Basic commands like df.groupby() in Python and similar are not included in this document, because this is (almost) exclusively meant as a reference guide for troubleshooting steps and gotchas for different scenarios. Function names and general syntax is a 2 second look-up with a LLM or Google.
Note
This cheatsheet uses html tags to add search functionality as an alternate way to navigating the cheatsheet by the index below. Not all entries have tags, but they follow the structure Category/Sub-category and these are the available tags at this moment in time:
BigQuery/Timetravel Git/SSH-access Git/Deleting-a-file-from-history Git/Splitting-repos Git/Morning-routine
GoogleIAM/Limiting-access PowerBI/Evaluating-criteria-per-category PowerBI/Z-levels PowerBI/Slicer Venv/Management JOIN-statements/SQL-Pandas
Tags are mostly used for longer entries that are less easily scanable when looking for a specific entry.
- BigQuery
- Compute Engine VMs
- Crontabs
- Docker
- d-types
- Fabric notebooks
- File extensions
- GeoJSON
- Git
- GitHub
- Google IAM
- Mapshaper
- PowerBi
- PowerShell
- Python
- SQL
- VS Code
- Miscellaneous
- BigQuery/Timetravel BigQuery timetravel. Bigquery stores snapshots of tables for 7 days. Use this to restore a BigQuery table to a previous snapshot:
Actual lifesaver. Replace
CREATE OR REPLACE TABLE `your_project.your_dataset.your_table` AS SELECT * FROM `your_project.your_dataset.your_table` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
INTERVAL 2 HOURwith a time delta long enough to where you are sure that the original table existed.
-
GCP's compute engine VMs with Debian don't come with Git or Python venv preinstalled, but do come with Python preinstalled. Run the following to install Git and Python venv:
sudo apt update && sudo apt install git.
sudo apt install python3-venv.
Doublecheck Git and Python installations withgit --versionandpython3 --version. -
The venv activation command is different on Linux too, compared to Windows. Use this:
source venv/bin/activateFor reference, this deactivates the currently active venv:
deactivateThis deletes the venv folder:
rm -rf venv.This creates a new venv:
python -m venv venv. -
Debian doesn't come with Python 3.12.0 which can cause issues with pinned package numbers in a requirements.txt. Use this to strip the pins from a requirements.txt in this case and test if the script works without pinned packages versions:
sed -i 's/[><=!~].*//' requirements.txtAs a TEST, you can run it without the -i parameter first, which won't change the file and will instead just print the output in the terminal, so you can visually verify the change before commiting to saving it:
sed 's/[><=!~].*//' requirements.txtTo verify the content of requirements.txt afterwards:
cat requirements.txt
Tip
-
Use the
pwdcommand in Debian to check the current directory. -
Use the
lscommand to see files in the current directory. Also works in PowerShell on Windows. -
Use
crontab -lto print the content of the crontab in the terminal. -
Use
timedatectlto print the VMs current time. Make sure to check if the VM is in the same timezone as you, because that will affect when it triggers. Especially if you're trying to test if the crontab runs successfully. -
Add
>> /home/headoffice/CityDNA/cron.log 2>&1in the crontab with the relevant path to have crontab print outputs from your script (print statements and errors) to a cron.log file that you can monitor. Otherwise crontab apparently defaults to e-mailing log entries to you, which won't work on a new VM without the e-mail service set up. -
Use
cat /home/headoffice/CityDNA/cron.logto print the current content of the cron.log file. -
Use
tail -f /home/headoffice/CityDNA/cron.logto start monitoring the log file. Changes will be printed to the terminal. UseCtrl+Cto cancel the monitoring. -
Add
-ubetween the venv and script paths in crontab to unbuffer your script. This just means that it'll print printouts to the cron.log file immediately. Otherwise Python buffers print statements in a 8kb file and only prints them to the log file once that buffer is filled up.
-
Crontabs are used to schedule tasks, like running a script, on linux systems. Run
crontab -eto edit the crontab file.Crontab uses the format * * * * * to schedule tasks with the stars representing minutes, hours, day of month, month and day of week, respectively. Leaving a parameter as a * just means there's no rule for it, so the following would be the equivalent of running a script on the 10th, 20th and 28th of every month at 4 am:
0 4 10,20,28 * *.
The crontab will run on the VM's schedule, which, for Google Compute Engine, defaults to UTC, which is 2 hours behind Copenhagen time. So the above example would run at 6 am Copenhagen time. Check the VM timezone withtimedatectl.
Crontab parameters can have multiple values seperated by a comma. All of10,20,28in the above example is related to the day of the month that the task runs on because the values are all together, seperated by a comma.Python scripts will need the venv to run, so the format for actually running a script is
[time] [venv path to python.exe] [python script], like in this example:
0 4 10,20,28 * * /home/mha/CityDNA/venv/bin/python3 /home/mha/CityDNA/main.py.The NANO editor in Debian is weird.. Ctrl+X --> Y --> Enter to save and exit.
Print out the crontab content withcrontab -lto doublecheck.
The affected docker is based on what directory you're in!
-
Spin up Docker:
docker compose up -d -
Spin down Docker:
docker compose down -
Spin down Docker AND delete volumes:
docker compose down -v
- Excel doesn't remember d-types. Use Parquet instead! Especially for intermediate datasets while testing.
df.to_parquet([FILENAME OR RAW PATH], index=False)instead ofdf.to_excel([FILENAME OR RAW PATH], index=False)
pd.read_parquet()instead ofpd.read_excel()
Unless it absolutely has to be used in excel afterwards.
-
Fabric notebooks connect to a lakehouse. Under
Explorerin the left hand side, chooseData items-->Add data itemsand choose the lakehouse that you want the notebook to be connected to.
This means that you DO NOT need lakehouse references when running spark commands. -
Fabric notebooks require spark dataframes to interact with lakehouse tables.
Usedf = spark.createDataFrame(df)to convert a pandas dataframe to a spark dataframe.
Usedf = df.toPandas()to convert a spark dataframe to a pandas dataframe. -
To create a new table in the connected lakehouse, use
df.write.mode('overwrite').saveAsTable('[TABLE NAME]'). If you're overwritting an existing table and the scema has changedm usedf.write.mode('overwrite').option('overwriteSchema', 'true').saveAsTable('[TABLE NAME]')because spark, by default, tries to match the two schemas when overwritting an existing table with a new one. -
To retrieve an existing table, use
df = spark.table('[TABLE NAME]'). Remember that this will be a spark table. Convert to a pandas one if desired. -
To merge new data into an existing table, overwriting existing values and adding new ones, use:
df.createOrReplaceTempView("staging") --This assumes that you have a **SPARK** dataframe. Otherwise convert it first. spark.sql(""" MERGE INTO [INSERT EXISTING TABLE NAME] AS prod USING staging ON prod.[INSERT VARIABLE TO MATCH ON] = staging.[INSERT VARIABLE TO MATCH ON] WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """)
-
Spark uses backticks for column names instead of "" used in PostgreSQL, so creating a new table from an existing one will look like this in Spark:
CREATE OR REPLACE TABLE copenhagen_card_surveyxact_respondents AS SELECT `responde`, `s_152`, `s_368`, `s_3`, `card_len`, `order_id`, `passtoke`, `kids_inc` FROM copenhagen_card_surveyxact_dataset
-
Fabric notebooks comes with packages like Pandas and Requests preinstalled, but use
%pip install [PACKAGE NAME]to install other packages.
- Turn on file extensions in windows explorer's visual settings. You can change a file's extension without impacting the file at all. As in, you can take a .sav file, for instance. rename it to .sav.backup and Windows won't know how to open it, which can help prevent confusion if you need to have multiple files stored with similar names. Change it back to .sav at any time and it'll function as a normal .sav file.
-
Creating geographic polygons and measuring their area in km²
Start by going to geojson.io and drawing your polygons.In the feature editor at the bottom center of the screen, open the editor and add a property called "Name". This will add it to all polygons that you create. Fill it out for all polygons.
When you have drawn all of the polygons, export the .geojson file without truncating the coordinates.
Upload the geojson file to mapshaper and run this command in the console, which will add a property with an estimated size of the area in km²:
each 'area_km2=$.area / 1e6'Download that file and run it through this Python script to get a simple overview of each area's name and size:
import json with open(r"C:\[path]\POIs with area size.json") as f: data = json.load(f) for i in data['features']: print(f"Name of area: {i['properties']['Name']}, size of area: {round(i['properties']['area_km2'], 2)} km²")
This will create a printout similar to the following, which can then be customized however you want to:
Name of area: Stroeget, size of area: 0.04 km² Name of area: Kongens Have, size of area: 0.12 km² Name of area: Kongens Nytorv, size of area: 0.01 km² Name of area: Nyhavn, size of area: 0.01 km² Name of area: Bella_center, size of area: 0.06 km² Name of area: Norreport_station, size of area: 0.01 km² Name of area: Hovedbanegaarden, size of area: 0.02 km² Name of area: Noerrebro_station, size of area: 0.0 km² Name of area: Copenhagen_South_station, size of area: 0.02 km² Name of area: Copenhagen_airport_station, size of area: 0.01 km² Name of area: Botanical_Garden, size of area: 0.09 km² Name of area: Helsingor_Station, size of area: 0.01 km² Name of area: Helsingor_Center_og_Stengade, size of area: 0.02 km² Name of area: Louisiana_Museum, size of area: 0.03 km² Name of area: Roskilde_station, size of area: 0.02 km²
-
Git/SSH-access SETTING UP SSH access to GitHub (easier for switching between accounts than HTTPS authentication):
Start by creating an SSH key in PowerShell:
ssh-keygen -t ed25519 -C "email@email.com" -f "$HOME\.ssh\id_ed25519_personal".Windows has a config file for SSH connections where we need to specify an alias and what host the key should be used to connect to:
Host github-personal HostName github.com User git IdentityFile C:\Users\mha\.ssh\id_ed25519_personalThen add the public SSH key to GitHub in the settings menu. Use
cat ~/.ssh/id_ed25519_personal.pubto get the key. Copy and paste the entire return (3 elements) into GitHub SSH settings on the site.Test the connection to GitHub with
ssh -T git@github-personal.Existing HTTPS connections to repos can be switched over to SSH using
git remote set-url origin git@github-personal:Cazuchi/reponame.git.New repos can have their SSH connection established using
git remote add origin git@github-personal:Cazuchi/reponame.git.Use
git remote -vto double check that the project folder is connected to the correct GitHub repo.Git doesn't use Windows' OpenSSH by default and might not read the SSH config initially. Use
git config --global core.sshCommand "C:/Windows/System32/OpenSSH/ssh.exe"to ensure that it reads the correct SSH config file and can connect to GitHub. This only needs to be done once. It's a global setting that's saved afterwards.
Caution
Note: The following is DANGEROUS. It WILL destroy your repo if you do it wrong, but can be used to remove a file from a GitHub repo and the repo's entire history.
Important
The important thing is to remember --invert-paths in the first filter-repo command. That determines whether you're deleting a single file (the desired outcome) or everything in the repo EXCEPT that file.
Note
This will REMOVE any changes you have made to the local repo, but haven't pushed yet. Keep this in mind, because if you're doing this to exclude a filetype, for instance, and the .gitignore addition hasn't been pushed before doing this, you're going to be repeating this entire process again. Ask me how I know, lol...
-
Git/Deleting-a-file-from-history Removing a file from your GitHub repo's history
It's a pip install, so navigate to the local repo folder, activate the venv and runpip install git-filter-repo.
This example uses the filenamemove-data-bq.ipynbas an example.
The filter-repo package heavily encourages you to start from a freshly cloned local repo, because the file technically still exists in your local repo history until garbage cleanup, which takes 30 days unless you do it manually or modify the settings.THIS IS THE DANGEROUS PART! If you don't include
--invert-pathsfilter-repo will remove anything BUT the file you specify.--invert-pathsinverts that to remove ONLY the specified file, but forgetting that parameter WILL DELETE most of your repo.
Rungit filter-repo --path move-data-bq.ipynb --invert-pathsto remove the file from the repo and the repo's history. This will DELETE the file, so make a backup, if you need to keep it. If you're not starting from a freshly cloned repo it will refuse to run the command. Add--forceat the end to run it anyway. If you run this on a local repo that isn't a fresh clone, it'll like be unable to delete some objects related to the file's history, because windows locks them. Just replynto each. There will be alot, potentially. Git garbage cleanup will remove them eventually.Use
git log --oneline --all -- move-data-bq.ipynbto check if the removal was successful. This should return nothing.Add the file to your .gitignore, either as the full filename or
*.ipynbif you don't want any of those filetypes included in the future.As a safety feature, filter-repo removes your git origin. Add it back with
git remote add origin [repo].git push origin --force --allto push the updated history to your GitHub repo.--forceis needed because GitHub will refuse to push otherwise since your local history doesn't align with the GitHub repo's history. -
Git/Splitting-repos Extracting elements from a GitHub repo into a new repo Filter-repo can also be used if you want to extract elements from one repo and use them to build a new repo, like if you want to split a single repo in to multiple repos responsible for their own functionality.
Make sure you have filter-repo installed:pip install git-filter-repo.
Start by cloning the repo you want to extract elements from:git clone [repo].The structure of the command is
git filter-repo --path folder/ --path README.md --path-glob "*.py" --path-regex "^src/.*\.py$".
Every element reference starts with--pathand these can be chained.
--path folder/copies a given folder.
--path README.mdcopies a specific file - in this case the readme.md.
--path-glob "*.py"is a pattern matching function - in this case it matches all .py files.
--path-regex "^scripts/.*\.py$"allows for regex searches, in this case all .py files in the scripts folder.This deletes all unspecified files in the local repo rather than copying them to a new repo, so once this is done, change the name of the folder, point to a new repo with
git remote add origin repoand push to that repo. -
Git/Morning-routine MAKE THIS A HABIT every morning!
git remote -vcheck the repo
git branchcheck what branch I'm on.git checkout [branch]to switch
git fetchto get the newest version from github
git statuscheck if the local repo is up to date
git pullif it isn't -
If you forgot to pull the newest version of the remote repo before applying changes, you can use the
git stashcommand to fix it.
git stashsaves your changes, but removes them from the file.
Then rungit pullto get up to date with the remote repo.
Then rungit stash popto reapply the saved changes and push them to the remote repo as normal. -
Check current repo:
git remote -v -
Check current username:
git config user.name -
Set repo-specific username:
git config user.name "username" -
Set global username:
git config --global user.name "username" -
Check current e-mail:
git config user.email -
Set repo-specific e-mail:
git config user.email "e-mail" -
Set global e-mail:
git config --global user.email "e-mail" -
Just use git clone to initialize new repo instead of doing it in Powershell:
git clone [repo link] -
Check if local repo is up to date with online repo:
git fetch-->git status
-
GitHub has a set of highlights that you can use in markdown files to highlight specific parts. All of them require
>infront of each line - that is what tells GitHub what range of your text should be highlighted with the given highlight syntax. These have to be on the top-level of the markdown text. As in, they cannot be nested under a bulletpoint.The syntax is the following without the space after
>:
> [tag](!CAUTION, !WARNING, !IMPORTANT, !TIP or !NOTE)
> text
Caution:
Caution
Use this to highlight critical notes, like commands that will irreversibly damage your project if implemented incorrectly or carelessly.
Warnings:
Warning
Similar to the above, but for less dangerous commands.
Important:
Important
For important notes.
Tips:
Tip
For useful tips.
Notes:
Note
For regular notes.
- Github supports HTML tags to an extend, which are used in this cheatsheet as my search tags (example PowerBI/Slicer). The structure is
<kbd>followed by your text followed by<\kbd>.
Important
As a general rule-of-thumb, aim to minimize permissions. So any given user or service account only has access to the specific functions or services that they need. Not out of lack of trust for the user (maybe sometimes depending on experience), but out of concern for if someone else manages to gain access to a given account. If that account's permissions are minimized, the potential damage is also minimized.
-
GoogleIAM/Limiting-access Limiting service account access to secrets:
AddSecret Manager Secret Accessorpermission to your service account.
Go to Secret Manager in GCP, click on the secret and copy the path at the top with a structure likeprojects/[project number]/secrets/[secret name].
ClickAdd IAM conditionand add a condition forname,starts withand the secret path.starts withbecause the actually endpoint you're hitting when querying the secret is likely the secret path with/version/latestappended to it, to ensure you're getting the most up-to-date secret. -
GoogleIAM/Limiting-access Limiting Compute Engine VMs access to GCP: When setting up the VM, attach a service account with only the permissions needed for the VM to take the actions that you want, for instance:
BigQuery User permissionsto allow for the creation of dataset and tables.BigQuery Usershave limited access to existing datasets and tables, butownership levelaccess to datasets and tables that it creates itself.
Secret Manager Secret Accessor permissionsto allow for the retrieval of, say, API keys or similar.
Enable access to ALL GCP APIs for the VM. The VM will still only have the permissions equal to the permissions of the service account that you attach, but needs access to the same APIs in the VM settings. You're technically setting the permissions twice, which is tedious, but necessary. In the end, the VMs actual permissions are controlled by the attached Service Account, so choose "All APIs" in the VM settings is fine. -
Using
BigQuery Userlevel permissions for service accounts automatically grantsownership levelaccess to the service account for datasets and tables that it creates itself, while limiting access to datasets and tables created by other users or service accounts.
-
Run
each 'area_km2=$.area / 1e6'in the console to add a calculate property to each area with the estimated size in km2. -
Run
-filter 'navn == "København" || navn == "Frederiksberg" || navn == "Dragør" || navn == "Tårnby"'to extract a subset of the geographic areas in the uploaded file.navn == "København"should match the area's property.Run
-dissolveto combine the extracted areas into a single area.Upload the file to geojson.io and clean up any small artifacts that might be in the dissolved boundary. Download and use the file.
-
To get dynamic labels to be in the vertical center of card visuals, go to
format-->callout-->padding, turn on individual padding and ONLY add padding on the bottom, like 10-15 pixels. Maybe add 5 pixels of padding on the left side too, so push it in a bit. -
PowerBI/Z-levels Check the damn z-levels on visuals!
When editing interactions between elements on a page, if a visual isn't responding to changing interactions like you'd expect, check if the buttons belong to the correct visualization.If you have, say, a matrix on z-level 1 and a textbox on z-level 2 above it, the buttons that appear for editing interactions can look like they belong to the matrix, but they will actually belong to the textbox and the matrix won't react to changing the settings.
Move the matrix to a higher z-level than the textbox around it or move the textbox out of the way temporarily and the buttons for choosing interactions for the matrix will appear.. jesus ..
If you hover over the edit interactions buttons on a visual it'll add a faint highlight to the visual that they belong to, which can help identify issues too.
-
PowerBI/Slicer Variables in synced slicers across pages are PERMANENT. In the sense that if you copy/paste your slicers to a new page, choose to sync them to other pages and then change what variable is shown in a given slicer, it will change the variable in that slicer on the other pages too, which is going to break visuals on the other pages. You CAN delete a slicer on the new page and it won't affect that same slicer on other pages. Then just create a new slicer with the new variable.
-
PowerBI/Evaluating-criteria-per-category When creating measures that are supposed to evaluate criteria on a per-category basis, make sure to "loop" over the categories in the measure. As an example this would be a way to evaluate criteria per destination in a measure (from a tourism dataset):
VAR return_value = SUMX( VALUES('Monthly_data'[Destination]), ...
This evaluates whatever criteria you want for each unique
destinationin theMonthly_datatable.
-
Create a folder with
mkdir [folder name].
Delete a folder withremove-item -Recurse -Force [folder name].
Check items in a directory withls.
Move into a directory withcd [folder name].
Move out of a directory withcd ... -
IF PowerShell won't save the $PROFILE, it's because you need to create it first manually. Run this to create it:
New-Item -ItemType Directory -Force -Path (Split-Path $PROFILE) -
Sublime Text can be used to open textfiles from PowerShell, but the program has to be added to the environmental variables first. In windows, hit the windows key, search for environmental variables and add the path to the Sublime text folder as a new line in the PATH variable.
-
By DEFAULT PowerShell doesn't allow for the execution of scripts in the terminal. Run this to fix it (slightly more lenient script policy, standard for programming):
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser -
To make PowerShell open in a given directory by default open the $PROFILE in a texteditor:
subl $PROFILE
Then add this line:
Set-Location *path* -
Add pastel yellow timestamp and baby blue venv tag color in the PowerShell output:
subl $PROFILEto open the PowerShell settings.New-Item -ItemType Directory -Force -Path (Split-Path $PROFILE)to create it first, if the settings file doesn't exist, then:function prompt { $venv = if ($env:VIRTUAL_ENV) { "`e[38;5;153m($(Split-Path $env:VIRTUAL_ENV -Leaf))`e[0m " } else { "" } "[`e[38;5;229m$(Get-Date -Format 'HH:mm:ss')`e[0m] $venv PS $($executionContext.SessionState.Path.CurrentLocation)> " }
. $PROFILEto load the new PowerShell settings without having to restart the terminal. -
Filter ls search results (replace "citydna" with query term):
ls -Recurse -Filter "*citydna*.ipynb" | Sort-Object LastWriteTime -Descending | Select-Object Name, LastWriteTime -
You can use
New-Item name.filetypeto create a new file, but you can also just try to access the file and PowerShell will create it if it doesn't exist:
Code name.filetype -
Copy a file, like if you want a temporary backup while refactoring (... xD):
Copy-Item old_name.filetype new_name.filetype
-
TLDR for requests
Requests is NOT a standard library. Pip install it.Authentication
- Basic authentication is just
auth=("[USERNAME]", "[PASSWORD]")in the requests call, but some APIs require you to grab a token from an auth endpoint first and sending that instead, in which case it's just a parameter in the url, like shown under GET requests below. - Doing the above just tells the requests library to put the information inside the authorization header, which is the equivalent of defining the header yourself.
GET requests
- Used for retrieving data.
- Postman documentation will often include empty payload and headers variables. Ignore them. They are generally NOT used in GET requests.
- Parameters go in the url, with the structure usually being
endpoint?param#1¶m#2etc., but structure params as a dictionary and reference it asparams=paramsin the requests call and the requests library will automatically format them correctly into the url before sending the request.
POST requests
- Used for sending data.
- DOES use payloads and headers to send data to the server.
- Payload is either formatted as json (
json=payloadinside the request function call) or form-encoded (data=payload). Depends on what the API expects.
Sidenote on consuming API responses
import iowithpd.read_csv(io.StringIO(result.text))wraps a raw string from an API response variable namedresultin a file-like object that Pandas can iterate over. It basically mimicks reading a file from disc, but with an API response instead.- Other packages (not pandas) might expect a file-like object from bytes instead of a string, which can be done using
io.BytesIO(result.content)instead. - File-like just means it functions the same way a file read from the filesystem, with the same usable function-calls, like
.read(),.seek()and.tell(). - XML was created by the devil to torture programmers. Use xmltodict to navigate it (or ElementTree if you're absolutely forced to, lol).
- Sidenote, xmltodict returns a dict when tags only appear once and a list when tags appear multiple times. Still better than ElementTree for exploration, but worth checking for.
- Basic authentication is just
-
Venv/Management Create virtual environment. The second ".venv" is the name of the environment, which technically doesn't matter, but some programs (like VS Code) autodetect the environment IF it's named .venv, so leave as default unless it's important to have a different name for the environment:
python -m venv .venv -
Venv/Management Check the path to the ACTIVE virtual environment to confirm which is being used (like when you have multiple venvs called "venv"):
$env:VIRTUAL_ENV -
Venv/Management Dump all installed packages to a requirements.txt files (includes specific version numbers):
pip freeze > requirements.txt.If you DON'T want pinned versions, use pipreqs instead or run this in PowerShell to strip version numbers:
pip freeze | ForEach-Object { $_ -replace '[><=!~].*', '' } | Out-File requirements.txt. -
Venv/Management Install all packages from a requirements.txt file (-r specifies that you want pip to read from a file. The filename doesn't matter - the structure of the file does. Requirements.txt is the norm tho'):
pip install -r requirements.txt -
Venv/Management When VS Code doesn't automatically register your python kernel in Jupyter Lab (shouldn't be a problem with [Project folder] > [venv folder] type folder structure, but just in case). The name in the command can be whatever. It doesn't actually matter, but display-name determines what's shown in VS Code.
python -m ipykernel install --user --name=venv --display-name "Python (venv)" -
For replacing values in a dataframe, use one of the following:
df[col].map(DICTIONARY).fillna(VALUE) """ Make sure the dictionary's keys matches values in the column and the values are what you want to replace them with. Fillna() is optional if you want to handle values that don't match in a specific way. map() returns NaN values by default for missing matches. """
OR
search_criteria = [ dfm[col] == criteria1, dfm[col] == criteria2, dfm[col] == criteria3, ] replacement_values = [ replacement_value1, replacement_value2, replacement_value3 ] df[col] = np.select(search_criteria, replacement_values, default=dfm[col]) """ The "default" parameter specifies the value to return for rows that don't match any of the specified criteria. Point back to the original column to keep the original values or replace with a custom value. The first criteria maps to the first replacement value and so on. Like a key mapping to a value in a dictionary, except these are just two lists. """
OR
df.replace(mapping_dict)IF mapping_dict has the structure{column_name : {original_value1 : replacement_value1, original_value2 : replacement_value2, ...}}. df.replace just leaves the original value if no match is found, which is handy. -
SETS and FROZENSETS So sets are useful for a lot of lookup operations, but normal sets are mutable, meaning they can be changed at will. Frozensets are immutable. Once you define them, they can no longer be changed and can therefore also be used as dictionary keys, for instance.
-
Basic CTE structure. Pick and choose which function are needed (Placeholder. Will go through and add explanations later). See my F1 Ergast project for implemetation examples:
WITH cte_name AS ( SELECT t1.column1, t1.column2, t2.column3, t1.column4 AS "Named Column", 'literal string' AS "Static Value", SUM(t1.column5) AS total_value, AVG(t1.column5) AS avg_value, COUNT(t1.column5) AS count_value, ROW_NUMBER() OVER (PARTITION BY t1.column1 ORDER BY t1.column2) AS row_num, SUM(t1.column5) OVER (PARTITION BY t1.column1 ORDER BY t1.column2) AS running_total FROM table1 t1 INNER JOIN table2 t2 ON t1.column1 = t2.column1 LEFT JOIN table3 t3 ON t1.column1 = t3.column1 WHERE t1.column1 = 'value' AND t1.column2 > 100 GROUP BY t1.column1, t1.column2, t2.column3 HAVING SUM(t1.column5) > 100 ) SELECT * FROM cte_name WHERE row_num = 1 ORDER BY column1 ASC
-
JOIN-statements/SQL-Pandas JOIN statements - THIS APPLIES TO PANDAS AS WELL. The DIRECTION of the join statement specifies which table is protected. LEFT and RIGHT joins respectively keep all of the rows in the LEFT or RIGHT table, but drops any row from the other table without a match. INNER only keeps rows that match ACROSS BOTH TABLES. FULL OUTER JOIN (just "outer" in Pandas) protects BOTH tables and includes rows that DO NOT have a match across the two tables. In practice, RIGHT joins are pointless - just memorize the LEFT join function and switch the order of the tables depending on which needs protection.
-
Venv/Management VS Code's autodetecting of interpreters for Jupyter notebooks is incredibly unreliable.
UseCtrl+Shift+Pto jump to the search bar, search forPython: Select Interpreterand manually navigate to the Python.exe inside.\.venv\Scripts\.
That's the only way I've found to consistently select an interpreter. -
VS Code be default activates a venv when you launch it, but it's literally never the correct one, so
ctrl+shift+psearch forOpen user settings (JSON)and paste in"python.terminal.activateEnvironment": false. That'll stop VS Code from auto-activating any venv on launch.
- Backtick for code blocks:
Shift + the key to the left of backspace