This repository documents my structured journey through the Python for Data Analytics Bootcamp .
The focus of this program is not only learning Python syntax, but developing the ability to:
Design, automate, and scale data workflows in real-world analytics environments
Through this course, I am building the ability to:
- Automate repetitive analytical tasks
- Work with structured and large-scale datasets
- Build reproducible data workflows
- Combine SQL and Python effectively
- Develop production-ready analytical thinking
Python is not a replacement for SQL or Excel. It is a tool for controlling the entire data workflow.
- Python 3.x
- Miniconda (environment management)
- Jupyter Notebook / VS Code
- pandas, numpy
- matplotlib, seaborn
- scikit-learn
The course is structured into 12 progressive sessions:
| Session | Topic | Content |
|---|---|---|
| 01 | Foundations | Read |
| 02 | Python Fundamentals | Read |
| 03 | Pandas & Data Workflow | Read |
| 04 | Advanced Pandas | Read |
Session 01 introduces the fundamental shift from manual data analysis to automated workflows.
| SQL | Python |
|---|---|
| Declarative | Imperative |
| What you want | How to do it |
| Data querying | Workflow control |
👉 SQL retrieves data 👉 Python defines what happens next
Transition:
- From manual tools → automation
- From queries → workflows
- From analyst → system thinker
Core types:
int # 10
float # 3.14
str # "Alice"
bool # True / FalseVariables are named references to values in memory:
price = 100
quantity = 3
revenue = price * quantityUsed to:
- Store data
- Reuse values
- Build logic
Example:
df = df[df["price"] > 0]Instead of repeating manual steps → 👉 Define logic once → execute automatically
Python ensures:
- Transparent workflows
- Repeatable results
- Collaboration readiness
Best practices:
- One step per cell
- Markdown for explanations
- Code for execution
👉 Notebook = analysis story
Key libraries:
- pandas → data manipulation
- numpy → numerical operations
- matplotlib / seaborn → visualization
Using Miniconda:
conda create -n myenv python=3.13
conda activate myenv
pip install pandas👉 One project = one environment
pandas==2.2.2
numpy==1.26.4pip install -r requirements.txtEnsures:
- Consistent environments
- Easy collaboration
- Python enables automation, scalability, and control
- SQL and Python are complementary
- Data types directly affect correctness
- Variables are the foundation of logic
- Environments are critical for professional workflows
data_analytics_with_python/
│
├── data/
│ ├── raw/
│ └── processed/
│
├── notebooks/
├── imgs/
├── docs/
├── gitignore
├── requirements.txt
├── README.md- Session 01 — Foundations
- Session 02 — Data Structures
- Session 03 — Control Flow
- Session 04 — Functions
- Session 05 — File Handling
- Session 06 — NumPy
- Session 07 — Pandas
- Session 08 — Data Cleaning
- Session 09 — Visualization
- Session 10 — Workflows
- Session 11 — Automation
- Session 12 — Final Project
➡️ Session 02: Data Structures Focus: Lists, Dictionaries, Data Organization
This repository reflects a transformation:
From running queries → to building data systems
Session 02 builds the core programming foundation required for analytical thinking in Python.
The focus is not just syntax, but understanding how to:
Represent, transform, and analyze data using programmatic logic
This session marks the transition from:
- Writing simple code → to thinking in data transformations
After completing this session, I can:
- Perform core analytical computations using Python
- Structure raw data into lists, dictionaries, and tabular formats
- Apply conditional logic for filtering and segmentation
- Iterate over datasets and compute aggregations
- Write concise transformations using list comprehension
- Transition from raw Python structures to pandas DataFrames
- Perform basic data manipulation and filtering
Python supports fundamental operations used in all analytical workflows:
r = 100
t = 0.2
total = r * (1 + t)Boolean logic enables decision-making:
100 > 50
100 == 50
100 != 50Logical operators:
andornot
👉 These form the basis of filtering and rule-based analysis
sales = [100, 200, 150]Used for:
- Time series
- Measurements
- Sequential data
coordinates = (40.18, 44.51)Used for:
- Fixed values
- Constants
- Safe data storage
customer_ids = {1, 2, 3, 3}Key features:
- Removes duplicates
- Fast membership checks
- Supports set operations
Used for:
- Deduplication
- Data validation
- Segment comparison
customer = {
"name": "Anna",
"revenue": 150,
"city": "Yerevan"
}Used for:
- Representing entities (rows)
- Key-value relationships
- Building structured datasets
A list of dictionaries represents tabular data:
customers = [
{"name": "Anna", "revenue": 150},
{"name": "David", "revenue": 220}
]👉 This is the conceptual bridge to DataFrames
A DataFrame is conceptually:
- A collection of columns
- Each column behaves like a labeled list
import pandas as pd
df = pd.DataFrame({
"name": ["Anna", "David"],
"revenue": [150, 220]
})Basic operations:
- Column selection
- Row filtering
- Feature creation
- Column removal
👉 Mirrors SQL operations (SELECT, WHERE)
if revenue > 100:
print("High revenue")
else:
print("Normal revenue")Key principles:
- Conditions evaluate to
TrueorFalse - Indentation defines execution blocks
elifenables multi-branch logic
Used for:
- Filtering
- Segmentation
- Rule-based scoring
for value in sales:
print(value)Used for:
- Iterating over datasets
- Applying rules
- Aggregation
Example:
total = 0
for value in sales:
total += valuefor value in sales:
if value > 120:
print(value)👉 Represents row-by-row filtering logic
for i in range(5):
print(i)Used for:
- Controlled iteration
- Index-based logic
for i in range(3):
for j in range(2):
print(i, j)Used for:
- Multi-dimensional data
- Pairwise operations
new_sales = [value * 1.2 for value in sales]With condition:
high_sales = [value for value in sales if value > 120]With transformation:
labels = ["High" if v > 120 else "Low" for v in sales]👉 Combines:
- Iteration
- Filtering
- Transformation
Mutable:
- list, dict, set, DataFrame
Immutable:
- int, float, str, tuple
Why it matters:
- Prevents unintended changes
- Avoids hidden bugs
- Ensures predictable behavior
This session establishes the core mapping between Python and analytics:
| Python Concept | Analytics Equivalent |
|---|---|
| if | WHERE clause |
| loop | row iteration |
| sum logic | aggregation |
| list comprehension | transformation |
| dictionary | record |
| DataFrame | table |
- Data structures define how information is organized
- Conditional logic enables decision-making
- Loops enable scalable computation
- List comprehension enables clean transformations
- DataFrames formalize structured analysis
- Session 01 — Foundations
- Session 02 — Python Fundamentals
- Session 03 — Control Flow (Advanced)
- Session 04 — Functions
- Session 05 — File Handling
- Session 06 — NumPy
- Session 07 — Pandas
- Session 08 — Data Cleaning
- Session 09 — Visualization
- Session 10 — Workflows
- Session 11 — Automation
- Session 12 — Final Project
➡️ Session 03: Pandas + Data Workflow (Advanced) Focus: Functions, deeper logic, and reusable workflows
This session completes the transition:
From writing Python code → to thinking in data logic and transformations
Session 03 introduces pandas as the core tool for data analysis and builds the first end-to-end data workflow.
The focus shifts from:
Writing Python logic → to working with real datasets
This session establishes the foundation for:
- Exploring, cleaning, transforming, and exporting structured data.
After completing this session, I can:
- Load large-scale datasets into pandas DataFrames.
- Perform structured data exploration (EDA).
- Clean and transform raw data into usable formats.
- Create analytical features from existing columns.
- Handle missing values correctly.
- Reshape datasets when necessary.
- Export processed datasets for further analysis.
- Series: 1-dimensional labeled array (Values + Index).
- DataFrame: 2-dimensional table (Rows + Columns + Values). Equivalent to a SQL table or Excel sheet.
import pandas as pd
df_orders = pd.read_csv("../data/raw/orders.csv")
df_products = pd.read_csv("../data/raw/products.csv")Key principle: Always validate data immediately after loading.
Core inspection methods:
| Method | Description |
| df.head() / df.tail() | View first / last rows | | df.columns | Column names | | df.info() | General info (types, nulls) | | df.dtypes | Data type of each column | | df.describe() | Statistical summary |
-
Dropping columns:
df.drop(columns=["eval_set"])
-
Renaming columns:
df.rename(columns={"eval_set": "dataset_type"})
-
Changing types:
df["order_id"] = df["order_id"].astype("int64")
Creating business-ready insights:
# Boolean flags
df["is_weekend"] = df["order_dow"].isin([0, 6])
# Categorization using apply
df["order_frequency_category"] = df["order_number"].apply(
lambda x: "New" if x == 1 else "Low" if x <= 5 else "High"
)df["days_since_prior_order"] = df["days_since_prior_order"].fillna(0)Key principle: Missing data must be explicitly handled before analysis.
| Approach | Efficiency |
|---|---|
✅ Preferred (Vectorized): df["order_hour_of_day"] < 12 |
Fast and scalable |
df["col"].apply(lambda x: ...) |
Avoid when possible |
Key principle: Vectorized operations are faster and more scalable.
-
Transpose (Convert wide → long format or fix structure):
df.T
df.to_csv("../data/processed/orders_cleaned.csv", index=False)Key principle: Processed data should be saved separately from raw data.
| Action | Purpose |
|---|---|
head / tail |
Quick inspection |
info / dtypes |
Structure & data types |
describe |
Statistical overview |
drop / rename |
Cleaning |
astype |
Data correctness |
| Feature creation | Business logic |
fillna |
Data reliability |
to_csv |
Workflow output |
- Pandas is the core tool for data analysis in Python.
- Every analysis starts with Structured Exploration (EDA).
- Data must be cleaned before it can be analyzed.
- Feature engineering creates business value.
- Vectorized operations are essential for performance.
- Session 01 — Foundations
- Session 02 — Python Fundamentals
- Session 03 — Pandas & Data Workflow
- Session 04 — Functions
- Session 05 — File Handling
- Session 06 — NumPy
- Session 07 — Advanced Pandas
- Session 08 — Data Cleaning (Advanced)
- Session 09 — Visualization
- Session 10 — Workflows
- Session 11 — Automation
- Session 12 — Final Project
Next Step: ➡️ Session 04: Functions (Reusable logic & modular code)
💡 Final Note: This session marks a major transition from writing basic Python code to building real data pipelines.
Session 04 moves from single-table analysis → multi-table analytical thinking.
The focus is on:
- Combining datasets and extracting higher-level insights using aggregation and joins
This session reflects real-world analytics, where:
- Data is distributed across multiple tables
- Insights require joining + summarizing data
After completing this session, I can:
- Understand the structure and role of multiple related tables
- Identify primary and foreign keys
- Perform data aggregation using
groupby() - Merge datasets using
pd.merge() - Validate joins using
indicator=True - Build multi-table analytical datasets
- Detect and handle data integrity issues during merges
Real datasets are structured as relational systems, not single tables.
Example (Instacart dataset):
| Table | Role | Grain |
|---|---|---|
orders |
Behavioral | One row per order |
order_products_train |
Transaction | One row per product per order |
products |
Master | One row per product |
departments |
Lookup | One row per department |
aisles |
Lookup | One row per aisle |
👉 Understanding table relationships is critical before merging.
| Key Type | Description |
|---|---|
| Primary Key | Unique identifier (e.g. order_id) |
| Foreign Key | Reference to another table (e.g. product_id) |
Example relationships:
orders.order_id→order_products_train.order_idproducts.product_id→order_products_train.product_idproducts.department_id→departments.department_id
Aggregation allows us to summarize large datasets.
df.groupby("department")["reordered"].mean()Common operations:
sum()mean()count()nunique()
groupby() follows:
- Split → divide data into groups
- Apply → perform computation
- Combine → return aggregated result
👉 Core principle of analytical computation
Core syntax:
pd.merge(left_df, right_df, on="key", how="left")Merge types:
| Type | Behavior |
|---|---|
inner |
Only matching rows |
left |
Keep all left rows |
right |
Keep all right rows |
outer |
Keep everything |
👉 Most common in analytics: left join
Step-by-step enrichment:
# Step 1: products + departments
df_prod_dep = pd.merge(df_products, df_departments, on="department_id", how="left")
# Step 2: + aisles
df_prod_full = pd.merge(df_prod_dep, df_aisles, on="aisle_id", how="left")👉 Adds business meaning to IDs
pd.merge(df1, df2, on="key", how="left", indicator=True)Produces _merge column:
| Value | Meaning |
|---|---|
both |
matched |
left_only |
missing in right |
right_only |
missing in left |
👉 Critical for data quality checks
Each table operates at a different level:
| Table | Grain |
|---|---|
orders |
order-level |
order_products |
product-level |
products |
product-level |
⚠️ Merging different grains incorrectly → duplicated rows
In transactional tables:
order_idrepeats → multiple products per orderproduct_idrepeats → appears in many orders
👉 This is expected, not an error
Full pipeline:
# Filter relevant data
df_orders_train = df_orders[df_orders["eval_set"] == "train"]
# Merge orders with transactions
df_merged = pd.merge(
df_orders_train,
df_order_products,
on="order_id",
how="left"
)
# Add product info
df_merged = pd.merge(
df_merged,
df_products,
on="product_id",
how="left"
)👉 Builds a fully enriched analytical dataset
| Operation | Analytics Meaning |
|---|---|
groupby |
aggregation |
merge |
JOIN |
on |
join key |
how="left" |
LEFT JOIN |
_merge |
data validation |
| repeated keys | transactional structure |
| filtering before merge | WHERE clause |
- Real-world data is multi-table
- Always understand each table before merging
groupby()is the core of aggregationpd.merge()connects the data ecosystem- Always validate joins using
indicator - Data grain awareness prevents analytical errors
- Enrichment transforms raw IDs → business insights
- Session 01 — Foundations
- Session 02 — Python Fundamentals
- Session 03 — Pandas & Data Workflow
- Session 04 — Advanced Pandas (Aggregation & Merging)
- Session 05 — File Handling
- Session 06 — NumPy
- Session 07 — Advanced Data Cleaning
- Session 08 — Visualization
- Session 09 — Feature Engineering
- Session 10 — Workflows
- Session 11 — Automation
- Session 12 — Final Project
➡️ Session 05: File Handling Focus: Reading, writing, and managing data pipelines across file systems
This session marks a major transition:
From analyzing single tables → to building connected analytical datasets
👉 This is where analysis becomes real-world data engineering thinking