🐍 Python for Data Analytics — Professional Learning Journey

About This Repository

This repository documents my structured journey through the Python for Data Analytics Bootcamp .

The focus of this program is not only learning Python syntax, but developing the ability to:

Design, automate, and scale data workflows in real-world analytics environments

Objectives

Through this course, I am building the ability to:

Automate repetitive analytical tasks
Work with structured and large-scale datasets
Build reproducible data workflows
Combine SQL and Python effectively
Develop production-ready analytical thinking

Core Concept

Python is not a replacement for SQL or Excel. It is a tool for controlling the entire data workflow.

Tech Stack

Python 3.x
Miniconda (environment management)
Jupyter Notebook / VS Code
pandas, numpy
matplotlib, seaborn
scikit-learn

Course Structure

The course is structured into 12 progressive sessions:

📚 Sessions

Session	Topic	Content
01	Foundations	Read
02	Python Fundamentals	Read
03	Pandas & Data Workflow	Read
04	Advanced Pandas	Read

🧩 Session 01 — Foundations

🔹 Overview

Session 01 introduces the fundamental shift from manual data analysis to automated workflows.

🔹 Key Concepts

1. SQL vs Python

SQL	Python
Declarative	Imperative
What you want	How to do it
Data querying	Workflow control

👉 SQL retrieves data 👉 Python defines what happens next

2. Analytical Mindset Shift

Transition:

From manual tools → automation
From queries → workflows
From analyst → system thinker

3. Data Types

Core types:

int      # 10
float    # 3.14
str      # "Alice"
bool     # True / False

⚠️ Incorrect data types lead to incorrect analysis.

4. Variables

Variables are named references to values in memory:

price = 100
quantity = 3
revenue = price * quantity

Used to:

Store data
Reuse values
Build logic

5. Automation

Example:

df = df[df["price"] > 0]

Instead of repeating manual steps → 👉 Define logic once → execute automatically

6. Reproducibility

Python ensures:

Transparent workflows
Repeatable results
Collaboration readiness

7. Notebook Workflow

Best practices:

One step per cell
Markdown for explanations
Code for execution

👉 Notebook = analysis story

8. Python Ecosystem

Key libraries:

pandas → data manipulation
numpy → numerical operations
matplotlib / seaborn → visualization

9. Environment Management

Using Miniconda:

conda create -n myenv python=3.13
conda activate myenv
pip install pandas

👉 One project = one environment

10. Reproducibility via requirements.txt

pandas==2.2.2
numpy==1.26.4

pip install -r requirements.txt

Ensures:

Consistent environments
Easy collaboration

🔥 Key Takeaways

Python enables automation, scalability, and control
SQL and Python are complementary
Data types directly affect correctness
Variables are the foundation of logic
Environments are critical for professional workflows

📁 Project Structure

data_analytics_with_python/
│
├── data/
│   ├── raw/
│   └── processed/
│
├── notebooks/
├── imgs/
├── docs/
├── gitignore
├── requirements.txt
├── README.md

📈 Progress

Next Step

➡️ Session 02: Data Structures Focus: Lists, Dictionaries, Data Organization

💡 Final Note

This repository reflects a transformation:

From running queries → to building data systems

🧩 Session 02 — Python Fundamentals for Data Analytics

🔹 Overview

Session 02 builds the core programming foundation required for analytical thinking in Python.

The focus is not just syntax, but understanding how to:

Represent, transform, and analyze data using programmatic logic

This session marks the transition from:

Writing simple code → to thinking in data transformations

What I Can Do After This Session

After completing this session, I can:

Perform core analytical computations using Python
Structure raw data into lists, dictionaries, and tabular formats
Apply conditional logic for filtering and segmentation
Iterate over datasets and compute aggregations
Write concise transformations using list comprehension
Transition from raw Python structures to pandas DataFrames
Perform basic data manipulation and filtering

🔹 Key Concepts

1. Arithmetic & Boolean Logic

Python supports fundamental operations used in all analytical workflows:

r = 100
t = 0.2
total = r * (1 + t)

Boolean logic enables decision-making:

100 > 50
100 == 50
100 != 50

Logical operators:

and
or
not

👉 These form the basis of filtering and rule-based analysis

2. Core Data Structures

List — Ordered Data

sales = [100, 200, 150]

Used for:

Time series
Measurements
Sequential data

Tuple — Immutable Data

coordinates = (40.18, 44.51)

Used for:

Fixed values
Constants
Safe data storage

Set — Unique Values

customer_ids = {1, 2, 3, 3}

Key features:

Removes duplicates
Fast membership checks
Supports set operations

Used for:

Deduplication
Data validation
Segment comparison

Dictionary — Structured Records

customer = {
    "name": "Anna",
    "revenue": 150,
    "city": "Yerevan"
}

Used for:

Representing entities (rows)
Key-value relationships
Building structured datasets

3. From Structures to Tables

A list of dictionaries represents tabular data:

customers = [
    {"name": "Anna", "revenue": 150},
    {"name": "David", "revenue": 220}
]

👉 This is the conceptual bridge to DataFrames

4. Introduction to pandas DataFrame

A DataFrame is conceptually:

A collection of columns
Each column behaves like a labeled list

import pandas as pd

df = pd.DataFrame({
    "name": ["Anna", "David"],
    "revenue": [150, 220]
})

Basic operations:

Column selection
Row filtering
Feature creation
Column removal

👉 Mirrors SQL operations (SELECT, WHERE)

5. Conditional Statements (Decision Logic)

if revenue > 100:
    print("High revenue")
else:
    print("Normal revenue")

Key principles:

Conditions evaluate to True or False
Indentation defines execution blocks
elif enables multi-branch logic

Used for:

Filtering
Segmentation
Rule-based scoring

6. Loops (Iteration)

for value in sales:
    print(value)

Used for:

Iterating over datasets
Applying rules
Aggregation

Example:

total = 0
for value in sales:
    total += value

7. Combining Logic and Iteration

for value in sales:
    if value > 120:
        print(value)

👉 Represents row-by-row filtering logic

8. range() and Control Flow

for i in range(5):
    print(i)

Used for:

Controlled iteration
Index-based logic

9. Nested Loops

for i in range(3):
    for j in range(2):
        print(i, j)

Used for:

Multi-dimensional data
Pairwise operations

⚠️ Increased complexity → impacts performance

10. List Comprehension (Efficient Transformation)

new_sales = [value * 1.2 for value in sales]

With condition:

high_sales = [value for value in sales if value > 120]

With transformation:

labels = ["High" if v > 120 else "Low" for v in sales]

👉 Combines:

Iteration
Filtering
Transformation

11. Mutable vs Immutable Objects

Mutable:

list, dict, set, DataFrame

Immutable:

int, float, str, tuple

Why it matters:

Prevents unintended changes
Avoids hidden bugs
Ensures predictable behavior

🔄 Analytical Perspective

This session establishes the core mapping between Python and analytics:

Python Concept	Analytics Equivalent
if	WHERE clause
loop	row iteration
sum logic	aggregation
list comprehension	transformation
dictionary	record
DataFrame	table

🔥 Key Takeaways

Data structures define how information is organized
Conditional logic enables decision-making
Loops enable scalable computation
List comprehension enables clean transformations
DataFrames formalize structured analysis

📈 Progress

Next Step

➡️ Session 03: Pandas + Data Workflow (Advanced) Focus: Functions, deeper logic, and reusable workflows

💡 Final Note

This session completes the transition:

From writing Python code → to thinking in data logic and transformations

🧩 Session 03 — Introduction to Pandas & Data Workflow

🔹 Overview

Session 03 introduces pandas as the core tool for data analysis and builds the first end-to-end data workflow.

The focus shifts from:

Writing Python logic → to working with real datasets

This session establishes the foundation for:

Exploring, cleaning, transforming, and exporting structured data.

What I Can Do After This Session

After completing this session, I can:

Load large-scale datasets into pandas DataFrames.
Perform structured data exploration (EDA).
Clean and transform raw data into usable formats.
Create analytical features from existing columns.
Handle missing values correctly.
Reshape datasets when necessary.
Export processed datasets for further analysis.

🔹 Key Concepts

1. Pandas Core Data Structures

Series: 1-dimensional labeled array (Values + Index).
DataFrame: 2-dimensional table (Rows + Columns + Values). Equivalent to a SQL table or Excel sheet.

2. Data Import

import pandas as pd

df_orders = pd.read_csv("../data/raw/orders.csv")
df_products = pd.read_csv("../data/raw/products.csv")

Key principle: Always validate data immediately after loading.

3. Data Exploration (EDA)

Core inspection methods:

| Method | Description |

4. Data Wrangling (Cleaning & Structuring)

Dropping columns:
```
df.drop(columns=["eval_set"])
```

Renaming columns:

df.rename(columns={"eval_set": "dataset_type"})

Changing types:

df["order_id"] = df["order_id"].astype("int64")

5. Feature Engineering

Creating business-ready insights:

# Boolean flags
df["is_weekend"] = df["order_dow"].isin([0, 6])

# Categorization using apply
df["order_frequency_category"] = df["order_number"].apply(
    lambda x: "New" if x == 1 else "Low" if x <= 5 else "High"
)

6. Handling Missing Values

df["days_since_prior_order"] = df["days_since_prior_order"].fillna(0)

Key principle: Missing data must be explicitly handled before analysis.

7. Vectorization vs `apply()`

Approach	Efficiency
✅ Preferred (Vectorized): `df["order_hour_of_day"] < 12`	Fast and scalable
⚠️ Less efficient: `df["col"].apply(lambda x: ...)`	Avoid when possible

Key principle: Vectorized operations are faster and more scalable.

8. Data Reshaping

Transpose (Convert wide → long format or fix structure):
```
df.T
```

9. Exporting Data

df.to_csv("../data/processed/orders_cleaned.csv", index=False)

Key principle: Processed data should be saved separately from raw data.

Analytical Perspective

Action	Purpose
`head` / `tail`	Quick inspection
`info` / `dtypes`	Structure & data types
`describe`	Statistical overview
`drop` / `rename`	Cleaning
`astype`	Data correctness
Feature creation	Business logic
`fillna`	Data reliability
`to_csv`	Workflow output

🔥 Key Takeaways

Pandas is the core tool for data analysis in Python.
Every analysis starts with Structured Exploration (EDA).
Data must be cleaned before it can be analyzed.
Feature engineering creates business value.
Vectorized operations are essential for performance.

📈 Progress (Updated)

Next Step: ➡️ Session 04: Functions (Reusable logic & modular code)

💡 Final Note: This session marks a major transition from writing basic Python code to building real data pipelines.

🧩 Session 04 — Advanced Pandas: Aggregation & Merging

🔹 Overview

Session 04 moves from single-table analysis → multi-table analytical thinking.

The focus is on:

Combining datasets and extracting higher-level insights using aggregation and joins

This session reflects real-world analytics, where:

Data is distributed across multiple tables
Insights require joining + summarizing data

What I Can Do After This Session

After completing this session, I can:

Understand the structure and role of multiple related tables
Identify primary and foreign keys
Perform data aggregation using groupby()
Merge datasets using pd.merge()
Validate joins using indicator=True
Build multi-table analytical datasets
Detect and handle data integrity issues during merges

🔹 Key Concepts

1. Multi-Table Data Model

Real datasets are structured as relational systems, not single tables.

Example (Instacart dataset):

Table	Role	Grain
`orders`	Behavioral	One row per order
`order_products_train`	Transaction	One row per product per order
`products`	Master	One row per product
`departments`	Lookup	One row per department
`aisles`	Lookup	One row per aisle

👉 Understanding table relationships is critical before merging.

2. Keys & Relationships

Key Type	Description
Primary Key	Unique identifier (e.g. `order_id`)
Foreign Key	Reference to another table (e.g. `product_id`)

Example relationships:

orders.order_id → order_products_train.order_id
products.product_id → order_products_train.product_id
products.department_id → departments.department_id

3. Aggregation with `groupby()`

Aggregation allows us to summarize large datasets.

df.groupby("department")["reordered"].mean()

Common operations:

sum()
mean()
count()
nunique()

4. Split-Apply-Combine Logic

groupby() follows:

Split → divide data into groups
Apply → perform computation
Combine → return aggregated result

👉 Core principle of analytical computation

5. Merging DataFrames

Core syntax:

pd.merge(left_df, right_df, on="key", how="left")

Merge types:

Type	Behavior
`inner`	Only matching rows
`left`	Keep all left rows
`right`	Keep all right rows
`outer`	Keep everything

👉 Most common in analytics: left join

6. Data Enrichment Workflow

Step-by-step enrichment:

# Step 1: products + departments
df_prod_dep = pd.merge(df_products, df_departments, on="department_id", how="left")

# Step 2: + aisles
df_prod_full = pd.merge(df_prod_dep, df_aisles, on="aisle_id", how="left")

👉 Adds business meaning to IDs

7. Indicator for Validation

pd.merge(df1, df2, on="key", how="left", indicator=True)

Produces _merge column:

Value	Meaning
`both`	matched
`left_only`	missing in right
`right_only`	missing in left

👉 Critical for data quality checks

8. Understanding Data Grain

Each table operates at a different level:

Table	Grain
`orders`	order-level
`order_products`	product-level
`products`	product-level

⚠️ Merging different grains incorrectly → duplicated rows

9. Handling Repeated Keys

In transactional tables:

order_id repeats → multiple products per order
product_id repeats → appears in many orders

👉 This is expected, not an error

10. Analytical Merge Pipeline

Full pipeline:

# Filter relevant data
df_orders_train = df_orders[df_orders["eval_set"] == "train"]

# Merge orders with transactions
df_merged = pd.merge(
    df_orders_train,
    df_order_products,
    on="order_id",
    how="left"
)

# Add product info
df_merged = pd.merge(
    df_merged,
    df_products,
    on="product_id",
    how="left"
)

👉 Builds a fully enriched analytical dataset

Analytical Perspective

Operation	Analytics Meaning
`groupby`	aggregation
`merge`	JOIN
`on`	join key
`how="left"`	LEFT JOIN
`_merge`	data validation
repeated keys	transactional structure
filtering before merge	WHERE clause

🔥 Key Takeaways

Real-world data is multi-table
Always understand each table before merging
groupby() is the core of aggregation
pd.merge() connects the data ecosystem
Always validate joins using indicator
Data grain awareness prevents analytical errors
Enrichment transforms raw IDs → business insights

📈 Progress (Updated)

Next Step

➡️ Session 05: File Handling Focus: Reading, writing, and managing data pipelines across file systems

💡 Final Note

This session marks a major transition:

From analyzing single tables → to building connected analytical datasets

👉 This is where analysis becomes real-world data engineering thinking

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🐍 Python for Data Analytics — Professional Learning Journey

About This Repository

Objectives

Core Concept

Tech Stack

Course Structure

📚 Sessions

🧩 Session 01 — Foundations

🔹 Overview

🔹 Key Concepts

1. SQL vs Python

2. Analytical Mindset Shift

3. Data Types

4. Variables

5. Automation

6. Reproducibility

7. Notebook Workflow

8. Python Ecosystem

9. Environment Management

10. Reproducibility via requirements.txt

🔥 Key Takeaways

📁 Project Structure

📈 Progress

Next Step

💡 Final Note

🧩 Session 02 — Python Fundamentals for Data Analytics

🔹 Overview

What I Can Do After This Session

🔹 Key Concepts

1. Arithmetic & Boolean Logic

2. Core Data Structures

List — Ordered Data

Tuple — Immutable Data

Set — Unique Values

Dictionary — Structured Records

3. From Structures to Tables

4. Introduction to pandas DataFrame

5. Conditional Statements (Decision Logic)

6. Loops (Iteration)

7. Combining Logic and Iteration

8. range() and Control Flow

9. Nested Loops

10. List Comprehension (Efficient Transformation)

11. Mutable vs Immutable Objects

🔄 Analytical Perspective

🔥 Key Takeaways

📈 Progress

Next Step

💡 Final Note

🧩 Session 03 — Introduction to Pandas & Data Workflow

🔹 Overview

What I Can Do After This Session

🔹 Key Concepts

1. Pandas Core Data Structures

2. Data Import

3. Data Exploration (EDA)

4. Data Wrangling (Cleaning & Structuring)

5. Feature Engineering

6. Handling Missing Values

7. Vectorization vs apply()

8. Data Reshaping

9. Exporting Data

Analytical Perspective

🔥 Key Takeaways

📈 Progress (Updated)

🧩 Session 04 — Advanced Pandas: Aggregation & Merging

🔹 Overview

What I Can Do After This Session

🔹 Key Concepts

1. Multi-Table Data Model

2. Keys & Relationships

3. Aggregation with groupby()

4. Split-Apply-Combine Logic

5. Merging DataFrames

7. Vectorization vs `apply()`

3. Aggregation with `groupby()`