Text Profiling & KPI Standardization
Source Code:string_functions.sql
Dataset: transactions_text_demo – 1,000 dummy transactions with messy text: inconsistent phones, annotated categories, mixed casing, duplicates.
- Phone format diversity: multiple formats inflate distinct counts
- Category fragmentation: annotations split logical categories
- Functions used:
LENGTH(),POSITION() / STRPOS(),COUNT(DISTINCT),GROUP BY
- Phone cleaned: last 8 digits only
- Category cleaned: annotations removed, trimmed
- Revenue per transaction:
quantity * price - Functions used:
REGEXP_REPLACE(),TRIM(),SUBSTRING(),CONCAT()
| Metric | Raw | Cleaned |
|---|---|---|
| Revenue by category | Fragmented | Consolidated |
| Unique customers | Inflated | Reduced |
| Avg transaction value | Stable | Stable |
- KPIs changed: Raw text caused dimensional fragmentation
- Biggest impact: Phone standardization reduced duplicate customers
- Assumptions: Last 8 digits identify a customer; annotations not meaningful; revenue = quantity * price
- Production risks: Variable phone lengths, new annotations, unhandled formatting, country code differences
source Code: analytics_schema.sql
This project contains a fully structured PostgreSQL schema created for practicing SQL JOIN operations and relational database design.
The database models a simplified analytical e-commerce system with geographical hierarchy and geospatial support using PostGIS.
The schema name is:
analytics
The schema includes the following logical groups of tables:
- countries
- regions
- cities
- customers
- products
- orders
- order_items
- country_boundaries
- region_boundaries
- city_boundaries
- customer_locations
The database follows a hierarchical and relational structure:
- One country → Many regions
- One region → Many cities
- One city → Many customers
- One customer → Many orders
- One order → Many order_items
- One product → Can appear in many order_items
Foreign key constraints are used to maintain referential integrity.
- Custom schema creation (
CREATE SCHEMA IF NOT EXISTS analytics) - Use of
FOREIGN KEYrelationships - Data validation with
CHECKconstraints - Unique constraints (e.g. email)
- PostGIS extension enabled for geospatial support
- Geometry types:
- MultiPolygon (countries)
- Polygon (regions, cities)
- Point (customer locations)
- PostgreSQL
- PostGIS Extension
- SQL (DDL)
- Open pgAdmin or connect via psql
- Create a new database
- Run the SQL script file
- The
analyticsschema and all tables will be created automatically
In the next phase, analytical SQL queries will be implemented using:
- JOIN
- GROUP BY
- Aggregations
- Filtering
- Time-based analysis
This project is part of SQL analytical training (Session 08 – JOINs).
Source Code: geospatial_data_etl.sql
This project demonstrates a professional-grade Geospatial ETL pipeline designed to handle complex spatial datasets. The workflow focuses on migrating raw administrative boundaries and customer location data into a structured PostgreSQL/PostGIS database, ensuring high data integrity and spatial accuracy.
The pipeline transforms raw text-based coordinates (WKT) into validated geometry objects, creating a foundation for advanced location intelligence and business analytics.
-
Database: PostgreSQL (with PostGIS extension)
-
Logic: Advanced SQL (DDL/DML)
-
Spatial Standard: WGS 84 (SRID 4326)
-
Tools: VS Code, Git, GitHub
The project follows a structured data engineering lifecycle to ensure scalability and reliability:
-
Implementation of a Staging Area pattern using
_stg_tables. -
Decoupling raw data imports from production tables to prevent data corruption.
-
Bulk loading of datasets including Countries, Regions, Cities, and Customer Locations.
-
WKT Parsing: Converting Well-Known Text (WKT) strings into native PostGIS geometry types.
-
Point & Polygon Management: Handling diverse spatial types, from specific customer coordinates to complex multi-polygon administrative boundaries.
-
Coordinate Standardization: Enforcing SRID 4326 across all layers for global map compatibility.
-
Topology Validation: Automated checks using
ST_IsValidto identify and filter malformed geometries. -
Spatial Integrity Audits: Verification of SRID consistency and geometry type classification to ensure the database is "analysis-ready."
-
geospatial_data_etl.sql: The main SQL script containing the database schema, ETL logic, and validation procedures. -
README.md: Documentation of the technical architecture and project goals. -
/data/: (Placeholder) Directory for source CSV datasets.
-
Database Architecture: Designing a schema that supports both relational business data and spatial geometries.
-
ETL Optimization: Using staging-to-production logic for cleaner data pipelines.
-
Spatial Analysis Readiness: Preparing data for complex queries like "customers within a specific delivery zone" or "regional sales density."
This project is part of my Data Engineering portfolio, focusing on SQL and Geospatial analytics.
This section contains example SQL queries demonstrating relational joins and basic analytical operations on the analytics schema.
Source Codejoins_analytics_queries.sql
-
Customer Orders (INNER JOIN)
Retrieves customers and their orders. -
Customers Without Orders (LEFT JOIN)
Finds customers who have not placed any orders. -
Orders and Products
Joins orders, order items, and products. -
Order Revenue Calculation
Calculates total revenue per order. -
Customer Geography Hierarchy
Joins customers with city, region, and country tables. -
Geospatial Join Example
Checks whether a customer's location is inside the city boundary using PostGIS. -
Join Row Count
Counts joined rows between orders and order_items.