Costa Rica
Last updated: 2025-11-24
This automation handles the complete data pipeline setup for the Azure AI Shopping application.
Table of Content (Click to expand)
Note
What It Does? The data pipeline automation performs the following tasks:
- Creates Python Virtual Environment: Sets up an isolated Python environment with all required dependencies
- Imports Data to Cosmos DB: Loads product catalog data from CSV into Cosmos DB container
- Creates Azure AI Search Index: Sets up a search index with vector search capabilities
- Imports Data to Search: Populates the search index from Cosmos DB using an indexer
Prerequisites: (Click to expand)
- Python 3.8 or higher installed and available in PATH
- Product catalog CSV file at
src/data/updated_product_catalog(in).csv(demo)
Automated by Terraform:
- Cosmos DB account and database
- Azure AI Search service
- Azure OpenAI model deployments
- Environment variables in
src/.env
Option 1: Run Automatically with Terraform → Enable data pipeline automation in
terraform.tfvars:
enable_data_pipeline = trueThen run:
terraform apply -auto-approveThis will:
- Deploy all Azure resources
- Create AI model deployments
- Generate
.envfile - Automatically run the complete data pipeline
Option 2: Run Manually → If you prefer to run the data pipeline manually or separately:
-
Ensure
.envfile exists (created by Terraform):cd terraform-infrastructure terraform apply -auto-approve -
Navigate to src directory:
cd ../src -
Create virtual environment and install dependencies:
python -m venv venv .\venv\Scripts\Activate.ps1 pip install --upgrade pip pip install -r requirements.txt
-
Run pipeline scripts in order:
# Step 1: Import data to Cosmos DB python pipelines/ingest_to_cosmos.py # Step 2: Create Azure AI Search index python pipelines/create_search_index.py # Step 3: Upload data to search index python pipelines/upload_to_search.py
Product Catalog CSV → The product catalog data should be placed at:
src/data/updated_product_catalog(in).csv
Expected columns:
ProductID: Unique product identifierProductName: Product nameProductCategory: Product categoryProductDescription: Product descriptionProductPrice: Product priceProductImageURL: URL to product image
Download Data → If you don't have the data file, you can download it from the reference repository TechWorkshop-L300-AI-Apps-and-agents, please feel free to follow the guide as well Guide - TechWorkshop L300: AI Apps and Agents:
# Download the product catalog data
curl -o src/data/updated_product_catalog(in).csv https://raw.githubusercontent.com/microsoft/TechWorkshop-L300-AI-Apps-and-agents/main/src/data/updated_product_catalog(in).csvpipelines/ingest_to_cosmos.py (Click to expand)
- Reads CSV data with product catalog
- Connects to Cosmos DB (uses AAD or key-based auth)
- Creates database and container if they don't exist
- Imports all products with upsert operations
- Creates
content_for_vectorfield for semantic search - Smart Skip Logic:
- By default (
COSMOS_SKIP_IF_EXISTS=true), checks if container already has data - If data exists, skips import to avoid duplicates and save time
- Set
COSMOS_FORCE_INGEST=trueto force re-import even if data exists - Set
COSMOS_SKIP_IF_EXISTS=falseto always import (legacy behavior)
- By default (
pipelines/create_search_index.py (Click to expand)
- Creates Azure AI Search index with vector search
- Configures HNSW algorithm for vector search
- Sets up Azure OpenAI vectorizer
- Defines searchable and filterable fields
pipelines/create_search_index.py (Click to expand)
- Creates Azure AI Search index with vector search capabilities
- Configures HNSW algorithm for efficient vector similarity search
- Sets up Azure OpenAI vectorizer with text-embedding-3-small model
- Defines searchable, filterable, and vector fields
- Supports hybrid search (keyword + semantic)
pipelines/create_search_index.py (Click to expand)
- Creates Azure AI Search index with vector search
- Configures HNSW algorithm for vector search
- Sets up Azure OpenAI vectorizer
- Defines searchable and filterable fields
pipelines/upload_to_search.py (Click to expand)
- Reads all documents from Cosmos DB container
- Authenticates using AAD or key-based auth (auto-fallback)
- Maps Cosmos DB fields to Azure AI Search index schema
- Uploads documents in batches to Azure AI Search
- Provides detailed success/failure reporting
- Note: This script replaces the traditional indexer approach to avoid managed identity complexity when Cosmos DB local auth is disabled
For detailed troubleshooting guidance, see TROUBLESHOOTING.md. Quick Reference:
- Python Not Found: Install Python 3.8+ from https://www.python.org/downloads/
- CSV File Not Found: Download the product catalog CSV file and place it in
src/data/directory - Authentication Errors: Run
az loginand ensure you have proper permissions. See TROUBLESHOOTING.md for detailed solutions. - Virtual Environment Issues: Delete
venvfolder and recreate. See TROUBLESHOOTING.md for details.
All configuration is pulled from the
.envfile created by Terraform:
COSMOS_DB_ENDPOINT=...
COSMOS_DB_KEY=...
COSMOS_DB_NAME=...
COSMOS_DB_CONTAINER_NAME=products
COSMOS_SKIP_IF_EXISTS=true # Skip import if data already exists
COSMOS_FORCE_INGEST=false # Force re-import even if data exists
SEARCH_SERVICE_ENDPOINT=...
SEARCH_SERVICE_KEY=...
SEARCH_INDEX_NAME=products-index
AZURE_OPENAI_ENDPOINT=...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-small| Variable | Default | Description |
|---|---|---|
COSMOS_SKIP_IF_EXISTS |
true |
Skip import if container already has data |
COSMOS_FORCE_INGEST |
false |
Force re-import even if data exists (overrides skip) |
COSMOS_DB_ENDPOINT |
- | Cosmos DB account endpoint URL |
COSMOS_DB_KEY |
- | Cosmos DB account key (optional if using AAD) |
COSMOS_DB_NAME |
- | Database name |
COSMOS_DB_CONTAINER_NAME |
- | Container name for product catalog |
After running the pipeline, verify data was imported:
az cosmosdb sql container show \
--account-name <cosmos-account> \
--database-name zava \
--name products \
--resource-group <rg-name>az search index show \
--index-name products-index \
--service-name <search-service> \
--resource-group <rg-name>az search index show-statistics \
--index-name products-index \
--service-name <search-service> \
--resource-group <rg-name>After the data pipeline completes:
- Your Cosmos DB container is populated with product data
- Azure AI Search index is created with vector search enabled
- Search index is populated from Cosmos DB
- You can now build AI agents that query this data
- Use the search index for hybrid search (keyword + semantic)