Skip to content

Latest commit

 

History

History
310 lines (205 loc) · 18.7 KB

File metadata and controls

310 lines (205 loc) · 18.7 KB

Demonstration: Medallion Architecture Overview

Costa Rica

GitHub brown9804

Last updated: 2025-07-17


List of References (Click to expand)
Table of Contents (Click to expand)

Overview

The medallion architecture is a data design pattern used in data lakes and lakehouses to organize data at different levels of refinement. It's a best practice for managing the data lifecycle and ensuring data quality

Layer Description
Bronze Layer This layer contains the raw data ingested from various sources. The data is typically stored in its original format and is append-only. The Bronze layer acts as the landing zone for all incoming data.
Silver Layer This layer contains cleaned and transformed data. The data in the Silver layer is often enriched with additional information and is structured in a way that makes it easier to query and analyze. This layer is also where data quality checks and transformations are applied.
Gold Layer This layer contains curated and aggregated data that is ready for consumption by business intelligence and reporting tools. The data in the Gold layer is highly structured and optimized for performance.

Note

This demo will be created step by step. Please note that Microsoft Fabric already assists by setting up the medallion flow for you.

Important

If you are not able to see the auto-create report option neither copilot be aware you need to enable AI features in your tenant, click here to see how

image image

Demo

Implementing a medallion architecture provides several benefits:

  • Data Quality: By organizing data into layers, you can apply quality checks and transformations in a structured manner, ensuring that the data in the Gold layer is reliable and ready for analysis.

  • Scalability: The architecture allows you to scale your data processing pipelines independently for each layer, providing flexibility and efficiency.

  • Performance: The Gold layer is optimized for performance, which means that your reporting and analytics queries will run faster.

  • Simplicity: It simplifies the data pipeline by breaking it down into smaller, manageable steps, each with a clear purpose.

  • Auditability: It provides a clear data lineage, making it easier to trace the origin of data and understand the transformations applied at each stage.

    image

Step 1: Set Up Your Environment

Note

If you need more visual help, check out the video that walks you through the steps. Below is a detailed step-by-step guide.

  1. Create a Fabric Workspace: This will be your central hub for all activities.

    • Click on Create a resource and create your Fabric Capacity if you haven't done so already.

      image
    • Go to the Fabric portal.

    • Click on Workspaces, then select either your existing workspace or create a new one by clicking New Workspace:

    • Provide a name and other required details, then create the workspace.

      image
    • Now, assign the Fabric Capacity to your workspace by clicking on Workspace settings and selecting the fabric capacity under the license.

      image
      fabricCapacity.mp4
  2. Create Lakehouses: Set up three lakehouses for the Bronze, Silver, and Gold layers.

    • In your Fabric workspace, create three lakehouses named raw_Bronze, cleansed_Silver, and curated_Gold.

      image image image
      CreateMedallionLakeLayers.mp4

Step 2: Ingest Data into the Bronze Layer

  1. Identify Data Sources:
    • Determine the sources from which you'll ingest data.

    • List all the data sources such as databases, APIs, file systems, etc.

      Let's suppose you have your data locally as csv, you can upload your data to the raw layer. Click to access sample files in case you don't have any data at the moment

      image

      You need to load the data as tables, either to a new table or to a existing one:

      Load information as table To a new one or existing one
      image image
      IngestDataIntoBronzeLayer.mp4

Suppose you need to extract data from your sql database

image

Note

In case you don't have any data at the moment, please follow this video to create a sample:

CREATE TABLE dbo.Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    BirthDate DATE,
    HireDate DATE,
    JobTitle VARCHAR(50),
    Salary DECIMAL(10, 4)
);

INSERT INTO dbo.Employees (EmployeeID, FirstName, LastName, BirthDate, HireDate, JobTitle, Salary)
VALUES 
(1, 'John', 'Doe', '1985-11-15', '2010-03-10', 'Software Engineer', 75000.0000),
(2, 'Jane', 'Smith', '1990-05-22', '2012-07-18', 'Project Manager', 85000.0000),
(3, 'Emily', 'Jones', '1988-04-17', '2014-06-25', 'Data Analyst', 65000.0000),
(4, 'Michael', 'Brown', '1982-06-21', '2008-09-15', 'HR Specialist', 55000.0000),
(5, 'Sarah', 'Davis', '1995-09-30', '2020-11-20', 'Marketing Specialist', 60000.0000);
ConnectSQLServerCreateTable.mp4

Important

Besides using Data pipelines to bring your SQL information, you can also leverage Microsoft Fabric's mirrored SQL capability. This feature allows you to create a mirrored copy of your SQL database, improving data availability, reliability, and disaster recovery. By maintaining a synchronized copy of your database in a different location, it ensures that your data is always accessible, even in the event of a failure or outage.

Note

The mirroring process can involve both inbound and outbound connections. Inbound connections refer to data coming into Azure from external sources, such as an on-premises SQL database being mirrored in Azure. Outbound connections, on the other hand, refer to data going out from Azure to external destinations, like mirroring an Azure SQL database to Microsoft Fabric. This setup allows for seamless data flow and integration across different platforms, ensuring data consistency and availability.

For example, both Azure SQL Database and Microsoft Fabric are Microsoft products. However, the concept of outbound connections still applies because the data is moving from one service (Azure SQL Database) to another service (Microsoft Fabric), even though they are both within the Microsoft ecosystem. This movement of data is considered outbound because it is leaving the Azure SQL Database environment and entering the Microsoft Fabric environment.

Under the Zero Trust Architecture, both inbound and outbound connections are treated with the same level of scrutiny and security protocols. This means that whether the connection is inbound or outbound, it is subject to strict verification processes to ensure it is safe and authorized. Key principles of Zero Trust include verification of every access request, least privilege access, continuous monitoring, and micro-segmentation. By applying these principles, Azure ensures that both inbound and outbound connections are secure, reducing the risk of unauthorized access and data breaches.

FabricMirroringSQL.mp4
  1. Create Dataflows or Pipelines: Use Data Factory to create dataflows or pipelines that ingest data into the Bronze lakehouse.
    • In Data Factory, create a new pipeline.
    • Add activities to the pipeline to ingest data from the identified sources.
    • Configure Data Ingestion: Set up the data ingestion process to load data into the Bronze layer in its raw format.
      • Configure the source and destination settings in the pipeline activities.

      • Ensure the data is being ingested into the raw_Bronze lakehouse.

        image image image image image image image image

Step 3: Transform Data in the Silver Layer

  1. Create Notebooks or Dataflows: Use Fabric's notebooks or dataflows to read data from the Bronze layer.
    • In the Fabric workspace, create a new notebook.

      image
    • Connect the notebook to the raw_Bronze lakehouse.

  2. Data Cleaning: Apply data cleaning steps to handle missing values, remove duplicates, and correct data types.
    • Write code in the notebook to clean the data.
    • Use functions like dropna(), dropDuplicates(), and withColumn() to clean the data.
  3. Data Enrichment: Enrich the data with additional information if needed.
    • Join the data with reference tables or lookup values.
    • Add new columns with enriched information.
  4. Write to Silver Layer: Write the cleaned and transformed data to the Silver lakehouse.
    • Use the write.format("delta").save() method to save the data to the cleansed_Silver lakehouse.

      PySpark Code to Move Data from Bronze to Silver:

      # Read data from the Bronze layer
      bronze_df = spark.read.format("delta").load("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-bronze-lakehousename>.Lakehouse/Tables/<table name>")
      
      # Perform transformations (if any)
      silver_df = bronze_df  # Assuming no transformations for simplicity
      
      # Write data to the Silver layer
      silver_df.write.mode("overwrite").option("mergeSchema", "true").format("delta").save("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-silver-lakehousename>.Lakehouse/Tables/<table_name>")
      image

      If you want see more, click here to see a sample of the notebook

      image

Step 4: Curate Data in the Gold Layer

  1. Read Data from Silver Layer: Use notebooks or dataflows to read data from the Silver lakehouse.
    • In a new notebook, connect to the cleansed_Silver lakehouse.
  2. Apply Business Logic: Apply any additional business logic or aggregations.
    • Write code to perform aggregations and calculations.
    • Use functions like groupBy(), agg(), and sum() to aggregate the data.
  3. Write to Gold Layer: Write the curated data to the Gold lakehouse.
    • Use the write.format("delta").save() method to save the data to the curated_Gold lakehouse.

      Before any changes:

      image

      Applying some transformations: If you want see more, click here to see a sample of the notebook

      PySpark Code to Move Data from Silver to Gold:

      # Read data from the Silver layer
      silver_df = spark.read.format("delta").load("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-silver-lakehouse>.Lakehouse/Tables/<table name>")
      
      # Perform aggregations
      gold_df = silver_df.groupBy("Name").agg(
          sum("Count").alias("TotalCount"),
          avg("price").alias("AveragePrice"),
          avg("tax").alias("AverageTax")
      )
      
      # Write data to the Gold layer
      gold_df.write.mode("overwrite").option("mergeSchema", "true").format("delta").save("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-gold-lakehouse name>.Lakehouse/Tables/<your table name>")
      image

Step 5: Set Up Pipelines for Orchestration

  1. Create Pipelines: Create pipelines to automate the movement of data from the Bronze layer to the Silver layer, and from the Silver layer to the Gold layer.
    • In Data Factory, create a new pipeline.
    • Add a copy activity to move data from the raw_Bronze lakehouse to the cleansed_Silver lakehouse.
    • Add another copy activity to move data from the cleansed_Silver lakehouse to the curated_Gold lakehouse.
  2. Schedule Pipelines: Schedule these pipelines to run at appropriate intervals.
    • Set up triggers in Data Factory to run the pipelines on a schedule that aligns with your data freshness requirements.
    • Consider the frequency of data updates and the latency that is acceptable for your use case.

Step 6: Enable Data Access for Reporting

  1. Configure SQL Analytics Endpoint:

    • Validate if you have the SQL Analytics Endpoint configured, you can review it from workspace view, primarly is required for Gold layer to be accessible to your reporting tools.
    image
  2. Create a semantic models with the required tables for the report.

    • Go to your gold layer.

      image
    • Create the semantic model:

      image image
  3. Create Power BI Reports: Use Power BI in Direct Lake Mode to create reports and dashboards from the data in the Gold layer.

    • Connect to the curated_Gold lakehouse using the SQL analytics endpoint.

    • Build reports and dashboards that provide insights into the data.

      image image

      Create new report:

      image

      If you click auto-create report Copilot will create a draft report for you:

      image

      You can edit your report:

      image

      Also, you can levarage AI capabilities with Copilot:

      image
Total views

Refresh Date: 2025-08-06