Medallion Architecture in Practice: Six Layers, One Source of Truth

Two hundred source definitions. Forty-seven job pipelines. Five hundred and sixteen transformation models. And not a single bespoke DAG written for any of them.

That's the headline result of a data platform we built to ingest, transform, and distribute healthcare and commercial data at scale - all driven by YAML configuration rather than hand-written pipeline code. Here's the architecture behind it, and what we're proud of delivering.

The Core Idea: Configuration Over Code

The guiding principle behind this platform is simple: adding a new data source should mean writing a YAML file, not a new DAG.

Instead of writing logic per source, we built general-purpose DAG generators that read metadata and produce the right Airflow behavior dynamically. The YAML files are the source of truth. This single decision shaped everything else about the platform.

Architecture Overview

The platform has two connected repositories working in tandem: an orchestration layer built on Airflow, and a transformation layer built on dbt. Together they form a clean end-to-end pipeline - from raw vendor files to business-ready outputs distributed to downstream consumers.

The Ingestion Layer

For each data source, a YAML file defines everything the system needs to know: where the file comes from (SFTP, S3, Salesforce SOQL, or an external API), the filename pattern, load type, delimiter and encoding, archive behavior, and column-level data quality expectations.

A DAG generator reads these files at runtime and produces one Airflow DAG per source. Each DAG follows a consistent, reliable sequence:

Fetch files from the source system
Land them in an S3 input bucket
Log ingestion status to a Postgres/RDS metadata store
Load into Snowflake raw tables using COPY INTO
Run dbt tests on raw data
Build base models
Run dbt tests again
Archive processed files

Sensitive datasets follow a dedicated PII-specific path that routes them into an isolated Snowflake schema with separate bucket and stage configurations - keeping compliance requirements cleanly separated from the main pipeline with no extra work from the engineers adding new sources.

The Medallion Architecture: Six Layers of Trust

The heart of the transformation layer is a medallion-style architecture implemented in dbt. Rather than consolidating everything into a single schema and hoping for the best, we designed six distinct layers - each with a clear purpose, increasing levels of refinement, and explicit contracts between them.

The idea is straightforward: an engineer can look at any model path and instantly know how mature and trustworthy the data is. A raw_ prefix signals unprocessed ingestion straight from source. A reporting_ prefix signals a polished, validated, business-approved output. The layers in between are the journey from one to the other.

RAW is the landing zone. Data arrives here exactly as it came from the source - no transformations, no business logic. Every source table is preserved in its original form, giving a full audit trail and the ability to reprocess from scratch if needed.

BASE is where raw data gets its first treatment. Light cleaning happens here: standardising column names, enforcing data types, trimming whitespace, handling nulls, and normalising encodings. BASE models are thin but consistent - they create a reliable foundation for everything built on top.

CORE is where the real transformation happens. This layer joins tables, deduplicates records, applies business rules, and produces the canonical representation of each business entity. If BASE is about technical correctness, CORE is about business correctness. A model like core_patients or core_prescriptions reflects what those entities actually mean to the organisation - not just what arrived in a file. CORE is the truth layer the entire downstream stack is built on.

ANALYTICS builds on CORE to produce aggregated, metric-level models - pre-computed measures, period-over-period comparisons, cohort rollups. By materialising these in the warehouse rather than in BI tools, every consumer querying the data sees consistent numbers.

REPORTING packages analytics outputs into consumption-ready datasets shaped to the exact needs of specific consumers - dashboards, stakeholder reports, or downstream applications. These models are denormalised and labelled with business-friendly names, requiring minimal transformation by the end consumer.

FORECASTING is the specialised layer for predictive and modelling workflows. It draws from CORE and ANALYTICS to feed statistical models, demand forecasts, and scenario analyses - keeping predictive logic cleanly separated from the operational reporting stack.

Job Orchestration

Job orchestration reads job YAML files to create sequential Airflow DAGs. A job can trigger source DAGs, trigger other jobs, run dbt commands, wait on external tasks, capture dbt artifacts, and send status or data quality emails. This composability is what makes the whole system feel like one cohesive platform rather than a collection of independent pipelines.

Outbound Publishing

Data doesn't just flow in — it flows out. A feeds framework reads feed YAML files, executes Snowflake queries, generates output files, and uploads them to S3 or pushes directly back to Salesforce. A separate file transfer manager handles standalone SFTP-to-S3 delivery for vendor partners.

What We Achieved

Scale without complexity. The platform handles over 200 source YAML definitions, 47 job definitions, 15 feed definitions, and around 516 SQL models - spanning commercial data domains including Veeva, Prognos, Claritas, SFMC, MMIT, Komodo, Cardinal 3PL, and others. Adding a new vendor typically involves writing a source YAML and a handful of dbt models. No new DAG code, no new operators.

Operational reliability built in. Every source pipeline - without exception - gets ingestion logging, file archiving, DQ test gates, and dbt artifact capture. These aren't features someone has to remember to add, they're part of the framework that every YAML-defined source inherits automatically.

Compliance handled gracefully. PII data flows through a fully isolated path with separate infrastructure, enforced by the framework - not by individual engineers remembering to do the right thing.

A trustworthy data contract across six layers. The medallion architecture gives every downstream consumer a clear signal about what they're working with. RAW preserves the original record. BASE standardises it. CORE establishes business truth. ANALYTICS aggregates it consistently. REPORTING delivers it consumption-ready. FORECASTING powers predictive work, cleanly separated from operational flows. Each layer earns its name.

Real-world edge cases solved. ZIP file ingestion, incremental file detection, encoding variations, outbound Salesforce syncs, custom QC reporting - these aren't afterthoughts. They're built into the framework as first-class capabilities, accumulated through the real experience of working with dozens of external vendors.

The Bigger Picture

The metadata-driven approach isn't new - it's a well-established pattern in data engineering. What made it work here was the discipline to commit to it consistently: across ingestion, transformation, orchestration, and outbound publishing, rather than applying it selectively and falling back to bespoke code when things got complicated.

The result is a platform that scales with the business. New data sources, new business domains, new downstream consumers - all handled through configuration. The engineering effort goes into improving the framework, not repeating the same work for the hundredth source.

For teams considering a similar path, our advice: invest in the YAML schema design before writing a single line of generator code. The shape of your configuration files determines everything downstream, and it's far easier to get that right at the start than to migrate 200 source definitions later.

Want to learn more about how we approach data platform engineering? Reach out to our team - we'd love to hear how you're tackling similar challenges.