Back to PortfolioData Engineering

Building Scalable CDC Pipelines with Airbyte and Snowflake

Mar 20268 min read

Why CDC Matters

Change Data Capture (CDC) is a pattern that identifies and tracks changes in source databases so that downstream systems can react in near real-time. Traditional batch ETL simply cannot keep up when business teams need fresh data every few minutes rather than once a day. By capturing only the rows that changed — inserts, updates, and deletes — CDC dramatically reduces data transfer volumes and keeps your warehouse costs under control.

Architecture Overview

Our pipeline processes approximately 200 GB of incremental data per day from PostgreSQL to Snowflake. At its core, Airbyte handles the replication layer. It reads the PostgreSQL WAL (Write-Ahead Log) and emits change events in a normalized format. Airflow orchestrates the entire workflow — triggering Airbyte syncs, running data quality checks, and managing SCD Type-2 merge operations in Snowflake. The separation of concerns between the two tools is key: Airbyte focuses on reliable extraction, and Airflow handles scheduling, retries, and alerting.

Configuring Airbyte for CDC

Setting up CDC in Airbyte involves enabling logical replication on PostgreSQL (wal_level = logical), creating a dedicated replication slot, and configuring Airbyte's PostgreSQL source connector with the CDC option. One critical lesson: always set a replication slot monitoring alert. If Airbyte goes down and the slot isn't consumed, the WAL will grow unbounded and eventually fill your disk. We use a simple CloudWatch alarm on the pg_replication_slots view to catch this early.

SCD Type-2 in Snowflake

Once the raw change events land in a staging table in Snowflake, we run a MERGE statement to apply them against the target dimension table using SCD Type-2 logic. This preserves full history: when a row is updated, the existing record gets an end_date timestamp and a new record is inserted with the updated values. The MERGE runs inside an Airflow task that also performs row-count reconciliation between the source and target to catch any drift.

Handling Schema Evolution

Schema changes in the source database are inevitable. New columns, renamed fields, type changes — all of these can break a CDC pipeline. We handle this by enabling Airbyte's "detect and propagate schema changes" feature and pairing it with a custom Airflow sensor that detects schema mismatches before the merge step. When a mismatch is found, the pipeline pauses and sends a Slack alert so the team can review and approve the DDL change in Snowflake.

Performance and Cost Optimization

Running CDC at scale requires careful tuning. We use Snowflake's transient tables for staging to avoid Time Travel storage costs, size our virtual warehouse to SMALL for most syncs (scaling up only for initial full loads), and compress staging files with Snappy before loading. On the Airbyte side, we tune the batch size and flush interval to balance latency against API call volume. These optimizations reduced our monthly Snowflake bill by roughly 25%.

💡Key Takeaways

  • 1.CDC captures only changed rows, reducing transfer volumes by 90%+ compared to full-table batch loads.
  • 2.Separating extraction (Airbyte) from orchestration (Airflow) makes each component independently testable and replaceable.
  • 3.Always monitor your PostgreSQL replication slot to prevent WAL disk exhaustion.
  • 4.SCD Type-2 merge logic in Snowflake preserves full history for audit and analytics.
  • 5.Schema evolution handling is not optional — automate detection and approval workflows early.