The Cost Problem
Cloud data platforms make it incredibly easy to scale — and incredibly easy to overspend. We've seen Snowflake bills balloon 3× after a single misconfigured query, and AWS Glue jobs running 24/7 when they only needed 2 hours of compute per day. Cost optimization isn't about being cheap; it's about making sure every dollar of cloud spend translates into business value.
Snowflake Warehouse Tuning
The easiest Snowflake win is right-sizing virtual warehouses. Most teams default to LARGE or X-LARGE warehouses when SMALL handles 80% of workloads just fine. We profiled every query using the QUERY_HISTORY view, identified queries with high queue times (sign of under-provisioning) and queries with low utilization percentages (sign of over-provisioning), and right-sized accordingly. We also implemented auto-suspend after 60 seconds of inactivity and used multi-cluster warehouses with scaling policies for bursty workloads. These changes alone cut our Snowflake compute costs by 35%.
Batch vs. Streaming Cost Trade-offs
Not everything needs to be real-time. We audit each pipeline's freshness requirements with business stakeholders. If the dashboard is viewed once per morning, a nightly batch is fine — no need for a Kafka stream. We categorize pipelines into three tiers: Tier 1 (near real-time, < 5 min latency), Tier 2 (hourly), and Tier 3 (daily). This simple classification helped us move three pipelines from Kafka to batch Airflow DAGs, saving $2,100/month in streaming infrastructure costs.
Storage Optimization
Storage costs creep up silently. We implemented three strategies: (1) convert CSV and JSON files to Parquet or Delta format at the Bronze layer — compression ratios of 5-10× are typical, (2) set lifecycle policies on S3/ADLS to transition data older than 90 days to cold storage tiers, and (3) disable Snowflake Time Travel on staging and transient tables where we don't need 90-day history. Strategy 3 alone freed up 4 TB of storage.
Compute Scheduling
Many pipelines run on fixed schedules regardless of whether new data has arrived. We replaced time-based triggers with event-based triggers wherever possible — S3 event notifications, database CDC triggers, and webhook-based Airflow DAG runs. This reduced unnecessary compute executions by 40%. For non-event-driven workloads, we use AWS Spot Instances for Airflow workers and Databricks pools with spot pricing, accepting the occasional task restart in exchange for 60-70% cost savings.
Building a Cost Culture
Tools and techniques only work if the team cares about costs. We added a monthly "Cloud Cost Review" to our sprint ceremonies, set up per-team cost dashboards with alerts at 80% and 100% of budget, and made cost impact a mandatory section in every architecture design review. When engineers can see how their pipeline changes affect the bill in near real-time, they self-optimize. Our total data platform costs dropped 28% over six months without sacrificing any SLAs.