Hello, I'm

Vibhor Bansal

Building high-throughput, cost-efficient data pipelines across Azure & AWS — transforming raw data into actionable intelligence at scale.

Download Resume

About Me

Bridging the gap between raw data and business intelligence

Results-driven Data Engineer with 5+ years of experience designing and optimizing end-to-end data pipelines in Azure and AWS. Skilled in ETL/ELT, data modeling, CDC/SCD, and orchestration using Airflow. Proven track record in building scalable, cost-efficient and secure data workflows, integrating batch and streaming sources like Kafka and relational databases into Snowflake and Delta Lake for analytics.

Scalable Pipelines

5+ years designing end-to-end ETL/ELT data pipelines processing hundreds of GBs daily with CDC, SCD, and Medallion Architecture.

Cloud Expertise

Deep experience across Azure (ADLS, ADF, Databricks, Synapse) and AWS (S3, Lambda, Glue, CloudWatch) ecosystems.

Performance & Cost

Proven track record of optimizing throughput, reducing processing costs by 30%, and achieving 99.9% uptime on production systems.

Skills & Technologies

The tools and platforms I use to build data systems

Cloud Platforms

Azure ADLS Gen2Azure Data FactoryDatabricksSynapseAWS S3AWS LambdaAWS GlueCloudWatch

Data Technologies

SnowflakeDelta LakeSparkSQLMedallion ArchitectureKafkaChange Data Capture

Processing Frameworks

Apache SparkPySparkAirbyte

Orchestration Tools

Apache AirflowAzure Data Factory

Programming Languages

PythonJavaSQLShell Scripting

DevOps & Monitoring

JenkinsArgo CDDockerHelmKubernetesGrafanaPrometheusSplunkDynatraceSonarQube

Professional Experience

A track record of building data systems that deliver real business impact

WPP Media

Data Engineer · Bangalore

Dec 2025 – Present

▸Working on a large-scale Media Data Lake project, ingesting and processing media data using PySpark into a Medallion Architecture (Bronze → Silver → Gold).
▸Building and optimizing PySpark-based ingestion pipelines to handle high-volume media datasets across multiple domains with reliability and performance.
▸Applying data quality checks and transformations at each layer of the Medallion architecture to ensure clean, analytics-ready data in the Gold layer.

PySparkMedallion ArchitectureData LakeData Quality

Optum (UnitedHealth Group)

Senior Software Engineer · Noida

Mar 2022 – Dec 2025

▸Architected an end-to-end CDC pipeline processing 200GB data daily from PostgreSQL to Snowflake using Airbyte + Airflow — enabling real-time analytics for 50+ downstream teams.
▸Developed a Java Spring Boot microservice handling 100K+ Kafka messages/hour, achieving 99.9% uptime and reducing data processing costs by 30%.
▸Built ADF pipelines for healthcare data — ingesting data from multiple sources (APIs, source databases, ADLS storage) to ADLS, converting to Parquet, and transforming in Databricks (Bronze → Silver → Gold) as Delta tables.
▸Led migration from Jenkins to Argo CD, saving 15 hours/month in production deployment effort.
▸Managed 10+ domains with 90%+ code coverage, SonarQube gate 9, and zero critical vulnerabilities.

SnowflakeAirflowKafkaAirbyteDatabricksADFJavaArgo CD

Bank of America

Apprentice Trainee · Chennai

Jul 2021 – Mar 2022

▸Developed an AWS ETL pipeline for trading data using Lambda, S3, Glue, and Athena.
▸Automated daily extraction with CloudWatch triggers, reducing manual work by 90%.
▸Data cleansing and transformation performed before loading into analytics layers, enabling near-real-time insights for business analysts.

AWS LambdaS3GlueAthenaCloudWatch

Featured Projects

Real-world data engineering solutions with measurable impact

Featured Project

AI Resume Builder

An intelligent resume and portfolio builder powered by AI suggestions, offering multiple professionally designed templates, real-time preview, and PDF export capabilities.

Impact: Streamlines resume creation with AI-powered content suggestions and professional templates

Next.jsReactTypeScriptTailwind CSSAI/ML

GitHub Live Demo

Real-Time CDC Pipeline

End-to-end Change Data Capture pipeline processing 200GB daily from PostgreSQL to Snowflake. Uses Airbyte for ingestion and Airflow for orchestration, enabling real-time analytics.

Impact: Enables real-time analytics for 50+ downstream teams with automated data sync

SnowflakeAirbyteAirflowPostgreSQLPython

Internal Project

Kafka Streaming Microservice

High-throughput Java Spring Boot microservice processing 100K+ Kafka messages per hour into Snowflake with 99.9% uptime and 30% cost reduction.

Impact: 99.9% uptime, 30% cost reduction through optimized batch sizing

JavaSpring BootKafkaSnowflakeDocker

Internal Project

AWS Trading Data ETL

Serverless ETL pipeline for trading data using AWS Lambda, S3, Glue, and Athena with automated CloudWatch triggers for daily extraction.

Impact: 90% reduction in manual work through automated daily extraction

AWS LambdaS3GlueAthenaCloudWatch

Internal Project

Media Data Lake (Medallion)

Large-scale media data lakehouse built on Medallion Architecture processing high-volume datasets using PySpark with data quality checks at each layer.

Impact: Analytics-ready Gold layer for cross-domain media data analysis

PySparkDelta LakeMedallion ArchitectureData Quality

Internal Project

Azure Healthcare Pipeline

ADF pipeline ingesting data from multiple sources (APIs, source databases, ADLS storage) to ADLS, converting to Parquet, and transforming in Databricks with event-based triggers and SCD logic.

Impact: Near real-time Gold layer updates for healthcare data with SCD logic

Azure ADFDatabricksADLS Gen2Delta LakeParquet

Internal Project

System Design & Architecture

Data pipeline architectures I've designed and built

CDC Pipeline Architecture

End-to-end Change Data Capture pipeline from PostgreSQL to Snowflake with automated orchestration and reconciliation.

Source

PostgreSQL

Triggers

→

Ingestion

Airbyte

CDC Sync

→

Orchestration

Airflow DAGs

Reconciliation

→

Warehouse

Snowflake

SCD Logic

→

Serving

Analytics

50+ Users

Streaming Architecture

High-throughput streaming system processing 100K+ messages per hour from Kafka to Snowflake.

Producers

Applications

Events

→

Broker

Kafka Topics

Partitions

→

Consumer

Spring Boot

Batch Processing

→

Storage

Snowflake

Streams

→

Foundation

Latest Data

SCD

Medallion Architecture

Multi-layer data lakehouse processing high-volume datasets with quality checks at each stage.

Landing

CSV/JSON

Raw Files

→

Bronze

Raw Ingestion

Schema Applied

→

Silver

Cleansed

Validated

→

Gold

Aggregated

Analytics-Ready

→

Consumers

BI Tools

Reports

Key Achievements

Measurable impact through engineering excellence

0GB+

Data Processed Daily

End-to-end CDC pipeline

0K+

Messages / Hour

Kafka streaming throughput

0.9%

System Uptime

Production microservice

Cost Reduction

Optimized batch processing

Manual Work Reduced

CloudWatch automation

0%+

Code Coverage

Across 10+ domains

Upskilling & Learning

Staying ahead in the ever-evolving data landscape

Currently Learning

dbt (Data Build Tool)

Exploring dbt for modern data transformations — building modular, testable SQL pipelines with version control and documentation as first-class citizens.

dbt CoreSQL TransformationsData TestingDocumentation

Experimenting

Generative AI

Experimenting with LLMs and generative AI to build intelligent tools like AI-powered resume builders and automated data documentation systems.

LLMsPrompt EngineeringAI ApplicationsRAG

Exploring

Intelligent Data Applications

Building data-driven applications that leverage ML models and AI for automated insights, anomaly detection, and smart data quality monitoring.

ML PipelinesAnomaly DetectionData Quality AIAutoML

Blog & Insights

Sharing knowledge on data engineering, cloud architecture, and best practices

Data EngineeringMar 2026

Building Scalable CDC Pipelines with Airbyte and Snowflake

A deep dive into designing Change Data Capture pipelines that process hundreds of GBs daily with automated reconciliation and SCD logic.

8 min read

StreamingFeb 2026

Kafka Streaming at Scale: Lessons from 100K Messages/Hour

Practical insights from building a high-throughput Kafka consumer microservice — batch sizing strategies, error handling, and cost optimization.

6 min read

ArchitectureJan 2026

Medallion Architecture: From Raw to Analytics-Ready in Three Layers

How the Bronze-Silver-Gold pattern streamlines data quality, transforms, and governance in modern data lakehouses.

7 min read

OrchestrationDec 2025

Airflow Orchestration Patterns for Data Engineers

Best practices for structuring Airflow DAGs — dynamic task generation, retry logic, alerting, and integrating with tools like Airbyte.

5 min read

DevOpsNov 2025

Migrating CI/CD from Jenkins to Argo CD: A Practical Guide

How we saved 15 hours/month by adopting GitOps with Argo CD — setup, challenges, and lessons learned during the migration.

6 min read

CloudOct 2025

Cost Optimization Strategies for Cloud Data Pipelines

Real-world techniques for reducing cloud costs — from Snowflake warehouse tuning to batch processing optimization on AWS.

7 min read

Get In Touch

Let's discuss data challenges and opportunities

I'm always open to discussing new opportunities, interesting data problems, or ways to collaborate on scalable data solutions.

vibhorbansal1312@gmail.com