Hello, I'm

Vibhor Bansal

Building high-throughput, cost-efficient data pipelines across Azure & AWS — transforming raw data into actionable intelligence at scale.

Download Resume

About Me

Bridging the gap between raw data and business intelligence

Results-driven Data Engineer with 5+ years of experience designing and optimizing end-to-end data pipelines in Azure and AWS. Skilled in ETL/ELT, data modeling, CDC/SCD, and orchestration using Airflow. Proven track record in building scalable, cost-efficient and secure data workflows, integrating batch and streaming sources like Kafka and relational databases into Snowflake and Delta Lake for analytics.

Scalable Pipelines

5+ years designing end-to-end ETL/ELT data pipelines processing hundreds of GBs daily with CDC, SCD, and Medallion Architecture.

Cloud Expertise

Deep experience across Azure (ADLS, ADF, Databricks, Synapse) and AWS (S3, Lambda, Glue, CloudWatch) ecosystems.

Performance & Cost

Proven track record of optimizing throughput, reducing processing costs by 30%, and achieving 99.9% uptime on production systems.

Skills & Technologies

The tools and platforms I use to build data systems

Cloud Platforms

Azure ADLS Gen2Azure Data FactoryDatabricksSynapseAWS S3AWS LambdaAWS GlueCloudWatch

Data Technologies

SnowflakeDelta LakeSparkSQLMedallion ArchitectureKafkaChange Data Capture

Processing Frameworks

Apache SparkPySparkAirbyte

Orchestration Tools

Apache AirflowAzure Data Factory

Programming Languages

PythonJavaSQLShell Scripting

DevOps & Monitoring

JenkinsArgo CDDockerHelmKubernetesGrafanaPrometheusSplunkDynatraceSonarQube

Professional Experience

A track record of building data systems that deliver real business impact

WPP Media

Data Engineer · Bangalore

Dec 2025 – Present
  • Working on a large-scale Media Data Lake project, ingesting and processing media data using PySpark into a Medallion Architecture (Bronze → Silver → Gold).
  • Building and optimizing PySpark-based ingestion pipelines to handle high-volume media datasets across multiple domains with reliability and performance.
  • Applying data quality checks and transformations at each layer of the Medallion architecture to ensure clean, analytics-ready data in the Gold layer.
PySparkMedallion ArchitectureData LakeData Quality

Optum (UnitedHealth Group)

Senior Software Engineer · Noida

Mar 2022 – Dec 2025
  • Architected an end-to-end CDC pipeline processing 200GB data daily from PostgreSQL to Snowflake using Airbyte + Airflow — enabling real-time analytics for 50+ downstream teams.
  • Developed a Java Spring Boot microservice handling 100K+ Kafka messages/hour, achieving 99.9% uptime and reducing data processing costs by 30%.
  • Built ADF pipelines for healthcare data — ingesting data from multiple sources (APIs, source databases, ADLS storage) to ADLS, converting to Parquet, and transforming in Databricks (Bronze → Silver → Gold) as Delta tables.
  • Led migration from Jenkins to Argo CD, saving 15 hours/month in production deployment effort.
  • Managed 10+ domains with 90%+ code coverage, SonarQube gate 9, and zero critical vulnerabilities.
SnowflakeAirflowKafkaAirbyteDatabricksADFJavaArgo CD

Bank of America

Apprentice Trainee · Chennai

Jul 2021 – Mar 2022
  • Developed an AWS ETL pipeline for trading data using Lambda, S3, Glue, and Athena.
  • Automated daily extraction with CloudWatch triggers, reducing manual work by 90%.
  • Data cleansing and transformation performed before loading into analytics layers, enabling near-real-time insights for business analysts.
AWS LambdaS3GlueAthenaCloudWatch

Featured Projects

Real-world data engineering solutions with measurable impact

Featured Project

AI Resume Builder

An intelligent resume and portfolio builder powered by AI suggestions, offering multiple professionally designed templates, real-time preview, and PDF export capabilities.

Impact: Streamlines resume creation with AI-powered content suggestions and professional templates
Next.jsReactTypeScriptTailwind CSSAI/ML

Real-Time CDC Pipeline

End-to-end Change Data Capture pipeline processing 200GB daily from PostgreSQL to Snowflake. Uses Airbyte for ingestion and Airflow for orchestration, enabling real-time analytics.

Impact: Enables real-time analytics for 50+ downstream teams with automated data sync
SnowflakeAirbyteAirflowPostgreSQLPython
Internal Project

Kafka Streaming Microservice

High-throughput Java Spring Boot microservice processing 100K+ Kafka messages per hour into Snowflake with 99.9% uptime and 30% cost reduction.

Impact: 99.9% uptime, 30% cost reduction through optimized batch sizing
JavaSpring BootKafkaSnowflakeDocker
Internal Project

AWS Trading Data ETL

Serverless ETL pipeline for trading data using AWS Lambda, S3, Glue, and Athena with automated CloudWatch triggers for daily extraction.

Impact: 90% reduction in manual work through automated daily extraction
AWS LambdaS3GlueAthenaCloudWatch
Internal Project

Media Data Lake (Medallion)

Large-scale media data lakehouse built on Medallion Architecture processing high-volume datasets using PySpark with data quality checks at each layer.

Impact: Analytics-ready Gold layer for cross-domain media data analysis
PySparkDelta LakeMedallion ArchitectureData Quality
Internal Project

Azure Healthcare Pipeline

ADF pipeline ingesting data from multiple sources (APIs, source databases, ADLS storage) to ADLS, converting to Parquet, and transforming in Databricks with event-based triggers and SCD logic.

Impact: Near real-time Gold layer updates for healthcare data with SCD logic
Azure ADFDatabricksADLS Gen2Delta LakeParquet
Internal Project

System Design & Architecture

Data pipeline architectures I've designed and built

CDC Pipeline Architecture

End-to-end Change Data Capture pipeline from PostgreSQL to Snowflake with automated orchestration and reconciliation.

Source
PostgreSQL
Triggers
Ingestion
Airbyte
CDC Sync
Orchestration
Airflow DAGs
Reconciliation
Warehouse
Snowflake
SCD Logic
Serving
Analytics
50+ Users

Streaming Architecture

High-throughput streaming system processing 100K+ messages per hour from Kafka to Snowflake.

Producers
Applications
Events
Broker
Kafka Topics
Partitions
Consumer
Spring Boot
Batch Processing
Storage
Snowflake
Streams
Foundation
Latest Data
SCD

Medallion Architecture

Multi-layer data lakehouse processing high-volume datasets with quality checks at each stage.

Landing
CSV/JSON
Raw Files
Bronze
Raw Ingestion
Schema Applied
Silver
Cleansed
Validated
Gold
Aggregated
Analytics-Ready
Consumers
BI Tools
Reports

Key Achievements

Measurable impact through engineering excellence

0GB+
Data Processed Daily
End-to-end CDC pipeline
0K+
Messages / Hour
Kafka streaming throughput
0.9%
System Uptime
Production microservice
0%
Cost Reduction
Optimized batch processing
0%
Manual Work Reduced
CloudWatch automation
0%+
Code Coverage
Across 10+ domains

Upskilling & Learning

Staying ahead in the ever-evolving data landscape

Currently Learning

dbt (Data Build Tool)

Exploring dbt for modern data transformations — building modular, testable SQL pipelines with version control and documentation as first-class citizens.

dbt CoreSQL TransformationsData TestingDocumentation
Experimenting

Generative AI

Experimenting with LLMs and generative AI to build intelligent tools like AI-powered resume builders and automated data documentation systems.

LLMsPrompt EngineeringAI ApplicationsRAG
Exploring

Intelligent Data Applications

Building data-driven applications that leverage ML models and AI for automated insights, anomaly detection, and smart data quality monitoring.

ML PipelinesAnomaly DetectionData Quality AIAutoML

Get In Touch

Let's discuss data challenges and opportunities

I'm always open to discussing new opportunities, interesting data problems, or ways to collaborate on scalable data solutions.

Email
vibhorbansal1312@gmail.com
LinkedIn
Vibhor Bansal
GitHub
vibhor-bansal
Location
Bangalore, India