Experience — Ali Bashir

PwC LLP

Senior Data Engineer

New York, NY

Dec 2023 – Present

Project Designed and implemented a cloud-based data engineering solution to ingest, process, and analyze large-scale transportation and operational data for reliable analytics and decision-making across business stakeholders.

Designed, built, and maintained scalable data pipelines on Azure Databricks using PySpark and SQL for transportation and mobility analytics, handling large structured and semi-structured datasets.
Implemented Azure Data Factory (ADF) to orchestrate end-to-end ETL workflows, integrating on-premises and cloud-based sources into ADLS Gen2 for centralized processing.
Designed and implemented a Lakehouse architecture using Azure Databricks and ADLS Gen2 with Delta Lake — enabling ACID transactions, schema enforcement, and incremental data processing.
Integrated Databricks-processed datasets with Google BigQuery for downstream analytics; designed curated tables with partitioning and clustering for cost-efficient BI workloads.
Modeled analytical datasets supporting transportation KPIs: vehicle utilization, driver activity, payment reconciliation, and operational compliance for Tableau and Power BI consumption.
Leveraged ELK Stack (Elasticsearch, Logstash, Kibana) to track pipeline execution metrics, ingestion status, and processing anomalies for observability.
Developed Python-based utilities and automation scripts to streamline ingestion, validation, and monitoring — reducing manual effort and improving pipeline reliability.

Environment

Azure Databricks Azure Data Factory ADLS Gen2 Apache Spark PySpark Python Google BigQuery Tableau Power BI Elasticsearch Kibana Azure Monitor

Northwell Health

Data Engineer

New Hyde Park, NY

Jan 2022 – Nov 2023

Project Built and supported a GCP-based enterprise data platform enabling governed analytics on claims and pharmacy data through scalable BigQuery pipelines, orchestration, and compliance-driven data models.

Designed and maintained ELT pipelines using BigQuery to transform large volumes of structured healthcare data into partitioned, clustered reporting tables.
Contributed to a GCP-based healthcare data platform supporting ingestion and normalization of HL7- and FHIR-aligned clinical and claims datasets for downstream analytics.
Orchestrated daily ingestion workflows using Cloud Composer (Apache Airflow), enabling dependency tracking, SLA monitoring, and automated retries.
Built Dataflow pipelines (Python SDK) to parse and enrich semi-structured JSON data prior to BigQuery loading for downstream analytics.
Implemented schema evolution workflows in BigQuery using dbt and automated validation checks to ensure downstream models remained stable during structure changes.
Used Pub/Sub to trigger ingestion events for near-real-time feeds from external pharmacy systems into BigQuery staging tables.
Collaborated with compliance and InfoSec teams to implement encryption-at-rest, audit logging, and DLP scanning via Cloud DLP API for sensitive fields.

Environment

GCP BigQuery Cloud Composer Dataflow Pub/Sub Cloud Functions dbt Python HIPAA Cloud DLP

Bank of America

Data Engineer

New York, NY

Apr 2019 – Nov 2021

Project Developed scalable batch and streaming data pipelines on hybrid cloud platforms to process high-volume banking data for compliance, risk, and enterprise analytics.

Designed and implemented scalable pipelines to process high-volume financial transactions, customer, and reference data across hybrid cloud and on-prem platforms.
Developed and optimized batch ETL workflows using Apache Spark, Hive, and SQL to support regulatory reporting, risk analytics, and enterprise data consumption.
Worked with Databricks on AWS to develop and optimize Spark-based ETL pipelines on Amazon S3; built and maintained PySpark jobs for compliance and risk analytics.
Integrated data from relational (SQL Server, Oracle) and NoSQL (MongoDB, Cassandra) systems into centralized analytical layers for downstream analytics.
Developed real-time and near-real-time ingestion pipelines using Apache Kafka to process transaction and event-driven data with low latency.
Implemented infrastructure-as-code using Terraform to provision and manage cloud resources; integrated into CI/CD pipelines via Git and Jenkins.
Designed and optimized partitioned Hive tables and HDFS storage layouts to improve query performance and reduce processing latency.

Environment

Hadoop HDFS Hive Apache Spark Apache Kafka Apache Airflow AWS S3 AWS Glue Redshift Terraform Python Jenkins

Humana

Data Engineer

Louisville, KY

May 2017 – Mar 2019

Project Developed scalable Azure data engineering solutions to ingest, process, and curate healthcare datasets for downstream analytics, reporting, and operational insights.

Designed and developed scalable data pipelines on Azure Cloud, ensuring seamless data integration and processing across multiple sources.
Implemented and optimized Azure Data Lake solutions for storing and managing large volumes of structured and unstructured data.
Built ETL workflows using Azure Data Factory (ADF) to automate data extraction, transformation, and loading across multiple sources.
Developed data processing and analytics solutions using Databricks and PySpark, improving performance and scalability.
Integrated streaming and batch data processing capabilities to support real-time and historical data analysis.
Participated in cloud cost optimization strategies, reducing storage and compute costs while maintaining performance.

Environment

Azure Cloud Azure Data Lake Databricks PySpark Azure Data Factory SQL Server PostgreSQL Python Azure DevOps

Professional Experience