Building a Big Data Pipeline With Cloud Native Tools

CHALLENGE

Big data pipelines aggregate petabytes of data daily, and without the right infrastructure, they are complex and overwhelming. While there are numerous cloud solutions advertised for managing big data pipelines, there is a point at which out-of-the-box solutions can’t scale up to meet demand. On the other hand, proprietary tools can be hard to maintain and nearly impossible to scale.

Health systems and big data go hand-in-hand. Every day, hospitals need to aggregate data — including patient data, financial information, employee timesheets and much, much more — to help make life-changing health and business decisions. With 16 hospitals and roughly 36,000 employees under its umbrella, a statewide health system was in need of a bespoke, developer-first infrastructure that could handle this much data, while also being easy to manage and update with new data sources.

Six Feet Up helped the health system rebuild its infrastructure and implement a new world of cloud native and open source tooling, including Airflow, Spark, Delta Lakes and Terraform. The developer-first infrastructure:

simplifies workflow processes;
enables organizations to import existing and new data sources quickly, efficiently and at scale;
keeps costs down in relation to the number and volume of data sources; and
makes hiring and onboarding new developers — who have experience working with the open source tools — easier by eliminating the tech learning curve that existed with the proprietary tools.

In addition to making life easier for the health system's developers, it also allows for employees who would otherwise be tasked with data entry to be reallocated to roles focused on patient care and outcomes. Furthermore, with this automation and capacity, the health system will be able to sequence genomes in-house — recouping the cost of third-party fees and allowing genomic sequencing to be used more widely.

Due to the impressive, purposeful and transformative nature of the mission this technology supports, this project has been designated as one of Six Feet Up’s 10 IMPACTFUL Projects. Six Feet Up’s 10-year goal is to complete 10 IMPACTFUL Projects by 2025.

Implementation Details

The health system — who knew Python was the go-to programming language for big data — reached out to Six Feet Up to review the code used to manage their data pipeline. Six Feet Up’s seasoned developers quickly spotted broader issues with the existing cloud infrastructure.

After helping the in-house team use Python to optimize the existing pipeline infrastructure, Six Feet Up developers identified an all-new stack that would make the process of analyzing big data more reliable. Plus, the new stack would make onboarding new developers easier.

Specifically, Six Feet Up:

redesigned and reconstructed the health system’s cloud native big data pipeline,
created development tools that allow the health system’s developers to quickly modify and deploy the pipeline infrastructure with minimal steps; and
integrated the new pipeline with the health system’s existing tools and data sources.

Designing a new pipeline

Six Feet Up developed a proof of concept for the new design. While the health system’s data pipeline infrastructure was already in the cloud, Six Feet Up experts saw areas where best practices could be implemented.

In designing the new stack, Six Feet Up’s experts:

switched the main orchestrator from JAMS to Airflow,
moved data input processing from Informatica to Azure Data Factory,
implemented pipeline analytics and logging via Azure Monitor,
implemented Databricks for cluster management,
transitioned from Gen1 to Data Lake Storage Gen2;
implemented Terraform to help streamline infrastructure-as-code; and
executed a fully automated continuous integration and continuous deployment (CI/CD) pipeline.

Developer experience and usability remained top-of-mind during the design process. As such, Six Feet Up recommended a solution that would allow developers to deploy the entire infrastructure without touching the console. Plus, the infrastructure-as-code (IaC) design — which defines how you configure and build your architecture — made making updates less complex for developers.

The new structure builds a set of IaC based on configuration files that is then applied to the live structure. This ensures that when a change is made, all associated cloud objects — of which there are half a million — are changed accordingly up and down the pipeline. You can read more about how Six Feet Up experts were able to manage workflows up and down the pipeline using configuration files in the blog "Too Big for DAG Factories?"

The IaC design also builds in an inherent safety net by allowing the code to be easily and quickly deployed and rolled back in the event of a failure. All of this means that the health system’s developers can now:

spin up multiple sandbox pipelines and make changes with the knowledge that it will work in a live environment,
deploy pipelines in a consistent manner using a push-button process that can be reproduced consistently; and
easily add or remove data sources without having to manually update all aspects of the pipeline that those data sources touched.

A fully automated CI/CD pipeline where in-house developers to build, test, and deploy software on a massive system with a push of a button was critical to the developer experience. Six Feet Up experts designed and executed a fully automated infrastructure and release process which allows developers to repeatably deploy with confidence and manage any configuration drift with source control and Terraform.

While other companies have used these tools, the scope and sophistication of this self-deploying infrastructure is more comprehensive than many solutions out there today.

Simulating the existing pipeline

To test the new stack, Six Feet Up’s developers needed a way to run the same data loads coming from numerous sources in a local environment, but due to patient privacy laws, much of the health system’s data was off limits.

Using simulated data from Faker, the Six Feet Up team simulated the health system’s third-party data sources, targets, and everything in between to test the new stack and ensure that it would work when faced with those loads in the live environment. The team also used this data to test against a simulated version of the health system’s existing stack to compare solutions and root out integration issues.

Deploying the new pipeline and tools

With help from the experts at Astronomer, a Six Feet Up partner and the company behind Apache Airflow, the Six Feet Up team used Terraform to deploy the new orchestrator and data pipeline stack to production. Astronomer enhances deployment and revision of infrastructure without having to adhere to legacy tooling.

Because of the testing done using the simulated data, the new stack integrated with all of the health system’s data sources, third-party platforms, and SQL data warehouse.

Due to the number of configuration files needed to manage every table, data source, and file, Six Feet Up’s experts also devised an SQL-like query language that allows the health system’s developers to quickly find the configuration files they wish to update.

RESULTS

Since implementation, the health system’s developers have given the new stack positive reviews.

The new stack has made data pipeline management much more user-friendly for the health system’s development staff by:

allowing for deployment of not only the code via CI/CD, but also the Infrastructure,
reducing the amount of documentation needed to manage the system,
allowing for near infinite parallelization of scheduled jobs using Delta Tables ad-hoc Databricks clusters,
greatly reducing the impact and time loss from failures; and
allowing for infinite branching of the source code for easy development and deployment.

Six Feet Up will continue to build tools to further help the health system’s developers optimize and navigate the new system in order to make managing and scaling of such a comprehensive data pipeline truly manageable.