There are many tools out there that companies can use to handle the storage, transfer, and transformation of data; however, beyond a certain point, scalability becomes an issue. This leads to big data architecture that can’t keep up with the flow of data and causes loss of valuable hours on maintenance.
On Episode 142 of The Real Python Podcast, Calvin Hendryx-Parker, Six Feet Up’s CTO and AWS Community Hero, discusses a recent project that utilized Apache Airflow — along with a host of other open source and cloud native tools — to make a statewide health system’s big data architecture faster and more manageable. Calvin also touches on the upcoming 2023 Python Web Conference, which is scheduled for March 13-17.
"We've historically been a Python shop since our inception 23 years ago. Our most recent demand we've seen is around big data pipelines," Calvin says. “If you need to get a massive amount of data into a single spot, that's where these big data pipelines come into play.”
Every day, hospitals need to aggregate petabytes of data to help make life-changing health and business decisions. In order to make it easier for the health system to handle the flow of data and manage the system that oversees it, Six Feet Up developers devised a clever technique that allows the pipeline to quickly scale up to meet demand. You can read about that technique in the blog post “Too Big for DAG Factories?”
"Now a data engineer doesn't have to be a full blown super senior python developer to be able to import a new dataset into the data warehouse,” Calvin says.
Check out “Building a Big Data Pipeline with Cloud Native Tools” for more on the project.
Listen to Episode 142: “Orchestrating Large and Small Projects With Apache Airflow”
The Real Python Podcast is a weekly podcast hosted by Christopher Bailey featuring interviews, coding tips and conversation with guests from the Python community.