The July 2022 edition of IndyPy — Indiana’s largest Python meetup founded in 2007 by Six Feet Up CTO and Amazon Web Services (AWS) Community Hero, Calvin Hendryx-Parker — featured an introduction to synthetic data. Mason Egger, lead developer advocate at Gretel, walked through use cases for synthetic data as well as some of the major benefits for software developers and data scientists alike.
According to Egger’s presentation, 35% of a data scientists time is spent in data gathering, and there are a number of hurdles to getting an ideal set of data for machine learning models or software applications. For example, access to some data can be limited by data privacy regulations, and data sets that are too small or too skewed can result in models that don’t reflect reality.
Egger says synthetic data gets around this issue by producing an extended data set that is based on real data so that, although the data set is not real and doesn’t contain personally identifiable information, it is at least representative of a real set. In his presentation, Egger:
Did you miss the presentation? Watch the recording and explore tidbits via @IndyPy’s live Twitter thread.
You can find Mason on:
Best 49 open source synthetic data projects: https://www.opensourceagenda.com/tags/synthetic-data
Gretel Synthetics documentation: https://synthetics.docs.gretel.ai/en/stable/
Gretel synthetic data generators: https://github.com/gretelai/gretel-synthetics
Notebooks containing fun things you can do with Gretel synthetic data: https://github.com/gretelai/fun-with-synthetic-data