Join us for this one-day special event when we discuss best Python practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization.
Sparkflow: Utilizing Pyspark for Training Tensorflow Models on Large Datasets
By Derek Miller, LifeOmic30 Mins, Intermediate
As more public, large datasets are becoming available, distributed data processing tools such as Apache Spark are vital for data scientists. While SparkML provides many machine learning algorithms, standard pipelines, and a basic linear algebra library, it does not support training deep learning models. Due to the rise of Tensorflow in the last two years, Lifeomic built the Sparkflow library to combine the power of the Pipeline api from Spark with training Deep Learning models in Tensorflow. Sparkflow uses the Hogwild algorithm to train deep learning models in a distributed manor, which underneath leverages the driver/executor architecture in Spark to manage copied networks and gradients. In this session, we describe some of the lessons learned in building Sparkflow, the pros and cons of asynchronous distributed deep learning, how to use Spark Pipelines with Tensorflow with very few lines of code, and where we are headed with the library in the near future.
Policy on a Page: Operational Workflow for Ad-Hoc Analyses
By: Aaron Burgess, State of Indiana Family & Social Administration Division of Data & Analytics30-45 mins, Intermediate
The majority of requests to FSSA Data & Analytics are ad-hoc analyses. Ad-hoc analyses had suffered from two major issues. One was an inferred belief that stakeholders wanted data points when they really wanted a statement of fact that could be cited or applied. The other issue was an over-simplified workflow for ad-hoc requests that featured no version control and "wild west" peer review.
The solution to managing the workflow for ad-hoc analyses was to immediately implement Git and (due to existing licenses) BitBucket policies and procedures. This included a standard ad-hoc repo template, repeated training on best practices such as immediately opening a pull request upon branch creation and peer review responsibilities. JIRA was already in place for task management. In addition, Bamboo was utilized to introduce the concept of continuous integration where ad-hoc requests should be ran automatically on every data refresh with change detection scripts as an early warning system. Finally, making use of Jupyter Notebooks and the respective extensions to deliver well-groomed html exports of deliverables became a standard practice. These deliverables focused on clearly defining an objective, methodology, results, and a "statement of fact" for use by stakeholders.
Implemented changes have resulted in clearer expectations for Data & Analytics team members. The standard Jupyter Notebook html extracts have been well-received by stakeholders and greatly reduced the level of "data heavy; information light" deliverables. The resulting trust from stakeholders has increased our request load and opened up opportunities to work on more complex modeling.
Data Visualization with Bokeh
By: James Alexander, Leaf Software Solutions
30 mins, Beginner
Learn how to create interactive charts and graphs without writing any JavaScript. We'll use Python to generate simple interactive graphs and plots within Jupyter notebooks, and embedded in a running Django site. I'll show examples of streaming data to a Bokeh instance, and interactively intuit about a large dataset using Datashader.
Brief Intro to Natural Language Processing (NLP)
By: Andrew (AJ) Rader, DMC Insurance, Inc
45 mins, Beginner
Natural Language Processing (NLP) is a broad domain that deals with analyzing and understanding human text and words. Some typical areas of application for NLP involve text classification, speech recognition, machine translation, chatbots, and caption generation. Fundamentally, NLP involves converting words into numbers and doing math on these numbers in order to identify relationships between the words and documents they live in. The goal of this talk is to present the basic theory of what NLP is and demonstrate how to utilize machine learning approaches in python to extract insights from text. An example text classification problem is presented; illustrating the steps required to ingest, preprocess, build and test a model for an example text corpus.
Building IoT Data Pipelines with Python
By: Logan Wendholt, Bastian Solutions
30-40 mins, Beginner
So you've learned about the data analytics capabilities of Python, and now you're ready to start churning through data -- great! But do you know how to turn your snippet of code into a system capable of taking in streams of raw sensor data and spitting out insights? This presentation will lay out the basic components of a Python-based data pipeline built for Internet-of-Things (IoT) applications, and will highlight some of the common challenges associated with putting together an efficient data analytics and storage system.
Key topics include:
SQL Server 2017 Support for Machine Learning in Python
By:
30-45 mins, Intermediate
SQL Server is Microsoft’s flagship relational database product. A database is the best platform for storing data – even big data – and Structured Query Language (SQL) is unparalleled in terms of supporting data access and manipulation. Python is one of the de facto languages of choice for machine learning today, having many libraries facilitating all aspects, from data frames (Pandas) to neural nets (Keras or Tensorflow). Machine learning is crucial for deduction and prediction from vast datasets and underpins everything from web search to self-driving cars.
So far, relational databases and machine learning have been at arm’s length, even though both are intimately tied to data. This division is no more. SQL Server 2017 is the first version to support Python natively, so database scripts can contain Python and SQL side-by-side.
This talk gives examples and methods of how this exciting merging of technology can apply to real-world data.
Simplifying OAuth2.0 Authentication in Big Data Ecosystems
By:
30-45 mins, Beginner
OAuth2.0, has been a favorite option to authenticate a system-to-system interaction over classic System ID/password combination via authentication servers. As popular as using an ID/password combination might be, the old authentication process poses a number of limitations: stateful cookie-caching of the authentication, incompatibility with mobile clients, highly coupled app/auth server architecture and incongruence with third-party identity providers. OAuth2.0 addresses these issues but, still have a misname for being complex and cumbersome to implement.
This presentation will demonstrate a simple mean of implementing a system-to-system OAuth2.0 authentication via Python and its native libraries. You will be able to see a call that attains token from the Identity Provider, Okta, and authenticates itself to a Flask app - all made possible with minimal, straightforward coding. In addition to the live coding demo, the presenters will walk through how this can be used in an open-sourced Big Data Ecosystem.