How to start with Machine Learning

What’s Machine Learning?

Machine Learning (“ML”) is a software development discipline that aims at enabling computers or “machines” to learn and recognize patterns and make inferences on their own.

Machine Learning initiatives can be powerful when your finance, sales, marketing or operations departments need to analyze large amounts of data. This is especially true when it makes sense to analyze data points together to gain greater insights or find patterns.

Note that ML is a subset of Artificial Intelligence (“AI”), which aims at allowing computers to make decisions without human interaction.

How can you use Machine Learning for your organization?

You can assist your teams with interrogating and interpreting real time data, as well as predicting results, through the development of custom interactive dashboards.

Sometimes, understanding large data sets is especially difficult for the human brain. That’s when visualization tools come in handy. Python has a range of mature tools to create beautiful visualizations, each with their own strengths and weaknesses. Depending on your needs, you’ll want to make an intelligent choice of library to turn to for any given visualization task.

Where to start with an AI project?

Before looking at pretty visualization libraries, it’s important to go through a due diligence phase and help define the business problems. If you need help narrowing down the scope and getting everyone on the same page, consider scheduling a complimentary Vision Builder™ workshop. For more details, simply contact us.

Once you know where you’re going and what data you need to collect and analyze, you’ll have to concern yourselves with the integration, cleaning and normalization of various data types. This is typically called “ETL” for “Extract, Transform, Load”. This phase is evidently key to avoid “Garbage in, Garbage out”. Be prepared for a lot of iteration to get this correct. Everyone’s data is specific to their use case and comes with it own issues. For more on this topic, watch IndyPy member Alyssa Batula’s presentation “Proper Data Handling for Machine Learning”.

Depending on the type of data you are collecting, and from whom, you may have to consider managing data transfer security via encryption and/or virus scanning. The combination of AWS Lambda and S3 is a great way to automate a data pipeline and perform operations such as virus checking and initial transformations on incoming data. If you need to support SFTP or other file transfer than S3 directly, using the Python watchdog package to watch for new files and deposit them into S3 is also very effective. Check out the included watchmedo script included with the library.

If the data is especially sensitive, you may have to set up control access via granular roles and permissions combined with a content publication workflow. A web framework such as Django or Pyramid can provide the application layer needed to make decisions on accessing the data, and assist in publishing data to the web.

Analyzing large sets of data can be quite computer intensive. So, unless your organization has a massive infrastructure in house, you may want to run your big data workloads via scalable managed cloud services like AWS Sagemaker and AWS Lambda. With Lambda, you can run many models in parallel and never have to worry about managing operating systems or hardware while scaling out as needed.

Eventually you’ll want to collaborate with your data scientists to translate the analysis into pretty pixels and share it with decision-makers. You can create data visualization layers for interactive and real-time dashboards using Altair or Bokeh. For extra bonus points, you can also go one step further and embed data-streaming visualizations on your website, intranet or web portal.

What tools to consider for your ML projects?

Here’s a quick run down of our favorite tools for analytics projects:

Visualization tools:

Bokeh
Matplotlib
Altair - We have used Altair for analyzing data in Jupyter Notebooks for biomedical testing equipments.
D3.js and Rickshaw - We have used the Rickshaw Javascript helper for the D3.js graphing library to write ad-hoc graphs and visualizations for NASA.
FusionCharts - We used FusionCharts to display ad hoc histograms with a variety of data-sets for the University of Virginia.

Machine Learning and Data Science tools:

Application and Web Frameworks:

Additional resources

Here are a few great tutorials to help you get started with ML:

How to start with Machine Learning

Table of Contents