PySpark and Machine Learning - Corporate Training

6

Lessons

3 days

Duration

English

Language

Share This Class:

OBJECTIVEs:

Learn about Apache Spark and the Spark 3.0 architecture
Build and interact with Spark DataFrames using Spark SQL
Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
Read, transform, and understand data and use it to train machine learning models
Build machine learning models with MLlib and ML
Learn how to submit your applications programmatically using spark-submit
ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

Course features:

PRE-REQUISITES:

Software Pre-requisites:

Target Audience:

Anyone interested in Machine Learning.
Any intermediate level professionals who know the basics of machine learning, including classical algorithms like linear regression or logistic regression, but who want to learn more about it and explore all the different fields of Machine Learning.
Any professional who is not that comfortable with coding but who are interested in Machine Learning and want to apply it easily on datasets.
Any data analysts who want to level up in Machine Learning.
Any professional who is not satisfied with their job and who want to become a Data Scientist.
Any professional who wants to create added value to their business by using powerful Machine Learning tools.

Learning Path

Module 1: Introduction to Spark

What is Apache Spark
Spark Jobs and APIs
Spark Architecture
Installation and Configuration

Module 2: Resilient Distributed Datasets

Internal workings of an RDD
Creating RDDs
Global versus local scope
Transformations
Actions
Hands on Session on RDD and Spark
Assignments 1
Best Practices 1

Module 3: DataFrames

Python to RDD communications
Catalyst Optimizer refresh
Speeding up PySpark with DataFrames
Creating DataFrames
Simple DataFrame queries
Interoperating with RDDs
Querying with the DataFrame API
Hands On Session on Pandas DataFrame and PySpark
Assignments 2

Module 4: Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers
Getting familiar with your data Visualization
Hands on Session Data Modeling
Assignments 3

Module 5: Introducing MLlib

Overview of the package
Loading and transforming the data
Getting to know your data
Creating the final dataset
Predicting infant survival
Hands on Session using PySpark MLib
Assignments 4

Module 6: Introducing the ML Package

Overview of the package
Predicting the chances of infant survival with ML
Parameter hyper-tuning
Other features of PySpark ML in action
Implementation of ML Algorithm
Random Forest
Regression
K-means
Conclusion and Summary

We can customize your training as well.

CONTACT US TODAY!