6
Lessons
3 days
Duration
English
Language
Share This Class:
OBJECTIVEs:
- Learn about Apache Spark and the Spark 3.0 architecture
- Build and interact with Spark DataFrames using Spark SQL
- Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
- Read, transform, and understand data and use it to train machine learning models
- Build machine learning models with MLlib and ML
- Learn how to submit your applications programmatically using spark-submit
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
Course features:
- Practical hands on
- Lab sessions
- Training by experienced faculty
PRE-REQUISITES:
- Basic Python data structures
- Basic knowledge of Pandas dataframes and SQL
- Knowledge of common data storage formats like JSON, delimiter separated files, HDFS, etc
- Entry-level machine learning
Software Pre-requisites:
- Apache Spark (Downloadable from http://spark.apache.org/downloads.html)
- A Python distribution containing IPython, Pandas and Scikit-learn PySpark
Target Audience:
- Anyone interested in Machine Learning.
- Any intermediate level professionals who know the basics of machine learning, including classical algorithms like linear regression or logistic regression, but who want to learn more about it and explore all the different fields of Machine Learning.
- Any professional who is not that comfortable with coding but who are interested in Machine Learning and want to apply it easily on datasets.
- Any data analysts who want to level up in Machine Learning.
- Any professional who is not satisfied with their job and who want to become a Data Scientist.
- Any professional who wants to create added value to their business by using powerful Machine Learning tools.
Learning Path
- Day 1
- What is Apache Spark
- Spark Jobs and APIs
- Spark Architecture
- Installation and Configuration
- Internal workings of an RDD
- Creating RDDs
- Global versus local scope
- Transformations
- Actions
- Hands on Session on RDD and Spark
- Assignments 1
- Best Practices 1
- Day 2
- Python to RDD communications
- Catalyst Optimizer refresh
- Speeding up PySpark with DataFrames
- Creating DataFrames
- Simple DataFrame queries
- Interoperating with RDDs
- Querying with the DataFrame API
- Hands On Session on Pandas DataFrame and PySpark
- Assignments 2
- Checking for duplicates, missing observations, and outliers
- Getting familiar with your data Visualization
- Hands on Session Data Modeling
- Assignments 3
- Day 3
- Overview of the package
- Loading and transforming the data
- Getting to know your data
- Creating the final dataset
- Predicting infant survival
- Hands on Session using PySpark MLib
- Assignments 4
- Overview of the package
- Predicting the chances of infant survival with ML
- Parameter hyper-tuning
- Other features of PySpark ML in action
- Implementation of ML Algorithm
- Random Forest
- Regression
- K-means
- Conclusion and Summary