8
Lessons
5 days
Duration
English
Language
Share This Class:
OBJECTIVEs:
- Learn about Apache Spark and the Spark 3.0 architecture
- Learn Big data Concept
- Hadoop Ecosystem
- Python DataScience
- Build and interact with Spark DataFrames using Spark SQL
Course features:
- Practical hands on
- Lab sessions
- Training by experienced faculty
PRE-REQUISITES:
- Big Data and Hadoop
- Basic Python data structures
- Basic knowledge of Pandas dataframes and SQL
- Entry-level Data Science
Software Pre-requisites:
- Apache Spark (Downloadable from http://spark.apache.org/downloads.html)
- A Python distribution containing IPython, Pandas and Scikit-learn
- PySpark
- Anaconda Python3.x
- www.anaconda.com
Learning Path
- Day 1
- Introduction to Big Data
- Hadoop Architecture
- Mapper and Reducer
- What is Apache Spark?
- Spark Jobs and APIs
- Spark 3.0 architecture
- Using Anaconda, Notebook
- Installation and Configuration
- Python Introduction
- Python Objects
- Complex
- Boolean
- Python DataStructure
- list
- list methods
- tuple
- string
- string methods
- dictionary
- Dictionary methods with examples
- Control Structure
- Functions
- glob variale
- Variable Argument *arg, **kwarg
- Built in Functions
- range
- lambda
- filter
- map
- reduce
- set
- zip
- Conclusion and Summary
- Day 2
- File Handling
- Exception Handling
- List Comprehension
- Dictionary Comprehension
- Modules
- Uer Define Modules
- Built in Modules
- os
- sys
- system
- glob
- Class
- Methods
- Inheritance
- Case Study
- Iterator
- Generator
- Regular Expression ( re )
- File Handling and Exception Handling
- Conclusion and Summary
- Day 3
- Hands on Session
- Array Manipulation
- Matrix Manipulation
- pandas
- Hands on Session
- Data Series
- DataFrame
- Case Study
- Data Visualisation
- Matplotlib
- Case Study
- Day 4
- Internal workings of an RDD
- Creating RDDs
- Global versus local scope
- Transformations Functions
- Actions Functions
- Hands on Session on RDD and Spark
- Assignments 1
- Best Practices 1
- Project Discussion using Pyspark
- Conclusion and Summary
- Day 5
- Python to RDD communications
- Catalyst Optimizer refresh
- Speeding up PySpark with DataFrames
- Creating DataFrames
- Simple DataFrame queries
- Interoperating with RDDs
- Querying with the DataFrame API
- Hands On Session on Pandas DataFrame and PySpark
- Assignments 2
- Checking for duplicates, missing observations, and outliers
- Assignments 3
- Conclusion and Summary
- Group Project Presentation