Databricks PySpark - Corporate Training

10

Lessons

40h

Duration

English

Language

Share This Class:

Course features:

software PRE-REQUISITES:

Learning Path

Day 1: Introduction to Databricks and Apache Spark

Overview of Databricks and its features
Introduction to Apache Spark and its ecosystem
Understanding the advantages of using Databricks for big data processing
Session on creating & using Databrick Spark Cluster
Spark architecture – driver program, cluster manager, executors
Spark operations – number of executors, executor memory
Conclusion and Summary

4 hours

Day 2: Databricks Architecture and Components

Databricks workspace and notebooks
Clusters and cluster management
Introduction to Databricks Runtime and Spark versions
Understanding the Databricks File System (DBFS)
Conclusion and Summary

4 hours

Day 3: Spark RDD Transformations and Actions

Spark Fundamentals:

RDDs (Resilient Distributed Datasets) and transformations
Actions and lazy evaluation

Transformations (Hands on Session)

More focused session on
filter, groupBy, sortBy, joins – inner, outer, cross, partitionBy,
union, distinct, coalesce, repartition
Brief overview on: map, flatmap, mapPartitions,
mapPartitionsWithIndex, flatmapValues, groupByKey,
reduceByKey
Conclusion and Summary

4 hours

Day 4: Actions (Hands on Session)

PySpark RDD count, min, max, sum, mean, variance, stdev
PySpark RDD saveAsTextFile, saveAsParequetFile
Reduce, Collect, Keys, Values, Aggregate, First, take, foreach, top

HandOn Session: Basic Word Count Application

correlating with spark map reduce functioning
Sparkf RDD application to problems
basic word count, log file manipulation and statistics, entity resolution
Conclusion and Summary, Interim Test 1

4 hours

Day 5: Spark SQL and Databricks SQL

Data Manipulation and Processing:

Data ingestion from various sources (CSV, JSON, Parquet, etc.)
Store and load the data using various formats – csv, avro, json, orc, parquet
Data cleaning, filtering, and transformation
Joins, aggregations, and window functions (Self Join, Recursive Join)
Aggregate window functions – avg, count, max, min, sum
Ranking window functions – cume_dist, dense_rank, ntile, percent_rank, rank,
row_number
Value window functions – lead, lag, first_value, last_value
Handling missing data and outliers
Spark SQL data frames and table creation
Spark SQL querying data using Spark Session available/ created as spark
Some example operations and queries
Creating udfs to transformation
Joining the tables

4 hours

Day 6: NSE Case Study

Conclusion and Summary

Day 7 - 8: Spark Streaming and Structured Streaming

Introduction to Streaming
Architecture
Benefits
How does it work?
Handling streaming data with Databricks,
Introduction to Structured Streaming
Architecture
How does it work?
using DStreams and structured streaming
Input Sources
Sinks
Structured Streaming Operations
Windowing on the Streams
Spark Streaming Versus Structured Streaming
Conclusion and Summary, Interim Test2

8 hours

Day 9: Delta Lake Case Study

Retrieve Delta table history
History schema
Operation metrics keys
Retrieve Delta table details
Detail schema
Generate a manifest file
Convert a Parquet table to a Delta table
Convert a Delta table to a Parquet table
Shallow clone a Delta table
Remove files no longer referenced by a Delta table
Conclusion and Summary

Day10: Memory Optimization or Performance Optimization

Performance tuning and optimization techniques
Caching and persistence
Broadcast variables and accumulators
Working with large datasets and out-of-memory data processing
PySpark coding best practices guidelines
Data Engineer Associate Certificate Guideline
Participant Project Review
Conclusion and Summary
Final Test

4 hours

We can customize your training as well.

CONTACT US TODAY!