10
Lessons
40h
Duration
English
Language
Share This Class:
Course features:
- Practical hands on
- Lab sessions
- Training by experienced faculty
software PRE-REQUISITES:
- Online Databricks Cloud (With Internet Community Edition)
- https://docs.databricks.com/en/getting-started/community-edition.html
- https://community.cloud.databricks.com/login.html
Learning Path
- Overview of Databricks and its features
- Introduction to Apache Spark and its ecosystem
- Understanding the advantages of using Databricks for big data processing
- Session on creating & using Databrick Spark Cluster
- Spark architecture – driver program, cluster manager, executors
- Spark operations – number of executors, executor memory
- Conclusion and Summary
4 hours
- Databricks workspace and notebooks
- Clusters and cluster management
- Introduction to Databricks Runtime and Spark versions
- Understanding the Databricks File System (DBFS)
- Conclusion and Summary
4 hours
Spark Fundamentals:
- RDDs (Resilient Distributed Datasets) and transformations
- Actions and lazy evaluation
Transformations (Hands on Session)
- More focused session on
- filter, groupBy, sortBy, joins – inner, outer, cross, partitionBy,
- union, distinct, coalesce, repartition
- Brief overview on: map, flatmap, mapPartitions,
- mapPartitionsWithIndex, flatmapValues, groupByKey,
- reduceByKey
- Conclusion and Summary
4 hours
- PySpark RDD count, min, max, sum, mean, variance, stdev
- PySpark RDD saveAsTextFile, saveAsParequetFile
- Reduce, Collect, Keys, Values, Aggregate, First, take, foreach, top
HandOn Session: Basic Word Count Application
- correlating with spark map reduce functioning
- Sparkf RDD application to problems
- basic word count, log file manipulation and statistics, entity resolution
- Conclusion and Summary, Interim Test 1
4 hours
Data Manipulation and Processing:
- Data ingestion from various sources (CSV, JSON, Parquet, etc.)
- Store and load the data using various formats – csv, avro, json, orc, parquet
- Data cleaning, filtering, and transformation
- Joins, aggregations, and window functions (Self Join, Recursive Join)
- Aggregate window functions – avg, count, max, min, sum
- Ranking window functions – cume_dist, dense_rank, ntile, percent_rank, rank,
row_number - Value window functions – lead, lag, first_value, last_value
- Handling missing data and outliers
- Spark SQL data frames and table creation
- Spark SQL querying data using Spark Session available/ created as spark
- Some example operations and queries
- Creating udfs to transformation
- Joining the tables
4 hours
- Conclusion and Summary
- Introduction to Streaming
- Architecture
- Benefits
- How does it work?
- Handling streaming data with Databricks,
Introduction to Structured Streaming - Architecture
- How does it work?
- using DStreams and structured streaming
- Input Sources
- Sinks
- Structured Streaming Operations
- Windowing on the Streams
- Spark Streaming Versus Structured Streaming
- Conclusion and Summary, Interim Test2
8 hours
- Retrieve Delta table history
- History schema
- Operation metrics keys
- Retrieve Delta table details
- Detail schema
- Generate a manifest file
- Convert a Parquet table to a Delta table
- Convert a Delta table to a Parquet table
- Shallow clone a Delta table
- Remove files no longer referenced by a Delta table
- Conclusion and Summary
- Performance tuning and optimization techniques
- Caching and persistence
- Broadcast variables and accumulators
- Working with large datasets and out-of-memory data processing
- PySpark coding best practices guidelines
- Data Engineer Associate Certificate Guideline
- Participant Project Review
- Conclusion and Summary
- Final Test
4 hours