Course features:


Learning Path

  • Overview of Databricks and its features
  • Introduction to Apache Spark and its ecosystem
  • Understanding the advantages of using Databricks for big data processing
  • Session on creating & using Databrick Spark Cluster
  • Spark architecture – driver program, cluster manager, executors
  • Spark operations – number of executors, executor memory
  • Conclusion and Summary

4 hours

  • Databricks workspace and notebooks
  • Clusters and cluster management
  • Introduction to Databricks Runtime and Spark versions
  • Understanding the Databricks File System (DBFS)
  • Conclusion and Summary

4 hours

Spark Fundamentals:

  • RDDs (Resilient Distributed Datasets) and transformations
  • Actions and lazy evaluation

Transformations (Hands on Session)

  • More focused session on
  • filter, groupBy, sortBy, joins – inner, outer, cross, partitionBy,
  • union, distinct, coalesce, repartition
  • Brief overview on: map, flatmap, mapPartitions,
  • mapPartitionsWithIndex, flatmapValues, groupByKey,
  • reduceByKey
  • Conclusion and Summary

4 hours

  • PySpark RDD count, min, max, sum, mean, variance, stdev
  • PySpark RDD saveAsTextFile, saveAsParequetFile
  • Reduce, Collect, Keys, Values, Aggregate, First, take, foreach, top

HandOn Session: Basic Word Count Application

  • correlating with spark map reduce functioning
  • Sparkf RDD application to problems
  • basic word count, log file manipulation and statistics, entity resolution
  • Conclusion and Summary, Interim Test 1

4 hours

Data Manipulation and Processing:

  • Data ingestion from various sources (CSV, JSON, Parquet, etc.)
  • Store and load the data using various formats – csv, avro, json, orc, parquet
  • Data cleaning, filtering, and transformation
  • Joins, aggregations, and window functions (Self Join, Recursive Join)
  • Aggregate window functions – avg, count, max, min, sum
  • Ranking window functions – cume_dist, dense_rank, ntile, percent_rank, rank,
  • Value window functions – lead, lag, first_value, last_value
  • Handling missing data and outliers
  • Spark SQL data frames and table creation
  • Spark SQL querying data using Spark Session available/ created as spark
  • Some example operations and queries
  • Creating udfs to transformation
  • Joining the tables

4 hours

  • Conclusion and Summary
  • Introduction to Streaming
  • Architecture
  • Benefits
  • How does it work?
  • Handling streaming data with Databricks,
    Introduction to Structured Streaming
  • Architecture
  • How does it work?
  • using DStreams and structured streaming
  • Input Sources
  • Sinks
  • Structured Streaming Operations
  • Windowing on the Streams
  • Spark Streaming Versus Structured Streaming
  • Conclusion and Summary, Interim Test2

8 hours

  • Retrieve Delta table history
  • History schema
  • Operation metrics keys
  • Retrieve Delta table details
  • Detail schema
  • Generate a manifest file
  • Convert a Parquet table to a Delta table
  • Convert a Delta table to a Parquet table
  • Shallow clone a Delta table
  • Remove files no longer referenced by a Delta table
  • Conclusion and Summary
  • Performance tuning and optimization techniques
  • Caching and persistence
  • Broadcast variables and accumulators
  • Working with large datasets and out-of-memory data processing
  • PySpark coding best practices guidelines
  • Data Engineer Associate Certificate Guideline
  • Participant Project Review
  • Conclusion and Summary
  • Final Test

4 hours