19

Lessons

5 days

Duration

English

Language

OBJECTIVE:

Course features:

PRE-REQUISITES:

LAB SETUP:

Learning Path

  • Overview of Databricks Unified Analytics Platform
  • Setting up Databricks workspace and clusters
  • Creating and managing notebooks
  • Overview of Apache Spark architecture
  • Working with Resilient Distributed Datasets (RDDs)
  • Introduction to Spark SQL and DataFrames
  • Ingesting data from various sources (e.g., S3, Azure Blob Storage, relational databases)
  • Exploring data formats (CSV, JSON, Parquet, Avro)
  • Hands-on exercises: ingesting sample datasets into Databricks
  • Exploratory data analysis (EDA) using Databricks notebooks
  • Data cleaning and preprocessing techniques
  • Hands-on exercises: performing data transformation tasks using Spark DataFrame API
  • Working with complex data types (arrays, structs, maps)
  • User-defined functions (UDFs) in Spark SQL
  • Window functions and analytical queries
  • Understanding data partitioning and its impact on performance
  • Techniques for optimizing Spark jobs (e.g., caching, broadcast joins)
  • Hands-on exercises: optimizing data pipelines in Databricks
  • Overview of Delta Lake and its features
  • ACID transactions and data versioning
  • Hands-on exercises: working with Delta tables in Databricks
  • Creating and scheduling jobs in Databricks
  • Monitoring job performance and execution history
  • Hands-on exercises: scheduling data pipeline jobs in Databricks
  • Overview of Apache Spark Structured Streaming
  • Working with streaming DataFrames
  • Hands-on exercises: building streaming pipelines in Databricks
  • Overview of MLflow and its components
  • Tracking experiments, packaging, and deploying models
  • Hands-on exercises: managing machine learning pipelines with MLflow
  • Design patterns for building scalable and reliable data pipelines
  • Error handling and fault tolerance strategies
  • Optimization techniques for improving pipeline performance
  • Case studies and examples of data engineering projects
  • Best practices for handling common data engineering challenges
  • Q&A and open discussion on real-world scenarios
  • Overview of MLlib and ML packages in Spark
  • Building end-to-end machine learning pipelines
  • Hands-on exercises: building and deploying ML pipelines in Databricks
  • Introduction to Databricks Runtime
  • AutoML and hyperparameter tuning with Databricks AutoML
  • Hands-on exercises: exploring advanced features of Databricks workspace
  • Strategies for scaling data engineering workloads in Databricks
  • Autoscaling clusters and optimizing resource allocation
  • Hands-on exercises: scaling data pipelines for performance and efficiency
  • Monitoring and logging techniques in Databricks
  • Performance tuning and optimization strategies
  • Hands-on exercises: monitoring and optimizing data pipelines in Databricks
  • Participants work on a comprehensive data engineering project using Databricks
  • Project scope includes data ingestion, transformation, orchestration, and optimization
  • Guidance and support provided by instructors for project implementation
  • Participants present their capstone projects to the class and instructors
  • Projects are evaluated based on completeness, accuracy, scalability, and adherence to best practices
  • Feedback provided to participants for further improvement and learning
  • Recap of key concepts and takeaways from the bootcamp
  • Distribution of course completion certificates to participants