Advanced Data Engineering with Databricks

19 Lessons

5 days

Duration

English

Language

Share This Class:

OBJECTIVE:

Course features:

PRE-REQUISITES:

Fundamental Data Engineering Knowledge: Participants should have a solid understanding of data engineering concepts, including data modeling, ETL processes, SQL, and distributed computing.
Proficiency in Python or Scala: Participants should be proficient in at least one programming language, preferably Python or Scala, as Databricks supports both languages for data engineering tasks.
Familiarity with Apache Spark: Basic knowledge of Apache Spark is recommended but not mandatory. Participants should have an understanding of Spark concepts such as RDDs, DataFrames, and Spark SQL.
Access to Databricks Workspace: Participants should have access to a Databricks workspace provisioned with necessary permissions for creating clusters, notebooks, and executing jobs.

LAB SETUP:

Databricks Account: Each participant should have access to a Databricks account provided by the training provider or their organization. The account should have the necessary permissions to create clusters, notebooks, and execute jobs.
Internet Connectivity: Participants should have stable internet connectivity throughout the bootcamp to access the Databricks workspace, documentation, and other online resources.
IDE Integration: Participants may choose to integrate their preferred integrated development environment (IDE) with Databricks. Supported IDEs include PyCharm, Visual Studio Code, and IntelliJ IDEA.
Data Sources: ample datasets and data sources should be available for participants to use during lab exercises and hands-on projects. These datasets can include structured and semi-structured data in various formats such as CSV, JSON, Parquet, and Avro.
Collaboration Tools: Collaboration tools such as Slack, Microsoft Teams, or Zoom should be available for communication between instructors and participants, sharing code snippets, and seeking assistance during lab sessions.

Learning Path

Module 1: Introduction to Databricks

Overview of Databricks Unified Analytics Platform
Setting up Databricks workspace and clusters
Creating and managing notebooks

Module 2: Introduction to Apache Spark

Overview of Apache Spark architecture
Working with Resilient Distributed Datasets (RDDs)
Introduction to Spark SQL and DataFrames

Module 3: Data Ingestion in Databricks

Ingesting data from various sources (e.g., S3, Azure Blob Storage, relational databases)
Exploring data formats (CSV, JSON, Parquet, Avro)
Hands-on exercises: ingesting sample datasets into Databricks

Module 4: Data Exploration and Transformation

Exploratory data analysis (EDA) using Databricks notebooks
Data cleaning and preprocessing techniques
Hands-on exercises: performing data transformation tasks using Spark DataFrame API

Module 1: Advanced Spark SQL Operations

Working with complex data types (arrays, structs, maps)
User-defined functions (UDFs) in Spark SQL
Window functions and analytical queries

Module 2: Data Partitioning and Optimization

Understanding data partitioning and its impact on performance
Techniques for optimizing Spark jobs (e.g., caching, broadcast joins)
Hands-on exercises: optimizing data pipelines in Databricks

Module 3: Introduction to Delta Lake

Overview of Delta Lake and its features
ACID transactions and data versioning
Hands-on exercises: working with Delta tables in Databricks

Module 4: Orchestrating Workflows with Databricks Jobs

Creating and scheduling jobs in Databricks
Monitoring job performance and execution history
Hands-on exercises: scheduling data pipeline jobs in Databricks

Module 1: Introduction to Structured Streaming

Overview of Apache Spark Structured Streaming
Working with streaming DataFrames
Hands-on exercises: building streaming pipelines in Databricks

Module 2: Managing Data Pipelines with MLflow

Overview of MLflow and its components
Tracking experiments, packaging, and deploying models
Hands-on exercises: managing machine learning pipelines with MLflow

Module 3: Data Engineering Best Practices

Design patterns for building scalable and reliable data pipelines
Error handling and fault tolerance strategies
Optimization techniques for improving pipeline performance

Module 4: Real-world Data Engineering Use Cases

Case studies and examples of data engineering projects
Best practices for handling common data engineering challenges
Q&A and open discussion on real-world scenarios

Module 1: Introduction to Machine Learning Pipelines in Databricks

Overview of MLlib and ML packages in Spark
Building end-to-end machine learning pipelines
Hands-on exercises: building and deploying ML pipelines in Databricks

Module 2: Advanced Databricks Features

Introduction to Databricks Runtime
AutoML and hyperparameter tuning with Databricks AutoML
Hands-on exercises: exploring advanced features of Databricks workspace

Module 3: Scaling Data Engineering Workloads

Strategies for scaling data engineering workloads in Databricks
Autoscaling clusters and optimizing resource allocation
Hands-on exercises: scaling data pipelines for performance and efficiency

Module 4: Monitoring and Performance Tuning

Monitoring and logging techniques in Databricks
Performance tuning and optimization strategies
Hands-on exercises: monitoring and optimizing data pipelines in Databricks

Module 1: Capstone Project

Participants work on a comprehensive data engineering project using Databricks
Project scope includes data ingestion, transformation, orchestration, and optimization
Guidance and support provided by instructors for project implementation

Module 2: Project Presentations and Evaluation

Participants present their capstone projects to the class and instructors
Projects are evaluated based on completeness, accuracy, scalability, and adherence to best practices
Feedback provided to participants for further improvement and learning

Module 3: Course Conclusion and Certification

Recap of key concepts and takeaways from the bootcamp
Distribution of course completion certificates to participants

19

Lessons

5 days

Duration

English

Language

Share This Class:

OBJECTIVE:

Course features:

PRE-REQUISITES:

LAB SETUP:

Learning Path

We can customize your training as well.

CONTACT US TODAY!