19
Lessons
5 days
Duration
English
Language
Share This Class:
OBJECTIVE:
- This bootcamp is designed for data engineers seeking advanced skills in building and managing data pipelines using Databricks. Participants will learn how to leverage Databricks' Unified Analytics Platform to perform data engineering tasks efficiently, including data ingestion, transformation, orchestration, and optimization.
Course features:
- Practical hands on
- Lab sessions
- Training by experienced faculty
PRE-REQUISITES:
- Fundamental Data Engineering Knowledge: Participants should have a solid understanding of data engineering concepts, including data modeling, ETL processes, SQL, and distributed computing.
- Proficiency in Python or Scala: Participants should be proficient in at least one programming language, preferably Python or Scala, as Databricks supports both languages for data engineering tasks.
- Familiarity with Apache Spark: Basic knowledge of Apache Spark is recommended but not mandatory. Participants should have an understanding of Spark concepts such as RDDs, DataFrames, and Spark SQL.
- Access to Databricks Workspace: Participants should have access to a Databricks workspace provisioned with necessary permissions for creating clusters, notebooks, and executing jobs.
LAB SETUP:
- Databricks Account: Each participant should have access to a Databricks account provided by the training provider or their organization. The account should have the necessary permissions to create clusters, notebooks, and execute jobs.
- Internet Connectivity: Participants should have stable internet connectivity throughout the bootcamp to access the Databricks workspace, documentation, and other online resources.
- IDE Integration: Participants may choose to integrate their preferred integrated development environment (IDE) with Databricks. Supported IDEs include PyCharm, Visual Studio Code, and IntelliJ IDEA.
- Data Sources: ample datasets and data sources should be available for participants to use during lab exercises and hands-on projects. These datasets can include structured and semi-structured data in various formats such as CSV, JSON, Parquet, and Avro.
- Collaboration Tools: Collaboration tools such as Slack, Microsoft Teams, or Zoom should be available for communication between instructors and participants, sharing code snippets, and seeking assistance during lab sessions.
Learning Path
- Day 1: Introduction to Databricks and Spark
- Overview of Databricks Unified Analytics Platform
- Setting up Databricks workspace and clusters
- Creating and managing notebooks
- Overview of Apache Spark architecture
- Working with Resilient Distributed Datasets (RDDs)
- Introduction to Spark SQL and DataFrames
- Ingesting data from various sources (e.g., S3, Azure Blob Storage, relational databases)
- Exploring data formats (CSV, JSON, Parquet, Avro)
- Hands-on exercises: ingesting sample datasets into Databricks
- Exploratory data analysis (EDA) using Databricks notebooks
- Data cleaning and preprocessing techniques
- Hands-on exercises: performing data transformation tasks using Spark DataFrame API
- Day 2: Advanced Data Engineering Techniques
- Working with complex data types (arrays, structs, maps)
- User-defined functions (UDFs) in Spark SQL
- Window functions and analytical queries
- Understanding data partitioning and its impact on performance
- Techniques for optimizing Spark jobs (e.g., caching, broadcast joins)
- Hands-on exercises: optimizing data pipelines in Databricks
- Overview of Delta Lake and its features
- ACID transactions and data versioning
- Hands-on exercises: working with Delta tables in Databricks
- Creating and scheduling jobs in Databricks
- Monitoring job performance and execution history
- Hands-on exercises: scheduling data pipeline jobs in Databricks
- Day 3: Data Engineering Best Practices
- Overview of Apache Spark Structured Streaming
- Working with streaming DataFrames
- Hands-on exercises: building streaming pipelines in Databricks
- Overview of MLflow and its components
- Tracking experiments, packaging, and deploying models
- Hands-on exercises: managing machine learning pipelines with MLflow
- Design patterns for building scalable and reliable data pipelines
- Error handling and fault tolerance strategies
- Optimization techniques for improving pipeline performance
- Case studies and examples of data engineering projects
- Best practices for handling common data engineering challenges
- Q&A and open discussion on real-world scenarios
- Day 4: Advanced Topics in Databricks
- Overview of MLlib and ML packages in Spark
- Building end-to-end machine learning pipelines
- Hands-on exercises: building and deploying ML pipelines in Databricks
- Introduction to Databricks Runtime
- AutoML and hyperparameter tuning with Databricks AutoML
- Hands-on exercises: exploring advanced features of Databricks workspace
- Strategies for scaling data engineering workloads in Databricks
- Autoscaling clusters and optimizing resource allocation
- Hands-on exercises: scaling data pipelines for performance and efficiency
- Monitoring and logging techniques in Databricks
- Performance tuning and optimization strategies
- Hands-on exercises: monitoring and optimizing data pipelines in Databricks
- Day 5: Capstone Project and Final Assessment
- Participants work on a comprehensive data engineering project using Databricks
- Project scope includes data ingestion, transformation, orchestration, and optimization
- Guidance and support provided by instructors for project implementation
- Participants present their capstone projects to the class and instructors
- Projects are evaluated based on completeness, accuracy, scalability, and adherence to best practices
- Feedback provided to participants for further improvement and learning
- Recap of key concepts and takeaways from the bootcamp
- Distribution of course completion certificates to participants