Get in Touch

Course Outline

Introduction:

  • The Role of Apache Spark in the Hadoop Ecosystem
  • Overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Practical Workshop: Mastering Basics with Databricks

  • RDD API Exercises
  • Essential Action and Transformation Functions
  • PairRDDs
  • Join Operations
  • Caching Strategies
  • DataFrame API Exercises
  • SparkSQL
  • DataFrame Operations: select, filter, group, and sort
  • User-Defined Functions (UDFs)
  • Introduction to the Dataset API
  • Streaming

Practical Workshop: Deployment with AWS Environment

  • Fundamentals of AWS Glue
  • Distinguishing between AWS EMR and AWS Glue
  • Sample Jobs in Both Environments
  • Evaluating Advantages and Disadvantages

Additional Content:

  • Introduction to Apache Airflow Orchestration

Requirements

Programming proficiency (preferably in Python or Scala)

Fundamental knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories