Get in Touch

Course Outline

PySpark & Machine Learning 

Module 1: Big Data & Spark Foundations

  • Overview of the Big Data ecosystem and Spark's role in modern data platforms
  • Comprehending Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAG, and execution planning
  • Distinctions between RDD and DataFrame APIs and guidelines for choosing each approach
  • Establishing and configuring SparkSession and grasping the fundamentals of application configuration

Module 2: PySpark DataFrames

  • Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
  • Executing advanced operations like window functions, timestamp handling, and managing nested data
  • Implementing data quality checks and writing maintainable, reusable PySpark code

Module 3: Processing Large Datasets Efficiently

  • Grasping performance fundamentals: partitioning strategies, shuffle behavior, caching, and persistence
  • Utilizing optimization techniques such as broadcast joins and execution plan analysis
  • Efficiently processing large datasets and adhering to best practices for scalable data workflows
  • Understanding schema evolution and modern storage formats utilized in enterprise environments

Module 4: Feature Engineering at Scale

  • Conducting feature engineering with Spark MLlib: managing missing values, encoding categorical variables, and scaling features
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
  • Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
  • Comparing models and interpreting results within distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

  • Constructing end-to-end Machine Learning pipelines that integrate preprocessing, feature engineering, and modeling
  • Applying train/validation/test split strategies
  • Performing cross-validation and hyperparameter tuning via grid search and random search
  • Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying suitable evaluation metrics for regression and classification problems
  • Identifying overfitting and underfitting and making informed model selection decisions
  • Interpreting feature importance and understanding model behavior

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark
  • Implementing batch inference workflows on large datasets
  • Understanding the Machine Learning lifecycle in enterprise environments
  • Introduction to versioning, experiment tracking concepts, and fundamental testing strategies

Practical Outcome

  • Competence in working independently with PySpark
  • Capability to process large datasets efficiently
  • Skill in performing feature engineering at scale
  • Ability to construct scalable Machine Learning pipelines

Requirements

Participants are expected to have the following background:

Foundational Python programming skills, including working with functions, data structures, and libraries
A basic grasp of data analysis concepts such as datasets, transformations, and aggregations
Elementary knowledge of SQL and relational data principles
Introductory understanding of Machine Learning concepts, including training datasets, features, and evaluation metrics
Familiarity with command-line environments and basic software development practices is advised

Experience with Pandas, NumPy, or similar data processing libraries is advantageous but not required.

 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories