Get in Touch

Course Outline

Section 1: Introduction to Hadoop

  • History and core concepts of Hadoop
  • Overview of the ecosystem
  • Various distributions
  • High-level architecture
  • Common misconceptions about Hadoop
  • Challenges associated with Hadoop
  • Hardware and software requirements
  • Lab: Initial exploration of Hadoop

Section 2: HDFS

  • Design and architectural principles
  • Core concepts (horizontal scaling, replication, data locality, rack awareness)
  • Key daemons: NameNode, Secondary NameNode, DataNode
  • Communication protocols and heartbeats
  • Data integrity mechanisms
  • Read and write pathways
  • NameNode High Availability (HA) and Federation
  • Labs: Interacting with HDFS

Section 3: MapReduce

  • Concepts and architecture
  • Daemons (MRV1): JobTracker and TaskTracker
  • Execution phases: Driver, Mapper, Shuffle/Sort, Reducer
  • MapReduce Version 1 and Version 2 (YARN)
  • Internal workings of MapReduce
  • Introduction to Java-based MapReduce programming
  • Labs: Executing a sample MapReduce program

Section 4: Pig

  • Pig versus Java MapReduce
  • Pig job execution flow
  • Pig Latin language basics
  • ETL processes using Pig
  • Transformations and joins
  • User-defined functions (UDF)
  • Labs: Writing Pig scripts to analyze data

Section 5: Hive

  • Architecture and design
  • Data types
  • SQL compatibility in Hive
  • Creating Hive tables and executing queries
  • Partitions
  • Joins
  • Text processing capabilities
  • Labs: Various exercises on processing data with Hive

Section 6: HBase

  • Concepts and architecture
  • HBase compared to RDBMS and Cassandra
  • HBase Java API
  • Handling time series data on HBase
  • Schema design strategies
  • Labs: Interacting with HBase via shell; programming with the HBase Java API; Schema design exercise

Requirements

  • Proficiency in Java programming (as most coding exercises utilize Java)
  • Familiarity with the Linux environment (including navigating the command line and editing files using vi or nano)

Lab Environment

No Installation Required: Participants do not need to install Hadoop software on their own devices. A fully operational Hadoop cluster will be provided for use during the course.

Students will need the following tools:

  • An SSH client (Linux and Mac systems come with built-in SSH clients; Putty is recommended for Windows users)
  • A web browser to access the cluster, with Firefox being the recommended choice
 28 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories