Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Section 1: Introduction to Hadoop
- History and core concepts of Hadoop
- Overview of the ecosystem
- Various distributions
- High-level architecture
- Common misconceptions about Hadoop
- Challenges associated with Hadoop
- Hardware and software requirements
- Lab: Initial exploration of Hadoop
Section 2: HDFS
- Design and architectural principles
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Key daemons: NameNode, Secondary NameNode, DataNode
- Communication protocols and heartbeats
- Data integrity mechanisms
- Read and write pathways
- NameNode High Availability (HA) and Federation
- Labs: Interacting with HDFS
Section 3: MapReduce
- Concepts and architecture
- Daemons (MRV1): JobTracker and TaskTracker
- Execution phases: Driver, Mapper, Shuffle/Sort, Reducer
- MapReduce Version 1 and Version 2 (YARN)
- Internal workings of MapReduce
- Introduction to Java-based MapReduce programming
- Labs: Executing a sample MapReduce program
Section 4: Pig
- Pig versus Java MapReduce
- Pig job execution flow
- Pig Latin language basics
- ETL processes using Pig
- Transformations and joins
- User-defined functions (UDF)
- Labs: Writing Pig scripts to analyze data
Section 5: Hive
- Architecture and design
- Data types
- SQL compatibility in Hive
- Creating Hive tables and executing queries
- Partitions
- Joins
- Text processing capabilities
- Labs: Various exercises on processing data with Hive
Section 6: HBase
- Concepts and architecture
- HBase compared to RDBMS and Cassandra
- HBase Java API
- Handling time series data on HBase
- Schema design strategies
- Labs: Interacting with HBase via shell; programming with the HBase Java API; Schema design exercise
Requirements
- Proficiency in Java programming (as most coding exercises utilize Java)
- Familiarity with the Linux environment (including navigating the command line and editing files using vi or nano)
Lab Environment
No Installation Required: Participants do not need to install Hadoop software on their own devices. A fully operational Hadoop cluster will be provided for use during the course.
Students will need the following tools:
- An SSH client (Linux and Mac systems come with built-in SSH clients; Putty is recommended for Windows users)
- A web browser to access the cluster, with Firefox being the recommended choice
28 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already