Module 1 – What is Big Data?
___Introduction to Big Data_ *
o What is Big Data?
o Impact of Big Data
o Parallel Processing, Scaling, and Data Parallelism
o Tools of Big Data
o Beyond the Hype
o Big Data Use Cases
o Viewpoints about Big Data
Module 2 – Introduction to the Hadoop Ecosystem
___Introduction to the Hadoop Ecosystem_ *
o What is Hadoop
o An introduction to MapReduce
o The Hadoop Ecosystem/Common components: Introducing HDFS, Hive, HBase, and Spark, other modules
o Working with HDFS
o Working with HBase
o Lab: MapReduce
Module 3 – Introduction to Apache Spark
___Introduction to Apache Spark_ *
o Why use Apache Spark?
o Functional Programming Basics
o Parallel Programming using Resilient Distributed Datasets
o Scale-out / Data Parallelism in Apache Spark
o DataFrames and SparkSQL
o Lab: Practical examples with PySpark
Module 4 – DataFrames and SparkSQL
___DataFrames and SparkSQL_ *
o Introduction to Data-Frames & SparkSQL
o RDDs in Parallel Programming and Spark
o Data-frames and Datasets
o Catalyst and Tungsten
o ETL with Data-frames
o Lab: ETL with Data-frames
o Real-world usage of SparkSQL
o Lab: SparkSQL
Module 5 – Development and Runtime Environment options
___Development and Runtime Environment options_ *
o Apache Spark architecture
o Overview of Apache Spark Cluster Modes
o How to Run an Apache Spark Application
o Using Apache Spark on IBM Cloud
o Lab: Scale-out on IBM Spark Environment in Watson Studio
o Setting Apache Spark Configuration
o Running Spark on Kubernetes
o Lab: Spark on Kube
Module 6 – Monitoring & Tuning
___Monitoring and tuning Apache Spark_ *
o The Apache Spark User Interface
o Monitoring Jobs
o Debugging of parallel jobs
o Understanding Memory resources
o Understanding Processor resources
o Lab: Monitoring and Performance tuning
Module 7 – Final Quiz ****