Cloudera Apache Spark程序員
培訓班型: 公開課,內訓
課程長度: 3天/18小時
培訓日期: 待定
認證考試: 暫無
培訓地點: 博學國際教育培訓中心
環(huán)境要求: 投影儀、白板、大白紙
培訓形式: 實例講授,現(xiàn)場演、練、及時溝通
培訓資料: 培訓教材
課程內容
Cloudera Developer Training for Apache Spark
課程概述:
結合批處理、流媒體和交互分析技術,利用 Apache Spark 構建完整統(tǒng)一的大 數據應用。學習編寫復雜的并行應用程序,為各種用例、架構和行業(yè)執(zhí)行快速良好的決策和實時行動。
授課對象:
面向意欲優(yōu)化應用程序速度、易用性和復雜程度的開發(fā)人員和工程師。培訓對象要求 具 備Python或Scala背景知識,具備Linux 相關基礎知識更佳。
培訓目標:
Using the Spark shell for interactive data analysis
? The features of Spark’s Resilient Distributed Datasets
? How Spark runs on a cluster
? How Spark parallelizes task execution
? Writing Spark applications
? Processing streaming data with Spark
課程內容:
Introduction to Spark
? What is Spark?
? Review: From Hadoop MapReduce to Spark
? Review: HDFS
? Review: YARN
? Spark Overview
Spark Basics
? Using the Spark Shell
? RDDs (Resilient Distributed Datasets)
? Functional Programming in Spark
Working with RDDs in Spark
? Creating RDDs
? Other General RDD Operations
Aggregating Data with Pair RDDs
? Key-Value Pair RDDs
? Map-Reduce
? Other Pair RDD Operations
Writing and Deploying Spark Applications
? Spark Applications vs. Spark Shell
? Creating the SparkContext
? Building a Spark Application (Scala and Java)
? Running a Spark Application
? The Spark Application Web UI
? Hands-On Exercise: Write and Run a Spark Application
? Configuring Spark Properties
? Logging
Parallel Processing
? Review: Spark on a Cluster
? RDD Partitions
? Partitioning of File-based RDDs
? HDFS and Data Locality
? Executing Parallel Operations
? Stages and Tasks
Spark RDD Persistence
? RDD Lineage
? RDD Persistence Overview
? Distributed Persistence
Basic Spark Streaming
? Spark Streaming Overview
? Example: Streaming Request Count
? DStreams
? Developing Spark Streaming Applications
Advanced Spark Streaming
? Multi-Batch Operations
? State Operations
? Sliding Window Operations
? Advanced Data Sources
Common Patterns in Spark Data Processing
? Common Spark Use Cases
? Iterative Algorithms in Spark
? Graph Processing and Analysis
? Machine Learning
? Example: k-means
Improving Spark Performance
? Shared Variables: Broadcast Variables
? Shared Variables: Accumulators
? Common Performance Issues
? Diagnosing Performance Problems
Spark SQL and DataFrames
? Spark SQL and the SQL Context
? Creating DataFrames
? Transforming and Querying DataFrames
? Saving DataFrames
? DataFrames and RDDs
? Comparing Spark SQL, Impala and Hive-on-Spark