E-Book, Englisch, 288 Seiten
Chellappan / Ganesan Practical Apache Spark
1. ed
ISBN: 978-1-4842-3652-9
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
Using the Scala API
E-Book, Englisch, 288 Seiten
ISBN: 978-1-4842-3652-9
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
Work with Apache Spark using Scala to deploy and set up single-node, multi-node, and high-availability clusters. This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic. Practical Apache Spark also covers the integration of Apache Spark with Kafka with examples. You'll follow a learn-to-do-by-yourself approach to learning - learn the concepts, practice the code snippets in Scala, and complete the assignments given to get an overall exposure.
On completion, you'll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You'll also become familiar with machine learning algorithms with real-time usage.
What You Will LearnDiscover the functional programming features of Scala
Understand the complete architecture of Spark and its componentsIntegrate Apache Spark with Hive and Kafka
Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queries
Work with different machine learning concepts and libraries using Spark's MLlib packages
Who This Book Is For
Developers and professionals who deal with batch and stream data processing.
Subhashini Chellappan is an associate manager and technology enthusiast. She has rich experience in both academia and the software industry. She has published two books: Big Data Analytics and Pro Tableau. Her areas of interest and expertise are centered on business intelligence, big data analytics and cloud computing.
Bharath Kumar Dasa is a technology lead, with expertise in the big data space having core expertise in the complete Hadoop stack. Had worked on HDP distribution and has architected multiple data management and data life cycle auto service management projects for financial institutions. He has been working in machine learning and integration of machine learning with big data technologies for the past few years. His areas of interest and expertise are centered on big data and analytics, machine learning, data visualization and deep learning. Dharanitharan Ganesan is a senior analyst with five years of experience in IT. He has a high level of exposure and experience in big data - Apache Hadoop, Apache Spark and various Hadoop ecosystem components. He has a proven track record of improving efficiency and productivity through the automation of various routine and administrative functions in business intelligence and big data technologies. His areas of interest and expertise are centered on machine learning algorithms, statistical modelling and predictive analysis.
Autoren/Hrsg.
Weitere Infos & Material
1;Table of Contents;4
2;About the Authors;10
3;About the Technical Reviewers;11
4;Acknowledgments;12
5;Introduction;13
6;Chapter 1: Scala: Functional Programming Aspects;15
6.1;What Is Functional Programming?;16
6.1.1;What Is a Pure Function?;16
6.1.2;Example of Pure Function;17
6.2;Scala Programming Features;18
6.2.1;Variable Declaration and Initialization;19
6.2.2;Type Inference;20
6.2.3;Immutability;21
6.2.4;Lazy Evaluation;22
6.2.5;String Interpolation;24
6.2.5.1;String - s Interpolator;25
6.2.5.2;String - f Interpolator;26
6.2.5.3;String - raw Interpolator;27
6.2.6;Pattern Matching;27
6.2.7;Scala Class vs. Object;28
6.2.8;Singleton Object;29
6.2.9;Companion Classes and Objects;31
6.2.10;Case Classes;32
6.2.10.1;Pattern Matching on Case Classes;34
6.2.11;Scala Collections;35
6.2.11.1;Iterating Over the Collection;37
6.2.11.2;Common Methods of Collection;39
6.3;Functional Programming Aspects of Scala;41
6.3.1;Anonymous Functions;41
6.3.2;Higher Order Functions;43
6.3.3;Function Composition;44
6.3.4;Function Currying;45
6.3.5;Nested Functions;46
6.3.6;Functions with Variable Length Parameters;48
6.4;Reference Links;51
6.5;Points to Remember;51
7;Chapter 2: Single and Multinode Cluster Setup;52
7.1;Spark Multinode Cluster Setup;52
7.1.1;Recommended Platform;52
7.1.1.1;Operating System;53
7.1.2;Prerequisites;74
7.1.3;Spark Installation Steps;75
7.1.4;Spark Web UI;79
7.1.4.1;Spark Master UI;80
7.1.4.2;Spark Application UI;81
7.1.5;Stopping the Spark Cluster;83
7.2;Spark Single-Node Cluster Setup;83
7.2.1;Prerequisites;84
7.2.2;Spark Installation Steps;86
7.2.3;Spark Master UI;89
7.3;Points to Remember;90
8;Chapter 3: Introduction to Apache Spark and Spark Core;91
8.1;What Is Apache Spark?;92
8.2;Why Apache Spark?;92
8.3;Spark vs. Hadoop MapReduce;93
8.4;Apache Spark Architecture;94
8.5;Spark Components;96
8.5.1;Spark Core (RDD);96
8.5.2;Spark SQL;96
8.5.3;Spark Streaming;97
8.5.4;MLib;97
8.5.5;GraphX;97
8.5.6;SparkR;97
8.6;Spark Shell;97
8.7;Spark Core: RDD;98
8.7.1;RDD Operations;100
8.7.1.1;Transformations;100
8.7.1.2;Actions;100
8.7.2;Creating an RDD;100
8.7.2.1;Using Parallelized Collection;100
8.7.2.2;From External Data Source;101
8.7.2.3;Creating an RDD from the Hadoop File System;102
8.7.2.4;Creating an RDD: File Partitioning;102
8.8;RDD Transformations;103
8.9;RDD Actions;107
8.10;Working with Pair RDDs;110
8.11;Direct Acylic Graph in Apache Spark;113
8.11.1;How DAG Works in Spark;113
8.11.2;How Spark Achieves Fault Tolerance Through DAG;115
8.12;Persisting RDD;116
8.13;Shared Variables;117
8.13.1;Broadcast Variables;118
8.13.2;Accumulators;118
8.14;Simple Build Tool (SBT);119
8.15;Assignments;124
8.16;Reference Links;124
8.17;Points to Remember;125
9;Chapter 4: Spark SQL, DataFrames, and Datasets;126
9.1;What Is Spark SQL?;127
9.1.1;Datasets and DataFrames;127
9.2;Spark Session;127
9.3;Creating DataFrames;128
9.3.1;DataFrame Operations;129
9.3.1.1;Untyped DataFrame Operation: Select;130
9.3.1.2;Untyped DataFrame Operation: Filter;130
9.3.1.3;Untyped DataFrame Operation: Aggregate Operations;131
9.3.2;Running SQL Queries Programatically;132
9.3.2.1;Creating Views;132
9.3.3;Dataset Operations;134
9.3.4;Interoperating with RDDs;136
9.3.4.1;Reflection-Based Approach to Infer Schema;136
9.3.5;Different Data Sources;140
9.3.5.1;Generic Load and Save Functions;140
9.3.5.2;Manually Specifying Options;141
9.3.5.3;Run SQL on Files Directly;141
9.3.5.4;JDBC to External Databases;143
9.3.6;Working with Hive Tables;144
9.3.7;Building Spark SQL Application with SBT;146
9.4;Points to Remember;150
10;Chapter 5: Introduction to Spark Streaming;151
10.1;Data Processing;152
10.2;Streaming Data;152
10.2.1;Why Streaming Data Are Important;152
10.3;Introduction to Spark Streaming;152
10.3.1;Internal Working of Spark Streaming;153
10.3.2;Spark Streaming Concepts;154
10.3.2.1;Discretized Streams (DStream);154
10.3.2.2;Streaming Context;154
10.3.2.3;DStream Operations;154
10.4;Spark Streaming Example Using TCP Socket;155
10.5;Stateful Streaming;159
10.5.1;Window-Based Streaming;159
10.5.2;Full-Session-Based Streaming;162
10.6;Streaming Applications Considerations;165
10.7;Points to Remember;166
11;Chapter 6: Spark Structured Streaming;167
11.1;What Is Spark Structured Streaming?;168
11.2;Spark Structured Streaming Programming Model;168
11.2.1;Word Count Example Using Structured Streaming;170
11.3;Creating Streaming DataFrames and Streaming Datasets;173
11.4;Operations on Streaming DataFrames/Datasets;174
11.5;Stateful Streaming: Window Operations on Event-Time;177
11.6;Stateful Streaming: Handling Late Data and Watermarking;180
11.7;Triggers;181
11.8;Fault Tolerance;183
11.9;Points to Remember;184
12;Chapter 7: Spark Streaming with Kafka;185
12.1;Introduction to Kafka;185
12.1.1;Kafka Core Concepts;186
12.1.2;Kafka APIs;186
12.2;Kafka Fundamental Concepts;187
12.3;Kafka Architecture;188
12.3.1;Kafka Topics;189
12.3.2;Leaders and Replicas;189
12.4;Setting Up the Kafka Cluster;190
12.5;Spark Streaming and Kafka Integration;192
12.6;Spark Structure Streaming and Kafka Integration;195
12.7;Points to Remember;197
13;Chapter 8: Spark Machine Learning Library;198
13.1;What Is Spark MLlib?;199
13.1.1;Spark MLlib APIs;199
13.1.2;Vectors in Scala;200
13.1.2.1;Vector Representation in Spark;202
13.1.3;Basic Statistics;203
13.1.3.1;Correlation;204
13.1.3.2;Hypothesis Testing;207
13.1.4;Extracting, Transforming, and Selecting Features;209
13.1.4.1;Feature Extractors;210
13.1.4.1.1;Term Frequency–Inverse Document Frequency (TF–IDF);210
13.1.4.1.2;Example;212
13.1.4.2;Feature Transformers;215
13.1.4.2.1;Tokenizer;215
13.1.4.2.2;StopWordsRemover;216
13.1.4.2.3;StringIndexer;218
13.1.4.3;Feature Selectors;220
13.1.4.3.1;VectorSlicer;221
13.1.5;ML Pipelines;224
13.1.5.1;Pipeline Components;225
13.1.5.1.1;Estimators;225
13.1.5.1.2;Transformers;225
13.1.5.1.3;Pipeline Examples;225
13.1.5.2;Machine Learning Regression and Classification Algorithms;233
13.1.5.2.1;Regression Algorithms;233
13.1.5.2.1.1;Linear Regression;233
13.1.5.2.2;Classification Algorithms;238
13.1.5.2.2.1;Logistic Regression;238
13.1.5.2.3;Clustering Algorithms;243
13.1.5.2.3.1;K-Means Clustering;243
13.2;Points to Remember;245
14;Chapter 9: Working with SparkR;246
14.1;Introduction to SparkR;246
14.1.1;SparkDataFrame;246
14.1.2;SparkSession;247
14.2;Starting SparkR from RStudio;247
14.3;Creating SparkDataFrames;250
14.3.1;From a Local R DataFrame;250
14.3.2;From Other Data Sources;251
14.3.3;From Hive Tables;252
14.4;SparkDataFrame Operations;253
14.4.1;Selecting Rows and Columns;253
14.4.2;Grouping and Aggregation;254
14.4.3;Operating on Columns;256
14.5;Applying User-Defined Functions;257
14.5.1;Run a Given Function on a Large Data Set Using dapply or dapplyCollect;257
14.6;Running SQL Queries from SparkR;258
14.7;Machine Learning Algorithms;259
14.7.1;Regression and Classification Algorithms;259
14.7.1.1;Linear Regression;259
14.7.2;Logistic Regression;264
14.7.3;Decision Tree;267
14.8;Points to Remember;269
15;Chapter 10: Spark Real-Time Use Case;270
15.1;Data Analytics Project Architecture;271
15.1.1;Data Ingestion;271
15.1.2;Data Storage;272
15.1.3;Data Processing;272
15.1.4;Data Visualization;273
15.2;Use Cases;273
15.2.1;Event Detection Use Case;273
15.2.2;Build Procedure;279
15.2.3;Building the Application with SBT;280
15.3;Points to Remember;282
16;Index;283




