E-Book, Englisch, 243 Seiten
Nabi Pro Spark Streaming
1. ed
ISBN: 978-1-4842-1479-4
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
The Zen of Real-Time Analytics Using Apache Spark
E-Book, Englisch, 243 Seiten
ISBN: 978-1-4842-1479-4
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. Pro Spark Streaming walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in the book include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming.What You'll Learn:Spark Streaming application development and best practicesLow-level details of discretized streamsThe application and vitality of streaming analytics to a number of industries and domainsOptimization of production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and NagiosIngestion of data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiverIntegration and coupling with HBase, Cassandra, and RedisDesign patterns for side-effects and maintaining state across the Spark Streaming micro-batch modelReal-time and scalable ETL using data frames, SparkSQL, Hive, and SparkRStreaming machine learning, predictive analytics, and recommendationsMeshing batch processing with stream processing via the Lambda architectureWho This Book Is For:The audience includes data scientists, big data experts, BI analysts, and data architects.
Zubair Nabi is one of the very few computer scientists who have solved Big Data problems in all three domains: academia, research, and industry. He currently works at Qubit, a London-based start up backed by Goldman Sachs, Accel Partners, Salesforce Ventures, and Balderton Capital, which helps retailers understand their customers and provide personalized customer experience, and which has a rapidly growing client base that includes Staples, Emirates, Thomas Cook, and Topshop. Prior to Qubit, he was a researcher at IBM Research, where he worked at the intersection of Big Data systems and analytics to solve real-world problems in the telecommunication, electricity, and urban dynamics space.
Zubair's work has been featured in MIT Technology Review, SciDev, CNET, and Asian Scientist, and on Swedish National Radio, among others. He has authored more than 20 research papers, published by some of the top publication venues in computer science including USENIX Middleware, ECML PKDD, and IEEE BigData; and he also has a number of patents to his credit.
Zubair has an MPhil in computer science with distinction from Cambridge.
Autoren/Hrsg.
Weitere Infos & Material
1;Contents at a Glance;6
2;Contents;8
3;About the Author;14
4;About the Technical Reviewer;16
5;Acknowledgments;18
6;Introduction;20
7;Chapter 1: The Hitchhiker’s Guide to Big Data;21
7.1; Before Spark;21
7.1.1; The Era of Web 2.0;22
7.1.1.1; From SQL to NoSQL;22
7.1.1.2; MapReduce: The Swiss Army Knife of Distributed Data Processing;23
7.1.1.2.1; Word Count a la MapReduce;24
7.1.1.2.2; Hadoop: An Elephant with Big Dreams;25
7.1.2; Sensors, Sensors Everywhere;26
7.2; Spark Streaming: At the Intersection of MapReduce and CEP;28
8;Chapter 2: Introduction to Spark;29
8.1; Installation;30
8.2; Execution;31
8.2.1; Standalone Cluster;31
8.2.1.1; Master;31
8.2.1.2; Workers;31
8.2.1.3; UI;31
8.2.2; YARN;32
8.3; First Application;32
8.3.1; Build;34
8.3.2; Execution;35
8.3.2.1; Local Execution;35
8.3.2.2; Standalone Cluster;35
8.3.2.3; YARN;37
8.4; SparkContext;37
8.4.1; Creation of RDDs;37
8.4.2; Handling Dependencies;38
8.4.3; Creating Shared Variables;39
8.4.4; Job execution;40
8.5; RDD;40
8.5.1; Persistence;41
8.5.2; Transformations;42
8.5.3; Actions;46
8.6; Summary;47
9;Chapter 3: DStreams: Real-Time RDDs;48
9.1;From Continuous to Discretized Streams;48
9.2;First Streaming Application;49
9.2.1;Build and Execution;51
9.2.2;StreamingContext;51
9.2.2.1;Creating DStreams;52
9.2.2.2;DStream Consolidation;53
9.2.2.3;Job Execution;53
9.3;DStreams;53
9.3.1;The Anatomy of a Spark Streaming Application;55
9.3.2;Transformations;59
9.3.2.1;Mapping;59
9.3.2.1.1;map[U](function): DStream[U];59
9.3.2.1.1.1;mapPartitions[U](function): DStream[U];59
9.3.2.1.1.2;flatMap[U](function): DStream[U];60
9.3.2.1.1.3;filter(function): DStream[T];60
9.3.2.1.1.4;transform[U](function): DStream[U];60
9.3.2.2;Variation;61
9.3.2.2.1;union(that: DStream[T]): DStream[T];61
9.3.2.2.2;repartition(numPartitions: Int): DStream[T];61
9.3.2.2.3;glom(): DStream[Array[T]];61
9.3.2.3;Aggregation;61
9.3.2.3.1;count(): DStream[Long];61
9.3.2.3.2;countByValue(): DStream[(T, Long)];61
9.3.2.3.3;reduce(reduceFunc: (T, T) ? T): DStream[T];62
9.3.2.4;Key-value;62
9.3.2.4.1;groupByKey(): DStream[(K, Iterable[V])];62
9.3.2.4.2;reduceByKey(reduceFunc: (V, V) ? V): DStream[(K, V)];62
9.3.2.4.3;combineByKey[C](createCombiner: (V) ? C, mergeValue: (C, V) ? C, mergeCombiner: (C, C) ? C, partitioner: Partitioner): DStream[(K, C)];63
9.3.2.4.4;join[W](other: DStream[(K, W)]): DStream[(K, (V, W))];63
9.3.2.4.5;cogroup[W](other: DStream[(K, W)]): DStream[(K, (Iterable[V], Iterable[W]))];64
9.3.2.4.6;updateStateByKey[S](updateFunc: (Seq[V], Option[S]) ? Option[S]): DStream[(K, S)];64
9.3.2.5;Windowing;65
9.3.2.5.1;window(windowDuration: Duration, slideDuration: Duration): DStream[T];65
9.3.2.6;Actions;68
9.3.2.6.1;print(num: Int): Unit;68
9.3.2.6.2;print(): Unit;68
9.3.2.6.3;saveAsObjectFiles(prefix: String): Unit;68
9.3.2.6.4;saveAsTextFiles(prefix: String): Unit;68
9.3.2.6.5;saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String, suffix: String): Unit;68
9.3.2.6.6;saveAsNewAPIHadoopFiles[F <: OutputFormat[K, V]](prefix: String, suffix: String): Unit;68
9.3.2.6.7;foreachRDD(foreachFunc: (RDD[T]) ? Unit): Unit;69
9.4;Summary;69
10;Chapter 4: High-Velocity Streams: Parallelism and Other Stories;70
10.1; One Giant Leap for Streaming Data;70
10.2; Parallelism;72
10.2.1; Worker;72
10.2.2; Executor;73
10.2.2.1; Choosing the Number of Executors;74
10.2.2.2; Dynamic Executor Allocation;74
10.2.3; Task;75
10.2.3.1; Parallelism, Partitions, and Tasks;75
10.2.3.2;Task Parallelism;76
10.3; Batch Intervals;78
10.4; Scheduling;79
10.4.1; Inter-application Scheduling;79
10.4.2; Batch Scheduling;80
10.4.3; Inter-job Scheduling;80
10.4.4; One Action, One Job;80
10.5; Memory;82
10.5.1; Serialization;82
10.5.2; Compression;84
10.5.3; Garbage Collection;84
10.6; Every Day I’m Shuffling;85
10.6.1; Early Projection and Filtering;85
10.6.2; Always Use a Combiner;85
10.6.3; Generous Parallelism;85
10.6.4; File Consolidation;85
10.6.5; More Memory;85
10.7; Summary;86
11;Chapter 5: Real-Time Route 66: Linking External Data Sources;87
11.1;Smarter Cities, Smarter Planet, Smarter Everything;87
11.2;ReceiverInputDStream;89
11.3;Sockets;90
11.4;MQTT;98
11.5;Flume;102
11.5.1;Push-Based Flume Ingestion;103
11.5.2;Pull-Based Flume Ingestion;104
11.6;Kafka;104
11.6.1;Receiver-Based Kafka Consumer;107
11.6.2;Direct Kafka Consumer;109
11.7;Twitter;110
11.8;Block Interval;111
11.9;Custom Receiver;111
11.9.1;HttpInputDStream;112
11.10;Summary;115
12;Chapter 6: The Art of Side Effects;116
12.1;Taking Stock of the Stock Market;116
12.2;foreachRDD;118
12.2.1;Per-Record Connection;120
12.2.2;Per-Partition Connection;120
12.2.3;Static Connection;121
12.2.4;Lazy Static Connection;122
12.2.5;Static Connection Pool;123
12.3;Scalable Streaming Storage;125
12.3.1;HBase;125
12.3.2;Stock Market Dashboard;127
12.3.3;SparkOnHBase;129
12.3.4;Cassandra;130
12.3.5;Spark Cassandra Connector;132
12.4;Global State;133
12.4.1;Static Variables;133
12.4.2;updateStateByKey();135
12.4.3;Accumulators;136
12.4.4;External Solutions;138
12.4.4.1;Redis;138
12.5;Summary;140
13;Chapter 7: Getting Ready for Prime Time;141
13.1;Every Click Counts;141
13.2;Tachyon (Alluxio);142
13.3;Spark Web UI;144
13.3.1;Historical Analysis;158
13.3.2;RESTful Metrics;158
13.4;Logging;159
13.5;External Metrics;160
13.6;System Metrics;162
13.7;Monitoring and Alerting;163
13.8;Summary;165
14;Chapter 8: Real-Time ETL and Analytics Magic;166
14.1; The Power of Transaction Data Records;166
14.2; First Streaming Spark SQL Application;168
14.3; SQLContext;170
14.3.1; Data Frame Creation;170
14.3.1.1; Existing RDDs;170
14.3.1.2; Dynamic Schemas;170
14.3.1.3; Scala Sequence;172
14.3.1.4; RDDs with JSON;172
14.3.1.5; External Database;172
14.3.1.6; Parquet;172
14.3.1.7; Hive Table;173
14.3.2; SQL Execution;173
14.3.3; Configuration;173
14.3.4; User-Defined Functions;174
14.3.5; Catalyst: Query Execution and Optimization;175
14.3.6; HiveContext;175
14.4; Data Frame;176
14.4.1; Types;177
14.4.2; Query Transformations;177
14.4.2.1; select(col: String, cols: String*): Data Frame;177
14.4.2.2; select(cols: Column*): DataFrame;177
14.4.2.3; filter(conditionExpr: String): DataFrame;177
14.4.2.4; drop(colName: String): DataFrame;178
14.4.2.5; where(condition: Column): DataFrame;178
14.4.2.6; limit(n: Int): DataFrame;178
14.4.2.7; withColumn(colName: String, col: Column): DataFrame;178
14.4.2.8; groupBy(col1: String, cols: String: GroupedData;178
14.4.2.9; agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame;179
14.4.2.10; orderBy(sortCol: String, sortCols: String*): DataFrame;179
14.4.2.11; rollup(col1: String, cols: String*): GroupedData;180
14.4.2.12; cube(col1: String, cols: String*): GroupedData;180
14.4.2.13; dropDuplicates(colNames: Seq[String]): DataFrame;180
14.4.2.14; sample(withReplacement: Boolean, fraction: Double): DataFrame;180
14.4.2.15; except(other: DataFrame): DataFrame;180
14.4.2.16; intersect(other: DataFrame): DataFrame;181
14.4.2.17; unionAll(other: DataFrame): DataFrame;181
14.4.2.18; join(right: DataFrame, joinExprs: Column): DataFrame;181
14.4.2.19; na;182
14.4.2.20; stats;182
14.4.3; Actions;183
14.4.3.1; format(source: String): DataFrameWriter;183
14.4.3.2; save(path: String): Unit;183
14.4.3.3; parquet(path: String): Unit;184
14.4.3.4; json(path: String): Unit;184
14.4.3.5; saveAsTable(tableName: String): Unit;184
14.4.3.6; mode(saveMode: SaveMode): DataFrameWriter;184
14.4.3.7; partitionBy(colNames: String*): DataFrameWriter;184
14.4.3.8; insertInto(tableName: String): Unit;184
14.4.3.9; jdbc(url: String, table: String, connectionProperties: Properties): Unit;184
14.4.4; RDD Operations;185
14.4.5; Persistence;185
14.4.6; Best Practices;185
14.5; SparkR;185
14.6; First SparkR Application;186
14.6.1; Execution;187
14.6.2; Streaming SparkR;188
14.7; Summary;190
15;Chapter 9: Machine Learning at Scale;191
15.1;Sensor Data Storm;191
15.2;Streaming MLlib Application;193
15.3;MLlib;196
15.3.1;Data Types;196
15.3.2;Statistical Analysis;198
15.3.3;Preprocessing;199
15.4;Feature Selection and Extraction;200
15.4.1;Chi-Square Selection;200
15.4.2;Principal Component Analysis;201
15.5;Learning Algorithms;201
15.5.1;Classification;202
15.5.2;Clustering;203
15.5.3;Recommendation Systems;204
15.5.4;Frequent Pattern Mining;207
15.6;Streaming ML Pipeline Application;208
15.7;ML;210
15.8;Cross-Validation of Pipelines;211
15.9;Summary;212
16;Chapter 10: Of Clouds, Lambdas, and Pythons;213
16.1;A Good Review Is Worth a Thousand Ads;214
16.2; Google Dataproc;214
16.3;First Spark on Dataproc Application;219
16.4; PySpark;226
16.5; Lambda Architecture;228
16.5.1; Lambda Architecture using Spark Streaming on Google Cloud Platform;229
16.6; Streaming Graph Analytics;236
16.7; Summary;239
17;Index;240




