Kotu / Deshpande | Predictive Analytics and Data Mining | E-Book | www.sack.de
E-Book

E-Book, Englisch, 446 Seiten

Kotu / Deshpande Predictive Analytics and Data Mining

Concepts and Practice with RapidMiner
1. Auflage 2014
ISBN: 978-0-12-801650-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)

Concepts and Practice with RapidMiner

E-Book, Englisch, 446 Seiten

ISBN: 978-0-12-801650-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)



Put Predictive Analytics into ActionLearn the basics of Predictive Analysis and Data Mining through an easy to understand conceptual framework and immediately practice the concepts learned using the open source RapidMiner tool. Whether you are brand new to Data Mining or working on your tenth project, this book will show you how to analyze data, uncover hidden patterns and relationships to aid important decisions and predictions. Data Mining has become an essential tool for any enterprise that collects, stores and processes data as part of its operations. This book is ideal for business users, data analysts, business analysts, business intelligence and data warehousing professionals and for anyone who wants to learn Data Mining.You'll be able to:1. Gain the necessary knowledge of different data mining techniques, so that you can select the right technique for a given data problem and create a general purpose analytics process.2. Get up and running fast with more than two dozen commonly used powerful algorithms for predictive analytics using practical use cases.3. Implement a simple step-by-step process for predicting an outcome or discovering hidden relationships from the data using RapidMiner, an open source GUI based data mining tool Predictive analytics and Data Mining techniques covered: Exploratory Data Analysis, Visualization, Decision trees, Rule induction, k-Nearest Neighbors, Naïve Bayesian, Artificial Neural Networks, Support Vector machines, Ensemble models, Bagging, Boosting, Random Forests, Linear regression, Logistic regression, Association analysis using Apriori and FP Growth, K-Means clustering, Density based clustering, Self Organizing Maps, Text Mining, Time series forecasting, Anomaly detection and Feature selection. Implementation files can be downloaded from the book companion site at www.LearnPredictiveAnalytics.com - Demystifies data mining concepts with easy to understand language - Shows how to get up and running fast with 20 commonly used powerful techniques for predictive analysis - Explains the process of using open source RapidMiner tools - Discusses a simple 5 step process for implementing algorithms that can be used for performing predictive analytics - Includes practical use cases and examples

Vijay Kotu is Vice President of Analytics at ServiceNow. He leads the implementation of large-scale data platforms and services to support the company's enterprise business. He has led analytics organizations for over a decade with focus on data strategy, business intelligence, machine learning, experimentation, engineering, enterprise adoption, and building analytics talent. Prior to joining ServiceNow, he was Vice President of Analytics at Yahoo. He worked at Life Technologies and Adteractive where he led marketing analytics, created algorithms to optimize online purchasing behavior, and developed data platforms to manage marketing campaigns. He is a member of the Association of Computing Machinery and a member of the Advisory Board at RapidMiner.
Kotu / Deshpande Predictive Analytics and Data Mining jetzt bestellen!

Weitere Infos & Material


1;Front Cover;1
2;Predictive Analyticsand Data Mining;4
3;Copyright;5
4;Dedication;6
5;Contents;8
6;Foreword;12
7;Preface;16
7.1;WHY THIS BOOK?;17
7.2;WHO CAN USE THIS BOOK?;17
8;Acknowledgments;20
9;Chapter 1 -Introduction;22
9.1;1.1 WHAT DATA MINING IS;23
9.2;1.2 WHAT DATA MINING IS NOT;26
9.3;1.3 THE CASE FOR DATA MINING;27
9.4;1.4 TYPES OF DATA MINING;29
9.5;1.5 DATA MINING ALGORITHMS;31
9.6;1.6 ROADMAP FOR UPCOMING CHAPTERS;32
9.7;REFERENCES;37
10;Chapter 2 - Data Mining Process;38
10.1;2.1 PRIOR KNOWLEDGE;40
10.2;2.2 DATA PREPARATION;43
10.3;2.3 MODELING;48
10.4;2.4 APPLICATION;53
10.5;2.5 KNOWLEDGE;55
10.6;WHAT’S NEXT?;56
10.7;REFERENCES;56
11;Chapter 3 - Data Exploration;58
11.1;3.1 OBJECTIVES OF DATA EXPLORATION;59
11.2;3.2 DATA SETS;59
11.3;3.3 DESCRIPTIVE STATISTICS;62
11.4;3.4 DATA VISUALIZATION;67
11.5;3.5 ROADMAP FOR DATA EXPLORATION;80
11.6;REFERENCES;81
12;Chapter 4 - Classification;84
12.1;4.1 DECISION TREES;85
12.2;4.2 RULE INDUCTION;109
12.3;4.3 K-NEAREST NEIGHBORS;120
12.4;4.4 NAÏVE BAYESIAN;132
12.5;4.5 ARTIFICIAL NEURAL NETWORKS;145
12.6;4.6 SUPPORT VECTOR MACHINES;155
12.7;4.7 ENSEMBLE LEARNERS;169
12.8;REFERENCES;183
13;Chapter 5 - Regression Methods;186
13.1;5.1 LINEAR REGRESSION;188
13.2;5.2 LOGISTIC REGRESSION;201
13.3;CONCLUSION;213
13.4;REFERENCES;213
14;Chapter 6 - Association Analysis;216
14.1;6.1 CONCEPTS OF MINING ASSOCIATION RULES;218
14.2;6.2 Apriori Algorithm;223
14.3;6.3 FP-GROWTH ALGORITHM;227
14.4;CONCLUSION;236
14.5;REFERENCES;236
15;Chapter 7 - Clustering;238
15.1;CLUSTERING TO DESCRIBE THE DATA;238
15.2;CLUSTERING FOR PREPROCESSING;239
15.3;7.1 TYPES OF CLUSTERING TECHNIQUES;240
15.4;7.2 K-MEANS CLUSTERING;244
15.5;7.3 DBSCAN CLUSTERING;255
15.6;7.4 SELF-ORGANIZING MAPS;263
15.7;REFERENCES;275
16;Chapter 8 - Model Evaluation;278
16.1;8.1 CONFUSION MATRIX (OR TRUTH TABLE);279
16.2;8.2 RECEIVER OPERATOR CHARACTERISTIC (ROC) CURVES AND AREA UNDER THE CURVE (AUC);281
16.3;8.3 LIFT CURVES;284
16.4;8.4 EVALUATING THE PREDICTIONS: IMPLEMENTATION;285
16.5;CONCLUSION;294
16.6;REFERENCES;294
17;Chapter 9 - Text Mining;296
17.1;9.1 HOW TEXT MINING WORKS;298
17.2;9.2 IMPLEMENTING TEXT MINING WITH CLUSTERING AND CLASSIFICATION;305
17.3;CONCLUSION;323
17.4;REFERENCES;323
18;Chapter 10 - Time Series Forecasting;326
18.1;10.1 DATA-DRIVEN APPROACHES;329
18.2;10.2 MODEL-DRIVEN FORECASTING METHODS;334
18.3;CONCLUSION;347
18.4;REFERENCES;348
19;Chapter 11 - Anomaly Detection;350
19.1;11.1 ANOMALY DETECTION CONCEPTS;350
19.2;11.3 DENSITY-BASED OUTLIER DETECTION;359
19.3;11.4 LOCAL OUTLIER FACTOR;362
19.4;CONCLUSION;365
19.5;REFERENCES;366
20;Chapter 12 - Feature Selection;368
20.1;12.1 CLASSIFYING FEATURE SELECTION METHODS;369
20.2;12.2 PRINCIPAL COMPONENT ANALYSIS;370
20.3;12.3 INFORMATION THEORY–BASED FILTERING FOR NUMERIC DATA;379
20.4;CATEGORICAL DATA;381
20.5;12.5 WRAPPER-TYPE FEATURE SELECTION;384
20.6;CONCLUSION;391
20.7;REFERENCES;391
21;Chapter 13 - Getting Started with RapidMiner;392
21.1;13.1 USER INTERFACE AND TERMINOLOGY;393
21.2;13.2 DATA IMPORTING AND EXPORTING TOOLS;398
21.3;13.3 DATA VISUALIZATION TOOLS;403
21.4;13.4 DATA TRANSFORMATION TOOLS;407
21.5;13.5 SAMPLING AND MISSING VALUE TOOLS;413
21.6;CONCLUSION;426
21.7;REFERENCES;427
22;Comparison of Data Mining Algorithms;428
23;Index;438
23.1;A;438
23.2;B;439
23.3;C;439
23.4;D;439
23.5;E;440
23.6;F;440
23.7;G;441
23.8;H;441
23.9;I;441
23.10;K;441
23.11;L;441
23.12;M;442
23.13;Q;442
23.14;R;442
23.15;S;443
23.16;T;443
23.17;U;444
23.18;V;444
23.19;W;444
23.20;Y;444
24;About the Authors;446


Chapter 2

Data Mining Process


Abstract


Successfully uncovering patterns using data mining is an iterative process. Chapter 2 provides a framework to solve the data mining problem. The five-step process outlined in this chapter provides guidelines on gathering subject matter expertise; exploring the data with statistics and visualization; building a model using data mining algorithms; testing the model and deploying it in a production environment; and finally reflecting on new knowledge gained in the cycle. Over the years of evolution of data mining practices, different frameworks for the data mining process have been put forward by various academic and commercial bodies, like the Cross Industry Standard Process for Data Mining, knowledge discovery in databases, etc. These data mining frameworks exhibit common characteristics and hence we will be using a generic framework closely resembling the CRISP process.

Keywords


CRISP; KDD; data mining process; prior knowledge; modeling; data preparation; evaluation; application
The methodological discovery of useful relationships and patterns in data is enabled by a set of iterative activities known as data mining process. The standard data mining process involves (1) understanding the problem, (2) preparing the data samples, (3) developing the model, (4) applying the model on a data set to see how the model may work in real world, and (5) production deployment. Over the years of evolution of data mining practices, different frameworks for the data mining process have been put forward by various academic and commercial bodies. In this chapter, we will discuss the key steps involved in building a successful data mining solution. The framework we put forward in this chapter is synthesized from a few data mining frameworks, and is explained using a simple example data set. This chapter serves as a high-level roadmap in building deployable data mining models, and discusses the challenges faced in each step, as well as important considerations and pitfalls to avoid. Most of the concepts discussed in this chapter are reviewed later in the book with detailed explanations and examples.
One of the most popular data mining process frameworks is CRISP-DM, which is an acronym for Cross Industry Standard Process for Data Mining. This framework was developed by a consortium of many companies involved in data mining (Chapman et al., 2000). The CRISP-DM process is the most widely adopted framework for developing data mining solutions. Figure 2.1 provides a visual overview of the CRISP-DM framework. Other data mining frameworks are SEMMA, which is an acronym for Sample, Explore, Modify, Model, and Assess, developed by the SAS Institute (SAS Institute, 2013); DMAIC, which is an acronym for Define, Measure, Analyze, Improve and Control, used in Six Sigma practice (Kubiak & Benbow, 2005); and the Selection, Preprocessing, Transformation, Data Mining, Interpretation, and Evaluation framework used in the knowledge discovery in databases (KDD) process (Fayyad et al., 1996). We feel all these frameworks exhibit common characteristics and hence we will be using a generic framework closely resembling the CRISP process. As with any process framework, a data mining process recommends the performance of a certain set of tasks to achieve optimal output. The process of extracting information from the data is iterative. The steps within the data mining process are not linear and have many loops, going back and forth between steps and at times going back to the first step to redefine data mining problem statement.

Figure 2.1 CRISP data mining framework.
The data mining process presented in Figure 2.2 is a generic set of steps that is business, algorithm, and, data mining tool agnostic. The fundamental objective of any process that involves data mining is to address the analysis question. The problem at hand could be segmentation of customers or predicting climate patterns or a simple data exploration. The algorithm used to solve the business question could be automated clustering or an artificial neural network. The software tools to develop and implement the data mining algorithm used could be custom coding, IBM SPSS, SAS, R, or RapidMiner, to mention a few.
Data mining, specifically in the context of big data, has gained a lot of importance in the last few years. Perhaps the most visible and discussed part of data mining is the third step: modeling. It involves building representative models that can be derived from the sample data set and can be used for either predictions (predictive modeling) or for describing the underlying pattern in the data (descriptive or explanatory modeling). Rightfully so, there is plenty of academic and business research in this step and we have dedicated most of the book to discussing various algorithms and quantitative foundations that go with it. We specifically wish to emphasize considering data mining as an end-to-end, multistep, iterative process instead of just a model building step. Seasoned data mining practitioners can attest to the fact that the most time-consuming part of the overall data mining process is not the model building part, but the preparation of data, followed by data and business understanding. There are many data mining tools, both open source and commercial, available in the market that can automate the model building. The most commonly used tools are RapidMiner, R, Weka, SAS, SPSS, Oracle Data Miner, Salford, Statistica, etc. (Piatetsky, 2014). Asking the right business questions, gaining in-depth business understanding, sourcing and preparing the data for the data mining task, mitigating implementation considerations, and, most useful of all, gaining knowledge from the data mining process, remains crucial to the success of the data mining process. Lets get started with Step 1: Framing the data mining question and understanding the context.

Figure 2.2 Data mining process.

2.1. Prior Knowledge


Prior knowledge refers to information that is already known about a subject. The objective of data mining doesn’t emerge in isolation; it always develops on top of existing subject matter and contextual information that is already known. The prior knowledge step in the data mining process helps to define what problem we are solving, how it fits in the business context, and what data we need to solve the problem.

2.1.1. Objective


The data mining process starts with an analysis need, a question or a business objective. This is possibly the most important step in the data mining process (Shearer, 2000). Without a well-defined statement of the problem, it is impossible to come up with the right data set and pick the right data mining algorithm. Even though the data mining process is a sequential process and it is common to go back to previous steps and revise the assumptions, approach, and tactics. It is imperative to get the objective of the whole process right, even if it is exploratory data mining.
We are going to explain the data mining process using an hypothetical example. Let’s assume we are in the consumer loan business, where a loan is provisioned for individuals with the collateral of assets like a home or car, i.e., a mortgage or an auto loan. As many home owners know, an important component of the loan, for the borrower and the lender, is the interest rate at which the borrower repays the loan on top of the principal. The interest rate on a loan depends on a gamut of variables like the current federal funds rate as determined by the central bank, borrower’s credit score, income level, home value, initial deposit (down payment) amount, current assets and liabilities of the borrower, etc. The key factor here is whether the lender sees enough reward (interest on the loan) for the risk of losing the principal (borrower’s default on the loan). In an individual case, the status of default of a loan is Boolean; either one defaults or not, during the period of the loan. But, in a group of tens of thousands of borrowers, we can find the default rate—a continuous numeric variable that indicates the percentage of borrowers who default on their loans. All the variables related to the borrower like credit score, income, current liabilities, etc. are used to assess the default risk in a related group; based on this, the interest rate is determined for a loan. The business objective of this hypothetical use case is: If we know the interest rate of past borrowers with a range of credit scores, can we predict interest rate for a new borrower?

2.1.2. Subject Area


The process of data mining uncovers hidden patterns in the data set by exposing relationships between attributes. But the issue is that it uncovers a lot of patterns. False signals are a major problem in the process. It is up to the data mining practitioner to filter through the patterns and accept the ones that are valid and relevant to answer the objective question. Hence, it is essential to know the subject matter, the context, and the business process generating the data.
The lending business is one of the oldest, most prevalent, and complex of all the businesses. If the data mining objective is to predict the interest rate, then it is important to know how the lending business works, why the prediction matters, what we do once we know the predicted interest rate, what data points can be collected from borrowers, what data points cannot be collected because of regulations, what other external factors can affect the...



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.