Buch, Englisch, 416 Seiten, Format (B × H): 188 mm x 234 mm, Gewicht: 839 g
Handling Outliers and Anomalies in Data Science
Buch, Englisch, 416 Seiten, Format (B × H): 188 mm x 234 mm, Gewicht: 839 g
ISBN: 978-1-394-29437-4
Verlag: Wiley
An essential guide for tackling outliers and anomalies in machine learning and data science.
In recent years, machine learning (ML) has transformed virtually every area of research and technology, becoming one of the key tools for data scientists. Robust machine learning is a new approach to handling outliers in datasets, which is an often-overlooked aspect of data science. Ignoring outliers can lead to bad business decisions, wrong medical diagnoses, reaching the wrong conclusions or incorrectly assessing feature importance, just to name a few.
Fundamentals of Robust Machine Learning offers a thorough but accessible overview of this subject by focusing on how to properly handle outliers and anomalies in datasets. There are two main approaches described in the book: using outlier-tolerant ML tools, or removing outliers before using conventional tools. Balancing theoretical foundations with practical Python code, it provides all the necessary skills to enhance the accuracy, stability and reliability of ML models.
Fundamentals of Robust Machine Learning readers will also find: - A blend of robust statistics and machine learning principles
- Detailed discussion of a wide range of robust machine learning methodologies, from robust clustering, regression and classification, to neural networks and anomaly detection
- Python code with immediate application to data science problems
Fundamentals of Robust Machine Learning is ideal for undergraduate or graduate students in data science, machine learning, and related fields, as well as for professionals in the field looking to enhance their understanding of building models in the presence of outliers.
Autoren/Hrsg.
Fachgebiete
Weitere Infos & Material
Preface xv
About the Companion Website xix
1 Introduction 1
1.1 Defining Outliers 2
1.2 Overview of the Book 3
1.3 What Is Robust Machine Learning? 3
1.3.1 Machine Learning Basics 4
1.3.2 Effect of Outliers 6
1.3.3 What Is Robust Data Science? 7
1.3.4 Noise in Datasets 7
1.3.5 Training and Testing Flows 8
1.4 Robustness of the Median 9
1.4.1 Mean vs. Median 9
1.4.2 Effect on Standard Deviation 10
1.5 l 1 and l 2 Norms 11
1.6 Review of Gaussian Distribution 12
1.7 Unsupervised Learning Case Study 13
1.7.1 Clustering Example 14
1.7.2 Clustering Problem Specification 14
1.8 Creating Synthetic Data for Clustering 16
1.8.1 One-Dimensional Datasets 16
1.8.2 Multidimensional Datasets 17
1.9 Clustering Algorithms 19
1.9.1 k-Means Clustering 19
1.9.2 k-Medians Clustering 21
1.10 Importance of Robust Clustering 22
1.10.1 Clustering with No Outliers 22
1.10.2 Clustering with Outliers 23
1.10.3 Detection and Removal of Outliers 25
1.11 Summary 27
Problems 28
References 34
2 Robust Linear Regression 35
2.1 Introduction 35
2.2 Supervised Learning 35
2.3 Linear Regression 36
2.4 Importance of Residuals 38
2.4.1 Defining Errors and Residuals 38
2.4.2 Residuals in Loss Functions 39
2.4.3 Distribution of Residuals 40
2.5 Estimation Background 42
2.5.1 Linear Models 42
2.5.2 Desirable Properties of Estimators 43
2.5.3 Maximum-Likelihood Estimation 44
2.5.4 Gradient Descent 47
2.6 M-Estimation 49
2.7 Least Squares Estimation (LSE) 52
2.8 Least Absolute Deviation (LAD) 54
2.9 Comparison of LSE and LAD 55
2.9.1 Simple Linear Model 55
2.9.2 Location Problem 56
2.10 Huber’s Method 58
2.10.1 Huber Loss Function 58
2.10.2 Comparison with LSE and LAD 63
2.11 Summary 64
Problems 64
References 67
3 The Log-Cosh Loss Function 69
3.1 Introduction 69
3.2 An Intuitive View of Log-Cosh 69
3.3 Hyperbolic Functions 71
3.4 M-Estimation 71
3.4.1 Asymptotic Behavior 72
3.4.2 Linear Regression Using Log-Cosh 74
3.5 Deriving the Distribution for Log-Cosh 75
3.6 Standard Errors for Robust Estimators 79
3.6.1 Example: Swiss Fertility Dataset 81
3.6.2 Example: Boston Housing Dataset 82
3.7 Statistical Properties of Log-Cosh Loss 83
3.7.1 Maximum-Likelihood Estimation 83
3.8 A General Log-Cosh Loss Function 84
3.9 Summary 88
Problems 88
References 93
4 Outlier Detection, Metrics, and Standardization 95
4.1 Introduction 95
4.2 Effect of Outliers 95
4.3 Outlier Diagnosis 97
4.3.1 Boxplots 98
4.3.2 Histogram Plots 100
4.3.3 Exploratory Data Analysis 101
4.4 Outlier Detection 102
4.4.1 3-Sigma Edit Rule 102
4.4.2 4.5-MAD Edit Rule 104
4.4.3 1.5-IQR Edit Rule 105
4.5 Outlier Removal 105
4.5.1 Trimming Methods 105
4.5.2 Winsorization 105
4.5.3 Anomaly Detection Method 106
4.6 Regression-Based Outlier Detection 107
4.6.1 LS vs. LC Residuals 108
4.6.2 Comparison of Detection Methods 109
4.6.3 Ordered Absolute Residuals (OARs) 110
4.6.4 Quantile–Quantile Plot 111
4.6.5 Quad-Plots for Outlier Diagnosis 113
4.7 Regression-Based Outlier Removal 114
4.7.1 Iterative Boxplot Method 114
4.8 Regression Metrics with Outliers 116
4.8.1 Mean Square Error (MSE) 117
4.8.2 Median Absolute Error (MAE) 118
4.8.3 MSE vs. MAE on Realistic Data 119
4.8.4 Selecting Hyperparameters for Robust Regression 120
4.9 Dataset Standardization 121
4.9.1 Robust Standardization 122
4.10 Summary 126
Problems 126
References 131
5 Robustne