E-Book, Englisch, 420 Seiten
Celebi Partitional Clustering Algorithms
2015
ISBN: 978-3-319-09259-1
Verlag: Springer Nature Switzerland
Format: PDF
Kopierschutz: 1 - PDF Watermark
E-Book, Englisch, 420 Seiten
ISBN: 978-3-319-09259-1
Verlag: Springer Nature Switzerland
Format: PDF
Kopierschutz: 1 - PDF Watermark
This book focuses on partitional clustering algorithms, which are commonly used in engineering and computer scientific applications. The goal of this volume is to summarize the state-of-the-art in partitional clustering. The book includes such topics as center-based clustering, competitive learning clustering and density-based clustering. Each chapter is contributed by a leading expert in the field.
Dr. Emre Celebi is an Associate Professor with the Department of Computer Science, at Louisiana State University in Shreveport.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;6
2;Contents;10
3;1 Recent Developments in Model-Based Clusteringwith Applications;12
3.1;1.1 Introduction;12
3.2;1.2 Methodological Developments;18
3.2.1;1.2.1 Initialization;18
3.2.2;1.2.2 Spurious Solutions;21
3.2.3;1.2.3 Merging Mixture Components for Clustering;23
3.2.4;1.2.4 Nonparametric Clustering;26
3.2.5;1.2.5 Variable Selection for Clustering;28
3.2.6;1.2.6 Semi-Supervised Clustering;30
3.2.7;1.2.7 Diagnostics and Assessment of Partition Variability;32
3.2.8;1.2.8 Miscellaneous Topics;34
3.3;1.3 Modern Applications of Model-Based Clustering;35
3.3.1;1.3.1 Clustering Tree Ring Sequences;35
3.3.2;1.3.2 Identification of Differentially Expressed Genes;36
3.3.3;1.3.3 Analysis of Customer Navigation Patterns;38
3.3.4;1.3.4 Data Clustering on Unit-Hypersphere;39
3.3.5;1.3.5 Analysis of Mass Spectrometry Data;40
3.3.6;1.3.6 Network Clustering;41
3.4;References;43
4;2 Accelerating Lloyd's Algorithm for k-Means Clustering;51
4.1;2.1 Introduction;52
4.1.1;2.1.1 Popularity of the k-Means Algorithm;52
4.1.2;2.1.2 The Standard k-Means Algorithm does a Lot of Unnecessary Work;53
4.1.3;2.1.3 Previous Work on k-Means Acceleration;53
4.1.3.1;2.1.3.1 Algorithmic Improvements;54
4.1.3.2;2.1.3.2 Parallelization;55
4.1.3.3;2.1.3.3 Alternative Heuristic Methods;55
4.2;2.2 Cluster Distortion and Lloyd's Algorithm;55
4.2.1;2.2.1 Analysis of Lloyd's Algorithm;57
4.2.2;2.2.2 MacQueen's Algorithm;58
4.3;2.3 Tree Structured Approaches;58
4.3.1;2.3.1 Blacklisting and Filtering Centers with k-d Trees;58
4.3.2;2.3.2 Anchors Hierarchy;60
4.4;2.4 Triangle Inequality Approaches;60
4.4.1;2.4.1 Using the Triangle Inequality for Center-Center and Center-Point Distances;61
4.4.2;2.4.2 Maintaining Distance Bounds with the Triangle Inequality;62
4.4.3;2.4.3 Elkan's Algorithm: k Lower Bounds, k2 Center-Center Distances;64
4.4.4;2.4.4 Hamerly's Algorithm: 1 Lower Bound;64
4.4.5;2.4.5 Drake's Algorithm: 1 < b < k Lower Bounds;66
4.4.6;2.4.6 Annular Algorithm: Sorting the Centers by Norm;68
4.4.7;2.4.7 Kernelized k-Means with Distance Bounds;70
4.5;2.5 Heap-Ordered k-Means: Inverting the Innermost Loops;72
4.5.1;2.5.1 Reducing the Number of Bounds Kept;72
4.5.2;2.5.2 Cost of Combining Distance Bounds;73
4.5.3;2.5.3 Inverting the Loops Over n and k;73
4.5.4;2.5.4 Heap-Structured Bounds;73
4.5.5;2.5.5 Analysis of Heap-Structured k-Means;75
4.6;2.6 Parallelization;76
4.7;2.7 Experiments and Discussion;77
4.7.1;2.7.1 Testing Platforms;78
4.7.2;2.7.2 Speedup Relative to the Naive Algorithm;78
4.7.3;2.7.3 Parallelism;79
4.7.4;2.7.4 Number of Distance Calculations;81
4.7.5;2.7.5 Memory Use;85
4.8;2.8 Conclusion;86
4.9;References;87
5;3 Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm;89
5.1;3.1 Introduction;90
5.2;3.2 Linear, Deterministic, and Order-Invariant K-Means Initialization Methods;92
5.3;3.3 Experimental Setup;97
5.3.1;3.3.1 Data Set Descriptions;97
5.3.2;3.3.2 Attribute Normalization;97
5.3.3;3.3.3 Performance Criteria;98
5.4;3.4 Experimental Results and Discussion;99
5.5;3.5 Conclusions;103
5.6;References;104
6;4 Nonsmooth Optimization Based Algorithms in Cluster Analysis;109
6.1;4.1 Introduction;109
6.2;4.2 Optimization Formulations of Clustering Problems;112
6.2.1;4.2.1 Combinatorial Formulation of the Clustering Problem;113
6.2.2;4.2.2 Mixed Integer Nonlinear Programming Formulation of the Clustering Problem;113
6.2.3;4.2.3 Nonsmooth Nonconvex Optimization Formulation of the Clustering Problem;114
6.2.4;4.2.4 Comparison of Different Formulations of Clustering Problem;115
6.3;4.3 The Auxiliary Cluster Problem;115
6.4;4.4 An Incremental Clustering Algorithm;116
6.5;4.5 Computation of Starting Points for Cluster Centers;117
6.6;4.6 The Modified Incremental Clustering Algorithm;121
6.7;4.7 Solving Optimization Problems;121
6.7.1;4.7.1 k-Means Type Heuristic Algorithm;122
6.7.2;4.7.2 The Discrete Gradient Method;122
6.7.3;4.7.3 An Algorithm Based on Smoothing Techniques;123
6.7.3.1;4.7.3.1 Hyperbolic Smoothing of the Cluster Function;124
6.7.3.2;4.7.3.2 Hyperbolic Smoothing of the Auxiliary Cluster Function;125
6.7.3.3;4.7.3.3 Hyperbolic Smoothing of L1-Norm;125
6.7.3.4;4.7.3.4 Hyperbolic Smoothing of L?-Norm;127
6.7.3.5;4.7.3.5 Smooth Clustering Problems;128
6.8;4.8 Implementation of Incremental Clustering Algorithm;129
6.9;4.9 Computational Results: Evaluation of the Incremental Algorithm;130
6.9.1;4.9.1 Results for the Similarity Measure Based on the Squared L2-Norm;131
6.9.2;4.9.2 Results for the Similarity Measure Based on the L1-Norm;135
6.9.3;4.9.3 Results for the Similarity Measure Based on the L?-Norm;135
6.9.4;4.9.4 Dependence of Number of Distance Function Evaluations and CPU Time on Number of Clusters;135
6.9.5;4.9.5 Results for Purity in Data Sets with Class Labels;139
6.9.6;4.9.6 Visualization of Results;142
6.10;4.10 Computational Results: Comparison with Other Clustering Algorithms;143
6.10.1;4.10.1 Comparison of Algorithms Using the Similarity Measure Based on the Squared L2-Norm;145
6.10.2;4.10.2 Comparison of Algorithms Using the Similarity Measure Based on the L1-Norm;149
6.10.3;4.10.3 Comparison of Algorithms Using the Similarity Measure Based on the L?-Norm;150
6.11;4.11 Conclusions;153
6.12;References;154
7;5 Fuzzy Clustering Algorithms and Validity Indicesfor Distributed Data;157
7.1;5.1 Introduction;157
7.2;5.2 Fuzzy Clustering Algorithms;161
7.2.1;5.2.1 FCM: Fuzzy c-Means;162
7.2.2;5.2.2 GK: Gustafson Kessel;163
7.2.3;5.2.3 Other Fuzzy Clustering Algorithms;164
7.2.4;5.2.4 Complexity Analysis: Summary;164
7.3;5.3 Clustering Validation;165
7.3.1;5.3.1 XB: Xie-Beni;165
7.3.2;5.3.2 FSS: Fuzzy Simplified Silhouette;166
7.3.3;5.3.3 K: Kwon;166
7.3.4;5.3.4 TSS: Tang-Sun-Sun;167
7.3.5;5.3.5 FS: Fukuyama-Sugeno;167
7.3.6;5.3.6 FHV: Fuzzy Hypervolume;167
7.3.7;5.3.7 APD: Average Partition Density;168
7.3.8;5.3.8 PD: Partition Density;168
7.3.9;5.3.9 SCG;168
7.3.10;5.3.10 PBMF;169
7.3.11;5.3.11 Complexity Analysis: Summary;169
7.3.12;5.3.12 OMR: Ordered Multiple Runs;169
7.4;5.4 Distributed Fuzzy Clustering Algorithms;171
7.4.1;5.4.1 DFCM: Distributed Fuzzy c-Means;171
7.4.2;5.4.2 Framework for Distributed Data;174
7.4.3;5.4.3 Other Distributed Fuzzy Clustering Algorithms;176
7.4.4;5.4.4 Complexity Analysis: Communication;177
7.5;5.5 Distributed Clustering Validation;178
7.5.1;5.5.1 DXB: Distributed Xie-Beni;179
7.5.2;5.5.2 DFSS: Distributed Fuzzy Simplified Silhouette;179
7.5.3;5.5.3 DK: Distributed Kwon;180
7.5.4;5.5.4 DTSS: Distributed Tang-Sun-Sun;180
7.5.5;5.5.5 DFS: Distributed Fukuyama-Sugeno;181
7.5.6;5.5.6 DFHV: Distributed Fuzzy Hypervolume;182
7.5.7;5.5.7 DAPD: Distributed Average Partition Density;182
7.5.8;5.5.8 DPD: Distributed Partition Density;183
7.5.9;5.5.9 DSCG: Distributed SCG;183
7.5.10;5.5.10 DPBMF: Distributed PBMF;183
7.5.11;5.5.11 Complexity Analysis: Communication;184
7.5.12;5.5.12 DOMR: Distributed Ordered Multiple Runs;185
7.6;5.6 Summary of Results;185
7.7;5.7 Experimental Evaluation;186
7.8;5.8 Final Remarks;190
7.9;Appendix: Additional Fuzzy Clustering Algorithms;191
7.9.1;GG: Gath and Geva;191
7.9.2;FCV: Fuzzy c-Varieties;192
7.9.3;FCE: Fuzzy c-Elliptotypes;193
7.9.4;PCM: Possibilistic c-Means;194
7.9.5;PGK: Possibilistic Gustafson-Kessel;196
7.9.6;FPCM: Fuzzy-Possibilistic c-Means;196
7.9.7;PFCM: Possibilistic-Fuzzy c-Means;198
7.10;References;199
8;6 Density Based Clustering: Alternatives to DBSCAN;203
8.1;6.1 Introduction;203
8.2;6.2 Related Work;206
8.2.1;6.2.1 DBSCAN;206
8.3;6.3 Clustering with a Different Notion of Density;210
8.3.1;6.3.1 Black Hole Clustering;210
8.3.2;6.3.2 Protoclustering;212
8.4;6.4 Evaluation;215
8.4.1;6.4.1 Black Hole Clustering Evaluation;217
8.4.2;6.4.2 Protoclustering;219
8.5;6.5 Conclusion;221
8.6;References;222
9;7 Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering;224
9.1;7.1 Introduction to Nonnegative Matrix Factorization;225
9.2;7.2 Nonnegative Matrix Factorization for Clustering;226
9.3;7.3 Optimization Framework for Nonnegative Matrix Factorization;229
9.3.1;7.3.1 Block Coordinate Descent Framework;229
9.3.1.1;7.3.1.1 Convergence Property;231
9.3.1.2;7.3.1.2 Stopping Criterion;232
9.3.2;7.3.2 Extension 1: Sparse NMF;233
9.3.3;7.3.3 Extension 2: Weakly-Supervised NMF;234
9.4;7.4 Choosing the Number of Clusters;235
9.5;7.5 Experimental Results;237
9.5.1;7.5.1 Data Sets and Algorithms;237
9.5.2;7.5.2 Clustering Quality;239
9.5.3;7.5.3 Convergence Behavior;241
9.5.4;7.5.4 Sparseness;241
9.5.5;7.5.5 Consistency from Multiple Runs;242
9.6;7.6 UTOPIAN: User-driven Topic Modeling via Interactive NMF;243
9.6.1;7.6.1 Usage Scenarios;245
9.7;7.7 Conclusions and Future Directions;247
9.8;References;249
10;8 Overview of Overlapping Partitional Clustering Methods;253
10.1;8.1 Introduction;253
10.2;8.2 Classification of Exiting Approaches for Overlapping Clustering;254
10.3;8.3 Overlapping Partitional Clustering Methods;257
10.3.1;8.3.1 Uncertain Memberships Based-Methods;257
10.3.1.1;8.3.1.1 Fuzzy c-Means (FCM) and Possibilistic c-Means (PCM);258
10.3.1.2;8.3.1.2 Evidential c-Means (ECM) and Belief c-Means (BCM);260
10.3.2;8.3.2 Hard Memberships Based-Methods;263
10.3.2.1;8.3.2.1 Additive Methods;263
10.3.2.2;8.3.2.2 Geometrical Methods;265
10.3.3;8.3.3 Summary of Overlapping Partitional Methods;270
10.4;8.4 Evaluation of Overlapping Clustering;272
10.4.1;8.4.1 Label Based Evaluation;272
10.4.2;8.4.2 Pair Based Evaluation;273
10.4.3;8.4.3 BCubed Evaluation;274
10.4.4;8.4.4 Synthesis of Evaluation Methods for Overlapping Clustering;275
10.5;8.5 Empirical Evaluation of Overlapping Partitional Clustering Methods;276
10.6;8.6 Conclusion;281
10.7;References;281
11;9 On Semi-Supervised Clustering;284
11.1;9.1 Introduction;284
11.2;9.2 Overview and Taxonomy of Algorithms;287
11.2.1;9.2.1 Types of Supervision (Side-Knowledge);288
11.2.2;9.2.2 Partitional SSC Algorithms;289
11.2.3;9.2.3 Hierarchical SSC Algorithms;290
11.3;9.3 Main Issues and Key Points;291
11.3.1;9.3.1 Evaluation of Clustering Quality: Internal/External Measures;291
11.3.2;9.3.2 Relevant Questions on Supervision;292
11.3.2.1;9.3.2.1 Equivalence Between Constraint-Based and Label-Based Supervision;292
11.3.2.2;9.3.2.2 Selecting the Best Query in Active Learning;293
11.3.2.3;9.3.2.3 Uncertain Supervision;294
11.4;9.4 Algorithms for SSC;295
11.4.1;9.4.1 COP-COBWEB and COP-kMeans;295
11.4.2;9.4.2 A Probabilistic Framework for SSC: The HMRF k-Means Method;297
11.4.3;9.4.3 Semi-Supervised Learning of the Distance Measure;301
11.4.4;9.4.4 Seeded k-Means and Constrained k-Means;303
11.4.5;9.4.5 The Zheng-Li Algorithm;304
11.4.6;9.4.6 Active Fuzzy Constrained Clustering;307
11.4.7;9.4.7 Semi-Supervised Graph Clustering: A Kernel Approach;310
11.5;9.5 A Glimpse of the Future: Some Research Directions;313
11.6;9.6 Conclusions;315
11.7;References;315
12;10 Consensus of Clusterings Based on High-Order Dissimilarities;319
12.1;10.1 Introduction;319
12.2;10.2 High-Order Dissimilarity: The Dissimilarity Increments Principle;322
12.2.1;10.2.1 Dissimilarity Increments: Definition and Properties;323
12.2.2;10.2.2 Dissimilarity Increments Distribution (DID);323
12.2.2.1;10.2.2.1 DID for High-Dimensional Data;324
12.2.2.2;10.2.2.2 DID for Two-Dimensional Data;325
12.2.2.3;10.2.2.3 Characterization and Properties of 2-DID;326
12.2.3;10.2.3 DID Models and Data Fitting;328
12.2.3.1;10.2.3.1 Best Approximation to DID for High-Dimensional Data;328
12.2.3.2;10.2.3.2 Fitting DID to Non-Gaussian Data;330
12.3;10.3 Partitional Clustering;330
12.3.1;10.3.1 Initial Data Partition;331
12.3.2;10.3.2 Merge Criterion;332
12.4;10.4 Consensus Clustering with DID;333
12.5;10.5 Validation with DID;334
12.6;10.6 Experimental Results and Discussion;336
12.6.1;10.6.1 Partitional Clustering;338
12.6.1.1;10.6.1.1 Known Number of Clusters;338
12.6.1.2;10.6.1.2 Unknown Number of Clusters;339
12.6.2;10.6.2 Consensus Clustering;345
12.7;10.7 Conclusions;350
12.8;Appendix;352
12.9;References;355
13;11 Hubness-Based Clustering of High-Dimensional Data;358
13.1;11.1 Introduction;359
13.2;11.2 High-Dimensional Data Clustering;359
13.3;11.3 The Hubness Phenomenon;360
13.3.1;11.3.1 The Emergence of Hubs;361
13.3.2;11.3.2 Relation of Hubs to Data Clusters;363
13.4;11.4 Effects of Hubness on Clustering;364
13.5;11.5 Interaction Between Kernels and Hubness;368
13.6;11.6 Clustering Algorithms Based on Hubness;371
13.6.1;11.6.1 Deterministic Approach;372
13.6.2;11.6.2 Probabilistic Approach;373
13.6.3;11.6.3 A Hybrid Approach: Extending K-Means;375
13.6.4;11.6.4 Using Kernels in Hubness-Based Clustering;375
13.6.5;11.6.5 Scalability of Hubness-Based Approaches;377
13.7;11.7 Experimental Comparisons;379
13.8;11.8 Application Domains and Other Types of Hubness-Aware Methods;385
13.9;11.9 Possible Future Directions;386
13.10;References;387
14;12 Clustering for Monitoring Distributed Data Streams;392
14.1;12.1 Introduction;393
14.2;12.2 Text Mining Application;395
14.3;12.3 Monitoring Threshold Functions Through Clustering: Motivation;396
14.4;12.4 Monitoring Threshold Functions Through Clustering: Implementation;399
14.5;12.5 Experimental Results;403
14.5.1;12.5.1 Data;403
14.5.2;12.5.2 Monitoring with Incremental Clustering;404
14.6;12.6 Conventional Clustering Algorithms;406
14.6.1;12.6.1 PDDP;407
14.6.2;12.6.2 Batch k-Means;408
14.6.3;12.6.3 Incremental k-Means;410
14.6.4;12.6.4 Batch k-Means Followed by Incremental k-Means;410
14.6.5;12.6.5 Node Clustering with Classical Clustering Algorithms;411
14.7;12.7 Conclusions;412
14.8;Appendix 1: First and Second Moments;413
14.9;Appendix 2: Broadcast Count;416
14.10;References;418




