Jo | Text Mining | E-Book | www.sack.de
E-Book

E-Book, Englisch, Band 45, 376 Seiten

Reihe: Studies in Big Data

Jo Text Mining

Concepts, Implementation, and Big Data Challenge
1. Auflage 2018
ISBN: 978-3-319-91815-0
Verlag: Springer Nature Switzerland
Format: PDF
Kopierschutz: 1 - PDF Watermark

Concepts, Implementation, and Big Data Challenge

E-Book, Englisch, Band 45, 376 Seiten

Reihe: Studies in Big Data

ISBN: 978-3-319-91815-0
Verlag: Springer Nature Switzerland
Format: PDF
Kopierschutz: 1 - PDF Watermark



This book discusses text mining and different ways this type of data mining can be used to find implicit knowledge from text collections. The author provides the guidelines for implementing text mining systems in Java, as well as concepts and approaches. The book starts by providing detailed text preprocessing techniques and then goes on to provide concepts, the techniques, the implementation, and the evaluation of text categorization. It then goes into more advanced topics including text summarization, text segmentation, topic mapping, and automatic text management.

Dr. Taeho Jo works as a faculty member for school of game in Hongik University, South Korea. He received his PhD from University of Ottawa in 2006. His research spans text mining, neural networks, machine learning, and information retrieval. He has four years' experience working for industrial organizations and ten years' experience working for in academia. He has published almost 150 research papers, and he was awarded two times in the world wide biography dictionary, 'Marquis Who's Who in the World'.

Jo Text Mining jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


1;Preface;6
2;Contents;8
3;Part I Foundation;15
3.1;1 Introduction;17
3.1.1;1.1 Definition of Text Mining;17
3.1.2;1.2 Texts;18
3.1.2.1;1.2.1 Text Components;19
3.1.2.2;1.2.2 Text Formats;20
3.1.3;1.3 Data Mining Tasks;21
3.1.3.1;1.3.1 Classification;21
3.1.3.2;1.3.2 Clustering;23
3.1.3.3;1.3.3 Association;24
3.1.4;1.4 Data Mining Types;25
3.1.4.1;1.4.1 Relational Data Mining;26
3.1.4.2;1.4.2 Web Mining;27
3.1.4.3;1.4.3 Big Data Mining;28
3.1.5;1.5 Summary;30
3.2;2 Text Indexing;32
3.2.1;2.1 Overview of Text Indexing;32
3.2.2;2.2 Steps of Text Indexing;34
3.2.2.1;2.2.1 Tokenization;34
3.2.2.2;2.2.2 Stemming;36
3.2.2.3;2.2.3 Stop-Word Removal;37
3.2.2.4;2.2.4 Term Weighting;38
3.2.3;2.3 Text Indexing: Implementation;40
3.2.3.1;2.3.1 Class Definition;40
3.2.3.2;2.3.2 Stemming Rule;43
3.2.3.3;2.3.3 Method Implementations;45
3.2.4;2.4 Additional Steps;48
3.2.4.1;2.4.1 Index Filtering;48
3.2.4.2;2.4.2 Index Expansion;50
3.2.4.3;2.4.3 Index Optimization;51
3.2.5;2.5 Summary;53
3.3;3 Text Encoding;54
3.3.1;3.1 Overview of Text Encoding;54
3.3.2;3.2 Feature Selection;56
3.3.2.1;3.2.1 Wrapper Approach;56
3.3.2.2;3.2.2 Principal Component Analysis;57
3.3.2.3;3.2.3 Independent Component Analysis;59
3.3.2.4;3.2.4 Singular Value Decomposition;62
3.3.3;3.3 Feature Value Assignment;63
3.3.3.1;3.3.1 Assignment Schemes;63
3.3.3.2;3.3.2 Similarity Computation;65
3.3.4;3.4 Issues of Text Encoding;66
3.3.4.1;3.4.1 Huge Dimensionality;66
3.3.4.2;3.4.2 Sparse Distribution;67
3.3.4.3;3.4.3 Poor Transparency;68
3.3.5;3.5 Summary;70
3.4;4 Text Association;72
3.4.1;4.1 Overview of Text Association;72
3.4.2;4.2 Data Association;74
3.4.2.1;4.2.1 Functional View;74
3.4.2.2;4.2.2 Support and Confidence;75
3.4.2.3;4.2.3 Apriori Algorithm;77
3.4.3;4.3 Word Association;79
3.4.3.1;4.3.1 Word Text Matrix;79
3.4.3.2;4.3.2 Functional View;81
3.4.3.3;4.3.3 Simple Example;82
3.4.4;4.4 Text Association;84
3.4.4.1;4.4.1 Functional View;84
3.4.4.2;4.4.2 Simple Example;85
3.4.5;4.5 Overall Summary;87
4;Part II Text Categorization;89
4.1;5 Text Categorization: Conceptual View;91
4.1.1;5.1 Definition of Text Categorization;91
4.1.2;5.2 Data Classification;93
4.1.2.1;5.2.1 Binary Classification;93
4.1.2.2;5.2.2 Multiple Classification;94
4.1.2.3;5.2.3 Classification Decomposition;95
4.1.2.4;5.2.4 Regression;97
4.1.3;5.3 Classification Types;98
4.1.3.1;5.3.1 Hard vs Soft Classification;98
4.1.3.2;5.3.2 Flat vs Hierarchical Classification;100
4.1.3.3;5.3.3 Single vs Multiple Viewed Classification;102
4.1.3.4;5.3.4 Independent vs Dependent Classification;104
4.1.4;5.4 Variants of Text Categorization;106
4.1.4.1;5.4.1 Spam Mail Filtering;106
4.1.4.2;5.4.2 Sentimental Analysis;107
4.1.4.3;5.4.3 Information Filtering;109
4.1.4.4;5.4.4 Topic Routing;110
4.1.5;5.5 Summary and Further Discussions;111
4.2;6 Text Categorization: Approaches;112
4.2.1;6.1 Machine Learning;112
4.2.2;6.2 Lazy Learning;114
4.2.2.1;6.2.1 K Nearest Neighbor;115
4.2.2.2;6.2.2 Radius Nearest Neighbor;117
4.2.2.3;6.2.3 Distance-Based Nearest Neighbor;118
4.2.2.4;6.2.4 Attribute Discriminated Nearest Neighbor;120
4.2.3;6.3 Probabilistic Learning;121
4.2.3.1;6.3.1 Bayes Rule;122
4.2.3.2;6.3.2 Bayes Classifier;123
4.2.3.3;6.3.3 Naive Bayes;125
4.2.3.4;6.3.4 Bayesian Learning;127
4.2.4;6.4 Kernel Based Classifier;129
4.2.4.1;6.4.1 Perceptron;130
4.2.4.2;6.4.2 Kernel Functions;131
4.2.4.3;6.4.3 Support Vector Machine;133
4.2.4.4;6.4.4 Optimization Constraints;135
4.2.5;6.5 Summary and Further Discussions;137
4.3;7 Text Categorization: Implementation;139
4.3.1;7.1 System Architecture;139
4.3.2;7.2 Class Definitions;141
4.3.2.1;7.2.1 Classes: Word, Text, and PlainText;141
4.3.2.2;7.2.2 Interface and Class: Classifier and KNearestNeighbor;144
4.3.2.3;7.2.3 Class: TextClassificationAPI;146
4.3.3;7.3 Method Implementations;147
4.3.3.1;7.3.1 Class: Word;148
4.3.3.2;7.3.2 Class: PlainText;149
4.3.3.3;7.3.3 Class: KNearestNeighbor;151
4.3.3.4;7.3.4 Class: TextClassificationAPI;152
4.3.4;7.4 Graphic User Interface and Demonstration;155
4.3.4.1;7.4.1 Class: TextClassificationGUI;155
4.3.4.2;7.4.2 Preliminary Tasks and Encoding;157
4.3.4.3;7.4.3 Classification Process;162
4.3.4.4;7.4.4 System Upgrading;165
4.3.5;7.5 Summary and Further Discussions;166
4.4;8 Text Categorization: Evaluation;167
4.4.1;8.1 Evaluation Overview;167
4.4.2;8.2 Text Collections;169
4.4.2.1;8.2.1 NewsPage.com;169
4.4.2.2;8.2.2 20NewsGroups;170
4.4.2.3;8.2.3 Reuter21578;171
4.4.2.4;8.2.4 OSHUMED;173
4.4.3;8.3 F1 Measure;174
4.4.3.1;8.3.1 Contingency Table;175
4.4.3.2;8.3.2 Micro-Averaged F1;176
4.4.3.3;8.3.3 Macro-Averaged F1;178
4.4.3.4;8.3.4 Example;180
4.4.4;8.4 Statistical t-Test;181
4.4.4.1;8.4.1 Student's t-Distribution;181
4.4.4.2;8.4.2 Unpaired Difference Inference;184
4.4.4.3;8.4.3 Paired Difference Inference;185
4.4.4.4;8.4.4 Example;187
4.4.5;8.5 Summary and Further Discussions;188
5;Part III Text Clustering;190
5.1;9 Text Clustering: Conceptual View;191
5.1.1;9.1 Definition of Text Clustering;191
5.1.2;9.2 Data Clustering;192
5.1.2.1;9.2.1 SubSubsectionTitle;193
5.1.2.2;9.2.2 Association vs Clustering;194
5.1.2.3;9.2.3 Classification vs Clustering;195
5.1.2.4;9.2.4 Constraint Clustering;196
5.1.3;9.3 Clustering Types;197
5.1.3.1;9.3.1 Static vs Dynamic Clustering;198
5.1.3.2;9.3.2 Crisp vs Fuzzy Clustering;199
5.1.3.3;9.3.3 Flat vs Hierarchical Clustering;201
5.1.3.4;9.3.4 Single vs Multiple Viewed Clustering;202
5.1.4;9.4 Derived Tasks from Text Clustering;204
5.1.4.1;9.4.1 Cluster Naming;204
5.1.4.2;9.4.2 Subtext Clustering;205
5.1.4.3;9.4.3 Automatic Sampling for Text Categorization;207
5.1.4.4;9.4.4 Redundant Project Detection;208
5.1.5;9.5 Summary and Further Discussions;209
5.2;10 Text Clustering: Approaches;210
5.2.1;10.1 Unsupervised Learning;210
5.2.2;10.2 Simple Clustering Algorithms;211
5.2.2.1;10.2.1 AHC Algorithm;212
5.2.2.2;10.2.2 Divisive Clustering Algorithm;213
5.2.2.3;10.2.3 Single Pass Algorithm;214
5.2.2.4;10.2.4 Growing Algorithm;216
5.2.3;10.3 K Means Algorithm;218
5.2.3.1;10.3.1 Crisp K Means Algorithm;218
5.2.3.2;10.3.2 Fuzzy K Means Algorithm;219
5.2.3.3;10.3.3 Gaussian Mixture;220
5.2.3.4;10.3.4 K Medoid Algorithm;221
5.2.4;10.4 Competitive Learning;224
5.2.4.1;10.4.1 Kohonen Networks;224
5.2.4.2;10.4.2 Learning Vector Quantization;226
5.2.4.3;10.4.3 Two-Dimensional Self-Organizing Map;227
5.2.4.4;10.4.4 Neural Gas;229
5.2.5;10.5 Summary and Further Discussions;230
5.3;11 Text Clustering: Implementation;232
5.3.1;11.1 System Architecture;232
5.3.2;11.2 Class Definitions;234
5.3.2.1;11.2.1 Classes in Text Categorization System;234
5.3.2.2;11.2.2 Class: Cluster;237
5.3.2.3;11.2.3 Interface: ClusterAnalyzer;239
5.3.2.4;11.2.4 Class: AHCAlgorithm;240
5.3.3;11.3 Method Implementations;242
5.3.3.1;11.3.1 Methods in Previous Classes;242
5.3.3.2;11.3.2 Class: Cluster;244
5.3.3.3;11.3.3 Class: AHC Algorithm;246
5.3.4;11.4 Class: ClusterAnalysisAPI;247
5.3.4.1;11.4.1 Class: ClusterAnalysisAPI;248
5.3.4.2;11.4.2 Class: ClusterAnalyzerGUI;249
5.3.4.3;11.4.3 Demonstration;251
5.3.4.4;11.4.4 System Upgrading;252
5.3.5;11.5 Summary and Further Discussions;253
5.4;12 Text Clustering: Evaluation;255
5.4.1;12.1 Introduction;255
5.4.2;12.2 Cluster Validations;256
5.4.2.1;12.2.1 Intra-Cluster and Inter-Cluster Similarities;256
5.4.2.2;12.2.2 Internal Validation;258
5.4.2.3;12.2.3 Relative Validation;259
5.4.2.4;12.2.4 External Validation;261
5.4.3;12.3 Clustering Index;263
5.4.3.1;12.3.1 Computation Process;263
5.4.3.2;12.3.2 Evaluation of Crisp Clustering;264
5.4.3.3;12.3.3 Evaluation of Fuzzy Clustering;265
5.4.3.4;12.3.4 Evaluation of Hierarchical Clustering;267
5.4.4;12.4 Parameter Tuning;269
5.4.4.1;12.4.1 Clustering Index for Unlabeled Documents;269
5.4.4.2;12.4.2 Simple Clustering Algorithm with Parameter Tuning;270
5.4.4.3;12.4.3 K Means Algorithm with Parameter Tuning;271
5.4.4.4;12.4.4 Evolutionary Clustering Algorithm;272
5.4.5;12.5 Summary and Further Discussions;273
6;Part IV Advanced Topics;275
6.1;13 Text Summarization;277
6.1.1;13.1 Definition of Text Summarization;277
6.1.2;13.2 Text Summarization Types;278
6.1.2.1;13.2.1 Manual vs Automatic Text Summarization;279
6.1.2.2;13.2.2 Single vs Multiple Text Summarization;280
6.1.2.3;13.2.3 Flat vs Hierarchical Text Summarization;282
6.1.2.4;13.2.4 Abstraction vs Query-Based Summarization;284
6.1.3;13.3 Approaches to Text Summarization;285
6.1.3.1;13.3.1 Heuristic Approaches;286
6.1.3.2;13.3.2 Mapping into Classification Task;287
6.1.3.3;13.3.3 Sampling Schemes;289
6.1.3.4;13.3.4 Application of Machine Learning Algorithms;291
6.1.4;13.4 Combination with Other Text Mining Tasks;293
6.1.4.1;13.4.1 Summary-Based Classification;294
6.1.4.2;13.4.2 Summary-Based Clustering;295
6.1.4.3;13.4.3 Topic-Based Summarization;296
6.1.4.4;13.4.4 Text Expansion;298
6.1.5;13.5 Summary and Further Discussions;299
6.2;14 Text Segmentation;301
6.2.1;14.1 Definition of Text Segmentation;301
6.2.2;14.2 Text Segmentation Type;302
6.2.2.1;14.2.1 Spoken vs Written Text Segmentation;302
6.2.2.2;14.2.2 Ordered vs Unordered Text Segmentation;304
6.2.2.3;14.2.3 Exclusive vs Overlapping Segmentation;306
6.2.2.4;14.2.4 Flat vs Hierarchical Text Segmentation;308
6.2.3;14.3 Machine Learning-Based Approaches;310
6.2.3.1;14.3.1 Heuristic Approaches;310
6.2.3.2;14.3.2 Mapping into Classification;311
6.2.3.3;14.3.3 Encoding Adjacent Paragraph Pairs;313
6.2.3.4;14.3.4 Application of Machine Learning;315
6.2.4;14.4 Derived Tasks;317
6.2.4.1;14.4.1 Temporal Topic Analysis;317
6.2.4.2;14.4.2 Subtext Retrieval;319
6.2.4.3;14.4.3 Subtext Synthesization;320
6.2.4.4;14.4.4 Virtual Text;321
6.2.5;14.5 Summary and Further Discussions;322
6.3;15 Taxonomy Generation;324
6.3.1;15.1 Definition of Taxonomy Generation;324
6.3.2;15.2 Relevant Tasks to Taxonomy Generation;325
6.3.2.1;15.2.1 Keyword Extraction;325
6.3.2.2;15.2.2 Word Categorization;327
6.3.2.3;15.2.3 Word Clustering;329
6.3.2.4;15.2.4 Topic Routing;330
6.3.3;15.3 Taxonomy Generation Schemes;332
6.3.3.1;15.3.1 Index-Based Scheme;332
6.3.3.2;15.3.2 Clustering-Based Scheme;333
6.3.3.3;15.3.3 Association-Based Scheme;334
6.3.3.4;15.3.4 Link Analysis-Based Scheme;336
6.3.4;15.4 Taxonomy Governance;337
6.3.4.1;15.4.1 Taxonomy Maintenance;337
6.3.4.2;15.4.2 Taxonomy Growth;339
6.3.4.3;15.4.3 Taxonomy Integration;340
6.3.4.4;15.4.4 Ontology;342
6.3.5;15.5 Summary and Further Discussions;344
6.4;16 Dynamic Document Organization;346
6.4.1;16.1 Definition of Dynamic Document Organization;346
6.4.2;16.2 Online Clustering;347
6.4.2.1;16.2.1 Online Clustering in Functional View;347
6.4.2.2;16.2.2 Online K Means Algorithm;349
6.4.2.3;16.2.3 Online Unsupervised KNN Algorithm;350
6.4.2.4;16.2.4 Online Fuzzy Clustering;351
6.4.3;16.3 Dynamic Organization;353
6.4.3.1;16.3.1 Execution Process;353
6.4.3.2;16.3.2 Maintenance Mode;354
6.4.3.3;16.3.3 Creation Mode;355
6.4.3.4;16.3.4 Additional Tasks;356
6.4.4;16.4 Issues of Dynamic Document Organization;357
6.4.4.1;16.4.1 Text Representation;358
6.4.4.2;16.4.2 Binary Decomposition;358
6.4.4.3;16.4.3 Transition into Creation Mode;359
6.4.4.4;16.4.4 Variants of DDO System;360
6.4.5;16.5 Summary and Further Discussions;361
7;References;363
8;Index;368



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.