E-Book, Englisch, Band 132, 447 Seiten
Reihe: International Series in Operations Research & Management Science
Chan / Talburt / Talley Data Engineering
1. Auflage 2009
ISBN: 978-1-4419-0176-7
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
Mining, Information and Intelligence
E-Book, Englisch, Band 132, 447 Seiten
Reihe: International Series in Operations Research & Management Science
ISBN: 978-1-4419-0176-7
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
DATA ENGINEERING: Mining, Information, and Intelligence describes applied research aimed at the task of collecting data and distilling useful information from that data. Most of the work presented emanates from research completed through collaborations between Acxiom Corporation and its academic research partners under the aegis of the Acxiom Laboratory for Applied Research (ALAR). Chapters are roughly ordered to follow the logical sequence of the transformation of data from raw input data streams to refined information. Four discrete sections cover Data Integration and Information Quality; Grid Computing; Data Mining; and Visualization. Additionally, there are exercises at the end of each chapter. The primary audience for this book is the broad base of anyone interested in data engineering, whether from academia, market research firms, or business-intelligence companies. The volume is ideally suited for researchers, practitioners, and postgraduate students alike. With its focus on problems arising from industry rather than a basic research perspective, combined with its intelligent organization, extensive references, and subject and author indices, it can serve the academic, research, and industrial audiences.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;4
2;Table of Contents;7
3;1 Introduction;16
3.1;1.1 Common Problem;16
3.2;1.2 Data Integration and Data Management;18
3.2.1;1.2.1 Information Quality Overview;18
3.2.2;1.2.2 Customer Data Integration;19
3.2.2.1;1.2.2.1 Hygiene;20
3.2.2.2;1.2.2.2 Enhancement;21
3.2.2.3;1.2.2.3 Entity Resolution;22
3.2.2.4;1.2.2.4 Aggregation and Selection;22
3.2.3;1.2.3 Data Management;23
3.2.4;1.2.4 Practical Problems to Data Integration and Management;24
3.3;1.3 Analytics;25
3.3.1;1.3.1 Model Development;25
3.3.2;1.3.2 Current Modeling and Optimization Techniques;26
3.3.3;1.3.3 Specific Algorithms and Techniques for Improvement;27
3.3.4;1.3.4 Incremental or Evolutionary Updates;28
3.3.5;1.3.5 Visualization;30
3.4;1.4 Conclusion;30
3.5;1.5 References;31
4;2 A Declarative Approach to Entity Resolution;32
4.1;2.1 Introduction;32
4.2;2.2 Background;33
4.2.1;2.2.1 Entity Resolution Definition;33
4.2.2;2.2.2 Entity Resolution Defense;33
4.2.3;2.2.3 Entity Resolution Terminology;34
4.2.3.1;2.2.3.1 Prospecting;34
4.2.3.2;2.2.3.2 Blocking;34
4.2.3.3;2.2.3.3 Closure;34
4.2.3.4;2.2.3.4 Matching;35
4.2.4;2.2.4 Declarative Languages;35
4.3;2.3 The Declarative Taxonomy: The Nouns;35
4.3.1;2.3.1 Attributes;36
4.3.2;2.3.2 References;36
4.3.3;2.3.3 Paths and Match Functions;37
4.3.4;2.3.4 Entities;39
4.3.5;2.3.5 Super Groups;40
4.3.6;2.3.6 Matching Graphs;41
4.4;2.4 A Declarative Taxonomy: The Adjectives;42
4.4.1;2.4.1 Attribute Adjectives;42
4.4.2;2.4.2 Reference Adjectives;44
4.5;2.5 The Declarative Taxonomy: The Verbs;44
4.5.1;2.5.1 Attribute Verbs;44
4.5.2;2.5.2 Reference Verbs;45
4.5.3;2.5.3 Entity Verbs;47
4.6;2.6 A Declarative Representation;48
4.6.1;2.6.1 The XML Schema;49
4.6.2;2.6.2 A Representation for the Operations;51
4.7;2.7 Conclusion;52
4.8;2.8 Exercises;52
4.9;2.9 References;52
5;3 Transitive Closure of Data Records: Applicationand Computation;54
5.1;3.1 Introduction;54
5.1.1;3.1.1 Motivation;55
5.1.2;3.1.2 Literature Review;57
5.2;3.2 Problem Definition;58
5.3;3.3 Sequential Algorithms;60
5.3.1;3.3.1 A Breadth First Search Based Algorithm;60
5.3.2;3.3.2 A Sorting and Disjoint Set Based Algorithm;62
5.3.3;3.3.3 Experiment;66
5.4;3.4 Parallel and Distributed Algorithms;68
5.4.1;3.4.1 An Overview of a Parallel and Distributed Scheme;68
5.4.2;3.4.2 Generate Matching Pairs;70
5.4.3;3.4.3 Conversion Process;70
5.4.4;3.4.4 Closure Process;71
5.4.5;3.4.5 A MPI Based Parallel and Distributed Algorithm;77
5.4.6;3.4.6 Experiment;79
5.5;3.5 Conclusion;85
5.6;3.6 Exercises;86
5.7;3.7 Acknowledgments;88
5.8;3.8 References;89
6;4 Semantic Data Matching: Principlesand Performance;91
6.1;4.1 Introduction;91
6.2;4.2 Problem Statement: Data Matching for Customer DataIntegration;92
6.3;4.3 Semantic Data Matching;92
6.3.1;4.3.1 Background on Latent Semantic Analysis;92
6.3.2;4.3.2 Analysis;94
6.4;4.4 Effect of Shared Terms;95
6.4.1;4.4.1 Fundamental Limitations on Data Matching;95
6.4.2;4.4.2 Experiments;96
6.5;4.5 Results;97
6.6;4.6 Conclusion;101
6.7;4.7 Exercises;103
6.8;4.8 Acknowledgments;103
6.9;4.9 References;103
7;5 Application of the Near Miss Strategy and EditDistance to Handle Dirty Data;105
7.1;5.1 Introduction;105
7.2;5.2 Background;106
7.2.1;5.2.1 Techniques used for General Spelling Error Correction;107
7.2.1.1;5.2.1.1 Minimum edit distance techniques;107
7.2.1.2;5.2.1.2 Soundex and Phonetic Strategy;108
7.2.1.3;5.2.1.3 Rule-based techniques;108
7.2.1.4;5.2.1.4 N-gram-based techniques;108
7.2.1.5;5.2.1.5 Probabilistic techniques and Neural Nets;109
7.2.2;5.2.2 Domain-Specific Correction;109
7.3;5.3 Individual Name Spelling Correction Algorithm: thePersonal Name Recognition Strategy (PNRS);110
7.3.1;5.3.1 Experiment Results;112
7.4;5.4 Conclusion;113
7.5;5.5 Exercises;113
7.6;5.6 References;114
8;6 A Parallel General-Purpose Synthetic DataGenerator1;116
8.1;6.1 Introduction;116
8.2;6.2 SDDL;117
8.2.1;6.2.1 Min/Max Constraints;118
8.2.2;6.2.2 Distribution Constraints;119
8.2.3;6.2.3 Formula Constraints;119
8.2.4;6.2.4 Iterations;119
8.2.5;6.2.5 Query Pools;121
8.3;6.3 Pools;121
8.4;6.4 Parallel Data Generation;123
8.4.1;6.4.1 Generation Algorithm 1;124
8.4.2;6.4.2 Generation Algorithm 2;125
8.5;6.5 Performance and Applications;126
8.6;6.6 Conclusion and Future Directions;127
8.7;6.7 Exercises;129
8.8;6.8 References;130
9;7 A Grid Operating Environment for CDI;131
9.1;7.1 Introduction;131
9.2;7.2 Grid-Based Service Deployment;132
9.2.1;7.2.1 Evolution of the Acxiom Grid (A Case Study);132
9.2.2;7.2.2 Services Grid;134
9.2.3;7.2.3 Grid Management;136
9.3;7.3 Grid-Based Batch Processing;139
9.3.1;7.3.1 Workflow Grid;139
9.3.2;7.3.2 I/O Constraints;145
9.3.3;7.3.3 Data Grid;147
9.3.4;7.3.4 Database Grid;149
9.3.5;7.3.5 Data Management;150
9.4;7.4 Conclusion;152
9.5;7.5 Exercises;153
10;8 Parallel File Systems;155
10.1;8.1 Introduction;155
10.2;8.2 Commercial Data and Access Patterns;156
10.2.1;8.2.1 Large File Access Patterns;157
10.2.2;8.2.2 File System Interfaces;158
10.3;8.3 Basics of Parallel File Systems;159
10.3.1;8.3.1 Common Storage System Hardware;160
10.4;8.4 Design Challenges;161
10.4.1;8.4.1 Performance;162
10.4.2;8.4.2 Consistency Semantics;162
10.4.3;8.4.3 Fault Tolerance;163
10.4.4;8.4.4 Interoperability;164
10.4.5;8.4.5 Management Tools;165
10.4.6;8.4.6 Traditional Design Challenges;166
10.5;8.5 Case Studies;166
10.5.1;8.5.1 Multi-Path File System (MPFS);166
10.5.1.1;8.5.1.1 Architecture;167
10.5.1.2;8.5.1.2 File Mapping Protocol;168
10.5.1.3;8.5.1.3 Caching;168
10.5.1.4;8.5.1.4 Fault Tolerance;169
10.5.1.5;8.5.1.5 Similar File Systems;169
10.5.2;8.5.2 Parallel Virtual File System (PVFS);169
10.5.2.1;8.5.2.1 Architecture;169
10.5.2.2;8.5.2.2 Fault Tolerance;170
10.5.2.3;8.5.2.3 Application Interfaces;171
10.5.2.4;8.5.2.4 Consistency Semantics;172
10.5.2.5;8.5.2.5 Similar File Systems;172
10.5.3;8.5.3 The Google File System (GFS);172
10.5.3.1;8.5.3.1 Architecture;173
10.5.3.2;8.5.3.2 Fault Tolerance;173
10.5.3.3;8.5.3.3 Application Interfaces;174
10.5.3.4;8.5.3.4 Consistency Semantics;175
10.5.3.5;8.5.3.5 Similar File Systems;175
10.5.4;8.5.4 pNFS;175
10.5.4.1;8.5.4.1 Architecture;176
10.5.4.2;8.5.4.2 Layouts;177
10.5.4.3;8.5.4.3 Layout Requests;177
10.5.4.4;8.5.4.4 Implementations;178
10.5.5;8.6 Conclusion;179
10.5.6;8.7 Exercises;179
10.5.7;8.8 References;180
11;9 Performance Modeling of Enterprise Grids;181
11.1;9.1 Introduction and Background;181
11.1.1;9.1.1 Performance Modeling;181
11.1.2;9.1.2 Capacity Planning Tools and Methodology;183
11.2;9.2 Measurement Collection and Preliminary Analysis;185
11.3;9.3 Workload Characterization;186
11.3.1;9.3.1 K-means Clustering;188
11.3.1.1;9.3.1.1 Starting Point Selection;191
11.3.1.2;9.3.1.2 K-means Analysis Example;192
11.3.2;9.3.2 Hierarchical Workload Characterization;193
11.3.3;9.3.3 Other Issues in Workload Characterization;194
11.4;9.4 Baseline System Models and Tool Construction;196
11.4.1;9.4.1 Analytic Models;196
11.4.1.1;9.4.1.1 Queueing Networks;197
11.4.1.2;9.4.1.2 Petri Nets;201
11.4.2;9.4.2 Simulation Tools for Enterprise Grid Systems;203
11.5;9.5 Enterprise Grid Capacity Planning Case Study;204
11.5.1;9.5.1 Data Collection and Preliminary Analysis;206
11.5.2;9.5.2 Workload Characterization;206
11.5.3;9.5.3 Development and Validation of the Baseline Model;207
11.6;9.6 Summary;211
11.7;9.7 Exercises;211
11.8;9.8 References;212
12;10 Delay Characteristics of Packet SwitchedNetworks;214
12.1;10.1 Introduction;214
12.2;10.2 High-Speed Packet Switching Systems;215
12.2.1;10.2.1 Packet Switched General Organization;215
12.2.2;10.2.2 Switching Fabric Structures for Packet Switches;216
12.2.3;10.2.3 Queuing Schemes for Packet Switches;217
12.3;10.3 Technical Background;218
12.3.1;10.3.1 Packet Scheduling in Packet Switches;218
12.3.2;10.3.2 Introduction to Network Calculus;219
12.4;10.4 Delay Characteristics of Output Queuing Switches;221
12.4.1;10.4.1 Output Queuing Switch System;221
12.4.2;10.4.2 OQ Switch Modeling and Analysis;222
12.4.3;10.4.3 Output Queuing Emulation for Delay Guarantee;223
12.5;10.5 Delay Characteristics of Buffered Crossbar Switches;223
12.5.1;10.5.1 Buffered Crossbar Switch System;223
12.5.2;10.5.2 Modeling Traffic Control in Buffered Crossbar Switches;225
12.5.3;10.5.3 Delay Analysis for Buffered Crossbar Switches;226
12.5.4;10.5.4 Numerical Examples;227
12.6;10.6 Delay Comparison of Output Queuing to BufferedCrossbar;228
12.6.1;10.6.1 Maximum Packet Delay Comparison;228
12.6.2;10.6.2 Bandwidth Allocation for Delay Performance Guarantees;229
12.6.3;10.6.3 Numerical Examples;230
12.7;10.7 Summary;232
12.8;10.8 Exercises;233
12.9;10.9 References;233
13;11 Knowledge Discovery in Textual Databases: AConcept-Association Mining Approach;235
13.1;11.1 Introduction;235
13.2;11.2 Method;238
13.2.1;11.2.1 Concept Based Association Rule Mining Approach;238
13.2.2;11.2.2 Concept Extraction;239
13.2.3;11.2.3 Mining Concept Associations;241
13.2.4;11.2.4 Generating a Directed Graph of Concept Associations;241
13.3;11.3 Experiments and Results;243
13.3.1;11.3.1 Isolated words vs. multi-word concepts;243
13.3.2;11.3.2 New Metrics vs. the Traditional Support & Confidence;245
13.3.2.1;11.3.2.1 Directed Graphs;247
13.4;11.4 Conclusions;250
13.5;11.5 Examples;251
13.6;11.6 Exercises;252
13.7;11.7 References;252
14;12 Mining E-Documents to Uncover Structures;254
14.1;12.1 Introduction;254
14.2;12.2 Related Research;255
14.3;12.3 Discovery of the Physical Structure;256
14.3.1;12.3.1 Paragraph;256
14.3.2;12.3.2 Heading;257
14.3.2.1;12.3.2.1 Assigning Heading Levels to Informal Headings;257
14.3.3;12.3.3 Table;261
14.3.4;12.3.4 Image;262
14.3.5;12.3.5 Capturing the physical structure of an e-document;263
14.4;12.4 Discovery of the Explicit Terms Using Ontology;272
14.4.1;12.4.1 The Stemmer;273
14.4.2;12.4.2 The Ontology;273
14.4.3;12.4.3 Discovery Process;275
14.5;12.5 Discovery of the Logical Structure;277
14.5.1;12.5.1 Segmentation;277
14.5.2;12.5.2 Segments’ Relationships;279
14.6;12.6 Empirical Results;281
14.7;12.7 Conclusions;283
14.8;12.8 Exercises;283
14.9;12.9 Acknowledgments;285
14.10;12.10 References;285
15;13 Designing a Flexible Framework for a TableAbstraction;288
15.1;13.1 Introduction;288
15.2;13.2 Analysis of the Table ADT;290
15.3;13.3 Formal Design Contracts;292
15.4;13.4 Layered Architecture;294
15.5;13.5 Client Layer;295
15.5.1;13.5.1 Abstract Predicates for Keys and Records;296
15.5.2;13.5.2 Keys and the Comparable Interface;296
15.5.3;13.5.3 Records and the Keyed Interface;297
15.5.4;13.5.4 Interactions among the Layers;298
15.6;13.6 Access Layer;298
15.6.1;13.6.1 Abstract Predicates for Tables;298
15.6.2;13.6.2 Table Interface;298
15.6.3;13.6.3 Interactions among the Layers;300
15.7;13.7 Storage Layer;301
15.7.1;13.7.1 Abstract Predicate for Storable Records;301
15.7.2;13.7.2 Bridge Pattern;301
15.7.3;13.7.3 Proxy Pattern;302
15.7.4;13.7.4 RecordStore Interface;303
15.7.5;13.7.5 RecordSlot Interface;304
15.7.6;13.7.6 Interactions among the Layers;306
15.8;13.8 Externalization Module;306
15.9;13.9 Iterators;308
15.9.1;13.9.1 Table Iterator Methods;309
15.9.2;13.9.2 Input Iterators;310
15.9.3;13.9.3 Filtering Iterators;311
15.9.4;13.9.4 Query Iterator Methods;312
15.10;13.10 Evolving Frameworks;314
15.10.1;13.10.1 Three Examples;314
15.10.2;13.10.2 Whitebox Frameworks;315
15.10.3;13.10.3 Component Library;315
15.10.4;13.10.4 Hot Spots;316
15.10.5;13.10.5 Pluggable Objects;317
15.11;13.11 Discussion;317
15.12;13.12 Conclusion;319
15.13;13.13 Exercises;319
15.14;13.14 Acknowledgements;321
15.15;13.15 References;321
16;14 Information Quality Framework for VerifiableIntelligence Products;324
16.1;14.1 Introduction;324
16.2;14.2 Background;326
16.2.1;14.2.1 Production Process of Intelligence Products;326
16.2.2;14.2.2 Current IQ Practices in the IC;328
16.2.3;14.2.3 Relevant Concepts and Methods of IQ Management;330
16.2.3.1;14.2.3.1 TDQM Framework;330
16.2.3.2;14.2.3.2 Treating information as Product and IP-Map;331
16.2.3.3;14.2.3.3 PolyGen;331
16.2.3.4;14.2.3.4 QER;332
16.3;14.3 IQ Challenges within the IC;332
16.3.1;14.3.1 IQ Issues in Intelligence Collection and Analysis;332
16.3.2;14.3.2 Other IQ Problems;333
16.3.3;14.3.3 IQ Dimensions Related to the IC;334
16.4;14.4 Towards a Proposed Solution;335
16.4.1;14.4.1 IQ Metrics for Intelligence Products;336
16.4.2;14.4.2 Verifiability of Intelligence Products;337
16.4.3;14.4.3 Objectives and Plan;338
16.5;14.5 Conclusion;340
16.6;14.6 Exercises;340
16.7;14.7 References;340
17;15 Interactive Visualization of LargeHigh-Dimensional Datasets;343
17.1;15.1 Introduction;343
17.1.1;15.1.1 Related work;343
17.1.2;15.1.2 General requirements for a data visualization system;344
17.2;15.2 Data Visualization Process;345
17.2.1;15.2.1 Data Rendering Stage;346
17.2.1.1;15.2.1.1 Choosing visual objects and features;347
17.2.1.2;15.2.1.2 Non-uniform data distribution problem;347
17.2.2;15.2.2 Backward Transformation Stage;349
17.2.3;15.2.3 Knowledge Extraction Stage;350
17.3;15.3 Interactive Visualization Model;351
17.4;15.4 Utilizing Summary Icons;352
17.5;15.5 A Case Study;354
17.6;15.6 Conclusion;358
17.7;15.7 Exercises;358
17.8;15.8 Acknowledgements;358
17.9;15.9 References;358
18;16 Image Watermarking Based on PyramidDecomposition with CH Transform;360
18.1;16.1. Introduction;360
18.2;16.2. Algorithm for multi-layer image watermarking;361
18.2.1;16.2.1. Resistant watermarking;361
18.2.2;16.2.2. Resistant watermark detection;371
18.2.3;16.2.3. Fragile watermarking;376
18.3;16.3. Data hiding;377
18.4;16.4. Evaluation of the watermarking efficiency;378
18.5;16.5. Experimental results;379
18.6;16.6. Application areas;386
18.6.1;16.6.1. Resistant watermarks;386
18.6.2;16.6.2. Fragile watermarks;387
18.6.3;16.6.3. Data hiding;387
18.7;16.7. Conclusion;387
18.8;16.8 Exercises;388
18.9;16.9 Acknowledgment;393
18.10;16.10 References;393
19;17 Immersive Visualization of Cellular Structures;395
19.1;17.1 Introduction;395
19.2;17.2 Light Microscopic Cellular Images and Focus: Basics;396
19.3;17.3 Flat-Field Correction;398
19.4;17.4 Separation of Transparent Layers using Focus;399
19.5;17.5 3D Visualization of Cellular Structures;402
19.5.1;17.5.1 Volume Rendering;402
19.5.2;17.5.2 Immersive Visualization: CAVE Environment;404
19.6;17.6 Conclusions;407
19.7;17.7 Exercises;407
19.8;17.8 References;407
20;18 Visualization and Ontology of GeospatialIntelligence;409
20.1;18.1 Introduction;409
20.1.1;18.1.1 Premises;409
20.1.2;18.1.2 Research Agenda;410
20.2;18.2 Semantic Information Representation and Extraction;411
20.3;18.3 Markov Random Field;412
20.3.1;18.3.1 Spatial or Contextual Pattern Recognition;413
20.3.2;18.3.2 Image Classification using k-medoid Method;413
20.3.3;18.3.3 Random Field and Spatial Time Series;416
20.3.4;18.3.4 First Persian-Gulf-War Example;418
20.4;18.4 Context-driven Visualization;420
20.4.1;18.4.1 Relevant Methodologies;420
20.4.2;18.4.2 Visual Perception and Tracking;421
20.4.3;18.4.3 Visualization;423
20.5;18.5 Intelligent Information Fusion;425
20.5.1;18.5.1 Semantic Information Extraction;425
20.5.2;18.5.2 Intelligent Contextual Inference;426
20.5.3;18.5.3 Context-driven Ontology;426
20.6;18.6 Metrics for Knowledge Extraction and Discovery;427
20.7;18.7 Conclusions and Recommendations;428
20.7.1;18.7.1 Contributions;428
20.7.2;18.7.2 Looking Ahead;429
20.8;18.8 Exercises;430
20.9;18.9 Acknowledgements;433
20.10;18.10 References;434
21;19 Looking Ahead;436
21.1;19.1 Introduction;436
21.2;19.2 Data Integration and Information Qual;437
21.3;19.3 Grid Computing;439
21.4;19.4 Data Mining;440
21.5;19.5 Visualization;442
21.6;19.6 References;443
22;Index;445




