E-Book, Englisch, 297 Seiten
Ferilli Automatic Digital Document Processing and Management
1. Auflage 2011
ISBN: 978-0-85729-198-1
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Problems, Algorithms and Techniques
E-Book, Englisch, 297 Seiten
Reihe: Advances in Computer Vision and Pattern Recognition
ISBN: 978-0-85729-198-1
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
This text reviews the issues involved in handling and processing digital documents. Examining the full range of a document's lifetime, the book covers acquisition, representation, security, pre-processing, layout analysis, understanding, analysis of single components, information extraction, filing, indexing and retrieval. Features: provides a list of acronyms and a glossary of technical terms; contains appendices covering key concepts in machine learning, and providing a case study on building an intelligent system for digital document and library management; discusses issues of security, and legal aspects of digital documents; examines core issues of document image analysis, and image processing techniques of particular relevance to digitized documents; reviews the resources available for natural language processing, in addition to techniques of linguistic analysis for content handling; investigates methods for extracting and retrieving data/information from a document.
Autoren/Hrsg.
Weitere Infos & Material
1;Foreword;6
2;Preface;9
3;Acknowledgments;12
4;Contents;13
5;Acronyms;19
6;Digital Documents;23
6.1;Documents;25
6.1.1;A Juridic Perspective;25
6.1.2;History and Trends;26
6.1.3;Current Landscape;27
6.1.4;Types of Documents;29
6.1.5;Document-Based Environments;32
6.1.6;Document Processing Needs;33
6.1.7;References;34
6.2;Digital Formats;36
6.2.1;Compression Techniques;37
6.2.1.1;RLE (Run Length Encoding);37
6.2.1.2;Huffman Encoding;37
6.2.1.3;LZ77 and LZ78 (Lempel-Ziv);39
6.2.1.4;LZW (Lempel-Ziv-Welch);40
6.2.1.5;DEFLATE;42
6.2.2;Non-structured Formats;42
6.2.2.1;Plain Text;43
6.2.2.1.1;ASCII;44
6.2.2.1.2;ISO Latin;44
6.2.2.1.3;UNICODE;45
6.2.2.1.4;UTF;45
6.2.2.2;Images;49
6.2.2.2.1;Color Spaces;49
6.2.2.2.1.1;RGB;50
6.2.2.2.1.2;YUV/YCbCr;50
6.2.2.2.1.3;CMY(K);51
6.2.2.2.1.4;HSV/HSB and HLS;51
6.2.2.2.1.5;Comparison among Color Spaces;51
6.2.2.2.2;Raster Graphics;52
6.2.2.2.2.1;BMP (BitMaP);53
6.2.2.2.2.2;GIF (Graphics Interchange Format);55
6.2.2.2.2.3;TIFF (Tagged Image File Format);57
6.2.2.2.2.4;JPEG (Joint Photographic Experts Group);58
6.2.2.2.2.5;PNG (Portable Network Graphics);60
6.2.2.2.2.6;DjVu (DejaVu);62
6.2.2.2.3;Vector Graphic;64
6.2.2.2.3.1;SVG (Scalable Vector Graphic);64
6.2.3;Layout-Based Formats;66
6.2.3.1;PS (PostScript);66
6.2.3.2;PDF (Portable Document Format);77
6.2.4;Content-Oriented Formats;80
6.2.4.1;Tag-Based Formats;81
6.2.4.1.1;HTML (HyperText Markup Language);82
6.2.4.1.2;XML (eXtensible Markup Language);87
6.2.4.2;Office Formats;90
6.2.4.2.1;ODF (OpenDocument Format);90
6.2.5;References;91
6.3;Legal and Security Aspects;93
6.3.1;Cryptography;94
6.3.1.1;Basics;94
6.3.1.2;Short History;96
6.3.1.3;Digital Cryptography;97
6.3.1.3.1;DES (Data Encryption Standard);99
6.3.1.3.2;IDEA (International Data Encryption Algorithm);100
6.3.1.3.3;Key Exchange Method;101
6.3.1.3.4;RSA (Rivest, Shamir, Adleman);102
6.3.1.3.5;DSA (Digital Signature Algorithm);105
6.3.2;Message Fingerprint;105
6.3.2.1;SHA (Secure Hash Algorithm);106
6.3.3;Digital Signature;108
6.3.3.1;Management;110
6.3.3.1.1;DSS (Digital Signature Standard);112
6.3.3.1.2;OpenPGP Standard;113
6.3.3.2;Trusting and Certificates;114
6.3.4;Legal Aspects;117
6.3.4.1;A Law Approach;118
6.3.4.2;Public Administration Initiatives;121
6.3.4.2.1;Digital Signature;121
6.3.4.2.2;Certified e-mail;123
6.3.4.2.3;Electronic Identity Card & National Services Card;124
6.3.4.2.4;Telematic Civil Proceedings;124
6.3.5;References;128
7;Document Analysis;130
7.1;Image Processing;132
7.1.1;Basics;133
7.1.1.1;Convolution and Correlation;133
7.1.2;Color Representation;135
7.1.2.1;Color Space Conversions;136
7.1.2.1.1;RGB-YUV;136
7.1.2.1.2;RGB-YCbCr;136
7.1.2.1.3;RGB-CMY(K);137
7.1.2.1.4;RGB-HSV;137
7.1.2.1.5;RGB-HLS;138
7.1.2.2;Colorimetric Color Spaces;139
7.1.2.2.1;XYZ;139
7.1.2.2.2;L*a*b*;140
7.1.3;Color Depth Reduction;141
7.1.3.1;Desaturation;141
7.1.3.2;Grayscale (Luminance);142
7.1.3.3;Black&White (Binarization);142
7.1.3.3.1;Otsu Thresholding;142
7.1.4;Content Processing;143
7.1.4.1;Geometrical Transformations;144
7.1.4.2;Edge Enhancement;145
7.1.4.2.1;Derivative Filters;146
7.1.4.3;Connectivity;148
7.1.4.3.1;Flood Filling;149
7.1.4.3.2;Border Following;150
7.1.4.3.3;Dilation and Erosion;151
7.1.4.3.4;Opening and Closing;152
7.1.5;Edge Detection;153
7.1.5.1;Canny;154
7.1.5.2;Hough Transform;156
7.1.5.3;Polygonal Approximation;158
7.1.5.4;Snakes;160
7.1.6;References;162
7.2;Document Image Analysis;163
7.2.1;Document Structures;163
7.2.1.1;Spatial Description;165
7.2.1.1.1;4-Intersection Model;166
7.2.1.1.2;Minimum Bounding Rectangles;168
7.2.1.2;Logical Structure Description;169
7.2.1.2.1;DOM (Document Object Model);169
7.2.2;Pre-processing for Digitized Documents;172
7.2.2.1;Document Image Defect Models;173
7.2.2.2;Deskewing;174
7.2.2.3;Dewarping;175
7.2.2.3.1;Segmentation-Based Dewarping;176
7.2.2.4;Content Identification;178
7.2.2.5;Optical Character Recognition;179
7.2.2.5.1;Tesseract;181
7.2.2.5.2;JTOCR;183
7.2.3;Segmentation;184
7.2.3.1;Classification of Segmentation Techniques;185
7.2.3.2;Pixel-Based Segmentation;187
7.2.3.2.1;RLSA (Run Length Smoothing Algorithm);187
7.2.3.2.2;RLSO (Run-Length Smoothing with OR);189
7.2.3.2.3;X-Y Trees;191
7.2.3.3;Block-Based Segmentation;193
7.2.3.3.1;The DOCSTRUM;193
7.2.3.3.2;The CLiDE (Chemical Literature Data Extraction) Approach;195
7.2.3.3.3;Background Analysis;197
7.2.3.3.4;RLSO on Born-Digital Documents;201
7.2.4;Document Image Understanding;202
7.2.4.1;Relational Approach;204
7.2.4.1.1;INTHELEX (INcremental THEory Learner from EXamples);206
7.2.4.2;Description;208
7.2.4.2.1;DCMI (Dublin Core Metadata Initiative);209
7.2.5;References;211
8;Content Processing;215
8.1;Natural Language Processing;217
8.1.1;Resources-Lexical Taxonomies;218
8.1.1.1;WordNet;219
8.1.1.2;WordNet Domains;220
8.1.1.3;Senso Comune;223
8.1.2;Tools;224
8.1.2.1;Tokenization;225
8.1.2.2;Language Recognition;226
8.1.2.3;Stopword Removal;227
8.1.2.4;Stemming;228
8.1.2.4.1;Suffix Stripping;229
8.1.2.5;Part-of-Speech Tagging;231
8.1.2.5.1;Rule-Based Approach;231
8.1.2.6;Word Sense Disambiguation;233
8.1.2.6.1;Lesk's Algorithm;235
8.1.2.6.2;Yarowsky's Algorithm;235
8.1.2.7;Parsing;236
8.1.2.7.1;Link Grammar;237
8.1.3;References;239
8.2;Information Management;241
8.2.1;Information Retrieval;241
8.2.1.1;Performance Evaluation;242
8.2.1.2;Indexing Techniques;244
8.2.1.2.1;Vector Space Model;244
8.2.1.3;Query Evaluation;247
8.2.1.3.1;Relevance Feedback;248
8.2.1.4;Dimensionality Reduction;249
8.2.1.4.1;Latent Semantic Analysis and Indexing;250
8.2.1.4.2;Concept Indexing;253
8.2.1.5;Image Retrieval;255
8.2.2;Keyword Extraction;257
8.2.2.1;TF-ITP;259
8.2.2.2;Naive Bayes;259
8.2.2.3;Co-occurrence;260
8.2.3;Text Categorization;262
8.2.3.1;A Semantic Approach Based on WordNet Domains;264
8.2.4;Information Extraction;265
8.2.4.1;WHISK;267
8.2.4.2;A Multistrategy Approach;269
8.2.5;The Semantic Web;271
8.2.6;References;272
9;Appendix A A Case Study: DOMINUS;274
9.1;General Framework;274
9.1.1;Actors and Workflow;274
9.1.2;Architecture;276
9.2;Functionality;278
9.2.1;Input Document Normalization;278
9.2.2;Layout Analysis;279
9.2.2.1;Kernel-Based Basic Blocks Grouping;280
9.2.3;Document Image Understanding;281
9.2.4;Categorization, Filing and Indexing;281
9.3;Prototype Implementation;282
9.4;Exploitation for Scientific Conference Management;285
9.4.1;GRAPE;286
10;Appendix B Machine Learning Notions;288
10.1;Categorization of Techniques;288
10.2;Noteworthy Techniques;289
10.2.1;Artificial Neural Networks;289
10.2.2;Decision Trees;290
10.2.3;k-Nearest Neighbor;290
10.2.4;Inductive Logic Programming;290
10.2.5;Naive Bayes;291
10.2.6;Hidden Markov Models;291
10.2.7;Clustering;291
10.3;Experimental Strategies;292
10.3.1;k-Fold Cross-Validation;292
10.3.2;Leave-One-Out;293
10.3.3;Random Split;293
11;Glossary;294
11.1;Bounding box;294
11.2;Byte ordering;294
11.3;Ceiling function;294
11.4;Chunk;294
11.5;Connected component;294
11.6;Heaviside unit function;294
11.7;Heterarchy;295
11.8;KL-divergence;295
11.9;Linear regression;295
11.10;Run;295
11.11;Scanline;295
12;References;296
13;Index;305




