Kruschwitz | Intelligent Document Retrieval | E-Book | www.sack.de
E-Book

E-Book, Englisch, Band 17, 205 Seiten

Reihe: The Information Retrieval Series

Kruschwitz Intelligent Document Retrieval

Exploiting Markup Structure
1. Auflage 2006
ISBN: 978-1-4020-3768-9
Verlag: Springer Netherlands
Format: PDF
Kopierschutz: 1 - PDF Watermark

Exploiting Markup Structure

E-Book, Englisch, Band 17, 205 Seiten

Reihe: The Information Retrieval Series

ISBN: 978-1-4020-3768-9
Verlag: Springer Netherlands
Format: PDF
Kopierschutz: 1 - PDF Watermark



Collections of digital documents can nowadays be found everywhere in institutions, universities or companies. Examples are Web sites or intranets. But searching them for information can still be painful. Searches often return either large numbers of matches or no suitable matches at all. Such document collections can vary a lot in size and how much structure they carry. What they have in common is that they typically do have some structure and that they cover a limited range of topics. The second point is significantly different from the Web in general. The type of search system that we propose in this book can suggest ways of refining or relaxing the query to assist a user in the search process. In order to suggest sensible query modifications we would need to know what the documents are about. Explicit knowledge about the document collection encoded in some electronic form is what we need. However, typically such knowledge is not available. So we construct it automatically.

Kruschwitz Intelligent Document Retrieval jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


1;Contents;6
2;Foreword;10
3;Preface;12
4;List of Figures;14
5;List of Tables;16
6;1 Introduction;18
6.1;1.1 Introductory Examples;21
6.2;1.2 Using Markup to Extract Knowledge;25
6.3;1.3 Applying the Extracted Knowledge;32
6.4;1.4 Structure of the Book;34
7;Part I The Model;38
7.1;2 Related Work;40
7.1.1;2.1 Information Retrieval;41
7.1.2;2.2 Information Extraction;43
7.1.3;2.3 Clustering;44
7.1.4;2.4 Classi.cation;46
7.1.5;2.5 Web Search Techniques;48
7.1.6;2.6 Ontologies;51
7.1.7;2.7 Layout Analysis;53
7.1.8;2.8 Web Search Studies;53
7.1.9;2.9 Navigating Concept Hierarchies;55
7.1.10;2.10 Dialogue Systems;58
7.1.11;2.11 Usability Issues;59
7.1.12;2.12 Concluding Remarks on Related Work;60
7.2;3 Data Analysis and Domain Model Construction;62
7.2.1;3.1 Documents;62
7.2.2;3.2 Concepts;64
7.2.3;3.3 A Domain Model Based on Concepts;68
7.2.4;3.4 Model Structure;70
7.2.5;3.5 Model Construction;71
7.2.6;3.6 Using the Model for Query Modi.cation;75
7.2.7;3.7 Implementational Issues;77
7.3;4 Incorporating Additional Knowledge;80
7.3.1;4.1 Internal Knowledge;80
7.3.2;4.2 External Knowledge;84
7.4;5 A Dialogue System for Partially Structured Data;86
7.4.1;5.1 Dialogue as Movement in Space;87
7.4.2;5.2 Dialogue Example;88
7.4.3;5.3 Static;90
7.4.4;Dynamic Clusters;90
7.4.5;5.4 Real User Queries;90
7.4.6;5.5 Properties;92
7.4.7;5.6 Dialogue;95
8;Part II Practical Applications;109
8.1;6 UKSearch - Intelligent Web Search;110
8.1.1;6.1 Indexing Web Pages;111
8.1.2;6.2 The UKSearch System;115
8.1.3;6.3 Sample Domain 1: Essex University;124
8.1.4;6.4 Sample Domain 2: BBC News;129
8.1.5;6.5 Implementational Issues;134
8.2;7 UKSearch - Evaluation and Discussion;138
8.2.1;7.1 Log Analysis;138
8.2.2;7.2 Investigating Domain Model Relations;142
8.2.3;7.3 Task-Based Evaluation: Essex University;146
8.2.4;7.4 Task-Based Evaluation: BBC News;158
8.3;8 YPA - Searching Classified Directories;174
8.3.1;8.1 System Overview;175
8.3.2;8.2 Indexing Classi.ed Advertisements;176
8.3.3;8.3 Dialogue Strategy in the YPA;179
8.3.4;8.4 Implementational Issues;188
8.4;9 Future Directions and Conclusions;190
8.4.1;9.1 Towards Evolving Domain Models;190
8.4.2;9.2 Dialogue Management;193
8.4.3;9.3 An Outlook on Future Evaluations;194
8.4.4;9.4 Conclusions;195
9;References;198
10;Index;210


6 UKSearch - Intelligent Web Search (p.93-94)


Finding information on the Web is normally a straightforward task. For most user requests the information can be located by applying a standard search engine using simple pattern matching techniques. However, by restricting the search to some smaller document collection (one that is still too large to be searched without appropriate tools) this can become a tedious task. Examples of such collections are corporate intranets or university Web sites. Typically a search will return large numbers of matching documents even in smaller document collections. If no matching document can be found, the user is usually either left alone with a great number of partially matching documents or with no results at all.

These are well known problems and approaches for more sophisticated search systems exist to overcome them (see Chap. 2). But those approaches tend to rely very much on a given document structure or expensively created concept hierarchies. While this is appropriate for fairly well structured domains such as product catalogues and other applications where the information is stored in database formats, it is no help if the document collection is heterogeneous.

Surprisingly perhaps, the problem of not .nding any document in the collection for a user query (a form of "data sparsity") is not necessarily a major problem in small domains. The log .les of the search engine installed at the University of Essex Web site prove that the majority of queries that users submit result in a large number of matching documents despite the fairly small size of the collection. But unlike in general Web search where scalability issues prevent the application of more sophisticated indexing steps, we can build domain-speci.c concept hierarchies easily and rapidly in such well-de.ned document collections using the techniques introduced in the earlier chapters. These automatically created knowledge sources re.ect the relations between documents or terms within those documents simply based on the available data.

A part from that, collections of Web pages are well suited to verify the techniques introduced in this book, as these documents are typically marked up using HTML tags. This type of markup mixes visual markup and semantic representation (as found in the meta tags for example). We turn this implicit knowledge into explicit relations.

The earlier chapters presented the conceptual framework. Here we discuss the practical steps that lead to an explicitly structured representation of a Web document collection. Frequently used HTML tags are used to de.ne markup contexts (the fundamental units to extract concepts which are then arranged in a domain model). The structure imposed on the data collection is employed in a dialogue system which assists the user with handling those queries that do not retrieve documents or result in large numbers of matches.

We will see how the general dialogue manager introduced earlier is set up to work with the data collections discussed in this chapter. We will however not focus on the links between concepts and individual documents or directories. The more interesting aspect is the construction of domain models that are not closely tied to the individual documents, mainly because a separable domain model is more .exible. The reason is that despite the ever-changing nature of a collection of Web documents we will not need to constantly update the model. A domain model that is not linked to the individual documents will still be usable once the document collection has been updated. It can simply be plugged into a search system.



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.