E-Book, Englisch, Band 32, 227 Seiten, eBook
Reihe: Advances in Database Systems
Zezula / Amato / Dohnal Similarity Search
1. Auflage 2006
ISBN: 978-0-387-29151-2
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
The Metric Space Approach
E-Book, Englisch, Band 32, 227 Seiten, eBook
Reihe: Advances in Database Systems
ISBN: 978-0-387-29151-2
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
The area of similarity searching is a very hot topic for both research and c- mercial applications. Current data processing applications use data with c- siderably less structure and much less precise queries than traditional database systems. Examples are multimedia data like images or videos that offer query by example search, product catalogs that provide users with preference based search, scientific data records from observations or experimental analyses such as biochemical and medical data, or XML documents that come from hetero- neous data sources on the Web or in intranets and thus does not exhibit a global schema. Such data can neither be ordered in a canonical manner nor meani- fully searched by precise database queries that would return exact matches. This novel situation is what has given rise to similarity searching, also - ferred to as content based or similarity retrieval. The most general approach to similarity search, still allowing construction of index structures, is modeled in metric space. In this book. Prof. Zezula and his co authors provide the first monograph on this topic, describing its theoretical background as well as the practical search tools of this innovative technology.
Zielgruppe
Professional/practitioner
Autoren/Hrsg.
Weitere Infos & Material
Metric Searching in a Nutshell.- Foundations of Metric Space Searching.- Survey of Existing Approaches.- Metric Searching in Large Collections of Data.- Centralized Index Structures.- Approximate Similarity Search.- Parallel and Distributed Indexes.
Chapter 3 CENTRALIZED INDEX STRUCTURES (p. 105-106)
Most existing search structures have been designed to run on a single computer. Let us call them centralized structures. They are built with different assumptions about type of distance function (discrete vs. continuous), form of query (range, nearest neighbor, etc.), and temporal properties (static or dynamic) of the data to be organized. Although many index structures have been proposed as main memory structures, there are several indexes which organize data using disk storage to allow the processing of a large volume of data. In what follows, we focus on two basic approaches which store objects in secondary memories. Specifically, we discuss tree-based structures and methods which employ hashing (i.e., the key-to-address transformation) principles.
1. M-tree Family
[Ciaccia et al., 1997b] have proposed a dynamic organization, called the M-tree, which can handle data files that vary dynamically in size, i.e., in cases when insertion and deletion of objects is frequent. In contrast to other metric tree-based structures, the M-tree is built bottom-up and maintains the same length for all tree paths because the tree is balanced. This paradigm has become very popular and many researches have developed extensions of the M-tree storage structure with a main objective of increasing search efficiency, sometimes conditioned by specific application requirements. We start with the original idea of the M-tree, then describe its most important extensions.
1.1 The M-tree
Most of the indexing methods described in Chapter 2 are either static, unbalanced, or both. They are not very suitable for dynamic environments where data is subject to permanent alteration, nor for large data repositories where disk- based techniques are necessary. The M-tree is, by nature, designed as a dynamic and balanced index structure capable of organizing data stored on a disk. By building the tree in a bottom-up fashion from its leaves to its root, the M-tree shares some similarities with R-trees [Guttman, 1984] and B-trees [Comer, 1979].
This concept results in a balanced tree structure independent of the number of insertions or deletions and has a positive impact on query execution. In general, the M-tree behaves like the R-tree. All objects are stored in (or referenced from) leaf nodes while internal nodes keep pointers to nodes at the next level, together with additional information about their subtrees. Recall that R-trees store minimum bounding rectangles in non-leaf nodes that cover their subtrees. In general metric spaces, we cannot define such bounding rectangles because a coordinate system is lacking.
Thus M-trees use an object called a pivot, and a covering radius, to form a bounding ball region. In the M-tree, pivots play a role similar to that in the GNAT access structure [Brin, 1995], but unlike in GNAT, all objects are stored in leaves. Because pre-selected objects are used, the same object may be present several times in the M-tree - once in a leaf node, and once or several times in internal nodes as a pivot.