Long / Chang | Relevance Ranking for Vertical Search Engines | E-Book | www.sack.de
E-Book

E-Book, Englisch, 264 Seiten

Long / Chang Relevance Ranking for Vertical Search Engines


1. Auflage 2014
ISBN: 978-0-12-407202-2
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark

E-Book, Englisch, 264 Seiten

ISBN: 978-0-12-407202-2
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark



In plain, uncomplicated language, and using detailed examples to explain the key concepts, models, and algorithms in vertical search ranking, Relevance Ranking for Vertical Search Engines teaches readers how to manipulate ranking algorithms to achieve better results in real-world applications. This reference book for professionals covers concepts and theories from the fundamental to the advanced, such as relevance, query intention, location-based relevance ranking, and cross-property ranking. It covers the most recent developments in vertical search ranking applications, such as freshness-based relevance theory for new search applications, location-based relevance theory for local search applications, and cross-property ranking theory for applications involving multiple verticals. - Foreword by Ron Brachman, Chief Scientist and Head, Yahoo! Labs - Introduces ranking algorithms and teaches readers how to manipulate ranking algorithms for the best results - Covers concepts and theories from the fundamental to the advanced - Discusses the state of the art: development of theories and practices in vertical search ranking applications - Includes detailed examples, case studies and real-world situations

Bo Long is currently a staff applied researcher at LinkedIn Inc., and was formerly a senior research scientist at Yahoo! Labs. His research interests lie in data mining and machine learning with applications to web search, recommendation, and social network analysis. He holds eight innovations and has published peer-reviewed papers in top conferences and journals including ICML, KDD, ICDM, AAAI, SDM, CIKM, and KAIS. He has served as reviewer, workshop co-organizer, conference organizer, committee member, and area chair for multiple conferences, including KDD, NIPS, SIGIR, ICML, SDM, CIKM, JSM etc.
Long / Chang Relevance Ranking for Vertical Search Engines jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


2

News Search Ranking


Abstract


News search is one of the most important Internet user activities. For a commercial news search engine, it is critical to provide users with the most relevant and fresh ranking results. Furthermore, it is necessary to group the related news articles so that users can browse search results in terms of news stories rather than individual news articles. This chapter describes a few algorithms for news search engines, including ranking algorithms and clustering algorithms. For the ranking problem, the main challenge is achieving appropriate balance between topical relevance and freshness. For the clustering problem, the main challenge is in grouping related news articles into clusters in a scalable mode. We begin by introducing a few news search ranking approaches, including a learning-to-rank approach (Section 2.1) and a joint learning approach from clickthroughs (Section 2.2). We then describe a scalable clustering approach to group news search results (Section 2.3).

Keywords

News search

freshness

relevance

clustering

temporal features

2.1 The Learning-to-Rank Approach


The main challenge for ranking in news search is how to make appropriate balance between two factors:Relevance and freshness. Here relevance includes both topical relevance as well as news source authority.

A widely adopted approach in practice is to use a simple formula to combine relevance and freshness. For example, the final ranking score for a news article can be computed as

(2.1)

where is the value representing the relevance between query and news article, isnews article age and is a time decay term, for which the older a news article is, the more penalty the article will receive for its final ranking. The parameter is used to control the relative importance of freshness in the final ranking result. In the literature of information retrieval, document is usually used to refer to a candidate item in ranking tasks. In this chapter, we use the terms document and news article equally because the application here is to rank news articles in a search.

The advantage of such a heuristic approach to a relevance and freshness combination is its efficiency in real practice, for which only the value of the parameter needs to be tuned by using some ranking examples. Furthermore, the appropriate value often leads to good ranking results for many queries, which also makes this approach effective.

The drawback of this approach is that it is incapable of further improving ranking performance, because such a heuristic rule is too naive to handle more complicated ranking cases. For example, in (2.1), time decay is represented by the term , which is totally dependent on the document age. In fact, an appropriate time decay factor should also rely on the nature of the query, since different queries have different time sensitivities: If a query is related to breaking news, such as an earthquake, that has just happened and has extensive media reports on casualty and rescue, then freshness should be very important because even a document published only one hour ago could be outdated. On the other hand, if a query is for an event that happened weeks ago, then relevance is more important in ranking because the user would like to find the most relevant and comprehensive reports in the search results.

2.1.1 Related Works


Many prior works have exploited the temporal dimension in searches. For example, Baeza-Yates et al. [22] studied the relation among Web dynamics, structure, and page quality and demonstrated that PageRank is biased against new pages. In T-Rank Light and T-Rank algorithms [25], both activity (i.e., update rates) and freshness (i.e., timestamps of most recent updates) of pages and links are taken into account in link analysis. Cho et al. [66] proposed a page quality ranking function in order to alleviate the problem of popularity-based ranking, and they used the derivatives of PageRank to forecast future PageRank values for new pages. Nunes [269] proposed to improve Web information retrieval in the temporal dimension by combining the temporal features extracted from both individual documents and the whole Web. Pandey et al. [276] studied the tradeoff between new page exploration and high-quality page exploitation, which is based on a ranking method to randomly promote some new pages so that they can accumulate links quickly.

Temporal dimension is also considered in other information retrieval applications. Del Corso et al. [94] proposed the ranking framework to model news article generation, topic clustering, and story evolution over time, and this ranking algorithm takes publication time and linkage time into consideration as well as news source authority. Li et al. [221] proposed a TS-Rank algorithm, which considers page freshness in the stationary probability distribution of Markov chains, since the dynamics of Web pages are also important for ranking. This method proves effective in the application of publication search. Pasca [277] used temporal expressions to improve question-answering results for time-related questions. Answers are obtained by aggregating matching pieces of information and the temporal expressions they contain. Furthermore, Arikan et al. [20] incorporated temporal expressions into a language model and demonstrated experimental improvement in retrieval effectiveness.

Recency query classification plays an important role in recency ranking. Diaz [98] determined the newsworthiness of a query by predicting the probability of a user clicking on the news display of a query. König et al. [204] estimated the clickthrough rate for dedicated news search results with a supervised model, which is to satisfy the requirement of adapting quickly to emerging news event.

2.1.2 Combine Relevance and Freshness


Learning-to-rank algorithms have shown significant and consistent success in various applications [226,184,406,54]. Such machine-learned ranking algorithms learn a ranking mechanism by optimizing particular loss functions based on editorial annotations. An important assumption in those learning methods is that document relevance for a given query is generally stationary over time, so that, as long as the coverage of the labeled data is broad enough, the learned ranking functions would generalize well to future unseen data. Such an assumption is often true in Web searches, but it is less likely to hold in news searches because of the dynamic nature of news events and the lack of timely annotations.

A typical procedure is as follows:

• Collect query-URL pairs.

• Ask editors to label the query-URL pairs with relevance grades.

• Apply a learning-to-rank algorithm to the train ranking model.

Traditionally, in learning-to-rank, editors label query-URL pairs with relevance grades, which usually have four or five values, including perfect, excellent, good, fair, or bad. Editorial labeling information is used for ranking model training and ranking model evaluation. For training, these relevance grades are directly mapped to numeric values as learning targets.

For evaluation, we desire an evaluation metric that supports graded judgments and penalizes errors near the beginning of the ranked list. In this work, we useDCG [175],

(2.2)

where is the position in the document list, and is the function of relevance grade. Because the range of DCG values is not consistent across queries, we adopt the NDCG as our primary ranking metric,

(2.3)

where is a normalization factor, which is used to make the NDCG of the ideal list be 1. We can use and to evaluate the ranking results.

We extend the learning-to-rank algorithm in news searches, for which we mainly make two modifications due to the dynamic nature of the news search: (1) training sample collection and (2) editorial labeling guideline.

2.1.2.1 Training Sample Collection

The training sample collection has to be near real time for news searches by the following steps:

1. Sample the latest queries from the news search query log.

2. Immediately get the candidate URLs for the sampled queries.

3. Immediately ask editors to do judgments on the query-URL pairs with relevance and freshness grades.

We can see that all the steps need to be accomplished in a short period. Therefore, the training sample collection has to be well planned in advance; otherwise, any delay during this procedure would affect the reliability of the collected data. If queries are sampled from an outdated query log or if all of the selected candidate URLs are outdated, they cannot represent the real data distribution. If editors do not label query-URL pairs on time, it will be difficult for them to provide accurate judgments, because editors’ judgments rely on their good understanding of the related news events, which becomes more difficult as time elapses.

2.1.2.2 Editorial Labeling

In a news search, editors should provide query-URL grades on both traditional relevance and freshness. Although document age is usually available in news searches, it is impossible to determine a...



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.