main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 11th Information Retrieval Societies Conference, AIRS 2015, held in Brisbane, QLD, Australia, in December 2015.

The 29 full papers presented together with 11 short and demonstration papers, and the abstracts of 2 keynote lectures were carefully reviewed and selected from 92 submissions. The final programme of AIRS 2015 is divided in 10 tracks: Efficiency, Graphs, Knowledge Bases and Taxonomies, Recommendation, Twitter and Social Media, Web Search, Text Processing, Understanding and Categorization, Topics and Models, Clustering, Evaluation, and Social Media and Recommendation.

Inhaltsverzeichnis

On Structures of Inverted Index for Query Processing Efficiency

Inverted index has been widely adopted by modern search engines to effectively manage billions of documents and respond to users’ queries. Recently, many auxiliary index variants are brought up to enhance the engine’s compression ratio or query processing efficiency. The most successful auxiliary index structures are Block-Max Index and Dual-Sorted Index, both used for quickening the query processing. More precisely, Block-Max Index is designed for efficient top-k query processing while Dual-Sorted Index introduces pattern matching to solve complex query. There is little work thoroughly analyses and compares the performance of the two auxiliary structures. In this paper, an in-depth study on Block-Max Index and Dual-Sorted Index is presented, with a survey on related top-k query processing strategies. Finally, experimental results on TREC GOV2 dataset with detailed analysis show that Dual-Sorted Index achieves the best query processing performance at the price of huge space occupation, moreover, it sheds light upon the prospect of combining compact data structures with inverted index.

Xingshen Song, Xueping Zhang, Yuexiang Yang, Jicheng Quan, Kun Jiang

Access Time Tradeoffs in Archive Compression

Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that “prime” the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk (SSD) drives. For HDD, the dominant factor affecting access speed is the compression rate achieved, even when this involves larger dictionaries and larger blocks. When the data is on SSD the same effects are present, but not as markedly, and more complex trade-offs apply.

Matthias Petri, Alistair Moffat, P. C. Nagesh, Anthony Wirth

Large Scale Sentiment Analysis with Locality Sensitive BitHash

As social media data rapidly grows, sentiment analysis plays an increasingly more important role in classifying users’ opinions, attitudes and feelings expressed in text. However, most studies have been focused on the effectiveness of sentiment analysis, while ignoring the storage efficiency when processing large-scale high-dimensional text data. In this paper, we incorporate the machine learning based sentiment analysis with our proposed Locality Sensitive One-Bit Min-Hash (BitHash) method. BitHash compresses each data sample into a compact binary hash code while preserving the pairwise similarity of the original data. The binary code can be used as a compressed and informative representation in replacement of the original data for subsequent processing, for example, it can be naturally integrated with a classifier like SVM. By using the compact hash code, the storage space is significantly reduced. Experiment on the popular open benchmark dataset shows that, as the hash code length increases, the classification accuracy of our proposed method could approach the state-of-the-art method, while our method only requires a significantly smaller storage space.

Wenhao Zhang, Jianqiu Ji, Jun Zhu, Hua Xu, Bo Zhang

Knowledge-Based Query Expansion in Real-Time Microblog Search

Since the length of microblog texts, such as tweets, is strictly limited to 140 characters, traditional Information Retrieval techniques usually suffer severely from the vocabulary mismatch problem such that they cannot yield good performance in the context of microblogosphere. To address this critical challenge, in this paper, we propose a new language modeling approach for microblog retrieval by inferring various types of context information. In particular, we expand the query using knowledge terms derived from Freebase so that the expanded one can better reflect the information need. Besides, in order to further answer users’ real-time information need, we incorporate temporal evidences into the expansion methods so that the proposed approach can boost recent tweets in the retrieval results with respect to a given topic. Experimental results on two official TREC Twitter corpora demonstrate the significant superiority of our approach over baseline methods.

Chao Lv, Runwei Qiang, Feifan Fan, Jianwu Yang

Enrichment of Academic Search Engine Results Pages by Citation-Based Graphs

Researchers’ readings of academic papers make their research more sophisticated and objective. In this paper, we describe a method of supporting scholarly surveys by incorporating a graph based on citation relationships into the results page of an academic search engine. Conventional academic search engines have a problem in that users have difficulty in determining which academic papers are relevant to their needs because it is hard to understand the relationship between the academic papers that appear in the search results pages. Our method helps users to make judgments about the relevance of papers by clearly visualizing the relationship. It visualizes not only academic papers on the results page but also papers that have a strong citation relationship with them. We carefully considered the method of visualization and implemented a prototype with which we conducted a user study simulating scholarly surveys. We confirmed that our method improved the efficiency of scholarly surveys through the user study.

Shuhei Shogen, Toshiyuki Shimizu, Masatoshi Yoshikawa

Quality-Aware Review Selection Based on Product Feature Taxonomy

User-generated information such as online reviews has become increasingly significant for customers in decision making processes. Meanwhile, as the volume of online reviews proliferates, there is an insistent demand to help users in tackling the information overload problem. A considerable amount of research has addressed the problem of extracting useful information from overwhelming reviews; among the proposed approaches we remind review summarization and review selection. Particularly, to address the issue of reducing redundant information, researchers attempt to select a small set of reviews to represent the entire review corpus by preserving its statistical properties (e.g., opinion distribution). However, a significant drawback of the existing works is that they only measure the utility of the extracted reviews as a whole without considering the quality of each individual review. As a result, the set of chosen reviews may consist of low-quality ones even if its statistical property is close to that of the original review corpus, which is not preferred by the users. In this paper, we propose a review selection method which takes the reviews’ quality into consideration during the selection process. Specifically, we examine the relationships between product features based upon a domain ontology to capture the review characteristics based on which to select reviews that have good quality and to preserve the opinion distribution as well. Our experimental results based on real world review datasets demonstrate that our proposed approach is feasible and able to improve the performance of the review selection effectively.

Nan Tian, Yue Xu, Yuefeng Li, Gabriella Pasi

An Author Subject Topic Model for Expert Recommendation

A supervised hierarchical topic model, named the Author Subject Topic (AST) model, was introduced for expert recommendation in this study. The difference between the Author Topic (AT) model and the AST model is that the AST model introduces an additional supervised “Subject” layer. The additional supervised layer of AST allows subjects to be shared across authors and group documents under various topic distributions, rather than only grouping documents under a single author’s topic distribution, which encourages to cluster documents and words with less noise. In considerations that interdisciplinary studies are a major trend in many research fields, a typical interdisciplinary, Information Management and Information System, is investigated and corresponding real data were gathered from WANFANG DATA (http://www.wanfangdata.com.cn/). Different comparative experiments were conducted, which demonstrates that the AST model outperforms the AT model on this dataset. It shows that the AST model is able to capture the subject class and distinguish the topics effectively for modeling the expert’s research interests, which helps for expert recommendation.

Haikun Mou, Qian Geng, Jian Jin, Chong Chen

A Keyword Recommendation Method Using CorKeD Words and Its Application to Earth Science Data

In various research domains, data providers themselves annotate their own data with keywords from a controlled vocabulary. However, since selecting keywords requires extensive knowledge of the domain and the controlled vocabulary, even data providers have difficulty in selecting appropriate keywords from the vocabulary. Therefore, we propose a method for recommending relevant keywords in a controlled vocabulary to data providers. We focus on a keyword definition, and calculate the similarity between an abstract text of data and the keyword definition. Moreover, considering that there are unnecessary words in the calculation, we extract CorKeD (Corpus-based Keyword Decisive) words from a target domain corpus so that we can measure the similarity appropriately. We conduct an experiment on earth science data, and verify the effectiveness of extracting the CorKeD words, which are the terms that better characterize the domain.

Youichi Ishida, Toshiyuki Shimizu, Masatoshi Yoshikawa

Incorporating Distinct Opinions in Content Recommender System

As the media content industry is growing continuously, the content market has become very competitive. Various strategies such as advertising and Word-of-Mouth (WOM) have been used to draw people’s attention. It is hard for users to be completely free of others’ influences and thus to some extent their opinions become affected and biased. In the field of recommender systems, prior research on biased opinions has attempted to reduce and isolate the effects of external influences in recommendations. In this paper, we present a new measure to detect opinions that are distinct from the mainstream. This distinctness enables us to reduce biases formed by the majority and thus, to potentially increase the performance of recommendation results. To ensure robustness, we develop four new hybrid methods that are various mixtures of existing collaborative filtering (CF) methods and our new measure of Distinctness. In this way, the proposed methods can reflect the majority of opinions while considering distinct user opinions. We evaluate the methods using a real-life rating dataset with 5-fold cross validation. The experimental results clearly show that the proposed models outperform existing CF methods.

Grace E. Lee, Keejun Han, Mun Y. Yi

Detecting Automatically-Generated Arabic Tweets

Recently, Twitter, one of the most widely-known social media platforms, got infiltrated by several automation programs, commonly known as “bots”. Bots can be easily abused to spread spam and hinder information extraction applications by posting lots of automatically-generated tweets that occupy a good portion of the continuous stream of tweets. This problem heavily affects users in the Arab region due to the recent developing political events as automated tweets can disturb communication and waste time needed in filtering such tweets.To mitigate this problem, this research work addresses the classification of Arabic tweets into automated or manual. We proposed four categories of features including formality, structural, tweet-specific, and temporal features. Our experimental evaluation over about 3.5 k randomly sampled Arabic tweets shows that classification based on individual categories of features outperform the baseline unigram-based classifier in terms of classification accuracy. Additionally, combining tweet-specific and unigram features improved classification accuracy to 92 %, which is a significant improvement over the baseline classifier, constituting a very strong reference baseline for future studies.

Hind Almerekhi, Tamer Elsayed

Improving Tweet Timeline Generation by Predicting Optimal Retrieval Depth

Tweet Timeline Generation (TTG) systems provide users with informative and concise summaries of topics, as they developed over time, in a retrospective manner. In order to produce a tweet timeline that constitutes a summary of a given topic, a TTG system typically retrieves a list of potentially-relevant tweets over which the timeline is eventually generated. In such design, dependency of the performance of the timeline generation step on that of the retrieval step is inevitable.In this work, we aim at improving the performance of a given timeline generation system by controlling the depth of the ranked list of retrieved tweets considered in generating the timeline. We propose a supervised approach in which we predict the optimal depth of the ranked tweet list for a given topic by combining estimates of list quality computed at different depths.We conducted our experiments on a recent TREC TTG test collection of 243 M tweets and 55 topics. We experimented with 14 different retrieval models (used to retrieve the initial ranked list of tweets) and 3 different TTG models (used to generate the final timeline). Our results demonstrate the effectiveness of the proposed approach; it managed to improve TTG performance over a strong baseline in 76 % of the cases, out of which 31 % were statistically significant, with no single significant degradation observed.

Maram Hasanain, Tamer Elsayed, Walid Magdy

Abstract Venue Concept Detection from Location-Based Social Networks

We investigate a new graphical model that can generate latent abstract concepts of venues, or Point of Interest (POI) by exploiting text data in venue profiles obtained from location-based social networks (LBSNs). Our model offers tailor-made modeling for two different types of text data that commonly appears in venue profiles, namely, tags and comments. Such modeling can effectively exploit their different characteristics. Meanwhile, the modeling of these two parts are tied with each other in a coordinated manner. Experimental results show that our model can generate better abstract venue concepts than comparative models.

Yi Liao, Shoaib Jameel, Wai Lam, Xing Xie

Snapboard: A Shared Space of Visual Snippets - A Study in Individual and Asynchronous Collaborative Web Search

People often engage in many search tasks that be collaborative, where two or more individuals work together with the joint information needs. We introduced and built CoZpace, a web-based application that enables a group of users to collaborate on searching the web. We also presented the main feature of CoZpace, named Snapboard, which is a shared board for a collection of group-created visual snippets. The visual snippet is a snapshot of focused and salient information captured by a user. It acts as a visual summarization of web pages, which allows any user to quickly recognize information and to revisit web pages. This paper describes example usage scenarios and initially investigates the ways Snapboard facilitates users in individual and asynchronous collaborative search. We then analyze users’ interactions and discuss how Snapboard supports search collaboration among study participants.

Teerapong Leelanupab, Hannarin Kruajirayu, Nont Kanungsukkasem

Improving Ranking and Robustness of Search Systems by Exploiting the Popularity of Documents

In building Information Retrieval systems, much of research is geared towards optimizing a specific aspect of the system. Consequently, there are a lot of systems that improve effectiveness of search results by striving to outperform a baseline system. Other systems, however, focus on improving the robustness of the system by minimizing the risk of obtaining, for any topic, a result subpar with that of the baseline system. Both tasks have been organized by TREC Web tracks 2013 and 2014, and have been undertaken by the track participants. Our work herein, proposes two re-ranking approaches – based on exploiting the popularity of documents with respect to a general topic – that improve the effectiveness while improving the robustness of the baseline systems. We used each of the runs submitted to TREC Web tracks 2013 – 14 as baseline, and empirically show that our algorithms improve the effectiveness as well as the robustness of the systems in an overwhelming number of cases, even though the systems used to produce them employ a variety of retrieval models.

Ashraf Bah, Ben Carterette

Heading-Aware Snippet Generation for Web Search

We propose heading-aware methods of generating search result snippets of web pages. A heading is a brief description of the topic of its associated sentences. Some existing methods give priority to sentences containing many words that also appear in headings when selecting sentences to be included in snippets with limited length. However, according to our observation, words in heading are very often omitted from their associated sentences because readers can understand the topic of the sentences by reading their heading. To score sentences considering such omission, our methods count keyword occurrences in their headings as well as in the sentences themselves. Our evaluation result indicated that our methods were effective only for queries with clear intents or containing four or more keywords. To discuss the statistical significance of the result, another evaluation with more queries is needed.

Tomohiro Manabe, Keishi Tajima

Smoothing Temporal Difference for Text Categorization

This paper addresses text categorization problem that training data may be derived from a different time period than test data. We present a method for text categorization that minimizes the impact of temporal effects by using term smoothing and transfer learning techniques. We first used a technique called Temporal-based Term Smoothing (TTS) to replace those time sensitive features with representative terms, then applied boosting based transfer learning algorithm called TrAdaBoost for categorization. The results using a 21-year Japanese Mainichi Newspaper corpus showed that integrating term smoothing and transfer learning improves overall performance, especially it is effective when the creation time period of the test data differs greatly from the training data.

Fumiyo Fukumoto, Yoshimi Suzuki

Charset Encoding Detection of HTML Documents

A Practical Experience

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

Shabanali Faghani, Ali Hadian, Behrouz Minaei-Bidgoli

Structure Matters: Adoption of Structured Classification Approach in the Context of Cognitive Presence Classification

Within online learning communities, receiving timely and meaningful insights into the quality of learning activities is an important part of an effective educational experience. Commonly adopted methods–such as the Community of Inquiry framework–rely on manual coding of online discussion transcripts, which is a costly and time consuming process. There are several efforts underway to enable the automated classification of online discussion messages using supervised machine learning, which would enable the real-time analysis of interactions occurring within online learning communities. This paper investigates the importance of incorporating features that utilise the structure of online discussions for the classification of “cognitive presence”–the central dimension of the Community of Inquiry framework focusing on the quality of students’ critical thinking within online learning communities. We implemented a Conditional Random Field classification solution, which incorporates structural features that may be useful in increasing classification performance over other implementations. Our approach leads to an improvement in classification accuracy of 5.8 % over current existing techniques when tested on the same dataset, with a precision and recall of 0.630 and 0.504 respectively.

Zak Waters, Vitomir Kovanović, Kirsty Kitto, Dragan Gašević

A Sequential Latent Topic-Based Readability Model for Domain-Specific Information Retrieval

In domain-specific information retrieval (IR), an emerging problem is how to provide different users with documents that are both relevant and readable, especially for the lay users. In this paper, we propose a novel document readability model to enhance the domain-specific IR. Our model incorporates the coverage and sequential dependency of latent topics in a document. Accordingly, two topical readability indicators, namely Topic Scope and Topic Trace are developed. These indicators, combined with the classical Surface-level indicator, can be used to rerank the initial list of documents returned by a conventional search engine. In order to extract the structured latent topics without supervision, the hierarchical Latent Dirichlet Allocation (hLDA) is used. We have evaluated our model from the user-oriented and system-oriented perspectives, in the medical domain. The user-oriented evaluation shows a good correlation between the readability scores given by our model and human judgments. Furthermore, our model also gains significant improvement in the system-oriented evaluation in comparison with one of the state-of-the-art readability methods.

Wenya Zhang, Dawei Song, Peng Zhang, Xiaozhao Zhao, Yuexian Hou

Automatic Labelling of Topic Models Using Word Vectors and Letter Trigram Vectors

The native representation of LDA-style topics is a multinomial distributions over words, which can be time-consuming to interpret directly. As an alternative representation, automatic labelling has been shown to help readers interpret the topics more efficiently. We propose a novel framework for topic labelling using word vectors and letter trigram vectors. We generate labels automatically and propose automatic and human evaluations of our method. First, we use a chunk parser to generate candidate labels, then map topics and candidate labels to word vectors and letter trigram vectors in order to find which candidate label is more semantically related to that topic. A label can be found by calculating the similarity between a topic and its candidate label vectors. Experiments on three common datasets show that not only the labelling method, but also out approach to automatic evaluation is effective.

Wanqiu Kou, Fang Li, Timothy Baldwin

A Study of Collection-Based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback

Pseudo-relevance feedback (PRF) is an effective technique to improve the ad-hoc retrieval performance. For PRF methods, how to optimize the balance parameter between the original query model and feedback model is an important but difficult problem. Traditionally, the balance parameter is often manually tested and set to a fixed value across collections and queries. However, due to the difference among collections and individual queries, this parameter should be tuned differently. Recent research has studied various query based and feedback documents based features to predict the optimal balance parameter for each query on a specific collection, through a learning approach based on logistic regression. In this paper, we hypothesize that characteristics of collections are also important for the prediction. We propose and systematically investigate a series of collection-based features for queries, feedback documents and candidate expansion terms. The experiments show that our method is competitive in improving retrieval performance and particularly for cross-collection prediction, in comparison with the state-of-the-art approaches.

Ye Meng, Peng Zhang, Dawei Song, Yuexian Hou

An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine

In this research we propose a technique of frequent itemset hierarchical clustering (FIHC) using an MDL-based algorithm, viz KRIMP. Different from the FIHC technique, in this proposed method we define clustering as a rank sequence problem of the top-3 ranked list of each itemsets-of-keywords clusters in web documents search results of a given query to a search engine. The key idea of an MDL compression based approach is the code table. Only frequent and representative keywords as those in a KRIMP code table can be used as candidates, instead of using all important keywords from keywords extractor such as RAKE. To simulate information needs in the real world, the web documents are originated from the search results of a multi domain query. By starting in a meta-search engine environment to grab many relevant documents, we set up k = {50, 100, 200} for k-toplist retrieved documents of each search engine to build a dataset for automatic relevance judgement. We implement a clustering technique to the best individual search engine the MDL-based FIHC algorithm with setting of k = {50, 100, 200} for k-toplist of retrieved documents of each search engine, minimum support = 5 for itemset KRIMP compression, and minimum cluster support = 0.1 for FIHC clustering. Our results show that the MDL-based FIHC clustering can improve the relevance scores of web search results on an individual search engine significantly (until 39.2 % at precision P@10, k-toplist = 50).

Diyah Puspitaningrum, Fauzi, Boko Susilo, Jeri Apriansyah Pagua, Aan Erlansari, Desi Andreswari, Rusdi Efendi, I. S. W. B. Prasetya

Improving Clustering Quality by Automatic Text Summarization

Automatic text summarization is the process of reducing the size of a text document, to create a summary that retains the most important points of the original document. It can thus be applied to summarize the original document by decreasing the importance or removing part of the content. The contribution of this paper in this field is twofold. First we show that text summarization can improve the performance of classical text clustering algorithms, in particular by reducing noise coming from long documents that can negatively affect clustering results. Moreover, the clustering quality can be used to quantitatively evaluate different summarization methods. In this regards, we propose a new graph-based summarization technique for keyphrase extraction, and use the Classic4 and BBC NEWS datasets to evaluate the improvement in clustering quality obtained using text summarization.

Mohsen Pourvali, Salvatore Orlando, Mehrad Gharagozloo

Tweet Timeline Generation via Graph-Based Dynamic Greedy Clustering

When searching a query in the microblogging, a user would typically receive an archive of tweets as part of a retrospective piece on the impact of social media. For ease of understanding the retrieved tweets, it is useful to produce a summarized timeline about a given topic. However, tweet timeline generation is quite challenging due to the noisy and temporal characteristics of microblogs. In this paper, we propose a graph-based dynamic greedy clustering approach, which considers the coverage, relevance and novelty of the tweet timeline. First, tweet embedding representation is learned in order to construct the tweet semantic graph. Based on the graph, we estimate the coverage of timeline according to the graph connectivity. Furthermore, we integrate a noise tweet elimination component to remove noisy tweets with the lexical and semantic features based on relevance and novelty. Experimental results on public Text Retrieval Conference (TREC) Twitter corpora demonstrate the effectiveness of the proposed approach.

Feifan Fan, Runwei Qiang, Chao Lv, Wayne Xin Zhao, Jianwu Yang

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Short Text Conversation (STC) is a new NTCIR task which tackles the following research question: given a microblog repository and a new post to that microblog, can systems reuse an old comment from the respository to satisfy the author of the new post? The official evaluation measures of STC are normalised gain at 1 (nG@1), normalised expected reciprocal rank at 10 (nERR@10), and P$$^+$$, all of which can be regarded as evaluation measures for navigational intents. In this study, we apply the topic set size design technique of Sakai to decide on the number of test topics, using variance estimates of the above evaluation measures. Our main conclusion is to create 100 test topics, but what distinguishes our work from other tasks with similar topic set sizes is that we know what this topic set size means from a statistical viewpoint for each of our evaluation measures. We also demonstrate that, under the same set of statistical requirements, the topic set sizes required by nERR@10 and P$$^+$$ are more or less the same, while nG@1 requires more than twice as many topics. To our knowledge, our task is the first among all efforts at TREC-like evaluation conferences to actually create a new test collection by using this principled approach.

Tetsuya Sakai, Lifeng Shang, Zhengdong Lu, Hang Li

Towards Nuanced System Evaluation Based on Implicit User Expectations

Information retrieval systems are often evaluated through the use of effectiveness metrics. In the past, the metrics used have corresponded to fixed models of user behavior, presuming, for example, that the user will view a pre-determined number of items in the search engine results page, or that they have a constant probability of advancing from one item in the result page to the next. Recently, a number of proposals for models of user behavior have emerged that are parameterized in terms of the number of relevant documents (or other material) a user expects to be required to address their information need. That recent work has demonstrated that T, the user’s a priori utility expectation, is correlated with the underlying nature of the information need; and hence that evaluation metrics should be sensitive to T. Here we examine the relationship between the query the user issues, and their anticipated T, seeking syntactic and other clues to guide the subsequent system evaluation. That is, we wish to develop mechanisms that, based on the query alone, can be used to adjust system evaluations so that the experience of the user of the system is better captured in the system’s effectiveness score, and hence can be used as a more refined way of comparing systems. This paper reports on a first round of experimentation, and describes the progress (albeit modest) that we have achieved towards that goal.

Paul Thomas, Peter Bailey, Alistair Moffat, Falk Scholer

A Study of Visual and Semantic Similarity for Social Image Search Recommendation

Partially due to the short and ambiguous keyword queries, many image search engines group search results into conceptual image clusters to minimize the chance of completely missing user search intent. Very often, a small subset of image clusters in search is relevant to user’s search intent. However, existing search engines do not support further exploration once a user has located the image cluster(s) that interest her. Similar to the problem of finding similar images of a given image, in this paper, we study the problem of “finding similar image clusters of a given image cluster”. We study this problem in the context of socially annotated images (e.g., images annotated with tags in Flickr). Each image cluster is then represented in two feature spaces: the visual feature space to describe the visual characteristics of the images in the image clusters; and the semantic feature space to describe an image cluster based on the tags of its member images. Two measures named relatedness and diversity are proposed to evaluate the effectiveness of the visual and semantic similarities in image cluster recommendation. Our experimental results show that both visual and semantic similarities should be considered in image cluster recommendation to support search result exploration. We also note that using visual similarity leads to more diversified recommendations while the semantic similarity recommends conceptually more related image clusters.

Yangjie Yao, Aixin Sun

Company Name Disambiguation in Tweets: A Two-Step Filtering Approach

Using Twitter as an effective marketing tool has become a gold mine for companies interested in their online reputation. A quite significant research challenge related to the above issue is to disambiguate tweets with respect to company names. In fact, finding if a particular tweet is relevant or irrelevant to a company is an important task not satisfactorily solved yet; to address this issue in this paper we propose a Wikipedia-based two-step filtering algorithm. As opposed to most other methods, the proposed approach is fully automatic and does not rely on hand-coded rules. The first step is a precision-oriented pass that uses Wikipedia as an external knowledge source to extract pertinent terms and phrases from certain parts of company Wikipedia pages, and use these as weighted filters to identify tweets about a given company. The second pass expands the first to increase recall by including more terms from URLs in tweets, Twitter user profile information and hashtags. The approach is evaluated on a CLEF lab dataset, showing good performance - especially for English tweets.

M. Atif Qureshi, Arjumand Younus, Colm O’Riordan, Gabriella Pasi

Utilizing Word Embeddings for Result Diversification in Tweet Search

The performance of result diversification for tweet search suffers from the well-known vocabulary mismatch problem, as tweets are too short and usually informal. As a remedy, we propose to adopt a query and tweet expansion strategy that utilizes automatically-generated word embeddings. Our experiments using state-of-the-art diversification methods on the Tweets2013 corpus reveal encouraging results for expanding queries and/or tweets based on the word embeddings to improve the diversification performance in tweet search. We further show that the expansions based on the word embeddings may serve as useful as those based on a manually constructed knowledge base, namely, ConceptNet.

Kezban Dilek Onal, Ismail Sengor Altingovde, Pinar Karagoz

Displaying People with Old Addresses on a Map

This paper proposes a method of converting old addresses to current addresses for geocoding, with the aim of displaying on a map people who have such old addresses. Existing geocoding services often fail to handle old addresses since the names of towns, cities, or prefectures can be different from those of current addresses. To solve this geocoding problem, we focus on postal codes, extracting them from Web search result snippets using the query “prefecture name AND important place name AND postal code.” The frequency of postal codes and the edit distance between the old address and the addresses obtained using the postal codes are used to judge the most suitable postal code and thus the corresponding current address. The effectiveness of the proposed method is evaluated in an experiment using a relative dataset. A prototype system was implemented in which users could display people using their birthdate and birthplace addresses on a map chronologically with an associated history chart.

Gang Zhang, Harumi Murakami

A Query Expansion Approach Using Entity Distribution Based on Markov Random Fields

The development of knowledge graph construction has prompted more and more commercial engines to improve the retrieval performance by using knowledge graphs as the basic semantic web. Knowledge graph is often used for knowledge inference and entity search, however, the potential ability of its entities and properties for better improving search performance in query expansion remains to be further excavated. In this paper, we propose a novel query expansion technique with knowledge graph (KG) based on the Markov random fields (MRF) model to enhance retrieval performance. This technique, called MRF-KG, models the joint distribution of original query terms, documents and two expanded variants, i.e. entities and properties. We conduct experiments on two TREC collections, WT10G and ClueWeb12B, annotated with Freebase entities. Experiment results demonstrate that MRF-KG outperforms traditional graph-based models.

Rui Li, Linxue Hao, Xiaozhao Zhao, Peng Zhang, Dawei Song, Yuexian Hou

Analysis of Cyber Army’s Behaviours on Web Forum for Elect Campaign

The goal of cyber army for elect campaign is to promote a certain candidate and denounce his/her rivals for ballots. This paper investigates the cyber army’s behaviors with a real case study, 2014 Taipei mayoral race. We analyze the data crawled from the Gossip Forum on the Professional Technology Temple (PTT), Taiwan’s largest online bulletin board. The operations of cyber army are shown and discussed.

Man-Chun Ko, Hsin-Hsi Chen

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

In vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called $$m_p$$-dissimilarity is used. Our empirical result in document retrieval task reveals that $$m_p$$ with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.

Sunil Aryal, Kai Ming Ting, Gholamreza Haffari, Takashi Washio

Explorations of Cross-Disciplinary Term Similarity

This paper presents some initial explorations into how to compute term similarity across different domains, or in the present case, scientific disciplines. In particular we explore the concepts of polysemy across disciplines, where the same term can have different meaning across different discipline. This can lead to confusion and/or erroneous query expansion, if the domain is not properly identified. Typical bag-of-words systems are not equipped to highlight such differences as terms would have a single representation. Identifying the synonymy of terms across different domains is also a difficult problem for typical bag-of-words systems, as they use surrounding words that will usually also be different across domains. Yet discovering such similarities across domains can support tasks such as literature discovery. We propose an approach that integrates knowledge based distances into a distributional semantics framework and demonstrate its efficiency on a hand-crafted dataset.

A Signature Approach to Patent Classification

We propose a document signature approach to patent classification. Automatic patent classification is a challenging task because of the fast growing number of patent applications filed every year and the complexity, size and nested hierarchical structure of patent taxonomies. In our proposal, the classification of a target patent is achieved through a k-nearest neighbour search using Hamming distance on signatures generated from patents; the classification labels of the retrieved patents are weighted and combined to produce a patent classification code for the target patent. The use of this method is motivated by the fact that intuitively document signatures are more efficient than previous approaches for this task that considered the training of classifiers on the whole vocabulary feature set. Our empirical experiments also demonstrate that the combination of document signatures and k-nearest neighbours search improves classification effectiveness, provided that enough data is used to generate signatures.

Dilesha Seneviratne, Shlomo Geva, Guido Zuccon, Gabriela Ferraro, Timothy Chappell, Magali Meireles

The Impact of Using Combinatorial Optimisation for Static Caching of Posting Lists

Caching posting lists can reduce the amount of disk I/O required to evaluate a query. Current methods use optimisation procedures for maximising the cache hit ratio. A recent method selects posting lists for static caching in a greedy manner and obtains higher hit rates than standard cache eviction policies such as LRU and LFU. However, a greedy method does not formally guarantee an optimal solution. We investigate whether the use of methods guaranteed, in theory, to find an approximately optimal solution would yield higher hit rates. Thus, we cast the selection of posting lists for caching as an integer linear programming problem and perform a series of experiments using heuristics from combinatorial optimisation (CCO) to find optimal solutions. Using simulated query logs we find that CCO yields comparable results to a greedy baseline using cache sizes between 200 and 1000 MB, with modest improvements for queries of length two to three.

Casper Petersen, Jakob Grue Simonsen, Christina Lioma

Improving Difficult Queries by Leveraging Clusters in Term Graph

Term graphs, in which the nodes correspond to distinct lexical units (words or phrases) and the weighted edges represent semantic relatedness between those units, have been previously shown to be beneficial for ad-hoc IR. In this paper, we experimentally demonstrate that indiscriminate utilization of term graphs for query expansion limits their retrieval effectiveness. To address this deficiency, we propose to apply graph clustering to identify coherent structures in term graphs and utilize these structures to derive more precise query expansion language models. Experimental evaluation of the proposed methods using term association graphs derived from document collections and popular knowledge bases (ConceptNet and Wikipedia) on TREC datasets indicates that leveraging semantic structure in term graphs allows to improve the results of difficult queries through query expansion.

Rajul Anand, Alexander Kotov

EEST: Entity-Driven Exploratory Search for Twitter

Social media has become a comprehensive platform for users to obtain information. When searching over the social media, users’ search intents are usually related to one or more entities. Entity, which usually conveys rich information for modeling relevance, is a common choice for query expansion. Previous works usually focus on entities from single source, which are not adequate to cover users’ various search intents. Thus, we propose EEST, a novel multi-source entity-driven exploratory search engine to help users quickly target their real information need. EEST extracts related entities and corresponding relationship information from multi-source (i.e., Google, Twitter and Freebase) in the first phase. These entities are able to help users better understand hot aspects of the given query. Expanded queries will be generated automatically while users choose one entity for further exploration. In the second phase, related users and representative tweets are offered to users directly for quickly browsing. A demo of EEST is available at http://demo.webkdd.org.

Chao Lv, Runwei Qiang, Lili Yao, Jianwu Yang

Oyster: A Tool for Fine-Grained Ontological Annotations in Free-Text

Oyster is a web-based annotation tool that allows users to annotate free-text with respect to concepts defined in formal knowledge resources such as large domain ontologies. The tool has been explicitly designed to provide (manual and automatic) search functionalities to identify the best concept entities to be used for annotation. In addition, Oyster supports features such as annotations that span across non-adjacent tokens, multiple annotations per token, the identification of entity relationships and a user-friendly visualisation of the annotation including the use of filtering based on annotation types. Oyster is highly configurable and can be expanded to support a variety of knowledge resources. The tool can support a wide range of tasks involving human annotation, including named-entity extraction, relationship extraction, annotation correction and refinement.

Hamed Tayebikhorami, Alejandro Metke-Jimenez, Anthony Nguyen, Guido Zuccon

TopSig: A Scalable System for Hashing and Retrieving Document Signatures

There are a large number of overlapping problems within information retrieval that involve retrieving objects with certain features or objects based on their similarity to other objects. If the features that define these objects can be extracted, these objects can be reduced to a common representation that maintains pairwise similarity but discards all other data in order to facilitate compact storage and scalable retrieval. In this paper we introduce TopSig, an open-source tool for hashing and retrieving topology-sensitive document signatures.

Timothy Chappell, Shlomo Geva

Backmatter

Weitere Informationen