Skip to main content
main-content

Über dieses Buch

This book constitutes the thoroughly refereed proceedings of the 8 th Russian Summer School on Information Retrieval, RuSSIR 2014, held in Nizhniy Novgorod, Russia, in August 2014.

The 14 papers presented were selected from various submissions. The papers focus on visualization for information retrieval along with other topics related to information retrieval.

Inhaltsverzeichnis

Frontmatter

Tutorial Papers

Frontmatter

Document Analysis and Retrieval Tasks in Scientific Digital Libraries

Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.
Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, C. Lee Giles

Online Experimentation for Information Retrieval

Online experimentation for information retrieval (IR) focuses on insights that can be gained from user interactions with IR systems, such as web search engines. The most common form of online experimentation, A/B testing, is widely used in practice, and has helped sustain continuous improvement of the current generation of these systems.
As online experimentation is taking a more and more central role in IR research and practice, new techniques are being developed to address, e.g., questions regarding the scale and fidelity of experiments in online settings. This paper gives an overview of the currently available tools. This includes techniques that are already in wide use, such as A/B testing and interleaved comparisons, as well as techniques that have been developed more recently, such as bandit approaches for online learning to rank.
This paper summarizes and connects the wide range of techniques and insights that have been developed in this field to date. It concludes with an outlook on open questions and directions for ongoing and future research.
Katja Hofmann

Introduction to Formal Concept Analysis and Its Applications in Information Retrieval and Related Fields

This paper is a tutorial on Formal Concept Analysis (FCA) and its applications. FCA is an applied branch of Lattice Theory, a mathematical discipline which enables formalisation of concepts as basic units of human thinking and analysing data in the object-attribute form. Originated in early 80s, during the last three decades, it became a popular human-centred tool for knowledge representation and data analysis with numerous applications. Since the tutorial was specially prepared for RuSSIR 2014, the covered FCA topics include Information Retrieval with a focus on visualisation aspects, Machine Learning, Data Mining and Knowledge Discovery, Text Mining and several others.
Dmitry I. Ignatov

Visualization and Data Mining for High Dimensional Data

–With Connections to Information Retrieval
The first, and still more popular application, of parallel coordinates is in exploratory data analysis (EDA); discovering data subsets (relations) satisfying given objectives.
Alfred Inselberg, Pei Ling Lai

Web as a Corpus: Going Beyond the n-gram

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on sub-problems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field.
Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.
Preslav Nakov

Author Profiling and Plagiarism Detection

In this paper we introduce the topics that we will cover in the RuSSIR 2014 course on Author Profiling and Plagiarism Detection (APPD). Author profiling distinguishes between classes of authors studying how language is shared by classes of people. This task helps in identifying profiling aspects such as gender, age, native language, or even personality type. In case of the plagiarism detection task we are not interested in studying how language is shared. On the contrary, given a document we are interested in investigating if the writing style changes in order to unveil text inconsistencies, i.e., unexpected irregularities through the document such as changes in vocabulary, style and text complexity. In fact, when it is not possible to retrieve the source document(s) where plagiarism has been committed from, the intrinsic analysis of the suspicious document is the only way to find evidence of plagiarism. The difficulty in retrieving the source of plagiarism could be due to the fact that the documents are not available on the web or the plagiarised text fragments were obfuscated via paraphrasing or translation (in case the source document was in another language). In this overview, we also discuss the results of the shared tasks on author profiling (gender and age identification) and plagiarism detection that we help to organise at the PAN Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse (http://​pan.​webis.​de).
Paolo Rosso

Young Scientists Conference Papers

Frontmatter

Transformation of Categorical Features into Real Using Low-Rank Approximations

Most of existing machine learning techniques can handle objects described by real but not categorical features. In this paper we introduce a simple unsupervised method for transforming categorical feature values into real ones. It is based on low-rank approximations of collaborative feature value frequencies. Once object descriptions are transformed, any common real-value machine learning technique can be applied for further data analysis. For example, it becomes possible to apply classic and powerful Random Forest predictor in supervised learning problems. Our experiments show that a combination of the proposed features transformation method with common real-value supervised algorithms leads to the results that are comparable to the state-of-the-art approaches like Factorization Machines.
Alexander Fonarev

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Part-of-speech (POS) tagging is an essential step in many text processing applications. Quite a few works focus on solving this task for Russian; their results are not directly comparable due to the lack of shared datasets and tools. We propose a POS tagging evaluation framework for Russian that comprises existing third-party resources available for researchers. We applied the framework to compare several implementations of statistical classifiers: HunPos, Stanford POS tagger, OpenNLP implementation of MaxEnt Markov Model, and our own re-implementation of Tiered Conditional Random Fields. The best tagger that was trained on a corpus with less than one million words achieved an accuracy above 93 % .We expect that the evaluation framework will facilitate future studies and improvements on POS tagging for Russian.
Rinat Gareev, Vladimir Ivanov

Recommendation of Ideas and Antagonists for Crowdsourcing Platform Witology

This paper introduces several recommender methods for crowdsourcing platforms. These methods are based on modern data analysis approaches for object-attribute data, such as Formal Concept Analysis and biclustering. The use of the proposed techniques is illustrated by the results of recommendation of ideas and antagonists for crowdsourcing platform Witology. In particular we show how the quality of antagonists recommender can be improved by usage of biclusters as focal areas for distance and similarity calculation.
Dmitry I. Ignatov, Maria Mikhailova, Alexandra Yu. Zakirova, Alexander Malioukov

Modelling Movement of Stock Market Indexes with Data from Emoticons of Twitter Users

The issue of using Twitter data to increase the prediction rate of stock price movements draws attention of many researchers. In this paper we examine the possibility of analyzing Twitter users’ emoticons to improve accuracy of predictions for DJIA and S&P500 stock market indices. We analyzed 1.6 billion tweets downloaded from February 13, 2013 to May 19, 2014. As a forecasting technique, we tested the Support Vector Machine (SVM), Neural Networks and Random Forest, which are commonly used for prediction tasks in finance analytics. The results of applying machine learning techniques to stock market price prediction are discussed.
Alexander Porshnev, Ilya Redkin, Nikolay Karpov

ImSe: Exploratory Time-Efficient Image Retrieval System

We consider the problem of Content-Based Image Retrieval (CBIR) with interactive user feedback when the user is unable to query the system with natural language text. We employ content-based techniques with Relevance Feedback mechanism to capture the precise need of the user and interactively refine the query. We apply the Exploration/Exploitation trade-off with Hierarchical Gaussian Process Bandits and pseudo feedback in order to tackle the problem of optimization in face of uncertainty and to improve the quality of multiple images selection. We tackle the scalability issue with Self-Organizing Map as a preprocessing techniques. A prototype system called ImSe was developed and tested in experiments with real users in different types of search tasks. The experiments show favorable results and indicate the benefits of proposed aprroach.
Ksenia Konyushkova, Dorota Głowacka

Semantic Clustering of Russian Web Search Results: Possibilities and Problems

The present paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply the data to cluster the results of Mail.ru search according to meanings in the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.
Andrey Kutuzov

A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory Study

The paper reports on a large-scale topical categorization of questions from a Russian community question answering (CQA) service Otvety@Mail.Ru. We used a data set containing all the questions (more than 11 millions) asked by Otvety@Mail.Ru users in 2012. This is the first study on question categorization dealing with non-English data of this size. The study focuses on adjusting category structure in order to get more robust classification results. We investigate several approaches to measure similarity between categories: the share of identical questions, language models, and user activity. The results show that the proposed approach is promising.
Galina Lezina, Pavel Braslavski

Towards Crowdsourcing and Cooperation in Linguistic Resources

Linguistic resources can be populated with data through the use of such approaches as crowdsourcing and gamification when motivated people are involved. However, current crowdsourcing genre taxonomies lack the concept of cooperation, which is the principal element of modern video games and may potentially drive the annotators’ interest. This survey on crowdsourcing taxonomies and cooperation in linguistic resources provides recommendations on using cooperation in existent genres of crowdsourcing and an evidence of the efficiency of cooperation using a popular Russian linguistic resource created through crowdsourcing as an example.
Dmitry Ustalov

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Globales Erdungssystem in urbanen Kabelnetzen

Bedingt durch die Altersstruktur vieler Kabelverteilnetze mit der damit verbundenen verminderten Isolationsfestigkeit oder durch fortschreitenden Kabelausbau ist es immer häufiger erforderlich, anstelle der Resonanz-Sternpunktserdung alternative Konzepte für die Sternpunktsbehandlung umzusetzen. Die damit verbundenen Fehlerortungskonzepte bzw. die Erhöhung der Restströme im Erdschlussfall führen jedoch aufgrund der hohen Fehlerströme zu neuen Anforderungen an die Erdungs- und Fehlerstromrückleitungs-Systeme. Lesen Sie hier über die Auswirkung von leitfähigen Strukturen auf die Stromaufteilung sowie die Potentialverhältnisse in urbanen Kabelnetzen bei stromstarken Erdschlüssen. Jetzt gratis downloaden!

Bildnachweise