2011 | OriginalPaper | Buchkapitel
Improved Stable Retrieval in Noisy Collections
verfasst von : Gianni Amati, Alessandro Celi, Cesidio Di Nicola, Michele Flammini, Daniela Pavone
Erschienen in: Advances in Information Retrieval Theory
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as
n
-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn’t any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.