Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.
Weitere Kapitel dieses Buchs durch Wischen aufrufen
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
- Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval
- Springer Berlin Heidelberg