Chapter 1. Introduction to Topic Detection and Tracking

Abstract

The Topic Detection and Tracking (TDT) research program has been running for five years, starting with a pilot study and including yearly open and competitive evaluations since then. In this chapter we define the basic concepts of TDT and provide historical context for the concepts. In describing the various TDT evaluation tasks and workshops, we provide an overview of the technical approaches that have been used and that have succeeded.

James Allan

Chapter 2. Topic Detection and Tracking Evaluation Overview

Abstract

The objective of the Topic Detection and Tracking (TDT) program is to develop technologies that search, organize and structure multilingual, news oriented textual materials from a variety of broadcast news media. This research program uses controlled laboratory simulations of hypothetical systems to test the efficacy of potential technologies, to gauge research progress, and to provide a forum for the exchange of research information. This chapter introduces TDT’s evaluation methodology including: the Linguistic Data Consortium’s TDT corpora, evaluation metrics used in TDT and the five TDT research tasks: Topic Tracking, Link Detection, Topic Detection, First Story Detection, and Story Segmentation.

Jonathan G. Fiscus, George R. Doddington

Chapter 3. Corpora for Topic Detection and Tracking

Abstract

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.

Christopher Cieri, Stephanie Strassel, David Graff, Nii Martey, Kara Rennert, Mark Liberman

Chapter 4. Probabilistic Approaches to Topic Detection and Tracking

Abstract

BBN’s systems for TDT use probabilistic models for higher accuracy and easy training. They generate measures that are normalized across topics, so that only one threshold is necessary to make decisions. These systems make little or no use of deep linguistic knowledge, and therefore are easy to modify for new languages and domains. At the same time their performance has consistently been in the top tier.

Tim Leek, Richard Schwartz, Srinivasa Sista

Chapter 5. Multi-strategy Learning for Topic Detection and Tracking

A joint report of CMU approaches to multilingual TDT

Abstract

This chapter reports on CMU’s work in all the five TDT-1999 tasks, including segmentation (story boundary identification), topic tracking, topic detection, first story detection, and story-link detection. We have addressed these tasks as supervised or unsupervised classification problems, and applied a variety of statistical learning algorithms to each problem for comparison. For segmentation we used exponential language models and decision trees; for topic tracking we used primarily k-nearest-neighbors classification (also language models, decision trees and a variant of the Rocchio approach); for topic detection we used a combination of incremental clustering and agglomerative hierarchical clustering, and for first story detection and story link detection we used a cosine-similarity based measure. We also studied the effect of combining the output of alternative methods for producing joint classification decisions in topic tracking. We found that a combined use of multiple methods typically improved the classification of new topics when compared to using any single method. We examined our approaches with multi-lingual corpora, including stories in English, Mandarin and Spanish, and multi-media corpora consisting of newswire texts and the results of automated speech recognition for broadcast news sources. The methods worked reasonably well under all of the above conditions.

Yiming Yang, Jaime Carbonell, Ralf Brown, John Lafferty, Thomas Pierce, Thomas Ault

Chapter 6. Statistical Models of Topical Content

Abstract

In this chapter we explore the behavior of two different statistical models, one based on simple unigrams and another based on the beta-binomial distribution, as applied to the problem of modeling story generation. We describe how these models can be incorporated into information extraction applications, particularly Tracking and Detection engines built for the Topic Detection and Tracking evaluations sponsored by DARPA. Tracking systems based on the two models have complementary strengths and weaknesses: a Beta-Binomial system yields high precision at high decision threshold, but performance quickly degrades as the threshold drops; a Unigram system is not as strong at high decision threshold, but is very good at suppressing false-alarms at lower threshold. We will describe the features of these systems that give rise to this behavior, and discuss ways that each system might be improved by borrowing from the other. We will also discuss our Detection system, and how improvements in Tracking should lead to improvements in Detection.

J. P. Yamron, L. Gillick, P. van Mulbregt, S. Knecht

Chapter 7. Segmentation and Detection at IBM

Hybrid Statistical Models and Two-tiered Clustering

Abstract

IBM’s story segmentation uses a combination of decision tree and maximum entropy models. They take a variety of lexical, prosodic, semantic, and structural features as their inputs. Both types of models are source-specific, and we substantially lower C _seg by combining them. IBM’s topic detection system introduces a minimal hierarchy into the clustering: each cluster is comprised of one or more microclusters. We investigate the importance of merging microclusters together, and propose a merging strategy which improves our performance.

S. Dharanipragada, M. Franz, J. S. McCarley, T. Ward, W.-J. Zhu

Chapter 8. A Cluster-Based Approach to Broadcast News

Abstract

We present an approach to detection and tracking of topics in multilingual broadcast news based upon a dynamic clustering scheme. Our approach derives from a system used to filter Web searches from multiple sources, with extensions for pipelining document clusters, part-of-speech tagging and extraction of named entities for use in an extended similarity measure.

David Eichmann, Padmini Srinivasan

Chapter 9. Signal Boosting for Translingual Topic Tracking

Document Expansion and n-best Translation

Abstract

The University of Maryland participated in the TDT-1999 topic tracking task. This chapter describes the system architecture, including source-dependent normalization, and then focuses on the cross-language case in which English training stories were used to find Mandarin stories on the same topic. Processes that may introduce noise, including errorful translation and transcription, are described and five techniques for minimizing the impact of a reduced signal-to-noise ratio are identified. Three techniques focus on signal boosting: augmenting story representations with topically related terminology through “document expansion,” exploiting knowledge of alternative translations using balanced n-best term translation, and enriching the bilingual term list to improve translation coverage. The remaining two techniques focus on noise reduction: removing common “stopwords” before translation and using corpus statistics to guide translation selection. Two of the signal boosting strategies yielded substantial gains using techniques that can be ported to other languages fairly easily, while outperforming state-of-the-art general-purpose machine translation. By contrast, neither of the noise reduction strategies produced significant improvements.

Gina-Anne Levow, Douglas W. Oard

Chapter 10. Explorations Within Topic Tracking and Detection

Abstract

This chapter presents the system used by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts for its participation in four of the five TDT tasks: tracking, detection, first story detection, and story link detection. For each task, we discuss the parameter setting approach that we used and the results of our system on the test data.

For the task of link detection, we look more carefully at score normalization across different languages and media types. We find that we can improve results noticeably though not substantially by normalizing scores differently depending upon the source language. We also consider smoothing the vocabulary in stories using a “query expansion” technique from Information Retrieval to add additional words from the corpus to each story. This results in substantial improvements.

In addition, we use TDT evaluation approaches to show that the tracking performance that sites are achieving is what is expected from Information Retrieval technology. We further show that any first story detection system based on a tracking approach is unlikely to be sufficiently accurate for most purposes. Finally, we present an overview of an automatic timeline generation system that we developed using TDT data.

James Allan, Victor Lavrenko, Russell Swan

Chapter 11. Towards a “Universal Dictionary” for Multi-Language Information Retrieval Applications

Abstract

Multilingual information retrieval tasks such as Topic Tracking have yielded high-quality results simply using word-by-word translation approaches. However, the construction of translation dictionaries for new languages is expensive and time-consuming. We show that an appropriate metric for term selection in a monolingual English corpus allows us to define a fairly small list, containing about ten thousand inflected forms or about 7500 lemmas, which works essentially as well (for a particular monolingual document classification evaluation) as an unlimited vocabulary of more than 300,000 word forms does. We suggest that such a list can be taken to form the English axis of a sort of “universal dictionary” for document classification tasks, providing a much more efficient path to the addition of new languages.

J. Michael Schultz, Mark Y. Liberman

Chapter 12. An NLP & IR Approach to Topic Detection

Abstract

This paper presents algorithms for Chinese and English-Chinese topic detection. Named entities, other nouns and verbs are cue patterns to relate news stories describing the same event. Lexical translation and name transliteration resolve lexical differences between English and Chinese. A two-threshold scheme determines relevance (irrelevance) between a news story and a topic cluster. Lookahead information deals with ambiguous cases in clustering. The least-recently-used removal strategy models the time factor in such a way that older and unimportant terms will have no effect on clustering. Experimental results show that nouns and verbs as well as the least-recently-used removal strategy outperform other models. The performance of the named-entity-only approach decreases slightly, but it has no overhead of nouns-and-verbs approach with the least-recently-used removal strategy.

Hsin-Hsi Chen, Lun-Wei Ku

Springer Professional

Topic Detection and Tracking

Event-based Information Organization

Table of Contents

Frontmatter

Chapter 1. Introduction to Topic Detection and Tracking

Chapter 2. Topic Detection and Tracking Evaluation Overview

Chapter 3. Corpora for Topic Detection and Tracking

Chapter 4. Probabilistic Approaches to Topic Detection and Tracking

Chapter 5. Multi-strategy Learning for Topic Detection and Tracking

Chapter 6. Statistical Models of Topical Content

Chapter 7. Segmentation and Detection at IBM

Chapter 8. A Cluster-Based Approach to Broadcast News

Chapter 9. Signal Boosting for Translingual Topic Tracking

Chapter 10. Explorations Within Topic Tracking and Detection

Chapter 11. Towards a “Universal Dictionary” for Multi-Language Information Retrieval Applications

Chapter 12. An NLP & IR Approach to Topic Detection

Backmatter