ABSTRACT
A large number of mainstream applications, like temporal search, event detection, and trend identification, assume knowledge of the timestamp of every document in a given textual collection. In many cases, however, the required timestamps are either unavailable or ambiguous. A charac- teristic instance of this problem emerges in the context of large repositories of old digitized documents. For such doc- uments, the timestamp may be corrupted during the digiti- zation process, or may simply be unavailable. In this paper, we study the task of approximating the timestamp of a doc- ument, so-called document dating. We propose a content- based method and use recent advances in the domain of term burstiness, which allow it to overcome the drawbacks of pre- vious document dating methods, e.g. the fix time partition strategy. We use an extensive experimental evaluation on different datasets to validate the efficacy and advantages of our methodology, showing that our method outperforms the state of the art methods on document dating.
- J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of SIGIR 1998. Google ScholarDigital Library
- N. Chambers. Labeling documents with timestamps: Learning from their time expressions. In Proceedings of ACL 2012. Google ScholarDigital Library
- F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical text. In Proceedings of AHC 2005.Google Scholar
- R. Jones, and F. Diaz. Temporal Profiles of Queries. In ACM Trans. Inf. Syst., 2007. Google ScholarDigital Library
- N. Kanhabua and K. Nørvåg. Improving Temporal Language Models For Determining Time of Non-Timestamped Documents. In Proceedings of ECDL 2008. Google ScholarDigital Library
- T. Lappas, B. Arai, M. Platakis, D. Kotsakos, and D. Gunopulos. On burstiness-aware search for document sequences. In Proceedings of SIGKDD 2009. Google ScholarDigital Library
- M.-H. Peetz, E. Meij, and M. de Rijke. Using temporal bursts for query modeling. In Inf. Retr., 2014. Google ScholarDigital Library
- X. Wan. TimedTextRank: adding the temporal dimension to multi-document summarization. In Proceedings of SIGIR 2007. Google ScholarDigital Library
Index Terms
- A burstiness-aware approach for document dating
Recommendations
Document Expansion Using External Collections
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalDocument expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method ...
A scaleable document clustering approach for large document corpora
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Comments