2013 | OriginalPaper | Buchkapitel
Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds
verfasst von : Salha M. Alzahrani
Erschienen in: Information Retrieval Technology
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About 11,000 documents with more than 6 million words were tagged as belonging to one of 6 domains: Business, Entertainment, Health, Science-Technology, Sports, and World. Metadata has been added to the corpus as a whole and to each domain independently. The developed corpus, called ArNeCo, has been analysed to ensure that it has a considerable quality and quantity, and published on the Internet for research purposes. This article aims to help potential users of ArNeCo to understand the nature of the corpus and to do information retrieval research in many ways such as in the formulation of queries, justification of decisions taken or interpretation of results gained. Besides the corpus, this article presents a method for developing corpora that can keep track of recent natural language texts posted on the Internet by using RSS feeds.