short-paper

Active Learning for Entity Filtering in Microblog Streams

Authors:
Damiano Spina

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Maria-Hendrike Peetz

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Maarten de Rijke

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2015Pages 975–978https://doi.org/10.1145/2766462.2767839

Published:09 August 2015Publication History

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 975–978

ABSTRACT

Monitoring the reputation of entities such as companies or brands in microblog streams (e.g., Twitter) starts by selecting mentions that are related to the entity of interest. Entities are often ambiguous (e.g., "Jaguar" or "Ford") and effective methods for selectively removing non-relevant mentions often use background knowledge obtained from domain experts. Manual annotations by experts, however, are costly. We therefore approach the problem of entity filtering with active learning, thereby reducing the annotation load for experts. To this end, we use a strong passive baseline and analyze different sampling methods for selecting samples for annotation. We find that margin sampling--an informative type of sampling that considers the distance to the hyperplane used for class separation--can effectively be used for entity filtering and can significantly reduce the cost of annotating initial training data.

References

E. Amigó, J. Artiles, J. Gonzalo, D. Spina, B. Liu, and A. Corujo. WePS-3 evaluation campaign: Overview of the online reputation management task. In CLEF '10 (Online Working Notes/Labs/Workshop), 2010.Google Scholar
E. Amigó, A. Corujo, J. Gonzalo, E. Meij, and M. de Rijke. Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF '12 (Online Working Notes/Labs/Workshop), 2012.Google Scholar
E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martın, E. Meij, M. de Rijke, and D. Spina. Overview of RepLab 2013: Evaluating online reputation monitoring systems. In CLEF '13 (Online Working Notes/Labs/Workshop), pages 333--352, 2013.Google Scholar
E. Amigó, J. Gonzalo, and F. Verdejo. A general evaluation measure for document organization tasks. In SIGIR '13, pages 643--652, 2013. Google ScholarDigital Library
J. Atserias, G. Attardi, M. Simi, and H. Zaragoza. Active learning for building a corpus of questions for parsing. In LREC '10, 2010.Google Scholar
R. L. Figueroa, Q. Zeng-Treitler, L. H. Ngo, S. Goryachev, and E. P. Wiechmann. Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association, 19 (5): 809--816, 2012.Google ScholarCross Ref
R. Hu. Active Learning for Text Classification. PhD thesis, Dublin Institute of Technology, 2011.Google Scholar
M.-H. Peetz. Time-Aware Online Reputation Analysis. PhD thesis, University of Amsterdam, 2015.Google Scholar
E. Pilkington. Unsold H&M clothes found in rubbish bags as homeless face winter chill. riptsize http://bit.ly/theguardian2010HMhttp://bit.ly/theguardian2010HM, January 2010.Google Scholar
M. Sassano. An empirical study of active learning with support vector machines for Japanese word segmentation. In ACL '02, 2002. Google ScholarDigital Library
B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.Google Scholar
D. Spina. Entity-Based Filtering and Topic Detection for Online Reputation Monitoring in Twitter. PhD thesis, UNED, 2014.Google Scholar
D. Spina, J. Carrillo de Albornoz, T. Martın, E. Amigó, J. Gonzalo, and F. Giner. UNED Online Reputation Monitoring Team at RepLab 2013. In CLEF '13 (Online Working Notes/Labs/Workshop), 2013.Google Scholar
D. Spina, J. Gonzalo, and E. Amigó. Discovering filter keywords for company name disambiguation in Twitter. Expert Systems with Applications, 40 (12): 4986--5003, 2013. Google ScholarDigital Library
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2: 45--66, Mar. 2002. Google ScholarDigital Library
Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active learning for relevance feedback. In ECIR '07, 2007. Google ScholarDigital Library
J. Zhu, H. Wang, and B. Tsou. A density-based re-ranking technique for active learning for data annotations. In ICCPOL '09, 2009. Google ScholarDigital Library

Index Terms

Active Learning for Entity Filtering in Microblog Streams
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

A named entity recognition approach for tweet streams using active learning
Collective intelligent information and database systems

In recent years, information extraction from tweets has been challenging for researchers in the fields of knowledge discovery and data mining. Unlike formal text, such as news articles and pieces of longer content, tweets are of a specific nature: short, ...
Read More
Active learning technique for biomedical named entity extraction
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and Informatics

One difficulty with machine learning for information extraction is the high cost of collecting labeled examples. Active Learning can make more efficient use of the learner's time by asking them to label only instances that are most useful for the ...
Read More
A study of active learning methods for named entity recognition in clinical text

Display Omitted We developed novel active learning algorithms for clinical named entity recognition.Equal cost per sample is not a practical annotation cost assumption in this task.We evaluated methods based on two types of estimated annotation cost.To ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2015
1198 pages
ISBN:9781450336215
DOI:10.1145/2766462
General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
entity filtering
text classification
twitter
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 228
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Active Learning for Entity Filtering in Microblog Streams

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A named entity recognition approach for tweet streams using active learning

Active learning technique for biomedical named entity extraction

A study of active learning methods for named entity recognition in clinical text