ABSTRACT
Monitoring the reputation of entities such as companies or brands in microblog streams (e.g., Twitter) starts by selecting mentions that are related to the entity of interest. Entities are often ambiguous (e.g., "Jaguar" or "Ford") and effective methods for selectively removing non-relevant mentions often use background knowledge obtained from domain experts. Manual annotations by experts, however, are costly. We therefore approach the problem of entity filtering with active learning, thereby reducing the annotation load for experts. To this end, we use a strong passive baseline and analyze different sampling methods for selecting samples for annotation. We find that margin sampling--an informative type of sampling that considers the distance to the hyperplane used for class separation--can effectively be used for entity filtering and can significantly reduce the cost of annotating initial training data.
- E. Amigó, J. Artiles, J. Gonzalo, D. Spina, B. Liu, and A. Corujo. WePS-3 evaluation campaign: Overview of the online reputation management task. In CLEF '10 (Online Working Notes/Labs/Workshop), 2010.Google Scholar
- E. Amigó, A. Corujo, J. Gonzalo, E. Meij, and M. de Rijke. Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF '12 (Online Working Notes/Labs/Workshop), 2012.Google Scholar
- E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martın, E. Meij, M. de Rijke, and D. Spina. Overview of RepLab 2013: Evaluating online reputation monitoring systems. In CLEF '13 (Online Working Notes/Labs/Workshop), pages 333--352, 2013.Google Scholar
- E. Amigó, J. Gonzalo, and F. Verdejo. A general evaluation measure for document organization tasks. In SIGIR '13, pages 643--652, 2013. Google ScholarDigital Library
- J. Atserias, G. Attardi, M. Simi, and H. Zaragoza. Active learning for building a corpus of questions for parsing. In LREC '10, 2010.Google Scholar
- R. L. Figueroa, Q. Zeng-Treitler, L. H. Ngo, S. Goryachev, and E. P. Wiechmann. Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association, 19 (5): 809--816, 2012.Google ScholarCross Ref
- R. Hu. Active Learning for Text Classification. PhD thesis, Dublin Institute of Technology, 2011.Google Scholar
- M.-H. Peetz. Time-Aware Online Reputation Analysis. PhD thesis, University of Amsterdam, 2015.Google Scholar
- E. Pilkington. Unsold H&M clothes found in rubbish bags as homeless face winter chill. riptsize http://bit.ly/theguardian2010HMhttp://bit.ly/theguardian2010HM, January 2010.Google Scholar
- M. Sassano. An empirical study of active learning with support vector machines for Japanese word segmentation. In ACL '02, 2002. Google ScholarDigital Library
- B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.Google Scholar
- D. Spina. Entity-Based Filtering and Topic Detection for Online Reputation Monitoring in Twitter. PhD thesis, UNED, 2014.Google Scholar
- D. Spina, J. Carrillo de Albornoz, T. Martın, E. Amigó, J. Gonzalo, and F. Giner. UNED Online Reputation Monitoring Team at RepLab 2013. In CLEF '13 (Online Working Notes/Labs/Workshop), 2013.Google Scholar
- D. Spina, J. Gonzalo, and E. Amigó. Discovering filter keywords for company name disambiguation in Twitter. Expert Systems with Applications, 40 (12): 4986--5003, 2013. Google ScholarDigital Library
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2: 45--66, Mar. 2002. Google ScholarDigital Library
- Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active learning for relevance feedback. In ECIR '07, 2007. Google ScholarDigital Library
- J. Zhu, H. Wang, and B. Tsou. A density-based re-ranking technique for active learning for data annotations. In ICCPOL '09, 2009. Google ScholarDigital Library
Index Terms
- Active Learning for Entity Filtering in Microblog Streams
Recommendations
A named entity recognition approach for tweet streams using active learning
Collective intelligent information and database systemsIn recent years, information extraction from tweets has been challenging for researchers in the fields of knowledge discovery and data mining. Unlike formal text, such as news articles and pieces of longer content, tweets are of a specific nature: short, ...
Active learning technique for biomedical named entity extraction
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsOne difficulty with machine learning for information extraction is the high cost of collecting labeled examples. Active Learning can make more efficient use of the learner's time by asking them to label only instances that are most useful for the ...
A study of active learning methods for named entity recognition in clinical text
Display Omitted We developed novel active learning algorithms for clinical named entity recognition.Equal cost per sample is not a practical annotation cost assumption in this task.We evaluated methods based on two types of estimated annotation cost.To ...
Comments