ABSTRACT
Mining phrases, entity concepts, topics, and hierarchies from massive text corpus is an essential problem in the age of big data. Text data in electronic forms are ubiquitous, ranging from scientific articles to social networks, enterprise logs, news articles, social media and general web pages. It is highly desirable but challenging to bring structure to unstructured text data, uncover underlying hierarchies, relationships, patterns and trends, and gain knowledge from such data.
In this tutorial, we provide a comprehensive survey on the state-of-the art of data-driven methods that automatically mine phrases, extract and infer latent structures from text corpus, and construct multi-granularity topical groupings and hierarchies of the underlying themes. We study their principles, methodologies, algorithms and applications using several real datasets including research papers and news articles and demonstrate how these methods work and how the uncovered latent entity structures may help text understanding, knowledge discovery and management.
Supplemental Material
Index Terms
- Bringing structure to text: mining phrases, entities, topics, and hierarchies
Recommendations
A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementMining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short ...
CitationLDA++: an Extension of LDA for Discovering Topics in Document Network
SoICT '18: Proceedings of the 9th International Symposium on Information and Communication TechnologyAlong with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which ...
Modeling Both Coarse-Grained and Fine-Grained Topics in Massive Text Data
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and ApplicationsTopic model has attracted much attention from investigators, as it provides users with insights into the huge volumes of documents. However, most previous related studies that based on Non-negative Matrix Factorization (NMF) neglect to figure out which ...
Comments