Information Management and Big Data
Second Annual International Symposium, SIMBig 2015, Cusco, Peru, September 2-4, 2015, and Third Annual International Symposium, SIMBig 2016, Cusco, Peru, September 1-3, 2016, Revised Selected Papers
- 2017
- Buch
- Herausgegeben von
- Juan Antonio Lossio-Ventura
- Hugo Alatrista-Salas
- Verlag
- Springer International Publishing
Über dieses Buch
This book constitutes the refereed proceedings of the Second Annual International Symposium on Information Management and Big Data, SIMBig 2015, held in Cusco, Peru, in September 2015, and of the Third Annual International Symposium on Information Management and Big Data, SIMBig 2016, held in Cusco, Peru, in September 2016.
The 11 revised full papers presented were carefully reviewed and selected from 70 submissions. The papers address issues such as Data Science, Big Data, Data Mining, Natural Language Processing, Bio NLP, Text Mining, Information Retrieval, Machine Learning, Semantic Web, Ontologies, Web Mining, Knowledge Representation and Linked Open Data, Social Networks, Social Web and Web Science, Information Visualization, OLAP, Data Warehousing, Business Intelligence, Spatiotemporal Data, Health Care, Agent-based Systems, Reasoning and Logic, Constraints, Satisfiability, and Search.
Inhaltsverzeichnis
-
Frontmatter
-
Sense-Level Semantic Clustering of Hashtags
Ali Javed, Byung Suk LeeAbstractWe enhance the accuracy of the currently available semantic hashtag clustering method, which leverages hashtag semantics extracted from dictionaries such as Wordnet and Wikipedia. While immune to the uncontrolled and often sparse usage of hashtags, the current method distinguishes hashtag semantics only at the word-level. Unfortunately, a word can have multiple senses representing the exact semantics of a word, and, therefore, word-level semantic clustering fails to disambiguate the true sense-level semantics of hashtags and, as a result, may generate incorrect clusters. This paper shows how this problem can be overcome through sense-level clustering and demonstrates its impacts on clustering behavior and accuracy. -
Automatic Idiom Recognition with Word Embeddings
Jing Peng, Anna FeldmanAbstractExpressions, such as add fuel to the fire, can be interpreted literally or idiomatically depending on the context they occur in. Many Natural Language Processing applications could improve their performance if idiom recognition were improved. Our approach is based on the idea that idioms and their literal counterparts do not appear in the same contexts. We propose two approaches: (1) Compute inner product of context word vectors with the vector representing a target expression. Since literal vectors predict well local contexts, their inner product with contexts should be larger than idiomatic ones, thereby telling apart literals from idioms; and (2) Compute literal and idiomatic scatter (covariance) matrices from local contexts in word vector space. Since the scatter matrices represent context distributions, we can then measure the difference between the distributions using the Frobenius norm. For comparison, we implement [8, 16, 24] and apply them to our data. We provide experimental results validating the proposed techniques. -
A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository
Riza Batista-Navarro, Chrysoula Zerva, Nhung T. H. Nguyen, Sophia AnaniadouAbstractIn our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository facilitates knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries. We present some use cases to illustrate the types of queries that the knowledge repository currently accommodates. -
Network Sampling Based on Centrality Measures for Relational Classification
Lilian Berton, Didier A. Vega-Oliveros, Jorge Valverde-Rebaza, Andre Tavares da Silva, Alneu de Andrade LopesAbstractMany real-world networks, such as the Internet, social networks, biological networks, and others, are massive in size, which impairs their processing and analysis. To cope with this, the network size could be reduced without losing relevant information. In this paper, we extend a work that proposed a sampling method based on the following centrality measures: degree, k-core, clustering, eccentricity and structural holes. For our experiments, we remove \(30\%\) and \(50\%\) of the vertices and their edges from the original network. After, we evaluate our proposal on six real-world networks on relational classification task using eight different classifiers. Classification results achieved on sampled graphs generated from our proposal are similar to those obtained on the entire graphs. The execution time for learning step of the classifier is shorter on the sampled graph compared to the entire graph and random sampling. In most cases, the original graph was reduced by up to \(50\%\) of its initial number of edges without losing topological properties. -
Dictionary-Based Sentiment Analysis Applied to a Specific Domain
Laura Cruz, José Ochoa, Mathieu Roche, Pascal PonceletAbstractThe web and social media have been growing exponentially in recent years. We now have access to documents bearing opinions expressed on a broad range of topics. This constitutes a rich resource for natural language processing tasks, particularly for sentiment analysis. Nevertheless, sentiment analysis is usually difficult because expressed sentiments are usually topic-oriented. In this paper, we propose to automatically construct a sentiment dictionary using relevant terms obtained from web pages for a specific domain. This dictionary is initially built by querying the web with a combination of opinion terms, as well as terms of the domain. In order to select only relevant terms we apply two measures \(\textit{AcroDef}_{\textit{MI}3}\) and TrueSkill. Experiments conducted on different domains highlight that our automatic approach performs better for specific cases. -
A Clustering Optimization Approach for Disaster Relief Delivery: A Case Study in Lima-Perú
Jorge Vargas-Florez, Rosario Medina-Rodríguez, Rafael Alva-CabreraAbstractDuring the last decade, funds to face humanitarian operations have increased approximately ten times. According to the Global Humanitarian Assistance Report, in 2013 the humanitarian funding requirement was by US$ 22 billion, which represents \(27.2\%\) more than the requested in 2012. Furthermore, the transportation cost represents between one-third to two-thirds from the total logistics cost. Therefore, a frequent problem in a disaster relief is to reduce the transportation cost by keeping an adequate distribution service. The latter depends on a reliable delivery route design, which is not easy to do considering a post-disaster environment, where the infrastructures and sources could be inexistent, unavailable or inoperative. This paper tackles this problem, regarding the constraints, to deliver relief aids in a post-disaster state (like an eight-degree earthquake) in the capital of Perú. The routes found by the hierarchical ascending clustering approach, solved with a heuristic model, achieved a sufficient and satisfactory solution. -
An Approach to Evaluate Class Assignment Semantic Redundancy on Linked Datasets
Leandro Mendoza, Alicia DíazAbstractIn this work we address the concept of semantic redundancy in linked datasets considering class assignment assertions. We discuss how redundancy can be evaluated as well as the relationship between redundancy and some class hierarchy aspects: number of classes, number of instances a class has, number of class descendants and class depth. Finally, we performed an evaluation on the DBpedia dataset using SPARQL queries for data redundancy checks. Results obtained from this evaluation suggest that the number of redundant class assignments increases when the number of classes is higher, for general classes, with more descendants and for those with more number of instances. In this evaluation we also observed some patterns that can be used to classify class assignments. These observations may be useful for linked data stakeholders to understand how different schemas are used within a dataset, detect errors and improve the mechanisms to generate linked data. -
Topic-Based Sentiment Analysis
Prasadith Buddhitha, Diana InkpenAbstractWe present a method that exploits syntactic dependencies for topic-oriented sentiment analysis in tweets. The proposed solution is based on supervised text classification and available polarity lexicons in order to identify the relevant dependencies in each sentence by detecting the correct attachment points for the polarity words. Our experiments are based on the data from the Semantic Evaluation Exercise 2015 (SemEval-2015), task 10, subtask C. The dependency parser that we used is adapted to this kind of text. Our classifier that combines topic- and sentence-level features obtained very good results. -
A Security Price Data Cleaning Technique: Reynold’s Decomposition Approach
Rachel V. Mok, Wai Yin Mok, Kit Yee CheungAbstractWe propose a security price data cleaning technique based on Reynold’s decomposition that uses \(T_I\), the time period of integration, to determine the de-noise level of the price data. As price is a function of time, \(T_0\), the optimal time period of integration, may reveal an underlying price trend, possibly indicating the intrinsic value of the security. The DJIA (Dow Jones Industrial Average) Index and the thirty companies comprising the index are our fundamental interest under the initial investigation period from 1990 to 2016. Also, intra-day security price data from February 8th to August 19th, 2016 are obtained to further study \(T_0\) on a minute-by-minute basis. Preliminary results include the following: (1) It was discovered that \(\alpha \), a key percentage measure, drops exponentially for low \(T_I\) and then drops linearly at a fairly shallow slope for high \(T_I\). (2) In the linear region, the \(\alpha \) hardly varies as \(T_I\) increases. Thus, we propose that the optimal time period of integration, \(T_0\), is when \(\alpha \) transitions from an exponential behavior to a linear behavior. We calculated that the average of the \(T_0\)’s for the thirty DJIA component companies is 64 business days and that for the DJIA itself is 63 business days. For intra-day study of \(T_0\), \(\alpha \) seems to drop proportionally with the length of \(T_I\), exhibiting an almost linear relationship. The change in slope for the intra-day study is not as noticeable as the total time period study. The average of the intra-day \(T_0\)’s for the thirty DJIA component companies is 52 min and for the DJIA Index is 69 min. -
Big Data Architecture for Predicting Churn Risk in Mobile Phone Companies
Alonso Raul Melgarejo Galvan, Katerine Rocio Clavo NavarroAbstractNowadays in Peru, mobile phone companies have been affected by the problem of mobile number portability because since July 2014 customers can change their mobile operator in just 24 h. Companies look for solutions through the analysis of historical data of their customers in order to generate predictive models and to identify which customers would leave the company. However, the current way how this prediction is performed is too slow. In this paper, we show a Big Data architecture which solves the problems of the “classic architecture” using data from social networks in order to predict which customers may go to the competition company, according to their opinions. Data processing is performed by Hadoop, which implements MapReduce and can process large amounts of data in parallel way. After doing the tests and seeing the results, we got a high percentage of accuracy (90.03% of success). -
Social Networks of Teachers in Twitter
Hernán Gil Ramírez, Rosa María Guilleumas GarcíaAbstractThis research aimed at identifying the trends in the topics of interest of the tweets published by 43 expert professors in the field of ICT and education and the network of their followers and followed in Tweeter, as well as their relationship with the characteristics of that network. With this purpose, NodeXL was employed to import, directly and automatically, 185,517 tweets which gave origin to a network of connections of 49,229 nodes. Data analysis involved social network analysis, text extraction and text mining using NodeXL, Excel and T-Lab. The research hypothesis was that there is a direct correlation between the trends identified in the topics of interest and the characteristics of the network of connections that emerge from the imported tweets. Among the conclusions of the study we can highlight the following: (1) most of the trends identified from the analyzed tweets were related to education and educational technologies that could enhance teaching and learning processes; (2) the text mining procedure applied to the tweets revealed an interesting association between education and technologies; (3) and finally that the analysis of lemmas seems to be more promising than that of hashtags for detecting trends in the tweets. -
Backmatter
- Titel
- Information Management and Big Data
- Herausgegeben von
-
Juan Antonio Lossio-Ventura
Hugo Alatrista-Salas
- Copyright-Jahr
- 2017
- Electronic ISBN
- 978-3-319-55209-5
- Print ISBN
- 978-3-319-55208-8
- DOI
- https://doi.org/10.1007/978-3-319-55209-5
Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.