Zum Inhalt

Information Management and Big Data

Second Annual International Symposium, SIMBig 2015, Cusco, Peru, September 2-4, 2015, and Third Annual International Symposium, SIMBig 2016, Cusco, Peru, September 1-3, 2016, Revised Selected Papers

  • 2017
  • Buch

Über dieses Buch

This book constitutes the refereed proceedings of the Second Annual International Symposium on Information Management and Big Data, SIMBig 2015, held in Cusco, Peru, in September 2015, and of the Third Annual International Symposium on Information Management and Big Data, SIMBig 2016, held in Cusco, Peru, in September 2016.

The 11 revised full papers presented were carefully reviewed and selected from 70 submissions. The papers address issues such as Data Science, Big Data, Data Mining, Natural Language Processing, Bio NLP, Text Mining, Information Retrieval, Machine Learning, Semantic Web, Ontologies, Web Mining, Knowledge Representation and Linked Open Data, Social Networks, Social Web and Web Science, Information Visualization, OLAP, Data Warehousing, Business Intelligence, Spatiotemporal Data, Health Care, Agent-based Systems, Reasoning and Logic, Constraints, Satisfiability, and Search.

Inhaltsverzeichnis

  1. Frontmatter

  2. Sense-Level Semantic Clustering of Hashtags

    Ali Javed, Byung Suk Lee
    Abstract
    We enhance the accuracy of the currently available semantic hashtag clustering method, which leverages hashtag semantics extracted from dictionaries such as Wordnet and Wikipedia. While immune to the uncontrolled and often sparse usage of hashtags, the current method distinguishes hashtag semantics only at the word-level. Unfortunately, a word can have multiple senses representing the exact semantics of a word, and, therefore, word-level semantic clustering fails to disambiguate the true sense-level semantics of hashtags and, as a result, may generate incorrect clusters. This paper shows how this problem can be overcome through sense-level clustering and demonstrates its impacts on clustering behavior and accuracy.
  3. Automatic Idiom Recognition with Word Embeddings

    Jing Peng, Anna Feldman
    Abstract
    Expressions, such as add fuel to the fire, can be interpreted literally or idiomatically depending on the context they occur in. Many Natural Language Processing applications could improve their performance if idiom recognition were improved. Our approach is based on the idea that idioms and their literal counterparts do not appear in the same contexts. We propose two approaches: (1) Compute inner product of context word vectors with the vector representing a target expression. Since literal vectors predict well local contexts, their inner product with contexts should be larger than idiomatic ones, thereby telling apart literals from idioms; and (2) Compute literal and idiomatic scatter (covariance) matrices from local contexts in word vector space. Since the scatter matrices represent context distributions, we can then measure the difference between the distributions using the Frobenius norm. For comparison, we implement [8, 16, 24] and apply them to our data. We provide experimental results validating the proposed techniques.
  4. A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository

    Riza Batista-Navarro, Chrysoula Zerva, Nhung T. H. Nguyen, Sophia Ananiadou
    Abstract
    In our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository facilitates knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries. We present some use cases to illustrate the types of queries that the knowledge repository currently accommodates.
  5. Network Sampling Based on Centrality Measures for Relational Classification

    Lilian Berton, Didier A. Vega-Oliveros, Jorge Valverde-Rebaza, Andre Tavares da Silva, Alneu de Andrade Lopes
    Abstract
    Many real-world networks, such as the Internet, social networks, biological networks, and others, are massive in size, which impairs their processing and analysis. To cope with this, the network size could be reduced without losing relevant information. In this paper, we extend a work that proposed a sampling method based on the following centrality measures: degree, k-core, clustering, eccentricity and structural holes. For our experiments, we remove \(30\%\) and \(50\%\) of the vertices and their edges from the original network. After, we evaluate our proposal on six real-world networks on relational classification task using eight different classifiers. Classification results achieved on sampled graphs generated from our proposal are similar to those obtained on the entire graphs. The execution time for learning step of the classifier is shorter on the sampled graph compared to the entire graph and random sampling. In most cases, the original graph was reduced by up to \(50\%\) of its initial number of edges without losing topological properties.
  6. Dictionary-Based Sentiment Analysis Applied to a Specific Domain

    Laura Cruz, José Ochoa, Mathieu Roche, Pascal Poncelet
    Abstract
    The web and social media have been growing exponentially in recent years. We now have access to documents bearing opinions expressed on a broad range of topics. This constitutes a rich resource for natural language processing tasks, particularly for sentiment analysis. Nevertheless, sentiment analysis is usually difficult because expressed sentiments are usually topic-oriented. In this paper, we propose to automatically construct a sentiment dictionary using relevant terms obtained from web pages for a specific domain. This dictionary is initially built by querying the web with a combination of opinion terms, as well as terms of the domain. In order to select only relevant terms we apply two measures \(\textit{AcroDef}_{\textit{MI}3}\) and TrueSkill. Experiments conducted on different domains highlight that our automatic approach performs better for specific cases.
  7. A Clustering Optimization Approach for Disaster Relief Delivery: A Case Study in Lima-Perú

    Jorge Vargas-Florez, Rosario Medina-Rodríguez, Rafael Alva-Cabrera
    Abstract
    During the last decade, funds to face humanitarian operations have increased approximately ten times. According to the Global Humanitarian Assistance Report, in 2013 the humanitarian funding requirement was by US$ 22 billion, which represents \(27.2\%\) more than the requested in 2012. Furthermore, the transportation cost represents between one-third to two-thirds from the total logistics cost. Therefore, a frequent problem in a disaster relief is to reduce the transportation cost by keeping an adequate distribution service. The latter depends on a reliable delivery route design, which is not easy to do considering a post-disaster environment, where the infrastructures and sources could be inexistent, unavailable or inoperative. This paper tackles this problem, regarding the constraints, to deliver relief aids in a post-disaster state (like an eight-degree earthquake) in the capital of Perú. The routes found by the hierarchical ascending clustering approach, solved with a heuristic model, achieved a sufficient and satisfactory solution.
  8. An Approach to Evaluate Class Assignment Semantic Redundancy on Linked Datasets

    Leandro Mendoza, Alicia Díaz
    Abstract
    In this work we address the concept of semantic redundancy in linked datasets considering class assignment assertions. We discuss how redundancy can be evaluated as well as the relationship between redundancy and some class hierarchy aspects: number of classes, number of instances a class has, number of class descendants and class depth. Finally, we performed an evaluation on the DBpedia dataset using SPARQL queries for data redundancy checks. Results obtained from this evaluation suggest that the number of redundant class assignments increases when the number of classes is higher, for general classes, with more descendants and for those with more number of instances. In this evaluation we also observed some patterns that can be used to classify class assignments. These observations may be useful for linked data stakeholders to understand how different schemas are used within a dataset, detect errors and improve the mechanisms to generate linked data.
  9. Topic-Based Sentiment Analysis

    Prasadith Buddhitha, Diana Inkpen
    Abstract
    We present a method that exploits syntactic dependencies for topic-oriented sentiment analysis in tweets. The proposed solution is based on supervised text classification and available polarity lexicons in order to identify the relevant dependencies in each sentence by detecting the correct attachment points for the polarity words. Our experiments are based on the data from the Semantic Evaluation Exercise 2015 (SemEval-2015), task 10, subtask C. The dependency parser that we used is adapted to this kind of text. Our classifier that combines topic- and sentence-level features obtained very good results.
  10. A Security Price Data Cleaning Technique: Reynold’s Decomposition Approach

    Rachel V. Mok, Wai Yin Mok, Kit Yee Cheung
    Abstract
    We propose a security price data cleaning technique based on Reynold’s decomposition that uses \(T_I\), the time period of integration, to determine the de-noise level of the price data. As price is a function of time, \(T_0\), the optimal time period of integration, may reveal an underlying price trend, possibly indicating the intrinsic value of the security. The DJIA (Dow Jones Industrial Average) Index and the thirty companies comprising the index are our fundamental interest under the initial investigation period from 1990 to 2016. Also, intra-day security price data from February 8th to August 19th, 2016 are obtained to further study \(T_0\) on a minute-by-minute basis. Preliminary results include the following: (1) It was discovered that \(\alpha \), a key percentage measure, drops exponentially for low \(T_I\) and then drops linearly at a fairly shallow slope for high \(T_I\). (2) In the linear region, the \(\alpha \) hardly varies as \(T_I\) increases. Thus, we propose that the optimal time period of integration, \(T_0\), is when \(\alpha \) transitions from an exponential behavior to a linear behavior. We calculated that the average of the \(T_0\)’s for the thirty DJIA component companies is 64 business days and that for the DJIA itself is 63 business days. For intra-day study of \(T_0\), \(\alpha \) seems to drop proportionally with the length of \(T_I\), exhibiting an almost linear relationship. The change in slope for the intra-day study is not as noticeable as the total time period study. The average of the intra-day \(T_0\)’s for the thirty DJIA component companies is 52 min and for the DJIA Index is 69 min.
  11. Big Data Architecture for Predicting Churn Risk in Mobile Phone Companies

    Alonso Raul Melgarejo Galvan, Katerine Rocio Clavo Navarro
    Abstract
    Nowadays in Peru, mobile phone companies have been affected by the problem of mobile number portability because since July 2014 customers can change their mobile operator in just 24 h. Companies look for solutions through the analysis of historical data of their customers in order to generate predictive models and to identify which customers would leave the company. However, the current way how this prediction is performed is too slow. In this paper, we show a Big Data architecture which solves the problems of the “classic architecture” using data from social networks in order to predict which customers may go to the competition company, according to their opinions. Data processing is performed by Hadoop, which implements MapReduce and can process large amounts of data in parallel way. After doing the tests and seeing the results, we got a high percentage of accuracy (90.03% of success).
  12. Social Networks of Teachers in Twitter

    Hernán Gil Ramírez, Rosa María Guilleumas García
    Abstract
    This research aimed at identifying the trends in the topics of interest of the tweets published by 43 expert professors in the field of ICT and education and the network of their followers and followed in Tweeter, as well as their relationship with the characteristics of that network. With this purpose, NodeXL was employed to import, directly and automatically, 185,517 tweets which gave origin to a network of connections of 49,229 nodes. Data analysis involved social network analysis, text extraction and text mining using NodeXL, Excel and T-Lab. The research hypothesis was that there is a direct correlation between the trends identified in the topics of interest and the characteristics of the network of connections that emerge from the imported tweets. Among the conclusions of the study we can highlight the following: (1) most of the trends identified from the analyzed tweets were related to education and educational technologies that could enhance teaching and learning processes; (2) the text mining procedure applied to the tweets revealed an interesting association between education and technologies; (3) and finally that the analysis of lemmas seems to be more promising than that of hashtags for detecting trends in the tweets.
  13. Backmatter

Titel
Information Management and Big Data
Herausgegeben von
Juan Antonio Lossio-Ventura
Hugo Alatrista-Salas
Copyright-Jahr
2017
Electronic ISBN
978-3-319-55209-5
Print ISBN
978-3-319-55208-8
DOI
https://doi.org/10.1007/978-3-319-55209-5

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data