main-content

## Über dieses Buch

This book constitutes the refereed proceedings of the 4th Annual International Symposium on Information Management and Big Data, SIMBig 2017, held in Lima, Peru, in September 2017.

The 10 revised full papers presented were carefully reviewed and selected from 71 submissions. The papers address issues such as Data Science, Big Data, Data Mining, Natural Language Processing, Text Mining, Information Retrieval, Machine Learning, Semantic Web, Ontologies, Web Mining, Knowledge Representation and Linked Open Data, Social Web and Web Science, Information Visualization.

## Inhaltsverzeichnis

### Parallelization of Conjunctive Query Answering over Ontologies

Abstract
Efficient query answering over Description Logic (DL) ontologies with very large datasets is becoming increasingly vital. Recent years have seen the development of various approaches to ABox partitioning to enable parallel processing. Instance checking using the enhanced most specific concept (MSC) method is a particularly promising approach. The applicability of these distributed reasoning methods to typical ontologies has been shown mainly through anecdotal observation. In this paper, we present a parallelizable, enhanced MSC method for the answering of ABox conjunctive queries, using a set of syntactic conditions that permit querying of large practical ontologies in reasonable time, and combining it with pattern matching to answer queries over role assertions. We also present execution time and efficiency of an implementation deployed over computing clusters of various sizes, showing the ability of the method to process instance checking for large scale datasets.
E. Patrick Shironoshita, Da Zhang, Mansur R. Kabuka, Jia Xu

### Could Machine Learning Improve the Prediction of Child Labor in Peru?

Abstract
Child labor is a relevant problem in developing countries because it may have a negative impact on economic growth. Policy makers and government agencies need information to correctly allocate their scarce resources to deal with this problem. Although there is research attempting to predict the causes of child labor, previous studies have used only linear statistical models. Non-linear models may improve predictive capacity and thus optimize resource allocation. However, the use of these techniques in this field remains unexplored. Using data from Peru, our study compares the predictive capability of the traditional logit model with artificial neural networks. Our results show that neural networks could provide better predictions than the logit model. Findings suggest that geographical indicators, income levels, gender, family composition and educational levels significantly predict child labor. Moreover, the neural network suggests the relevance of each factor which could be useful to prioritize strategies. As a whole, the neural network could help government agencies to tailor their strategies and allocate resources more efficiently.
Christian Fernando Libaque-Saenz, Juan Lazo, Karla Gabriela Lopez-Yucra, Edgardo R. Bravo

### Impact of Entity Graphs on Extracting Semantic Relations

Abstract
Relation extraction (RE) between a pair of entity mentions from text is an important and challenging task specially for open domain relations. Generally, relations are extracted based on the lexical and syntactical information at the sentence level. However, global information about known entities has not been explored yet for RE task. In this paper, we propose to extract a graph of entities from the overall corpus and to compute features on this graph that are able to capture some evidences of holding relationships between a pair of entities. The proposed features boost the RE performance significantly when these are combined with some linguistic features.
Rashedur Rahman, Brigitte Grau, Sophie Rosset

### Predicting Invariant Nodes in Large Scale Semantic Knowledge Graphs

Abstract
Understanding and predicting how large scale knowledge graphs change over time has direct implications in software and hardware associated with their maintenance and storage. An important subproblem is predicting invariant nodes, that is, nodes within the graph will not have any edges deleted or changed (add-only nodes) or will not have any edges added or changed (del-only nodes). Predicting add-only nodes correctly has practical importance, as such nodes can then be cached or represented using a more efficient data structure. This paper presents a logistic regression approach using attribute-values as features that achieves 90%+ precision on DBpedia yearly changes trained using Apache Spark. The paper concludes by outlining how we plan to use these models for evaluating Natural Language Generation algorithms.
Damian Barsotti, Martin Ariel Dominguez, Pablo Ariel Duboue

### Privacy-Aware Data Gathering for Urban Analytics

Abstract
Nowadays, there are a mature set of tools and techniques for data analytics, which help Data Scientists to extract knowledge from raw heterogeneous data. Nonetheless, there is still a lack of spatiotemporal historical dataset allowing to study everyday life phenomena, such as vehicular congestion, press influence, the effect of politicians comments on stock exchange markets, the relation between food prices evolution and temperatures or rainfall, social structure resilience against extreme climate events, among others. Unfortunately, few datasets are combining from different sources of urban data to carry out studies of phenomena occurring in cities (i.e., Urban Analytics). To solve this problem, we have implemented a Web crawler platform for gathering a different kind of available public datasets.
Miguel Nunez-del-Prado, Bruno Esposito, Ana Luna, Juandiego Morzan

### Purely Synthetic and Domain Independent Consistency-Guaranteed Populations in

Abstract
The elaborations of artificial knowledge bases can represent a clever solution to test new semantics-based infrastructures before deploying them and a precious support to the design of some prototypes. One major challenge of such synthetic data generations is to guarantee the acquisition of sound knowledge bases able to pass the equivalent of a Turing test. That’s why populations have to be restricted to guarantee the consistency until a certain fragment of expressivity. In a past work, we released a first version of a populator guaranteeing the consistency and populating knowledge bases founded on $$\textsc {TBox}$$es expressed in $$\mathcal {ALCQ}^{(\mathcal {D})}$$. This purely syntactic and domain independent populator is based on a random process of concept, role and limited data instantiations. In this paper, we propose to extend the expressivity covering by the populator until the fragment $$\mathcal {SHIQ}^{(\mathcal {D})}$$. This extension deals with $$\textsc {Rbox}$$es conforming the consistency of the role assertions with respect to the domains/ranges, the universal quantifications and the maximal cardinalities of all the super and inverse roles. Finally, an evaluation of some performances of the populator has been performed.
Jean-Rémi Bourguet

### Language Identification with Scarce Data: A Case Study from Peru

Abstract
Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
Alexandra Espichán-Linares, Arturo Oncevay-Marcos

### A Multi-modal Data-Set for Systematic Analyses of Linguistic Ambiguities in Situated Contexts

Abstract
Human situated language processing involves the interaction of linguistic and visual processing and this cross-modal integration helps to resolve ambiguities and predict what will be revealed next in an unfolding sentence during spoken communication. However, most state-of-the-art parsing approaches rely solely on the language modality. This paper aims to introduce a multi-modal data-set addressing challenging linguistic structures and visual complexities, which state-of-the-art parsers should be able to deal with. It also briefly addresses the multi-modal parsing approach and a proof-of-concept study that shows the contribution of employing visual information during disambiguation.
Özge Alaçam, Tobias Staron, Wolfgang Menzel

### Community Detection in Bipartite Network: A Modified Coarsening Approach

Abstract
Interest in algorithms for community detection in networked systems has increased over the last decade, mostly motivated by a search for scalable solutions capable of handling large-scale networks. Multilevel approaches provide a potential solution to scalability, as they reduce the cost of a community detection algorithm by applying it to a coarsened version of the original network. The solution obtained in the small-scale network is then projected back to the original large-scale model to obtain the desired solution. However, standard multilevel methods are not directly applicable to bipartite networks and there is a gap in existing literature on multilevel optimization applied to such networks. This article addresses this gap and introduces a novel multilevel method based on one-mode projection that allows executing traditional multilevel methods in bipartite network models. The approach has been validated with an algorithm for community detection that solves the Barber’s modularity problem. We show it can scale a target algorithm to handling larger networks, whilst preserving solution accuracy.
Alan Valejo, Vinícius Ferreira, Maria C. F. de Oliveira, Alneu de Andrade Lopes

### Reconstructing Pedestrian Trajectories from Partial Observations in the Urban Context

Abstract
The ever-greater number of technologies providing location-based services has given rise to a deluge of trajectory data. However, most of these trajectories are low-sampling-rate and, consequently, many movement details are lost. Due to that, trajectory reconstruction techniques aim to infer the missing movement details and reduce uncertainty. Nevertheless, most of the effort has been put into reconstructing vehicle trajectories. Here, we study the reconstruction of pedestrian trajectories by using road network information. We compare a simple technique that only uses road network information with a more complex technique that, besides the road network, uses historical trajectory data. Additionally, we use three different trajectory segmentation settings to analyze their influence over reconstruction. Our experiment results show that, with the limited pedestrian trajectory data available, a simple technique that does not use historical data performs considerably better than a more complex technique that does use it. Furthermore, our results also show that trajectories segmented in such a way as to allow a greater distance and time span between border points of pairs of consecutive trajectories obtain better reconstruction results in the majority of the cases, regardless of the technique used.
Ricardo Miguel Puma Alvarez, Alneu de Andrade Lopes

### Backmatter

Weitere Informationen