Skip to main content
Top

2021 | Book

Information Management and Big Data

7th Annual International Conference, SIMBig 2020, Lima, Peru, October 1–3, 2020, Proceedings

Editors: Juan Antonio Lossio-Ventura, Jorge Carlos Valverde-Rebaza, Eduardo Díaz, Hugo Alatrista-Salas

Publisher: Springer International Publishing

Book Series : Communications in Computer and Information Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 7th International Conference on Information Management and Big Data, SIMBig 2020, held in Lima, Peru, in October 2020.*

The 32 revised full papers and 7 revised short papers presented were carefully reviewed and selected from 122 submissions. The papers address topics such as natural language processing and text mining; machine learning; image processing; social networks; data-driven software engineering; graph mining; and Semantic Web, repositories, and visualization.

*The conference was held virtually.

Table of Contents

Frontmatter

Natural Language Processing and Text Mining

Frontmatter
Comparative Analysis of Question Answering Models for HRI Tasks with NAO in Spanish

Recent studies on Human Robot Interaction (HRI) have shown that different types of applications that combine metrics and techniques can help achieve a more efficient and organic interaction. This applications can be related to human care or go further and use a humanoid robot for nonverbal communication tasks. Similarly, for verbal communication, we found Question Answering, a Natural Language Processing task, that is in charge of capturing and interpret a question automatically and return a good representation of an answer. Also, recent work on creating Question Answering models, based on the Transformer architecture, have obtained state-of-the-art results. Our main goal in this project is to build a new Human Robot Interaction technique which uses a Question Answering system where we will test with college students. In the creation of the Question Answering model, we get results from state-of-the-art pre-trained models like BERT or XLNet, but also multilingual ones like m-BERT or XLM. We train them with a new Spanish dataset translated from the original SQuAD getting our best results with XLM-R, obtaining 68.1 F1 and 45.3 EM in the MLQA test dataset, and, 77.9 F1 and 58.3 EM for XQuAD test dataset. To validate the results obtained, we evaluated the project based on HRI metrics and a survey. The results demonstrate a high degree of acceptance in the students about the type of interaction that has been proposed.

Enrique Burga-Gutierrez, Bryam Vasquez-Chauca, Willy Ugarte
Peruvian Citizens Reaction to Reactiva Perú Program: A Twitter Sentiment Analysis Approach

The internet is part of people’s daily lives, and social networking sites (SNSs) may provide insights into how people perceive government actions. The present case study contributes to the debate concerning SNSs as an alternative communication tool between citizens and politicians in terms of information about the policies that rule citizens’ lives. To reach this goal, we explored the role of Twitter sentiment analysis as a means of monitoring reactions to Reactiva Perú, a program implemented by the Peruvian Government in response to the COVID-19 economic crisis. The findings suggest that SNSs may become an alternative source of information for policymakers to capture citizens’ reactions to implemented policies. Implications and possible strategies are discussed at an empirical level.

Rosmery Ramos-Sandoval
Twitter Early Prediction of Preferences and Tendencies Based in Neighborhood Behavior

In recent years, social networks have become increasingly massive. Consequently, they are a fundamental source of information and a powerful tool to spread ideas and opinions. Based on Twitter, this paper studies the problem of predicting the retweet preference of a user for a given tweet, considering how the tweet has been shared by that user’s environment. It also addresses the more global problem of predicting whether a tweet will be popular, based on the retweet behavior of central users. For both problems, we explore the evolution of prediction quality depending on the amount of information available over the time since a tweet is created, and derive insights about the trade-off between elapsed time and prediction performance.For the user retweet preference problem, this social prediction model achieves, for example, around $$63.76\%$$ 63.76 % on $$F_1$$ F 1 score by using the first 15 min information, $$75.2\%$$ 75.2 % by 4 h, and $$86.08\%$$ 86.08 % without considering any time window. In the case of popularity prediction, the model achieve scores of $$65.67\%$$ 65.67 % with 60 min of information, $$74.4\%$$ 74.4 % with 4 h, and $$80.73\%$$ 80.73 % with no time window restriction, using the behaviour 15% of users considered as influencers. All these results are obtained without considering the content of the tweets. Next, we incorporate features based on FastText word embeddings to represent the content of tweets. While such models alone attain an $$F_1$$ F 1 of around barely $$50\%$$ 50 % for preferences and popularity prediction, combined with social models, they improve the popularity prediction, generally, more than $$4\%$$ 4 % . For preference prediction, the FastText model is more useful in small time spans.We conclude that it is possible to reasonably predict the preference of a user retweet or how massive a publication will be, using only the information available during the first 30–60 min.

Emanuel Meriles, Martín Ariel Domínguez, Pablo Gabriel Celayes
Summarization of Twitter Events with Deep Neural Network Pre-trained Models

Due to the proliferation of online social media services such as Twitter, there is an upsurge in the volume of user-generated textual content. Such voluminous content is difficult to be consumed by users. Therefore, the development of technological solutions to automatically summarise the voluminous texts are essential. The work presented in this paper reports on the development of automatically generating abstractive summaries from a collection of texts from Twitter. Our proposed approach is a two-stage framework which includes: 1) Event detection by clustering and 2) Summarization of the events. We first generated a contextualized vector representation of the tweets and then applied different clustering techniques on the vectors. We evaluated the generated clusters, and based on the evaluation; we chose the best one found suitable for the summarization task. For the summarization task, we used the pre-trained models of two recently developed state-of-the-art deep neural network architectures and evaluated them on the event clusters. Standard measures of ROUGE scores have been used for evaluating the summaries. We obtained best ROUGE-1 score of 46%, ROUGE-2 score of 30%, ROUGE-L score of 41% and ROUGE-SU score of 23% from our experiments.

Kunal Chakma, Amitava Das, Swapan Debbarma
Multi-strategic Approach for Author Name Disambiguation in Bibliography Repositories

The problem of author name ambiguity in digital bibliography repositories can compromise the integrity and reliability of data. There are several techniques available in the literature to solve the author name disambiguation problem. In this work, we present a multi-strategic approach for author name disambiguation in bibliography repositories applying comparison of strings with the Jaccard similarity coefficient, Levenshtein distance measure, and social network clustering technique. Information from the DBLP digital bibliography repository is used to compare disambiguation results to SCI-synergy, an online scientific social network analysis artifact. The proposed approach outperforms the baseline with a precision of 0.8867, recall of 1, and F-measure of 0.9399, considering a Brazilian graduate program case.

Natan de Souza Rodrigues, Aurelio Ribeiro Costa, Lucas Correa Lemos, Célia Ghedini Ralha
Machine Learning Techniques for Speech Emotion Classification

In this paper we propose and evaluate different models for speech emotion classification through audio signal processing, machine learning and deep learning techniques. For this purpose, we have collected from two databases (RAVDESS and TESS), a total of 5252 audio samples with 8 emotional classes (neutral, calm, happy, sad, angry, fearful, disgust and surprised). We have divided our experiments in 3 main stages. In the first stage, we have used feature engineering to extract relevant features from the time, spectral and cepstral domains. Features like ZCR, energy, spectral centroid, chroma, MFCC etc. were used to train a SVM classifier. The best model obtained an accuracy of 91.1%. In the second stage, we only have considered 40 MFCC coefficients for training several Deep Neural Networks such as CNN, LSTM and MLP were trained, the best model obtained an accuracy of 89.5% with an MLP architecture. Finally, for the third stage we have trained an end-to-end CNN network (SampleCNN) at the sample level. This last approach does not require features engineering, but directly the audio signal. In this stage, we achieve a precision of 81.7%. The experiments show that the results achieved are competitive and some experiments have surpassed in accuracy the related works.

Noe Melo Locumber, Junior Fabian
An Evaluation of Physiological Public Datasets for Emotion Recognition Systems

[Background] The performance of emotion recognition systems depends heavily on datasets used in their training, validation, or testing stages. [Aims] This research aims to evaluate the extent to which public available physiological datasets created for emotion recognition systems meet a set of reference requirements. [Method] Firstly, we analyze the applicability of some reference requirements proposed for stress datasets and adjust the corresponding evaluation criteria. Secondly, nine public physiological datasets were identified from a previous survey. [Results] None of the evaluated datasets satisfy all the reference requirements in order to be considered as a reference dataset for being used in the construction of reliable emotion recognition systems. [Conclusion] Although the evaluated datasets do not support the whole reference requirements, they provide a baseline for further development. Also, a greater effort is needed to establish specific reference requirements that can appropriately guide the creation of physiological datasets for emotion recognition systems.

Alexis Mendoza, Alvaro Cuno, Nelly Condori-Fernandez, Wilber Ramos Lovón

Machine Learning

Frontmatter
YTTREX: Crowdsourced Analysis of YouTube’s Recommender System During COVID-19 Pandemic

Algorithmic personalization is difficult to approach because it entails studying many different user experiences, with a lot of variables outside of our control. Two common biases are frequent in experiments: relying on corporate service API and using synthetic profiles with small regards of regional and individualized profiling and personalization. In this work, we present the result of the first crowdsourced data collections of YouTube’s recommended videos via YouTube Tracking Exposed (YTTREX). Our tool collects evidence of algorithmic personalization via an HTML parser, anonymizing the users. In our experiment we used a BBC video about COVID-19, taking into account 5 regional BBC channels in 5 different languages and we saved the recommended videos that were shown during each session. Each user watched the first five second of the videos, while the extension captured the recommended videos. We took into account the top20 recommended videos for each completed session, looking for evidence of algorithmic personalization. Our results showed that the vast majority of videos were recommended only once in our experiment. Moreover, we collected evidence that there is a significant difference between the videos we could retrieve using the official API and what we collected with our extension. These findings show that filter bubbles exist and that they need to be investigated with a crowdsourced approach.

Leonardo Sanna, Salvatore Romano, Giulia Corona, Claudio Agosti
Parallel Social Spider Optimization Algorithms with Island Model for the Clustering Problem

The digital age came with an extraordinary ability to generate data across organizations, people, and devices, data that needs to be analyzed, processed and stored. A well-known technique for analyzing this kind of data is Clustering. Many bio-inspired algorithms were proposed for this problem such as the Social Spider Optimization (SSO). In this work, we propose parallel island models of the SSO algorithm for the Clustering problem, using 24 processors for each parallel algorithm. Such models were implemented using static and dynamic topologies, and datasets from the UCI Machine Learning Repository used for the stage of experiments. The achieved average speedups range from 15 to 28 times faster than the SSO algorithm for large and small datasets, respectively, and a parallel model with static ring topology performs a little bit faster than the other parallel models. The parallel algorithms provide results with similar precision to the ones computed with the SSO algorithm.

Edwin Alvarez-Mamani, Lauro Enciso-Rodas, Mauricio Ayala-Rincón, José L. Soncco-Álvarez
Two-Class Fuzzy Clustering Ensemble Approach Based on a Constraint on Fuzzy Memberships

In recent years, the motivation to use the hybrid mixture of various methods has been increased. In this regard, the appropriate combination of supervised or unsupervised techniques have been proposed in order to enhances the performance of classification. In this paper, in order to obtain a stable fuzzy cluster scheme, a novel ensemble approach is presented. The proposed model consists of implementations of several Fuzzy C-means (FCM) based algorithms followed by the formation of a co-association matrix in relevant with the probability of each observation belonging to the clusters. The mean of these values is combined with a restriction criterion which have been designed to perceive the exact possibility of assigning observations to clusters. In other words, certain objects receive a reward, and uncertain objects with lower fuzzy coefficient degrees tend to be ineffective. Since partitioning clustering algorithms are commonly used as a consensus function, in this study, achieved row vector is given to K-means and FCM to generate final clusters. Several datasets have been used in order to evaluate the performance of the proposed model in comparison with different methods. Specially in internal validity indices, proposed method fulfills better results than traditional algorithms.

Omid Aligholipour, Mehmet Kuntalp
Modeling and Predicting the Lima Stock Exchange General Index with Bayesian Networks and Information from Foreign Markets

This paper presents a Bayesian Network approach to model and forecast the daily return direction of the Lima stock Exchange general index using foreign market’s information. Thirteen worldwide stock market indices were used along with the copper future that is negotiated in New York.The proposed approach was compared against popular machine learning methods, including decision tree, SVM, Multilayer Perceptron and Long short-term memory networks. The results showed competitive results at classifying both positive and negative classes. The approach allows graphical representation of the relationships between the markets, which facilitate the understanding on the target market in the global context. A web application was developed to demonstrate the advantages of the proposed approach. To the best of our knowledge, this is the first effort to model the influences of the main stock markets around the world on the Lima Stock Exchange general index.

Daniel Chapi, Soledad Espezua, Julio Villavicencio, Oscar Miranda, Edwin Villanueva
Comparative Study of Spatial Prediction Models for Estimating PM Concentration Level in Urban Areas

Having accurate spatial prediction models of air pollutant concentrations can be very helpful to alleviate the shortage of monitoring stations, specially in low-to-middle income countries. However, given the large diversity of model types, both statistical, numerical and machine learning (ML) based, it is not clear which of them are most suitable for this task. In this paper we study the predictive capabilities of common machine learning methods for the spatial prediction of PM $$_{2.5}$$ 2.5 concentration level. Three relevant factors were scrutinized: the extent to which meteorological variables impact the prediction performance; the effect of variable normalization by inverse distance weighting (IDW); and the number of neighborhood stations needed to maximize predictive performance. Results in a dataset from Beijing monitoring network show that simple models like Linear Regresors trained on IDW normalized variables can cope with this task. Some knowledge have been derived to guide the construction of competent models for spatial prediction of PM $$_{2.5}$$ 2.5 concentrations with ML-based methods.

Irvin Rosendo Vargas-Campos, Edwin Villanueva
Prediction of Solar Radiation Using Neural Networks Forecasting

Solar radiation and wind data play an important role in renewable energy projects to produce electricity. In Ecuador, these data are not always available for locations of interest due to absences of meteorological stations. In the scope of this paper, a low-cost automatic meteorological station prototype based on Raspberry technology was developed to measure the aforementioned variables. The objective of this paper is twofold: a) to present a proposal for the design of a low-cost automatic weather station using the Raspberry Pi microcomputer, showing the feasibility of this technology as an alternative for the construction of automatic meteorological station and; b) to use Forecasting with neural networks to predict solar radiation in Manta, Ecuador, based on the historical data collected: solar radiation, wind speed and wind direction. We proved that both technology feasibility and Machine learning has a high potential as a tool to use in this field of study.

Ponce-Jara Marcos, Alvaro Talavera, Carlos Velásquez, David Tonato Peralta
COVID-19 Infection Prediction and Classification

Symptoms associated with COVID-19 are very similar to and difficult to distinguish from those of seasonal flu, bronchitis, or pneumonia. The use of tests, expensive and unavailable in most countries, especially developing ones, may be unnecessary in the case of a suspected COVID. This work is carried out in order to decide if a patient is a priori infected and must be tested. Otherwise, the patient will not be screened using a confidence threshold. The data is collected at the emergency department of the EHU of Oran in Algeria. The COVID-19infection classification and prediction are performed by decision trees.

Souad Taleb Zouggar, Abdelkader Adla

Image Processing

Frontmatter
Towards a Benchmark for Sedimentary Facies Classification: Applied to the Netherlands F3 Block

In this paper, we attempt to provide a new benchmark for image seismic interpretation tasks in a public seismic dataset (Netherlands F3 Block). For this, techniques such as data augmentation together with five different deep network architectures were used, as well as the application of focal loss function. Our experiments achieved an improvement in all evaluation metrics cited at the current benchmark. For instance, we managed to improve in $$3.7\%$$ 3.7 % the pixel accuracy metric and $$5.4\%$$ 5.4 % on mean class accuracy for a modified U-Net that uses dilated convolution layers in its bottleneck. In addition to this, the confusion matrices of each model are shown for a better inspection in the classes (sedimentary facies) where the greatest amount of misclassification occurred. The training process of almost all networks took less than one hour to converge. Finally, we applied Conditional Random Fields (CRF) as post-processing in order to obtained smother results. The inferences performed with the best topology, in an inline or section of the test set, is closer to achieving an interpretation at a human level.

Maykol J. Campos Trinidad, Smith W. Arauco Canchumuni, Marco Aurelio Cavalcanti Pacheco
Mobile Application for Movement Recognition in the Rehabilitation of the Anterior Cruciate Ligament of the Knee

Anterior cruciate ligament injury is a condition that requires physical rehabilitation therapy. Due to the problems of the COVID-19 pandemic and the patient’s mobility problems, it is difficult to attend the rehabilitation sessions. The developed mobile application uses color recognition through the OpenCV library, with which a virtual goniometer can be generated by capturing the specific anatomical points of the lower limb through the camera of the device. It also allows controlling and monitoring the exercises prescribed by a specialist. The exercises performed by the patient are registered by the mobile application which captures the series and repetitions, the flexion and extension movements, and their maximum and minimum angles respectively; Thanks to this, proper performance can be tracked. The results of four test subjects of different ages and sexes were obtained by submitting them to rehabilitation exercises and recording their respective measurements, thus verifying the effectiveness of the mobile application.

Iam Contreras-Alcázar, Kreyh Contreras-Alcázar, Victor Cornejo-Aparicio
Semantic Segmentation Using Convolutional Neural Networks for Volume Estimation of Native Potatoes at High Speed

Peru is one of the main producers of a wide variety of native potatoes in the world. Nevertheless, to achieve a competitive export of derived products is necessary to implement automation tasks in the production process. Nowadays, volume measurements of native potatoes are done manually, increasing production costs. To reduce these costs, a deep approach based on convolutional neural networks have been developed, tested, and evaluated, using a portable machine vision system to improve high-speed native potato volume estimations. The system was tested under different conditions and was able to detect volume with up to 90% of accuracy.

Miguel Chicchón, Ronny Huerta
Symbiotic Trackers’ Ensemble with Trackers’ Re-initialization for Face Tracking

Visual object tracking aims to deliver accurate estimates about the state of the target in a sequence of images or video frames. Nevertheless, tracking algorithms are sensitive to different kinds of image perturbations that frequently cause tracking failures. Indeed, tracking failures result from the insertion of imprecise target-related data into the trackers’ appearance models, which leads the trackers to lose the target or drift away from it. Here, we propose a tracking fusion approach, which incorporates feedback and re-initialization mechanisms to improve overall tracking performance. Our fusion technique, called SymTE-TR, enhances trackers’ overall performance by updating their appearances models with reliable information of the target’s states, while resets the imprecise trackers. We evaluated our approach on a facial video dataset, which characterizes a particular challenging tracking application under different imaging conditions. The experimental results indicate that our approach contributes to enhancing individual tracker performances by providing stable results across the video sequences and, consequently, contributes to stable overall tracking fusion performances.

Victor H. Ayma, Patrick N. Happ, Raul Q. Feitosa, Gilson A. O. P. Costa, Bruno Feijó
Multi-class Vehicle Detection and Automatic License Plate Recognition Based on YOLO in Latin American Context

In Latin America, and many other countries around the globe, serious problems exist regarding the high level of traffic that generates congestion on avenues and streets, with poor road planning being one of the main causes, plus the excess of buses, mini-buses, taxis, and other vehicles that cause obstructions. Therefore, it would be very useful to know the flow of existing vehicles in each area to know and segment which roads certain vehicles should transit, thus generating greater control. This research proposes a methodology for the detection and multi-classification of vehicles in eight classes: cars, buses, trucks, combis (micro-buses), moto-taxis (auto-rickshaws), taxis, motorcycles, and bicycles; to later carry out the detection of the vehicle license plates and do the recognition of the characters on them; using Deep Learning techniques, specifically YOLOv3 and LeNet. The proposed methodology consists of four stages: Vehicle Detection, License Plate Detection, Character Segmentation, and Character Recognition. We also introduce a novel open-access dataset, LAT-VEDA, which contains more than 22 000 images divided into 8 classes. Good results were obtained in each one of the four stages of the system in comparison with the state of the art. Achieving the best mAP of 1.0 in the Vehicle License Plate Detection stage and having the lowest performance in the Vehicle Detection stage with a mAP of 0.68. This approach may be used by the Government to support the management of public transport, giving greater control and information about the flow of vehicles by area, in addition to the fact that the license plate recognition system can help in the management of the control of public policies and regulations.

Pedro I. Montenegro-Montori, Jhonatan Camasca-Huamán, Junior Fabian
Static Summarization Using Pearson’s Coefficient and Transfer Learning for Anomaly Detection for Surveillance Videos

Data storage has been a problem as technology advances, there are more devices capable of capturing images, sounds, videos, etc. On the security side, many people choose to use security cameras that are available 24 h a day to capture anomalous events and maintain the security of the area, however, storing all captured videos generates high costs, as well as the prolonged analysis that this type of videos implies. For this reason, we propose a method that allows selecting only the important events captured by a video surveillance camera and then classifying them among the types of most constant criminal acts in Peru.

Steve Willian Chancolla-Neira, César Ernesto Salinas-Lozano, Willy Ugarte
Humpback Whale’s Flukes Segmentation Algorithms

Photo-identification consists of the analysis of photographs to identify cetacean individuals based on unique characteristics that each specimen of the same species exhibits. The use of this tool allows us to carry out studies about the size of its population and migratory routes by comparing catalogues. However, the number of images that make up these catalogues is large, so the manual execution of photo-identification takes considerable time. On the other hand, many of the methods proposed for the automation of this task coincide in proposing a segmentation phase to ensure that the identification algorithm takes into account only the characteristics of the cetacean and not the background. Thus, in this work, we compared four segmentation techniques from the image processing and computer vision fields to isolate whales’ flukes. We evaluated the Otsu (OTSU), Chan Vese (CV), Fully Convolutional Networks (FCN), and Pyramid Scene Parsing Network (PSP) algorithms in a subset of images from the Humpback Whale Identification Challenge dataset. The experimental results show that the FCN and PSP algorithms performed similarly and were superior to the OTSU and CV segmentation techniques.

Andrea Castro Cabanillas, Victor H. Ayma
Improving Context-Aware Music Recommender Systems with a Dual Recurrent Neural Network

Day by day, online content delivery services suppliers grow the volume of data on the internet. Music streaming services are one of those services that increase the number of users every day, as well as the number of songs in their catalog. To help the users to find songs that fit their interests, music recommender systems can be used to filter a large number of songs according to the preference of the user. However, the context in which the users listen to songs must be taken into account, which justifies the usage of context-aware recommender systems. The goal of this work is to use a Dual Recurrent Neural Network to acquire contextual information (represented by embeddings) for each song, given the sequence of songs that each user has listened to. We evaluated the embeddings by using four context-aware music recommender systems in two datasets. The results showed that the embeddings (i.e. the contextual information) obtained by our proposed method are able to improve context-aware music recommender systems.

Igor André Pegoraro Santana, Marcos Aurélio Domingues

Social Networks

Frontmatter
Classification of Cybercrime Indicators in Open Social Data

Posting information on social media platforms is a popular activity through which personal and confidential information can leak into the public domain. Consequently, social media can contain information that provides an indication that an organization has been compromised or suffered a data breach. This paper describes a technique for inferring if an organization has been compromised from information posted on social media. The proposed strategy forms the basis of an alarm system which generates an alert for possible unreported cybercrime incidents. The proposed strategy used two social media cybercrime related datasets that were collected from the Irish and New York regions from financial organizations’ Twitter accounts. The Tweets are labelled as either containing cybercrime indicators or not, and then the cybercrime Tweets were labelled further into crime categories. A deep dense pyramidal Neural Network model is used to classify the Tweets. This approach achieves an AUC of $$~0.85 \pm 0.03$$ 0.85 ± 0.03 which outperforms the baseline of deep convolutional neural networks.

Ihsan Ullah, Caoilfhionn Lane, Teodora Sandra Buda, Brett Drury, Marc Mellotte, Haytham Assem, Michael G. Madden
StrCoBSP: Relationship Strength-Aware Community-Based Social Profiling

User interest inference in social media is an important research topic with great value in modern personalization and advertisement systems. Using relationships characteristics such as strength may allow more refined inference. Indeed, due to influence and homophily phenomena, people maintaining strongest relationships tend to be and become more similar. Accordingly, we present StrCoBSP a Strength-aware Community-Based Social Profiling process that combines community structure and relationship strength to predict user’s interests in his egocentric network. We present empirical evaluation of StrCoBSP performed on real world co-authorship networks (DBLP/ResearchGate). The performances of the proposed approach are superior to the ones achieved by the existing strength-agnostic process with lifts of up to 18,46% and 18,15% in terms of precision and recall at top 15 returned interests.

Asma Chader, Hamid Haddadou, Leila Hamdad, Walid-Khaled Hidouci
Identifying Differentiating Factors for Cyberbullying in Vine and Instagram

A multitude of online social networks (OSNs) of varying types has been introduced in the past decade. Because of their enormous popularity and constant availability, the threat of cyberbullying launched via these OSNs has reached an unprecedented level. Victims of cyberbullying are now more vulnerable than ever before to the predators, perpetrators, and stalkers. In this work, we perform a detailed analysis of user postings on Vine and Instagram social networks by making use of two labeled datasets. These postings include threads of media posts and user comments that were labeled for being cyberbullying instances or not. Our analysis has revealed several important differentiating factors between cyberbullying and non-cyberbullying instances in these social networks. In particular, cyberbullying and non-cyberbullying instances differ in (i) the number of unique negative commenters, (ii) temporal distribution of positive and negative sentiment comments, and (iii) textual content of media captions and subsequent comments. The results of these analyses can be used to build highly accurate classifiers for identifying cyberbullying instances.

Rahat Ibn Rafiq, Homa Hosseinmardi, Richard Han, Qin Lv, Shivakant Mishra
Effect of Social Algorithms on Media Source Publishers in Social Media Ecosystems

Social media systems have become a primary platform to consume and exchange information nowadays. The systems usually have three main components: media sources, content distributors (social media services), and content consumers, which we call social media content delivery ecosystem. A content distributor has social algorithms, as a black box, that were designed and trained to pick up, filter, and rank the most relevant and desired content to be delivered to each individual one of us. However, these modern social algorithms were typically complicated, so we do not really know how these social algorithms work such that we are unsure about the quality of the delivered content. Most researchers have worried about user side and investigated how fair of the contents that were delivered from social algorithms to users. On the other hand, no one focuses on impact of social algorithms on publisher side. Thus, the main purpose of this paper is to understand how social algorithms have an impact on content publishers in social media ecosystem. From our SINCERE data, we firstly illustrate time series of all posts in each of global and local news media Facebook pages, including CNN, Fox News, The New York Times, and The Sacramento Bee, which were plotted in timeline during 2008 to early 2018 to see how they changed in terms of publishing times. We found that global news media changed their publish time. Our hypothesis was that they changed because social algorithms were changed. If they got better user reaction after changing publishing time somehow, we could assume social algorithms might deliver more contents to users at that time. We evaluated user reactions by the number of participants and user response time. We found that content most publishers got better reactions from users after changing publishing time. Therefore, we conclude that news media changed their time periods to published their post in order to make their content be more visible to users because social algorithms were changed.

Ittipon Rassameeroj, S. Felix Wu
The Identification of Framing Language in Business Leaders’ Speech from the Mass Media

The value of a company can soar and plunge upon the utterances of its leaders. Organisation leaders are aware of the power of their words to influence their employees, customers and the financial markets, consequently, they use rhetorical tricks such as framing to present statements that contain little or no useful information as containing positive material. These types of statements are not suitable for traditional sentiment analysis which can be used to predict the prospects of a company. On occasion, business leaders are forced to make objective statements which contain useful information that could be used to predict the future share price or profit levels using sentiment analysis. This paper presents a technique that uses sentimental low information words in quotes from business leaders to identify framing words. The identification of framing words allows the ranking of quotes from business leaders by framing likelihood, which can be used for further analysis.

Brett Drury, Samuel Morais Drury
Clustering Analysis of Website Usage on Twitter During the COVID-19 Pandemic

In this study we analyzed patterns of external website usage on Twitter during the COVID-19 pandemic. We used a multi-view clustering technique, which is able to incorporate multiple views of the data, to cluster the websites’ URLs based on their usage patterns and tweet text that occurs with the URLs. The results of the multi-view clustering of URLs used during the COVID-19 pandemic, from 29 January to 22 June 2020, revealed three, main clusters of URL usage. These three clusters differed significantly in terms of using information from different politically-biased, fake news, and conspiracy theory websites. Our results suggest that there are political biases in how information, to include misinformation, about the COVID-19 pandemic is used on Twitter.

Iain J. Cruickshank, Kathleen M. Carley

Data-Driven Software Engineering

Frontmatter
Calibrated Viewability Prediction for Premium Inventory Expansion

Billions of ads are displayed on a daily basis, making it a multi-billion industry. Most of web pages contain multiple ads, which are largely served in real time using a bidding process where buyers (advertisers) offer a price tag to the seller (publishers) for each given possible ad on the page. There are multiple factors that impact an ad price, one of the primary ones is the ad-location’s viewability likelihood. Due to the length of many web pages, certain ad locations are invisible to the visiting user, as he may not scroll far enough on the page to where the ads are placed. According to recent industry metrics, less than 60% of ads are viewable. This poses a challenge to both: buyers and sellers. Buyers want to optimize the likelihood they buy an ad that will be viewed, while sellers want to maximize ad prices (by setting higher floor prices) by providing as many possible ad placements with high viewability probability. This paper addresses the viewability prediction from the publisher’s side, and proposes a novel algorithm based on cascading gradient boosting. The algorithm enables sellers to predict an accurate viewability probability for ad impressions, which is optimized to match the actual viewability rate that will be measured for the served ads. Unlike other algorithms that optimize these problems to an average minimal difference from a central mean error, we propose an algorithm that increases the amount of extreme cases - which are the most valuable ones, thus expanding the premium ad inventory. We evaluate the algorithm on two datasets with a total of over 500 million impressions. We found that the algorithm outperforms other viewability prediction algorithms, works well for publishers while providing a measurable fairness metric to advertisers.

Jonathan Schler, Allon Hammer
Data Driven Policy Making: The Peruvian Water Resources Observatory

Nowadays, Big Data holds vast potential for improving decision-making in public policy due to the different methodologies for working with complex heterogeneous big data, which allows proposing policies based on real and measurable key performance indicators. This article aims to describe the water resource observatory of the Public Management School of Universidad del Pacífico. The idea behind the observatory is to handle data extracted from non-traditional sources to enhance efficient and responsive government solutions through evidence-based public policies for water regulation. We used Elastic Search stack to centralize and visualize data from different sources, which was standardized using river basins as basic units. Finally, we show a use case of the data gathered to optimize the water supply in new urban zones in Lima’s periphery.

Giuliana Barnuevo, Elsa Galarza, Maria Paz Herrera, Juan G. Lazo Lazo, Miguel Nunez-del-Prado, José Luis Ruiz

Graph Mining

Frontmatter
Complex Networks to Differentiate Elderly and Young People

Cardiovascular disease (CVD) is a general term that describes different heart problems. There are several heart diseases, which still lead thousands of people to sudden death. Among them are high blood pressure, ischemia, variation in cardiac rhythms, and pericardial effusion. Studies about these diseases are usually made through the analysis of electrocardiogram (ECG) signals, which presents valuable information on the development of the heart’s status. Recent papers have posited the creation of quantile graphs (QG) using data from ECG. In this method, based on transition probabilities, these quantile graphs are a result of a time series mapped into a network. This so-called QG method can be employed to differentiate between young and elderly patients using their ECG signals. The primary goal of our paper is to show how variations in ECG signals are mirrored in the respective QGs’ topology. Our analyses were centered on three metrics: mean jump length, betweenness centrality and clustering coefficient. The results indicate that the QG method is a reliable tool for differentiating ECG exams regarding the age of the patients.

Aruane M. Pineda, Francisco A. Rodrigues
Analysis of the Health Network of Metropolitan Lima Against Large-Scale Earthquakes

Peru is a highly seismic country located in the Ring of Fire, making it vulnerable to earthquakes and tsunamis. In the present work, we examined Lima’s health system capacity from three different and complementary points of view. We first analyze the Hospital Treatment Capacity (HTC) of 41 hospitals of II and III categories from EsSalud and MINSA in Lima, Peru. Second, we computed the hospitals’ coverage area and their citizens’ health demand as the aftermath of an earthquake of 8 Mw magnitude. Finally, an accessibility simulation to reach the hospitals was performed, taking into account real traffic jams conditions and street degradation. This document aims to provide elements for the strengthening of the Peruvian fragile health system.

Miguel Nunez-del-Prado, John Barrera
Quasiquadratic Time Algorithms for Square and Pentagon Counting in Real-World Networks

Counting structures in graphs (triangles, cliques, graphlets, etc.) is an important task when analyzing real-world networks given its usefulness to make link prediction, community discovery, graph classification, etc. Among many structures in graphs, triangle counting (cycles of length 3) has been a hot research topic in recent years. As a result of that, a number of algorithms have been developed for counting triangles and many closely related structures. However, square and pentagon counting (cycles of length 4 and 5 respectively) have received less attention despite its importance when analyzing real-world bipartite graphs and its influence on the graph spectrum. In this work, we propose quasiquadratic time algorithms for approximately counting the number of squares and pentagons in a simple graph as a whole, and also obtain such counts per vertices.

Grover E. C. Guzman, Jared León
Identifying Covid-19 Impact on Peruvian Mental Health During Lockdown Using Social Network

The actual outbreak generated by SARS-CoV-2, presented a challenge to the governments because PublicS Health, Economy, and Society are different in every country so actions must fit considering these previous conditions. South America is a region with developing countries, limitations, problems and the pandemic highlighted them. Peru is a country with good initial policies to contain the pandemic, a lockdown started on March 15 and lasted more than 100 days. By consequence, people were forced to change daily activities and of course, social and mental problems started to grow. The actual study wants to identify the covid-19 impact on the Social Network, Twitter filtering posts related to the topic. The initial findings present a high interest in the topic during the first week and a decreasing pattern in the last weeks.

Josimar E. Chire Saire, Jimy Frank Oblitas Cruz
Diagnosis of SARS-CoV-2 Based on Patient Symptoms and Fuzzy Classifiers

The contention, mitigation and prevention measures that governments have implemented around the world do not appear to be sufficient to prevent the spread of SARS-CoV-2. The number of infected and dead continues to rise every day, putting a strain on the capacity and infrastructure of hospitals and medical centers. Therefore, it is necessary to develop new diagnostic methods based on patients' symptoms that allow the generation of early warnings for appropriate treatment. This paper presents a new method in development for the diagnosis of SARS-CoV-2, based on patient symptoms and the use of fuzzy classifiers. Eleven (11) variables were fuzzified. Then, knowledge rules were established and finally, the center of mass method was used to generate the diagnostic results. The method was tested with a database of clinical records of symptomatic and asymptomatic SARS-CoV-2 patients. By testing the proposed model with data from symptomatic patients, we obtained 100% sensitivity and 100% specificity. Patients according to their symptoms are classified into two classes, allowing for the detection of patients requiring immediate attention from those with milder symptoms.

Fray L. Becerra-Suarez, Heber I. Mejia-Cabrera, Víctor A. Tuesta-Monteza, Manuel G. Forero

Semantic Web, Repositories, and Visualization

Frontmatter
Distributed Identity Management for Semantic Entities

We propose semDIM, a novel approach for Semantic Distributed Identity Management based on a Semantic Web architecture. For the first time, semDIM provides a framework for a distributed definition and management of entities such as persons being part of an organization, groups, and roles across namespaces. It is suitable for informal, i.e., social networks, as well as for professional networks such as cross-organizational collaborations. Beyond the capabilities of existing Identity Management solutions, we allow distributed identifiers and management of groups (consisting of agents and sub-groups) and roles. semDIM uses owl:sameAs as a central property to represent and verify distributed identities via formal reasoning. This concept enables novel functionalities for Distributed Identity Management, as these entities can be referred to, related to each other, as well as be managed across namespaces. Our semDIM approach consists of a modular software architecture, a process model, as well as a set of state-of-the-art DUL-based OWL ontology patterns. We demonstrate our approach by an example implementation that evaluates its functional fitness.

Falko Schönteich, Andreas Kasten, Ansgar Scherp
Telegram: Data Collection, Opportunities and Challenges

Over the years, social media platforms such as Facebook, Twitter, etc., have become a valuable resource for marketing, public relations etc. One emerging mobile instant messaging medium, Telegram, has recently gained momentum in countries such as Brazil, Indonesia, Iran, Russia, Ukraine, and Uzbekistan. While most social media platforms have been studied extensively, Telegram is still underexplored and a gold mine for researchers and social scientists to explore and study user behaviors. Moreover, the ease of data collection through its API and access to historical data makes it a lucrative platform for social computing research. This paper explores the features of Telegram and presents a methodology to collect and analyze data. We also demonstrate the viability of the platform as a source of social computing research by presenting a case study on Ukrainian Parliamentary members’ discourse. We conduct both text and network analysis to gain insights into political discourse and public opinion. Our findings include use of Telegram by Ukrainian politicians to connect with their voter base, promote their work as well as ridicule their peers. As a result, channels are actively disseminating information on current political affairs and chat groups that discuss views on Ukrainian government. From our study, we conclude that Telegram is a rich data source to study social behavior, analyze information campaigns through content dissemination, etc. This study opens plethora of research opportunities in future on Telegram.

Tuja Khaund, Muhammad Nihal Hussain, Mainuddin Shaik, Nitin Agarwal
Graph Theory Applied to International Code of Diseases (ICD) in a Hospital

We analyze comorbidity from a set of International Code of Diseases (ICD) records from a tertiary level hospital based on the graph theory. Comorbidity is the simultaneous presence of two or more diseases or conditions in a patient. A total of 36,236 patient health records, containing 80,253 ICD code notifications, have been studied. We show that over 43.0% of all comorbidities can be determined from the dominant graph edges. This study goes beyond a first-order statistical analysis. ICD graph chapters can be used to understand patient flow in hospital specialties and may also be used to quantify inter and intra-relationships among major hospitalization events. Our result could help plan hospital organization arranging sectors highly correlated close together.

C. Boldorini Jr., C. D. G. Euzebio, L. P. Porto, A. S. Martinez, E. E. S. Ruiz
CovidStream: Interactive Visualization of Emotions Evolution Associated with Covid-19

Since the beginning of the pandemic caused by Covid-19, the emotions of humanity have evolved abruptly, mainly for policies adopted by the governments of countries. These policies, since they have a high impact on people’s health, need feedback on people’s emotional perception and their connections with entities directly related to emotions, to have relevant information for decision making. Given the global social isolation, emotions have been expressed with higher magnitude in comments on social networks, generating a large amount of data that is a source for various investigations. The objective of this work is to design and adapt an interactive visualization tool called CovidStream, for monitoring the evolution of emotions associated with Covid-19 in Peru, for which Visual Analytics, Deep learning, and Sentiment Analysis techniques are combined. This visualization tool allows showing the evolution of the emotions associated with the Covid-19 and its relationships with three entities: persons, places, and organizations, which have an impact on emotions, all in a temporal space dimension. For the visualization of entities and emotions, Peruvian tweets extracted between January and July 2020 were used, all of them with the hashtag #Covid-19. For the classification of emotions, a recurrent neural network model with LSTM architecture was implemented, taking as training and test data the one proposed by SemEval-2018 Task1, corresponding to Spanish tweets labeled with emotions: anger, fear, joy, and sadness.

Herwin Alayn Huillcen Baca, Flor de Luz Palomino Valdivia, Yalmar Ponce Atencio, Manuel J. Ibarra, Mario Aquino Cruz, Melvin Edward Huillcen Baca
Backmatter
Metadata
Title
Information Management and Big Data
Editors
Juan Antonio Lossio-Ventura
Jorge Carlos Valverde-Rebaza
Eduardo Díaz
Hugo Alatrista-Salas
Copyright Year
2021
Electronic ISBN
978-3-030-76228-5
Print ISBN
978-3-030-76227-8
DOI
https://doi.org/10.1007/978-3-030-76228-5

Premium Partner