Web Mining and Knowledge Extraction

Frontmatter

Web Page Structured Content Detection Using Supervised Machine Learning

In this paper we present a comparative study using several supervised machine learning techniques, including homogeneous and heterogeneous ensembles, to solve the problem of classifying content and noise in web pages. We specifically tackle the problem of detecting content in semi-structured data (e.g., e-commerce search results) under two different settings: a controlled environment with only structured content documents and; an open environment where the web page being processed may or may not have structured content. The features are automatically obtained from a preexisting and publicly available extraction technique that processes web pages as a sequence of tag paths, thus the features are extracted from these sequences instead of the DOM tree. Besides comparing the performance between different models we have also conducted extensive feature selection/combination experiments. We have achieved an average F-score of about 93% in a controlled setting and 91% in an open setting.

Roberto Panerai Velloso, Carina F. Dorneles

Augmenting LOD-Based Recommender Systems Using Graph Centrality Measures

In this paper we investigate the incorporation of graph-based features into LOD path-based recommender systems, an approach that so far has received little attention. More specifically, we propose two normalisation procedures that adjust user-item path counts by the degree centrality of the nodes connecting them. Evaluation on the MovieLens 1M dataset shows that the linear normalisation approach yields a significant increase in recommendation accuracy as compared to the default case, especially in settings where the most popular movies are omitted. These results serve as a fruitful base for further incorporation of graph measures into recommender systems, and might help in establishing the recommendation diversity that has recently gained much attention.

Bart van Rossum, Flavius Frasincar

ST-Sem: A Multimodal Method for Points-of-Interest Classification Using Street-Level Imagery

Street-level imagery contains a variety of visual information about the facades of Points of Interest (POIs). In addition to general morphological features, signs on the facades of, primarily, business-related POIs could be a valuable source of information about the type and identity of a POI. Recent advancements in computer vision could leverage visual information from street-level imagery, and contribute to the classification of POIs. However, there is currently a gap in existing literature regarding the use of visual labels contained in street-level imagery, where their value as indicators of POI categories is assessed. This paper presents Scene-Text Semantics (ST-Sem), a novel method that leverages visual labels (e.g., texts, logos) from street-level imagery as complementary information for the categorization of business-related POIs. Contrary to existing methods that fuse visual and textual information at a feature-level, we propose a late fusion approach that combines visual and textual cues after resolving issues of incorrect digitization and semantic ambiguity of the retrieved textual components. Experiments on two existing and a newly-created datasets show that ST-Sem can outperform visual-only approaches by 80% and related multimodal approaches by 4%.

Shahin Sharifi Noorian, Achilleas Psyllidis, Alessandro Bozzon

Time and Location Recommendation for Crime Prevention

In recent years we have seen more and more open government and administrative data made available on the Web. Crime data, for example, allows civic organizations and ordinary citizens to obtain safety-related information on their surroundings. In this paper, we study crime prediction as a recommendation problem, using fine-grained open crime data. A common issue in current crime prediction methods is that, given fine-grained spatial temporal units, crime data would become very sparse, and prediction would not work properly. By modeling crime prediction as a recommendation problem, however, we can make use of the abundant selection of methods in recommendation systems that inherently consider data sparsity. We present our model and show how collaborative filtering and contextual-based recommendation methods can be applied. Focusing on two major types of crimes in the city of San Francisco, our empirical results show that recommendation methods can outperform traditional crime prediction methods, given small spatial and temporal granularity. Specifically, we show that by using recommendation methods, we can capture 70% of future thefts using only 20% man-hour, 13% more than traditional methods.

Yihong Zhang, Panote Siriaraya, Yukiko Kawai, Adam Jatowt

Incremental PARAFAC Decomposition for Three-Dimensional Tensors Using Apache Spark

Recent studies have focused on the use of tensor analysis for tensor decomposition because this method can identify more latent factor and patterns, compared to the matrix factorization approach. The existing tensor decomposition studies used static dataset in their analyses. However, in practice, data change and increase over time. Therefore, this paper proposes an incremental Parallel Factor Analysis (PARAFAC) tensor decomposition algorithm for three-dimensional tensors. The method of incremental tensor decomposition can reduce recalculation costs associated with the addition of new tensors. The proposed method is called InParTen; it performs distributed incremental PARAFAC tensor decomposition based on the Apache Spark framework. The proposed method decomposes only new tensors and then combines them with existing results without recalculating the complete tensors. In this study, it was assumed that the tensors grow with time as the majority of the dataset is added over a period. In this paper, the performance of InParTen was evaluated by comparing the obtained results for execution time and relative errors against existing tensor decomposition tools. Consequently, it has been observed that the method can reduce the recalculation cost of tensor decomposition.

Hye-Kyung Yang, Hwan-Seung Yong

Modeling Heterogeneous Influences for Point-of-Interest Recommendation in Location-Based Social Networks

The huge amount of heterogeneous information in location-based social networks (LBSNs) creates great challenges for POI recommendation. User check-in behavior exhibits two properties, diversity and imbalance. To effectively model both properties, we propose an Aspect-aware Geo-Social Matrix Factorization (AGS-MF) approach to exploit various factors in a unified manner for more effective POI recommendation. Specifically, we first construct a novel knowledge graph (KG), named as Aspect-aware Geo-Social Influence Graph (AGS-IG), to unify multiple influential factors by integrating the heterogeneous information about users, POIs and aspects from reviews. We design an efficient meta-path based random walk to discover relevant neighbors of each user and POI based on multiple influential factors. The extracted neighbors are further incorporated into AGS-MF with automatically learned personalized weights for each user and POI. By doing so, both diversity and imbalance can be modeled for better capturing the characteristics of users and POIs. Experimental results on several real-world datasets demonstrate that AGS-MF outperforms state-of-the-art methods.

Qing Guo, Zhu Sun, Jie Zhang, Yin-Leng Theng

Exploring Semantic Change of Chinese Word Using Crawled Web Data

Words changing their meanings over time reflects various shifts in socio-cultural attitudes and conceptual structures. Understanding semantic change of words over time is important in order to study models of language and cultural evolution. Word embeddings methods such as PPMI, SVD and word2vec have been evaluated in recent years. These kinds of representation methods, sometimes referring as semantic maps of words, are able to facilitate the whole process of language processing. Chinese language is no exception. The development of technology gradually influences people’s communication and the language they are using. In the paper, a huge amount of data (300 GB) is provided by Sogou, a Chinese web search engine provider. After pre-processing, the Chinese language corpus is obtained. Three different word representation methods are extended to including temporal information. They are trained and tested based on the above dataset. A thorough analysis (both qualitative and quantitative analysis) is conducted with different thresholds to capture different semantic accuracy and alignment quality of the shifted words. A comparison between three methods is provided and possible reasons behind experiment results are discussed.

Xiaofei Xu, Yukun Cao, Li Li

Web Big Data and Web Data Analytics

Frontmatter

A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users

On social media platforms and Twitter in particular, specific classes of users such as influencers have been given satisfactory operational definitions in terms of network and content metrics. Others, for instance online activists, are not less important but their characterisation still requires experimenting. We make the hypothesis that such interesting users can be found within temporally and spatially localised contexts, i.e., small but topical fragments of the network containing interactions about social events or campaigns with a significant footprint on Twitter. To explore this hypothesis, we have designed a continuous user profile discovery pipeline that produces an ever-growing dataset of user profiles by harvesting and analysing contexts from the Twitter stream. The profiles dataset includes key network and content-based users metrics, enabling experimentation with user-defined score functions that characterise specific classes of online users. The paper describes the design and implementation of the pipeline and its empirical evaluation on a case study consisting of healthcare-related campaigns in the UK, showing how it supports the operational definitions of online activism, by comparing three experimental ranking functions. The code is publicly available.

Flavio Primo, Paolo Missier, Alexander Romanovsky, Mickael Figueredo, Nelio Cacho

Predicting Graph Operator Output over Multiple Graphs

A growing list of domains, in the forefront of which are Web data and applications, are modeled by graph representations. In content-driven graph analytics, knowledge must be extracted from large numbers of available data graphs. As the number of datasets (a different type of volume) can reach immense sizes, a thorough evaluation of each input is prohibitively expensive. To date, there exists no efficient method to quantify the impact of numerous available datasets over different graph analytics tasks. To address this challenge, we propose an efficient graph operator modeling methodology. Our novel, operator-agnostic approach focuses on the inputs themselves, utilizing graph similarity to infer knowledge about them. An operator is executed for a small subset of the available inputs and its behavior is modeled for the rest of the graphs utilizing machine learning. We propose a family of similarity measures based on the degree distribution that prove capable of producing high quality models for many popular graph tasks, even compared to modern, state of the art similarity functions. Our evaluation over both real-world and synthetic graph datasets indicates that our method achieves extremely accurate modeling of many commonly encountered operators, managing massive speedups over a brute-force alternative.

Tasos Bakogiannis, Ioannis Giannakopoulos, Dimitrios Tsoumakos, Nectarios Koziris

Streaming Event Detection in Microblogs: Balancing Accuracy and Performance

In this work, we model the problem of online event detection in microblogs as a stateful stream processing problem and offer a novel solution that balances result accuracy and performance. Our new approach builds on two state of the art algorithms. The first algorithm is based on identifying bursty keywords inside blocks of blog messages. The second one involves clustering blog messages based on similarity of their contents. To combine the computational simplicity of the keyword-based algorithm with the semantic accuracy of the clustering-based algorithm, we propose a new hybrid algorithm. We then implement these algorithms in a streaming manner, on top of Apache Storm augmented with Apache Cassandra for state management. Experiments with a 12M tweet dataset from Twitter show that our hybrid approach provides a better accuracy-performance compromise than the previous approaches.

Ozlem Ceren Sahin, Pinar Karagoz, Nesime Tatbul

Supervised Group Embedding for Rumor Detection in Social Media

To detect rumors automatically in social media, methods based on recurrent neural network and convolutional neural network have been proposed. These methods split a stream of posts related to an event into several groups along time, and represent each group using unsupervised methods such as paragraph vector. However, many posts in a group (e.g., retweeted posts) do not contribute much to rumor detection, which deteriorates the performance of rumor detection based on unsupervised group embedding. In this paper, we propose a Supervised Group Embedding based Rumor Detection (SGERD) model that considers both textual and temporal information. Particularly, SGERD exploits post-level textual information to generate group embeddings, and is able to identify salient posts for further analysis. Experimental results on two real-world datasets demonstrate the effectiveness of our proposed model.

Yuwei Liu, Xingming Chen, Yanghui Rao, Haoran Xie, Qing Li, Jun Zhang, Yingchao Zhao, Fu Lee Wang

Fast Incremental PageRank on Dynamic Networks

Real-world networks are very large and are constantly changing. Computing PageRank values for such dynamic networks is an important challenge in network science. In this paper, we propose an efficient Monte Carlo based algorithm for PageRank tracking on dynamic networks. A revisit probability model is also presented to provide theoretical support for our algorithm. For a graph with n nodes, the proposed algorithm maintains only nR random walk segments (R random walks starting from each node) in memory. The time cost to update PageRank scores for each graph modification is proportional to $$n/|\varvec{E}|$$ ( $$\varvec{E}$$ is the edge set). Experiments on 5 real-world networks indicate that our algorithm is 1.3–30 times faster than state-of-the-art algorithms and does not accumulate any errors.

Zexing Zhan, Ruimin Hu, Xiyue Gao, Nian Huai

Social Web Applications and Crowdsourcing

Frontmatter

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

As an emerging type of video comments, time-sync comments (TSCs) enable viewers to make comments on video shots in a real-time manner. Such comments well reflect user interests in the frame level, which can be utilized to further improve the accuracy of video recommendation. In this paper, we make the first attempt in this direction and propose a new video recommendation algorithm called SACF by exploiting temporal relationship between time-sync comments and video frames. Our algorithm can extract a rich set of semantic features from crowdsourced time-sync comments, and combine latent semantic representations of users and videos by neural collaborative filtering. We conduct extensive experiments using real TSC datasets, and our results show that our proposed algorithm can improve the recommendation performance by 9.73% in HR@10 and 5.72% in NDCG@10 compared with other baseline solutions.

Zhanpeng Wu, Yan Zhou, Di Wu, Yipeng Zhou, Jing Qin

On Twitter Bots Behaving Badly: Empirical Study of Code Patterns on GitHub

Bots, i.e., algorithmically driven entities that behave like humans in online communications, are increasingly infiltrating social conversations on the Web. If not properly prevented, this presence of bots may cause harm to the humans they interact with. This paper aims to understand which types of abuse may lead to harm and whether these can be considered intentional or not. We manually review a dataset of 60 Twitter bot code repositories on GitHub, derive a set of potentially abusive actions, characterize them using a taxonomy of abstract code patterns, and assess the potential abusiveness of the patterns. The study does not only reveal the existence of 31 communication-specific code patterns – which could be used to assess the harmfulness of bot code – but also their presence throughout all studied repositories.

Andrea Millimaggi, Florian Daniel

CrowDIY: How to Design and Adapt Collaborative Crowdsourcing Workflows Under Budget Constraints

Workflow quality is a key determinant of crowdsourcing complex work, but finding ways to task design and plan has proved illusive. Instead, we formulate it as an optimization problem with budget constraints and fewer decision variables to set. We propose a two-staged approach CrowDIY that can not only estimate task attributes based on previous tasks but also optimize them with budget constraints in order to publish tasks more wisely in a timely manner. Several experimental studies have been conducted, and the results show compelling evidence that, under different conditions, the proposed approach can effectively reduce the workload of workflow design and plan, while avoiding commonly encountered trial-and-error in crowdsourcing workflows and leading up to successful complex outcomes.

Rong Chen, Bo Li, Hu Xing, Yijing Wang

Finding Baby Mothers on Twitter

In this paper, we study the task of detecting mothers of babies on Twitter. This could be beneficial for baby mother users to find friends, and for companies, organizations or experts to deliver accurately targeted information. Prior works have proposed supervised classification methods to detect generic latent attributes of Twitter users such as age, gender, and political orientation. However, methods and features for classifying generic attributes do not perform well for more specific attributes, such as whether a user is a mother of a young baby. We design feature sets based on followed accounts and profile pictures, which are largely overlooked in existing work. Comparing to three established feature sets, the experimental evaluation shows that our specifically-designed feature sets considerably improve classification accuracy.

Yihong Zhang, Adam Jatowt, Yukiko Kawai

Web User Interfaces

Frontmatter

An End-User Pipeline for Scraping and Visualizing Semi-Structured Data over the Web

The Web is a vast source of semi-structured datasets that are made readily available to support the construction of new knowledge. Information visualization techniques have been demonstrated as a suitable alternative for allowing users to analyze and understand a large amount of data. However, the steps required for visualizing semi-structured data obtained from the Web is not straightforward, and it requires proper treatment before information visualization techniques could be applied. In this work, we present a visualization pipeline for describing the fundamental operations required for visualizing semi-structured data over the Web. We employ Web Scraping and Web Augmentation techniques for supporting interactive visualizations and solving tasks without changing the context of use of the data. Our approach is duly supported by a framework including scraping-, augmenting- and visualization-tools and it has been applied to different kinds of websites to demonstrate its validity and feasibility. Our ultimate goal is to expand the limits of our technology for improving the user interaction with websites and creating new experiences for a better understanding of large datasets.

Gabriela Bosetti, Sergio Firmenich, Marco Winckler, Gustavo Rossi, Ulises Cornejo Fandos, Előd Egyed-Zsigmond

DotCHA: A 3D Text-Based Scatter-Type CAPTCHA

We introduce a new type of 3D text-based CAPTCHA, called DotCHA, which relies on human interaction. DotCHA asks users to rotate a 3D text model to identify the correct letters. The 3D text model is a twisted form of sequential 3D letters around a center pivot axis, and it shows different letters depending on the rotation angle. The model is not composed of a solid letter model, but a number of spheres to resist character segmentation attacks, and this is why DotCHA is classified as a scatter-type CAPTCHA. DotCHA is tolerant to machine learning attacks because each letter is only identified in each particular direction. We demonstrate that DotCHA, while maintaining usability, is resistant to existing types of attacks.

Suzi Kim, Sunghee Choi

Entropy and Compression Based Analysis of Web User Interfaces

In our paper we explore whether user visual perception of web interfaces (WUI) can be predicted by certain quantitative characteristics of WUI screenshots. The considered metrics are JPEG file size, PNG file size, and information entropy value calculated with frequency-based MATLAB’s entropy(I) function. We ran survey with 70 subjects who provided subjective evaluations of complexity, aesthetics and orderliness for 497 website homepages. The results suggest that all the three metrics were significant, and the proposed regression models were considerably better than the respective baseline models that only used the popular JPEG-based metric. Remarkably, the entropy metric had significant positive correlations with aesthetic and orderliness evaluations, but not with the size of the image. We believe our findings might be used in development of automated WUI analysis tools to aid web engineers in their work.

Egor Boychuk, Maxim Bakaev

Web Security and Privacy

Frontmatter

Domain Classifier: Compromised Machines Versus Malicious Registrations

In “phishing attacks”, phishing websites disguised as trustworthy websites attempt to steal sensitive information. Remediation and mitigation options differ depending on whether the phishing website is hosted on a legitimate but compromised domain, in which case the domain owner is also a victim, or whether the domain itself is maliciously registered. We accordingly attempt to tackle here the important question of classifying known phishing sites as either compromised or maliciously registered. Following the recent adoption of GDPR standards now putting off-limits any personal data, few relevant literature criteria still satisfy those standards. We propose here a machine-learning based domain classifier, introducing nine novel features which exploit the internet presence and history of a domain, using only publicly available information. Evaluation of our domain classifier was performed with a corpus of phishing websites hosted on over 1,000 compromised domains and 10,000 malicious domains. In the randomized evaluation, our domain classifier achieved over 92% accuracy with under 8% false positive rate, with compromised cases as the positive class. We have also collected over 180,000 phishing website instances over the past 3 years. Using our classifier we show that 73% of the websites hosting attacks are compromised while the remaining 27% belong to the attackers.

Sophie Le Page, Guy-Vincent Jourdan, Gregor V. Bochmann, Iosif-Viorel Onut, Jason Flood

The “Game Hack” Scam

Game Hack Scam (GHS) is a cyberattack in which the attacker attempts to convince the victim, often a child or a young adult, that they will be provided with free, unlimited resources or other advantages for their favorite game. To obtain these claimed advantages, the victims are asked to complete one or more tasks, called “offers”. These so-called offers include, but are not limited to, subscriptions to questionable services and installation of executable files on the victim’s device. Although recent research has provided important insights into different types of scam such as “Technical Support Scam”, “Survey Scam”, and “Romance Scam”, to the best of our knowledge GHS has not been studied up to now.In this paper, we report the first systematic study of GHS. We use a data-driven approach to investigate and gain knowledge on this type of scam: we formulated GHS-related search queries, and used multiple search engines to collect data about the websites to which GHS victims are directed when they search online for various game hacks and tricks. We analyze the collected data to provide new insight into GHS, and research the extent of this scam. We show that GHS attackers abuse social media, streaming sites, blogs, and even unrelated sites such as change.org or researchgate.net to carry out their attacks and reach a large number of victims. We estimate that these attacks have been clicked close to 60 million times since mid-2014. Our data collection spans over nine months; over the last five months, we uncovered over 3,000 GHS domains and over 100 different offer domains. Furthermore, we find that GHS instances are on the rise and so is the number of victims. Finally, in keeping with similar large-scale scam studies, we find that the current public blacklists are inadequate and suggest that our method is more effective at detecting these attacks.

Emad Badawi, Guy-Vincent Jourdan, Gregor Bochmann, Iosif-Viorel Onut, Jason Flood

Decentralized Service Registry and Discovery in P2P Networks Using Blockchain Technology

Decentralized information systems radically change the power dynamics of the Web by establishing participants as equal peers, which form a self-governing community. However, decentralized infrastructures currently do not offer a way for users to easily explore available services in the network, nor the ability to securely verify their origin and history. In this contribution, we approach these challenges by exploiting the tamper-proofness of blockchain technology to build a decentralized service registry and discovery system for an existing decentralized microservice infrastructure. With this, users are able to find services in a network and are also able to verify their integrity and origin. Our first evaluations show promising results with this kind of system in the domain of decentralized service provisioning, while also raising research questions for future research in this field.

Peter de Lange, Tom Janson, Ralf Klamma

Web Programming

Frontmatter

Amalgam: Hardware Hacking for Web Developers with Style (Sheets)

Web programming technologies such as HTML, JavaScript, and CSS have become a popular choice for user interface design due to their capabilities: flexible interface, first-class networking, and available libraries. In parallel, driven by the standards set by the mobile companies, embedded devices manufacturers now want to replicate these capabilities. As a result, embedded devices that use web technologies for their graphical interface have started to emerge. However, the programming effort required to integrate web technologies with embedded software hinders its adaption. In this paper, we introduce Amalgam, a system that facilitates the development of embedded devices that use web programming technologies. Amalgam does this by translating the physical interface of embedded hardware components found (e.g., a push button) directly into the HTML and CSS syntax. Our system reduces the programming effort required to develop new embedded devices that use web technologies, as well as adds new interesting capabilities to the design of these. We show Amalgam’s capabilities by exploring three embedded devices built using web programming technologies. Also, we demonstrate how Amalgam reduces programming effort by comparing two traditional approaches of building one of these devices against Amalgam. Results show our system reduces the lines of code required to integrate hardware elements into an embedded device application to a line of code per hardware component added to the device.

Jorge Garza, Devon J. Merrill, Steven Swanson

Jekyll RDF: Template-Based Linked Data Publication with Minimized Effort and Maximum Scalability

Over the last decades the Web has evolved from a human–human communication network to a network of complex human–machine interactions. An increasing amount of data is available as Linked Data which allows machines to “understand” the data, but RDF is not meant to be understood by humans. With Jekyll RDF we present a method to close the gap between structured data and human accessible exploration interfaces by publishing RDF datasets as customizable static HTML sites. It consists of an RDF resource mapping system to serve the resources under their respective IRI, a template mapping based on schema classes, and a markup language to define templates to render customized resource pages. Using the template system, it is possible to create domain specific browsing interfaces for RDF data next to the Linked Data resources. This enables content management and knowledge management systems to serve datasets in a highly customizable, low effort, and scalable way to be consumed by machines as well as humans.

Natanael Arndt, Sebastian Zänker, Gezim Sejdiu, Sebastian Tramp

On the Web Platform Cornucopia

The evolution of the Web browser has been organic, with new features introduced on a pragmatic basis rather than following a clear rational design. This evolution has resulted in a cornucopia of overlapping features and redundant choices for developing Web applications. These choices include multiple architecture and rendering models, different communication primitives and protocols, and a variety of local storage mechanisms. In this position paper we examine the underlying reasons for this historic evolution. We argue that without a sound engineering approach and some fundamental rethinking there will be a growing risk that the Web may no longer be a viable, open software platform in the long run.

Tommi Mikkonen, Cesare Pautasso, Kari Systä, Antero Taivalsaari

Web Services and Computing

Frontmatter

Linked USDL Extension for Cloud Services Description

Cloud computing has become the most influential paradigm in recent years, both in industry and academia. A Cloud provider delivers Cloud services to businesses or individuals. However, each Cloud provider uses its own techniques to describe their Cloud services. It is therefore difficult to compare Cloud offers and then provide the appropriate service to the user. Especially that the Cloud services can provide the same functionalities, but they differ by their quality of service, price, Cloud characteristics, service credibility, and so on. The variety of these techniques is due to the lack of Cloud service description standardization. To deal with such issues, we propose in this paper a Cloud service description ontology that assists the Cloud service publication, discovery and selection processes. The proposed description will be extended from the Linked USDL language to describe Cloud services thanks to its expressiveness by covering four aspects namely technical, operational, business and semantic.

Hajer Nabli, Raoudha Ben Djemaa, Ikram Amous Ben Amor

An Automatic Data Service Generation Approach for Cross-origin Datasets

As a unified data access model, data service has become a promising technique to integrate and share heterogeneous datasets. In order to publish overwhelming data on the web, it is a key to automatically extract and encapsulate data services from various datasets in cloud environment. In this paper, a novel data service generation approach for cross-origin datasets is proposed. An attribute dependency graph (ADG) is constructed by using inherent data dependency. Based on the ADG, an automatic data service extraction algorithm is implemented. The extracted atomic data services are further organized into another representation named data service dependency graph (DSDG). Then, a data service encapsulation framework, which includes an entity layer, a data access object layer and a service layer, is designed. Via a flexible RESTful service template, this framework can automatically encapsulate the extracted data services into the RESTful services which can be accessed by the exposed interfaces. In addition, a data service generation system has been developed. Experimental results show that the system has high efficiency and good quality for data service generation.

Yuanming Zhang, Langyou Huang, Jiawei Lu, Gang Xiao

Merging Intelligent API Responses Using a Proportional Representation Approach

Intelligent APIs, such as Google Cloud Vision or Amazon Rekognition, are becoming evermore pervasive and easily accessible to developers to build applications. Because of the stochastic nature that machine learning entails and disparate datasets used in their training, the output from different APIs varies over time, with low reliability in some cases when compared against each other. Merging multiple unreliable API responses from multiple vendors may increase the reliability of the overall response, and thus the reliability of the intelligent end-product. We introduce a novel methodology – inspired by the proportional representation used in electoral systems – to merge outputs of different intelligent computer vision APIs provided by multiple vendors. Experiments show that our method outperforms both naive merge methods and traditional proportional representation methods by 0.015 F-measure.

Tomohiro Ohtake, Alex Cummaudo, Mohamed Abdelrazek, Rajesh Vasa, John Grundy

Semantic Web and Linked Open Data Applications

Frontmatter

Analyzing the Evolution of Linked Vocabularies

Reusing terms results in a Network of Linked vOcabularies (NeLO), where the nodes are the vocabularies that use at least one term from some other vocabulary and thus depend on each other. These dependencies become a problem when vocabularies in the network change, e. g., when terms are deprecated or deleted. In these cases, all dependent vocabularies in the network need to be updated. So far, there has been no study that analyzes vocabulary changes in NeLO over time. To address this shortcoming, we compute the state of NeLO from the available versions of the vocabularies over 17 years. We analyze static parameters of NeLO such as its size, density, average degree, and the most important vocabularies at certain points in time. We further investigate how NeLO changes over time. Specifically, we measure the impact of a change in one vocabulary to others, how the reuse of terms changes, and the importance of vocabularies changes. Our analyses provide for the first time in-depth insights into the structure and evolution of NeLO. This study helps ontology engineers to identify shortcomings of the data modeling and to assess the dependencies implied with reusing a specific vocabulary.

Mohammad Abdel-Qader, Iacopo Vagliano, Ansgar Scherp

Comparison Matrices of Semantic RESTful APIs Technologies

Semantic RESTful APIs combine the power of the REST architectural style, the Semantic Web and Linked Data. They picture a world in which Web APIs are easier to browse and more meaningful for humans while also being machine-interpretable, turning them into platforms that developers and companies can build on. We counted 36 technologies that target building such APIs. As there is no one-size-fits-all technology, they have to be combined. This makes selecting the appropriate set of technologies to a specific context a difficult task for architects and developers. So, how the selection of such a set of technologies can be eased? In this paper we propose three comparison matrices of Semantic RESTful APIs enabling technologies. It is based on the analysis of the differences and commonalities between existing technologies. It intends to help developers and architects in making an informed decision on the technologies to use. It also highlights the limitations of state-of-the-art technologies from which open challenges are derived.

Antoine Cheron, Johann Bourcier, Olivier Barais, Antoine Michel

Dragon: Decision Tree Learning for Link Discovery

The provision of links across RDF knowledge bases is regarded as fundamental to ensure that knowledge bases can be used joined to address real-world needs of applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. This is generally done with the aid of machine learning approaches, such as e.g. Decision Trees. While Decision Trees are known to be fast, they are generally outperformed in the link discovery task by the state-of-the-art in terms of quality, i.e. F-measure. In this work, we present Dragon, a fast decision-tree-based approach that is both efficient and accurate. Our approach was evaluated by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance with respect to its F-measure while being 18 times faster on average than existing algorithms for link discovery on RDF knowledge bases. Furthermore, we investigate why Dragon significantly outperforms J48 in terms of link accuracy. We provide an open-source implementation of our algorithm in the LIMES framework.

Daniel Obraczka, Axel-Cyrille Ngonga Ngomo

Web Application Modeling and Engineering

Frontmatter

Catch & Release: An Approach to Debugging Distributed Full-Stack JavaScript Applications

Localizing bugs in distributed applications is complicated by the potential presence of server/middleware misconfigurations and intermittent network connectivity. In this paper, we present a novel approach to localizing bugs in distributed web applications, targeting the important domain of full-stack JavaScript applications. The debugged application is first automatically refactored to create its semantically equivalent centralized version by gluing together the application’s client and server parts, thus separating the programmer-written code from configuration/environmental issues as suspected bug causes. The centralized version is then debugged to fix various bugs. Finally, based on the bug fixing changes of the centralized version, a patch is automatically generated to fix the original application source files. We show how our approach can be used to catch bugs that include performance bottlenecks and memory leaks. These results indicate that our debugging approach can facilitate the challenges of localizing and fixing bugs in web applications.

Kijin An, Eli Tilevich

Multi-device Adaptation with Liquid Media Queries

The design of responsive Web applications is traditionally based on the assumption that they run on a single client at a time. Thanks to CSS3 media queries, developers can declaratively specify how the Web application UI adapts to the capabilities of specific devices. As users own more and more devices and they attempt to use them to run Web applications in parallel, we propose to extend CSS media queries so that they can be used to adapt the UI of liquid Web applications while they are dynamically deployed across multiple devices. In this paper we present our extension of CSS media queries with liquid-related types and features, allowing to detect the number of devices connected, the number of users running the application, or the role played by each device. The liquid media query types and features defined in this paper are designed and suitable for liquid component-based Web architectures, and they enable developers to control the deployment of individual Web components across multiple browsers. Furthermore we show the design of liquid media queries in the Liquid.js for Polymer framework and propose different adaptation algorithms. Finally we showcase the expressiveness of the liquid media queries to support real-world examples and evaluate the algorithmic complexity of our approach.

Andrea Gallidabino, Cesare Pautasso

Conversational Data Exploration

This paper presents a framework for the design of chatbots for data exploration. With respect to conversational virtual assistants (such as Amazon Alexa or Apple Siri), this class of chatbots exploits structured input to retrieve data from known data sources. The approach is based on a conceptual representation of the available data sources, and on a set of modeling abstractions that allow designers to characterize the role that key data elements play in the user requests to be handled. Starting from the resulting specifications, the framework then generates a conversation for exploring the content exposed by the considered data sources.

Nicola Castaldo, Florian Daniel, Maristella Matera, Vittorio Zaccaria

Distributed Intelligent Client-Centric Personalisation

Personalisation is used extensively to improve user engagement, to optimise user experience and to enhance marketing and advertising online. While privacy has always been an issue in personalised websites, only recently have we seen a noticeable change in consumer’s behaviour’s. User’s are seeing breaches of the personal information harvested, stored and shared by content providers and increasingly adjusting privacy controls, thus negatively impacting the effectiveness of personalisation services. Client-Side personalisation (CSP) approaches offer a privacy-conscious solution, keeping the user data and user model on the client’s own device, allowing users to enjoy personalised content without compromising the privacy of their personal data. However, these solutions have significant problems with scalability and performance due to client-device resource limitations. With an ever-increasing demand for rich multimedia, particularly on more lightweight mobile devices, performance is critical to provide a seamless user experience. This research proposes a hybrid approach which we term Intelligent Client-Centric Personalisation (ICCP), this minimises the leakage of user data while enhancing performance through predictive webpage prefetching. This paper performs a comparative framework evaluation, comparing the ICCP framework performance with a typical client-server personalisation approach. It uses a large dataset of user interactions across three contrasting consumer websites, following case study based methodology. Evaluation shows that such a framework can realise the performance benefits of a client-server approach but with enhanced privacy and reduced personal data leakage.

Rebekah Storan Clarke, Vincent Wade

Demonstrations

Frontmatter

Webifying Heterogenous Internet of Things Devices

Internet of Things (IoT) applications incorporate heterogenous smart devices that support different communication protocols (Zigbee, RFID, Bluetooth, custom protocols). Enabling application development employing different protocols require interoperability between the different types of heterogenous devices that co-exist in the IoT ecosystem. In this paper we propose WoTDL2API tool, that automatically generates a running RESTful API based on the popular OpenAPI specification and integrating with the existing OpenAPI code generation toolchain. This solution provides interoperability between the devices by wrapping IoT devices with a Web-based interface enabling easier integration with other platforms. We showcase our approach using a smart home scenario available online.

Mahda Noura, Sebastian Heil, Martin Gaedke

VR-Powered Scenario-Based Testing for Visual and Acoustic Web of Things Services

Web of Things (WoT) services are Web services that interact with physical things in the environment. Testing of WoT services should be performed considering the physical and human factors that affect their quality. Scenario-based testing is known to be one of the most effective testing techniques by which we can test software while considering various real-world scenarios. However, applying scenario-based testing to real-world WoT testbed environments is not practical in terms of cost and reconfigurability. In this work, we utilize Virtual Reality (VR) technology to mimic real-world WoT environments for cost-effective testing over various scenarios.

KyeongDeok Baek, HyeongCheol Moon, In-Young Ko

Posters

Frontmatter

User’s Emotional eXperience Analysis of Wizard Form Pattern Using Objective and Subjective Measures

Forms are the ordinary medium to collect data from prospective users and indirectly build a cordial relationship with them. This communication bridge can affect the user emotional reaction, whenever a user finds an unexpected error during or submitting the form. This paper presents an empirical user emotional eXperience study on wizard form pattern (Multi Step Form). The study mainly uses both objective measures through brain wave activity (EEG) with eye tracking data and subjective measures through a self-reported metrics. Fifteen participants (N = 15) joined the experiment by filling the wizard form pattern. We manipulated the experiment by generating a sudden error at one step and grouped these experiments by their step number. We observe that the error affects the motivational emotion of group1 (got an error on the first step), the excitement emotion of group2 (got an error on the second step), the frustration emotion of group3 (got an error on the third step) and group4 (got no error). We thus argue that an error while filling or submitting a form is more emotional than technical.

Muhammad Zaki Ansaar, Jamil Hussain, Asim Abass, Musarrat Hussain, Sungyoung Lee

Integration Platform for Metric-Based Analysis of Web User Interfaces

We present a software tool for collecting web UI metrics from different providers and integrating them in a single database for further analysis. The platform’s architecture supports both code- and image-based UI assessment, thus allowing to combine advantages of the two approaches. The data structures are based on a web UI measurement domain ontology (OWL) that organizes the currently disperse set of metrics and services. Our platform can be of use to interface designers, researchers, and UI analysis tools developers.

Maxim Bakaev, Sebastian Heil, Nikita Perminov, Martin Gaedke

Personal Information Controller Service (PICS)

This paper presents a view at glance of the project PICS (which stands for Personal Information Controller Service) that is concerned by personal data protection. More specifically we present a software platform that allows users to control the exchanges between Web-based Personal Information Management Systems (the so-called PIMS that store users’ personal data) and SaaS services (such as e-commerce applications) using a reinforced authentication. The ultimate goal of this platform is to empower users by allowing them to have full control on personal data exchange. Moreover, the platform includes specific components to help users to solve cognitive demanding tasks related to the data protection such as how to properly interpret Terms of Service (ToS) imposed by the SaaS, recall previous users interactions with the SaaS (ex. personal data exchanged with the SaaS and the corresponding term of services), and detect unauthorized use of personal data. The technical solution proposed by PICS is a suitable implementation of the General Data Protection Regulation (GDPR). We present the motivations, challenges and research questions that lead to the technical solution proposed by PICS.

Marco Winckler, Laurent Goncalves, Olivier Nicolas, Frédérique Biennier, Hind Benfenatki, Thierry Despeyroux, Nourhène Alaya, Alex Deslée, Mbaye Fall Diallo, Isabelle Collin-Lachaud, Gautier Ubersfeld, Christophe Cianchi

Enabling the Interconnection of Smart Devices Through Semantic Web Techniques

Nowadays, there are millions of devices connected to the Internet. This is what we know as called Internet of Things. The integration of these smart devices with the web protocols makes them more accessible and understandable by people. The purpose of these devices is to make people’s lives easier. Thanks to the collaboration between devices, the possibilities that the Web of Things offers can be even more exploited. However, many manufacturers develop their own devices and protocols in order to protect their market share, limiting in many ways the collaboration between devices of different manufacturers. This paper presents a solution based on semantic web techniques with the purpose of achieving collaboration between devices regardless of the technologies and protocols developed by their manufacturers.

Daniel Flores-Martin, Javier Berrocal, José García-Alonso, Carlos Canal, Juan M. Murillo

PhD Symposium

Frontmatter

Content- and Context-Related Trust in Open Multi-agent Systems Using Linked Data

In open multi-agent systems, linked data enables agents to communicate with each other and to gather knowledge for autonomous decision. Until now, trust is a factor for starting communications and ignores doubts about the content or context of ongoing communications. Several approaches are used to identify whom to trust and how human trust can be computationally modeled. Yet, they do not consider a change of context or of other agents’ behavior at runtime. The proposed doctoral work aims to support content- and context-related trust in open multi-agent systems using linked data. Existing trust models need to be surveyed with respect to content- and context-related trust. A framework based on a fitting trust model and working with linked data must be developed to establish and dynamically refine trust relationships on the autonomous agents’ point of view. This would enhance the applicability of decentralized systems without introducing central units as the history of the web demonstrates. Web engineers are hereby supported to work on a new level of abstraction using the decentralization, but not scrutinizing specific communication sequences.

Valentin Siegert

Facilitating the Evolutionary Modifications in Distributed Apps via Automated Refactoring

Actively used software applications must be changed continuously to ensure their utility, correctness, and performance. To perform these changes, programmers spend a considerable amount of time and effort pinpointing the exact locations in the code to modify, a particularly hard task for distributed applications. In distributed applications, server/middleware misconfigurations and network volatility often cause performance and correctness problems. My dissertation research puts forward a novel approach to facilitating the evolutionary modifications of distributed applications that introduces a novel automated refactoring—Client Insourcing. This refactoring transforms a distributed application into its semantically equivalent centralized variant, in which the remote parts are glued together and communicate with each other by means of regular function calls, eliminating any middleware, server, and network-related problems from the list of potential problem causes. Programmers then can use the resulting centralized variant to facilitate debugging, security enhancements, and fault-tolerance adaptations (Some of the preliminary work of this dissertation is described in a paper accepted for presentation in the main technical program of ICWE 2019 [4]).

Kijin An

Effect-Driven Selection of Web of Things Services in Cyber-Physical Systems Using Reinforcement Learning

Recently, Web of Things (WoT) expands its boundary to Cyber-physical Systems (CPS) that actuate or sense physical environments. However, there is no quantitative metric to measure the quality of physical effects generated by WoT services. Furthermore, there is no dynamic service selection algorithm that can be used to replace services with alternative ones to manage the quality of service provisioning. In this work, we study how to measure the effectiveness of delivering various types of WoT service effects to users, and develop a dynamic service handover algorithm using reinforcement learning to ensure the consistent provision of WoT services under dynamically changing conditions due to user mobility and changing availability of WoT media to deliver service effects. The preliminary results show that the simple distance-based metric is insufficient to select appropriate WoT services in terms of the effectiveness of delivering service effects to users, and the reinforcement-learning-based algorithm performs well with learning the optimal selection policy from simulated experiences in WoT environments.

KyeongDeok Baek, In-Young Ko

Liquid Web Architectures

Nowadays users access the Web differently from what they used to in the past, the devices we use to fetch applications from the Web are not the same as the slow desktop computers that we owned twenty years ago. The Web can be accessed by devices with different sizes and capabilities, ranging from desktop to laptop computers, or even from tablets to phones. More recently, also smart and embedded devices, such as smart televisions, smart watches or parts of smart cars, are able to communicate with remote Web servers through the Web. The average number of Web-enabled devices owned by a single user has also increased and the connected user usually access the Web with multiple devices concurrently.Web applications are traditionally designed having in mind a server-centric architecture, whereby the whole persistent data, dynamic state and logic of the application are stored and executed on the Web server. The endpoint client device running the Web browser traditionally only renders pre-computed views fetched from the server. As more data, state and computations are shifted to the client, it becomes more challenging to run Web applications across multiple devices while ensuring they can synchronize their state and react in real-time to changes of the set of available devices.In this symposium we define how we apply the liquid software paradigm to the design of liquid Web applications and we identify and address the challenges of creating multi-device liquid user experiences. We discuss about how much is important to research on liquid software running on the Web. We also present our prototype framework called Liquid.js for Polymer, whose goal is to simplify the creation of liquid Web applications.

Andrea Gallidabino

Tutorials

Frontmatter

Exploiting Side Information for Recommendation

Recommender systems have become extremely essential tools to help resolve the information overload problem for users. However, traditional recommendation techniques suffer from critical issues such as data sparsity and cold start problems. To address these issues, a great number of recommendation algorithms have been proposed by exploiting the side information. This tutorial aims to provide a comprehensive analysis of how to exploit various kinds of side information for improving recommendation performance. Specifically, we present the usage of side information from two perspectives: the representation and methodology. By this tutorial, researchers of recommender system would gain an in-depth understanding of how side information can be utilized for better recommendation performance.

Qing Guo, Zhu Sun, Yin-Leng Theng

Deep Learning-Based Sequential Recommender Systems: Concepts, Algorithms, and Evaluations

What is sequential recommendation? What challenges are traditional sequential recommendation models facing? How to address these challenges in sequential recommendation using advanced deep learning (DL) techniques? What factors do affect the performance of a DL-based sequential recommendation system? And how to utilize these factors to improve DL models? In this tutorial, we will carefully answer these questions by combining DL techniques with sequential recommendation, and provide a comprehensive overview of DL-based sequential recommender system. Specifically, we propose a novel classification framework for sequential recommendation tasks, with which we systematically introduce representative DL-based algorithms for different sequential recommendation scenarios. We further summarize the potentially influential factors of DL-based sequential recommendation, and thoroughly demonstrate their effects via a carefully designed experimental framework, which will be of great help to future research.

Hui Fang, Guibing Guo, Danning Zhang, Yiheng Shu

Architectures Server-Centric vs Mobile-Centric for Developing WoT Applications

The massive adoption of smart devices has fostered the development of Web of Things (WoT) applications. Due to the limited capabilities of these devices (some of them are battery powered, or the data exchange is limited), these applications have very stringent requirements. The success or failure of these applications largely depends on how they address these requirements, being the resource consumption a crucial one. Our experience has shown us that with different architectural styles we can obtain a similar behaviour, but the selected style directly impacts on the resource consumption. During the last few years, different frameworks, tools and activities have been proposed to estimate this consumption in early development phases in order to guide the decision making process. However, they are still not incorporated by the industry and researchers. This tutorial delves into different architectural styles that can be applied and what tools can be used to early estimate their consumption.

Javier Berrocal, Jose Garcia-Alonso, Juan Manuel Murillo

Powerful Data Analysis and Composition with the UNIX-Shell

In addition to a wide range of commercially available data processing tools for data analysis and knowledge discovery, there are a bundle of Unix-shell scripting and text processing tools practically available on every computer. This paper reports on some of these data processing tools and presents how they can be used together to manipulate and transform data and also to perform some sort of analysis like aggregation, etc. Beside the free availability, these tools have the advantage that they can be used immediately, without prior transformation and loading the data into the target system. Another important point is, that they are typically stream-based and thus, huge amounts of data can be processed without running out of main-memory.

Andreas Schmidt, Steffen Scholz

Non-monotonic Reasoning on the Web

In this tutorial we describe the approaches to non monotonic reasoning as a means for inference on the web. In particular we are focusing on the ways in which reasoning technologies have adapted to five different issues of the modern era world wide web: (a) epistemic aspects, bound by the new models of the social web, (b) changes over time, (c) language variants, including different languages of deployment of a web site, (d) agent-based knowledge deployment, due to social networks and blogs, (e) dialogue aspects, introduced again in blogs and social networks. The presentation covers these aspects by a technical viewpoint, including the introduction of specific knowledge-driven methods. The technical issues will be provided within a general logical framework known as defeasible logic.

Matteo Cristani

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Correction to: Dragon: Decision Tree Learning for Link Discovery

Web Mining and Knowledge Extraction

Frontmatter

Web Page Structured Content Detection Using Supervised Machine Learning

Augmenting LOD-Based Recommender Systems Using Graph Centrality Measures

ST-Sem: A Multimodal Method for Points-of-Interest Classification Using Street-Level Imagery

Time and Location Recommendation for Crime Prevention

Incremental PARAFAC Decomposition for Three-Dimensional Tensors Using Apache Spark

Modeling Heterogeneous Influences for Point-of-Interest Recommendation in Location-Based Social Networks

Exploring Semantic Change of Chinese Word Using Crawled Web Data

Web Big Data and Web Data Analytics

Frontmatter

A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users

Predicting Graph Operator Output over Multiple Graphs

Streaming Event Detection in Microblogs: Balancing Accuracy and Performance

Supervised Group Embedding for Rumor Detection in Social Media

Fast Incremental PageRank on Dynamic Networks

Social Web Applications and Crowdsourcing

Frontmatter

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

On Twitter Bots Behaving Badly: Empirical Study of Code Patterns on GitHub

CrowDIY: How to Design and Adapt Collaborative Crowdsourcing Workflows Under Budget Constraints

Finding Baby Mothers on Twitter

Web User Interfaces

Frontmatter

An End-User Pipeline for Scraping and Visualizing Semi-Structured Data over the Web

DotCHA: A 3D Text-Based Scatter-Type CAPTCHA

Entropy and Compression Based Analysis of Web User Interfaces

Web Security and Privacy

Frontmatter

Domain Classifier: Compromised Machines Versus Malicious Registrations

The “Game Hack” Scam

Decentralized Service Registry and Discovery in P2P Networks Using Blockchain Technology

Web Programming

Frontmatter

Amalgam: Hardware Hacking for Web Developers with Style (Sheets)

Jekyll RDF: Template-Based Linked Data Publication with Minimized Effort and Maximum Scalability

On the Web Platform Cornucopia

Web Services and Computing

Frontmatter

Linked USDL Extension for Cloud Services Description

An Automatic Data Service Generation Approach for Cross-origin Datasets

Merging Intelligent API Responses Using a Proportional Representation Approach

Semantic Web and Linked Open Data Applications

Frontmatter

Analyzing the Evolution of Linked Vocabularies

Comparison Matrices of Semantic RESTful APIs Technologies

Dragon: Decision Tree Learning for Link Discovery

Web Application Modeling and Engineering

Frontmatter

Catch & Release: An Approach to Debugging Distributed Full-Stack JavaScript Applications

Multi-device Adaptation with Liquid Media Queries

Conversational Data Exploration

Distributed Intelligent Client-Centric Personalisation

Demonstrations

Frontmatter

Webifying Heterogenous Internet of Things Devices

VR-Powered Scenario-Based Testing for Visual and Acoustic Web of Things Services

Posters

Frontmatter

User’s Emotional eXperience Analysis of Wizard Form Pattern Using Objective and Subjective Measures

Integration Platform for Metric-Based Analysis of Web User Interfaces

Personal Information Controller Service (PICS)

Enabling the Interconnection of Smart Devices Through Semantic Web Techniques

PhD Symposium

Frontmatter

Content- and Context-Related Trust in Open Multi-agent Systems Using Linked Data

Facilitating the Evolutionary Modifications in Distributed Apps via Automated Refactoring

Effect-Driven Selection of Web of Things Services in Cyber-Physical Systems Using Reinforcement Learning

Liquid Web Architectures

Tutorials

Frontmatter

Exploiting Side Information for Recommendation

Deep Learning-Based Sequential Recommender Systems: Concepts, Algorithms, and Evaluations

Architectures Server-Centric vs Mobile-Centric for Developing WoT Applications

Powerful Data Analysis and Composition with the UNIX-Shell