Recommender Systems Based on Web Technology

Frontmatter

MARF: User-Item Mutual Aware Representation with Feedback

As deep learning (DL) technologies have developed rapidly, many new techniques have become available for recommender systems. Yet, there is very little research addressing how users’ feedback for particular items (such as ratings) can affect recommendations. This feedback can assist in building more fine-grained user profiles, as not all raw clicks will truly reflect a user’s preference. The challenge of encoding such records, which are typically prohibitively long, also prevents research from considering using the whole click history to learn representations. To address these challenges, we propose MARF, a novel model for click prediction. Specifically, we construct fine-grained user representations (by considering both the multiple items browsed, and user’s feedback on them) and item representations (by considering browsing histories from multiple users, and their feedback). Moreover, the flexible up-down strategy is designed to avoid loading incomplete or overloaded historical information by selecting representative users/items based on their feedback records. A comprehensive evaluation on three large scale real-world benchmark datasets, showing that MARF significantly outperforms a variety of state-of-the-art solutions. Furthermore, MARF model is evaluated through an ablation study that validates the contribution of each component. As a final demonstration, we show how MARF can be used for cross-domain recommendation.

Qinqin Wang, Khalil Muhammad, Diarmuid O’ Reilly-Morgan, Barry Smyth, Elias Tragos, Aonghus Lawlor, Neil Hurley, Ruihai Dong

MRVAE: Variational Autoencoder with Multiple Relationships for Collaborative Filtering

Variational Autoencoder (VAE)-based collaborative filtering (VAE-based CF) methods have shown their effectiveness in top-N recommendation. Mult-VAE is one of them that achieves state-of-the-art performance. Multinomial likelihood and additional hyperparameter $$\beta $$ β on the KL divergence term controlling the strength of regularization make Mult-VAE a strong baseline. However, Mult-VAE uses non-linear MLPs as its encoder and decoder, which will boost the performance on the dense datasets but degrade the performance on the sparse datasets in our experiments. While recent studies shed light on the non-linearity for modeling the relationships between users and items, they ignore the importance of linearity between users and items, especially on the sparse datasets. To bridge the gap and consider both the linearity and non-linearity user-item relationships, we design a hybrid encoder that incorporates both linearity and non-linearity, and use a linear decoder for VAE-based CF, which can achieve competitive performance on both sparse and dense datasets. Moreover, most VAE-based CF methods only consider the relationships between users and items but ignore the relationships between items for improving the performance in collaborative filtering. To overcome this limitation, we try to incorporate item-item relationships into VAE-based CF with the help of cosine similarity between items. Unifying these relationships into VAE-based CF forms our proposed method, Variational Autoencoder with Multiple Relationships (MRVAE) for collaborative filtering. Extensive experiments on several dense and sparse datasets show the effectiveness of MRVAE.

Zhou Pan, Wei Liu, Jian Yin

Multilevel Feature Interaction Learning for Session-Based Recommendation via Graph Neural Networks

Predicting users’ actions based on anonymous sessions is a challenging problem due to the uncertainty of user behavior and limited information. Recent advances in graph neural networks (GNN) have led to a promising approach for addressing this problem. However, existing methods have three major issues. First, they are incapable of modeling the transitions between inconsecutive items. Second, they are infeasible for learning the cross-feature interactions when learning the item relationships. Third, very few models can adapt to the improvement of embedding quality to help improve recommendation performance. Therefore, to address these issues, we propose a novel model named M ultilevel F eature I nteractions L earning (MFIL) that effectively learns item and session representation using GNN. By leveraging item side information, e.g., brands and categories, MFIL can model transitions between inconsecutive items in the session graph (session-level). We further design hierarchical structures to learn the feature interactions, which is effective to estimate the importance weights of different neighboring items in the global graph (global-level). In addition, an effective learning strategy is employed to enhance MFIL’s capability, and it performs better than the classic regularization methods. Extensive experiments conducted on real-world datasets demonstrate that MFIL, significantly outperforms existing state-of-the-art graph-based methods.

Ming He, Tianshuo Han, Tianyu Ding

Social Web Applications

Frontmatter

A Real-Time System for Detecting Landslide Reports on Social Media Using Artificial Intelligence

This paper presents an online system that leverages social media data in real time to identify landslide-related information automatically using state-of-the-art artificial intelligence techniques. The designed system can (i) reduce the information overload by eliminating duplicate and irrelevant content, (ii) identify landslide images, (iii) infer geolocation of the images, and (iv) categorize the user type (organization or person) of the account sharing the information. The system was deployed in February 2020 online at https://landslide-aidr.qcri.org/landslide_system.php to monitor live Twitter data stream and has been running continuously since then to provide time-critical information to partners such as British Geological Survey and European Mediterranean Seismological Centre. We trust this system can both contribute to harvesting of global landslide data for further research and support global landslide maps to facilitate emergency response and decision making.

Ferda Ofli, Umair Qazi, Muhammad Imran, Julien Roch, Catherine Pennington, Vanessa Banks, Remy Bossu

Online Social Event Detection via Filtering Strategy Graph Neural Network

Nowadays, as a strongly time-dependent data type, the ubiquity of social media messages enables the detection and analysis of real-time events. Through the clustering of online posts concerning their topics, existing methods can quickly identify the current trends on social media, which helps discover marketing opportunities, prevent potential crises, etc. However, due to the diversity of social network users, the performance of current approaches is significantly affected by the long tail of random topics, which should be regarded as outliers in a clustering problem. Besides, current models are weak in detecting events that last for multiple days, which is common in real-world scenarios. Therefore, we propose the FS-GNN, a graph neural network based on a filtering strategy, for incremental social event detection in data streams. Our method uses heterogeneous information networks (HINs) to construct a social message graph, and we propose a centrality-based scoring mechanism to grade and filter noisy data before clustering. In addition, a message complement window is introduced to connect the same topic mentioned across multiple days for better clustering accuracy. Extensive experimental results demonstrate the superiority of FS-GNN over multiple baselines in both offline and online scenarios.

Lifu Chen, Junhua Fang, Pingfu Chao, An Liu, Pengpeng Zhao

Similarity Search with Graph Index on Directed Social Network Embedding

Similarity search on directed social networks (DSNs) could help users find the K nearest neighbors (KNNs). The graph index based similarity search does not have to compare query node against every node in DSNs, since the neighbor relationship of the nodes is captured by the edges. Nevertheless, the performance of similarity search is still unsatisfactory, such as not supporting the end-to-end search or taking unnecessary detours, etc. In this paper, we propose the method of Graph Index on Directed Social Network Embedding (GI-DSNE) to facilitate the approximate KNN search on DSNs. First, the DSNE is proposed to embed the DSN into a low-dimensional vector space to achieve the embeddings for efficient calculation of similarities on the search path. Then, the nearest neighbor descent algorithm is adopted to calculate the KNN graph. Subsequently, to construct the graph index efficiently, the direction guided strategy for edge selection, maximum out-degree of GI-DSNE and the depth-first-search tree for guaranteeing the connectivity of GI-DSNE are proposed. Experimental results show that our proposed method outperforms the state-of-the-art competitors on both execution time and precision.

Zhiwei Qi, Kun Yue, Liang Duan, Zhihong Liang

Web Application Modelling and Engineering

Frontmatter

An In-Depth Analysis of Web Page Structure and Efficiency with Focus on Optimization Potential for Initial Page Load

Web pages are nowadays usually built with a variety of different tools, frameworks, and generated code. The structure and size of the resulting HTML, CSS, and JavaScript code highly influence the time for page load, and related energy consumption. However, no large-scale baseline data exists about the efficiency of the resulting page code, e.g., what amount of the code is actually used, or if code parts must be render-blocking. Furthermore, existing examinations analyze page code structure but do not investigate the potential impact on code efficiency if parts of the code would be optimized. In this paper, the top 10,000 web pages worldwide using the Tranco list were analyzed in-depth. Aspects with the highest impact on structure or performance are evaluated in detail and set into context regarding used techniques, frameworks, code efficiency, and differences in the delivered desktop- and mobile versions. The results showed that the vast majority (over 70% for JavaScript and $$\approx $$ ≈ 90% CSS) of externally loaded resources (both JavaScript and CSS) are loaded as render-blocking code. On average, only $$\approx $$ ≈ 40% of render-blocking JavaScript and $$\approx $$ ≈ 15% of CSS are used until page render, which unveils a significant potential for performance improvements for most analyzed websites.

Lucas Vogel, Thomas Springer

Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.

Chia-Hui Chang, Cheng-Ju Wu, Tzu-Ping Lin

Disclosure: Efficient Instrumentation-Based Web App Migration for Liquid Computing

Web App migration means capturing a snapshot of the execution state of an web app on a device, and restoring it on another device to continue its execution, for cross-device liquid computing. Although web apps are relatively easy to migrate due to its high portability, there is a JavaScript language feature called closure, which complicates migration since it requires migrating the variable states of already-finished outer functions. One approach of web app migration is to instrument the source code to trace the closure variables, yet it often suffers from performance slowdown, especially for multiple migrations. In this paper, we propose a new instrumentation-based technique called Disclosure, which moves the declarations of closure variables to a managed data structure and replaces closure variables by the corresponding references to the data structure. This can improve the runtime performance while enhancing security. We evaluated our work with eight Octane benchmarks and four real web apps. The runtime performance penalty due to Disclosure is 0%–15%, which is much better than the result of the latest instrumentation-based work that supports deep closures and multiple migrations, as Disclosure. Real web apps are also shown to migrate seamlessly, even multiple times.

Jae-Yun Kim, Soo-Mook Moon

Enriching Scholarly Knowledge with Context

Leveraging a GraphQL-based federated query service that integrates multiple scholarly communication infrastructures (specifically, DataCite, ORCID, ROR, OpenAIRE, Semantic Scholar, Wikidata and Altmetric), we develop a novel web widget based approach for the presentation of scholarly knowledge with rich contextual information. We implement the proposed approach in the Open Research Knowledge Graph (ORKG) and showcase it on three kinds of widgets. First, we devise a widget for the ORKG paper view that presents contextual information about related datasets, software, project information, topics, and metrics. Second, we extend the ORKG contributor profile view with contextual information including authored articles, developed software, linked projects, and research interests. Third, we advance ORKG comparison faceted search by introducing contextual facets (e.g. citations). As a result, the devised approach enables presenting ORKG scholarly knowledge flexibly enriched with contextual information sourced in a federated manner from numerous technologically heterogeneous scholarly communication infrastructures.

Muhammad Haris, Markus Stocker, Sören Auer

FAIRification of Citizen Science Data Through Metadata-Driven Web API Development

Citizen Science (CS) implies a collaborative process to encourage citizens to collect data in CS projects and platforms. Unfortunately, these CS initiatives do not follow metadata nor data-sharing standards, which hampers their discoverability and reusability. To improve this scenario in CS is crucial to consider FAIR (Findability, Accessibility, Interoperability and Reusability) guidelines. Therefore, this paper defines a FAIRification process (i.e. make CS initiatives more FAIR compliant) which maps metadata of CS platforms’ catalogues to DCAT and generates Web Application Programming Interfaces (APIs) for improving CS data discoverability and reusability in an integrated approach. An experiment in a CS platform with different CS projects shows the performance and suitability of our FAIRification process. Specifically, the validation of the DCAT metadata generated by our FAIRification process was conducted through a SHACL standard validator, which emphasises how the process could boost CS projects to become more FAIR compliant.

Reynaldo Alvarez, César González-Mora, José Zubcoff, Irene Garrigós, Jose-Norberto Mazón, Hector Raúl González Diez

The Case for Cross-Entity Delta Encoding in Web Compression

Delta encoding and shared dictionary compression (SDC) for accelerating Web content have been studied extensively in research over the last two decades, but have only found limited adoption in the industry so far: Compression approaches that use a custom-tailored dictionary per website have all failed in practice due to lacking browser support and high overall complexity. General-purpose SDC approaches such as Brotli reduce complexity by shipping the same dictionary for all use cases, while most delta encoding approaches just consider similarities between versions of the same entity (but not between different entities). In this study, we investigate how much of the potential benefits of SDC and delta encoding are left on the table by these two simplifications. As our first contribution, we describe the idea of cross-entity delta encoding that uses cached assets from the immediate browser history for content encoding instead of a precompiled shared dictionary: This avoids the need to create a custom dictionary, but enables highly customized and efficient compression. Second, we present an experimental evaluation of compression efficiency to hold cross-entity delta encoding against state-of-the-art Web compression algorithms. We consciously compare algorithms some of which are not yet available in browsers to understand their potential value before investing resources to build them. Our results indicate that cross-entity delta encoding is over 50% more efficient for text-based resources than compression industry standards. We hope our findings motivate further research and development on this topic.

Benjamin Wollmer, Wolfram Wingerath, Sophie Ferrlein, Fabian Panse, Felix Gessert, Norbert Ritter

Web Big Data and Web Data Analytics

Frontmatter

Dynamic Network Embedding in Hyperbolic Space via Self-attention

Graph Neural Networks (GNNs) have recently become increasingly popular due to their ability to learn node representations in complex graphs. Existing graph representation learning methods primarily target static graphs in Euclidean space, while many graphs in practical applications are dynamic and evolve constantly over time. Besides, most of these methods underestimate the inherent complex and hierarchical properties in real-world graphs, leading to sub-optimal embeddings. In this work, we propose a Dynamic Network in Hyperbolic space via Self-Attention, referred to as DynHAT, a novel neural architecture that computes node representations through joint two dimensions of hyperbolic structural graph and temporal attention graph. More specifically, DynHAT maps the structural graph into hyperbolic space to capture the hierarchical information, then temporal graph captures time-varying dynamic evolution over multiple time steps by flexibly weighting historical representations. Experimental results on three real-world datasets demonstrate the superiority of DynHAT for dynamic graph embedding, as it consistently outperforms competing methods in link prediction tasks.

Dingyang Duan, Daren Zha, Xiao Yang, Nan Mu, Jiahui Shen

Engineering Annotations to Support Analytical Provenance in Visual Exploration Processes

This paper focuses on the fundamental role played by annotations to support provenance analysis in visual exploration processes of large datasets. Particularly, we investigate the use of annotations during the visual exploration of semantic datasets assisted by chained visualization techniques. In this paper, we identify three potential uses of annotations: (i) documenting findings (including errors in the dataset), (ii) supporting collaborative reasoning among teammates, and (iii) analysing provenance during the exploratory process. To demonstrate the feasibility of our approach, we implemented it as a tool support, while illustrating its usage and effectiveness through a series of use case scenarios. We identify the attributes and meta-data that describe the dependencies between annotations and visual representations, and we illustrate these dependencies through a domain-specific model.

Maroua Tikat, Aline Menin, Michel Buffa, Marco Winckler

Lunatory: A Real-Time Distributed Trajectory Clustering Framework for Web Big Data

Web big data contains a wealth of valuable information, which can be extracted through web mining and knowledge extraction. Among them, the real-time location information of web can provide a richer calculation basis for existing applications, such as real-time monitoring systems and recommendation systems based on real-time trajectory clustering. However, as a trajectory is a sequence of user positions in the time dimension, the correlation calculation of the trajectories will inevitably incur a massive computational cost. In addition, such trajectory data is usually time-sensitive, that is, once the trajectory data has been generated and changed, the corresponding clustering results need to be output with low latency. Although the offline trajectory clustering has been well studied, extending such work to an online environment directly tends to incur (1) expensive network cost, (2) high processing latency, and (3) low accuracy results. To enable a real-time clustering on trajectory stream, we propose a distributed cLustering framework for hexagonal-based streaming trajectory (Lunatory). Lunatory covers three key components, that are: (1) Simplifier: to solve the problem of extensive network transmission in a distributed trajectory streaming system, a pivot trajectory data structure is introduced to simplify trajectories by reducing the number of samples and extracting key features; (2) Partitioner: to enhance the local computational efficiency of subsequent clustering, a hexagonal-based indexing strategy is proposed to index the pivot trajectories; (3) Executor extends DBSCAN to pivot trajectories and implements real-time trajectory clustering based on Flink. Empirical studies on real-world data validate the usefulness of our proposal and prove the huge advantage of our approach over available solutions in the literature.

Yang Wu, Zhicheng Pan, Pingfu Chao, Junhua Fang, Wei Chen, Lei Zhao

Web Mining and Knowledge Extraction

Frontmatter

Building Knowledge Subgraphs in Question Answering over Knowledge Graphs

Question answering over knowledge graphs targets to leverage facts in knowledge graphs to answer natural language questions. The presence of large number of facts, particularly in huge and well-known knowledge graphs such as DBpedia, makes it difficult to access the knowledge graph for each given question. This paper describes a generic solution based on Personal Page Rank for extracting a small subset from the knowledge graph as a knowledge subgraph which is likely to capture the answer of the question. Given a natural language question, relevant facts are determined by a bi-directed propagation process based on Personal Page Rank. Experiments are conducted over FreeBase, DBPedia and WikiMovie to demonstrate the effectiveness of the approach in terms of recall and size of the extracted knowledge subgraphs.

Sareh Aghaei, Kevin Angele, Anna Fensel

Dual-Attention Based Joint Aspect Sentiment Classification Model

Aspect-Category based Sentiment Analysis (ACSA) aims to predict the aspect category and the sentiment polarity mentioned in a sentence. Most works treat it as two individual tasks: aspect category detection (ACD) and aspect category sentiment classification (ACSC), thus resulting in category missing and mismatch between sentiment words and aspect categories. This paper proposes a dual-attention based joint aspect sentiment classification model (AS-DATJM), which jointly predicts aspect category and sentiment polarity in one framework. Given a sentence, AS-DATJM firstly employs aspect aware attention in ACD to obtain the hidden aspect terms. With these terms as guidance, ACSC module aggregates relevant sentiment context over the Graph Convolutional Network. As a result, the inter-relations between aspect categories and sentiments can be captured and employed to predict both categories simultaneously. Extensive evaluations demonstrate the effctiveness of our model and results show that it outperforms the state-of-the-art methods on four benchmark datasets.

Ping Gu, Zhipeng Zhang

Explaining a Deep Neural Model with Hierarchical Attention for Aspect-Based Sentiment Classification Using Diagnostic Classifiers

LCR-Rot-hop++ is a state-of-art model for Aspect-Based Sentiment Classification. However, it is also a black-box model where the information encoded in each layer is not understood by the user. This study uses diagnostic classifiers, single layer neural networks, to evaluate the information encoded in each layer of the LCR-Rot-hop++ model. This is done by using various hypotheses designed to test for information deemed useful for sentiment analysis. We conclude that the model did not focus on identifying the aspect mentions associated with a word and the structure of the sentence. However, the model excelled in encoding information to identify which words are related to the target. Lastly, the model was able to encode to some extent information about the word sentiment and sentiments of the words related to the target.

Kunal Geed, Flavius Frasincar, Maria Mihaela Truşcǎ

A Model for Meteorological Knowledge Graphs: Application to Météo-France Data

To study and predict meteorological phenomenons and to include them in broader studies, the ability to represent and exchange meteorological data is of paramount importance. A typical approach in integrating and publishing such data now is to formalize a knowledge graph relying on Linked Data and semantic Web standard models and practices. In this paper, we first discuss the semantic modelling issues related to spatio-temporal data such as meteorological observational data. We motivate the reuse of a network of existing ontologies to define a semantic model in which meteorological parameters are semantically defined, described and integrated. The model is generic enough to be adopted and extended by meteorological data providers to publish and integrate their sources while complying with Linked Data principles. Finally, we present a meteorological knowledge graph of weather observations based on our proposed model, published in the form of an RDF dataset, that we produced by transforming observation records made by Météo-France weather stations. It covers a large number of meteorological variables described through spatial and temporal dimensions and thus has the potential to serve several scientific case studies from different domains including agriculture, agronomy, environment, climate change and natural disasters.

Nadia Yacoubi Ayadi, Catherine Faron, Franck Michel, Fabien Gandon, Olivier Corby

An Ontological Approach for Recommending a Feature Selection Algorithm

Feature selection plays an important role in machine learning or data mining problems. Removing irrelevant features increases model accuracy and reduces the computational cost. However, selecting important features is not a simple task as one feature selection algorithm does not perform well on all the datasets that are of interest. This paper tries to address the recommendation of a feature selection algorithm based on dataset characteristics and quality. The research uses three types of dataset characteristics along with data quality metrics. The main contribution of the work is the utilization of Semantic Web techniques to develop a novel system that can aid in robust feature selection algorithm recommendations. The system’s strength lies in assisting users of machine learning algorithms by providing more relevant feature selection algorithms for the dataset using an ontology called Feature Selection algorithm recommendation based on Data Characteristics and Quality (FSDCQ). Results are generated using six different feature selection algorithms and four types of classifiers on ten datasets from UCI repository. Recommendations take the form of “Feature selection algorithm X is recommended for dataset i, as it performed better on dataset j, similar to dataset i in terms of class overlap 0.3, label noise 0.2, completeness 0.9, conciseness 0.8 units". While the domain-specific ontology FSDCQ was created to aid in the task of algorithm recommendation for feature selection, it is easily applicable to other meta-learning scenarios.

Aparna Nayak, Bojan Božić, Luca Longo

Towards Bridging the Gap Between Knowledge Graphs and Chatbots

Chatbots are nowadays being applied widely in different life domains. One major reason for this trend is the mature development process that is supported by large companies and sophisticated conversational platforms. However, the required development steps are mostly done manually while transforming existing knowledge bases into interaction configurations, s.t., algorithms integrated into the conversational platforms are enabled to learn the intended interaction patterns. However, already existing domain knowledge may get vanished while transforming a structured knowledge base into a “flat” text representation without references backwards. In this paper, we aim for an automatic process dedicated to generating interaction configurations for a conversational platform (Google Dialogflow) from an existing domain-specific knowledge base. Our ultimate goal is to generate chatbot configurations automatically, s.t., the quality and efficiency are increased.

Annemarie Wittig, Aleksandr Perevalov, Andreas Both

Web Security and Privacy

Frontmatter

Configurable Per-Query Data Minimization for Privacy-Compliant Web APIs

The purpose of regulatory data minimization obligations is to limit personal data to the absolute minimum necessary for a given context. Beyond the initial data collection, storage, and processing, data minimization is also required for subsequent data releases, as it is the case when data are provided using query-capable Web APIs. Data-providing Web APIs, however, typically lack sophisticated data minimization features, leaving the task open to manual and all too often missing implementations. In this paper, we address the problem of data minimization for data-providing, query-capable Web APIs. Based on a careful analysis of functional and non-functional requirements, we introduce Janus, an easy-to-use, highly configurable solution for implementing legally compliant data minimization in GraphQL Web APIs. Janus provides a rich set of information reduction functionalities that can be configured for different client roles accessing the API. We present a technical proof-of-concept along with experimental measurements that indicate reasonable overheads. Janus is thus a practical solution for implementing GraphQL APIs in line with the regulatory principle of data minimization.

Frank Pallas, David Hartmann, Paul Heinrich, Josefine Kipke, Elias Grünewald

Effective Malicious URL Detection by Using Generative Adversarial Networks

Malicious URL, a.k.a. malicious website, pose a great threat to Web security. In particular, concept drift caused by variants of malicious URL degrades the performance of existing detection methods based on the available attack patterns. In this paper, We conduct an extensive measurement study of the realistic URL and find that the hierarchical semantics feature is suitable for identifying malicious URL. Therefore, we propose URLGAN, a deep neural network model equipped with the hierarchical semantics features, to detect distinguish between malicious and normal URL. Firstly, we embed the entire URL into a hierarchical semantics structure. Secondly, hierarchical semantics features are extracted from the hierarchical semantics structure through BERT. Then, the extracted features are combined with features generated by the generator, similar but slightly different, to enable the condition discriminator to extract the essential difference between normal and malicious URL. Notably, with the features generated by the generator, we enhance the robustness of the system to detect malicious URL variants. Extensive experiments on the public dataset and our data collected from specific targets demonstrate that our method achieves superior performance to other methods and protects specific targets from the susceptibility of malicious URL.

Jinbu Geng, Shuhao Li, Zhicheng Liu, Zhenyu Cheng, Li Fan

MEMTD: Encrypted Malware Traffic Detection Using Multimodal Deep Learning

Malware that generates encrypted traffic presents a great threat to Internet security. The existing state-of-the-art malware traffic detection techniques based on deep learning (DL) ignore the heterogeneity of encrypted traffic, resulting in their inability to further improve detection performance. This paper applies multimodal DL to detect encrypted malware traffic, proposing a multimodal encrypted malware traffic detection (MEMTD) approach. MEMTD extracts features from three types of modal data—the transport layer security (TLS) handshake payload bytes (encryption behavior modal data), packet length sequence (spatial modal data), and packet arrival-time interval sequence (time modal data) of encrypted traffic. Moreover, an intermediate fusion mechanism is adopted in the MEMTD approach to mine the dependencies among modalities and fuse the discriminative traffic features, improving detection performance. The experimental results on datasets containing 8 malware families and normal traffic show that the MEMTD approach achieves 0.9996 macro-F1 and outperforms other single-modal DL detection methods.

Xiaotian Zhang, Jintian Lu, Jiakun Sun, Ruizhi Xiao, Shuyuan Jin

Web User Interfaces

Frontmatter

A Web Crowdsourcing Platform for Territorial Control in Smart Cities

Nowadays citizens engage with smart city ecosystems in several ways using smartphones, mobile devices, connected cars, and drones. Pairing devices and data with a city’s infrastructure and services can improve sustainability and achieve an improvement in awareness and territorial control. Communities can improve energy distribution and decrease traffic congestion with the help of IoT technologies. To support and streamline such a process, in this paper we introduce a Web crowdsourcing platform as a Common Operational Picture dashboard to interoperate with smart devices, collect urban data from them, and monitor the city in real-time. Its application to the Metropolitan City of Bari is presented and discussed.

Andrea Pazienza, Domenico Lofù, Giampaolo Flace, Marco Salzedo, Pietro Noviello, Eugenio Di Sciascio, Felice Vitulano

Supporting Natural Language Interaction with the Web

Conversational AI is disrupting the way information is accessed. However, there is still a lack of conversational technologies leveraging the Web. This paper introduces an approach to support the notion of Conversational Web Browsing. It illustrates design patterns for navigating websites through conversation and shows how such patterns are sustained by a Web architecture that integrates NLP technologies.

Marcos Baez, Cinzia Cappiello, Claudia M. Cutrupi, Maristella Matera, Isabella Possaghi, Emanuele Pucci, Gianluca Spadone, Antonella Pasquale

User Acceptance of Modified Web Page Loading Based on Progressive Streaming

In times of the pandemic, it becomes more evident that our modern society relies heavily on the internet for most areas of life. As websites become more complex and their size increases, the amount of code slows down their loading speed, especially on mobile devices with poor network connectivity. Various improvements exist to optimize the code before and after delivering a web page to a client. However, the delivery and rendering itself was rarely examined. In this paper, we evaluate two new methods of loading and displaying websites faster, namely Text-First and Layout-First. Layout-First reduced the time until first contentful paint (FCP) on average from 281.75 s down to 6.43 s at 32 KB/s, a difference of more than 4.5 min. Text-First reached the FCP on average in 2.15 s at the same network speed. However, our user study revealed that not every technological improvement is well accepted by users. Results showed that users will wait longer if the layout is stable while loading the page. More than 85% of participants preferred the Layout-First method introduced in this paper.

Lucas Vogel, Thomas Springer

We Don’t Need No Real Users?! Surveying the Adoption of User-less Automation Tools by UI Design Practitioners

The main principles for designing successful UIs in a perfect world have long been known—considering many possible solutions for a problem and involving representative users in the process. In practice, however, reasons for violating those principles can be plentiful: the infamous tight budgets and schedules, lack of management buy-in, restrictions for face-to-face meetings, etc. Yet, design tools that do not require real users, such as AI-/ML-powered solutions, which could mitigate these issues seem to experience a rather low adoption rate in industry. In this paper, we present a survey with 34 professional digital designers and user researchers intended to investigate the above hypotheses. We inquire into awareness and usage of 61 such tools and platforms, as well as participants’ design and research processes and general design tool adoption in industry. From the results we identify three particular challenges and three opportunities. Finding and recruiting relevant participants for user studies seems to be indeed problematic, and professional designers and researchers often lack the time and resources to follow a textbook process. They are, however, open to novel tools addressing these shortcomings—particular for ideation and evaluation—but at the same time seem to be largely unfamiliar with AI-/ML-based approaches or do not (yet) see added value in them. With these findings as a starting point, the Web Engineering community can work towards a deeper understanding of designers’ and researchers’ needs that could be met with AI-/ML-based support tools.

Maxim Bakaev, Maximilian Speicher, Johanna Jagow, Sebastian Heil, Martin Gaedke

Ph.D. Symposium

Frontmatter

Achieving Corruption-Transparency in Service Governance Processes with Blockchain-Technology Based e-Participation

Corruption takes place in public procurement by public servants through intermediaries due to the use of centralized systems and complicated processes. Blockchain and Web3 has the potential to remove these intermediaries, instead allowing institutions to build trust among public servants and citizens through a decentralized web. It is feasible to positively reinforce the transparency in tackling corruption in public procurement by establishing an e-participatory governance infrastructure using token economics from smart-contract blockchain technology. The overall success of public procurement in terms of service delivery to citizens is associated with citizen e-participation. Thus, increased e-participation through automated processes makes the government accountable and transparent in the provision of services that lead to the progress and economic growth of a country. In this paper, we investigate the potential of blockchain and smart-contracts to improve the efficiency, trust, and transparency of public procurement in the case of Afghanistan. Moreover, we identify the existing barriers namely lack of trust, transparency, the complexity of procurement documents, and inappropriate record-keeping system. To address these issues, we propose a blockchain-based e-participatory infrastructure to boost transparency by curbing public procurement corruption.

Mohammad Mustafa Ibrahimy, Alex Norta, Peeter Normak

Applying a Healthcare Web of Things Framework for Infertility Treatments

According to doctors and researchers, fertility problems are becoming epidemic proportions. Meanwhile, the demand for infertility treatment is increasing by 5–10% per year. To support the growing demand, physicians need to define personalized remote monitoring treatments supported by devices that send real-time information on hormones levels, heart-rate, temperature, etc. To this end, Healthcare Monitoring Systems (HMS) have recently appeared, based on increasingly advanced devices that help to manage this task. However, current solutions are expensive and not very customizable by physicians themselves. In this paper, we propose a framework called MoSTHealth, based on digital twins and Model-Driven Engineering (MDE), allows healthcare experts to model a personalized Web of Things (WoT) HMS scenario per treatment and per patient. Thanks to MDE, the simulated scenario allows us to generate a Service-Oriented enterprise cloud architecture that integrates a prediction module based on machine learning and data analysis. In this paper, a WoT HMS scenario for infertility treatment is presented as a case study. In this scenario, a specific care plan is defined, associated with a set of devices, including the use of a biosensing device that sends hormones levels in real-time.

Anastasiia Gorelova, Santiago Meliá

Blockchain and AI to Build an Alzheimer’s Risk Calculator

The problems affecting healthcare databases and medical records are numerous, although the potential of the data stored in them is high. However, medical records are hidden across hospitals, and data sharing processes fail to provide accountable data control. Blockchain technology has been successfully applied in various fields to support distributed data management and data quality. This article evidences how Blockchain is expected to be leveraged to better organize and sharing of healthcare’s big data with mixed EHR (Electronic Health Records) and imaging (CAT, RX, etc.) sources. The aim is to exploit these data through Artificial Intelligence methods in order to build an Alzheimer’s risk calculator based on neuro-images.

Paolo Sorino

Bridging Static Site Generation with the Dynamic Web

Historically web sites have been developed using HTML for their markup either by authoring it directly or through abstraction to generate it. The currently available tools exist in a continuum of static, developer-oriented tools and dynamic services that cater to non-technical users. In this paper, we propose an approach that sits in the middle by using JSON for site definitions. The definition is leveraged on the client-side for editing, bridging the continuum’s ends.

Juho Vepsäläinen, Petri Vuorimaa

Enhance Web-Components in Order to Increase Security and Maintainability

Today’s development of client-side web applications is based on one of the JavaScript-frameworks, such as Angular or React. The excessive dependencies that arise in the ecosystem from the Node-Package-Manager increase the security risk and the dependency of your own web application on third-party packages. Moreover, the frameworkless approach proposes a renaissance of classic web development, because it strives to avoid external dependencies as far as possible and to fall back on the standards. Whether the implementation achieves maintainability and security of frameworks is questionable. Therefore, it makes sense to research which core concepts of the frameworks meet the requirements for maintainability and security and how these are implemented. The novelty is that the concepts to be explored are moved to a standard in order to ensure the developer efficiency, security, performance and maintainability in the long term. This allows existing approaches to focus on other essential features.

Tobias Münch, Rainer Roosmann

FAIRification of Citizen Science Data

Citizen Science (CS) initiatives encourage citizens to collect local data, contributing to knowledge creation and scientific development. However, these CS initiatives do not follow metadata nor data-sharing standards, which hampers their discoverability and reusability out of the scope of them. To improve this scenario in CS is crucial to consider Findable, Accessible, Interoperable and Reusable (FAIR) guidelines for research data sharing. This work proposes a FAIRification process (i.e. making CS initiatives more FAIR compliant), enhancing data sharing capacities in the CS context. It will be considered the adoption of Web standards, Web application programming interfaces (APIs) and Web augmentation. This approach contributes to the production of FAIR data in CS for data consumers. As preliminary results this paper explains the FAIRification process. The research objectives and plan are also presented.

Reynaldo Alvarez Luna, José Zubcoff, Irene Garrigós, Hector Gonz

Towards Differentially Private Machine Learning Models and Their Robustness to Adversaries

The pervasiveness of modern machine learning algorithms exposes users to new vulnerabilities: violation of sensitive information stored in the training data and wrong model behaviors caused by adversaries. State-of-the-art approaches to prevent such behaviors are usually based on Differential Privacy (DP) and Adversarial Training (AT). DP is a rigorous formulation of privacy in probabilist terms to prevent information leakages that could reveal private information about the users, while AT algorithms empirically increase the system’s robustness, injecting adversarial examples during the training process. Both techniques involve achieving their goal by modeling noise introduced into the system. We propose analyzing the relationship between these two techniques, studying how one affects the other. Our objective is to design a mechanism that guarantees DP and robustness against adversarial attacks, injecting modeled noise into the system. We propose Recommender Systems as an application scenario because of the severe risks to user privacy and system sensitivity to adversaries.

Alberto Carlo Maria Mancino, Tommaso Di Noia

Posters and Demonstrations

Frontmatter

A Metadata-Driven Tool for FAIR Data Production in Citizen Science Platforms

Citizen Science (CS) platforms include a large number of projects that manage data from citizen observations. However, data and metadata are not easily available and do not generally comply with standards. This makes it difficult to share data through the mechanisms commonly used in the scientific community, affecting the reuse of data outside the context of CS platforms. The adoption of Web standards could improve the FAIR (Findable, Accessible, Interoperable and Reusable) quality of shared data. Adopting standards is not enough; it is also important to provide the technologies that make it possible to find the data, access it, share it and be able to interoperate with the data. For this purpose, this paper presents a tool for the production of FAIR data from PPSR (Public Participation in Scientific Research) Core metadata model based platforms. The tool allows (i) transforming metadata from CS platforms to the DCAT (Data Catalogue Vocabulary) standard, (ii) generating Web APIs from the available data, and (iii) building a DCAT-validated data catalogue. This approach improves the FAIR compliance of CS data, empowering data consumers and developers.

Reynaldo Alvarez, César González-Mora, Irene Garrigós, Jose Zubcoff

A New Compatibility Measure for Harmonic EDM Mixing

DJ track selection can benefit from software-generated recommendations that optimise harmonic transitions. Emerging techniques (such as Tonal Interval Vectors) enable the definition of new metrics for harmonic compatibility (HC) estimation that improve the performance of existing applications. Thus, the aim of this study is to provide the DJ with a new tool to improve his/her musical selections. We present a software package that can estimate the HC between digital music recordings, with a particular focus on modern dance music and the workflow of the DJ. The user must define a target track for which the calculation is to be made, and obtains the HC values expressed as a percentage with respect to each track in the music collection. The system also calculates a pitch transposition interval for each candidate track that, if applied, maximizes the HC with respect to the target track. Its graphical user interface allows the user to easily run it simultaneously with the DJ software of choice during live performances. The system, tested with musically experienced users, generates pitch transposition suggestions that improve mixes in 73.7% of cases.

Gabriel Bibbó Frau, Angel Faraldo

Compaz: Exploring the Potentials of Shared Dictionary Compression on the Web

In this demonstration, we present Compaz, an extensible benchmarking tool for web compression that enables evaluating approach-es before they have been fully implemented and deployed. Compaz makes this possible by collecting all relevant data from user journeys on live websites first and then performing the benchmark analysis as a subsequent step with global knowledge of all transmitted resources. In our demonstration scenario, the audience can witness how current websites could improve their compression ratio and save bandwidth. They can choose from standard and widespread approaches such as Brotli or gzip and advanced approaches like shared dictionary compression that are currently not even supported by any browser.

Benjamin Wollmer, Wolfram Wingerath, Sophie Ferrlein, Felix Gessert, Norbert Ritter

Social Events Analyzer (SEA): A Toolkit for Mining Social Workflows by Means of Federated Process Mining

Users’ smartphones collect information about the different interactions they perform in their daily life, including web interactions. Mining this information to discover user’s processes provides information about them as individuals and as part of a social group. However, analyzing events produced by human behavior, where indeterminism and variability prevail, is a complex task. Techniques such as process mining focus on analyzing customary event logs produced by a system where all the possible interactions are predefined. The analysis become even harder when it involves a group of people whose joint activity is considered part of a Social Workflow. In this demo we present Social Events Analyzer (SEA), a toolkit for easy Social Workflow analysis using a technique called Federated Process Mining. The tool offers models more faithful to the behavior of the users that make up a Social Workflow and opens the door to the use of process mining as a basis for the creation of new automatic procedures adapted to the user behavior.

Javier Rojo, José García-Alonso, Javier Berrocal, Juan Hernández, Juan M. Murillo, Carlos Canal

Solid Web Monetization

The Solid decentralization effort decouples data from services, so that users are in full control over their personal data. In this light, Web Monetization has been proposed as an alternative business model for web services that does not depend on data collection anymore. Integrating Web Monetization with Solid, however, remains difficult because of the heterogeneity of Interledger wallet implementations, lack of mechanisms for securely paying on behalf of a user, and an inherent issue of trusting content providers to handle payments. We propose the Web Monetization Provider as a solution to these challenges. The WMP acts as a third party, hiding the underlying complexity of transactions and acting as a source of trust in Web Monetization interactions. This demo shows a working end-to-end example including a website providing monetized content, a WMP, and a dashboard for configuring WMP into a Solid identity.

Merlijn Sebrechts, Tom Goethals, Thomas Dupont, Wannes Kerckhove, Ruben Taelman, Filip De Turck, Bruno Volckaert

Web Push Notifications from Solid Pods

Our demo showcases how a Solid Pod, i.e. a web server that adheres to the Solid Protocol, can be extended to support Web Push Notifications for Progressive Web Applications (PWAs). For a user’s perspective, we present a PWA where a user can choose to receive Web Push Notifications when a message is posted to her Solid Pod’s inbox.

Christoph H.-J. Braun, Tobias Käfer

Tutorials

Frontmatter

A Guide for Quantum Web Services Deployment

Quantum computing is a new paradigm for solving problems that classical computers cannot reach. To the point that it is already generating interest in the scientific and industrial communities. Currently, quantum computers and technology are being developed to support the execution of quantum software. Several large computer companies have already built functional quantum computers, and developed several programming languages and quantum simulators that can be used by the general public. All this infrastructure for quantum computing is offered to quantum developers through the cloud, following a model similar to the familiar Infrastructure as a Service. However, due to the early stages of quantum computing taking advantage of the capabilities of these computers requires a very in depth knowledge of quantum programming and quantum hardware that is far from what cloud developers are used to in classical cloud offerings. Although the future of quantum computing is still unknown, it is highly certain that there must be a time when quantum computing coexists with classical computing. At the same time, one of the most well-known and tested solutions for the communication of heterogeneous computing systems are web services. In this tutorial we will offer an introductory view on how quantum algorithms can be converted into web services, how this web services can be deployed, using the Amazon Braket platform for quantum computing, and invoked through classical web services endpoints. Finally, we will propose a way in which a disadvantage of current quantum computers in terms of web services can be transformed into an advantage for web services through the use of a Quantum API Gateway.

Jaime Alvarado-Valiente, Javier Romero-Álvarez, Jose Garcia-Alonso, Juan M. Murillo

About Lightweight Code Generation

There is often something mystical about code generation [1]. This is partly due to tools, that are able to achieve a high degree of generation thanks to their flexibility and universality, but this also makes the tools extremely complex and restricts their use to suitably trained persons. This also applies to the OMG’s “Model Driven Architecture” approach, which has tried to establish a standard in this field and to enable the exchange between different tools through additional technologies. A “code generation light” approach, which would often be sufficient in many cases, is difficult to implement with these tools.In principle, however, getting started with code generation is actually quite simple. Only two things are needed: (1) a model that describes the application to be realized, and (2) a template, which transforms the model into code.In the simplest case, the model can consist of a series of statements in an ASCII file, or of an object graph over which the template iterates. This provides a clear separation between the semantic aspects (model) and the technical aspects (template).This tutorial will introduce lightweight generator technologies that can easily integrated into your own software development process and delegate tedious, monotonous programming tasks to the code-generator, so that you can concentrate on the more demanding and interesting programming tasks. For this reason, a number of different software generator technologies and their functional principles will be presented and how they can be realized with minimal effort. The tutorial also includes practical parts in which the participants perform a series of concrete tasks in the sphere of software code-generation.

Andreas Schmidt

SPARQL Endpoints and Web API (SWApi)

The success of Semantic Web technology has boosted the publication of Knowledge Graphs in the Web of Data, and several technologies to access them have become available covering different spots in the spectrum of expressivity: from the highly expressive SPARQL to the controlled access of Linked Data APIs, with GraphQL in between. Many of these technologies have reached industry-grade maturity. Finding the trade-offs between them is often difficult in the daily work of developers, interested in quick API deployment and easy data ingestion. This tutorial covers this in-between technology space, with the main goal of providing strategies and tools for publishing Web APIs that ensure the easy consumption of data coming from SPARQL endpoints. Together with an overview of state-of-the-art technologies, the tutorial focuses on two novel technologies: SPARQL Transformer, which allows to get a more compact JSON structure for SPARQL results, decreasing the effort required by developers in interfacing JavaScript and Python applications; and grlc, an automatic way of building APIs on top of SPARQL endpoints by sharing queries on collaborative platforms. Moreover, recent developments are presented to combine the two, offering a complete resource for developers and researchers. Hands-on sessions are proposed to internalise those concepts with practical exercises.

Pasquale Lisena, Albert Meroño-Peñuela

Web Engineering with Human-in-the-Loop

Modern Web applications employ sophisticated Machine Learning models to rank news, posts, products, and other items presented to the users or contributed by them. To keep these models useful, one has to constantly train, evaluate, and monitor these models using freshly annotated data, which can be done using crowdsourcing. In this tutorial we will present a portion of our six-year experience in solving real-world tasks with human-in-the-loop pipelines that combine efforts made by humans and machines. We will introduce data labeling via public crowdsourcing marketplaces and present the critical components of efficient data labeling. Then, we will run a practical session, where participants address a challenging real-world Information Retrieval for e-Commerce task, experiment with selecting settings for the labeling process, and launch their label collection project on real crowds within the tutorial session. We will present useful quality control techniques and provide the attendees with an opportunity to discuss their annotation ideas. Methods and techniques described in this tutorial can be applied to any crowdsourced data and are not bound to any specific crowdsourcing platform.

Dmitry Ustalov, Nikita Pavlichenko, Boris Tseytlin, Daria Baidakova, Alexey Drutsa

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Recommender Systems Based on Web Technology

Frontmatter

MARF: User-Item Mutual Aware Representation with Feedback

MRVAE: Variational Autoencoder with Multiple Relationships for Collaborative Filtering

Multilevel Feature Interaction Learning for Session-Based Recommendation via Graph Neural Networks

Social Web Applications

Frontmatter

A Real-Time System for Detecting Landslide Reports on Social Media Using Artificial Intelligence

Online Social Event Detection via Filtering Strategy Graph Neural Network

Similarity Search with Graph Index on Directed Social Network Embedding

Web Application Modelling and Engineering

Frontmatter

An In-Depth Analysis of Web Page Structure and Efficiency with Focus on Optimization Potential for Initial Page Load

Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

Disclosure: Efficient Instrumentation-Based Web App Migration for Liquid Computing

Enriching Scholarly Knowledge with Context

FAIRification of Citizen Science Data Through Metadata-Driven Web API Development

The Case for Cross-Entity Delta Encoding in Web Compression

Web Big Data and Web Data Analytics

Frontmatter

Dynamic Network Embedding in Hyperbolic Space via Self-attention

Engineering Annotations to Support Analytical Provenance in Visual Exploration Processes

Lunatory: A Real-Time Distributed Trajectory Clustering Framework for Web Big Data

Web Mining and Knowledge Extraction

Frontmatter

Building Knowledge Subgraphs in Question Answering over Knowledge Graphs

Dual-Attention Based Joint Aspect Sentiment Classification Model

Explaining a Deep Neural Model with Hierarchical Attention for Aspect-Based Sentiment Classification Using Diagnostic Classifiers

A Model for Meteorological Knowledge Graphs: Application to Météo-France Data

An Ontological Approach for Recommending a Feature Selection Algorithm

Towards Bridging the Gap Between Knowledge Graphs and Chatbots

Web Security and Privacy

Frontmatter

Configurable Per-Query Data Minimization for Privacy-Compliant Web APIs

Effective Malicious URL Detection by Using Generative Adversarial Networks

MEMTD: Encrypted Malware Traffic Detection Using Multimodal Deep Learning

Web User Interfaces

Frontmatter

A Web Crowdsourcing Platform for Territorial Control in Smart Cities

Supporting Natural Language Interaction with the Web

User Acceptance of Modified Web Page Loading Based on Progressive Streaming

We Don’t Need No Real Users?! Surveying the Adoption of User-less Automation Tools by UI Design Practitioners

Ph.D. Symposium

Frontmatter

Achieving Corruption-Transparency in Service Governance Processes with Blockchain-Technology Based e-Participation

Applying a Healthcare Web of Things Framework for Infertility Treatments

Blockchain and AI to Build an Alzheimer’s Risk Calculator

Bridging Static Site Generation with the Dynamic Web

Enhance Web-Components in Order to Increase Security and Maintainability

FAIRification of Citizen Science Data

Towards Differentially Private Machine Learning Models and Their Robustness to Adversaries

Posters and Demonstrations

Frontmatter

A Metadata-Driven Tool for FAIR Data Production in Citizen Science Platforms

A New Compatibility Measure for Harmonic EDM Mixing

Compaz: Exploring the Potentials of Shared Dictionary Compression on the Web

Social Events Analyzer (SEA): A Toolkit for Mining Social Workflows by Means of Federated Process Mining

Solid Web Monetization

Web Push Notifications from Solid Pods

Tutorials

Frontmatter

A Guide for Quantum Web Services Deployment

About Lightweight Code Generation

SPARQL Endpoints and Web API (SWApi)

Web Engineering with Human-in-the-Loop

Backmatter