Top

2014 | Book

Read chapter Read first chapter

Web Information Systems Engineering – WISE 2014

15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part I

Editors: Boualem Benatallah, Azer Bestavros, Yannis Manolopoulos, Athena Vakali, Yanchun Zhang

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the proceedings of the 15th International Conference on Web Information Systems Engineering, WISE 2014, held in Thessaloniki, Greece, in October 2014.
The 52 full papers, 16 short and 14 poster papers, presented in the two-volume proceedings LNCS 8786 and 8787 were carefully reviewed and selected from 196 submissions. They are organized in topical sections named: Web mining, modeling and classification; Web querying and searching; Web recommendation and personalization; semantic Web; social online networks; software architectures amd platforms; Web technologies and frameworks; Web innovation and applications; and challenge.

Frontmatter

Web Mining, Modeling and Classification

Coupled Item-Based Matrix Factorization

The essence of the challenges

cold start

and

sparsity

in Recommender Systems (RS) is that the extant techniques, such as Collaborative Filtering (CF) and Matrix Factorization (MF), mainly rely on the user-item rating matrix, which sometimes is not informative enough for predicting recommendations. To solve these challenges, the objective item attributes are incorporated as complementary information. However, most of the existing methods for inferring the relationships between items assume that the attributes are “independently and identically distributed (iid)”, which does not always hold in reality. In fact, the attributes are more or less coupled with each other by some implicit relationships. Therefore, in this paper we propose an attribute-based coupled similarity measure to capture the implicit relationships between items. We then integrate the implicit item coupling into MF to form the Coupled Item-based Matrix Factorization (CIMF) model. Experimental results on two open data sets demonstrate that CIMF outperforms the benchmark methods.

Fangfang Li, Guandong Xu, Longbing Cao

A Lot of Slots – Outliers Confinement in Review-Based Systems

Review-based websites such as, e.g., Amazon, eBay, TripAdvisor, and Booking have gained an extraordinary popularity, with millions of users daily consulting online reviews to choose the best services and products fitting their needs. Some of the most popular review-based websites rank products by sorting them aggregating the single ratings through their arithmetic mean. In contrast, recent studies have proved that the median is a more robust aggregator, in terms of ad hoc injections of outlier ratings. In this paper, we focus on four different types of ratings aggregators. We propose to the slotted mean and the slotted median, and we compare their mathematical properties with the mean and the median. The results of our experiments highlight advantages and drawbacks of relying on each of these quality indexes. Our experiments have been carried out on a large data set of hotel reviews collected from Booking.com, while our proposed solutions are rooted on sound statistical theory. The results shown in this paper, other than being interesting on their own, also call for further investigations.

Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi

A Unified Model for Community Detection of Multiplex Networks

Multiplex networks contain multiple simplex networks. Community detection of multiplex networks needs to deal with information from all the simplex networks. Most approaches aggregate all the links in different simplex networks treating them as being equivalent. However, such aggregation might ignore information of importance in simplex networks. In addition, for each simplex network, the aggregation only considers adjacency relation among nodes, which can’t reflect real closeness among nodes very well. In order to solve the problems above, this paper presents a unified model to detect community structure by grouping the nodes based on a unified matrix transferred from multiplex network. In particular, we define importance and node similarity to describe respectively correlation difference of simplex networks and closeness among nodes in each simplex network. The experiment results show the higher accuracy of our model for community detection compared with competing methods on synthetic datasets and real world datasets.

Guangyao Zhu, Kan Li

Mining Domain-Specific Dictionaries of Opinion Words

The task of opinion mining has attracted interest during the last years. This is mainly due to the vast availability and value of opinions on-line and the easy access of data through conventional or intelligent crawlers. In order to utilize this information, algorithms make extensive use of word sets with known polarity. This approach is known as dictionary-based sentiment analysis. Such dictionaries are available for the English language. Unfortunately, this is not the case for other languages with smaller user bases. Moreover, such generic dictionaries are not suitable for specific domains. Domain-specific dictionaries are crucial for domain-specific sentiment analysis tasks. In this paper we alleviate the above issues by proposing an approach for domain-specific dictionary building. We evaluate our approach on a sentiment analysis task. Experiments on user reviews on digital devices demonstrate the utility of the proposed approach. In addition, we present

NiosTo

, a software that enables dictionary extraction and sentiment analysis on a given corpus.

Pantelis Agathangelou, Ioannis Katakis, Fotios Kokkoras, Konstantinos Ntonas

A Community Detection Algorithm Based on the Similarity Sequence

Community detection is a hot topic in the field of complex social networks. It is of great value to personalized recommendation, protein structure analysis, public opinion analysis, etc. However, most existing algorithms detect communities with misclassified nodes and peripheries, and the clustering accuracy is not high. In this paper, in terms of the agglomerative hierarchical clustering, a community detection algorithm based on the similarity sequence is proposed, named as ACSS (Agglomerative Clustering Algorithm based on the Similarity Sequence). First, similarities of nodes are sorted in descending order to get a sequence. Then pairs of nodes are merged according to the sequence to construct a preliminary community structure. Secondly, the agglomerative clustering process is carried out to get the optimal community structure. The proposed algorithm is tested on real network and computer-generated network data sets. Experimental results show that ACSS can solve the problem of neglecting peripheries. Compared with the existing representative algorithms, it can detect stronger community structure, and improve the clustering accuracy.

Hongwei Lu, Qian Zhao, Zaobin Gan

A Self-learning Clustering Algorithm Based on Clustering Coefficient

This paper presents a novel clustering algorithm based on clustering coefficient. It includes two steps: First, k-nearest-neighbor method and correlation convergence are employed for a preliminary clustering. Then, the results are further split and merged according to intra-class and inter-class concentration degree based on clustering coefficient. The proposed method takes correlation between each other in a cluster into account, thereby improving the weakness existed in previous methods that consider only the correlation with center or core data element. Experiments show that our algorithm performs better in clustering compact data elements as well as forming some irregular shape clusters. It is more suitable for applications with little prior knowledge, e.g. hotspots discovery.

MingJie Zhong, ZhiJun Ding, HaiChun Sun, PengWei Wang

Detecting Hierarchical Structure of Community Members by Link Pattern Expansion Method

Community structure is an important property of complex networks, which is generally described as densely connected nodes and similar patterns of links. Hierarchy is a common property of networks. Different members have different belonging coefficients to the community, e.g. core members and boundary members, who are at different levels in the hierarchy of community. In this paper, a novel structure is presented, called hierarchical structure of members (HSM), which shows the relationships among members and multi-resolution of the community. A hierarchical link-pattern expansion method is proposed to detect HSM. First, we use the most similar link patterns to detect the seed communities which include both clique structures and star structures. Next, we define the influence between members to expand the community hierarchically. The experiment explores the hierarchical structure of members and the comparison with competitive algorithms on real-world networks demonstrates our method has stronger ability to detect communities.

Fengjiao Chen, Kan Li

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

The use of semantics in tasks related to information retrieval has become, in recent years, a vast field of research. Considering supervised text classification, which is the main interest of this work, semantics can be involved at different steps of text processing: during indexing step, during training step and during class prediction step. As for class prediction step, new text-to-text semantic similarity measures can replace classical similarity measures that are traditionally used by some classification methods for decision-making. In this paper we propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements.

Shereen Albitar, Sébastien Fournier, Bernard Espinasse

Automatically Annotating Structured Web Data Using a SVM-Based Multiclass Classifier

In this paper, we propose a new learning approach to Web data annotation, where a support vector machine-based multiclass classifier is trained to assign labels to data items. For data record extraction, a data section re-segmentation algorithm based on visual and content features is introduced to improve the performance of Web data record extraction. We have implemented the proposed approach and tested it with a large set of Web query result pages in different domains. Our experimental results show that our proposed approach is highly effective and efficient.

Daiyue Weng, Jun Hong, David A. Bell

Mining Discriminative Itemsets in Data Streams

This paper presents a single pass algorithm for mining discriminative Itemsets in data streams using a novel data structure and the tilted-time window model. Discriminative Itemsets are defined as Itemsets that are frequent in one data stream and their frequency in that stream is much higher than the rest of the streams in the dataset. In order to deal with the data structure size, we propose a pruning process that results in the compact tree structure containing discriminative Itemsets. Empirical analysis shows the sound time and space complexity of the proposed method.

Majid Seyfi, Shlomo Geva, Richi Nayak

Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach

Identifying and targeting visitors on e-commerce website with personalized content in real-time is extremely important to marketers. Although such targeting exists today, it is based on demographic attributes of the visitors. We show that dynamic visitor attributes extracted from their click-stream provide much better predictive capabilities of visitor intent. In this work, we propose a mechanism for identifying similar visitor sessions on a website based on their click-streams. Novel techniques for extracting features from visitor clicks are employed. Large margin nearest neighbour (LMNN) algorithm is used to learn a similarity metric between any two sessions. Further the sessions are classified into purchasers and non-purchasers using k-nearest neighbour (kNN) classification. Experimental results showing significant improvements over baseline algorithms based on Hidden Markov Model(HMM), support vector machine (SVM) and random forest are presented on two large real-world data sets.

Deepak Pai, Abhijit Sharang, Meghanath Macha Yadagiri, Shradha Agrawal

BOSTER: An Efficient Algorithm for Mining Frequent Unordered Induced Subtrees

Extracting frequent subtrees from the tree structured data has important applications in Web mining. In this paper, we introduce a novel canonical form for rooted labelled unordered trees called the

alanced-

ptimal-search

anonical

orm (BOCF) that can handle the isomorphism problem efficiently. Using BOCF, we define a tree structure guided scheme based enumeration approach that systematically enumerates only the valid subtrees. Finally, we present the

alanced

ptimal

earch

ree min

(BOSTER) algorithm based on BOCF and the proposed enumeration approach, for finding frequent induced subtrees from a database of labelled rooted unordered trees. Experiments on the real datasets compare the efficiency of BOSTER over the two state-of-the-art algorithms for mining induced unordered subtrees, HybridTreeMiner and UNI3. The results are encouraging.

Israt J. Chowdhury, Richi Nayak

Web Querying and Searching

Phrase Queries with Inverted + Direct Indexes

Phrase queries play an important role in web search and other applications. Traditionally, phrase queries have been processed using a positional inverted index, potentially augmented by selected multi-word sequences (e.g.,

-grams or frequent noun phrases). In this work, instead of augmenting the inverted index, we take a radically different approach and leverage the

direct index

, which provides efficient access to compact representations of documents. Modern retrieval systems maintain such a direct index, for instance, to generate snippets or compute proximity features. We present extensions of the established term-at-a-time and document-at-a-time query-processing methods that make effective combined use of the inverted index and the direct index. Our experiments on two real-world document collections using diverse query workloads demonstrate that our methods improve response time substantially without requiring additional index space.

Kiril Panev, Klaus Berberich

Ranking Based Activity Trajectory Search

With the proliferation of the GPS-enabled devices and mobile techniques, there has been a lot of work on trajectory search in the last decade. Previous trajectory search has focused on spatio-temporal features and text descriptions. Different from them, we study a novel problem of searching trajectories with activities and corresponding ranking information. Given a query

, which is attached with a set of activities and a threshold of distance, the results of ranking based activity trajectory search (RTS) are

trajectories such that the given activities are performed with the highest ranking within the threshold of distance. In addition, we also extend the query with an order, i.e., order-sensitive ranking based activity trajectory search (ORTS), which takes both the order of activities in a query

and the order of trajectories into account. It is challenging to answer RTS and ORTS efficiently due to the structural complexity of trajectory data with ranking information. In this paper, a hybrid index AC-tree and its optimized variant RAC-tree are proposed to achieve higher efficiency. Extensive experiments verify the high efficiency and scalability of the proposed algorithms.

Wei Chen, Lei Zhao, Xu Jiajie, Kai Zheng, Xiaofang Zhou

Topical Pattern Based Document Modelling and Relevance Ranking

For traditional information filtering (IF) models, it is often assumed that the documents in one collection are only related to one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling was proposed to generate statistical models to represent multiple topics in a collection of documents, but in a topic model, topics are represented by distributions over words which are limited to distinctively represent the semantics of topics. Patterns are always thought to be more discriminative than single terms and are able to reveal the inner relations between words. This paper proposes a novel information filtering model, Significant matched Pattern-based Topic Model (SPBTM). The SPBTM represents user information needs in terms of multiple topics and each topic is represented by patterns. More importantly, the patterns are organized into groups based on their statistical and taxonomic features, from which the more representative patterns, called Significant Matched Patterns, can be identified and used to estimate the document relevance. Experiments on benchmark data sets demonstrate that the SPBTM significantly outperforms the state-of-the-art models.

Yang Gao, Yue Xu, Yuefeng Li

A Decremental Search Approach for Large Scale Dynamic Ridesharing

The Web of Things (WoT) paradigm introduces novel applications to improve the quality of human lives. Dynamic ridesharing is one of these applications, which holds the potential to gain significant economical, environmental, and social benefits particularly in metropolitan areas. Despite the recent advances in this area, many challenges still remain. In particular, handling large-scale incomplete data has not been adequately addressed by previous works. Optimizing the taxi/passengers schedules to gain the maximum benefits is another challenging issue. In this paper, we propose a novel system, MARS (Multi-Agent Ridesharing System), which addresses these challenges by formulating travel time estimation and enhancing the efficiency of taxi searching through a decremental search approach. Our proposed approach has been validated using a real-world dataset that consists of the trajectories of 10,357 taxis in Beijing, China.

Ali Shemshadi, Quan Z. Sheng, Wei Emma Zhang

Model-Based Search and Ranking of Web APIs across Multiple Repositories

Web API search and reuse for agile Web application development may benefit from selection criteria that combine several perspectives: they can be performed based on features used to describe APIs, or according to the co-occurrence of Web APIs in the same applications, or they can be driven through ratings assigned by designers who used the Web APIs for their own mashups. Nevertheless, different Web API repositories usually focus on a subset of these perspectives, thus providing complementary Web API descriptions. In this paper, we propose a unified model for Web API characterization. The model enables a cross-repository search of Web APIs and mashups, based on different kinds of similarity between them, identified regardless the complementarity of their descriptions. This unified representation improves retrieval results if compared with a Web API search performed over multiple repositories considered separately.

Devis Bianchini, Valeria De Antonellis, Michele Melchiori

Common Neighbor Query-Friendly Triangulation-Based Large-Scale Graph Compression

Large-scale graphs appear in many web applications, and are inevitable in web data management and mining. A lossless compression method for large-scale graphs, named as

bound-triangulation

, is introduced in this paper. It differs itself from other graph compression methods in that: 1) it can achieve both good compression ratio and low compression time. 2) The compression ratio can be controlled by users, so that compression ratio and processing performance can be balanced. 3) It supports efficient common neighbor query processing over compressed graphs. Thus, it can support a wide range of graph processing tasks. Empirical study over two real-life large-scale social networks, which different underlying data distributions, show the superior of the proposed method over other existing graph compression methods.

Liang Zhang, Chen Xu, Weining Qian, Aoying Zhou

Continuous Monitoring of Top-k Dominating Queries over Uncertain Data Streams

In many scenarios, e.g., environmental monitoring using multiple sensors, the uncertain data objects arrive continuously (online) and need to be processed in a streaming manner. We first formally define the problem of continuous probabilistic top-

dominating (

PTOPK

) query processing over uncertain data streams based on a count-based sliding window model. Based on the observation that

PTOPK

does not change dramatically in consequent sliding window and most uncertain data objects not in

PTOPK

cannot be inserted in

PTOPK

in a certain period of time, an efficient postponed examination algorithm (PEA) is proposed. With PEA, the scores calculation for some uncertain data objects not in

PTOPK

can be postponed and the computation cost can be saved. Extensive experiments have been conducted to demonstrate the efficiency of our approaches.

Guohui Li, Changyin Luo, Jianjun Li

Keyword Search over Web Documents Based on Earth Mover’s Distance

Keyword search is widely used in many practical applications. Unfortunately, most keyword-based search engines compute the similarity distance between two Web documents by only matching the keywords at the same positions in both the query and the document vectors, without considering the impact of the keywords at neighbouring positions. Such approach usually results in incompleteness of search results. In this paper, we exploit the Earth Mover’s Distance (EMD) as a distance function, which is more flexible against other distance functions such as Euclidean distance. To overcome the limitation of EMD-based computation complexity, we use the filtering techniques to minimize the total number of actual EMD computations. We further develop a novel lower bound as a new EMD filter for partial matching technique that is suitable for searching Web documents. The experimental results demonstrate the efficiency of EMD-based search with filtering techniques.

Jiangang Ma, Quan Z. Sheng, Lina Yao, Yong Xu, Ali Shemshadi

iPoll: Automatic Polling Using Online Search

For years, opinion polls rely on data collected through telephone or person-to-person surveys. The process is costly, inconvenient, and slow. Recently online search data has emerged as potential proxies for the survey data. However considerable human involvement is still needed for the selection of search indices, a task that requires knowledge of both the target issue and how search terms are used by the online community. The robustness of such manually selected search indices can be questionable. In this paper, we propose an automatic polling system through a novel application of machine learning. In this system, the needs for examining, comparing, and selecting search indices have been eliminated through automatic generation of candidate search indices and intelligent combination of the indices. The results include a publicly accessible web application that provides real-time, robust, and accurate measurements of public opinions on several subjects of general interest.

Thin Nguyen, Dinh Phung, Wei Luo, Truyen Tran, Svetha Venkatesh

Web Recommendation and Personalization

Comparing the Predictive Capability of Social and Interest Affinity for Recommendations

The advent of online social networks created new prediction opportunities for recommender systems: instead of relying on past rating history through the use of collaborative filtering (CF), they can leverage the social relations among users as a predictor of user tastes similarity. Alas, little effort has been put into understanding when and why (e.g., for which users and what items) the

social affinity

(i.e., how well connected users are in the social network) is a better predictor of user preferences than the

interest affinity

among them as algorithmically determined by CF, and how to better evaluate recommendations depending on, for instance, what type of users a recommendation application targets. This overlook is explained in part by the lack of a systematic collection of datasets including both the explicit social network among users and the collaborative annotated items. In this paper, we conduct an extensive empirical analysis on six real-world publicly available datasets, which dissects the impact of user and item attributes, such as the density of social ties or item rating patterns, on the performance of recommendation strategies relying on either the social ties or past rating similarity. Our findings represent practical guidelines that can assist in future deployments and mixing schemes.

Alexandra Olteanu, Anne-Marie Kermarrec, Karl Aberer

End-User Browser-Side Modification of Web Pages

The increasing volume of content and actions available on the Web, combined with the growing number of mature digital natives, anticipate a growing desire of controlling the Web experience. Akin to the Web2.0 movement, webies’ desires do not stop at content authoring but look for controlling how content is arranged in websites. By content, we mainly refer to HTML pages, better said, their runtime representation: DOM trees. The vision is for users to “prune” (removing nodes) or “graft” (adding nodes) existing DOM trees to improve their idiosyncratic and situational Web experience. Hence, Web content is no longer consumed as canned by Web masters. Rather, users can remove content of no interest, or place new content from somewhere else. This vision accounts for a post-production user-driven Web customization (referred to as

“Web Modding”

). Being user driven, appropriate abstractions and tools are needed. The paper introduces a set of abstractions (formalized in terms of a domain-specific language) and an IDE (realized as an add-on from

Google Chrome

) to empower non-programmers to achieve HTML rearrangement. The paper discusses the technical issues and the results of a first validation.

Oscar Díaz, Cristóbal Arellano, Iñigo Aldalur, Haritz Medina, Sergio Firmenich

Mobile Phone Recommendation Based on Phone Interest

As cellular users change mobile phone frequently, mobile phone recommendation system is of great importance for mobile operator to achieve business benefit. There are essential challenges for researchers to design such system. Among them, a critical one is how to obtain and model user’s interest of mobile phone. So far, recommendation approaches based on phone’s hardware features or personalized web behavior could not achieve satisfactory results. In this paper, we propose phone interest for mobile phone recommendation. Phone interest is a latent level concept which is extracted from a group of users’ web log data, who have the same mobile phone. We propose a novel probabilistic model named “Phone Interest Model” only based on mobile web log data. All the log data are from cellular operators server, not from mobile phone’s application. The model proves its effectiveness on large scale of station cellular data from real cellular operator. In experiments, we validated the model against 1.3 billion of mobile Web logs for 4 million distinct users in Beijing metropolitan areas, and show that the model achieves a good performance in the phone recommendation, also outperforms the baseline methods and offers significantly high fidelity.

Bozhi Yuan, Bin Xu, Tonglee Chung, Kaiyan Shuai, Yongbin Liu

Two Approaches to the Dataset Interlinking Recommendation Problem

Whenever a dataset

is published on the Web of Data, an exploratory search over existing datasets must be performed to identify those datasets that are potential candidates to be interlinked with

. This paper introduces and compares two approaches to address the dataset interlinking recommendation problem, respectively based on Bayesian classifiers and on Social Network Analysis techniques. Both approaches define rank score functions that explore the vocabularies, classes and properties that the datasets use, in addition to the known dataset links. After extensive experiments using real-world datasets, the results show that the rank score functions achieve a mean average precision of around 60%. Intuitively, this means that the exploratory search for datasets to be interlinked with

might be limited to just the top-ranked datasets, reducing the cost of the dataset interlinking process.

Giseli Rabello Lopes, Luiz André P. Paes Leme, Bernardo Pereira Nunes, Marco Antonio Casanova, Stefan Dietze

Exploiting Perceptual Similarity: Privacy-Preserving Cooperative Query Personalization

In this paper, we introduce privacy-preserving query personalization for experience items like movies, music, games or books. While these items are rather common, describing them with semantically meaningful attribute values is challenging, thus hindering traditional database query personalization. This often leads to the use of recommender systems, which, however, have several drawbacks as for example high barriers for new users joining the system, the inability to process dynamic queries, and severe privacy concerns due to requiring extensive long-term user profiles. We propose an alternative approach, representing experience items in a perceptual space using high-dimensional and semantically rich features. In order to query this space, we provide query-by-example personalization relying on the perceived similarity between items, and learn a user’s current preferences with respect to the query on the fly. Furthermore, for query execution, our approach addresses privacy issues of recommender systems as we do not require user profiles for queries, do not leak any personal information during interaction, and allow users to stay anonymous while querying. In this paper, we provide the foundations of such a system and then extensively discuss and evaluate the performance of our approach under different assumptions. Also, suitable optimizations and modifications to ensure scalability on current hardware are presented.

Christoph Lofi, Christian Nieke

Identifying Explicit Features for Sentiment Analysis in Consumer Reviews

With the number of reviews growing every day, it has become more important for both consumers and producers to gather the information that these reviews contain in an effective way. For this, a well performing feature extraction method is needed. In this paper we focus on detecting explicit features. For this purpose, we use grammatical relations between words in combination with baseline statistics of words as found in the review text. Compared to three investigated existing methods for explicit feature detection, our method significantly improves the

-measure on three publicly available data sets.

Nienke de Boer, Marijtje van Leeuwen, Ruud van Luijk, Kim Schouten, Flavius Frasincar, Damir Vandic

Facet Tree for Personalized Web Documents Organization

Vast amount information and resources in the digital libraries and in general on the Web demands effective methods of archiving and organization. However, most of the existing solutions support only very specific use case scenarios, or are not flexible enough to accommodate to the changes in the document collections over time. We propose a method for web documents organization based on a facet view of the personal information structure. Facet chaining in a tree can create any depth of the structure and thus specify any context of resources. We enhanced this method by clustering similar resources and by using a special Search facet that allows users to specify arbitrary keyword queries as an input for collection’s categorization. In order to evaluate the proposed approach, we carried out a user study in the bookmarking system Annota.

Róbert Móro, Mária Bieliková, Roman Burger

Mobile Web User Behavior Modeling

Models of mobile web user behavior have broad applicability in fields such as mobile network optimization, mobile web content recommendation, collective behavior analysis, and human dynamics. This paper proposes and evaluates URI model, a novel approach to analyze user mobile Web usage behavior, which combines user interest modeling with location analysis. The URI model takes as input mobile user web logs associated with coarse-grained location drawn from real data, such as Event Detail Records(EDRs) from a cellular telephone network. We use probabilistic topic modeling to discover latent

user interest

from user mobile Web usage log. We validated the URI model against billions of mobile web logs for millions of cellular phones in Beijing metropolitan areas. Experiments show that the URI model achieves a good performance, and offers significantly high fidelity.

Bozhi Yuan, Bin Xu, Chao Wu, Yuanchao Ma

Effect of Mood, Social Connectivity and Age in Online Depression Community via Topic and Linguistic Analysis

Depression afflicts one in four people during their lives. Several studies have shown that for the isolated and mentally ill, the Web and social media provide effective platforms for supports and treatments as well as to acquire scientific, clinical understanding of this mental condition. More and more individuals affected by depression join online communities to seek for information, express themselves, share their concerns and look for supports [12]. For the first time, we collect and study a large online depression community of more than 12,000 active members from Live Journal. We examine the effect of mood, social connectivity and age on the online messages authored by members in an online depression community. The posts are considered in two aspects: what is written (topic) and how it is written (language style). We use statistical and machine learning methods to discriminate the posts made by bloggers in low versus high valence mood, in different age categories and in different degrees of social connectivity. Using statistical tests, language styles are found to be significantly different between low and high valence cohorts, whilst topics are significantly different between people whose different degrees of social connectivity. High performance is achieved for low versus high valence post classification using writing style as features. The finding suggests the potential of using social media in depression screening, especially in online setting.

Bo Dao, Thin Nguyen, Dinh Phung, Svetha Venkatesh

A Review Selection Method Using Product Feature Taxonomy

As of today, online reviews have become more and more important in decision making process. In recent years, the problem of identifying useful reviews for users has attracted significant attentions. For instance, in order to select reviews that focus on a particular feature, researchers proposed a method which extracts all associated words of this feature as the relevant information to evaluate and find appropriate reviews. However, the extraction of associated words is not that accurate due to the noise in free review text, and this affects the overall performance negatively. In this paper, we propose a method to select reviews according to a given feature by using a review model generated based upon a domain ontology called product feature taxonomy. The proposed review model provides relevant information about the hierarchical relationships of the features in the review which captures the review characteristics accurately. Our experiment results based on real world review dataset show that our approach is able to improve the review selection performance according to the given criteria effectively.

Nan Tian, Yue Xu, Yuefeng Li

Semantic Web

A Genetic Programming Approach for Learning Semantic Information Extraction Rules from News

Due to the increasing amount of data provided by news sources and the user specific information needs, recently, many news personalization systems have been proposed. Often, these systems process news data automatically into information, while relying on underlying knowledge bases, containing concepts and their relations for specific domains. For this, information extraction rules are frequently used, yet they are usually manually constructed. As it is difficult to efficiently maintain a balance between precision and recall, while using a manual approach, we present a genetic programming-based approach for automatically learning semantic information extraction rules from (financial) news that extract events. Our evaluation results show that compared to information extraction rules constructed by expert users, our rules yield a 27% higher

-measure after the same amount of rules construction time.

Wouter IJntema, Frederik Hogenboom, Flavius Frasincar, Damir Vandic

Ontology-Based Management of Conflicting Products in Pixel Advertising

Pixel advertising represents the placement of multiple pixel blocks on a banner for the purpose of advertising companies and their products. In this paper, we investigate how one can avoid product conflicts in the placement of pixel advertisements on a Web banner, while maximizing the overall banner revenue. Our solution for this problem is based on a product ontology that defines products and their relationships. We evaluate three heuristic algorithms for generating allocation patterns, i.e., the left justified algorithm, the orthogonal algorithm, and the GRASP constructive algorithm. The results show that the left justified algorithm and the orthogonal algorithm are most effective in terms of profit per pixel, while the GRASP constructive algorithm is identified as most efficient in terms of computational time.

Ferry Boon, Sabri Bouzidi, Raymond Vermaas, Damir Vandic, Flavius Frasincar

Exploiting Semantic Result Clustering to Support Keyword Search on Linked Data

Keyword search is by far the most popular technique for searching linked data on the web. The simplicity of keyword search on data graphs comes with at least two drawbacks: difficulty in identifying results relevant to the user intent among an overwhelming number of candidates and performance scalability problems. In this paper, we claim that result ranking and top-k processing which adapt schema unaware IR-based techniques to loosely structured data are not sufficient to address these drawbacks and efficiently produce answers of high quality. We present an alternative solution which hierarchically clusters the results based on a semantic interpretation of the keyword instances and takes advantage of relevance feedback from the user. Our clustering hierarchy exploits graph patterns which are structured queries clustering together result graphs of the same structure and represent possible interpretations for the keyword query. We present an algorithm which computes r-radius Steiner patterns graphs using exclusively the structural summary of the data graph. The user selects relevant pattern graphs by exploring only a small portion of the hierarchy supported by a ranking of the hierarchy components.Our experimental results show the feasibility of our system by demonstrating short reach times and efficient computation of the relevant results.

Ananya Dass, Cem Aksoy, Aggeliki Dimitriou, Dimitri Theodoratos

Discovering Semantic Mobility Pattern from Check-in Data

The wealth of check-in data offers new opportunities for better understanding user movement patterns. Existing studies have been focusing on mining explicit frequent sequential patterns. However, the sparseness of check-in data makes it difficult that all explicit patterns be precisely discovered. In addition, due to the weakness in expressing semantic knowledge of explicit patterns, the need for discovering semantic pattern rises. In this paper, we propose the Topical User Transition Model (TUTM) to discover the semantic mobility patterns by analyzing topical transitions. Via this model, we can discover semantic transition properties and predict the user movement preferences. Furthermore, Expectation-Maximization (EM) algorithm incorporating with Forward-Backward algorithm is provided for estimating the model parameters. To demonstrate the performance of TUTM model, experimental studies are carried out and the results show that our model can not only reasonably explain user mobility patterns, but also effectively improve the prediction accuracy in comparison with traditional approaches.

Ji Yuan, Xudong Liu, Richong Zhang, Hailong Sun, Xiaohui Guo, Yanghao Wang

An Offline Optimal SPARQL Query Planning Approach to Evaluate Online Heuristic Planners

In graph databases, a given graph query can be executed in a large variety of semantically equivalent ways. Each such execution plan produces the same results, but at different computation costs. The query planning problem consists of finding, for a given query, an execution plan with the minimum cost. The traditional greedy or heuristic cost-based approaches addressing the query planning problem do not guarantee by design the optimality of the chosen execution plan. In this paper, we present a principled framework to solve the query planning problem by casting it into an Integer Linear Programming problem, and discuss its applications to testing and improving heuristic-based query planners.

Achille Fokoue, Mihaela Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas

Agents, Models and Semantic Integration in Support of Personal eHealth Knowledge Spaces

The advancements in healthcare practice have brought to the fore the need for flexible access to health-related information and created an ever-growing demand for the design, development and management of personalized knowledge spaces. In this paper, we present a web-based platform that generates a Personal eHealth Knowledge Space as an aggregation of several knowledge sources relevant for the provision of individualized personal services. To this end, novel technologies are exploited, such as

knowledge on demand

to lower the information overload for the end-users,

agent-based communication and reasoning

to support cooperation and decision making, and

semantic integration

to provide uniform access to heterogeneous information. All three different technologies are combined to create a novel web-based platform allowing seamless user interaction through a portal that supports personalized, granular and secure access to relevant information.

Haridimos Kondylakis, Dimitris Plexousakis, Vedran Hrgovcic, Robert Woitsch, Marc Premm, Michael Schuele

Probabilistic Associations as a Proxy for Semantic Relatedness

Semantic relatedness computation is a well known problem with multidisciplinary applications. Existing approaches to computing semantic relatedness ignore the asymmetric associations of words. In the absence of an explicit topical context, these asymmetric associations can be effectively used to represent the relation of words in directional contexts. Motivated by the idea of word associations, this paper presents a new approach to computing semantic relatedness using asymmetric association based probabilities of words extracted from the directional contexts of words based on the Wikipedia corpus. The performance evaluation of the proposed approach on a variety of publicly available benchmark datasets shows that the asymmetric association based measures outperformed not only the baseline symmetric measures but also most of the state-of-art approaches.

Shahida Jabeen, Xiaoying Gao, Peter Andreae

A Hybrid Model for Learning Semantic Relatedness Using Wikipedia-Based Features

Semantic relatedness computation is the task of quantifying the degree of relatedness of two concepts. The performance of existing approaches to computing semantic relatedness is highly dependent on particular aspects of relatedness. For instance, taxonomy-based approaches aim at computing similarity, which is a special case of semantic relatedness. On the other hand, corpus-based approaches focus on the associative relations of words by taking their distributional features into account. Based on the assumption that different aspects of knowledge sources cover different kinds of semantic relations, this paper presents a hybrid model for computing semantic relatedness of words using new features extracted from various aspects of Wikipedia. The focus of this paper is on finding the optimal feature combination(s) that enhance the performance of the hybrid model. The empirical evaluation on benchmark datasets has shown that hybrid features perform better than single features by providing a complementary coverage of semantic relations, leading to improved correlation with human judgments.

Shahida Jabeen, Xiaoying Gao, Peter Andreae

An Ontology-Based Approach for Product Entity Resolution on the Web

Product entity resolution is an important part of online product search, where product entities coming from different websites need to be aggregated in the search results. In this paper, we propose an approach to product entity resolution using the descriptive power of an ontology. In our algorithm, we use similarity measures that are defined specifically for each type of product feature and learn the feature weights by means of a genetic algorithm. In the evaluation of our algorithm, we obtain

-measures of 59% and 72% for two product classes that we consider. The obtained results are significantly better than those obtained from a state-of-the-art product entity resolution algorithm.

Raymond Vermaas, Damir Vandic, Flavius Frasincar

Backmatter

Title: Web Information Systems Engineering – WISE 2014
Editors: Boualem Benatallah
Azer Bestavros
Yannis Manolopoulos
Athena Vakali
Yanchun Zhang
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-11749-2
Print ISBN: 978-3-319-11748-5
DOI: https://doi.org/10.1007/978-3-319-11749-2

Springer Professional

About this book

Table of Contents

Frontmatter

Web Mining, Modeling and Classification

Coupled Item-Based Matrix Factorization

A Lot of Slots – Outliers Confinement in Review-Based Systems

A Unified Model for Community Detection of Multiplex Networks

Mining Domain-Specific Dictionaries of Opinion Words

A Community Detection Algorithm Based on the Similarity Sequence

A Self-learning Clustering Algorithm Based on Clustering Coefficient

Detecting Hierarchical Structure of Community Members by Link Pattern Expansion Method

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

Automatically Annotating Structured Web Data Using a SVM-Based Multiclass Classifier

Mining Discriminative Itemsets in Data Streams

Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach

BOSTER: An Efficient Algorithm for Mining Frequent Unordered Induced Subtrees

Web Querying and Searching

Phrase Queries with Inverted + Direct Indexes

Ranking Based Activity Trajectory Search

Topical Pattern Based Document Modelling and Relevance Ranking

A Decremental Search Approach for Large Scale Dynamic Ridesharing

Model-Based Search and Ranking of Web APIs across Multiple Repositories

Common Neighbor Query-Friendly Triangulation-Based Large-Scale Graph Compression

Continuous Monitoring of Top-k Dominating Queries over Uncertain Data Streams

Keyword Search over Web Documents Based on Earth Mover’s Distance

iPoll: Automatic Polling Using Online Search

Web Recommendation and Personalization

Comparing the Predictive Capability of Social and Interest Affinity for Recommendations

End-User Browser-Side Modification of Web Pages

Mobile Phone Recommendation Based on Phone Interest

Two Approaches to the Dataset Interlinking Recommendation Problem

Exploiting Perceptual Similarity: Privacy-Preserving Cooperative Query Personalization

Identifying Explicit Features for Sentiment Analysis in Consumer Reviews

Facet Tree for Personalized Web Documents Organization

Mobile Web User Behavior Modeling

Effect of Mood, Social Connectivity and Age in Online Depression Community via Topic and Linguistic Analysis

A Review Selection Method Using Product Feature Taxonomy

Semantic Web

A Genetic Programming Approach for Learning Semantic Information Extraction Rules from News

Ontology-Based Management of Conflicting Products in Pixel Advertising

Exploiting Semantic Result Clustering to Support Keyword Search on Linked Data

Discovering Semantic Mobility Pattern from Check-in Data

An Offline Optimal SPARQL Query Planning Approach to Evaluate Online Heuristic Planners

Agents, Models and Semantic Integration in Support of Personal eHealth Knowledge Spaces

Probabilistic Associations as a Proxy for Semantic Relatedness

A Hybrid Model for Learning Semantic Relatedness Using Wikipedia-Based Features

An Ontology-Based Approach for Product Entity Resolution on the Web

Backmatter

Premium Partner