Skip to main content

2004 | Buch

Web Mining: From Web to Semantic Web

First European Web Mining Forum, EWMF 2003, Cavtat-Dubrovnik, Croatia, September 22, 2003, Invited and Selected Revised Papers

herausgegeben von: Bettina Berendt, Andreas Hotho, Dunja Mladenič, Maarten van Someren, Myra Spiliopoulou, Gerd Stumme

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

In the last years, research on Web mining has reached maturity and has broadened in scope. Two different but interrelated research threads have emerged, based on the dual nature of the Web: – The Web is a practically in?nite collection of documents: The acquisition and - ploitation of information from these documents asks for intelligent techniques for information categorization, extraction and search, as well as for adaptivity to the interests and background of the organization or person that looks for information. – The Web is a venue for doing business electronically: It is a venue for interaction, information acquisition and service exploitation used by public authorities, n- governmental organizations, communities of interest and private persons. When observed as a venue for the achievement of business goals, a Web presence should be aligned to the objectives of its owner and the requirements of its users. This raises the demand for understandingWeb usage, combining it with other sources of knowledge inside an organization, and deriving lines of action. ThebirthoftheSemanticWebatthebeginningofthedecadeledtoacoercionofthetwo threadsintwoaspects:(i)theextractionofsemanticsfromtheWebtobuildtheSemantic Web;and(ii)theexploitationofthesesemanticstobettersupportinformationacquisition and to enhance the interaction for business and non-business purposes. Semantic Web mining encompasses both aspects from the viewpoint of knowledge discovery.

Inhaltsverzeichnis

Frontmatter
A Roadmap for Web Mining: From Web to Semantic Web
Abstract
The purpose of Web mining is to develop methods and systems for discovering models of objects and processes on the World Wide Web and for web-based systems that show adaptive performance. Web Mining integrates three parent areas: Data Mining (we use this term here also for the closely related areas of Machine Learning and Knowledge Discovery), Internet technology and World Wide Web, and for the more recent Semantic Web. The World Wide Web has made an enormous amount of information electronically accessible. The use of email, news and markup languages like HTML allow users to publish and read documents at a world-wide scale and to communicate via chat connections, including information in the form of images and voice records. The HTTP protocol that enables access to documents over the network via Web browsers created an immense improvement in communication and access to information. For some years these possibilities were used mostly in the scientific world but recent years have seen an immense growth in popularity, supported by the wide availability of computers and broadband communication. The use of the internet for other tasks than finding information and direct communication is increasing, as can be seen from the interest in “e-activities” such as e-commerce, e-learning, e-government, e-science.
Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van Someren, Myra Spiliopoulou, Gerd Stumme
On the Deployment of Web Usage Mining
Abstract
In this paper we look at the deployment of web usage mining results within two key application areas of web measurement and knowledge generation for personalisation. We take a fresh look at the model of interaction between business and visitors to their web sites and the sources of data generated during these interactions. We then look at previous attempts at measuring the effectiveness of the web as a channel to customers and describe our approach, based on scenario development and measurement to gain insights into customer behaviour. We then present Concerto, a platform for deploying knowledge on customer behaviour with the aim of providing a more personalized service. We also look at approaches to measuring the effectiveness of the personalization. Various standards that are emerging in the market that can ease the integration effort of personalization and similar knowledge deployment engines within the existing IT infrastructure of an organization are also presented. Finally, current challenges in the deployment of web usage mining are presented.
Sarabjot Singh Anand, Maurice Mulvenna, Karine Chevalier
Mining the Web to Add Semantics to Retail Data Mining
Abstract
While research on the Semantic Web has mostly focused on basic technologies that are needed to make the Semantic Web a reality, there has not been a lot of work aimed at showing the effectiveness and impact of the Semantic Web on business problems. This paper presents a case study where Web and Text mining techniques were used to add semantics to data that is stored in transactional databases of retailers. In many domains, semantic information is implicitly available and can be extracted automatically to improve data mining systems. This is a case study of a system that is trained to extract semantic features for apparel products and populate a knowledge base with these products and features. We show that semantic features of these items can be successfully extracted by applying text learning techniques to the descriptions obtained from websites of retailers. We also describe several applications of such a knowledge base of product semantics that we have built including recommender systems and competitive intelligence tools and provide evidence that our approach can successfully build a knowledge base with accurate facts which can then be used to create profiles of individual customers, groups of customers, or entire retail stores.
Rayid Ghani
Semantically Enhanced Collaborative Filtering on the Web
Abstract
Item-based Collaborative Filtering (CF) algorithms have been designed to deal with the scalability problems associated with traditional user-based CF approaches without sacrificing recommendation or prediction accuracy. Item-based algorithms avoid the bottleneck in computing user-user correlations by first considering the relationships among items and performing similarity computations in a reduced space. Because the computation of item similarities is independent of the methods used for generating predictions, multiple knowledge sources, including structured semantic information about items, can be brought to bear in determining similarities among items. The integration of semantic similarities for items with rating- or usage-based similarities allows the system to make inferences based on the underlying reasons for which a user may or may not be interested in a particular item. Furthermore, in cases where little or no rating (or usage) information is available (such as in the case of newly added items, or in very sparse data sets), the system can still use the semantic similarities to provide reasonable recommendations for users. In this paper, we introduce an approach for semantically enhanced collaborative filtering in which structured semantic knowledge about items, extracted automatically from the Web based on domain-specific reference ontologies, is used in conjunction with user-item mappings to create a combined similarity measure and generate predictions. Our experimental results demonstrate that the integrated approach yields significant advantages both in terms of improving accuracy, as well as in dealing with very sparse data sets or new items.
Bamshad Mobasher, Xin Jin, Yanzan Zhou
Mapping Documents onto Web Page Ontology
Abstract
The paper describes an approach to automatically mapping Web pages onto ontology using document classification based on the Yahoo! ontology of Web pages. Techniques developed for learning on text data are used here on the hierarchical classification structure (ontology of Web documents). The high number of features is reduced by taking into account the hierarchical structure and using feature subset selection developed for the Naive Bayesian classifier. We focus on data sets with many features that also have a highly unbalanced class distribution. Documents are represented as word-vectors that include word sequences of up to five consecutive words. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo! hierarchy. The resulting model is a set of independent classifiers, each used to predict the probability that a new document is a member of the corresponding category represented as a node in the hierarchy. Our example problem is automatic document categorization where we want to identify documents relevant for the selected category. Usually, only about 1%-10% of examples belong to the selected category. Experimental evaluation on real-world data shows that the proposed approach gives good results. Our experimental comparison of eleven feature scoring measures show that considering data and algorithm characteristics significantly improves the performance.
Dunja Mladenić, Marko Grobelnik
Mining Web Sites Using Wrapper Induction, Named Entities, and Post-processing
Abstract
This paper presents a new framework for extracting information from collections of Web pages across different sites. In the proposed framework, a standard wrapper induction algorithm is used that exploits named entity information that has been previously identified. The idea of post-processing the extraction results is introduced for resolving ambiguous fields and improving the overall extraction performance. Post-processing involves the exploitation of two additional sources of information: field transition probabilities, based on a trained bigram model, and confidence scores, estimated for each field by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of the new framework.
Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Web Community Directories: A New Approach to Web Personalization
Abstract
This paper introduces a new approach to Web Personalization, named Web Community Directories that aims to tackle the problem of information overload on the WWW. This is realized by applying personalization techniques to the well-known concept of Web Directories. The Web directory is viewed as a concept hierarchy which is generated by a content-based document clustering method. Personalization is realized by constructing community models on the basis of usage data collected by the proxy servers of an Internet Service Provider. For the construction of the community models, a new data mining algorithm, called Community Directory Miner, is used. This is a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, and specialize it to the needs of user communities. The data that are mined present a number of peculiarities such as their large volume and semantic diversity. Initial results presented in this paper illustrate the use of the methodology and provide an indication of the behavior of the new mining method.
Dimitrios Pierrakos, Georgios Paliouras, Christos Papatheodorou, Vangelis Karkaletsis, Marios Dikaiakos
Evaluation and Validation of Two Approaches to User Profiling
Abstract
In the Internet era, huge amounts of data are available to everybody, in every place and at any moment. Searching for relevant information can be overwhelming, thus contributing to the user’s sense of information overload. Building systems for assisting users in this task is often complicated by the difficulty in articulating user interests in a structured form – a profile – to be used for searching. Machine learning methods offer a promising approach to solve this problem. Our research focuses on supervised methods for learning user profiles which are predictively accurate and comprehensible.
The main goal of this paper is the comparison of two different approaches for inducing user profiles, respectively based on Inductive Logic Programming (ILP) and probabilistic methods. An experimental session has been carried out to compare the effectiveness of these methods in terms of classification accuracy, learning and classification time, when coping with the task of learning profiles from textual book descriptions rated by real users according to their tastes.
F. Esposito, G. Semeraro, S. Ferilli, M. Degemmis, N. Di Mauro, T. M. A. Basile, P. Lops
Greedy Recommending Is Not Always Optimal
Abstract
Recommender systems suggest objects to users. One form recommends documents or other objects to users searching information on a web site. A recommender system can use data about a user to recommend information, for example web pages. Current methods for recommending are aimed at optimising single recommendations. However, usually a series of interactions is needed to find the desired information.
Here we argue that in interactive recommending a series of normal, ‘greedy’, recommendings is not the strategy that minimises the number of steps in the search. Greedy sequential recommending conflicts with the need to explore the entire space of user preferences and may lead to recommending series that require more steps (mouse clicks) from the user than necessary. We illustrate this with an example, analyse when this is so and outline when greedy recommending is not the most efficient.
Maarten van Someren, Vera Hollink, Stephan ten Hagen
An Approach to Estimate the Value of User Sessions Using Multiple Viewpoints and Goals
Abstract
Web-based commerce systems fail to achieve many of the features that enable small businesses to develop a friendly human relationship with customers. Although many enterprises have worried about user identification to solve the problem, the solution goes far beyond trying to find out what navigator’s behavior looks like. Many approaches have recently been proposed to enrich the data in web logs with semantics related to the business so that web mining algorithms can later be applied to discover patterns and trends. In this paper we present an innovative method of log enrichment as several goals and viewpoints of the organization owning the site are taken into account. By later applying discriminant analysis to the information enriched this way, it is possible to identify the relevant factors that contribute most to the success of a session for each viewpoint under consideration. The method also helps to estimate ongoing session value in terms of how the company’s objectives and expectations are being achieved.
E. Menasalvas, S. Millán, M. S. Pérez, E. Hochsztain, A. Tasistro
Monitoring the Evolution of Web Usage Patterns
Abstract
With the ongoing shift from off-line to on-line business processes, the Web has become an important business platform, and for most companies it is crucial to have an on-line presence which can be used to gather information about their products and/or services. However, in many cases there is a difference between the intended and the effective usage of a web site and, presently, many web site operators analyse the usage of their sites to improve their usability. But particularly in the context of the Internet, content and structure change rather quickly, and the way a web site is used may change often, either due to changing information needs of its visitors, or due to an evolving user group. Therefore, the discovered usage patterns need to be updated continuously to always reflect the actual behaviour of the visitors.
In this article, we introduce PAM, an automated Pattern Monitor, which can be used to observe changes to the behaviour of a web site’s visitors. It is based on a temporal representation of rules in which both the content of the rule and its statistical properties are modelled. It observes pattern change as evolution of the statistical measurements captured for a rule throughout its entire lifetime and notifies the user about interesting changes within the rule base. We present PAM in a case study on the evolution of web usage patterns. In particular, we discovered association rules from a web-server log that show which pages tend to be visited within the same user session. These patterns have been imported into the monitor, and their evolution throughout a period of 8 months has been analysed. Our results show that PAM is particularly suitable to gain insights into the changes of a rule base over time.
Steffan Baron, Myra Spiliopoulou
Backmatter
Metadaten
Titel
Web Mining: From Web to Semantic Web
herausgegeben von
Bettina Berendt
Andreas Hotho
Dunja Mladenič
Maarten van Someren
Myra Spiliopoulou
Gerd Stumme
Copyright-Jahr
2004
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-30123-3
Print ISBN
978-3-540-23258-2
DOI
https://doi.org/10.1007/b100615