Top

2006 | Book

Read chapter Read first chapter

Advances in Web Intelligence and Data Mining

Editors: Mark Last, Piotr S. Szczepaniak, Zeev Volkovich, Abraham Kandel

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Computational Intelligence

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Today, in the middle of the ?rst decade of the 21st century, the Internet has become a major communication medium, where virtually any kind of content can be transferred instantly and reliably between individual users and entire - ganizations located in any part of the globe. The World Wide Web (WWW) has a tremendous e?ect on our daily activities at work and at home. Consequently, more e?ective and e?cient methods and technologies are needed to make the most of the Web’s nearly unlimited potential. The new Web-related research directions include intelligent methods usually associated with the ?elds of c- putational intelligence, soft computing, and data mining. AWIC, the “Atlantic Web Intelligence Conferences” continue to be a forum for exchange of new ideas and novel practical solutions in this new and exciting area. The conference was born as an initiative of the WIC-Poland and the WIC- Spain Research Centers, both belonging to the Web Intelligence Consortium – WIC (http://wi-consortium.org/). Prior to this year, three AWIC conferences have been held: in Madrid, Spain (2003), in Cancun, Mexico (2004), and in L´ od´ z, Poland (2005). AWIC 2006 took place in Beer-Sheva, Israel during June 5–7, 2006, organized locally by Ben-Gurion University of the Negev. The book presents state-of-the-art developments in the ?eld of computati- ally intelligent methods applied to various aspects and ways of Web exploration.

Frontmatter

Part 1

DataRover: An Automated System for Extracting Product Information From Online Catalogs

The increasing number of e-commerce Web sites on the Web introduces numerous challenges in organizing and searching the product information across multiple Web sites. This problem is further exacerbated by various presentation templates that different Web sites use in presenting their product information, and different ways of product information they store in their catalogs. This paper describes the DataRover system, which can automatically crawl and extract all products from online catalogs. DataRover is based on pattern mining algorithms and domain specific heuristics which utilize the navigational and presentation regularities to identify taxonomy, list-of-product and single-product segments within an online catalog. Next, it uses the inferred patterns to extract data from all such data segments and to automatically transform an online catalog into a database of categorized products. We also provide experimental results to demonstrate the efficacy of the DataRover.

Syed Toufeeq Ahmed, Srinivas Vadrevu, Hasan Davulcu

A New Path Generalization Algorithm for HTML Wrapper Induction

Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.

Costin Bădică, Amelia Bădică, Elvira Popescu

Trustworthiness Measurement Methodology for e-Business

The purpose of the Trustworthiness Measure is to (a) to determine the quality of the Trusted Agents and (b) once the trusting agent has determined and recorded the trustworthiness of the trusted agent or the quality of the trusted agent, the trusting agent can use this determined and recorded quality of the trusted agent when some other agent queries it about the quality of the trusted agent. As can be clearly seen, if the trusting agent has not determined the trustworthiness of the quality of the trusted agent and subsequently recorded it, then it will not be in a position to communicate recommendations about the trusted agent. Unfortunately, in the existing literature there is no methodology for quantifying and expressing the trustworthiness of the trusted agent. In this paper we propose a methodology by that the trusting agent needs to following in order to determine the trustworthiness of the trusted agent. This methodology helps trusted business transactions, virtual collaboration and keeps the service-oriented environment trustworthy as well as helping to provide a transparent and harmonious nature to the distributed, heterogeneous, anonymous, pseudo-anonymous, and non-anonymous e-service networks.

Farookh Khadeer Hussain, Elizabeth Chang, Tharam S. Dillon

Routing Using Messengers in Sparse and Disconnected Mobile Sensor Networks

Sparse mobile sensor networks, such as those in the applications of ecology forest and modern battlefield, can frequently disconnect. Unfortunately, most existing routing protocols in mobile wireless networks mainly address connected networks, either sparse or dense. In this paper, we study the specific problem for dynamic routing in the sparse and disconnected mobile sensor networks utilizing messengers. We propose two routing discovery protocols: Genetic Fuzzy Straight Line Moving of Messengers (GFSLMM) and Genetic Fuzzy Flexible Sharing Policy of Messengers (GFFSPM). A preliminary simulation shows the efficacy of our protocols.

Qiong Cheng, Yanqing Zhang, Xiaolin Hu, Nisar Hundewale, Alex Zelikovsky

Content Consistency Model for Pervasive Internet Access

In this paper, we propose a new content consistency model for pervasive Internet access. We argue that content retrieved over the Internet consists of not only the data object but also its attributes needed to perform appropriate network or presentation related functions such as caching, content reuse, and content adaptation. With this model, four types of content consistency are defined. To get a deeper insight on the current situation of content consistency over Internet, real content on replica / CDN (Content Delivery Network) was monitored and analyzed. Surprisingly, we found that there are lots of discrepancies in data object and attributes found by comparing the original copy and the retrieved copy of the content. This result is important because they have direct implications to the trustworthiness of information over the Internet.

Chi-Hung Chi, Lin Liu, Choon-Keng Chua

Content Description Model and Framework for Efficient Content Distribution

In this paper, we propose a content description model and framework for efficient content distribution. The content description model employs ideas from Resource Description Framework and External Annotation, which allow flexible descriptions for Web content. The model also allows a server to efficiently select any subset of the descriptions of any Web page and deliver them to a proxy. The framework consists of algorithms for the proxies to map user preferences and device capabilities to a set of functions to be performed, and for the server to select and deliver necessary content descriptions to the proxy, and for the proxy to efficiently cache and reuse the content descriptions. With our content description model and framework, best-fit content presentation for pervasive Internet access can be made possible.

Chi-Hung Chi, Lin Liu, Shutao Zhang

Exploiting Wikipedia in Integrating Semantic Annotation with Information Retrieval

The Semantic Web can be seen as an extension of the current one in which information is given a formal meaning, making it understandable by computers. The process of giving formal meaning to Web resources is commonly known in the state of the art as semantic annotation. In this paper we describe an approach to integrate the semantic annotation task with the information retrieval task. This approach makes use of relevance feedback techniques and exploits the information generated and maintained by Wikipedia users. The validity of our approach is currently being tested by means of a Web portal, which also uses the annotations defined by users in providing basic semantic search facilities.

Norberto Fernández-García, José M. Blázquez-del-Toro, Luis Sánchez-Fernández, Vicente Luque

Ontology based Query Rewriting on Integrated XML based Information Systems

The paper mainly discusses the extension of querying on integrated XISs with wrapped ontologies. It discusses the complex ontology mapping patterns with semantically enhanced similarity, such as subsumption mapping, composition mapping and so forth. It also discusses the semantic query mechanism, which primarily extends XML query algebra based on TAX, on the XISs wrapped with local ontologies. Because common XML query languages such as XQuery and XUpdate can be transferred into XML query algebra based on TAX, so the extension is manageable. Complex ontology mapping ensures distributed querying can solve the problem of the inconsistency of semantic and increases the efficiency by refining on the querying and reducing redundancy.

Jinguang Gu, Yi Zhou

Visually Exploring Concept-Based Fuzzy Clusters in Web Search Results

Users of web search systems often have difficulty determining the relevance of search results to their information needs. Clustering has been suggested as a method for making this task easier. However, this introduces new challenges such as naming the clusters, selecting multiple clusters, and re-sorting the search results based on the cluster information. To address these challenges, we have developed Concept Highlighter, a tool for visually exploring concept-based fuzzy clusters in web search results. This tool automatically generates a set of concepts related to the users’ queries, and performs single-pass fuzzy c-means clustering on the search results using these concepts as the cluster centroids. A visual interface is provided for interactively exploring the search results. In this paper, we describe the features of Concept Highlighter and its use in finding relevant documents within the search results through concept selection and document surrogate highlighting.

Orland Hoeber, Xue-Dong Yang

A Grid Scheduling Optimization Strategy Based on Fuzzy Multi-Attribute Group Decision-Making

In grid environments, the grid scheduling technique is more complex than the conventional ones in high performance computing system, and grid scheduling is one of the major factors that would affect the grid performance. In order to optimize grid scheduling, we have to consider the various factors. By combining the analysis and prediction methods that are of different principles and approaches, we would be able to make comprehensive decisions on different scenarios and provide reference for scheduling optimization. In this paper, a method of fuzzy multi-attribute group decision-making is proposed, which introduces fuzzy set and its operations into decision-making process, and reflects a group or collective ranking of alternatives based on the individual preferences of those alternatives. The flexible selection models heighten the expressive force and adaptability greatly. The experiments show that the grid scheduling with this method has high performance.

It should be pointed out that the decision-making approach in this paper is built on the compensability between the decision attributes. But in some cases, the compensability between the decision attributes is conditional, and even non-compensable. Therefore, the other comprehensive decision-making approaches are needed for these features. These approaches will be our further research focus.

Jin Huang, Hai Jin, Xia Xie, Jun Zhao

An Adaptive PC to Mobile Web Contents Transcoding System Based on MPEG-21 Multimedia Framework

The purpose of this paper is to supply web contents for PC to various multi platform device as PDA or portable device. The conventional studies could not consider these devices, so it created mobile contents beforehand, and then transmitted to the limited mobile device. In this point, the critical problem is to generate mobile contents which are suitable for all kinds of mobile device from PC Web content. This paper propose a service system for transmitting wire web content to various portable device by using MPEG-21 Multimedia Framework. It does not create mobile contents for each device. It just uses DIDL of MEPG-21 as intermediate language to express the structure, resource and description of mobile contents. In DIDL, the multimedia resource is transcoded in off-line previously. The description part is converted in real time as soon as service is requested by end-user. Mobile contents integrate adapted resource with appropriate description and then are transmitted. In addition, this paper proposes the Multi-level caching for reusing mobile contents and describes the result in experiment system.

Euisun Kang, Daehyuck Park, JongKeun Kim, Kunjung Sim, Younghwan Lim

Recommendation Framework for Online Social Networks

Przemysław Kazienko, Katarzyna Musiał

Packet Manipulating Based on Zipf’s Distribution to Protect from Attack in P2P Information Retrieval

On composing the P2P systems the most important point is the possibility that users exchange information and contents under anonymous condition having exclusive right. Most of packets transferred from node to node do not include sender’s IP address and these packets are transmitted through dynamic routing carried out by intermediate hosts. In addition this may be temporary since dynamic routing is renewed (updated) periodically so it is impossible to know which host transfers packet for the first and which the designated recipient host is. Therefore basically it provides anonymity. However when contents upload and download is made between user and provider information on both sides is exposed and this attenuates the possibility mentioned first. In order to settle this problem this study calculates packet distribution within the whole network of Query and Query Hit, which is different from downloading cached information to protect identity of user and provider, manipulates QueryHit on the basis of the calculation and transfers contents after caching them. This provides secured anonymity to the intermediate node performing Proxy role between user and provider.

Byung-Ryong Kim, Ki-Chang Kim

Investigation of the Fuzzy System for the Assessment of Cadastre Operators’ Work

Dariusz Król, Grzegorz Kukla, Tadeusz Lasota, Bogdan Trawiński

DLAIP: A Description Logic Based Approach for Dynamic Semantic Web Services Composition

The Description Logic, which possesses strong knowledge representation and reasoning capabilities, is the logic basis of the Semantic Web ontology languages such as OWL and OWL-S. The AJ planning, which provides an effective method for solving the planning problem and task decomposition in AI, possesses better modeling capability of the action state transformation. Based on the merits of the Description Logic and the AI planning above, this paper proposes a service composition mechanism

DLAIP

and testifies its feasibility in Description Logic. The results show that this composition mechanism can not only be feasible but also be hepful for the semantic modeling of the service composite process in the Semantic Web.

Yingjie Li, Li Wang, Xueli Yu, Wen Li, Yu Xing

Adding Support to User Interaction in Egovernment Environment

The paper presents a way of pluging a dialog system on a platform dedicated to eGovernment services. All platform modules are compliant with a central multi-lingual ontology used to represent both the domain knowledge - social care domain -, the semantic descriptions of services and the semantic indexing of documents. The dialog system is built as a multi-agent system in charge of responding to various types of users’ questions by querying the main modules of the platform and the ontology itself. The document shows some scenarios which illustrate some questions involving the discovery of services and the search of documents by respecting agents.

Claude Moulin, Fathia Bettahar, Jean-Paul A. Barthès, Marco Sbodio, Nahum Korda

Using Consensus Methodology in Processing Inconsistency of Knowledge

This paper presents two structures for representing inconsistency of knowledge on semantic level. The first structure is based on relational model, which is multi-valued and multi-attribute. This structure has been analyzed in the earlier literature. For the second structure (called

logical structure

) in the From of clauses this work presents its semantics, the way for distance calculating and the algorithm for consensus computing. The future work should concern working out algorithm for determining

-consensus for inconsistent knowledge in logical structure.

Ngoc Thanh Nguyen

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

In research in web archives, large temporal document collections are necessary in order to be able to compare and evaluate new strategies and algorithms. Large temporal document collections are not easily available, and an alternative is to create synthetic document collections. In this paper we will describe how to generate synthetic temporal document collections, how this is realized in the

TDocGen

temporal document generator, and we will also present a study of the quality of the document collections created by TDocGen.

Kjetil Nørvåg, Albert Overskeid Nybø

Predicting Stock Trends with Time Series Data Mining and Web Content Mining

This paper presents a new methodology for predicting stock trends and making trading decisions based on the combination of Data Mining and Web Content Mining techniques. While research in both areas is quite extensive, inference from time series stock data and time-stamped news stories collected from the World Wide Web require further exploration. Our prediction models are based on the content of time-stamped web documents in addition to traditional Numerical Time Series Data. The stock trading system based on the proposed methodology (ADMIRAL) will be simulated and evaluated on real-world series of news stories and stocks data using several known classification algorithms. The main performance measures will be the prediction accuracy of the induced models and, more importantly, the profitability of the investments made by using system recommendations based on these predictions.

Gil Rachlin, Mark Last

CatS: A Classification-Powered Meta-Search Engine

CatS is a meta-search engine that utilizes text classification techniques to improve the presentation of search results. After posting a query, the user is offered an opportunity to refine the results by browsing through a category tree derived from the dmoz Open Directory topic hierarchy. This paper describes some key aspects of the system (including HTML parsing, classification and displaying of results), outlines the text categorization experiments performed in order to choose the right parameters for classification, and puts the system into the context of related work on (meta-)search engines. The approach of using a separate category tree represents an extension of the standard relevance list, and provides a way to refine the search

on need

, offering the user a non-imposing, but potentially powerful tool for locating needed information quickly and efficiently. The current implementation of CatS may be considered a baseline, on top of which many enhancements are possible.

Miloš Radovanović, Mirjana Ivanović

A Decision Tree Framework for Semi-Automatic Extraction of Product Attributes from the Web

Semi-Automatic extraction of product attributes from URLs is an important issue for comparison-shopping agents. In this paper we examine a novel decision tree framework for extracting product attributes. The core induction algorithmic framework consists of three main stages. In the first stage, a large set of regular expression-based patterns are induced by employing a longest common subsequence algorithm. In the second stage we filter the initial set and leave only the most useful patterns. In the last stage we represent the extraction problem (in which the domain values are not known in advance) as a classification problem and employ an ensemble of decision trees. An empirical study performed on a real-world extraction tasks illustrates the capability of the proposed framework.

Lior Rokach, Roni Romano, Barak Chizi, Oded Maimon

A Dynamic Generation Algorithm for Meta Process in Intelligent Platform of Virtual Travel Agency

Intelligent Platform of Virtual Travel Agency (IPVita) is a platform which can intelligently and automatically compose all kinds of travel web services into a satisfactory travel for tourists. This paper introduced concepts of Meta Services and Meta Process, which can largely simplify the complexity of web services composition. And corresponding to these concepts, a Dynamic Generation Algorithm for Meta Process is presented instead of predefining a rigorous workflow model. And differently with some AI planning method, this approach deals with the various requirements of tourists more effectively and flexibly. Additional this approach also can resolve these similar problems in other domain.

Qi Sui, Hai-yang Wang

Linguistic Summaries of Standardized Documents

Automatic summarization of databases has become indispensable in a number of tasks involving information exchange or strategic decision making. It is also important when huge bases of documents must be clustered. The present paper deals with summarization of standardized databases containing both numerical and textual records. The method and its variations are described and explained on illustrative examples.

Piotr S. Szczepaniak, Joanna Ochelska

A Web-knowledge-based Clustering Model for Gene Expression Data Analysis

Current microarray technology provides ways to obtain time series expression data for studying a wide range of biological systems. However, the expression data tends to contain considerable noise, which as a result may deteriorate the clustering quality. We propose a web-knowledge-based clustering method to incorporate the knowledge of gene-gene relations into the clustering procedure. Our method first obtains the biological roles of each gene through a web mining process, next groups genes based on their biological roles and the Gene Ontology, and last applies a semi-supervised clustering model where the supervision is provided by the detected gene groups. Under the guidance of the knowledge, the clustering procedure is able to cope with data noise. We evaluate our method on a publicly available data set of human fibroblast response to serum. The experimental results demonstrate improved quality of clustering compared to the clustering methods without any prior knowledge.

Na Tang, V. Rao Vemuri

Part 2

Estimations of Similarity in Formal Concept Analysis of Data with Graded Attributes

We study similarity in formal concept analysis of data tables with graded attributes. We focus on similarity related to formal concepts and concept lattices, i.e. the outputs of formal concept analysis. We present several formulas for estimation of similarity of outputs in terms of similarity of inputs. The results answer some problems which arose in previous investigation as well as some natural questions concerning similarity in conceptual data analysis. The derived formulas enable us to compute an estimation of similarity of concept lattices much faster than one can compute their exact similarity. We omit proofs due to lack of space.

Radim Bělohlávek, Vilém Vychodil

Kernels for the Relevance Vector Machine - An Empirical Study

The Relevance Vector Machine (RVM) is a generalized linear model that can use kernel functions as basis functions. Experiments with the Matérn kernel indicate that the kernel choice has a significant impact on the sparsity of the solution. Furthermore, not every kernel is suitable for the RVM. Our experiments indicate that the Matérn kernel of order 3 is a good initial choice for many types of data.

David Ben-Shimon, Armin Shmilovici

A Decision-Tree Framework for Instance-space Decomposition

This paper presents a novel instance-space decomposition framework for decision trees. According to this framework, the original instance-space is decomposed into several subspaces in a parallel-to-axis manner. A different classifier is assigned to each subspace. Subsequently, an unlabelled instance is classified by employing the appropriate classifier based on the subspace where the instance belongs. An experimental study which was conducted in order to compare various implementations of this framework indicates that previously presented implementations can be improved both in terms of accuracy and computation time.

Shahar Cohen, Lior Rokach, Oded Maimon

On prokaryotes’ clustering based on curvature distribution

Massive determination of complete genomes sequences has led to development of different tools for genome comparisons. Our approach is to compare genomes according to typical genomic distributions of a mathematical function that reflects a certain biological function. In this study we used comprehensive genome analysis of DNA curvature distributions before starts and after ends of prokaryotic genes to evaluate the assistance of mathematical and statistical procedures. Due to an extensive amount of data we were able to define the factors influencing the curvature distribution in promoter and terminator regions. Two clustering methods, K-means and PAM were applied and produced very similar clusterings that reflect genomic attributes and environmental conditions of species’ habitat.

L. Kozobay-Avraham, A. Bolshoy, Z. Volkovich

Look-Ahead Mechanism Integration in Decision Tree Induction Models

Most of decision tree induction algorithms use a greedy splitting criterion. One of the possible solutions to avoid this greediness is looking ahead to make better splits. Look-Ahead has not been used in most decision tree methods primarily because of its high computational complexity and its questionable contribution to predictive accuracy. In this paper we describe a new Look-Ahead approach to induction of decision tree models. We present a computationally efficient algorithm which evaluates quality of subtrees of variable-depth in order to determine the best split attribute out of a set of candidate attributes with a splitting criterion statistically indifferent from the best one.

Michael Roizman, Mark Last

Feature Selection by Combining Multiple Methods

Feature selection is the process of identifying relevant features in the dataset and discarding everything else as irrelevant and redundant. Since feature selection reduces the dimensionality of the data, it enables the learning algorithms to operate more effectively and rapidly. In some cases, classification performance can be improved; in other instances, the obtained classifier is more compact and can be easily interpreted. There is much work done on feature selection methods for creating ensemble of classifiers. Thus, these works examine how feature selection can help ensemble of classifiers to gain diversity. This paper examines a different direction, i.e. whether ensemble methodology can be used for improving feature selection performance. In this paper we present a general framework for creating several feature subsets and then combine them into a single subset. Theoretical and empirical results presented in this paper validate the hypothesis that this approach can help finding a better feature subset.

Lior Rokach, Barak Chizi, Oded Maimon

Clustering and Classification of Textual Documents Based on Fuzzy Quantitative Similarity Measure — a Practical Evaluation

Clustering enables more effective information retrieval. In practice, similar approaches are used for ranking and clustering. This paper presents a practical evaluation of a method for clustering of documents which is based on certain textual fuzzy similarity measure. The similarity measure was originally introduced in [

12.

] — cf. also [

13.

], and later used in internet-related applications [

14.

15.

18.

]. Two textual databases [

21.

22.

] of predefined clusters and of diverse level of freedom in the contents of documents were used for experiments that employed some variants of the basic clustering method [

19.

Piotr S. Szczepaniak

Oriented k-windows: A PCA driven clustering method

In this paper we present the application of Principal Component Analysis (PCA) on subsets of the dataset to better approximate clusters. We focus on a specific density-based clustering algorithm,

-Windows, that holds particular promise for problems of moderate dimensionality. We show that the resulting algorithm, we call Oriented

-Windows (OkW), is able to steer the clustering procedure by effectively capturing several coexisting clusters of different orientation. OkW combines techniques from computational geometry and numerical linear algebra and appears to be particularly effective when applied on difficult datasets of moderate dimensionality.

D. K. Tasoulis, D. Zeimpekis, E. Gallopoulos, M. N. Vrahatis

A cluster stability criteria based on the two-sample test concept

A method for assessing cluster stability is presented in this paper. We hypothesize that if one uses a “consistent” clustering algorithm to partition several independent samples then the clustered samples should be identically distributed. We use the two sample energy test approach for analyzing this hypothesis. Such a test is not very efficient in the clustering problems because outliers in the samples and limitations of the clustering algorithms heavily contribute to the noise level. Thus, we repeat calculating the value of the test statistic many times and an empirical distribution of this statistic is obtained. We choose the value of the “true” number of clusters as the one which yields the most concentrated distribution. Results of the numerical experiments are reported.

Z. Volkovich, Z. Barzily, L. Morozensky

Title: Advances in Web Intelligence and Data Mining
Editors: Mark Last
Piotr S. Szczepaniak
Zeev Volkovich
Abraham Kandel
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-33880-2
Print ISBN: 978-3-540-33879-6
DOI: https://doi.org/10.1007/3-540-33880-2

Springer Professional