Skip to main content

2011 | Buch

Web Information Systems and Mining

International Conference, WISM 2011, Taiyuan, China, September 24-25, 2011, Proceedings, Part II

herausgegeben von: Zhiguo Gong, Xiangfeng Luo, Junjie Chen, Jingsheng Lei, Fu Lee Wang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 6987 and 6988 constitutes the refereed proceedings of the International Conference on Web Information Systems and Mining, WISM 2011, held in Taiyuan, China, in September 2011. The 112 revised full papers presented were carefully reviewed and selected from 472 submissions. The second volume includes 56 papers organized in the following topical sections: management information systems; semantic Web and ontologies; Web content mining; Web information classification; Web information extraction; Web intelligence; Web interfaces and applications; Web services and e-learning; and XML and semi-structured data.

Inhaltsverzeichnis

Frontmatter

Management Information Systems

Text Clustering Based on LSA-HGSOM

Text clustering has been recognized as an important component in data mining. Self-Organizing Map (SOM) based models have been found to have certain advantages for clustering sizeable text data. However, current existing approaches lack in providing an adaptive hierarchical structure within in a single model. This paper presents a new method of hierarchical text clustering based on combination of latent semantic analysis (LSA) and hierarchical GSOM, which is called LSA-HGSOM method. The text clustering result using traditional methods can not show hierarchical structure. However, the hierarchical structure is very important in text clustering. The LSA-HGSOM method can automatically achieve hierarchical text clustering, and establishes vector space model (VSM) of term weight by using the theory of LSA, then semantic relation is included in the vector space model. Both theory analysis and experimental results confirm that LSA-HGSOM method decreases the number of vector, and enhances the efficiency and precision of text clustering.

Jianfeng Wang, Lina Ma
Design Pattern Modeling and Implementation Based on MDA

Model Driven Architecture (MDA) stresses on the model-centric. It defines the framework of the system by using various models. Aiming to increase not only the modeling granularity but also the reusability of model transformation rule we apply the design pattern into MDA. In this paper, firstly, a modeling approach based on role is presented. In this way, the pattern model and the transformation rule can be defined respectively. Secondly, two extended meta-meta-models, ExPattern(Extended Pattern) and ExRole(Extended Role), which are the meta-models of Pattern and Role respectively, are demonstrated in the article. A QVT-based transformation rule is defined for the snake of models transformation. At last, a case study of Graduate Education Management System which uses the technologies proposed in this paper is demonstrated.

Xuejiao Pang, Kun Ma, Bo Yang
End-to-End Resources Planning Based on Internet of Service

Business innovation and collaboration lead to the development of enterprise information systems from the internal scope towards the more broadly external parties. EERP (End-to-End Resources Planning) as the up to date requirement in the enterprise information field is presented in this paper. The relevant technologies about internet of service to realize EERP have been researched in detail. An approach based on value-aware service engineering and service network technologies for EERP has been proposed and applied, which can help to realize business-goal-driven dynamic semantic integration of the Web services.

Baoan Li, Wei Zhang
A Comprehensive Reputation Computation Model Based on Fuzzy Regression Method of Cross-Domain Users

It is a problem that users’ reputation distributed in each domain can’t be shared in collaborative environment of cross-domain. A computation model of comprehensive reputation based on fuzzy regression is proposed in this paper to supply uniform reputation evaluation method which is the base of collaboration of cross-domain users. The model builds up user’s reputation vector of single domain, and the vectors are combined according to weighting coefficients. In order to build up fuzzy regression model, the comprehensive reputation is hazed with symmetric triangular fuzzy number to determine fuzzy coefficients, and to calculate out the result. The experiment results show that the model reflects the flexible range objectively.

Juan Zhou, Gang Hu, Qinghua Pang
A TV Commercial Detection System

Automatic real-time recognition of TV commercials is an essential step for TV broadcast monitoring. It comprises of two basic tasks: rapid detection of known commercials that are stored in a database, and accurate recognition of unknown ones that appear for the first time in TV streaming. In this paper, we present the framework of a TV commercial detection system.

Yijun Li, Suhuai Luo
Research and Implementation of Entropy-Based Model to Evaluation of the Investment Efficiency of Grid Enterprise

Investment efficiency on power grid enterprises evaluation is an evaluation of proportional relationship between investment results and investment consumption. The rationality of evaluation depends on evaluation model, the most important part of evaluation model is the enactment of weight. As a long-tested weight method, entropy weight coefficients method is reliable and objective. In this article, we discussed a entropy weight coefficients method that applicable to the investment efficiency on power grid enterprises evaluation. The computer realization method is also disscussed in this article.

Kehe Wu, Xiao Tu, Cheng Duan
Passive Data Storage Based Housewares Store Management System

After referring several cases of management information system and RFID system, we design a novel and practical system for housewares store. We use the RFID tags in a different way - Passive Data Storage. In the paper, the structure and implementation of the system is described in detail. A case study is also given for illustration.

Yang Xiao, Guoqi Li, Juan Zhang

Mobile Computing

Multiple Solutions for Resonant Difference Equations

In this paper, the critical point theory, the minimax methods and Morse theory are employed to discuss the existence of nontrivial solutions for boundary value problems of second-order difference equations with resonance both at infinity and at zero. Some existence results are obtained.

Shuli Wang, Jianming Zhang
The Application of the GPRS Network on the Design of Real-Time Monitor System for Water Pollution Resource

In order to design the real-time monitor system with good performance, the application of GPRS network on it was studied in depth. Firstly, the importance of designing a real-time monitor system was introduced, and the basic theory of GPRS network was explained; and then the theory principle of real-time monitor system of water pollution resource was analyzed. And then methodology of GPRS network was studied. Finally the hardware device, software and theirs corresponding function were designed. And the application of this monitor system of water pollution resource was good when it was applied in actual engineering.

Shi-he Sun
CuttingPlane: An Efficient Algorithm for Three-Dimensional Spatial Skyline Queries Based on Dynamic Scan Theory

Skyline operator and skyline computation play an important role in database communication, decision support, data visualization, spatial database and so on. In this paper we firstly analyze the existing methods, point out some problems in progressive disposal, query efficiency and convenience of following user selection. Secondly, we propose and prove a theorem for pruning query space based on dynamic scan theorem, based on the thought of the theorem, we propose a more efficient algorithm-dynamic cutting plane scan queries for skyline queries, and analyze and verify the feasibility, efficiency and veracity of the algorithm through instance and experiment.

Meng Zhao, Jing Yu

Semantic Web and Ontologies

ROS: Run-Time Optimization of SPARQL Queries

The optimization effect on large-scale RDF data is not statisfactory using the existing algorithms based on cost models. This paper presents the Run-time Optimization of SPARQL queries (ROS), and describes the join graphs and the index structures for SPARQL queries that are foundations of the ROS approach. The ROS algorithm, without cost models, intertwines cost estimation and query optimization into the execution procedure, and determines query plans in run time. Our experiments using the SP2Bench benchmark show that ROS can select the best query plan and improve query efficiency dramatically compared with the existing approaches.

Liuqing Li, Xin Wang, Xiansen Meng, Zhiyong Feng
The Research and Implementation of Heterogeneous Data Integration under Ontology Mapping Mechanism

To deal with semantic heterogeneity among heterogeneous data sources, ontology is imported to the traditional data-integrated middleware. Domain ontology and local ontology are constructed as overall data view and local data view respectively. By establishing the mapping between domain ontology and local ontology, the problem of semantic heterogeneity is settled and semantic standard is reached. In this subject the way of automatic mapping is used to generate mapping relations innovatively. And an algorithm is given here. As a result, this way will provide an effective solution to the automatic integration of massive data.

Jing Bian, Hai Zhang, Xinguang Peng
Extracting Hyponymy Patterns in Tibetan Language to Enrich Minority Languages Knowledge Base

Semantic ontology is a formal, explicit specification of a shared conceptualization. The construction of semantic ontology knowledge base is the vital process in language processing, which is applied in information retrieval, information extraction and automatic translation. Hyponymy pattern is a basic semantic relationship between concepts, which is used to concepts acquisition to enrich ontology automatically. In this paper, the construction idea of multilingual ontology with unified criteria and interface are introduced, and hyponymy pattern is represented as a pair of a meaning frame defining the necessary information extraction in Tibetan language. The research of hyponymy relationship pattern can assist concept enrichment in ontology, which can reduce the cost during the ontology engineering process.

Lirong Qiu, Yu Weng, Xiaobing Zhao, Xiaoyu Qiu

Web Content Mining

Discovering Atypical Property Values for Object Searches on the Web

Conventional search engines are able to extract commonplace information by incorporating users’ requests into their queries. Users perform niche requests when they want to obtain atypical objects or unique information. In these instances, it is difficult for users to expand their queries to match their niche requests. In this paper, we introduce a query suggestion method for finding objects that have atypical characteristics. Our method focuses on the property values of an object, and elicits atypical property values by using the relation between an object’s name and a typical property value.

Tatsuya Fujisaka, Takayuki Yumoto, Kazutoshi Sumiya
An Indent Shape Based Approach for Web Lists Mining

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.

Yanxu Zhu, Gang Yin, Huaimin Wang, Dianxi Shi, Xiang Li, Lin Yuan
Web Text Clustering with Dynamic Themes

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm–Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

Ping Ju Hung, Ping Yu Hsu, Ming Shien Cheng, Chih Hao Wen
Multi-aspect Blog Sentiment Analysis Based on LDA Topic Model and Hownet Lexicon

Blog is an important web2.0 application, which attracts many users to express their subjective reviews about financial events, political events and other objects. Usually a Blog page includes more than one theme. However the existing researches of multi-aspect sentiment analysis focus on the product reviews. In this paper, we propose a multi-aspect Chinese Blog sentiment analysis method based on LDA topic model and Hownet lexicon. At first, we use a Chinese Blog corpus to train a LDA topic model and identify the themes of this corpus. Then the LDA model which has been trained is used to segment the themes of Blog pages with paragraphs. After that the sentiment word tagging method based on Hownet is used to calculate the sentiment orientation of every Blog theme. So the sentiment orientation of the Blog pages can be represented by the sentiment orientation of multi-aspect Blog themes. The experiment results on SINA Blog dataset show our method not only gets good topic segments, but also improves the sentiment classification performance.

Xianghua Fu, Guo Liu, Yanyan Guo, Wubiao Guo
Redundant Feature Elimination by Using Approximate Markov Blanket Based on Discriminative Contribution

As a high dimensional problem, it is a hard task to analyze the text data sets, where many weakly relevant but redundant features hurt generalization performance of classifiers. There are previous works to handle this problem by using pair-wise feature similarities, which do not consider discriminative contribution of each feature by utilizing the label information. Here we define an Approximate Markov Blanket (AMB) based on the metric of DIScriminative Contribution (DISC) to eliminate redundant features and propose the AMB-DISC algorithm. Experimental results on the data set of Reuter-21578 show AMB-DISC is much better than the previous state-of-arts feature selection algorithms considering feature redundancy in terms of Micro

avg

F1 and Macro

avg

F1.

Xue-Qiang Zeng, Su-Fen Chen, Hua-Xing Zou
Synchronization of Hyperchaotic Rossler System and Hyperchaotic Lorenz System with Different Structure

This paper studies synchronization between hyperchaotic Rossler system and hyperchaotic Lorenz system with unknown parameters. Based on Lyapunov stability theory, active synchronization and adaptive synchronization make the different systems achieve synchronization. And numerical simulations show the effectiveness and feasibility of these methods.

Yi-qiang Wei, Nan Jiang
Research of Matrix Clustering Algorithm Based on Web User Access Pattern

It is of great significance that summarizing the regular pattern of the user along the URL to find and browse the Web, mining user browsing patterns to help users reach the target page quickly for realizing the personalized navigation of search engine. In order to provide personalized service, an optimized matrix clustering algorithm is proposed, which can cluster the page users access, analysis and study the laws in the Web log records to improve performance and organizational structure of Web site according to browsing patterns of user accessing to Web, understand the user behavior, find user browsing patterns. The Experiment results shows that the algorithm has good practicability with accurately reflecting the Web visits.

Jian Bao

Web Information Classification

Combining Link-Based and Content-Based Classification Method

Link mining is also called social network analysis. It is a new study of data mining. It is different from the traditional data mining methods. Link information is used in link mining. Link information provides richer and more accurate information about the social network. In this paper, a representation is chosen by Graph, Dyad and Subgraph for the statistical inference of mining. And then based on the defining of the graph structure and link type, the model of getting the link features is built. Last a combining link-based and content-based classification method is proposed, and this method is proved to improve the result of classification.

Kelun Tian
Chinese Expert Entity Homepage Recognition Based on Co-EM

Focused on the problem of numerous labeling works on the expert homepage in the procedure of Chinese expert entity homepage recognition, in this paper, a method of Chinese expert entity homepage recognition based on the Co-EM proposed. In detail, firstly, collect the names of Chinese expert entity and the corresponding web pages, and then label a small quantity of web pages. Secondly for Chinese entity characteristics, extract the hyperlink features and the web page content features as two independent feature sets. Thirdly train the hyperlink classifier using the hyperlink feature set and label the all the expert entity homepages, and then train the content classifier using the web page content feature set and the labels which were labeled by the hyperlink classifier. Use the labels which were labeled by the content classifier to update the hyperlink classifier. Repeat the procedure until the two classifiers converge. Finally, experiments were done by employing the method of 10-fold cross validation. The results show that the method based on the Co-EM semi-supervised algorithm can uses the unlabeled web pages effectively and there is an increase of accuracy of recognition compared with using the labeled web pages only.

Li Liu, Zhengtao Yu, Lina Li
Semi–supervised K-Means Clustering by Optimizing Initial Cluster Centers

Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the usage of labeled data to generate and optimize initial cluster centers for k-means algorithm. It proposes a max-distance search approach in order to find some optimal initial cluster centers from unlabeled data, especially when labeled data can’t provide enough initial cluster centers. Experimental results demonstrate the advantages of this method over standard random selection and partial random selection, in which some initial cluster centers come from labeled data while the other come from unlabeled data by random selection.

Xin Wang, Chaofei Wang, Junyi Shen
Fuzzy ID3 Algorithm Based on Generating Hartley Measure

Fuzzy decision tree induction algorithm is an important way with uncertain information. However, the current fuzzy decision tree algorithms do not systematically consider the impact of different fuzzy levels and simply make uncertainty treatment awareness into the selection of extended properties. To avoiding this problem, this paper establishes a generating Hartley measure model based on cut-standard, subsequently, proposes fuzzy ID3 algorithm based on generating Hartley measure model, finally, the results of the experiments indicates that the model is feasible and effective.

Fachao Li, Dandan Jiang
A Technique for Improving the Performance of Naive Bayes Text Classification

Naive Bayes classifier is widely used in text classification tasks, and it can perform surprisingly well, it is often regarded as a baseline. But previous researches show that the skewed distribution of training collection may cause poor results in text classification. This paper presents a new method to deal with this situation. We introduce a conditional probability which takes into account both the information of the whole corpus and each category. Our proposed method performs well in the standard benchmark collections, competing with the state-of-the-art text classifiers especially for the skewed data.

Yuqian Jiang, Huaizhong Lin, Xuesong Wang, Dongming Lu
Mapping Data Classification Based on Modified Fuzzy Statistical Analysis

Through analyzing and researching traditional mapping data classification, and considering the fuzzy of classification, then a modified mapping data classification is put forward, which is based on fuzzy set. First, gain fuzzy sample set by expert system, and compute sample distribution function by statistical analysis. Then, utilize distribution function to gained fuzzy membership function, and working out the fuzziest point by membership function. Finally, in according to the most fuzzy point, and achieve to classify mapping data. The proposed method solves the problem of misconstruction of the original data, not being generally used, complexity of computing and fuzzy classification in the traditional mapping data classification method and practical process.

Yi Cheng, Mingxia Xie, Jianzhong Guo
Web Clustering Using a Two-Layer Approach

Internet is a rich and potential information base. It needs scientific and effective methods in order to find interesting information. Researchers have proposed many web clustering algorithms, but it spends too much time using a simple kind of clustering algorithms, because the number of the web information is huge. Considering the efficiency and the effect of the clustering, in the paper, we use a two-layer web clustering approach to cluster for a number of web access patterns from web logs. At the first layer, we use the LVQ (Learning Vector Quantization) neural network to group the web access patterns to several representative clustering centers. At the second layer, the rough k-means algorithm is adopted to deal with the result of the first layer, producing the final classifications. The experimental results show that the effect is close to monolayer clustering algorithm the rough k-means, and the efficiency is better than the rough k-means by using the two-layer web clustering approach.

Yanping Li, Jinsheng Xing, Rui Wu, Fulan Zheng
An Improved KNN Algorithm for Vertical Search Engines

Secondary Data Processing deals the information further by re-crawling and categories based on the basic of structured data. It is the key researching module of Vertical Search Engines. This paper proposes an improved KNN algorithm for the categories. This algorithm achieves the responsiveness and the accuracy of vertical search by reducing the time complexity and accelerating the speed of classification. The experiment proved the improved algorithm has the better feasibility and robustness when it’s used in secondary data processing and participle of vertical search engines.

Yubo Jia, Hongdan Fan, Guanghu Xia, Xing Dong

Web Information Extraction

Concluding Pattern of Web Page Based on String Pattern Matching

Presently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. Data extraction from HTML is usually performed by software modules called wrappers. In many studies of constructing wrapper, concluding the pattern of the Web site is a importance task in the beginning. This paper studies the problem of concluding pattern from a Web page that contains several nested structure and repeated structure. In our method, the algorithm bases on string pattern matching can discover the nested structure and the repeated structure in a Web page. Then a regular expression will be generated as the pattern of the Web site.

Yiqing Cai, Xinjun Wang, Chunsheng Lu, Zhongmin Yan, Zhaohui Peng
A Rapid Method to Extract Multiword Expressions with Statistic Measures and Linguistic Rules

Multiword Expressions (MWEs) have been the bottleneck in NLP. Particularly, the resource of fixed MWEs can improve the performance of tasks and implications of NLP. Due to complex characters of MWEs, it is hard to make difference between fixed MWEs and unfixed MWEs. This paper puts forwards an approach to extract fixed MWEs rapidly. First the definition of fixed MWEs is given. Features contributing to determinate fixed MWEs are considered both in statistic measures and in linguistic information. We extract fixed MWEs in the frame of multi-features and do manual evaluation. Experiment shows that the approach is effective. Our job can provide a desired list of fixed MWEs for NLP implication.

Lijuan Wang, Rong Liu
Mining Popular Menu Items of a Restaurant from Web Reviews

We propose a novel method to mine popular menu items from online reviews. In order to extract popular menu items, a crawler that uses the wrapper on search web sites was used to collect online reviews, restaurant names, and menu items. Then, unnecessary posts were removed by using the patterns. Also, post frequency was used to find the most frequently appearing menu items from online reviews in order to select the most popular menu items. In the result, the total average accuracy was 0.900.

Yeong Hyeon Gu, Seong Joon Yoo
News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm

Information extraction is important in web information retrieval. In case of news information extraction, because news information does not have representative keywords pointing out its beginning and ending, it is difficult to specify the news title and body automatically. Our approach is based on an adaptive weighting factor using Bayesian algorithm to solve this problem. We divided a news page into text fragments, and represented them with a set of content features and layout features. We used an adaptive weighting factor to make features fit in different pages. Experiments show that our method results in a higher precision than the original algorithm without a weighting factor on the task of news information extraction.

Shilin Huang, Xiaolin Zheng, Xiaowei Wang, Deren Chen

Web Intelligence

Infectious Communities Forging
Using Information Diffusion Model in Social Network Mining

This article proposes a new model for clustering individual nodes based on node’s interrelation with a real-life mining application. The model is capable of detecting a network topology based on information flow and therefore could be easily extended and applied in a variety of today’s research fields. E.g. discover audience group sharing similar attitude, or retrieve authors’ academic referencing group or plot active friend society in social networks. An effective algorithm: Boundary Growth Algorithm is proposed through which people can find the underlying structure of networks. Extensive experimental evaluations demonstrate the effectiveness of our approach.

Tianran Hu, Xuechen Feng
Extracting Dimensions for OLAP on Multidimensional Text Databases

With the amount of textual information massively growing in various kinds of business systems and Internet, there are increasingly demands for analyzing both structured data and unstructured text data. Online Analysis Processing (OLAP) is effective for analyzing and mining structured data. However, while handling with unstructured data, it is powerless. After working on several information integration and data analysis applications, we have realized the defect of OLAP on text data analysis and use technical ways to handle this issue. In this paper, we propose a semi-supervised algorithm to extract dimensions and their members from textual information for the purpose of analyzing a huge set of textual data. We use straightforward measures to express analysis results. Experiment result shows that the extracting algorithm is valid and our approach has a high scalability and flexibility.

Chao Zhang, Xinjun Wang, Zhaohui Peng
A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.

Inma Hernández, Hassan A. Sleiman, David Ruiz, Rafael Corchuelo
Web Trace Duplication Detection Based on Context

Data Integration becomes more and more important with the rapidly spread of the internet and the study on entity trace becomes more and more important as a part of it. The entity trace is mainly extracted from the text fragments. There will be much duplication in the records because of the large scale, strong autonomy and the high redundancy features of the web sources. The processing of this problem often carries semantic features, which results in that the traditional integration method cannot be applied on it directly. In this paper, we propose a web trace duplication detection method based on unsupervised learning and context. We address the problem above by a new process on computing the comparison vector between two records based on the context, then acquiring the sample data automatically, training the classifiers with the sample data, and finally classifying the records.

Chang Gao, Xiaoguang Hong, Zhaohui Peng, Hongda Chen
A Framework for Incremental Deep Web Crawler Based on URL Classification

With the Web grows rapidly, more and more data become available in the Deep Web · But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through static URL links, because Deep Web pages are hidden behind the forms. However, the amount of information contained in the Deep web is not only far more than the Surface Web, the information of Deep Web is more valuable than the Surface Web. As Deep Web Pages change rapidly, how to maintain the Deep Web pages which were crawled fresh and to crawl the new Deep Web pages is a challenge. A framework for incremental Deep Web crawler based on URL classification is proposed. According to the list page and leaf page, the URL that is related with the page can be divided into two parts: list URL and leaf URL. The framework not only crawls the latest Deep Web pages according to the change frequency of list page, but also crawl the leaf pages which often change.

Zhixiao Zhang, Guoqing Dong, Zhaohui Peng, Zhongmin Yan
Query Classification Based on Index Association Rule Expansion

Query classification can improve the query results of search engine, but the existing query classification methods which use extra web resources to enrich query features easily result in high delay. In this paper, a query classification based on index association rule expansion (IARE-QC) is proposed. IARE-QC uses an index based query classification framework to reduce the response time through transforming the query classification problem in online phase to the equivalent index term classification in offline phase. Moreover, in order to get more accurate feature enrichment of index term, we propose a novel algorithm which called index association expansion based on similarity voting (IARE-SV) to determine the category labels of index term. The experiment results on the search engine simulation environment show that IARE-SV can get much better query classification performance than the common simple voting (SV) method.

Xianghua Fu, Dongjian Chen, Xueping Guo, Chao Wang
A Link Analysis Model Based on Online Social Networks

As information technology has advanced, people are turning to electronic media more frequently for communication, and social relationships are increasingly found on online channels. Traditional on-line social network researches are based a certain comment interaction. Though some interest conclusions have been obtained, the understanding of the entire on-line social network is one-sided. In this paper, we compare four different types of networks proposed by previous researchers. Statistical analysis reveals that those four networks are consistent in nature (both the “small-world effect” and skewed degree distributions are found in them). To discover the mechanism behind these network observations, we propose a single-factor model with a single parameter

K

; using this model, various networks can be obtained when we change the parameter

K

in a given range. Simulation experiment based on this model show that the simulation results and the real data are consistent, which means that our model is valid.

Bu Zhan, ZhengYou Xia
Research on Information Measurement at Semantic Level

The paper defined an information measure associated with a topic or semantics for a keyword based corpus. Firstly, the topic-based corpus was obtained. Then the latent semantic vector space model of the corpus was established. After that, the information measure of the keyword was defined through the vector space model. Accordingly, it could be calculated that the amount of the topic information any document contained. Lastly, the membership degree which measured the degree of membership of the document belonging to the topic was introduced. Set a measurement threshold, thereby it was determined whether the documents belonging to the topic or not. Experiments show that the definition of the information measurement can get over the difficulty of the word-match search and real reach the goal of the Semantic-match search.

Kaizhong Jiang, Lu Li, Bosheng Xu
A New Similarity Measure Based Robust Possibilistic C-Means Clustering Algorithm

In this paper, we focus on the development of a new similarity measure based robust possibilistic c-means clustering (RPCM) algorithm which is not sensitive to the selection of initial parameters, robust to noise and outliers, and able to automatically determine the number of clusters. The proposed algorithm is based on an objective function of PCM which can be regarded as special case of similarity based robust clustering algorithms. Several simulations, including artificial and benchmark data sets, are conducted to demonstrate the effectiveness of the proposed algorithm.

Kexin Jia, Miao He, Ting Cheng
DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages

Web pages usually contain much irrelevant information that customers don’t need. Thus, in order to extract relevant information from the complicated information heap, effective methods to extract information are required. Aiming at the semi-structured characteristic of HTML, theme-relevant information in web pages could be extracted by semantic pruning, in the adoption of DOM-presentation, combined with the feature of web structure and the fuzzy classification of keywords.

Junjie Chen, Junyao Jia, Liguo Duan

Web Interfaces and Applications

A Domain Specific Language for Interactive Enterprise Application Development

Web-based enterprise applications (EAs) have become the mainstream for business systems; however, there are enormous challenges for EAs development to meet the software quality and delivery deadline. In this paper, we propose a domain specific language, called WL4EA, which combines components with generative reuse and targets for popular application frameworks (or platform) and supports high interactivity. With WL4EA, an EA can be declaratively specified as some sets of entities, views, business objects, and data access objects. Such language elements will be composed according to known EA architecture and patterns. Such a DSL and code generation can lower the development complexity and error proneness and improve efficiency.

Jingang Zhou, Dazhe Zhao, Jiren Liu
Scalable Application Description Language to Support IPTV Client Device Independence Based on MPEG-21

This paper presents a framework and a description language to ensure application interoperability for heterogeneous IPTV client devices with different capabilities by providing application scalability. To support device independent application in IPTV service environment, we suggest a new XML schema (named Scalable Application Description Language: SADL) based on MPEG-21 DIDL (Digital Item Declaration Language).

Tae-Beom Lim, Kyoungro Yoon, Kyung Won Kim, Jae Won Moon, Yun Ju Lee, Seok-Pil Lee
A Study on Using Two-Phase Conditional Random Fields for Query Interface Segmentation

Recently, the Web has been rapidly “deepened” by many searchable databases online, where data are hidden behind query interfaces. Automatic processing of a query interface is a must to access the invisible contents of deep Web. This entails automatic segmentation, i.e., the task of grouping related components of an interface together. The segmentation is divided into two steps: interface component labeling and interface component grouping. In this paper we present a new approach to perform query interface segmentation using two-phase Conditional Random Fields (CRFs). At the first phase, one CRFs model is used to tag each component with a semantic label (attribute-name, operator, operand or other); at the second phase, another CRFs model is used to create groups of related components. Experiments show that our approach yields high accuracy.

Yongquan Dong, Xiangjun Zhao, Gongjie Zhang
Key Techniques Research on Water Resources Scientific Data Sharing Platform

In order to integrate distributed and isomerous water resources scientific data and provide one-stop data sharing services for different users, a distributed data sharing platform is needed urgently in China. A framework of the distributed water resources scientific data sharing platform (WSDSP) is proposed to solve this problem .This platform comprises one main center, one authentication center and several sub-centers in which core functions are encapsulated into web services. And then, some key techniques, i.e. metadata directory service, single sing-on (SSO), web services based service-oriented architecture (SOA) and distributed metadata synchronization, are discussed in detail. And now, this platform was deployed in Ministry of water resources and related provinces, in which more than 8976 MB data integrated from 5487 hydrometric stations can be shared to the public.

Yufeng Yu, Shijin Li, Jingjin Jiang
ExpertRec: A Collaborative Web Search Engine

ExpertRec is a collaborative Web search engine, which is differ from current main search engine and allows users share search histories through a Web browser toolbar or a proxy browser. In addition, it can be taken as a novel social Web search engine and utilize expert’s search histories for building recommendations. In this paper, we give an anatomy of ExpertRec and specially introduce its architecture and core techniques. It includes two basic components: a client agent and a back-end server. The former is implemented as a Mozilla Firefox toolbar (a Firefox extension), which can integrate with mainstream search engines like Google, Yahoo!, et al., to meet users’ teamwork needs. And it allows users to generate high-quality tags, votes, comments over current Web including search histories, personal archival content in local host typically beyond the reach of existing Web 2.0 social tagging system. The latter is a CBR (case-based reasoning)-based recommendation engine and implemented according to some core techniques, such recommendation rules, a scalable method to identify search expertise based on a hierarchical user profile in order to improve users’ search quality, and so on. Finally, we give an evaluation and make conclusions.

Jingyu Sun, Junjie Chen, Xueli Yu, Ning Zhong

Web Services and E-Learning

A QoS Evaluation Method for Personalized Service Requests

With the prevalence of Web service, QoS is playing a more and more important role in service evaluation, recommendation and selection. In most previous works, it is often assumed that the delivered QoS of a Web service is often determined by service provider, not service consumer. However, in the practical service execution environment, Web services usually work in an interactive mode with service consumer, so service consumer should also take responsibility for the delivered QoS of a Web service. Hence, it becomes a challenge to evaluate the QoS of Web services impartially. In view of this challenge, a QoS evaluation method for personalized service requests is proposed in this paper. Finally, the effectiveness of our method is validated, and an optimization method is proposed to improve the QoS evaluation efficiency.

Rutao Yang, Qi Chen, Lianyong Qi, Wanchun Dou
Virtual Personalized Learning Environment (VPLE) on the Cloud

With the virtualization technology maturity and the growing up fast of the cloud computing services, the application service providers have changed the way to server customers. Meanwhile, the new service model let users access the resource with the browser of their thin devices. By the way, the application service providers do not need to buy many machines for the uncertain backup or expanded requirement. Because the Cloud Service can provision the computing resources for customers dynamically, the users also pay as their use. The innovation of information science and technology drives e-learning system to produce a new type of service and the personalization requirements of the users are fast growing up. This paper presents a solution for building a virtual and personalized learning environment which combines the technology of Cloud Infrastructure as a Service (IaaS) and Cloud Software as a Service (SaaS) to create a service oriented model for the application service providers and the learners.

The proposed environment “Virtual Personalized Learning Environment” is intended for subscribing and excising of the selected learning resources as well as creating a personalized virtual classroom. This VPLE system allows the learning content providers to registry their applications in the server and the learners integrate other internet learning resources to their learning application pools.

Po-Huei Liang, Jiann-Min Yang
MTrust-S: A Multi-model Based Prototype System of Trust Management for Web Services

As an important technique for internet-scale information integration, web service becomes popular very rapidly in recent years. More and more enterprises move their core business onto the Web in the form of web services. Malicious web services will affect the security and reliability of the requester’s application which invokes the services. Therefore, identifying the trustworthy web services is now a critical issue to make requester’s application secure and reliable. In this paper, we develop a prototype that naturally provides a solution for the evaluation and management of web service trust and reputation. The prototype integrates service requester’s feedback collection, trust evaluation and trust management together. With collaborating of these three components, our prototype provides an effective way for selecting trustworthy services for the requesters. To model service trust more precisely, we present a mathematic presentation for different types of data describing service trust, i.e, discrete values, probabilistic values, brief values and fuzzy values. A series of models has been developed to evaluate the trust of services according to the types of trust values. Simulation results verify that using these models can greatly improve the success rate of invoking trustworthy services.

Dunlu Peng, Shaojun Yi, Huan Huo, Jing Lu
Web Application Security Based on Trusted Network Connection

This paper introduces the security of Trusted Network Connection into Web applications. To solve the security of Web applications and the application limitations of Trusted Network Connection, which is only widely used in LAN and VPN, a new method used for Web application security is presented in the paper based on the thought of Trusted Network Connection. Through the model design and the system realization, it can prove that the thought of Trusted Network Connection can be applied to Web applications and improve the security of Web applications. At the mean time, the thought of Trusted Network Connection can reduce the attack of viruses and trojans and broaden the fields of Trusted Network Connection application.

Yongwei Fu, Xinguang Peng
Model Checking for Asynchronous Web Service Composition Based on XYZ/ADL

Concerned with Web service composition, this paper proposes a model checking method of verifying asynchronous communication behaviors and timed properties. Firstly, analyzing Web service composition from software architecture, the interactive behaviors and timed properties are described by XYZ/ADL based on temporal logic language. Secondly, timed asynchronous communication model (TACM) which accords with the specification of model checker UPPAAL is proposed. Finally, based on the transition from XYZ/RE communication commands to TACM, the correctness of asynchronous communication behaviors of the service composition system can be verified by UPPAAL.

Guangquan Zhang, Huijuan Shi, Mei Rong, Haojun Di
Specification and Verification of Data and Time in Web Service Composition

The verification of Web service composition has been widely acknowledged as a challenging problem. In this paper we present a method based on data and time aware service model to validate property of Web service composition. First we translate Web service composition specification to formal model which contains data related information and time related information, and then translate this model to UPPAAL specification, at last the correctness of Web service composition is verified through the UPPAAL tool.

Guangquan Zhang, Haojun Di, Mei Rong, Huijuan Shi
Development of LMS/LCMS (Contents Link Module) Real-Time Interactive in Videos for Maximizing the Effect of Learning

Since most existing LMS/ LCMS have a limitation in providing various interactive elements of video contents, it lacks real-time mode and interactivity between teacher-learner and learner-learner. It is also difficult in measuring the accurate progress rate of learners in the process of teaching-learning. In this paper, we proposed a contents link technology for real-time interaction by adding various functions to videos in order to overcome limitations in e-Learning. We implemented a platform for video contents operation using the proposed method. It will overcome the existing problems and maximize the effect of e-Learning.

Junghyun Kim, Doohong Hwang, Kangseok Kim, Changduk Jung, Wonil Kim

XML and Semi-structured Data

Converting XML Schema Data to Object-Relational Data with DOM

There are many strategies for storing XML documents, using (object-) relational database to store XML document is one kind of the strategies. The paper discussed the new mapping method from XML Schema to the object-relational database and discussed its basic mapping rules and an approach based on DOM-Chart model in details. This mapping process preserves the structure as well as the semantic constraints of the source schema in the target schema. The test result shows that the theory and the design are feasible and effective.

Lijun Sang, Jihai Xiao, Xiaohong Cui
XML Query Algorithm Based on Matching Pretreatment Optimization

In the XML query algorithm based on matching pretreatment, three important existing tree matching models are used and the match results of data sets are presented according to the match cost in a descending order. The function of matching pretreatment is added to improve the existing algorithm, and a series of experiments are conducted. The results show that this algorithm can remove the unwanted node in the tree to promote the efficiency of data sets retrieval especially the recall ratio, the precision ratio and the average response time, when the data scale is quite large. This algorithm is applied in the unity retrieval system of scientific and technological resources database. It can facilitate the navigation of resources, narrow the search scope and promote the efficiency of this system.

Yi Wang, Heming Ye, Haixia Ma, Weizhao Zhang
Backmatter
Metadaten
Titel
Web Information Systems and Mining
herausgegeben von
Zhiguo Gong
Xiangfeng Luo
Junjie Chen
Jingsheng Lei
Fu Lee Wang
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-23982-3
Print ISBN
978-3-642-23981-6
DOI
https://doi.org/10.1007/978-3-642-23982-3

Premium Partner