Skip to main content
main-content

Über dieses Buch

This book offers comprehensive coverage of information retrieval by considering both Text Based Information Retrieval (TBIR) and Content Based Image Retrieval (CBIR), together with new research topics. The approach to TBIR is based on creating a thesaurus, as well as event classification and detection. N-gram thesaurus generation for query refinement offers a new method for improving the precision of retrieval, while event classification and detection approaches aid in the classification and organization of information using web documents for domain-specific retrieval applications. In turn, with regard to content based image retrieval (CBIR) the book presents a histogram construction method, which is based on human visual perceptions of color. The book’s overarching goal is to introduce readers to new ideas in an easy-to-follow manner.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Intelligent Rule-Based Deep Web Crawler

Abstract
In this chapter, architecture specification of a deep web crawler is discussed. The crawler has indexer with the capability to fetch huge documents from both surface and deep web. The documents from the deep web are fetched-based rules, where core and allied fields of the forms play important role. Based on the domain and nature of FORM in HTML pages, functional dependency between the fields, core and allied fields are identified. The SVM classifier is used for classifying the rule as most preferable, least preferable and mutually exclusive. The documents are fetched by using the most preferable fields in FORM. The fetched document is indexed, and the same architecture is scaled to support distributed functionality with the help of web services. This architecture specification processes huge number of documents which has encouraging coverage rate and lower fetching time. The retrieval performance of the crawler is compared with Google retrieval system and found that the proposed architecture archives similar procession of retrieval.
S. G. Shaila, A. Vadivel

Chapter 2. Information Classification and Organization Using Neuro-Fuzzy Model for Event Pattern Retrieval

Abstract
Classifying the sentences that describe Events is an important task for many applications. In this chapter, Event patterns are identified and extracted at sentence level using term features. The terms that trigger Events along with the sentences are extracted from Web documents. The sentence structures are analysed using POS tags. A hierarchal sentence classification model is presented by considering specific term features of the sentence, and the rules are derived. The rules fail to define a clear boundary between the patterns and create ambiguity and impreciseness. To overcome this, suitable fuzzy rules are derived which give importance to all term features of the sentence. The fuzzy rules are constructed with more variables and generate sixteen patterns. Artificial neuro-fuzzy inference system (ANFIS) model is presented for training and classifying the sentence patterns for capturing the knowledge present in sentences. The obtained patterns are assigned linguistic grades based on previous classification knowledge. These grades represent the type and quality of information in the patterns. The membership function is used to evaluate the fuzzy rules. The patterns share the membership values between [0–1] which determines the weights for each pattern. Later, higher weighted patterns are considered to build Event Corpus, which helps in retrieving useful and interested information of Event Instances. The performance of the presented approach classification is evaluated for ‘Crime’ Event by crawling documents from WWW and also evaluated for benchmark dataset for ‘Die’ Event. It is found that the performance of the presented approach is encouraging when compared with recently proposed similar approaches.
S. G. Shaila, A. Vadivel

Chapter 3. Constructing Thesaurus Using TAG Term Weight for Query Expansion in Information Retrieval Application

Abstract
In information retrieval applications, the query expansion is considered as the important procedure for improving the precision of retrieval. This chapter discusses on Thesaurus of N-gram content. This is generated using the content from web documents for expanding the query. The TAG of HTML pages are parsed, and the text present within the TAG is assigned weight based on the nature of TAGs. The total weight for these texts is calculated as the sum of TAG weight and frequency of occurrence. The content of Thesaurus is updated with single term or text as Unigram. Similarly, N-gram Thesaurus is updated with N-term or text along with total weight. Given a query, the term(s) are looked up in the corresponding Thesaurus to obtain a set of query as prediction. The set is ordered based on the total weight, and the user selects any of the term(s) as preference. The benchmark datasets such as Clueweb09B, WT10g and GOV2 are used for experiments. A threshold value is fixed as baseline. The proposed approach has gained 8, 19 and 30% on Clueweb09B, WT10g and GOV2, respectively. In addition, KLDCo and BoCo are used as benchmark datasets for evaluating the performance of the presented approach in terms of query refinement. The MAP, MRR is on the higher side against the baseline.
S. G. Shaila, A. Vadivel

Chapter 4. Smooth Weighted Colour Histogram Using Human Visual Perception for Content-Based Image Retrieval Applications

Abstract
In this chapter, a histogram is constructed based on human colour visual perception for content-based image retrieval. For each pixel, the true colour and grey colour proportion are calculated using a suitable weight function. During histogram construction, the hue and intensity values are iteratively distributed to the neighbouring bins. The NBS distance between the colour values of reference bin to the adjacent bins is estimated. The NBS distance value provides the proportion of the overlap of colour of the reference bin with the adjacent bins, and accordingly, the weight is updated. This kind of procedure for constructing the histogram uses minute colour information and captures the complex background colour content. The distribution makes it possible to extract the background colour information effectively along with the foreground information. The low-level feature of all the database images is extracted and stored in feature database. The relevant images are retrieved for a query image based on the similarity ranking between the query and database images, and Manhattan distance is used as a similarity measure. The performance of the presented approach using coral benchmark dataset is encouraging, and the precision of retrieval is compared with some of the similar work.
S. G. Shaila, A. Vadivel

Chapter 5. Cluster Indexing and GR Encoding with Similarity Measure for CBIR Applications

Abstract
In content-based image retrieval applications, there is an exhaustive search in the image database for finding relevant images, which is non-scalable. This chapter presents methods on indexing scheme, encoding scheme and similarity measure for handling the non-scalable issue. An image is represented in terms of colour feature, and the bin content of the feature is analysed to understand the colour content of the images. Based on the bin values and its contribution to the colour information, the size of the feature is truncated. The features are clustered based on the dimension of the histogram. The bin values of the truncated feature are encoded with Golomb–Rice (GR) coding scheme. The similarity between the query and database image is calculated by measuring the degree of overlap in terms of bins and its content. Benchmark datasets are used for evaluating the performance of the all the proposed schemes.
S. G. Shaila, A. Vadivel

Backmatter

Weitere Informationen