Skip to main content
main-content

Über dieses Buch

Document Processing and Retrieval: TEXPROS focuses on the design and implementation of a personal, customizable office information and document processing system called TEXPROS (a TEXt PROcessing System). TEXPROS is a personal, intelligent office information and document processing system for text-oriented documents. This system supports the storage, classification, categorization, retrieval and reproduction of documents, as well as extracting, browsing, retrieving and synthesizing information from a variety of documents. When using TEXPROS in a multi-user or distributed environment, it requires specific protocols for extracting, storing, transmitting and exchanging information.
The authors have used a variety of techniques to implement TEXPROS, such as Object-Oriented Programming, Tcl/Tk, X-Windows, etc. The system can be used for many different purposes in many different applications, such as digital libraries, software documentation and information delivery.
Audience: Provides in-depth, state-of-the-art coverage of information processing and retrieval, and documentation for such professionals as database specialists, information systems and software developers, and information providers.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
Information in an office environment is kept in documents. Documents may be text-oriented documents (such as letters, memoranda, electronic mail, reports, etc.) or non-text-oriented documents (such as images, graphics, audio and video data, etc.). The purpose of office information processing systems is to support office workers in their management of information and documents. TEXPROS (TEXt PROcessing System) [171] is a personal intelligent filing and retrieval oriented office information processing system which focuses on text-oriented documents, and has the following major features:
  • A state-of-the-art data model capable of capturing the behavior of the various office activities [106, 107, 108, 170].
  • Extracting the synopsis or the most significant information from a document (such information is often sufficient to satisfy the user’s needs when information retrieval occurs) [61, 62, 175].
  • A knowledge-based, customizable document classification handler that exploits both spatial and textual analysis to identify the type of a document [21, 22, 60, 61, 62, 147, 174, 175].
  • An agent-based architecture supporting document filing and file reorganization [143, 168, 169, 189].
  • A retrieval system that can handle incomplete and vague queries [90, 91, 92, 93, 94].
Qianhong Liu, Peter A. Ng

2. Data Model and Algebra for Office Document

Abstract
There has been a tremendous interest on document modeling for the Office Information Systems. In this chapter, we introduce a new document model (called the D_model) and an algebraic language (called the D_algebra) for describing and manipulating documents encountered in the office environment [106, 107, 108].
Qianhong Liu, Peter A. Ng

3. Document Categorization

Abstract
The document model of TEXPROS discussed in Chapter 2 employs a dual approach to describing and classifying office documents by defining both a document type hierarchy and a folder organization (or logical filing structure). The document type hierarchy depicts the structural organization of the document types used in the problem domain. It identifies and organizes the structural commonalities among documents, and facilitates classifying various documents. The folder organization represents the user’s view of the document filing organization. In this chapter, we present two different architectures to implement the document filing organization [143, 168, 169, 189]. We start in Section 3.1 by giving a formal definition of the document model, including frame templates, a document type hierarchy, folders, and folder organizations. A frame template (document type) specifies the structure and components common to different documents or frame instances (document instances) of the same kind. The folder organization specifying the document filing view is defined using predicates and directed graphs. Then, we show how these concepts can be used to solve the Reconstruction Problem in Section 3.2. We investigate that under what circumstances it is possible to reconstruct a folder organization from its folder level predicates. The results are expressed in terms of graph-theoretic concepts, such as, an associated digraph, transitive closure, and redundant/nonredundant filing paths.
Qianhong Liu, Peter A. Ng

4. Document Classification and Information Extraction

Abstract
In Chapter 4 and 5, we turn our attention to the techniques used for document classification and information extraction [60, 61, 62, 174, 175]. In TEXPROS, the task of document classification is to determine the types of the office documents. That is, given an office document, the document classification subsystem identifies the corresponding frame template of the document. By identifying the defined type of the documents, it is possible to implement efficient storage and access methods to enhance the performance of retrieval. The task of information extraction is extracting from the contents of the document the most relevant information pertinent to the user. That is, given an office document, the information extraction subsystem forms its frame instance by instantiating its corresponding frame template. The document classification and information extraction can be achieved in aid of analyzing the document structures.
Qianhong Liu, Peter A. Ng

5. Knowledge-Based Document Classification

Abstract
In Chapter 4 we introduced the sample-based classification mechanism. A document sample base is a repository for all the document samples in the form of document sample trees. In order to find an appropriate sample, we compare the incoming document with all the samples in the sample base. To improve the efficiency of the sample-based classification mechanism, we present a knowledge-based document classification mechanism in this chapter [174, 175]. An inductive learning process is employed to learn the document type knowledge from document sample trees and generate a fewer number of document type trees. During the tree matching process, a document can be classified as its type by matching its tree structure against these document type trees. Once the type of the document has been identified, we use the document sample trees of the type to do the format recognition and information extraction.
Qianhong Liu, Peter A. Ng

6. Document Retrieval

Abstract
In Chapter 2 through 5 we discussed the data model, classification and categorization mechanisms for the office documents. Consider a collection of documents to be stored in an information base. From each document, a synopsis of information is extracted to form a frame instance (reminiscent of the tuple in the relational data model). Frame instances can be classified according to their types which are called frame templates (reminiscent of the schema in the relational data model). The frame instances can be categorized based on the nature of their information and are placed in folders. Thus, a folder can contain a collection of frame instances of various frame template types1.
Qianhong Liu, Peter A. Ng

7. Query Transformation

Abstract
In Section 6.3 we present the overall architecture of the retrieval system of TEXPROS, which is capable of processing incomplete and vague queries. In TEXPROS, an integrated system catalog provides a centralized retrieval environment for processing incomplete and vague queries. We begin this chapter by introducing the system catalog mechanism in Section 7.1. Section 7.1.1 defines the structure of the system catalog. In Section 7.1.2 we present methods of retrieval on the system catalog using algebraic query language. The system catalog describes the document filing organization and document classification at system level. In Section 7.1.3 we discuss how to manage the system catalog dynamically during document classification and filing.
Qianhong Liu, Peter A. Ng

8. Browser

Abstract
In Chapter 7, we discussed an efficient and standard method for retrieving information from databases, which is called systematic retrieval [114]. The user presents his request in a formal query; and upon receiving this query, the system executes the query transformation to find, if necessary, the proper index terms corresponding to those given keyterms from the user query by retrieving the system frame instances in the system catalog, and then to generate the equivalent algebraic queries by applying the algebraic operator to these index terms. There are some situations, however, in which the systematic retrieval is difficult to achieve the objectives. For instance, the user may only have a vague retrieval target (e.g. What is John Smith?). Here, the user does not know exactly what kinds of information he needs until some kind of description is displayed to him. (The user needs to gain knowledge about both schemas and instances from the database.) In such situations, TEXPROS employs a browsing mechanism as a complementary retrieval method.
Qianhong Liu, Peter A. Ng

9. Generalizer

Abstract
In Chapter 7 we presented a query transformation mechanism of processing incomplete or imprecise queries in the document retrieval system, which primarily relieves users from the necessity of remembering the precise terms of individual entities in the system. However, since the query entered by the user is less restrictive, the response given to the user by the system may be less cooperative. According to Kao et al. [79], the requirements for achieving cooperative responses from the system are as follows: (1) the maxim of quantity: be as informative as required; (2) the maxim of quality: contribute only when an adequate amount of evidence is present; (3) the maxim of relation: be relevant; and (4) the maxim of manner: avoid ambiguity.
Qianhong Liu, Peter A. Ng

Backmatter

Weitere Informationen