Skip to main content

Über dieses Buch

This second edition provides a systematic introduction to the work and views of the emerging patent-search research and innovation communities as well as an overview of what has been achieved and, perhaps even more importantly, of what remains to be achieved. It revises many of the contributions of the first edition and adds a significant number of new ones.

The first part “Introduction to Patent Searching” includes two overview chapters on the peculiarities of patent searching and on contemporary search technology respectively, and thus sets the scene for the subsequent parts. The second part on “Evaluating Patent Retrieval” then begins with two chapters dedicated to patent evaluation campaigns, followed by two chapters discussing complementary issues from the perspective of patent searchers and from the perspective of related domains, notably legal search. “High Recall Search” includes four completely new chapters dealing with the issue of finding only the relevant documents in a reasonable time span. The last (and with six papers the largest) part on “Special Topics in Patent Information Retrieval” covers a large spectrum of research in the patent field, from classification and image processing to translation. Lastly, the book is completed by an outlook on open issues and future research.

Several of the chapters have been jointly written by intellectual property and information retrieval experts. However, members of both communities with a background different to that of the primary author have reviewed the chapters, making the book accessible to both the patent search community and to the information retrieval research community. It also not only offers the latest findings for academic researchers, but is also a valuable resource for IP professionals wanting to learn about current IR approaches in the patent domain.



Introduction to Patent Searching


Chapter 1. Introduction to Patent Searching

Practical Experience and Requirements for Searching the Patent Space
This chapter introduces patent search in a way that should be accessible and useful to both researchers in information retrieval and other areas of computer science and professionals seeking to broaden their knowledge of patent search. It gives an overview of the process of patent search, including the different forms of patent search. It goes on to describe the differences among different domains of patent search (engineering, chemicals, gene sequences and so on) and the tools currently used by searchers in each domain. It concludes with an overview of open issues.
Doreen Alberts, Cynthia Barcelon Yang, Denise Fobare-DePonio, Ken Koubek, Suzanne Robins, Matthew Rodgers, Edlyn Simmons, Dominic DeMarco

Chapter 2. An Introduction to Contemporary Search Technology

This chapter is the counterpart of the preceding chapter. It gives an overview of some of the most important terms and concepts used in search technology and information retrieval (IR) today. We hope it can be useful to readers who are not researchers in these areas. After a short dip into the history of the field, we start with a high level overview of the different types of search, then move on to the gap between user requirements and how search systems can be evaluated and finally narrow it down to the main evaluation methodology used today. This is followed by a step-by-step guide to the architectural components of a generic fulltext document search system and its design implications. We then describe how the underlying models define to a large extent what the system can and cannot do. This chapter concludes with a short introduction to semantic search and an outlook to the challenges in patent IR, the main subject of this book.
Mihai Lupu, Florina Piroi, Veronika Stefanov

Evaluating Patent Retrieval


Chapter 3. Patent-Related Tasks at NTCIR

The NII Testbeds and Community for Information access Research (ntcir) has been the first benchmarking campaign that created a test collection specifically for patent retrieval, in 2001/2002. Over the course of just over a decade, organisers and participants at NTCIR patent-related challenges have addressed the problem of mono- and multilingual patent search and automated translation. In doing so, the only available East Asian language patent test collections have been created and made publicly available for research purposes. This chapter provides a reference summary of the efforts undertaken in NTCIR, helping the reader understand the challenges addressed, the datasets created and the solutions observed.
Mihai Lupu, Atsushi Fujii, Douglas W. Oard, Makoto Iwayama, Noriko Kando

Chapter 4. Evaluating Information Retrieval Systems on European Patent Data: The CLEF-IP Campaign

Although not always evident, patents have an economic and legal impact on our everyday life. The increase of digitally available patent data has triggered a growing interest in the use of information retrieval solutions in the intellectual property (IP) domain. The CLEF-IP benchmarking activity took place from 2009 to 2013 as part of the Conference and Labs of the Evaluation Forum (CLEF). It encouraged and facilitated research in multilingual and multimodal patent retrieval by providing a clean and comprehensive data set for experimentation and realistic retrieval tasks. We describe in this chapter the collection of patents used in the evaluation campaign and the motivation behind the campaign’s tasks. We describe each of the seven types of task that were organised. We explain how the topics and judgements for each of the tasks were created, as well as the measures involved in assessing the experiments submitted. All the data used in our evaluation activities can be downloaded from the CLEF-IP website under a Creative Commons licence.
Florina Piroi, Allan Hanbury

Chapter 5. Evaluating Real Patent Retrieval Effectiveness

In this chapter we consider the nature of information retrieval evaluation for patent searching. We outline the challenges involved in conducting patent searches and the commercial risks inherent in patent searching. We highlight some of the main challenges of reconciling how we evaluate retrieval systems in the laboratory and the needs of patent searchers, concluding with suggestions for the development of more informative evaluation procedures for patent searching.
Anthony Trippe, Ian Ruthven

Chapter 6. Measuring Effectiveness in the TREC Legal Track

In this chapter, we report our experiences from attempting to measure the effectiveness of large electronic discovery (e-Discovery) result sets in the Text Retrieval Conference (TREC) Legal Track campaigns of 2006–2011. For effectiveness measures, we have focused on recall, precision and F 1. We state the estimators that we have used for these measures, and we outline both the rank-based and set-based approaches to sampling that we have taken. We share our experiences with the sampling error in the resulting estimates for the absolute effectiveness on individual topics, relative effectiveness on individual topics, mean effectiveness across topics and relative effectiveness across topics. Finally, we discuss our experiences with assessor error, which we have found has often had a larger impact than sampling error.
Stephen Tomlinson, Bruce Hedin

High Recall Search


Chapter 7. Retrieval Models Versus Retrievability

Retrievability is an important measure in information retrieval (IR) that can be used to analyse retrieval models and document collections. Rather than just focusing on a set of few documents that are given in the form of relevance judgements, retrievability examines what is retrieved, how frequently it is retrieved and how much effort is needed to retrieve it. Such a measure is of particular interest within the recall-oriented retrieval systems (e.g. patent or legal retrieval), because in this context a document needs to be retrieved before it can be judged for relevance. If a retrieval model makes some patents hard to find, patent searchers could miss relevant documents just because of the bias of the retrieval model. In this chapter we explain the concept of retrievability in information retrieval. We also explain how it can be estimated and how it can be used for analysing a retrieval bias of retrieval models. We also show how retrievability relates to effectiveness by analysing the relationship between retrievability and effectiveness measures and how the retrievability measure can be used to improve effectiveness.
Shariq Bashir, Andreas Rauber

Chapter 8. Federated Patent Search

Federated search, also known as distributed information retrieval (DIR), is a technique for searching multiple text collections simultaneously. This chapter presents the basic components of a typical federated search system and the main technical challenges in each component during its operation. We briefly review the methods and techniques of federated search and how these can be applied in the patent domain. We discuss the problems that usually are ignored in DIR research, but they should be practically addressed in real federated patent search systems. We also present PerFedPat, an interactive patent search system based on the federated search approach. PerFedPat provides core services to search, using a federated method, multiple online patent resources, thus providing parallel access to multiple patent sources. PerFedPat hides complexity from the end user who uses a common single query tool for querying all patent datasets at the same time. The second innovative feature of PerFedPat is that it has a pluggable and extensible architecture, and therefore it enables the use of multiple search tools that are integrated in PerFedPat. We present an example of such a tool, the IPC suggestion tool, which uses a federated search technique (specifically source selection) that exploits topically organised patents (using their intellectually assigned classifications codes) to support patent searches by automated IPC suggestion. This tool shows how DIR techniques can be applied beyond the typical scenario of implementing a federated search system.
Michail Salampasis

Chapter 9. The Portability of Three Types of Text Mining Techniques into the Patent Text Genre

In this book chapter, we examined the portability of several different well-known text mining techniques on patent text. We test the techniques by addressing three different relation extraction applications: acronym extraction, hyponymy extraction and factoid entity relation extraction. These applications require different types of natural language processing tools, from simple regular expression matching (acronym extraction), to part of speech and phrase chunking (hyponymy extraction), to a full-blown dependency parser (factoid extraction). With the relation extraction applications presented in this chapter, we want to elucidate the requirements needed of general natural language processing tools when deployed on patent text for a specific extraction task. On the other hand, we also present language technology methods which are already portable to the patent genre with no or only moderate adaptations to the text genre.
Linda Andersson, Allan Hanbury, Andreas Rauber

Chapter 10. Visual Analysis of Patent Data Through Global Maps and Overlays

Visual analytics has been increasingly used to help to better grasp the complexity and evolution of scientific and technological activities over time, across science and technological areas and in organisations. This chapter presents general insights into some important fields of expertise such as mapping, network analysis and visual analytics applied to patent information retrieval and analysis. We also present a new global patent map and overlay technique and illustrative examples of its application. The concluding remarks offer considerations for future patent analysis and visualisation.
Luciano Kay, Alan L. Porter, Jan Youtie, Nils Newman, Ismael Ràfols

Special Topics in Patent Retrieval


Chapter 11. Patent Classification on Subgroup Level Using Balanced Winnow

In the past decade research into automated patent classification has mainly focused on the higher levels of International Patent Classification (IPC) hierarchy. The patent community has expressed a need for more precise classification to better aid current pre-classification and retrieval efforts (Benzineb and Guyot, Current challenges in patent information retrieval. Springer, New York, pp 239–261, 2011). In this chapter we investigate the three main difficulties associated with automated classification on the lowest level in the IPC, i.e. subgroup level. In an effort to improve classification accuracy on this level, we (1) compare flat classification with a two-step hierarchical system which models the IPC hierarchy and (2) examine the impact of combining unigrams with PoS-filtered skipgrams on both the subclass and subgroup levels. We present experiments on English patent abstracts from the well-known WIPO-alpha benchmark data set, as well as from the more realistic CLEF-IP 2010 data set. We find that the flat and hierarchical classification approaches achieve similar performance on a small data set but that the latter is much more feasible under real-life conditions. Additionally, we find that combining unigram and skipgram features leads to similar and highly significant improvements in classification performance (over unigram-only features) on both the subclass and subgroup levels, but only if sufficient training data is available.
Eva D’hondt, Suzan Verberne, Nelleke Oostdijk, Lou Boves

Chapter 12. Document Image Classification, with a Specific View on Applications of Patent Images

The main focus of this chapter is document image classification and retrieval, where we analyse and compare different parameters for the run-length histogram and Fisher vector-based image representations. We do an exhaustive experimental study using different document image data sets, including the MARG benchmarks, two data sets built on customer data and the images from the patent image classification task of the CLEF-IP 2011. The aim of the study is to give guidelines on how to best choose the parameters such that the same features perform well on different tasks. As an example of such need, we describe the image-based patent retrieval tasks of CLEF-IP 2011, where we used the same image representation to predict the image type and retrieve relevant patents.
Gabriela Csurka

Chapter 13. Flowchart Recognition in Patent Information Retrieval

In this chapter, we will analyse the current technologies available that deal with graphical information in patent retrieval applications and, in particular, with the problem of recognising and understanding information carried by flowcharts. We will review some of the state-of-the-art techniques that have arisen from the graphics recognition community and their application in the intellectual property domain. We will present an overview of the different steps that compound a flowchart recognition system, looking also at the achievements and remaining challenges in such a domain.
Marçal Rusiñol, Josep Lladós

Chapter 14. Modern Approaches to Chemical Image Recognition

Millions of existing patent documents and journal articles dealing with chemistry describe chemical structures by way of structure images (so-called Kekulé structures). While being human-readable, these structure images cannot be interpreted by a computer and are unusable in the context of most chemoinformatics applications: structure and substructure searches, chemo-biological property calculations, etc. There are currently many formats available for storing structural information in a computer-readable format, but the conversion of millions of images by hand is a cumbersome and time-consuming process. Therefore there is a need for an automatic tool for converting images into structures. One of the first such tools was presented at ICDAR in 1993 (OROCS). We would like to present modern developments in optical structure recognition which build upon the ideas developed earlier and add modern enhancements to the process of automatic extraction of structure images from the surrounding text and graphics and conversion of the extracted images into a molecular format. We describe in detail two top performing chemical OCR applications—one open source and one academic software package. The performance here was judged by TREC-CHEM 2011 and CLEF 2012 challenges.
Igor V. Filippov, Mihai Lupu, Alan P. Sexton

Chapter 15. Representation and Searching of Chemical Structure Information in Patents

This chapter describes the techniques that are used to represent and to search for molecular structures in chemical patents. There are two types of structure: specific structures that describe individual molecules and generic structures that describe sets of structurally related molecules. Methods for representing and searching specific structures have been well established for many years, and the techniques are also applicable, albeit with substantial modification, to the processing of generic structures.
Geoff M. Downs, John D. Holliday, Peter Willett

Chapter 16. Machine Translation and the Challenge of Patents

In this chapter, machine translation (MT) is first introduced in the context of patent information, and we touch upon what role it can play at various points in the intellectual property (IP) life cycle. We then step back to take a high-level look at what exactly defines MT, how it works, what makes it such a difficult task, as well as some of the more recent advances to overcome these hurdles and how we can go about ensuring that MT systems we develop are actually fit for purpose.
We then explore patent information as an application area for MT and describe how it presents a unique challenge not only for MT but for language technology in general. Finally, we take a closer look at some use cases involving MT and patents to show how they are already bringing significant value to consumers, but that there remains plenty of room for improvement.
John Tinsley

Chapter 17. Future Patent Search

In this chapter we make some predictions for patent search in about 10 years’ time—in 2026. We base these predictions on the contents of the earlier part of the book, the observed differences between this second edition and the first edition of the book as well as on some data and trends not well represented in the book (for one reason or another). We consider primarily incorporating knowledge of different sorts of patent search into the patent search process; utilising knowledge of the subject domain of the search into the patent search system; utilising multiple sources of data within the search system; the need to address the requirement to deal with multiple languages in patent search; and the need to provide effective visualisation of the results of patent searches. We conclude the real need is to find ways to support search independent of language or location.
Barrou Diallo, Mihai Lupu
Weitere Informationen