Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

End to End Neural Retrieval for Patent Prior Art Search

Author : Vasileios Stamatis

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

This research will examine neural retrieval methods for patent prior art search. One research direction is the federated search approach, where we proposed two new methods that solve the results merging problem in federated patent search using machine learning models. The methods are based on a centralized index containing samples of documents from all potential resources, and they implement machine learning models to predict comparable scores for the documents retrieved by different resources. The other research direction is the adaptation of end-to-end neural retrieval approaches to the patent characteristics such that the retrieval effectiveness will be increased. Off-the-self neural methods like BERT have lower effectiveness for patent prior art search. So, we adapt the BERT model to patent characteristics in order to increase retrieval performance. We propose a new gate-based document retrieval method and examine it in patent prior art search. The method combines a first-stage retrieval method using BM25 and a re-ranking approach where the BERT model is used as a gating function that operates on the BM25 score and modifies it according to the BERT relevance score. These experiments are based on two-stage retrieval approaches as neural models like BERT requires lots of computing power to be used. Eventually, the final part of the research will examine first-stage neural retrieval methods such as dense retrieval methods adapted to patent characteristics for prior art search.

The original version of this chapter was revised: this chapter was previously published non-open access. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-99739-7_76

1 Introduction

This research is in the field of Information Retrieval (IR), specifically on the subfield of the professional search in the patent domain. Nowadays, the number of patents related to artificial intelligence, big data, and the internet of things has tremendously grown [2]. The increase of patent applications filed every year makes the need for better patent search systems inevitable. Patent and other innovation-related documents can be found in patent offices, online datasets, and resources that typically must be searched using various patent search systems and other online services such as espacenet, Google patents, bibliographic search, and many more [3]. From an information task perspective, patent retrieval tasks are typically recall-oriented [4]; therefore, retrieving all the patent documents related to a patent application is crucially important otherwise, there might be a significant economic impact due to the lawsuits for patent infringement [5]. Thus, in professional search, it is vital to search effectively in all the potentially distributed resources containing patents or other patent-related data.

To that end, the Federated Search (FS) approach aims to solve the problem of effectively searching at all resources containing patent information. FS systems implement a Distributed Information Retrieval (DIR) scenario that permits the simultaneous search of multiple searchable, remote, and potentially physically distributed resources.

There are different patent search tasks with different purposes, such as prior-art search, infringement search, freedom to operate search etc. In this work, the focus will be on prior art search. Prior art search is a task where the novelty of an idea is examined [6]. Typically users use the boolean queries model to express their information need [7]. I plan to investigate methods and architectures in patent retrieval and use Artificial Intelligence (AI) end-to-end processes to improve patent search and retrieval effectiveness and propose future search engines.

2 Patent Search Characteristics

Patent search can be considered a specific example of Information Retrieval, i.e., finding relevant information of unstructured nature in huge collections [8] and has been considered a complex area. Patent text differs from regular text. Sentences used in patent documents are usually longer than general-use sentences [9]. More specifically, Iwayama in [10] found that the length of patent documents is 24 times the respective length of news documents. The syntactic structure of patent language is also a big challenge as founded by Verberne in [9]. The same study also found that patent authors tend to use multi-words to introduce novel terms. Another challenge in patent search is the vocabulary mismatch problem, i.e., the non-existence of common words between two relevant documents. Magdy et al. [11] showed that 12% of relevant documents for topics from the CLEF-IP 2009 have no words in common with the respective topics. All these make patent search a complicated process.

Researchers have categorized methodologies for patent search and retrieval. Lupu & Hanbury [12] summarized methods for patent retrieval, divided into text-based methodologies (Bag of Words, Latent Semantic Analysis, Natural Language Processing), Query Creation/Modification methodologies, Metadata-based methodologies, and Drawing-based methodologies. Khode & Jambhorkar [13] split the procedures for patent retrieval into IPC based and those based on patent features and query formulation. More recently, Shalaby et al. in [14] broke patent retrieval into the following categories. Keyword-based methods, Pseudo Relevance Feedback Methods, Semantic-based methods, Metadata based methods, Interactive methods.

In the last years, there has been a shift in research to neural approaches for IR. Neural approaches for IR are a new and developing field [15]. Transformer models like BERT [16] have achieved impressive results on various NLP tasks. The use of a BERT model for patent retrieval has not been investigated enough though. While BERT has drawn lots of attention in research in the patent industry, it is either used for classification [2, 17] or didn’t work as expected for patent retrieval [18]. Dense retrieval [19] is a new neural method for search and given the particular characteristics of the patent industry, it is expected to solve problems like vocabulary mismatch and improve retrieval effectiveness. Generally, the use of AI techniques in the patent industry has drawn lots of attention and is currently an active area of research [7, 20, 21].

3 Research Questions

The research questions I will address in my PhD are the following.

Federated Search

What is machine learning algorithms? effectiveness on result merging when searching for patents in federated environments?

To what extent does an end-to-end neural retrieval approach that is adapted to patent characteristics improve retrieval effectiveness?

Why do the off-the-shelf methods have lower effectiveness for patent search?
-How can BERT be adapted to improve retrieval effectiveness in patent search?
How can first-stage retrieval or dense retrieval be adapted to improve the retrieval effectiveness in patent search?

4 Summary of My Research so Far

1a) What is machine learning algorithms? effect on result merging when searching for patents in federated environments?

The result merging problem was studied as a general DIR problem and not in the specific context of the patent domain. The result merging problem appeared in research many years ago. One of the first works that conducted experiments in results merging is [22]. After that many algorithms were presented in the relevant literature.

A very widely used and very robust estimation method is the collection inference retrieval network CORI [23]. CORI uses a linear combination of the score of the document returned by the collection and the source selection score and applies a simple heuristic formula. It finally normalizes the collection-specific scores and produces global comparable scores.

One more effective estimation algorithm is the semi-supervised learning algorithm (SSL) [24] which is based on linear regression. The SSL algorithm proposed by Si and Callan applies linear regression to assign the local collection scores to the global comparable scores. To achieve that, the algorithm functions on the common documents returned every time, between a collection and a centralized index created by samples from all the collections.

SAFE (sample-agglomerate fitting estimate) is a more recent algorithm designed to function on uncooperative environments [25]. SAFE is based on the principle that the results of the sampled documents for each query are a sub-ranking of the original collection, so this sub-ranking can be used to conduct curve fitting in order to predict the original scores.

In download methods, the results are downloaded locally to calculate their relevance. Hung in [32] proposed a technique in which the best documents are downloaded to re-rank and create the final merged list. He used machine learning and genetic programming to re-rank the final merged results. Whilst download methods seem to perform better than estimation approaches in the context tested by in [33], they have essential disadvantages such as increased computation, download time, and bandwidth overhead during the retrieval process.

Hybrid methods are combinations of estimation and download methods. Paltoglou et al. [34] proposed a hybrid method that combines download and estimation methods. More specifically it downloads a limited number of documents, and based on them, it trains a linear regression model for calculating the relevance of the rest documents. The results showed that this method achieved a good balance between the time and performance for the download and estimation approaches respectively.

Taylor et al. [35] published a patent about a machine learning process for conducting results merging. Another patent was published by [36] which uses the scores assigned to the lists and the documents to complete the final merging.

I started my research journey working on the first research question by implementing an idea that solves the results merging process in federated search scenarios. The initial idea of my work was published at the PCI 2020 conference [37]. This work proposes two new methods that solve the results merging problem in federated patent search using machine learning models. The methods are based on a centralized index containing samples of documents from all potential resources, and they implement machine learning models to predict comparable scores for the documents retrieved by different resources. The effectiveness of the new results merging methods was measured against very robust models and was found to be superior to them in many cases.

2a) Why do the off-the-shelf methods have lower effectiveness for patent search?

As some initial results show, the BERT model does not perform well in patent search as an off-the-shelf method and this is also consistent with the literature [18].

Patent documents have specific characteristics and differences compared with regular text, where BERT model approaches have achieved impressive results [16] [38]. Thus, patent search is different than other types of searches such as web search. For example, in a typical patent prior art search, the starting point is a patent application as a topic [39] that needs to be transformed to search queries [39, 40]. BERT can only take an input of up to 512 tokens, so the whole extensive patent documents cannot be used for direct feed to the model. Also, the diversity of the language, as well as the usual use of vague terms, makes the need for huge amounts of data for training inevitably important in order to effectively train BERT.

Another notable characteristic of patent documents is their structural information. A patent document is a summary of different fields describing the invention. These are title, abstract, description, claims, metadata, and figures. There are also language differences between them. For example, in abstract and description, it is usually used technical language while in the claims section legal jargon is used [40]. We need to choose which parts will be used to train a BERT model and for what task. As already mentioned, each part has its characteristics, and we need to look deep into them and decide how to adapt BERT model to them.

Another big challenge is the lack of data for training the BERT model for patent retrieval. Deep learning models, in general, require lots of data. BERT as well requires big datasets to take advantage of its power [16]. For example, CLEF-IP is a popular dataset used in patent retrieval research which is an extract of the more extensive MAREC, but its structure does not offer use for training models like BERT.

2b) How can BERT be adapted to improve retrieval effectiveness in patent search?

The re-ranker architecture where the final results list in document ranking comes from an initial classical retrieval model followed by a neural re-ranker is the state-of-the-art search process [41]. Transformer models like BERT [16] have achieved impressive results on various NLP tasks. While BERT has drawn lots of attention in research in the patent industry, it is either used for classification [2, 17] or didn’t work as expected for patent retrieval [18]. This recommends that it needs more research when working with patents as patent language has many specific features, as already mentioned.

Lee & Hsiang [42] implemented a re-ranking approach for patent prior art search using a BM25 model for the first retrieval and then a re-ranker using the cosine similarity and BERT embeddings. As they only used the BERT embeddings, they train BERT using the plain text file architecture which has one sentence per line, and all the examples are positive. Their re-ranking effectiveness was satisfactory, but they found that calculating semantic similarities between longer texts is still challenging.

Althammer et al. [18] trained a BERT model using patent documents, and they used the BERT paragraph-level interaction architecture [43] and compared the retrieval performance with BM25. They found BM25 to perform better than BERT.

Dai & Callan [44] found that BERT-based re-rankers performed better on longer queries than short keyword queries. Therefore, as patent retrieval involves long queries, it makes sense to train a BERT re-ranker for the patent domain. Padaki et al. [45] worked on query expansion for BERT re-ranking. They found that queries need to have a rich set of concepts and grammar structures to take advantage of BERT-base re-rankers. The traditional word-based query expansion that results in short queries is not sufficient, and they found that BERT achieved higher accuracy when using longer queries.

Beltagy et al. [46] presented longformer, a BERT-like model designed to work with long documents. It combines a local attention mechanism in combination with a global one allowing the processing of longer documents. Longformer can take as input documents up to 4096 tokens long, eight times more than the BERT’s maximum input.

Kang et al. [2] worked on prior art search performance by solving the binary classification problem of classifying patent documents as noisy and not relevant in order to be removed from the search and find valid patents using BERT model.

Lee & Hsiang [17] worked on patent classification using the BERT model. They fine-tuned a BERT model and used it for CPC classification. They also showed that using only the claims is sufficient for patent classification.

We use a BERT re-ranker along with a BM25 model for the first-stage retrieval. BERT model is used a gate-based function that modifies BM25 score according to BERT’s relevance score. The main challenge is the lack of appropriate data for training such a model. Also, BERT can take a maximum input of 512 tokens. We use only the abstract, as the abstract is mandatory for every patent document and is a good description of the invention. The first step is to create a dataset of the relevant abstracts. We used the MAREC dataset [47], and from each patent document, we will found its citations and used them to create a dataset of relevant abstracts. This result in 80 million pair of abstracts 50% of which are positive and 50% are negative. We then trained the BERT model using this data and compare with BM25 and found the method to be superior to BM25.

2c) How can first-stage retrieval or dense retrieval be adapted to improve the retrieval effectiveness in patent search?

The two-stage retrieval process is an important step in order to solve patent retrieval-related problems analyzed in the previous section. However, transformer models like BERT are expensive as they require enough computing power to function. After solving the previous research question, we plan to move to the first stage retrieval and examine ways and architectures to include them on patent prior art search. Dense retrieval [19] is a new approach to search and has not been investigated enough for patent search. We plan to train dense retrieval models, adapt them for patent documents and investigate their performance comparing with state-of-the-art models and architectures. Also, term importance prediction is another approach for first stage retrieval, which could also be combined with dense retrieval. DeepCT model [44] for example use BERT model’s term representations and predicts weights for terms in a sentences, which will used for bag of words retrieval afterward. We plan to train a DeepCt model, adapt it to patent documents and combine it with dense retrieval. The combination of two independent architectures for first-stage retrieval has not been investigated before. We also plan to examine further how the structure of patent documents can affect the performance of these models and explore ways to include this information in the retrieval process.

Acknowledgments

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk’odowska-Curie grant agreement No: 860721 (DoSSIER Project, https://dossier-project.eu/).

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Towards Explainable Search in Legal Text

next chapter Third International Workshop on Algorithmic Bias in Search and Recommendation (BIAS@ECIR2022)

Kang, D.M., Lee, C.C., Lee, S., Lee, W.: Patent prior art search using deep learning language model. In: 24th International Database Application & Enginnerring Symosium (IDEAS 2020), ACM, New York, NY, USA, pp. 5 (2020)

Salampasis, M.: Federated Patent Search. In: Lupu, M., Mayer, K., Kando, N., Trippe, A.J. (eds.) Current Challenges in Patent Information Retrieval, pp. 213–240. Springer, Berlin, Heidelberg (2017). https://doi.org/10.1007/978-3-662-53817-3_8CrossRef

Mahdabi, P., Gerani, S., Huang, J.X., Crestani, F.: Leveraging conceptual lexicon: query disambiguation using proximity information for patent retrieval. In SIGIR (2013)

Khode, A., Jambhorkar, S.: Effect of technical domains and patent structure on patent information retrieval. Int. J. Eng. Adv. Technol. 9(1), 6067–6074 (2019)CrossRef

Clarke, N.: The basics of patent searching. World Patent Inf. 54, S4–S10 (2018)CrossRef

Setchi, R., Spasic, I., Morgan, J., Harrison, C., Corken, R.: Artificial intelligence for patent prior art searching. World Patent Inf. 64, 102021 (2021)CrossRef

Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH

Verberne, S., D’hondt, E., Oostdijk, N., Koster, C.H.: Quantifying the challenges in parsing patent claims. In: 1st International Workshop on Advances in Patent Information Retrieval (2010)

Iwayama, M., Fujii, A., Kando, N., Marukawa, Y.: An empirical study on retrieval models for different document genres: patents and newspaper articles. In: SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in Informaion Retrieval (2003)

10.

Magdy, W., Leveling, J., Jones, G.J.F.: Exploring structured documents and query formulation techniques for patent retrieval. In: Peters, C., et al. (eds.) Multilingual Information Access Evaluation I. Text Retrieval Experiments. CLEF 2009. Lecture Notes in Computer Science, vol. 6241. Springer, Berlin, Heidelberg (2009).https://doi.org/10.1007/978-3-642-15754-7

11.

Lupu, M., Hanbury, A.: Patent retrieval. Found. Trends Inf. Retreival 7(1), 1–97 (2013)CrossRef

12.

Khode, A., Jambhorkar, S.: A literature review on patent information retrieval techniques. Indian J. Sci. Technol. 10(37), 1–13 (2017)CrossRef

13.

Shalaby, W., Zadrozny, W.: Patent retrieval: a literature review. Knowl. Inf. Syst. 61(2), 631–660 (2019). https://doi.org/10.1007/s10115-018-1322-7CrossRef

14.

Mitra, B., Craswell, N.: An introduction to neural information retrieval. Found. Trends Inf. Retr. 13(1), 1–126 (2018)CrossRef

15.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in arXiv:1810.04805v2 (2019)

16.

Lee, J.-S., Jieh, H.: Patent classification by fine-tuning BERT language model. World Patent Inf. 61, 101965 (2020)CrossRef

17.

Althammer, S., Hofstatter, S., Hanbury, A.: Cross-domain retrieval in the legal and patent domains: a reproducability study. In: ECIR (2021)

18.

Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Empirical Methods in Natural Language Processing (EMNLP) (2020)

19.

Alderucci, D., Sicker, D.: Applying artificial intelligence to the patent system. Technol. Innov. 20, 415–425 (2019)CrossRef

20.

Aristodemou, L., Tietze, F.: The state-of-the-art on Intellectual Property Analytics (IPA): a literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data. World Patent Inf. 55, 37–51 (2018)CrossRef

21.

Voorhees, E.M., Gupta, N.K., Laird, B.J.: The collection fusion problem. In: Harman, D.K. (ed.) The third text retrieval conference(TREC-3), National Institute of Standards and Technology (1994).https://doi.org/10.1057/9780230379411_4

22.

Callan, J., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 95 (1995)

23.

Si, L., Callan, J.: A semisupervised learning method to merge search engine results. ACM Trans. Inf. Syst. 21(4), 457–491 (2003)CrossRef

24.

Shokouhi, M., Zobel, J.: Robust result merging using sample-based score estimates. ACM Trans. Inform. Syst. 27(3), 1–29 (2009)CrossRef

25.

Hung, V.T.: New re-ranking approach in merging search results. Informatica 43, 2 (2019)

26.

Craswell, N., Hawking, D., Thistlewaite, P.: Merging results from isolated search engines. In: Australasian Database Conference (1999)

27.

Paltoglou, G., Salampasis, M., Satratzemi, M.: A Results merging algorithm for distributed information retrieval environments that combines regression methodologies with a selective download phase. Inf. Process. Manage. 44, 1580–1599 (2008)CrossRef

28.

Taylor, M., Radlinski, F., Shokouhi, M.: Merging search results. US Patent US 9,495.460 B2, 15 November (2016)

29.

Mao, J., Mukherjee, R., Raghavan, P., Tsaparas, P.: Method and apparatus for merging. US Patent US 6,728,704 B2, 27 April (2004)

30.

Stamatis, V., Salampasis, M.: Results merging in the patent domain. In: PCI, Athens (2020)

31.

Nogueira, R., Cho, K.: Passage re-ranking with BERT. In arXiv:1901.04085v5 (2020)

32.

Mahdabi, P., Keikha, M., Gerani, S., Landoni, M., Crestani, F.: Building queries for prior-art search. In: Hanbury, A., Rauber, A., de Vries, A.P. (eds.) IRFC 2011. LNCS, vol. 6653, pp. 3–15. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21353-3_2CrossRef

33.

Xue, X., Croft, W.B.: Automatic query generation for patent search. In: CIKM (2009)

34.

Sekulic, I., Soleimani, A., Aliannejadi M., Crestani, F.: Longformer for MS MARCO Document Re-ranking Task. In: arXiv:2009.09392 (2020)

35.

Lee, J.-S., Hsiang, J.: Prior art search and reranking for generated patent text. In: PatentSemTech Workshop, SIGIR21 (2021)

36.

Shao, Y., et al: BERT-PLI: modeling paragraph-level interactions for legal case retrieval. In: Twenty-Ninth International Joint Conference on Artificial Intelligence (2020)

37.

Dai, Z., Callan, J.: Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. In arXiv:1910.10687v2 (2019)

38.

Padaki, R., Dai, Z., Callan, J.: Rethinking query expansion for BERT Reranking. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 297–304. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_37CrossRef

39.

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The Long-Document Transformer. In: arXiv:2004.05150v2 (2020)

40.

MAREC data set? [Online]. http://www.ifs.tuwien.ac.at/imp/marec.shtml

Title: End to End Neural Retrieval for Patent Prior Art Search
Author: Vasileios Stamatis
Publisher: Springer International Publishing
Book: Advances in Information Retrieval
Print ISBN: 978-3-030-99738-0

Electronic ISBN: 978-3-030-99739-7

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-030-99739-7_66