1 Introduction
This research is in the field of Information Retrieval (IR), specifically on the subfield of the professional search in the patent domain. Nowadays, the number of patents related to artificial intelligence, big data, and the internet of things has tremendously grown [
2]. The increase of patent applications filed every year makes the need for better patent search systems inevitable. Patent and other innovation-related documents can be found in patent offices, online datasets, and resources that typically must be searched using various patent search systems and other online services such as espacenet, Google patents, bibliographic search, and many more [
3]. From an information task perspective, patent retrieval tasks are typically recall-oriented [
4]; therefore, retrieving all the patent documents related to a patent application is crucially important otherwise, there might be a significant economic impact due to the lawsuits for patent infringement [
5]. Thus, in professional search, it is vital to search effectively in all the potentially distributed resources containing patents or other patent-related data.
To that end, the Federated Search (FS) approach aims to solve the problem of effectively searching at all resources containing patent information. FS systems implement a Distributed Information Retrieval (DIR) scenario that permits the simultaneous search of multiple searchable, remote, and potentially physically distributed resources.
There are different patent search tasks with different purposes, such as prior-art search, infringement search, freedom to operate search etc. In this work, the focus will be on prior art search. Prior art search is a task where the novelty of an idea is examined [
6]. Typically users use the boolean queries model to express their information need [
7]. I plan to investigate methods and architectures in patent retrieval and use Artificial Intelligence (AI) end-to-end processes to improve patent search and retrieval effectiveness and propose future search engines.
2 Patent Search Characteristics
Patent search can be considered a specific example of Information Retrieval, i.e., finding relevant information of unstructured nature in huge collections [
8] and has been considered a complex area. Patent text differs from regular text. Sentences used in patent documents are usually longer than general-use sentences [
9]. More specifically, Iwayama in [
10] found that the length of patent documents is 24 times the respective length of news documents. The syntactic structure of patent language is also a big challenge as founded by Verberne in [
9]. The same study also found that patent authors tend to use multi-words to introduce novel terms. Another challenge in patent search is the vocabulary mismatch problem, i.e., the non-existence of common words between two relevant documents. Magdy et al. [
11] showed that 12% of relevant documents for topics from the CLEF-IP 2009 have no words in common with the respective topics. All these make patent search a complicated process.
Researchers have categorized methodologies for patent search and retrieval. Lupu & Hanbury [
12] summarized methods for patent retrieval, divided into text-based methodologies (Bag of Words, Latent Semantic Analysis, Natural Language Processing), Query Creation/Modification methodologies, Metadata-based methodologies, and Drawing-based methodologies. Khode & Jambhorkar [
13] split the procedures for patent retrieval into IPC based and those based on patent features and query formulation. More recently, Shalaby et al. in [
14] broke patent retrieval into the following categories. Keyword-based methods, Pseudo Relevance Feedback Methods, Semantic-based methods, Metadata based methods, Interactive methods.
In the last years, there has been a shift in research to neural approaches for IR. Neural approaches for IR are a new and developing field [
15]. Transformer models like BERT [
16] have achieved impressive results on various NLP tasks. The use of a BERT model for patent retrieval has not been investigated enough though. While BERT has drawn lots of attention in research in the patent industry, it is either used for classification [
2,
17] or didn’t work as expected for patent retrieval [
18]. Dense retrieval [
19] is a new neural method for search and given the particular characteristics of the patent industry, it is expected to solve problems like vocabulary mismatch and improve retrieval effectiveness. Generally, the use of AI techniques in the patent industry has drawn lots of attention and is currently an active area of research [
7,
20,
21].
4 Summary of My Research so Far
1a) What is machine learning algorithms? effect on result merging when searching for patents in federated environments?
The result merging problem was studied as a general DIR problem and not in the specific context of the patent domain. The result merging problem appeared in research many years ago. One of the first works that conducted experiments in results merging is [
22]. After that many algorithms were presented in the relevant literature.
A very widely used and very robust estimation method is the collection inference retrieval network CORI [
23]. CORI uses a linear combination of the score of the document returned by the collection and the source selection score and applies a simple heuristic formula. It finally normalizes the collection-specific scores and produces global comparable scores.
One more effective estimation algorithm is the semi-supervised learning algorithm (SSL) [
24] which is based on linear regression. The SSL algorithm proposed by Si and Callan applies linear regression to assign the local collection scores to the global comparable scores. To achieve that, the algorithm functions on the common documents returned every time, between a collection and a centralized index created by samples from all the collections.
SAFE (sample-agglomerate fitting estimate) is a more recent algorithm designed to function on uncooperative environments [
25]. SAFE is based on the principle that the results of the sampled documents for each query are a sub-ranking of the original collection, so this sub-ranking can be used to conduct curve fitting in order to predict the original scores.
In download methods, the results are downloaded locally to calculate their relevance. Hung in [
32] proposed a technique in which the best documents are downloaded to re-rank and create the final merged list. He used machine learning and genetic programming to re-rank the final merged results. Whilst download methods seem to perform better than estimation approaches in the context tested by in [
33], they have essential disadvantages such as increased computation, download time, and bandwidth overhead during the retrieval process.
Hybrid methods are combinations of estimation and download methods. Paltoglou et al. [
34] proposed a hybrid method that combines download and estimation methods. More specifically it downloads a limited number of documents, and based on them, it trains a linear regression model for calculating the relevance of the rest documents. The results showed that this method achieved a good balance between the time and performance for the download and estimation approaches respectively.
Taylor et al. [
35] published a patent about a machine learning process for conducting results merging. Another patent was published by [
36] which uses the scores assigned to the lists and the documents to complete the final merging.
I started my research journey working on the first research question by implementing an idea that solves the results merging process in federated search scenarios. The initial idea of my work was published at the PCI 2020 conference [
37]. This work proposes two new methods that solve the results merging problem in federated patent search using machine learning models. The methods are based on a centralized index containing samples of documents from all potential resources, and they implement machine learning models to predict comparable scores for the documents retrieved by different resources. The effectiveness of the new results merging methods was measured against very robust models and was found to be superior to them in many cases.
Patent documents have specific characteristics and differences compared with regular text, where BERT model approaches have achieved impressive results [
16] [
38]. Thus, patent search is different than other types of searches such as web search. For example, in a typical patent prior art search, the starting point is a patent application as a topic [
39] that needs to be transformed to search queries [
39,
40]. BERT can only take an input of up to 512 tokens, so the whole extensive patent documents cannot be used for direct feed to the model. Also, the diversity of the language, as well as the usual use of vague terms, makes the need for huge amounts of data for training inevitably important in order to effectively train BERT.
Another notable characteristic of patent documents is their structural information. A patent document is a summary of different fields describing the invention. These are title, abstract, description, claims, metadata, and figures. There are also language differences between them. For example, in abstract and description, it is usually used technical language while in the claims section legal jargon is used [
40]. We need to choose which parts will be used to train a BERT model and for what task. As already mentioned, each part has its characteristics, and we need to look deep into them and decide how to adapt BERT model to them.
Another big challenge is the lack of data for training the BERT model for patent retrieval. Deep learning models, in general, require lots of data. BERT as well requires big datasets to take advantage of its power [
16]. For example, CLEF-IP is a popular dataset used in patent retrieval research which is an extract of the more extensive MAREC, but its structure does not offer use for training models like BERT.
Lee & Hsiang [42] implemented a re-ranking approach for patent prior art search using a BM25 model for the first retrieval and then a re-ranker using the cosine similarity and BERT embeddings. As they only used the BERT embeddings, they train BERT using the plain text file architecture which has one sentence per line, and all the examples are positive. Their re-ranking effectiveness was satisfactory, but they found that calculating semantic similarities between longer texts is still challenging.
Althammer et al. [
18] trained a BERT model using patent documents, and they used the BERT paragraph-level interaction architecture [43] and compared the retrieval performance with BM25. They found BM25 to perform better than BERT.
Dai & Callan [44] found that BERT-based re-rankers performed better on longer queries than short keyword queries. Therefore, as patent retrieval involves long queries, it makes sense to train a BERT re-ranker for the patent domain. Padaki et al. [45] worked on query expansion for BERT re-ranking. They found that queries need to have a rich set of concepts and grammar structures to take advantage of BERT-base re-rankers. The traditional word-based query expansion that results in short queries is not sufficient, and they found that BERT achieved higher accuracy when using longer queries.
Beltagy et al. [46] presented longformer, a BERT-like model designed to work with long documents. It combines a local attention mechanism in combination with a global one allowing the processing of longer documents. Longformer can take as input documents up to 4096 tokens long, eight times more than the BERT’s maximum input.
Kang et al. [
2] worked on prior art search performance by solving the binary classification problem of classifying patent documents as noisy and not relevant in order to be removed from the search and find valid patents using BERT model.
Lee & Hsiang [
17] worked on patent classification using the BERT model. They fine-tuned a BERT model and used it for CPC classification. They also showed that using only the claims is sufficient for patent classification.
We use a BERT re-ranker along with a BM25 model for the first-stage retrieval. BERT model is used a gate-based function that modifies BM25 score according to BERT’s relevance score. The main challenge is the lack of appropriate data for training such a model. Also, BERT can take a maximum input of 512 tokens. We use only the abstract, as the abstract is mandatory for every patent document and is a good description of the invention. The first step is to create a dataset of the relevant abstracts. We used the MAREC dataset [47], and from each patent document, we will found its citations and used them to create a dataset of relevant abstracts. This result in 80 million pair of abstracts 50% of which are positive and 50% are negative. We then trained the BERT model using this data and compare with BM25 and found the method to be superior to BM25.