Skip to main content
Erschienen in: Journal of Big Data 1/2023

Open Access 01.12.2023 | Research

Characterizing patent big data upon IPC: a survey of triadic patent families and PCT applications

verfasst von: Jewel X. Zhu, Minghan Sun, Shelia X. Wei, Fred Y. Ye

Erschienen in: Journal of Big Data | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Research objective

Triadic patent (TP) families and Patent Cooperation Treaty (PCT) applications are often used as datasets to measure innovation capability or R&D internationalization, but their concordance is unclear, which is the main issue in this study.

Methods

We collect the global TP and PCT data from the Derwent Innovations Index (DII), and a total of 1,589,172 TP families and 4,067,389 PCT applications are retrieved. Based on International Patent Classification (IPC) codes, we compare these two big datasets in three parts: IPC distribution, IPC co-occurrence network, and nation-IPC co-occurrence network. In order to understand the overall similarities and differences between TP and PCT, we make the basic statistics of the global data and w-core defined based on the w-index. Furthermore, the w-cores are visualized and the global similarities are calculated for the detailed concordance and differences.

Findings

The result shows that the w-core is suitable to select the core part of big data and TP and PCT get high concordance. Meanwhile, in technological convergence, some specific technical fields (e.g. chemistry, medicine, electronic communication, and lighting technology) and countries/regions (e.g. Germany, Japan, China, and Korea), there are a few differences.

Practical implications

TP families are very similar to PCT applications in terms of reflecting innovation capability or R&D internationalization at a macro level, but when it comes to technological convergence, specific research topics, and countries/regions, the choice may depend on the purpose of the research.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abkürzungen
TP
Triadic patent
PCT
Patent Cooperation Treaty
DII
Derwent Innovations Index
IPC
International Patent Classification
EPO
European Patent Office
JPO
Japan Patent Office
USPTO
United States Patent and Trademark Office
WIPO
World Intellectual Property Organization
OECD
Organization for Economic Cooperation and Development

Introduction

Patents, which contain 90–95% of the global technical information, represent valuable technical inventions and provide academia and industry with a reliable basis. Compared with other technical documents, patents are more authoritative and up-to-date. A large number of researchers have already used patent data to analyze current and future technological trends. However, with the explosive growth of patents and the massive influx of low-quality patents, the number of patents is no longer an effective measure to investigate the state of innovation and trends in technologies or industries, so researchers have begun to look for some appropriate indicators that represent high-quality patents, where the number of triadic patent (TP) families or the Patent Cooperation Treaty (PCT) applications is frequently used.
The triadic patent (TP) families refer to a set of patents filed at three major patent offices, namely the European Patent Office (EPO), the Japan Patent Office (JPO), and the United States Patent and Trademark Office (USPTO) [1].
Meanwhile, the Patent Cooperation Treaty (PCT) is an international treaty with more than 150 Contracting States. It is possible for an invention to seek patent protection in plenty of countries at the same time by submitting a single “international” patent application via the PCT rather than several separate national or regional patent applications. The granting of patents remains under the control of the national or regional patent offices in what is called the “national phase” [2].
As cross-border patent applications, TP families and PCT applications are important datasets to investigate national or regional innovation capabilities, evaluate industrial development status, and measure cross-border knowledge flow, whether in working papers and reports [37] or journal papers [810]. On the one hand, although there are some studies to choose TP families or PCT applications as datasets, these studies only focused on a part of PCT and TP applications, such as some patents related to a specific topic or applied for at a certain period. Therefore, in this study, we intend to collect and investigate the global TP families and PCT applications with a million-level volume. On the other hand, there does not exist paper to compare TP families and PCT applications, so it is worth knowing if the TP families and PCT applications get concordance. In a word, we propose to quantitatively explore the TP families and PCT applications based on the global data and understand their concordance from a global perspective in this study.

Literature review

In this section, we review some studies about three aspects, namely TP families and PCT applications, IPC co-occurrence network and nation-IPC co-occurrence network, where the nation refers to the earliest priority country or region, and the h-index and w-index, to understand the current research situation and research gap.

TP families and PCT applications

Patent applications were considered to have the inclination that applicants tend to file patents in their home country’s patent office, which is called “home advantage bias” [11]. As multinational applications, TP families were able to balance the home advantage of domestic applicants/inventors in the 1990s [12], so as to more objectively show the innovation strength of a country or a region. After examining the extent of the ‘home advantage’ effect in the USPTO and the EPO patent data and the TP families, there was a conclusion that TP families could be used as a satisfactory alternative to the USPTO and the EPO for measuring R&D internationalization [13]. On this basis, many papers have conducted empirical studies on TP as an innovation dataset [1420]. Tahmooresnejad and Beaudry studied the relationship between the structure and characteristics of TP families and patent value, and believed that the structure and characteristics of the patent families played an important role in explaining the high value of patents [21].
As is a key indicator of technological and innovative strength, the number of TP families per country was a function of technological specialization and (national) patenting strategies [22]. Based on TP families, the potential future convergences among technologies can be predicted by using Adamic/Adar similarity between IPC codes [23]. It was also proved that international filings, especially TP, were important to capture variations in research productivity [24]. Recently, the number of TP has continued to be an important indicator for measuring innovation. The registration of TP families was used as an innovation output variable along with the number of research article citations and patent citations to measure knowledge spillover efficiency [25]. Sun et al. used the TP database for 24 innovating countries between the years 1994 and 2013 to investigate the effects of technological innovation within certain countries on the energy efficiency performance of neighboring countries [26]. The number of TP families was selected as the output variable to analyze the relationship between regulation and R&D efficiency [8]. Higham et al. linked citation network layers through TP families and observed that these layers contain complementary, rather than redundant, information about technological relationships [27]. Wei et al. combined TP families and technology life cycle theory to define the grey-rhino model [10].
Similar to TP families, PCT applications were often used to measure innovation output [2830], innovation capability [31, 32] and international knowledge diffusion [33]. As early as 2008, based on the 138,751 patents filed in 2006 under the PCT, Leydesdorff used IPC codes to analyze the relations among technologies at different levels of aggregation [34]. As a representative of patent activities, PCT applications were also used to study the technological growth of countries [35] or the development of the industry [36, 37], etc. By combining patent data from PCT and EPO, Kers studied trends in genetic patent applications in order to identify the trends in the commercialization of research findings in genetics [38]. The participation of PCT applications in patent portfolios and a country’s degree of concentration of PCT application filings were used to evaluate the commercial potential of university patenting [39]. Schmoch analyzed China’s technological performance based on the transfer of China’s PCT applications [9]. Roszko-Wojtowicz et al. adopted PCT applications per billion GDP as one of the variables to describe the effects of innovative activity [40]. Based on the case of Siemens’ PCT applications, Ervits utilized the revealed technological advantage (RTA) index to measure the extent of the technological diversification of patent output [41].
In general, there have been many studies based on TP families or PCT applications in recent years, but there is no paper to compare these two datasets from the global perspective. Hence, we focus on the issue of shaping the relations between the global TP families and PCT applications to know how to profile the TP families and PCT applications and whether they get concordance or non-concordance.

IPC co-occurrence network and nation-IPC co-occurrence network

Compared with simple quantitative statistical analysis, patent network analysis can provide more comprehensive, objective and accurate technical intelligence for the management of research and development activities [42].
Patent network analysis can not only show the technical relationship between research subjects such as patents, enterprises, technical fields, countries or regions [43, 44], but also present the knowledge exchange [45], technical cooperation [46, 47], the knowledge maps [48] and technology development trends [49, 50]. In addition, the patent network provided clear data insights for comparative studies of different patent databases [51].
Furthermore, patent networks can be shown as one-mode, two-mode or even higher-mode. One-mode patent networks only include similar entities, such as IPC co-occurrence networks. When applying for a patent, the IPC codes [2] of the technical field corresponding to the patent are given. The structure of the IPC is divided into eight sections, and each section is subdivided into class, subclass, group, and subgroup [52]. A single patent can be granted multiple IPC codes. IPC co-occurrences network analysis was used to identify the convergence of technologies [53, 54], or to predict the pattern of technological convergence [23]. Two- and higher-mode patent networks include different sets of entities, and due to such unique feature, the two-mode network was essential to analyze the links among two disjoint node sets [45, 55, 56, 57]. The nation-IPC two-mode network that combines IPC information with the source country/region information of the patent was effective to identify the technological advantages of different countries/regions [58, 59].
In addition to visualization, network analysis provides rich quantitative indicators for patent comparative analysis, including measures of nodes and links within a network and inter-network similarity such as cosine similarity [60].

The h-index and w-index

The h-index is an index proposed by Hirsch [61] to evaluate the academic influence of scholars [61], which is defined as: A scientist has index h of his or her \({N}_{p}\) papers have at least h citation each and the other \(\left({N}_{p}-h\right)\) papers have \(\le h\) citations each. The core part intercepted according to the h-index is called h-core [62], and each paper in h-core has at least h citations [63]. There are two main reasons why the h-index is popular. On the one hand, the h-index has the advantages of simplicity and stability. On the other hand, it can accurately grasp the common power-law phenomenon in informatics [64], naturally intercept the top data, and comprehensively balance quantity and influence [65, 66]. Now, the h-index has fully entered the research and application of academic evaluation, information measurement and other fields [14, 15, 66, 68, 69, 70]. The h-index was also introduced into the network node measure [71], and soon gained wide application [72, 73]. As links began to be recognized as playing a key role in the network [74], researchers found that the h-index, as the most characteristic method for extracting top information, was very suitable for measuring high-strength important links in the network, and h-strength (\({h}_{s}\)) came into being. Its definition is as follows: the h-strength of a network is equal to \({h}_{s}\), if \({h}_{s}\) is the largest natural number such that there are \({h}_{s}\) links each with strength at least equal to \({h}_{s}\) in the network [75]. The h-strength can significantly simplify complex networks and effectively select the main link structures. However, the h-index and \({h}_{s}\) are powerless when extracting core information within very large-scale data and networks, and then the w-index and the generalized w-index were proposed.
The w-index is an improvement on the h-index [76], which focuses more on the evaluation of researchers' high-impact papers than the h-index. It can be defined as follows: If \(w\) of a research’s papers have at least \(10w\) citations each and the other papers have fewer than \(10\left(w+1\right)\) citations, his/her w-index is \(w\). On this basis, Egghe expanded 10 in the w-index to any natural number greater than or equal to 1 and proposed the generalized w-index (\({w}_{a}\)) in 2011 [77]. When \(a=1, {w}_{a}=h\). For the same data set, the larger \(a\) is, the smaller \({w}_{a}\) is, and the corresponding value of the \({w}_{a}\) th source is larger. That is to say, the generalized w-index pays more attention to the top data than the h-index, and it can extract an appropriate level of core especially when faced with huge data. Then, if we combine the generalized w-index with h-strength, we can select a suitable core network from the network of large-scale data.

Methodology

Methods and data applied in this paper are displayed as follows.

Method

We compare TP and PCT in the following three parts: IPC distribution, IPC co-occurrence network and nation-IPC co-occurrence network, where the nation refers to the earliest priority country or region. We propose to use the generalized w-index to extract the core part of datasets. There are three main reasons why we choose the generalized w-index. Firstly, given that the TP and PCT datasets are very large, we deem that it is necessary to focus on the core part. Secondly, although the h-index is very famous and popular, the w-index is more suitable for big datasets because the constant \(a\) can be adjusted. Finally, the generalized w-index considers two important aspects of datasets, namely the number of sources (including IPC categories, IPC-IPC links, and Nation-IPC links) and the number of items for each source (see below for detailed representations).
Specifically, we define the w-core based on the generalized w-index.
The generalized w-index, denoted \({w}_{a}\), for \(a \ge 1\) is the largest rank \(\text{r} = {\text{w}}_{a}\), such that all sources on rank 1, …, r all have at least \(a{\text{w}}_{a}\) items. Following the concept of the generalized w-index, we introduce a new definition of w-core.
Definition
(w-core) A set of sources is divided into two groups by the generalized w-index. The first group with w sources each having at least awa items is w-core, and the rest of the sources, each having less than awa items, is w-tail. If there exists w-core as a subnetwork, we directly call it a w-core network. When the networks change among citation network, co-citation network, co-occurrence network and so on, the w-core can be extended to various w-cores.
In this paper, the w-index is applied to IPC distribution and co-occurrence networks to extract the w-cores. In the part of IPC distribution, an IPC category is a source and patents corresponding to this IPC category are items of this IPC category. In the part of IPC co-occurrence network, an IPC-IPC link is a source, and patents in which these two IPC categories co-occur are items of this IPC-IPC link. The sources and items of nation-IPC co-occurrence network are similar to IPC co-occurrence network. The detailed operation is as follows: first, for the IPC distribution, all IPC categories are sorted in descending order by the number of items in each IPC category. Similarly, for the IPC co-occurrence network and nation-IPC co-occurrence network, all links are sorted in descending order by the number of items in each link which is called the strength of links. Second, the maximum rank r is decided based on \(\mathrm{r}={w}_{a}\), where the top r IPC categories or links have at least \(a{\text{w}}_{a}\) items. The w-core consists of the top r IPC categories or links. The constant \(a\) depends on the volume of the dataset, and we can adjust the value of \(a\) to extract the w-core of IPC distribution or co-occurrence networks effectively.
Cosine similarity, which is a measure of similarity between two individuals using the cosine value of the angle between two vectors in vector space, is adopted to investigate the global situation. The value range of cosine similarity is [− 1, 1]. The higher the cosine similarity, the more similar the two vectors become. When the value is 1, the angle between these two vectors is 0, which means these two vectors exactly coincide. The value of cosine similarity is independent of the length of the vector, and only related to the direction of the vector, so the disparity in the amount of TP families and PCT applications can be ignored.
Thus, for two n-dimensional vectors A and B, the cosine similarity between them is:
$$s(A,B)=cos\left(\theta \right)=\frac{A\cdot B}{\Vert A\Vert \cdot \Vert B\Vert }=\frac{{\sum }_{i=1}^{n}{A}_{i}\times {B}_{i}}{\sqrt{{\sum }_{i=1}^{n}{\left({A}_{i}\right)}^{2}}\sqrt{{\sum }_{i=1}^{n}{\left({B}_{i}\right)}^{2}}}$$
(1)
In this study, we use cosine similarity to measure the global similarity of TP families and PCT applications in IPC distribution, IPC co-occurrence network and nation-IPC co-occurrence network. The TP and PCT are two vectors with the same dimensions. For three different parts, the dimensions of vectors are IPC categories, IPC-IPC links or nation-IPC links, and the values of dimensions are the number of patents in each IPC category or the strength of links. Then, the cosine similarity of TP and PCT can be calculated based on Eq. (1).

Data

All patent data in this study are retrieved from the Derwent Innovations Index (DII). This database is currently one of the most comprehensive databases of international patent information in the world, published by Thomson Derwent Publishing Company. Every week, 25,000 patent documents published by more than 40 countries, regions and patent organizations and 45,000 patent citations are included in the database. Derwent, a world-class large patent database, provides a standardized and reliable data source for large-scale patentometric research.
The search strategy of TP families is “PN = (US*) AND PN = (JP*) AND PN = (EP*)” and the search strategy of PCT applications is “PN = (WO*)”. It should be noted that the PCT came into effect in 1978, so the earliest PCT application appeared in 1978, and there were not many TP families before 1978. Therefore, we limit the search time range to after 1978, and the retrieval date is October 1, 2021. A total of 1,589,172 TP families and 4,067,389 PCT families are retrieved, and the data volume of PCT applications is as high as 2.56 times that of TP families. Figure 1 shows the basic situation of the data.
In Fig. 1, the left part is the number of families of TP and PCT in every priority year. We can see that the number of PCT rises rapidly, while the number of TP rises relatively slowly and even shows a downward trend in recent years, which may be because the application process for TP is more complicated than that for PCT. The right part is the Venn diagram of TP and PCT, and they share 1,030,579 patent families which account for 64.85% of TP, 25.34% of PCT, and 22.28% of their union. It can be seen that the degree of overlap between TP and PCT is relatively high.
Furthermore, the broad flowchart of research is shown in Fig. 2. In the next section, we present the basic statistics of the global data and w-core, visualize the w-core and calculate the global similarity.

Results and discussion

The results are also divided into three parts, namely the IPC distribution, IPC co-occurrence networks, and nation-IPC co-occurrence networks. In the three parts, we will discuss the w-cores and global situations respectively.
As the quantities of both TP and PCT exceed one million, after repeated testing, it is found that the appropriate w-cores can be selected when \(a=100\). In order to understand overall similarities and differences between PCT and TP, the basic statistics of global data and w-cores are shown in Table 1, which includes the average, standard deviation, minimum, median, maximum, quartile and the Spearman Correlation between PCT and TP. In Table 1, IPC means IPC distribution, Co-IPC is IPC co-occurrence network, and Nation-IPC is nation-IPC co-occurrence network. In addition, N indicates the sample size, and the value of N in w-cores also means the value of \({w}_{100}\).
Table 1
The basic statistics of global data and w-cores
Type
N
Min.
Q1
Med.
Q3
Max.
Avg.
Std.
Correl.
Global
         
 IPC
         
  PCT
2374
0
1
2
341.5
420,059
4382.39
20,831.09
0.838**
  TP
2374
0
1
2
158
208,380
2172.52
10,131.21
 
 Co-IPC
         
  PCT
137,286
0
1
4
19
247,641
96.82
1298.37
0.860**
  TP
137,286
0
1
2
11
135,449
61.57
769.09
 
 Nation-IPC
         
  PCT
36,610
0
1
6
46
203,726
284.08
2697.79
0.791**
  TP
36,610
0
0
1
12
98,021
140.87
1342.38
 
W-core
         
 IPC
         
  PCT
155
15,508
20,999
30,136
59,299
420,059
53,570.61
63,128.09
0.891**
  TP
111
11,317
14,096
21,421
41,334
208,380
34,522.39
32,397.93
 
 Co-IPC
         
  PCT
125
12,533
15,886
20,315
32,214
247,641
30,234.93
27,866.84
0.852**
  TP
101
10,182
12,547.5
15,603
23,955
135,449
21,012.50
15,939.64
 
 Nation-IPC
         
  PCT
123
12,415
14,912
20,212
35,365
203,726
32,188.28
31,223.67
0.813**
  TP
91
9321
12,189
15,525
24,403
98,021
20,717.88
14,454.40
 
The Correl. is the correlation coefficient between PCT and TP, derived from a two-sided Spearman test
*p < 0.05; **p < 0.01. The correlation between the w-cores of PCT and TP is calculated based on the overlap of two w-cores
As shown in Table 1, firstly, the values of these statistics indicators of PCT are all higher than those of TP, excluding the minimum and Q1 in global data, because the data volume of PCT is bigger than that of TP and PCT is more discrete than TP. Secondly, the values of minimum, Q1, median, and Q3 of three parts in global data are very small, which indicates that most IPC categories have a few patents and most links have weak strength. However, the values of those indicators in w-cores are much higher than those in the global data, which to some extent means the w-index and w-core can extract the core part of the global data. Thirdly, the three values of \({w}_{100}\) of PCT are greater than that of TP, because PCT applications are much more than TP families. Finally, according to the Spearman Correlation, we find that PCT and TP have a strong positive correlation for either global data or w-cores.
The basic statistics present the overall situation, while detailed information of PCT and TP needs to be further shown. Hence, in the following sections, we visualize the w-cores of PCT and TP and calculate the global similarity of the three parts to make sense of the specific similarities and differences.

IPC distribution

The w-cores of TP and PCT have 111 and 155 IPC categories respectively, and 107 IPC categories in the w-core of TP are included in the w-core of PCT. The 107 IPC categories shared by the w-cores of TP and PCT mainly distribute in the front of the w-core of PCT. 48 IPC categories only appear in the w-core of PCT because the data volume of PCT is larger and there are more patents belonging to each IPC category. Meanwhile, 4 IPC categories only appear in the w-core of TP. Actually, they also distribute in PCT, but they have not entered the w-core because of their relatively small numbers.
The overlap of w-cores of IPC distribution of TP and PCT is shown in Fig. 3. The vertical axis is the number of patents in each IPC category and the horizontal axis is the descending order of IPC categories of PCT. The green column is the IPC distribution of PCT, the red column is the IPC distribution of TP and the green line is the distribution of PCT* (see below).
According to Fig. 3, we know that the w-cores of IPC distribution of TP and PCT get high concordance. First, TP and PCT keep similar w-cores as shown in Fig. 3. Second, several IPC categories have a wealth of patents, such as G06F and A61K, while the number of patents in most IPC categories is low relatively. Third, TP and PCT maintain similar distribution trends. In a lot of IPC categories, if the percentage of TP is high, that of PCT tends to be high. In addition, based on Eq. (1), we calculate the cosine similarity of the global IPC distribution of TP and PCT and the similarity is 0.968, which further indicates TP and PCT are alike.
However, a few differences exist. In all IPC categories in Fig. 3, PCT is higher than TP, because the data volume of PCT is much higher than that of TP, which is about 2.56 times the number of TP. Therefore, in order to make the comparison more intuitive, we divide the number of PCT applications in each IPC category by 2.56 to obtain PCT*, which can ignore the disparity in the number of TP and PCT. However, from Fig. 3 we can see that TP is always slightly higher than PCT*. The reason is the broader technical convergence of TP: each TP family has 3.24 IPC categories on average, while the average number of IPC categories in PCT is only 2.56, which is 0.79 times that of the former. When focusing on specific IPC categories, we find that there are still some differences between TP and PCT*. On the one hand, some categories of TP are much higher than PCT*, such as A61K (preparations for medical, dental, or toilet purposes), A61P (specific therapeutic activity of chemical compounds or medicinal preparations), C07D (heterocyclic compounds), C08L (compositions of macromolecular compounds), and C07C (acyclic or carbocyclic compounds), C07B (general methods of organic chemistry; apparatus therefor), B01J (chemical or physical processes, e.g. catalysis or colloid chemistry), C08F (macromolecular compounds obtained by reactions only involving carbon-to-carbon unsaturated bonds), which are related to chemistry and medicine. On the other hand, four categories of TP, which belong to electronic communication, are lower than PCT*. They are G06F (electric digital data processing), H04L (transmission of digital information), H04W (wireless communication networks) and G06K (recognition of data; presentation of data; record carriers; handling record carriers) respectively. In recent years, with the rapid development of electronic communication [77, 79, 80], the patents corresponding to these IPC categories seem to be more inclined to PCT, perhaps because PCT makes international patent applications faster and more convenient. All these differences are at the micro level, while the IPC distributions of TP and PCT are similar on the whole.

IPC co-occurrence network

The basic data of the global network and the w-core of the IPC co-occurrence network are shown in Table 2.
Table 2
The basic data of the IPC co-occurrence network
Co-IPC
Global
W-core
Nodes
Links
Frequency
Nodes
Links
Frequency
TP
2004
115,037
8,453,047
51 (2.54%)
101 (0.09%)
2,122,263 (25.11%)
PCT
2085
127,535
13,291,821
65 (3.12%)
125 (0.10%)
3,779,366 (28.43%)
In order to focus on the most important part of networks, Fig. 4 shows the w-cores of the IPC co-occurrence network of TP and PCT, where the rectangular box is the IPC category and different colors represent different clusters. The larger the rectangle box, the more times it co-occurs with other boxes. Similarly, if the link between two IPC categories is thick, they co-occur many times.
In Fig. 4, we can see that TP has five clusters and PCT has six clusters, but their clusters are very similar. For TP and PCT, the largest cluster is the red group represented by A61K, which is the field of medicine. The second largest cluster, colored blue, mainly includes H04W and H04L, which is communication technology. In addition, the purple group is chemical technology, electrical technology is represented by yellow and medical treatment and diagnosis technology is the green cluster which is closely linked to the red cluster. Furthermore, the cosine similarity of the global IPC co-occurrence networks of TP and PCT is 0.975, so they are highly similar in terms of IPC co-occurrence.
Nevertheless, there are also some differences. PCT has more nodes and its w-core network is more intensive than TP, which may be related to numerous PCT applications. The light blue cluster only appears on the right side of the PCT w-core network, including three IPC categories, namely F21Y (relating to the form or the kind of the light sources or the color of the light emitted), F21S (non-portable lighting devices; systems thereof; vehicle lighting devices specially adapted for vehicle exteriors) and F21V (functional features or details of lighting devices or systems thereof; structural combinations of lighting devices with other articles). These IPC categories point to lighting technology, indicating that this technology is more inclined to PCT.

Nation-IPC co-occurrence network

The basic data of the global network and the w-core of the Nation-IPC co-occurrence network are shown in Table 3.
Table 3
The basic data Nation-IPC co-occurrence network
IPC-Nation
Global
W-core
Nodes
Links
Frequency
Nodes
Links
Frequency
TP
2110
23,837
5,157,334
58 (2.75%)
91 (0.38%)
1,885,327 (36.56%)
PCT
2228
54,550
10,400,329
53 (2.38%)
109 (0.20%)
2,941,705 (28.28%)
In the same way, Fig. 5 also displays the w-cores of the nation-IPC co-occurrence network of TP and PCT. The green boxes are countries or regions and the red boxes are IPC categories.
We find that the w-core of the nation-IPC co-occurrence network of TP is similar to that of PCT. In two subgraphs of Fig. 5, the applications of PCT and TP in the United States include the most IPC categories, which means patents from the United States involve wide fields at present. The second country is Japan, so its technical fields are broad too. In addition, two w-cores have some same countries or regions, namely Germany, Europe, France and Great Britain.
To compare the similarity of global nation-IPC co-occurrence networks of TP and PCT, we count the number of dimensions in the vector of some representative countries/regions in global networks, and calculate their cosine similarity. The results are presented in Table 4.
Table 4
The similarity of five representative countries/regions in TP and PCT
Indicators
Global
US
JP
DE
EP
CN
The number of dimensions
36,610
1823
1235
1089
765
657
Similarity
0.935
0.972
0.970
0.892
0.987
0.978
Generally speaking, whether these countries/regions or the whole network, their similarities in the TP and PCT are very high. Combined with Fig. 5 and Table 4, Japan and Germany deserve attention. Although Japan has high similarity (0.970) in the global networks of TP and PCT, Japan in the two w-core networks has some differences. Japan has more IPC categories in the w-core of TP than that in the w-core of PCT. Contrarily, Germany has similar structures in two w-core networks, but its similarity of the global network is lower than that of other countries/regions.
However, like Fig. 4, the nodes of PCT are more and the w-core network is denser than that of TP. The reason should also be related to the large number of PCT applications. China and Korea only appear in the core network of PCT, so they tend to submit PCT applications.
In this section, we present the similarities and differences between TP families and PCT applications in terms of IPC distribution, IPC co-occurrence networks, and nation-IPC networks, based on three methods: statistical analysis, network visualization, and cosine similarity. We find that the w-core is suitable to select the core part of big data. The datasets of TP families and PCT applications are very similar in these three parts for either global data or w-cores, but there are some micro differences as said before. Thus, at a macro level, TP families and PCT applications get high concordance concerning their ability to reflect innovation capability or R&D internationalization, but when it comes to technological convergence, specific research topics and countries/regions, the choice may depend on the purpose of the research.

Conclusion and limitation

According to the above analysis, we have three main contributions. First, the w-core is a useful concept to characterize the core of important patents and patent networks. Second, we profile the w-cores and global situations of the TP families and PCT applications, and characterize their concordance from three parts, IPC distribution, IPC co-occurrence network and nation-IPC co-occurrence network respectively. Although the data volume of TP and PCT varies greatly, the results show that TP and PCT are very similar as a whole. Hence, if we want to observe the innovation capability, R&D internationalization, technical structure or development trend of a country/region or an industry, the analysis result based on TP is similar to PCT, which means TP and PCT can replace each other to a certain extent. Third, the TP and PCT are different in technological convergence, some specific fields (e.g. chemical, medicine, electronic communication and lighting technology) or countries/regions (e.g. Germany, Japan, China, and Korea), so that it is necessary to choose TP or PCT based on different research purposes.
The comparison between TP and PCT is still a relatively primary study, and there are certainly some limitations. Firstly, we simply use basic statistics and network visualization, but there are many different statistical methods and network indicators, such as regression, clustering and centrality, which can be used to further portray the TP families and PCT applications. Secondly, we characterize PCT and TP from three parts, the IPC distribution, IPC co-occurrence networks, and nation-IPC co-occurrence networks, which only involve IPC and countries/regions of TP families and PCT applications. However, citations and contents of patents both play important roles in patent analysis, so we need to focus on diverse information about patents to answer if they are similar. Finally, because of delays in patent applications and publications [81], it is difficult to cover all TP families and PCT applications, especially in recent years. Generally speaking, we hope to be able to extend our study to patent citations and contents based on various statistical methods and network indicators to explore whether TP and PCT get concordance from different perspectives.

Acknowledgements

We acknowledge the financial support from the National Natural Science Foundation of China Grants No. 71673131. We thank the anonymous reviewers for their constructive suggestions.

Declarations

Not applicable.
Not applicable.

Competing interests

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
8.
Zurück zum Zitat Nam M, Ko J, Lee J. Analysis of the relationship between regulation and R&D efficiency using quantile regression. In: International conference on big data and smart computing (BigComp); 2022, January 17–20, Daegu, South Korea. Nam M, Ko J, Lee J. Analysis of the relationship between regulation and R&D efficiency using quantile regression. In: International conference on big data and smart computing (BigComp); 2022, January 17–20, Daegu, South Korea.
14.
Zurück zum Zitat Chen DZ, Huang WT, Huang MH. Analyzing Taiwan’s patenting performance: comparing US patents and triadic patent families. Malays J Lib Inf Sci. 2014;19(1):51–70 (<Go to ISI>://WOS:000331270100005). Chen DZ, Huang WT, Huang MH. Analyzing Taiwan’s patenting performance: comparing US patents and triadic patent families. Malays J Lib Inf Sci. 2014;19(1):51–70 (<Go to ISI>://WOS:000331270100005).
20.
Zurück zum Zitat Wada T. Cognitive distances in prior art search by the triadic patent offices: empirical evidence from international search reports.proceedings of the international conference on scientometrics and informetrics. 15th International Conference of the International-Society-for-Scientometrics-and-Informetrics (ISSI) on Scientometrics and Informetrics, Bogazici Univ, Istanbul, Turkey; 2015. Wada T. Cognitive distances in prior art search by the triadic patent offices: empirical evidence from international search reports.proceedings of the international conference on scientometrics and informetrics. 15th International Conference of the International-Society-for-Scientometrics-and-Informetrics (ISSI) on Scientometrics and Informetrics, Bogazici Univ, Istanbul, Turkey; 2015.
39.
Zurück zum Zitat Zdralek P, Stemberkova R, Matulova P, Maresova P, Kuca K. Commercial potential of university patents through patent cooperation treaty application. In: International conference on social sciences and humanities (SOSHUM), Kota Kinabalu, Malaysia; 2016, Apr 19–21. Zdralek P, Stemberkova R, Matulova P, Maresova P, Kuca K. Commercial potential of university patents through patent cooperation treaty application. In: International conference on social sciences and humanities (SOSHUM), Kota Kinabalu, Malaysia; 2016, Apr 19–21.
46.
Zurück zum Zitat Chen JH, Jang SL, Chang CH. The patterns and propensity for international co-invention: the case of China. Scientometrics. 2013;94(2):481–95.CrossRef Chen JH, Jang SL, Chang CH. The patterns and propensity for international co-invention: the case of China. Scientometrics. 2013;94(2):481–95.CrossRef
48.
Zurück zum Zitat Lee S, Kim MS. Inter-technology networks to support innovation strategy: an analysis of Korea’s new growth engines. Innovation. 2010;12(1):88–104.CrossRef Lee S, Kim MS. Inter-technology networks to support innovation strategy: an analysis of Korea’s new growth engines. Innovation. 2010;12(1):88–104.CrossRef
49.
Zurück zum Zitat Kumari R, Jeong JY, Lee BH, Choi KN, Choi K. Topic modelling and social network analysis of publications and patents in humanoid robot technology. J Inf Sci. 2019;47(5):658–76.CrossRef Kumari R, Jeong JY, Lee BH, Choi KN, Choi K. Topic modelling and social network analysis of publications and patents in humanoid robot technology. J Inf Sci. 2019;47(5):658–76.CrossRef
52.
Zurück zum Zitat Leydesdorff L, Kushnir D, Rafols I. Interactive overlay maps for US patent (USPTO) data based on international patent classification (IPC). Scientometrics. 2012;98(3):1583–99.CrossRef Leydesdorff L, Kushnir D, Rafols I. Interactive overlay maps for US patent (USPTO) data based on international patent classification (IPC). Scientometrics. 2012;98(3):1583–99.CrossRef
54.
Zurück zum Zitat Kim MS, Kim C. On a patent analysis method for technological convergence. Proc Soc Behav Sci. 2012;40(40):657–63.CrossRef Kim MS, Kim C. On a patent analysis method for technological convergence. Proc Soc Behav Sci. 2012;40(40):657–63.CrossRef
55.
Zurück zum Zitat Borgatti SP, Everett MG. Network analysis of 2-mode data. Soc Netw. 1997;19(3):243–69.CrossRef Borgatti SP, Everett MG. Network analysis of 2-mode data. Soc Netw. 1997;19(3):243–69.CrossRef
58.
Zurück zum Zitat Rassenfosse GD, Dernis H, Guellec D, Picci L, Potterie BVPDL. The worldwide count of priority patents: a new indicator of inventive activity. Melbourne Inst Work Pap Ser. 2012;42(3):720–37. Rassenfosse GD, Dernis H, Guellec D, Picci L, Potterie BVPDL. The worldwide count of priority patents: a new indicator of inventive activity. Melbourne Inst Work Pap Ser. 2012;42(3):720–37.
64.
Zurück zum Zitat Egghe L. (2005). Power Laws in the Information Production Process: Lotkaian Informetrics. Oxford (UK): Elsevier. Egghe L. (2005). Power Laws in the Information Production Process: Lotkaian Informetrics. Oxford (UK): Elsevier.
68.
Zurück zum Zitat Chen HC, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. Mis Quart. 2012;36(4):1165–88 (Go to ISI>://WOS:000311525500010).CrossRef Chen HC, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. Mis Quart. 2012;36(4):1165–88 (Go to ISI>://WOS:000311525500010).CrossRef
Metadaten
Titel
Characterizing patent big data upon IPC: a survey of triadic patent families and PCT applications
verfasst von
Jewel X. Zhu
Minghan Sun
Shelia X. Wei
Fred Y. Ye
Publikationsdatum
01.12.2023
Verlag
Springer International Publishing
Erschienen in
Journal of Big Data / Ausgabe 1/2023
Elektronische ISSN: 2196-1115
DOI
https://doi.org/10.1186/s40537-023-00778-5

Weitere Artikel der Ausgabe 1/2023

Journal of Big Data 1/2023 Zur Ausgabe

Premium Partner