1 Introduction
Citation type | Example sentence |
---|---|
concept | To this end, SWRL [14] extends OWL-DL and OWL-Lite with Horn clauses |
claim | In the traditional hypertext Web, browsing and searching are often seen as the two dominant modes of interaction (Olston and Chi 2003) |
author | Gibson et al. [12] used hyperlink for identifying communities |
2 Citation recommendation
2.1 Terminology
2.2 Scenarios, advantages, and caveats of citation recommendation
-
A researcher needs to write a scientific text on a topic that is outside of her core research area and expertise (e.g., generic research proposals [20] and potential future work descriptions).
-
A journalist in the science domain—e.g., authoring texts for a popular science magazine—needs to write an article on a certain scientific topic [124, 130]. We can assume that the journalist typically is not an expert on the topic she needs to write about. Having citations in the text helps to substantiate the written facts and make the text more complete and understandable.
-
“Newcomers” in science, such as Masters students and PhD students in their early years, are confronted with the vast amount of citable publications and typically do not know any or all of the relevant literature in the research field yet [71, 174]. Getting citations recommended helps not only students in writing systematic and scientific texts, such as research project proposals (exposés), but also their mentors (e.g., professors).
2.3 Task definition
Symbol | Description |
---|---|
\(D=\{d_1,\ldots ,d_i,\ldots ,d_n\}\) | Set of citing documents in the offline step |
\(R=\{r_1,\ldots ,r_m,\ldots ,r_M\}\) | References of all citing documents D |
\(C_i=\{c_{i1},\ldots ,c_{ij},\ldots ,c_{iN}\}\) | Citation contexts from document \(d_i\) |
\(Z_i=\{z_{i1},\ldots ,z_{ij},\ldots ,z_{iN}\}\) | Abstract citation contexts from document \(d_i\) |
Z | Set of all abstract citation contexts of D |
f | Mapping function |
g | Mapping function |
d | Input document in the online step |
\(R^d\) | References of document d |
\(C^d=\{c_{1}^d,\ldots c_{k}^d, \ldots , c_{K}^d\}\) | Potential citation contexts of document d |
\(Z^d=\{z_{1}^d,\ldots ,z_{k}^d,\ldots ,z_{K}^d\}\) | Abstract representations of potential citation contexts of document d |
\(R_{z_{k}^d}\) | Set of papers recommended for citation |
\(d'\) | Input document d enriched by recommended citations |
2.3.1 Offline step
2.3.2 Online step
2.4 Related research fields
2.4.1 Non-scholarly citation recommendation
Tool | Approach | Input format | Output format | Extracts citation contexts (citation context length) | Extracts citing paper’s abstract |
---|---|---|---|---|---|
CERMINE [158] | CRF | pdf | xml | Yes (300 words) | Yes |
ParsCit [39] | CRF | txt | xml, txt | Yes (200 words) | No |
CRF | pdf | xml | No | Yes | |
PDFX [38] | Rule-based | pdf | xml | Yes (300 words) | Yes |
Crossref pdf-extractor [40] | Rule-based | pdf | xml, bib | No | No |
IceCite [12] | Rule-based | pdf | tsv, xml, json | No | Yes |
Science Parse [6] | CRF | pdf | json | Yes | Yes |
2.4.2 Scholarly data recommendation
2.4.3 Related citation-based tasks
3 Comparison of citation recommendation approaches
Reference | Venue | Local CR |
---|---|---|
McNee et al. [110] | CSCW’02 | |
Strohman et al. [144] | SIGIR’07 | |
Nallapati et al. [119] | KDD’08 | |
Tang et al. [151] | PAKDD’09 | |
He et al. [72] | WWW’10 |
\(\checkmark \)
|
Kataria et al. [89] | AAAI’10 |
\(\checkmark \)
|
Bethard et al. [20] | CIKM’10 | |
He et al. [71] | WSDM’11 |
\(\checkmark \)
|
Lu et al. [107] | CIKM’11 | |
Wu et al. [167] | FSKD’12 | |
He et al. [69] | SPIRE’12 |
\(\checkmark \)
|
Huang et al. [74] | CIKM’12 |
\(\checkmark \)
|
Rokach et al. [134] | LSDS-IR’13 |
\(\checkmark \)
|
Liu et al. [101] | AIRS’13 |
\(\checkmark \)
|
Jiang et al. [84] | TCDL Bulletin’13 | |
Zarrinkalam et al. [175] | Program’13 | |
Duma et al. [45] | ACL’14 |
\(\checkmark \)
|
Livne et al. [103] | SIGIR’14 |
\(\checkmark \)
|
Tang et al. [153] | SIGIR’14 |
\(\checkmark \)
|
Ren et al. [131] | KDD’14 | |
Liu et al. [99] | JCDL’14 | |
Liu et al. [98] | CIKM’14 | |
Jiang et al. [85] | Web-KR’14 | |
Huang et al. [75] | WCMG’15 |
\(\checkmark \)
|
Chakraborty et al. [35] | ICDE’15 | |
Hsiao et al. [73] | MDM’15 | |
Gao et al. [60] | FSKD’15 | |
Lu et al. [106] | APWeb’15 | |
Jiang et al. [86] | CIKM’15 | |
Liu et al. [100] | iConf’16 | |
Duma et al. [47] | LREC’16 | |
Duma et al. [46] | D-Lib’16 | |
Yin et al. [174] | APWeb’17 |
\(\checkmark \)
|
Ebesu et al. [48] | SIGIR’17 |
\(\checkmark \)
|
Guo et al. [65] | IEEE’17 | |
Cai et al. [29] | AAAI’18 | |
Bhagavatula et al. [21] | NAACL’18 | |
Kobayashi et al. [91] | JCDL’18 |
\(\checkmark \)
|
Jiang et al. [87] | JCDL’18 | |
Han et al. [67] | ACL’18 |
\(\checkmark \)
|
Jiang et al. [88] | SIGIR’18 | |
Zhang et al. [176] | ISMIS’18 | |
Cai et al. [28] | IEEE TLLNS’18 | |
Yang et al. [171] | JIFS’18 | |
Dai et al. [41] | JAIHC’18 | |
Yang et al. [170] | IEEE Access’18 |
\(\checkmark \)
|
Mu et al. [118] | IEEE Access’18 | |
Jeong et al. [81] | arXiv’19 |
\(\checkmark \)
|
Yang et al. [169] | IEEE Access’19 | |
Dai et al. [42] | IEEE Access’19 | |
Cai et al. [30] | IEEE Access’19 |
3.1 Corpus creation
3.2 Corpus characteristics
3.3 Comparison of local citation recommendation approaches
Paper | Year | Group | Approach | User model | Prefilter | Citation context length | Citation placeholders | Cited papers’ content needed | Evaluation data set | Domain | Evaluation metrics |
---|---|---|---|---|---|---|---|---|---|---|---|
[72] | 2010 | b | Probabilistic model (Gleason’s Theorem) | – | – | 50 words before and after | Yes | No | CiteSeerX | Computer science | Recall, co-cited prob., nDCG, runtime |
[89] | 2010 | b | Topic model (adapt. LDA) | – | – | 30 words before and after | Yes | Yes | CiteSeer | Computer science | RKL |
[71] | 2011 | a | Ensemble of decision trees | – | – | 50 words before and after | No | No | CiteSeerX | Computer science | Recall, co-cited probability, nDCG |
[69] | 2012 | c | Machine translation | – | – | 1 sentence | Yes | Yes | Own dataset | Computer science | MAP |
[74] | 2012 | c | Machine translation | – | – | 1-3 sentences | Yes | No | CiteSeer & CiteULike | Computer science | precision, recall, F1; Bpref, MRR |
[134] | 2013 | a | Ensemble of supervised ML techniques | Author | Top 500 | 50 words before and after | Yes | No | CiteSeer & CiteULike | Computer science | F1, precision, runtime |
[101] | 2013 | a | SVM | Author | – | On average 13.4 words | Yes | no | Own dataset | Computer science | Recall, MAP |
[45] | 2014 | a | Cos similarity of vectors (TF-IDF based) | – | – | 5-30 words before and after | Yes | Depending on variant | Part of ACL Anthology | Comput. linguistics | Accuracy |
[103] | 2014 | a | Regression trees (gradient boosted) | Author | Top 500 | 50 words before and after | No | Yes | Own dataset | Computer science | nDCG |
[153] | 2014 | d | Learning-to-rank | – | – | Sentence plus sentence before and after | Yes | No | Own dataset | Computer science and technology | Recall, MAP, MRR |
[75] | 2015 | d | Neural network (feed-forward) | – | Variable | Sentence plus sentence before and after | Yes | No | CiteSeer | Computer science | MAP, MRR, nDCG |
[174] | 2017 | d | Neural network (CNN) | – | Variable | Sentence plus sentence before and after | Yes | No (but title + abstract) | Own (same as in [101]) | Computer science | MAP, recall |
[48] | 2017 | d | Neural network (CNN + RNN) | Author | Top 2048 | 50 words before and after | Yes | No | RefSeer | Computer science | Recall, nDCG, MAP, MRR |
[91] | 2018 | d | Cos. similarity of paper embeddings | – | – | 1 sentence | Yes | Yes | Own dataset (from ACM library) | Computer science | nDCG |
[67] | 2018 | d | Dot product of 2 paper embeddings | – | – | 50 words before and after | Yes | Yes | NIPS, ACL-ANT, CiteSeer + DBLP | Computer science | Recall, MAP, MRR, nDCG |
[170] | 2018 | d | Neural network (LSTM) | Author, venue | – | 5 sentences before and after | Yes | Yes | AAN + DBLP | Computer science | recall, MAP, MRR |
[81] | 2019 | d | Neural network (feed-forward) | – | – | 50 words before and after | Yes | No | AAN + Own dataset | Computer science | MAP, MRR, recall |
3.4 System demonstrations
4 Data sets for citation recommendation
4.1 Corpora containing papers’ content
4.1.1 Overview of data sets
-
CiteSeerX (complete) [32] Referring to the CiteSeerX version of 2014, the number of indexed documents exceeded 2M. The CiteSeerX system crawls, indexes, and parses documents that are openly available on the Web. Therefore, only about half of all indexed documents are actually scientific publications, while a large fraction of the documents are manuscripts. The degree to which the findings resulting from the evaluations based on CiteSeerX also hold for the actual citing behavior in science is therefore unknown to some degree.
-
CiteSeerX cleaned by Caragea et al. [32] The raw CiteSeerX data set contains a lot of noise and errors as outlined by Roy et al. [135]. Thus, in 2014, Caragea et al. [32] released a smaller, cleaner version of it. The revised data set resolves some of the noise problems and in addition links papers to DBLP.
-
CiteSeerX cleaned by Wu et al. [168] According to Wu et al. [168], the cleaned data set [32] still has relatively low precision in terms of matching CiteSeerX papers with papers in DBLP. Hence, Wu et al. have published another approach for creating a cleaner data set out of the raw CiteSeerX data, achieving slightly better results on the matching of the papers from CiteSeerX and DBLP.
-
ACL Anthology Network (ACL-AAN) [129] ACL-AAN is a manually curated database of citations, collaborations, and summaries in the field of Computational Linguistics. It is based on 18k papers. The latest release is from 2016. ACL-AAN has been used as an evaluation data set for many tasks.
-
ACL Anthology Reference Corpus (ACL-ARC) [22]19 ACL-ARC is a widely used corpus of scholarly publications about computational linguistics. There are different versions of it available. ACL-ARC is based on the ACL Anthology website and contains the source PDF files (about 11k for the February 2007 snapshot), the corresponding content as plaintext, and metadata of the documents taken either from the website or from the PDFs.
-
CORE20 CORE collects openly available scientific publications (originating from institutional repositories, subject repositories, and journal publishers) as data basis for approaches concerning search, text mining, and analytics. As of October 2019, the data set contains 136M open access articles. CORE has been proposed for citation-based tasks for several years. However, to the best of our knowledge, it has not yet been used for evaluating or deploying any of the published citation recommendation systems.
-
unarXiv [136] This data set is an extension of the arXiv CS data set. It consists of over one million full text documents (about 269 million sentences) and links to 2.7 million unique papers via 29.2 million citation contexts (having 15.9 million unique references). All papers and citations are linked to the Microsoft Academic Graph.
Size of data set | Citation context available, size | Metadata of citing paper (structured) | Metadata of cited paper (structured) | Full text of all citing papers | Full text of all cited papers | Abstract of citing paper | Abstract of cited paper | Full citation graph | Cleanliness | Links | Usage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CiteSeerX complete | Very large | Yes, 400 chars | Yes (noisy) | Yes (noisy) | Yes | No | Yes | not all | No (but large) | No | No | |
CiteSeerX cleaned by Caragea et al. | Large | Yes, 400 chars | Yes (noisy) | Yes (noisy) | No | No | Yes | not all | No (but large) | No | DBLP | |
RefSeer | Large | Yes, 400 chars | Yes (noisy) | Yes (noisy) | No | No | Yes | not all | No | No | No | [48] |
CiteSeerX cleaned by Wang et al. | Large | Yes, 400 chars | Yes (noisy) | Yes (noisy) | No | No | Yes | not all | No (but large) | No | DBLP | |
ACL-AAN | Small | No (extractable) | Yes | No (extractable) | Yes (noisy) | No | (extractable) | not all | No | No | No | |
ACL-ARC | Small | No (extractable) | Yes | No (extractable) | Yes (noisy) | No | (extractable) | not all | No | No | No | [20] |
arXiv CS | Medium | Yes, 1 sentence | Yes | Yes | Yes | No | (extractable) | not all | No | Yes | DBLP | |
CORE | Very large | No (part. extractable) | Yes | No | partially | No | Yes | not all | No (but large) | Yes | No | |
Scholarly Dataset 2 | Medium | No (extractable) | No (extractable) | No (extractable) | Yes | No | (extractable) | not all | No | Yes | DBLP | |
unarXiv | Large | Yes, 3 sentences | Yes | Yes | Yes | No | (extractable) | not all | No | Yes | MAG |
Size of data set | Abstract of citing paper | Abstract of cited paper | Full citation graph | Cleanliness | Links | |
---|---|---|---|---|---|---|
AMiner DBLPv10 | Large | Partially | Partially | Yes | Yes | DBLP |
AMiner ACMv9 | Large | Yes | yes | Yes | Yes | DBLP (but no URIs) |
Microsoft Academic Graph | Very large | No | No | Yes | Yes | No |
Open Academic Graph | Very large | Yes | Yes | Yes (open access papers) | Yes | DBLP (but no URIs) |
PubMed | Large | No | Partially | Yes | Yes | No |
4.1.2 Comparison of evaluation data sets
4.2 Corpora containing papers’ metadata
-
Microsoft Academic Graph27 This data set can be considered as an actual knowledge graph about publications and associated entities such as authors, institutions, journals, and fields of study. Direct access to the MAG is only provided via an API. However, dump versions have been created.28 Prior versions of the MAG are known as the Microsoft Academic Search data set, based on a the project Microsoft Academic Search which retired in 2012.
-
Open Academic Graph29 This data set is designated to be an intersection of the Microsoft Academic Graph and the AMiner data. In many cases, the DBLP entries for computer science publications ought to be retrievable.
-
PubMed30 PubMed is a database of bibliographic information with a focus on life science literature. As of October 2019, it contains 29M citations and abstracts. It also provides links to the full-text articles and third-party websites if available (but no content).
5 Evaluation methods and challenges
5.1 Evaluation methods for citation recommendation
5.2 Challenges of evaluating citation recommendation approaches
5.2.1 Fitness of citations
5.2.2 Cite-worthiness of contexts
5.2.3 Scenario specificity
5.3 Discussion
6 Potential future work
-
Topically diversifying recommended citations [35];
-
Recommending papers which state similar, related, or contrary claims as the ones in the citation contexts (i.e., recommending not only papers with identical claims);
-
Inserting a sufficient (optimal) set of citations; this could be useful in the presence of paper size limitation, which may be imposed, for example, by conferences. A citation recommendation system should then prioritize important citation contexts that cannot be left without the insertion of citations, while perhaps skipping other less important ones in order to keep the paper size within the limits;
-
Given an input text with already present citations, suggesting newer/better ones to update some obsolete/poor citations;
-
Combating the cold-start problem for freshly published papers which are not yet cited, hence no training data is available on them;
-
Incorporating information on social networks among researchers and considering knowledge sharing platforms; such data can offer additional (often timely) hints on the appropriateness of papers to be cited in particular citation contexts;
-
Focusing on specific user groups, which have a given pre-knowledge in common (see our listed scenarios in Sect. 2.2);
-
Studying the influences of citing behavior on citation recommendation systems and developing methods for minimizing citing biases in citation recommendation such as biases arising from researchers belonging to the same domains, research groups, or geographical areas (cf. Sect. 5.2);
-
Developing global context-aware citation recommendation approaches, i.e., approaches that recommend citations in a context-aware way, yet still consider the entire content of a paper;
-
Recommending citations refuting an argument (using argumentation mining);
-
Designing domain-specific citation recommendation approaches and evaluating generic approaches on different disciplines (outside computer science).