Skip to main content

07.02.2024 | Focus

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

verfasst von: Wei Ai, Jia Xu, Hongen Shao, Tao Meng, Keqin Li

Erschienen in: Soft Computing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Entity event deduplication is the task of identifying all duplication entity events that have described the same entity within a set of events. However, the traditional entity event deduplication method has two challenges. First, the traditional method usually used global comparison when finding the duplication entity event, are all entity events in the dataset need to be compared, leading to low performance. Second, when the entity event evolves, the traditional method does not identify it well and reduces the effectiveness. To address these two problems and improve the performance and effectiveness, we propose a two-stage deduplication method based on graph node selection and optimization (TS-NSNO) strategy. In the first stage (TS-NS), we propose a graph node selection strategy, which transforms the global comparison into a local comparison by selecting the leader node, greatly reduces the number of calculations and improves the performance. In the second stage (TS-NO), we propose a graph node optimization strategy, by combining the spatiotemporal distance and entity event importance change of the event evolution, which optimizes the entity event with incorrect judgment to improve the effectiveness. We conduct extensive experiments on real entity event datasets of different sizes, and the results show that our method performs better in terms of performance and effectiveness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
You can see the data set and dictionary built in this experiment through this link www.​github.​com/​jiaxu-git/​TS-NSNO.
 
Literatur
Zurück zum Zitat Ai W, Xu J, Shao H et al (2021) An entity event deduplication method based on connected subgraph. In: 2021 7th international conference on systems and informatics (ICSAI), IEEE, pp 1–6 Ai W, Xu J, Shao H et al (2021) An entity event deduplication method based on connected subgraph. In: 2021 7th international conference on systems and informatics (ICSAI), IEEE, pp 1–6
Zurück zum Zitat Arun P, Sumesh M (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet), IEEE, pp 765–770 Arun P, Sumesh M (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet), IEEE, pp 765–770
Zurück zum Zitat Bodankar R, Waghmare M (2020) Int J Sci Res Sci Eng Technol. Identification and effective summary extraction with deduplication of data in news articles 7:96–102 Bodankar R, Waghmare M (2020) Int J Sci Res Sci Eng Technol. Identification and effective summary extraction with deduplication of data in news articles 7:96–102
Zurück zum Zitat Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. compression and complexity of SEQUENCES 1997 (Cat. No. 97TB100171), IEEE, pp 21–29 Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. compression and complexity of SEQUENCES 1997 (Cat. No. 97TB100171), IEEE, pp 21–29
Zurück zum Zitat Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388 Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388
Zurück zum Zitat Chen Z (2010) Graph-based clustering and its application in coreference resolution. In: Proceedings of the 2010 workshop on graph-based methods for natural language processing, pp 1–9 Chen Z (2010) Graph-based clustering and its application in coreference resolution. In: Proceedings of the 2010 workshop on graph-based methods for natural language processing, pp 1–9
Zurück zum Zitat Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. pp 4171–4186 Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805. pp 4171–4186
Zurück zum Zitat Fedoryszak M, Frederick B, Rajaram V et al (2019) Real-time event detection on social data streams. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2774–2782 Fedoryszak M, Frederick B, Rajaram V et al (2019) Real-time event detection on social data streams. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2774–2782
Zurück zum Zitat Ge Y, Wu J, Dai G et al (2019) Text deduplication with minimum loss ratio. In: Proceedings of the 2019 11th international conference on machine learning and computing, pp 310–316 Ge Y, Wu J, Dai G et al (2019) Text deduplication with minimum loss ratio. In: Proceedings of the 2019 11th international conference on machine learning and computing, pp 310–316
Zurück zum Zitat Han S, Hao X, Huang H (2018) An event-extraction approach for business analysis from online Chinese news. Electron Commerc Res Appl 28:244–260CrossRef Han S, Hao X, Huang H (2018) An event-extraction approach for business analysis from online Chinese news. Electron Commerc Res Appl 28:244–260CrossRef
Zurück zum Zitat Hossny AH, Mitchell L, Lothian N et al (2020) Feature selection methods for event detection in twitter: a text mining approach. Soc Netw Anal Min 10(1):1–15CrossRef Hossny AH, Mitchell L, Lothian N et al (2020) Feature selection methods for event detection in twitter: a text mining approach. Soc Netw Anal Min 10(1):1–15CrossRef
Zurück zum Zitat Huang D, Hu S, Cai Y et al (2014) Discovering event evolution graphs based on news articles relationships. In: 2014 IEEE 11th international conference on e-business engineering, IEEE, pp 246–251 Huang D, Hu S, Cai Y et al (2014) Discovering event evolution graphs based on news articles relationships. In: 2014 IEEE 11th international conference on e-business engineering, IEEE, pp 246–251
Zurück zum Zitat Jadhav A, Rajan V (2018) Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 142–151 Jadhav A, Rajan V (2018) Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 142–151
Zurück zum Zitat Liu S, Liu K, He S et al (2016) A probabilistic soft logic based approach to exploiting latent and global information in event classification. In: Thirtieth AAAI conference on artificial intelligence, p 2993–2999 Liu S, Liu K, He S et al (2016) A probabilistic soft logic based approach to exploiting latent and global information in event classification. In: Thirtieth AAAI conference on artificial intelligence, p 2993–2999
Zurück zum Zitat Liu B, Niu D, Wei H et al (2018) Matching article pairs with graphical decomposition and convolutions. arXiv preprint arXiv:1802.07459 Liu B, Niu D, Wei H et al (2018) Matching article pairs with graphical decomposition and convolutions. arXiv preprint arXiv:​1802.​07459
Zurück zum Zitat Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World wide web, pp 141–150 Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World wide web, pp 141–150
Zurück zum Zitat McConky K, Nagi R, Sudit M et al (2012) Improving event co-reference by context extraction and dynamic feature weighting. In: 2012 IEEE international multi-disciplinary conference on cognitive methods in situation awareness and decision support, IEEE, pp 38–43 McConky K, Nagi R, Sudit M et al (2012) Improving event co-reference by context extraction and dynamic feature weighting. In: 2012 IEEE international multi-disciplinary conference on cognitive methods in situation awareness and decision support, IEEE, pp 38–43
Zurück zum Zitat Navarro-Colorado B, Saquete E (2016) Cross-document event ordering through temporal, lexical and distributional knowledge. Knowl Based Syst 110:244–254CrossRef Navarro-Colorado B, Saquete E (2016) Cross-document event ordering through temporal, lexical and distributional knowledge. Knowl Based Syst 110:244–254CrossRef
Zurück zum Zitat Schinas M, Papadopoulos S, Petkos G et al (2015) Multimodal graph-based event detection and summarization in social media streams. In: Proceedings of the 23rd ACM international conference on multimedia, pp 189–192 Schinas M, Papadopoulos S, Petkos G et al (2015) Multimodal graph-based event detection and summarization in social media streams. In: Proceedings of the 23rd ACM international conference on multimedia, pp 189–192
Zurück zum Zitat Sharapova E, Sharapov R (2019) Detection of fuzzy duplicate texts in news feeds. 2019 systems of signal synchronization. Generating and processing in telecommunications (SYNCHROINFO), IEEE, pp 1–5 Sharapova E, Sharapov R (2019) Detection of fuzzy duplicate texts in news feeds. 2019 systems of signal synchronization. Generating and processing in telecommunications (SYNCHROINFO), IEEE, pp 1–5
Zurück zum Zitat Tomadaki E, Salway A (2005) Matching verb attributes for cross-document event co-reference. In: Proceedings of interdisciplinary workshop on the identification and representation of verb features and verb classes, pp 127–132 Tomadaki E, Salway A (2005) Matching verb attributes for cross-document event co-reference. In: Proceedings of interdisciplinary workshop on the identification and representation of verb features and verb classes, pp 127–132
Zurück zum Zitat UzZaman N, Allen JF (2010) Extracting events and temporal expressions from text. In: 2010 IEEE fourth international conference on semantic computing, IEEE, pp 1–8 UzZaman N, Allen JF (2010) Extracting events and temporal expressions from text. In: 2010 IEEE fourth international conference on semantic computing, IEEE, pp 1–8
Zurück zum Zitat Wang X, Dong X, Chen S (2020) Text duplicated-checking algorithm implementation based on natural language semantic analysis. In: 2020 IEEE 5th information technology and mechatronics engineering conference (ITOEC), IEEE, pp 732–735 Wang X, Dong X, Chen S (2020) Text duplicated-checking algorithm implementation based on natural language semantic analysis. In: 2020 IEEE 5th information technology and mechatronics engineering conference (ITOEC), IEEE, pp 732–735
Zurück zum Zitat Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442ADSCrossRefPubMed Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442ADSCrossRefPubMed
Zurück zum Zitat Yang CC, Shi X, Wei CP (2009) Discovering event evolution graphs from news corpora. IEEE Trans Syst Man Cybern Part A Syst Hum 39(4):850–863CrossRef Yang CC, Shi X, Wei CP (2009) Discovering event evolution graphs from news corpora. IEEE Trans Syst Man Cybern Part A Syst Hum 39(4):850–863CrossRef
Zurück zum Zitat Zhang X, Yao Y, Ji Y et al (2016) Effective and fast near duplicate detection via signature-based compression metrics. Math Probl Eng 10:1–12 Zhang X, Yao Y, Ji Y et al (2016) Effective and fast near duplicate detection via signature-based compression metrics. Math Probl Eng 10:1–12
Zurück zum Zitat Zhang X, Liu Z, Liu W et al (2011) Event similarity computation in text. In: 2011 International conference on internet of things and 4th international conference on cyber. Physical and social computing, IEEE, pp 419–423 Zhang X, Liu Z, Liu W et al (2011) Event similarity computation in text. In: 2011 International conference on internet of things and 4th international conference on cyber. Physical and social computing, IEEE, pp 419–423
Metadaten
Titel
A two-stage entity event deduplication method based on graph node selection and node optimization strategy
verfasst von
Wei Ai
Jia Xu
Hongen Shao
Tao Meng
Keqin Li
Publikationsdatum
07.02.2024
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-023-09623-6

Premium Partner