nach oben

Erschienen in:

2022 | OriginalPaper | Buchkapitel

Sentence Classification to Detect Tables for Helping Extraction of Regulatory Interactions in Bacteria

verfasst von : Dante Sepúlveda, Joel Rodríguez-Herrera, Alfredo Varela-Vega, Axel Zagal Norman, Carlos-Francisco Méndez-Cruz

Erschienen in: Computational Intelligence Methods for Bioinformatics and Biostatistics

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The biomedical knowledge about transcriptional regulation in bacteria is rapidly published in scientific articles, so keeping biological databases up to date by manual curation is rather than impossible. Despite the efforts in biomedical text mining, there are still challenges in extracting regulatory interactions (RIs) between transcription factors and genes from text documents. One of them is produced by text extraction from PDF files. We have observed that the extraction of RIs from text lines that comes from tables of the original PDF article produces false positives. Here, we address the problem of automatically separating this text lines from those that are regular sentences by using automatic classification. Our best model was a Support Vector Classifier trained with n-grams of characters of tags of parts of speech, numbers, symbols, punctuation, brackets, and hyphens. Despite a significant imbalanced data, our classifier archived a positive class F1-score of 0.87. Our best classifier will be coupled eventually to a preprocessing pipeline for the automatic generation of transcriptional regulatory networks of bacteria by discarding text lines that comes from tables of the original PDF.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Text Mining Enhancements for Image Recognition of Gene Names and Gene Relations

Nächstes Kapitel RF-Isolation: A Novel Representation of Structural Connectivity Networks for Multiple Sclerosis Classification

Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2020)

Angeli, G., Johnson Premkumar, M.J., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, pp. 344–354. Association for Computational Linguistics, Beijing (2015). https://doi.org/10.3115/v1/P15-1034

Bekkar, M., Djemaa, H.K., Alitouche, T.A.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10), 27–39 (2013)

Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2), 281–305 (2012)

Bishop, C.M.: Pattern Recognition and Machine Learning, p. 738. Springer, NY (2006)

Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_25CrossRef

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324CrossRef

Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(1), 1–13 (2020). https://doi.org/10.1186/s12864-019-6413-7CrossRef

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018CrossRef

10.

Díaz-Rodríguez, M., et al.: Lisen &Curate: a platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria. Biochim. Biophys. Acta, Gene Regul. Mech. 1864(11), 194753 (2021). https://doi.org/10.1016/j.bbagrm.2021.194753CrossRefPubMed

11.

Escorcia-Rodríguez, J.M., Tauch, A., Freyre-González, J.A.: Abasy Atlas v2.2: the most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization. Comput. Struct. Biotechnol. J. 18, 1228–1237 (2020). https://doi.org/10.1016/j.csbj.2020.05.015

12.

Fàbrega, A., Vila, J.: Salmonella enterica serovar Typhimurium skills to succeed in the host: virulence and regulation. Clin. Microbiol. Rev. 26(2), 308–341 (2013)CrossRefPubMedPubMedCentral

13.

Feng, X., Oropeza, R., Kenney, L.J.: Dual regulation by phospho-OmpR of ssrA/B gene expression in Salmonella pathogenicity island 2. Mol. Microbiol. 48(4), 1131–1143 (2003). https://doi.org/10.1046/j.1365-2958.2003.03502.xCrossRefPubMed

14.

Ferrario, A., Nagelin, M.: The art of natural language processing: classical, modern and contemporary approaches to text document classification. Modern and Contemporary Approaches to Text Document Classification (March 1, 2020) (2020)

15.

Jeni, L., Cohn, J., De la Torre, F.: Facing imbalanced data – recommendations for the use of performance metrics. In: Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, vol. 2013, pp. 245–251 (2013). https://doi.org/10.1109/ACII.2013.47

16.

Kadhim, A.I.: Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 52(1), 273–292 (2019). https://doi.org/10.1007/s10462-018-09677-1CrossRef

17.

Konheim, A.G.: Cryptography, a Primer. Wiley, Chichester (1981)

18.

Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: Icml, vol. 97, p. 179. Citeseer (1997)

19.

Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)

20.

Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 91–100 (2007)

21.

Lusa, L., et al.: Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models. BMC Bioinform. 16(1), 1–10 (2015)

22.

Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn treebank. Comput. Linguist. 19(2), 313–330 (1993)

23.

Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24752-4_14CrossRef

24.

Park, S.Y., Pontes, M.H., Groisman, E.A.: Flagella-independent surface motility in Salmonella enterica serovar Typhimurium. Proc. Natl. Acad. Sci. 112(6), 1850–1855 (2015). https://doi.org/10.1073/pnas.1422938112CrossRefPubMedPubMedCentral

25.

Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)

26.

Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 235–242 (2003)

27.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2020)

28.

RegulonDB: Regulatory network interactions (2022). http://regulondb.ccg.unam.mx/menu/download/datasets/index.jsp. Accessed 19 June 2022

29.

Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef

30.

Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., Mueller, A.: Scikit-learn: machine learning without learning the machinery. GetMobile: Mob. Comput. Commun. 19(1), 29–33 (2015). https://doi.org/10.1145/2786984.2786995CrossRef

31.

Wang, L., et al.: InvS coordinates expression of PrgH and FimZ and is required for invasion of epithelial cells by Salmonella enterica serovar Typhimurium. J. Bacteriol. 199(13), e00824-16 (2017). https://doi.org/10.1128/JB.00824-16CrossRefPubMedPubMedCentral

32.

Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, NY (2010). https://doi.org/10.1007/978-0-387-34555-0

33.

Yoon, H., Lim, S., Heu, S., Choi, S., Ryu, S.: Proteome analysis of Salmonella enterica serovar Typhimurium fis mutant. FEMS Microbiol. Lett. 226(2), 391–396 (2003)CrossRefPubMed

34.

Zhai, Z., et al.: ChemTables: a dataset for semantic classification on tables in chemical patents. J. Cheminformatics 13(1), 97 (2021)CrossRef

35.

Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 1–35 (2020). https://doi.org/10.1145/3372117CrossRef

Titel: Sentence Classification to Detect Tables for Helping Extraction of Regulatory Interactions in Bacteria
verfasst von: Dante Sepúlveda
Joel Rodríguez-Herrera
Alfredo Varela-Vega
Axel Zagal Norman
Carlos-Francisco Méndez-Cruz
Verlag: Springer International Publishing
Buch: Computational Intelligence Methods for Bioinformatics and Biostatistics
Print ISBN: 978-3-031-20836-2

Electronic ISBN: 978-3-031-20837-9

Copyright-Jahr: 2022
DOI: https://doi.org/10.1007/978-3-031-20837-9_12

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner