Skip to main content

2019 | OriginalPaper | Buchkapitel

Automatic Generation of Dictionaries: The Journalistic Lexicon Case

verfasst von : Matteo Cristani, Claudio Tomazzoli, Margherita Zorzi

Erschienen in: Advances and Trends in Artificial Intelligence. From Theory to Practice

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text normalisation is an important task in the context of Natural Language Processing. By normalisation, free text is mapped into dictionaries, i.e. indexed collections of locutions recognised as typical of a particular jaergon. In general, technical dictionaries are difficult to build and validate. They are typically constructed by hand on the basis of everyday human work and they are agreement-based. This is indubitably time consuming and the approach requires a strong human supervision and does not provide a general methodology. In this paper, we perform the first steps towards the to automatic building of a dictionary for Italian journalistic lexicon, called NewsDict, based on sub dictionaries able to characterise main topics occurring in newspaper articles. We exploit a dataset of annotated documents from some Italian newspapers and a statistical techniques based on the Mutual Information Principle. Documents contains information such as the release date and the topic of the article and has been directly annotated by the author. To check the accuracy of the dictionary we built, we develop an initial test. We normalise a control set of journal article into NewsDict. Crossing results presented in this paper against the human annotation, we provide a fist measure of performances of the described methodology.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
These option can be set by the user. In the case of the journalistic lexicon we decide to respect the order, in order to capture typical expressions and figures of speech of the slang.
 
2
This selection criterium could be clearly relaxed when we will extend the dictionary to n-grams (\(n>2\)).
 
Literatur
3.
Zurück zum Zitat Cristani, M., Fogoroasi, D., Tomazzoli, C.: Measuring homophily. In: CEUR Workshop Proceedings, vol. 1748 (2016) Cristani, M., Fogoroasi, D., Tomazzoli, C.: Measuring homophily. In: CEUR Workshop Proceedings, vol. 1748 (2016)
5.
Zurück zum Zitat Baroni, M., Bisi, S.: Using cooccurrence statistics and the web to discover synonyms in a technical language. In: Proceedings of the of LREC (2004) Baroni, M., Bisi, S.: Using cooccurrence statistics and the web to discover synonyms in a technical language. In: Proceedings of the of LREC (2004)
6.
Zurück zum Zitat Henry, F.P.: A review of the first book on the diseases of the eye, by benvenutus grassus, 1474: exhibition of three other fifteenth century monographs (a) the first medical dictionary, synonyma simonis genuensis, 1473; (b) the first book on diet, by isaac, 1487, (c) the second edition of the first book on diseases of children, by paulus bagellardus, 1487. Med. Libr. Hist. J. 3(1), 27–40 (1905) Henry, F.P.: A review of the first book on the diseases of the eye, by benvenutus grassus, 1474: exhibition of three other fifteenth century monographs (a) the first medical dictionary, synonyma simonis genuensis, 1473; (b) the first book on diet, by isaac, 1487, (c) the second edition of the first book on diseases of children, by paulus bagellardus, 1487. Med. Libr. Hist. J. 3(1), 27–40 (1905)
7.
Zurück zum Zitat Meystre, S., Haug, P.J.: Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J. Biomed. Inform. 39(6), 589–599 (2006)CrossRef Meystre, S., Haug, P.J.: Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J. Biomed. Inform. 39(6), 589–599 (2006)CrossRef
8.
Zurück zum Zitat Combi, C., Zorzi, M., Pozzani, G., Moretti, U., Arzenton, E.: From narrative descriptions to MedDRA: automagically encoding adverse drug reactions. J. Biomed. Inform. 84, 184–199 (2018)CrossRef Combi, C., Zorzi, M., Pozzani, G., Moretti, U., Arzenton, E.: From narrative descriptions to MedDRA: automagically encoding adverse drug reactions. J. Biomed. Inform. 84, 184–199 (2018)CrossRef
9.
Zurück zum Zitat Forgac, R., Krakovsky, R.: Text processing by using projective art neural networks (2016) Forgac, R., Krakovsky, R.: Text processing by using projective art neural networks (2016)
10.
Zurück zum Zitat Abel, M., Chung, S.: Computing preset dictionaries from text corpora for the compression of messages (2014) Abel, M., Chung, S.: Computing preset dictionaries from text corpora for the compression of messages (2014)
11.
Zurück zum Zitat Quan, C., Ren, F., He, T., Hu, P.: Automatic construction of biomedical abbreviations dictionary from text (2008) Quan, C., Ren, F., He, T., Hu, P.: Automatic construction of biomedical abbreviations dictionary from text (2008)
13.
Zurück zum Zitat Schulz, S., Costa, C.M., Kreuzthaler, M., et al.: Semantic relation discovery by using co-occurrence information. In: Proceedings of BioTxtM (2014) Schulz, S., Costa, C.M., Kreuzthaler, M., et al.: Semantic relation discovery by using co-occurrence information. In: Proceedings of BioTxtM (2014)
14.
Zurück zum Zitat Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of SHB, pp. 33–40. ACM (2012) Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of SHB, pp. 33–40. ACM (2012)
16.
Zurück zum Zitat Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic topic models for learning terminological ontologies. IEEE Trans. Knowl. Data Eng. 22(7), 1028–1040 (2010)CrossRef Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic topic models for learning terminological ontologies. IEEE Trans. Knowl. Data Eng. 22(7), 1028–1040 (2010)CrossRef
17.
Zurück zum Zitat Aussenac-Gilles, N., Sorgel, D.: Text analysis for ontology and terminology engineering. Appl. Ontol. 1, 35–46 (2005) Aussenac-Gilles, N., Sorgel, D.: Text analysis for ontology and terminology engineering. Appl. Ontol. 1, 35–46 (2005)
18.
Zurück zum Zitat Faria, C., Serra, I., Girardi, R.: A domain-independent process for automatic ontology population from text. Sci. Comput. Program. 95(P1), 26–43 (2014)CrossRef Faria, C., Serra, I., Girardi, R.: A domain-independent process for automatic ontology population from text. Sci. Comput. Program. 95(P1), 26–43 (2014)CrossRef
19.
Zurück zum Zitat Benafia, A., Mazouzi, S., Maamri, R., Sahnoun, Z., Benafia, S.: From linguistic to conceptual: a framework based on a pipeline for building ontologies from texts. J. Adv. Comput. Intell. Intell. Inform. 20(6), 941–960 (2016)CrossRef Benafia, A., Mazouzi, S., Maamri, R., Sahnoun, Z., Benafia, S.: From linguistic to conceptual: a framework based on a pipeline for building ontologies from texts. J. Adv. Comput. Intell. Intell. Inform. 20(6), 941–960 (2016)CrossRef
20.
Zurück zum Zitat Milian, K., Hoekstra, R., Bucur, A., ten Teije, A., van Harmelen, F., Paulissen, J.: Enhancing reuse of structured eligibility criteria and supporting their relaxation. J. Biomed. Inform. 56, 205–219 (2015)CrossRef Milian, K., Hoekstra, R., Bucur, A., ten Teije, A., van Harmelen, F., Paulissen, J.: Enhancing reuse of structured eligibility criteria and supporting their relaxation. J. Biomed. Inform. 56, 205–219 (2015)CrossRef
21.
Zurück zum Zitat Reimer, U., Maier, E., Streit, S., Diggelmann, T., Hoffleisch, M.: Learning a lightweight ontology for semantic retrieval in patient-centered information systems. Int. J. Knowl. Manag. 7(3), 11–26 (2011)CrossRef Reimer, U., Maier, E., Streit, S., Diggelmann, T., Hoffleisch, M.: Learning a lightweight ontology for semantic retrieval in patient-centered information systems. Int. J. Knowl. Manag. 7(3), 11–26 (2011)CrossRef
22.
Zurück zum Zitat Zouaq, A., Nkambou, R.: Building domain ontologies from text for educational purposes. IEEE Trans. Learn. Technol. 1(1), 49–62 (2008)CrossRef Zouaq, A., Nkambou, R.: Building domain ontologies from text for educational purposes. IEEE Trans. Learn. Technol. 1(1), 49–62 (2008)CrossRef
23.
Zurück zum Zitat Suresh, R., Dinakaran, K., Amulya, R.: Automating ontologies for e-learning. Int. J. Metadata Semant. Ontol. 9(3), 227–232 (2014)CrossRef Suresh, R., Dinakaran, K., Amulya, R.: Automating ontologies for e-learning. Int. J. Metadata Semant. Ontol. 9(3), 227–232 (2014)CrossRef
24.
Zurück zum Zitat Muresan, S., Klavans, J.: A method for automatically building and evaluating dictionary resources. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands - Spain, European Language Resources Association (ELRA), May 2002 Muresan, S., Klavans, J.: A method for automatically building and evaluating dictionary resources. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands - Spain, European Language Resources Association (ELRA), May 2002
25.
Zurück zum Zitat Sellami, Z., Camps, V., Aussenac-Gilles, N.: DYNAMO-MAS: a multi-agent system for ontology evolution from text. J. Data Semant. 2(2–3), 145–161 (2013)CrossRef Sellami, Z., Camps, V., Aussenac-Gilles, N.: DYNAMO-MAS: a multi-agent system for ontology evolution from text. J. Data Semant. 2(2–3), 145–161 (2013)CrossRef
26.
Zurück zum Zitat Souvignet, J., Declerck, G., Asfari, H., Jaulent, M.C., Bousquet, C.: Ontoadr a semantic resource describing adverse drug reactions to support searching, coding, and information retrieval. J. Biomed. Inform. 63, 100–107 (2016)CrossRef Souvignet, J., Declerck, G., Asfari, H., Jaulent, M.C., Bousquet, C.: Ontoadr a semantic resource describing adverse drug reactions to support searching, coding, and information retrieval. J. Biomed. Inform. 63, 100–107 (2016)CrossRef
27.
Zurück zum Zitat Rahul, M., Shine, S.: A survey of morphosyntactic lexicon generation. In: Proceedings of the International Conference in Emerging Trends in Engineering, Science and Technology, ICETEST 2018, pp. 773–778 (2018) Rahul, M., Shine, S.: A survey of morphosyntactic lexicon generation. In: Proceedings of the International Conference in Emerging Trends in Engineering, Science and Technology, ICETEST 2018, pp. 773–778 (2018)
28.
Zurück zum Zitat Zorzi, M., Combi, C., Lora, R., Pagliarini, M., Moretti, U.: Automagically encoding adverse drug reactions in MedDRA. In: 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, 21–23 October 2015, pp. 90–99. IEEE Computer Society (2015) Zorzi, M., Combi, C., Lora, R., Pagliarini, M., Moretti, U.: Automagically encoding adverse drug reactions in MedDRA. In: 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, 21–23 October 2015, pp. 90–99. IEEE Computer Society (2015)
29.
Zurück zum Zitat Zorzi, M., Combi, C., Pozzani, G., Moretti, U.: Mapping free text into MedDRA by natural language processing: a modular approach in designing and evaluating software extensions. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2017, Boston, MA, USA, 20–23 August 2017, pp. 27–35. ACM (2017) Zorzi, M., Combi, C., Pozzani, G., Moretti, U.: Mapping free text into MedDRA by natural language processing: a modular approach in designing and evaluating software extensions. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2017, Boston, MA, USA, 20–23 August 2017, pp. 27–35. ACM (2017)
30.
Zurück zum Zitat Combi, C., Zorzi, M., Pozzani, G., Arzenton, E., Moretti, U.: Normalizing spontaneous reports into MedDRA: some experiments with MagiCoder. IEEE J. Biomed. Health Inform. 23(1), 95–102 (2019)CrossRef Combi, C., Zorzi, M., Pozzani, G., Arzenton, E., Moretti, U.: Normalizing spontaneous reports into MedDRA: some experiments with MagiCoder. IEEE J. Biomed. Health Inform. 23(1), 95–102 (2019)CrossRef
31.
Zurück zum Zitat Schütze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inform. Process. Manag. 33(3), 307–318 (1997)CrossRef Schütze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inform. Process. Manag. 33(3), 307–318 (1997)CrossRef
32.
Zurück zum Zitat Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATH Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATH
33.
34.
Zurück zum Zitat Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. In: Proceedings of ACL 1989, Stroudsburg, PA, USA, pp. 76–83 (1989) Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. In: Proceedings of ACL 1989, Stroudsburg, PA, USA, pp. 76–83 (1989)
35.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRef Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRef
Metadaten
Titel
Automatic Generation of Dictionaries: The Journalistic Lexicon Case
verfasst von
Matteo Cristani
Claudio Tomazzoli
Margherita Zorzi
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-22999-3_63

Premium Partner