Skip to main content

Tipp

Weitere Kapitel dieses Buchs durch Wischen aufrufen

2020 | OriginalPaper | Buchkapitel

15. Die Anwendung von Machine Learning zur Gewinnung von Erkenntnissen aus Dokumentenstapeln

Zusammenfassung

„Document Understanding“ ist das tiefe Verständnis eines Textes. Im Kern geht es um die Konvertierung von unstrukturierten Daten in Informationen und für Unternehmen gleichermaßen um die die Einhaltung von Governance- und Compliance-Richtlinien. Zum Einsatz kommt zumeist eine Sammlung von verschiedenen Methoden, zu denen unter anderem die Document Classification oder auch die Entity Extraction gehören. Viele Ansätze beruhen auf regelbasierten Systemen respektive auf statistischen Verfahren.
Der Einsatz von Machine Learning zur massenhaften Erschließung unstrukturierter Dokumente eröffnet neue Wege, um unter anderem Beziehungen zwischen Dokumenten sichtbar zu machen. ML ermöglicht Vorhersagen zur Dokumentenklassifizierung oder etwa die Extraktion von Wissen aus Textpassagen, Grafiken oder Feldern jenseits einfacher Mustererkennung. ML stellt Möglichkeiten einer semantischen Suche über Dokumente hinweg zur Verfügung und legt den Grundstein für erweiterte Analysen beispielsweise der Anomalieerkennung.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Kurzbezeichnung für einen Trainingsdatensatz bereitgestellt durch die Message Understanding Conferences.
 
3
Vielfach erfolgt eine Transkription für Audio- und Video-Daten durch das entsprechende System automatisch.
 
Literatur
Zurück zum Zitat Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28, 15–21. CrossRef Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28, 15–21. CrossRef
Zurück zum Zitat Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Comput Vision, Graph Image Process, 39, 291–310. CrossRef Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Comput Vision, Graph Image Process, 39, 291–310. CrossRef
Zurück zum Zitat Chinchor, N., & Robinson, P. (1997). MUC-7 named entity task definition. Proceedings of the Sixth Message Understanding Conference (MUC-6), 21. Chinchor, N., & Robinson, P. (1997). MUC-7 named entity task definition. Proceedings of the Sixth Message Understanding Conference (MUC-6), 21.
Zurück zum Zitat Cimiano, P., & Völker, J. (2005). Towards large-scale, open-domain and ontology-based named entity classification. International Conference Recent Advances in Natural Language Process RANLP, 2005(1), 166–172. Cimiano, P., & Völker, J. (2005). Towards large-scale, open-domain and ontology-based named entity classification. International Conference Recent Advances in Natural Language Process RANLP, 2005(1), 166–172.
Zurück zum Zitat Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of the document understanding conference. Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of the document understanding conference.
Zurück zum Zitat Dengel, A., & Dubiel, F. (1995). Clustering and classification of document structure-a machine learning approach. In Proceedings of 3rd international conference on document analysis and recognition, 587–591. Dengel, A., & Dubiel, F. (1995). Clustering and classification of document structure-a machine learning approach. In Proceedings of 3rd international conference on document analysis and recognition, 587–591.
Zurück zum Zitat Dengel, A. R. (2003). Making documents work: Challenges for document understanding. In Proceedings of the international conference on document analysis and recognition, ICDAR. Dengel, A. R. (2003). Making documents work: Challenges for document understanding. In Proceedings of the international conference on document analysis and recognition, ICDAR.
Zurück zum Zitat Gharehchopogh, F. S., & Khalifelu, Z. A. (2011). Analysis and evaluation of unstructured data: Text mining versus natural language processing. In 5th International Conference on Application of Information and Communication Technologies (AICT), 1–4. Gharehchopogh, F. S., & Khalifelu, Z. A. (2011). Analysis and evaluation of unstructured data: Text mining versus natural language processing. In 5th International Conference on Application of Information and Communication Technologies (AICT), 1–4.
Zurück zum Zitat Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: A deep learning approach. Proceedings of the 28th International Conference on Machine Learning ICML, 2011, 513–520. Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: A deep learning approach. Proceedings of the 28th International Conference on Machine Learning ICML, 2011, 513–520.
Zurück zum Zitat Google Patents. (o. J.). Smart-home automation system that suggests or autmatically implements selected household policies based on sensed observations. Zugegriffen: 23. Dez. 2019. Google Patents. (o. J.). Smart-home automation system that suggests or autmatically implements selected household policies based on sensed observations. Zugegriffen: 23. Dez. 2019.
Zurück zum Zitat Gunning, D. (2017). Explainable artificial intelligence (xai). The Defense Advanced Research Projects Agency (DARPA), nd Web 2. Gunning, D. (2017). Explainable artificial intelligence (xai). The Defense Advanced Research Projects Agency (DARPA), nd Web 2.
Zurück zum Zitat Han, A. L.-F., Wong, D. F., & Chao, L. S. (2013). Chinese named entity recognition with conditional random fields in the light of Chinese characteristics BT – Language processing. In M. A. Kłopotek, J. Koronacki, M. Marciniak, et al. (Hrsg.), Intelligent information systems (S. 57–68). Berlin: Springer. Han, A. L.-F., Wong, D. F., & Chao, L. S. (2013). Chinese named entity recognition with conditional random fields in the light of Chinese characteristics BT – Language processing. In M. A. Kłopotek, J. Koronacki, M. Marciniak, et al. (Hrsg.), Intelligent information systems (S. 57–68). Berlin: Springer.
Zurück zum Zitat Hardy, H., Shimizu, N., Strzalkowski, T., et al. (2002). Cross-document summarization by concept classification. SIGIR Forum (ACM Spec Interes Gr Inf Retrieval), 2002, 121–128. Hardy, H., Shimizu, N., Strzalkowski, T., et al. (2002). Cross-document summarization by concept classification. SIGIR Forum (ACM Spec Interes Gr Inf Retrieval), 2002, 121–128.
Zurück zum Zitat Holzinger, A. (2018). From machine learning to explainable AI. In 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), S. 55–66. Holzinger, A. (2018). From machine learning to explainable AI. In 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), S. 55–66.
Zurück zum Zitat Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. ACL 2018 – 56th Annual Meeting of the Association for Computational Linguistics Proceeding Conference, 1, 328–339. Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. ACL 2018 – 56th Annual Meeting of the Association for Computational Linguistics Proceeding Conference, 1, 328–339.
Zurück zum Zitat Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4, 966–974. Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4, 966–974.
Zurück zum Zitat Jindal, R., Malhotra, R., & Jain, A. (2015). Techniques for text classification: Literature review and current trends. Webology, 12, 1–28. Jindal, R., Malhotra, R., & Jain, A. (2015). Techniques for text classification: Literature review and current trends. Webology, 12, 1–28.
Zurück zum Zitat Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advanced Information Technology, 1, 4–20. CrossRef Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advanced Information Technology, 1, 4–20. CrossRef
Zurück zum Zitat Lanjouw, J. O., Pakes, A., & Putnam, J. (1998). How to count patents and value intellectual property: The uses of patent renewal and application data. The Journal of Industrial Economics, 46, 405–432. CrossRef Lanjouw, J. O., Pakes, A., & Putnam, J. (1998). How to count patents and value intellectual property: The uses of patent renewal and application data. The Journal of Industrial Economics, 46, 405–432. CrossRef
Zurück zum Zitat Lin, Y., Liu, Z., Sun, M., et al. (2015). Learning entity and relation embeddings for knowledge graph completion. Proceeding of National Conference on Artificial Intelligence, 3, 2181–2187. Lin, Y., Liu, Z., Sun, M., et al. (2015). Learning entity and relation embeddings for knowledge graph completion. Proceeding of National Conference on Artificial Intelligence, 3, 2181–2187.
Zurück zum Zitat Liu, T., Chen, Z., Zhang, B., et al. (2004). Improving text classification using local latent semantic indexing. Proceeding – Fourth IEEE International Conference on Data Mining, ICDM, 2004, 162–169. Liu, T., Chen, Z., Zhang, B., et al. (2004). Improving text classification using local latent semantic indexing. Proceeding – Fourth IEEE International Conference on Data Mining, ICDM, 2004, 162–169.
Zurück zum Zitat Marinai, S., & Fujisawa, H. (2008). Machine learning in document analysis and recognition. Heidelberg: Springer. Marinai, S., & Fujisawa, H. (2008). Machine learning in document analysis and recognition. Heidelberg: Springer.
Zurück zum Zitat Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. Proceeding of ACM International Journal on Digital Libraries, 2000, 195–204. Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. Proceeding of ACM International Journal on Digital Libraries, 2000, 195–204.
Zurück zum Zitat Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition (1. Aufl.). New York: Wiley. Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition (1. Aufl.). New York: Wiley.
Zurück zum Zitat Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26. CrossRef Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26. CrossRef
Zurück zum Zitat Nigyogi, D., & Srihari, S. N. (1986). A rule-based system for document understanding. Proceeding of AAAI, 1986, 789–793. Nigyogi, D., & Srihari, S. N. (1986). A rule-based system for document understanding. Proceeding of AAAI, 1986, 789–793.
Zurück zum Zitat Nrl EM, Nrl DP, & Nyu RG. (1998). MUC-7 EVALUATION OF IE TECHNOLOGY : Overview of Results MUC-7 Program Committee. Program. Nrl EM, Nrl DP, & Nyu RG. (1998). MUC-7 EVALUATION OF IE TECHNOLOGY : Overview of Results MUC-7 Program Committee. Program.
Zurück zum Zitat Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. LREc, 2010, 320–1326. Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. LREc, 2010, 320–1326.
Zurück zum Zitat Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22, 1345–1359. CrossRef Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22, 1345–1359. CrossRef
Zurück zum Zitat Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.
Zurück zum Zitat Passonneau, R. (2011). Sentiment analysis of twitter data. Proceeding of Work Language Social Media (LSM 2011), 2011, 30–38. Passonneau, R. (2011). Sentiment analysis of twitter data. Proceeding of Work Language Social Media (LSM 2011), 2011, 30–38.
Zurück zum Zitat Prince, V, & Labadié, A. (2007). Text segmentation based on document understanding for information retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Prince, V, & Labadié, A. (2007). Text segmentation based on document understanding for information retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Zurück zum Zitat Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. CoNLL 2009 – Proceedings of Thirteen Conference on Computational Natural Language Learning, 2009, 147–155. Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. CoNLL 2009 – Proceedings of Thirteen Conference on Computational Natural Language Learning, 2009, 147–155.
Zurück zum Zitat Samek W., Wiegand T., & Müller K.-R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries, Special Issue No. 1 – Impact Artificial intelligence (AI) Communication Network Service, 1(1), 39–48. Samek W., Wiegand T., & Müller K.-R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries, Special Issue No. 1 – Impact Artificial intelligence (AI) Communication Network Service, 1(1), 39–48.
Zurück zum Zitat Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In International Conference on Natural Language Processing, S. 440–451. Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In International Conference on Natural Language Processing, S. 440–451.
Zurück zum Zitat Stevenson, R. A., Mikels, J. A., & James, T. W. (2007). Characterization of the affective norms for english words by discrete emotional categories. Behavior Research Methods, 39, 1020–1024. CrossRef Stevenson, R. A., Mikels, J. A., & James, T. W. (2007). Characterization of the affective norms for english words by discrete emotional categories. Behavior Research Methods, 39, 1020–1024. CrossRef
Zurück zum Zitat Tanner, S. (2004). Deciding whether optical character recognition is feasible. London: King’s Digital Consultancy Services, 1–11. Tanner, S. (2004). Deciding whether optical character recognition is feasible. London: King’s Digital Consultancy Services, 1–11.
Zurück zum Zitat Taylor, S. L., Lipshutz, M., Dahl, D. A., &Weir, C. (1993). An intelligent document understanding system. In Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ’93), S. 107–110. Taylor, S. L., Lipshutz, M., Dahl, D. A., &Weir, C. (1993). An intelligent document understanding system. In Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ’93), S. 107–110.
Zurück zum Zitat Tramèr, F., Zhang, F., Juels, A., et al. (2016). Stealing machine learning models via prediction apis. In 25th ${$USENIX$}$ Security Symposium (${$USENIX$}$ Security 16), S. 601–618. Tramèr, F., Zhang, F., Juels, A., et al. (2016). Stealing machine learning models via prediction apis. In 25th ${$USENIX$}$ Security Symposium (${$USENIX$}$ Security 16), S. 601–618.
Zurück zum Zitat Vincent, L. (2007). Google book search: Document understanding on a massive scale. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR Vincent, L. (2007). Google book search: Document understanding on a massive scale. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
Zurück zum Zitat Yoshida, Y., Hirao, T., Iwata, T., et al. (2011). Transfer learning for multiple-domain sentiment analysis – Identifying domain dependent/independent word polarity. Proceeding of the National Conference on Artificial Intelligence, 2, 1286–1291. Yoshida, Y., Hirao, T., Iwata, T., et al. (2011). Transfer learning for multiple-domain sentiment analysis – Identifying domain dependent/independent word polarity. Proceeding of the National Conference on Artificial Intelligence, 2, 1286–1291.
Zurück zum Zitat Yuan, Y., & Zhou, Y. (2013). Twitter Sentiment Analysis with Recursive Neural Networks. CS224D Course Projects, 2013, 1–8. Yuan, Y., & Zhou, Y. (2013). Twitter Sentiment Analysis with Recursive Neural Networks. CS224D Course Projects, 2013, 1–8.
Zurück zum Zitat Zweig, G. G., & Padmanabhan, M. (2005). Information extraction from documents with regular expression matching. Washington: U.S. Patent and Trademark Office. Zweig, G. G., & Padmanabhan, M. (2005). Information extraction from documents with regular expression matching. Washington: U.S. Patent and Trademark Office.
Metadaten
Titel
Die Anwendung von Machine Learning zur Gewinnung von Erkenntnissen aus Dokumentenstapeln
verfasst von
Stefan Ebener
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-658-29550-9_15

Premium Partner