Top

Published in:

2020 | OriginalPaper | Chapter

15. Die Anwendung von Machine Learning zur Gewinnung von Erkenntnissen aus Dokumentenstapeln

Author : Stefan Ebener

Published in: Künstliche Intelligenz in Wirtschaft & Gesellschaft

Publisher: Springer Fachmedien Wiesbaden

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Zusammenfassung

„Document Understanding“ ist das tiefe Verständnis eines Textes. Im Kern geht es um die Konvertierung von unstrukturierten Daten in Informationen und für Unternehmen gleichermaßen um die die Einhaltung von Governance- und Compliance-Richtlinien. Zum Einsatz kommt zumeist eine Sammlung von verschiedenen Methoden, zu denen unter anderem die Document Classification oder auch die Entity Extraction gehören. Viele Ansätze beruhen auf regelbasierten Systemen respektive auf statistischen Verfahren.

Der Einsatz von Machine Learning zur massenhaften Erschließung unstrukturierter Dokumente eröffnet neue Wege, um unter anderem Beziehungen zwischen Dokumenten sichtbar zu machen. ML ermöglicht Vorhersagen zur Dokumentenklassifizierung oder etwa die Extraktion von Wissen aus Textpassagen, Grafiken oder Feldern jenseits einfacher Mustererkennung. ML stellt Möglichkeiten einer semantischen Suche über Dokumente hinweg zur Verfügung und legt den Grundstein für erweiterte Analysen beispielsweise der Anomalieerkennung.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Vom smarten Berater zur smarten Maschine

next chapter KI in der Logistik – Multiagentenbasierte Planung und Steuerung in der Transportlogistik

Kurzbezeichnung für einen Trainingsdatensatz bereitgestellt durch die Message Understanding Conferences.

https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

Vielfach erfolgt eine Transkription für Audio- und Video-Daten durch das entsprechende System automatisch.

https://github.com/cayleygraph/cayley

https://patents.google.com/patent/US20170011116A1/en

Aiello, M., Monz, C., Todoran, L., & Worring, M. (2003a). Document understanding for a broad class of documents. Journal on Document Analysis and Recognition, 5, 1–16. https://doi.org/10.1007/s10032-002-0080-x.CrossRef

Aiello, M., Monz, C., Todoran, L., & Worring, M. (2003b). Document understanding for a broad class of documents. Journal on Document Analysis and Recognition. https://doi.org/10.1007/s10032-002-0080-x.CrossRef

Basari, A. S. H., Hussin, B., Ananta, I. G. P., & Zeniarja, J. (2013). Opinion mining of movie review using hybrid method of support vector machine and particle swarm optimization. Procedia Engineering, 53, 453–462. https://doi.org/10.1016/j.proeng.2013.02.059.CrossRef

Buchkremer, R., Demund, A., Ebener, S., et al. (2019). The application of artificial intelligence technologies as a substitute for reading and to support and enhance the authoring of scientific review articles. IEEE Access, 7, 65263–65276. https://doi.org/10.1109/ACCESS.2019.2917719.CrossRef

Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28, 15–21.CrossRef

Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Comput Vision, Graph Image Process, 39, 291–310.CrossRef

Chinchor, N., & Robinson, P. (1997). MUC-7 named entity task definition. Proceedings of the Sixth Message Understanding Conference (MUC-6), 21.

Cimiano, P., & Völker, J. (2005). Towards large-scale, open-domain and ontology-based named entity classification. International Conference Recent Advances in Natural Language Process RANLP, 2005(1), 166–172.

Dai, W., Yang, Q., Xue, G. R., & Yu, Y. (2007). Boosting for transfer learning. International Conference Proceedings Series, 227, 193–200. https://doi.org/10.1145/1273496.1273521.CrossRef

Daim, T. U., Rueda, G., Martin, H., & Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis BT – Tech Mining: Exploiting Science and Technology Information Resources. Technological Forecasting and Social Change, 73, 981–1012. https://doi.org/10.1016/j.techfore.2006.04.004.CrossRef

Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of the document understanding conference.

Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29, 141–142. https://doi.org/10.1109/MSP.2012.2211477.CrossRef

Dengel, A., & Dubiel, F. (1995). Clustering and classification of document structure-a machine learning approach. In Proceedings of 3rd international conference on document analysis and recognition, 587–591.

Dengel, A. R. (2003). Making documents work: Challenges for document understanding. In Proceedings of the international conference on document analysis and recognition, ICDAR.

Furman, B. L., & Wilson, G. A. (1978). Effects upon plasma glucose of inhibitors of 5Ht uptake and their interaction with 5???Hydroxytryptophan in producing hypoglycaemia. Journal of Pharmacy and Pharmacology, 30, 53P–53P. https://doi.org/10.1111/j.2042-7158.1978.tb10760.x.CrossRef

Gharehchopogh, F. S., & Khalifelu, Z. A. (2011). Analysis and evaluation of unstructured data: Text mining versus natural language processing. In 5th International Conference on Application of Information and Communication Technologies (AICT), 1–4.

Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: A deep learning approach. Proceedings of the 28th International Conference on Machine Learning ICML, 2011, 513–520.

Google Patents. (o. J.). Smart-home automation system that suggests or autmatically implements selected household policies based on sensed observations. Zugegriffen: 23. Dez. 2019.

Gray, J., & Rumpe, B. (2017). Models for the digital transformation. Software & Systems Modeling, 16, 307–308. https://doi.org/10.1007/s10270-017-0596-7.CrossRef

Guerra, P. H. C., Veloso, A., Meira, W., & Almeida, V. (2011). From bias to opinion: A transfer-learning approach to real-time sentiment analysis. Processding ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. https://doi.org/10.1145/2020408.2020438.

Gunning, D. (2017). Explainable artificial intelligence (xai). The Defense Advanced Research Projects Agency (DARPA), nd Web 2.

Guo, J., Xu, G., Cheng, X., & Li, H. (2009). Named entity recognition in query. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2009, 267–274. https://doi.org/10.1145/1571941.1571989.CrossRef

Hamdaqa, M., & Hamou-Lhadj, A. (2009). Citation analysis: An approach for facilitating the understanding and the analysis of regulatory compliance documents. ITNG 2009 – 6th International Conference on Information Technology- New Generations, 2009, 278–283. https://doi.org/10.1109/ITNG.2009.161.

Han, A. L.-F., Wong, D. F., & Chao, L. S. (2013). Chinese named entity recognition with conditional random fields in the light of Chinese characteristics BT – Language processing. In M. A. Kłopotek, J. Koronacki, M. Marciniak, et al. (Hrsg.), Intelligent information systems (S. 57–68). Berlin: Springer.

Handley, J. C., Namboodiri, A. M., & Zanibbi, R. (2005). Document understanding system using stochastic context-free grammars. International Conference on Document Analysis and Recognition, ICDAR, 2005, 511–515. https://doi.org/10.1109/ICDAR.2005.93.CrossRef

Hardy, H., Shimizu, N., Strzalkowski, T., et al. (2002). Cross-document summarization by concept classification. SIGIR Forum (ACM Spec Interes Gr Inf Retrieval), 2002, 121–128.

Hasan, S., O’Riain, S., & Curry, E. (2012). Approximate semantic matching of heterogeneous events. Proceeding of the 6th ACM International Conference on Distributed Event-based system DEBS’, 12, 252–263. https://doi.org/10.1145/2335484.2335512.

Holzinger, A. (2018). From machine learning to explainable AI. In 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), S. 55–66.

Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. ACL 2018 – 56th Annual Meeting of the Association for Computational Linguistics Proceeding Conference, 1, 328–339.

Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4, 966–974.

Jindal, R., Malhotra, R., & Jain, A. (2015). Techniques for text classification: Literature review and current trends. Webology, 12, 1–28.

Kapetanios, E., & Doina Tatar, C. S. (2013). Natural language processing: Semantic aspects (S. 346). https://doi.org/10.1201/b15472.

Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advanced Information Technology, 1, 4–20.CrossRef

Kiyani, F., & Tas, O. (2017). A survey automatic text summarization. Pressacademia, 5, 205–213. https://doi.org/10.17261/pressacademia.2017.591.

Lanjouw, J. O., Pakes, A., & Putnam, J. (1998). How to count patents and value intellectual property: The uses of patent renewal and application data. The Journal of Industrial Economics, 46, 405–432.CrossRef

Lee, L. S., & Chen, B. (2005). Spoken document understanding and organization. IEEE Signal Processing Magazine, 22(5), 42–60. https://doi.org/10.1109/MSP.2005.1511823.CrossRef

Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2, 1030–1038. https://doi.org/10.3115/1690219.1690290.

Lin, Y., Liu, Z., Sun, M., et al. (2015). Learning entity and relation embeddings for knowledge graph completion. Proceeding of National Conference on Artificial Intelligence, 3, 2181–2187.

Liu, T., Chen, Z., Zhang, B., et al. (2004). Improving text classification using local latent semantic indexing. Proceeding – Fourth IEEE International Conference on Data Mining, ICDM, 2004, 162–169.

Marinai, S., & Fujisawa, H. (2008). Machine learning in document analysis and recognition. Heidelberg: Springer.

Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. Proceeding of ACM International Journal on Digital Libraries, 2000, 195–204.

Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition (1. Aufl.). New York: Wiley.

Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26.CrossRef

Nigyogi, D., & Srihari, S. N. (1986). A rule-based system for document understanding. Proceeding of AAAI, 1986, 789–793.

Nrl EM, Nrl DP, & Nyu RG. (1998). MUC-7 EVALUATION OF IE TECHNOLOGY : Overview of Results MUC-7 Program Committee. Program.

Olivas, E. S., Guerrero, J. D. M., Martinez, S. M., et al. (2009). Handbook of research on machine learning applications and trends: Algorithms, methods, and techniques (S. 1–703). https://doi.org/10.4018/978-1-60566-766-9.

Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. LREc, 2010, 320–1326.

Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22, 1345–1359.CrossRef

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.

Passonneau, R. (2011). Sentiment analysis of twitter data. Proceeding of Work Language Social Media (LSM 2011), 2011, 30–38.

Pinto, D., Gómez-Adorno, H., Vilariño, D., & Singh, V. K. (2014). A graph-based multi-level linguistic representation for document understanding. Pattern recognition letters. https://doi.org/10.1016/j.patrec.2013.12.004.CrossRef

Prince, V, & Labadié, A. (2007). Text segmentation based on document understanding for information retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. CoNLL 2009 – Proceedings of Thirteen Conference on Computational Natural Language Learning, 2009, 147–155.

Samek W., Wiegand T., & Müller K.-R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries, Special Issue No. 1 – Impact Artificial intelligence (AI) Communication Network Service, 1(1), 39–48.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47. https://doi.org/10.1145/505282.505283.CrossRef

Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In International Conference on Natural Language Processing, S. 440–451.

Stack, K. P. (1998). Competitive intelligence. Intelligence and National Security, 13, 194–202. https://doi.org/10.1080/02684529808432511.CrossRef

Stevenson, R. A., Mikels, J. A., & James, T. W. (2007). Characterization of the affective norms for english words by discrete emotional categories. Behavior Research Methods, 39, 1020–1024.CrossRef

Tanner, S. (2004). Deciding whether optical character recognition is feasible. London: King’s Digital Consultancy Services, 1–11.

Taylor, S. L., Lipshutz, M., Dahl, D. A., &Weir, C. (1993). An intelligent document understanding system. In Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ’93), S. 107–110.

Tramèr, F., Zhang, F., Juels, A., et al. (2016). Stealing machine learning models via prediction apis. In 25th ${$USENIX$}$ Security Symposium (${$USENIX$}$ Security 16), S. 601–618.

Vincent, L. (2007). Google book search: Document understanding on a massive scale. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. Proceedings of the IEEE International Conference on Computer Vision, 2011, 1457–1464. https://doi.org/10.1109/ICCV.2011.6126402.

Yoshida, Y., Hirao, T., Iwata, T., et al. (2011). Transfer learning for multiple-domain sentiment analysis – Identifying domain dependent/independent word polarity. Proceeding of the National Conference on Artificial Intelligence, 2, 1286–1291.

Yu, B., & Xu, Z. ben. (2008). A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Syst, 21, 355–362. https://doi.org/10.1016/j.knosys.2008.01.001.CrossRef

Yuan, Y., & Zhou, Y. (2013). Twitter Sentiment Analysis with Recursive Neural Networks. CS224D Course Projects, 2013, 1–8.

Zhai, H., Lingren, T., Deleger, L., et al. (2013). Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal of Medical Internet Research, 15. https://doi.org/10.2196/jmir.2426.

Zweig, G. G., & Padmanabhan, M. (2005). Information extraction from documents with regular expression matching. Washington: U.S. Patent and Trademark Office.

Title: Die Anwendung von Machine Learning zur Gewinnung von Erkenntnissen aus Dokumentenstapeln
Author: Stefan Ebener
Publisher: Springer Fachmedien Wiesbaden
Book: Künstliche Intelligenz in Wirtschaft & Gesellschaft
Print ISBN: 978-3-658-29549-3

Electronic ISBN: 978-3-658-29550-9

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-658-29550-9_15