Skip to main content

2017 | OriginalPaper | Buchkapitel

Scatteract: Automated Extraction of Data from Scatter Plots

verfasst von : Mathieu Cliche, David Rosenberg, Dhruv Madeka, Connie Yee

Erschienen in: Machine Learning and Knowledge Discovery in Databases

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Charts are an excellent way to convey patterns and trends in data, but they do not facilitate further modeling of the data or close inspection of individual data points. We present a fully automated system for extracting the numerical values of data points from images of scatter plots. We use deep learning techniques to identify the key components of the chart, and optical character recognition together with robust regression to map from pixels to the coordinate system of the chart. We focus on scatter plots with linear scales, which already have several interesting challenges. Previous work has done fully automatic extraction for other types of charts, but to our knowledge this is the first approach that is fully automatic for scatter plots. Our method performs well, achieving successful data extraction on 89% of the plots in our test set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
As is traditional, we refer to the horizontal axis as “X” and the vertical axis as “Y”.
 
4
The procedure is largely inspired by this blog post: http://​felix.​abecassis.​me/​2011/​10/​opencv-rotation-deskewing/​.
 
5
We use the center of the tick-mark bounding box as the tick mark location.
 
Literatur
1.
Zurück zum Zitat Al-Zaidy, R.A., Giles, C.L.: Automatic extraction of data from bar charts. In: Proceedings of the 8th International Conference on Knowledge Capture, p. 30. ACM (2015) Al-Zaidy, R.A., Giles, C.L.: Automatic extraction of data from bar charts. In: Proceedings of the 8th International Conference on Knowledge Capture, p. 30. ACM (2015)
2.
Zurück zum Zitat Al-Zaidy, R.A., Giles, C.L.: A machine learning approach for semantic structuring of scientific charts in scholarly documents. In: Twenty-Ninth IAAI Conference (2017) Al-Zaidy, R.A., Giles, C.L.: A machine learning approach for semantic structuring of scientific charts in scholarly documents. In: Twenty-Ninth IAAI Conference (2017)
3.
Zurück zum Zitat Baucom, A., Echanique, C.: Scatterscanner: data extraction and chart restyling of scatterplots (2013) Baucom, A., Echanique, C.: Scatterscanner: data extraction and chart restyling of scatterplots (2013)
4.
Zurück zum Zitat Browuer, W., Kataria, S., Das, S., Mitra, P., Giles, C.L.: Segregating and extracting overlapping data points in two-dimensional plots. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 276–279. ACM (2008) Browuer, W., Kataria, S., Das, S., Mitra, P., Giles, C.L.: Segregating and extracting overlapping data points in two-dimensional plots. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 276–279. ACM (2008)
5.
Zurück zum Zitat Chen, Z., Cafarella, M., Adar, E.: Diagramflyer: a search engine for data-driven diagrams. In: Proceedings of the 24th International Conference on World Wide Web, pp. 183–186. ACM (2015) Chen, Z., Cafarella, M., Adar, E.: Diagramflyer: a search engine for data-driven diagrams. In: Proceedings of the 24th International Conference on World Wide Web, pp. 183–186. ACM (2015)
6.
Zurück zum Zitat Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996) Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
7.
Zurück zum Zitat Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRef Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRef
8.
Zurück zum Zitat Huang, W., Tan, C.L.: A system for understanding imaged infographics and its applications. In: Proceedings of the 2007 ACM Symposium on Document Engineering, pp. 9–18. ACM (2007) Huang, W., Tan, C.L.: A system for understanding imaged infographics and its applications. In: Proceedings of the 2007 ACM Symposium on Document Engineering, pp. 9–18. ACM (2007)
10.
Zurück zum Zitat Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)CrossRef Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)CrossRef
11.
Zurück zum Zitat Jung, D., Kim, W., Song, H., Hwang, J.i., Lee, B., Kim, B., Seo, J.: ChartSense: interactive data extraction from chart images. ACM (2017) Jung, D., Kim, W., Song, H., Hwang, J.i., Lee, B., Kim, B., Seo, J.: ChartSense: interactive data extraction from chart images. ACM (2017)
12.
Zurück zum Zitat Kataria, S., Browuer, W., Mitra, P., Giles, C.L.: Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents (2008) Kataria, S., Browuer, W., Mitra, P., Giles, C.L.: Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents (2008)
13.
Zurück zum Zitat Lu, X., Kataria, S., Brouwer, W.J., Wang, J.Z., Mitra, P., Giles, C.L.: Automated analysis of images in documents for intelligent document search. Int. J. Document Anal. Recogn. (IJDAR) 12(2), 65–81 (2009)CrossRef Lu, X., Kataria, S., Brouwer, W.J., Wang, J.Z., Mitra, P., Giles, C.L.: Automated analysis of images in documents for intelligent document search. Int. J. Document Anal. Recogn. (IJDAR) 12(2), 65–81 (2009)CrossRef
14.
Zurück zum Zitat Mishchenko, A., Vassilieva, N.: Chart image understanding and numerical data extraction. In: 2011 Sixth International Conference on Digital Information Management (ICDIM), pp. 115–120. IEEE (2011) Mishchenko, A., Vassilieva, N.: Chart image understanding and numerical data extraction. In: 2011 Sixth International Conference on Digital Information Management (ICDIM), pp. 115–120. IEEE (2011)
15.
Zurück zum Zitat Nair, R.R., Sankaran, N., Nwogu, I., Govindaraju, V.: Automated analysis of line plots in documents. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 796–800. IEEE (2015) Nair, R.R., Sankaran, N., Nwogu, I., Govindaraju, V.: Automated analysis of line plots in documents. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 796–800. IEEE (2015)
16.
Zurück zum Zitat Poco, J., Heer, J.: Reverse-engineering visualizations: recovering visual encodings from chart images. In: Computer Graphics Forum, vol. 36, pp. 353–363. Wiley Online Library (2017) Poco, J., Heer, J.: Reverse-engineering visualizations: recovering visual encodings from chart images. In: Computer Graphics Forum, vol. 36, pp. 353–363. Wiley Online Library (2017)
17.
Zurück zum Zitat Ray Choudhury, S., Giles, C.L.: An architecture for information extraction from figures in digital libraries. In: Proceedings of the 24th International Conference on World Wide Web, pp. 667–672. ACM (2015) Ray Choudhury, S., Giles, C.L.: An architecture for information extraction from figures in digital libraries. In: Proceedings of the 24th International Conference on World Wide Web, pp. 667–672. ACM (2015)
18.
Zurück zum Zitat Savva, M., Kong, N., Chhajta, A., Fei-Fei, L., Agrawala, M., Heer, J.: Revision: automated classification, analysis and redesign of chart images. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 393–402. ACM (2011) Savva, M., Kong, N., Chhajta, A., Fei-Fei, L., Agrawala, M., Heer, J.: Revision: automated classification, analysis and redesign of chart images. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 393–402. ACM (2011)
19.
Zurück zum Zitat Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:​1312.​6229 (2013)
20.
Zurück zum Zitat Shadish, W.R., Brasil, I.C., Illingworth, D.A., White, K.D., Galindo, R., Nagler, E.D., Rindskopf, D.M.: Using ungraph to extract data from image files: verification of reliability and validity. Behav. Res. Methods 41(1), 177–183 (2009)CrossRef Shadish, W.R., Brasil, I.C., Illingworth, D.A., White, K.D., Galindo, R., Nagler, E.D., Rindskopf, D.M.: Using ungraph to extract data from image files: verification of reliability and validity. Behav. Res. Methods 41(1), 177–183 (2009)CrossRef
21.
22.
Zurück zum Zitat Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 2, pp. 629–633. IEEE (2007) Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 2, pp. 629–633. IEEE (2007)
23.
Zurück zum Zitat Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016) Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016)
24.
Zurück zum Zitat Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
25.
Zurück zum Zitat Tsutsui, S., Crandall, D.: A data driven approach for compound figure separation using convolutional neural networks. arXiv preprint arXiv:1703.05105 (2017) Tsutsui, S., Crandall, D.: A data driven approach for compound figure separation using convolutional neural networks. arXiv preprint arXiv:​1703.​05105 (2017)
Metadaten
Titel
Scatteract: Automated Extraction of Data from Scatter Plots
verfasst von
Mathieu Cliche
David Rosenberg
Dhruv Madeka
Connie Yee
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-71249-9_9