Skip to main content
Top

2016 | OriginalPaper | Chapter

A Scalable Document-Based Architecture for Text Analysis

Authors : Ciprian-Octavian Truică, Jérôme Darmont, Julien Velcin

Published in: Advanced Data Mining and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. Thus, we propose in this paper a new generic text analysis architecture, where document structure is flexible, many preprocessing techniques are integrated and textual datasets are indexed for efficient access. We implement our conceptual architecture using both a relational and a document-oriented database. Our experiments demonstrate the feasibility of our approach and the superiority of the document-oriented logical and physical implementation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
Literature
1.
go back to reference Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., Zhu, M.: A practical algorithm for topic modeling with provable guarantees. In: International Conference on Machine Learning, pp. 939–947 (2013) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., Zhu, M.: A practical algorithm for topic modeling with provable guarantees. In: International Conference on Machine Learning, pp. 939–947 (2013)
2.
go back to reference Ben Kraiem, M., Feki, J., Khrouf, K., Ravat, F., Teste, O.: OLAP of the tweets: from modeling toward exploitation. In: International Conference on Research Challenges in Information Science, pp. 1–10 (2014) Ben Kraiem, M., Feki, J., Khrouf, K., Ravat, F., Teste, O.: OLAP of the tweets: from modeling toward exploitation. In: International Conference on Research Challenges in Information Science, pp. 1–10 (2014)
3.
go back to reference Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
4.
go back to reference Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23091-2_15 CrossRef Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-23091-2_​15 CrossRef
5.
go back to reference Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39(4), 12–27 (2011)CrossRef Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39(4), 12–27 (2011)CrossRef
6.
go back to reference Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef
7.
go back to reference Ding, B., Zhao, B., Lin, C.X., Han, J., Zhai, C., Srivastava, A., Oza, N.C.: Efficient keyword-based search for top-k cells in text cube. Trans. Knowl. Data Eng. 23(12), 1795–1810 (2011)CrossRef Ding, B., Zhao, B., Lin, C.X., Han, J., Zhai, C., Srivastava, A., Oza, N.C.: Efficient keyword-based search for top-k cells in text cube. Trans. Knowl. Data Eng. 23(12), 1795–1810 (2011)CrossRef
8.
go back to reference Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: International Conference on Pervasive Computing and Applications, pp. 363–366 (2011) Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: International Conference on Pervasive Computing and Applications, pp. 363–366 (2011)
9.
go back to reference Hecht, R., Jablonski, S.: NoSQL evaluation: a use case oriented survey. In: International Conference on Cloud and Service Computing, pp. 336–341 (2011) Hecht, R., Jablonski, S.: NoSQL evaluation: a use case oriented survey. In: International Conference on Cloud and Service Computing, pp. 336–341 (2011)
10.
go back to reference Jivani, A.G.: A comparative study of stemming algorithms. Int. J. Comput. Technol. Appl. 2, 1930–1938 (2011) Jivani, A.G.: A comparative study of stemming algorithms. Int. J. Comput. Technol. Appl. 2, 1930–1938 (2011)
11.
go back to reference Kettunen, K., Kunttu, T., Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? J. Documentation 61(4), 476–496 (2005)CrossRef Kettunen, K., Kunttu, T., Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? J. Documentation 61(4), 476–496 (2005)CrossRef
12.
go back to reference Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: computing IR measures for multidimensional text database analysis. In: International Conference on Data Mining, pp. 905–910 (2008) Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: computing IR measures for multidimensional text database analysis. In: International Conference on Data Mining, pp. 905–910 (2008)
13.
go back to reference Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. The Pragmatic Bookshelf (2012) Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. The Pragmatic Bookshelf (2012)
14.
go back to reference Sharma, D.: Stemming algorithms: a comparative study and their analysis. Int. J. Appl. Inf. Syst. 4, 7–12 (2012) Sharma, D.: Stemming algorithms: a comparative study and their analysis. Int. J. Appl. Inf. Syst. 4, 7–12 (2012)
15.
go back to reference Tang, J., Wu, S., Sun, J., Su, H.: Cross-domain collaboration recommendation. In: ACM SIGKDD, pp. 1285–1293 (2012) Tang, J., Wu, S., Sun, J., Su, H.: Cross-domain collaboration recommendation. In: ACM SIGKDD, pp. 1285–1293 (2012)
16.
go back to reference Teha, Y.W., Jordana, M.I., Beala, M.J., Bleia, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2012)MathSciNetCrossRef Teha, Y.W., Jordana, M.I., Beala, M.J., Bleia, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2012)MathSciNetCrossRef
17.
go back to reference Truică, C.O., Boicea, A., Rădulescu, F., Bucur, I.: Performance evaluation for CRUD operations in asynchronously replicated document oriented database. In: International Conference on Control Systems and Computer Science, pp. 191–196 (2015) Truică, C.O., Boicea, A., Rădulescu, F., Bucur, I.: Performance evaluation for CRUD operations in asynchronously replicated document oriented database. In: International Conference on Control Systems and Computer Science, pp. 191–196 (2015)
18.
go back to reference Truică, C.O., Guille, A., Gauthier, M.: CATS: collection and analysis of tweets made simple. In: ACM Conference on Computer-Supported Cooperative Work and Social Computing, pp. 41–44 (2016) Truică, C.O., Guille, A., Gauthier, M.: CATS: collection and analysis of tweets made simple. In: ACM Conference on Computer-Supported Cooperative Work and Social Computing, pp. 41–44 (2016)
19.
go back to reference Vishwakarma, S.K., Lakhtaria, K.I., Bhatnagar, D., Sharma, A.K.: An efficient approach for inverted index pruning based on document relevance. In: International Conference on Communication Systems and Network Technologies, pp. 487–490 (2014) Vishwakarma, S.K., Lakhtaria, K.I., Bhatnagar, D., Sharma, A.K.: An efficient approach for inverted index pruning based on document relevance. In: International Conference on Communication Systems and Network Technologies, pp. 487–490 (2014)
20.
go back to reference Zhang, D., Zhai, C.X., Han, J., Srivastava, A., Oza, N.: Topic cube: topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135 (2009) Zhang, D., Zhai, C.X., Han, J., Srivastava, A., Oza, N.: Topic cube: topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135 (2009)
Metadata
Title
A Scalable Document-Based Architecture for Text Analysis
Authors
Ciprian-Octavian Truică
Jérôme Darmont
Julien Velcin
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-49586-6_33

Premium Partner