Skip to main content
Erschienen in: Knowledge and Information Systems 2/2017

27.03.2017 | Regular Paper

Dynamic sampling of text streams and its application in text analysis

verfasst von: Gang Tian, Jiajia Huang, Min Peng, Jiahui Zhu, Yanchun Zhang

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A large number of texts are rapidly generated as streaming data in social media. Since it is difficult to process such text streams with limited memory in real time, researchers are resorting to text stream compression and sampling to obtain a small portion of valuable information from the streams. In this study, we investigate the crucial question of how to use less memory space to store more valuable texts to maintain the global information of the stream. First, we propose a text stream sampling framework based on compressed sensing theory, which can sample a text stream with a lightweight framework to reduce the space consumption while still retaining the most valuable texts. We then develop a query word-based retrieval task as well as a topic detection and evolution analysis task on the sample stream to evaluate the performance of the framework in retaining valuable information. The framework is evaluated from several aspects using two representative datasets of social media, including compression ratio, runtime, information reserved rate, and efficiency of the text analysis tasks. Experimental results demonstrate that the proposed framework outperforms baseline methods and is able to complete the text analysis tasks with promising results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
3
In the following, to distinguish the concept of sample framework proposed in the paper and the samples used in CS theory, the latter is replaced by linear measurements, or measurements in short, which are also commonly used wordings in the CS theory.
 
4
The dataset is downloaded from http://​snap.​stanford.​edu/​data.
 
Literatur
1.
Zurück zum Zitat Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 1st ACM international conference on web search and data mining. ACM Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 1st ACM international conference on web search and data mining. ACM
2.
Zurück zum Zitat Alonso O, Marshall CC, Najork M (2013) Are some tweets more interesting than others? \(\sharp \) Hard question. In: Proceedings of the symposium on human-computer interaction and information retrieval. ACM Alonso O, Marshall CC, Najork M (2013) Are some tweets more interesting than others? \(\sharp \) Hard question. In: Proceedings of the symposium on human-computer interaction and information retrieval. ACM
3.
Zurück zum Zitat Baraniuk R, Davenport M, DeVore R, Wakin M (2007) A simple proof of the restricted isometry property for random matrices. Constr Approx 23(3):918–925MathSciNetMATH Baraniuk R, Davenport M, DeVore R, Wakin M (2007) A simple proof of the restricted isometry property for random matrices. Constr Approx 23(3):918–925MathSciNetMATH
4.
Zurück zum Zitat Bian J, Yang Y, Zhang H, Chua TS (2015) Multimedia summarization for social events in microblog stream. IEEE Trans Multimed 17(2):216–228CrossRef Bian J, Yang Y, Zhang H, Chua TS (2015) Multimedia summarization for social events in microblog stream. IEEE Trans Multimed 17(2):216–228CrossRef
5.
Zurück zum Zitat Brisaboa NR, Faria A, Param J (2010) Dynamic lightweight text compression. ACM Trans Inf Syst 28(3):10CrossRef Brisaboa NR, Faria A, Param J (2010) Dynamic lightweight text compression. ACM Trans Inf Syst 28(3):10CrossRef
6.
Zurück zum Zitat Brisaboa NR, Faria A, Navarro G, Parama JR (2008) New adaptive compressors for natural language text. Softw Pract Exp 38(13):1429–1450CrossRef Brisaboa NR, Faria A, Navarro G, Parama JR (2008) New adaptive compressors for natural language text. Softw Pract Exp 38(13):1429–1450CrossRef
7.
Zurück zum Zitat Cataldi M, Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the 10th international workshop in multimedia data mining Cataldi M, Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the 10th international workshop in multimedia data mining
8.
Zurück zum Zitat Chen Y, Cheng X, Yang S (2011) Finding high quality threads in web forums. J Softw 22(8):1785–1804CrossRef Chen Y, Cheng X, Yang S (2011) Finding high quality threads in web forums. J Softw 22(8):1785–1804CrossRef
10.
Zurück zum Zitat Choudhury MD, Counts S, Czerwinski M (2011) Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In: Proceedings of the 5th international AAAI conference on weblogs and social media Choudhury MD, Counts S, Czerwinski M (2011) Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In: Proceedings of the 5th international AAAI conference on weblogs and social media
11.
Zurück zum Zitat Silva de Moura E, Navarro G, Ziviani N, Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Trans Inf Syst 18(2):113–139CrossRef Silva de Moura E, Navarro G, Ziviani N, Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Trans Inf Syst 18(2):113–139CrossRef
13.
Zurück zum Zitat Dutta A, Levi R, Ron D, Rubinfeld R (2013) A simple online competitive adaptation of Lempel–Ziv compression with efficient random access support. In: Proceedings of the 23rd IEEE data compression conference. IEEE Dutta A, Levi R, Ron D, Rubinfeld R (2013) A simple online competitive adaptation of Lempel–Ziv compression with efficient random access support. In: Proceedings of the 23rd IEEE data compression conference. IEEE
14.
Zurück zum Zitat Ghosh S, Zafar MB, Bhattacharya P, Sharma N, Ganguly N, Gummadi K (2013) On sampling the wisdom of crowds: random versus expert sampling of the twitter stream. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM Ghosh S, Zafar MB, Bhattacharya P, Sharma N, Ganguly N, Gummadi K (2013) On sampling the wisdom of crowds: random versus expert sampling of the twitter stream. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM
15.
Zurück zum Zitat Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Inc., BurlingtonMATH Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Inc., BurlingtonMATH
16.
Zurück zum Zitat Kasiviswanathan SP, Cong G, Melville P, Lawrence RD (2013) Novel document detection for massive data streams using distributed dictionary learning. IBM J Res Dev 57(3/4):9:1–9:15CrossRef Kasiviswanathan SP, Cong G, Melville P, Lawrence RD (2013) Novel document detection for massive data streams using distributed dictionary learning. IBM J Res Dev 57(3/4):9:1–9:15CrossRef
17.
Zurück zum Zitat Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Proceeding of the 25th advances in neural information processing systems. MIT Press Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Proceeding of the 25th advances in neural information processing systems. MIT Press
18.
Zurück zum Zitat Meladianos P, Nikolentzos G, Rousseau F, Stavrakas Y, Vazirgiannis M (2015) Degeneracy-based real-time sub-event detection in Twitter stream. In: Proceedings of the 9th international AAAI conference on web and social media Meladianos P, Nikolentzos G, Rousseau F, Stavrakas Y, Vazirgiannis M (2015) Degeneracy-based real-time sub-event detection in Twitter stream. In: Proceedings of the 9th international AAAI conference on web and social media
19.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of workshop at international conference on learning representations Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of workshop at international conference on learning representations
20.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of 27th annual conference on neural information processing systems. MIT Press Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of 27th annual conference on neural information processing systems. MIT Press
21.
Zurück zum Zitat Moffat A (1989) Word-based text compression. Softw Pract Exp 19(2):185–198CrossRef Moffat A (1989) Word-based text compression. Softw Pract Exp 19(2):185–198CrossRef
22.
Zurück zum Zitat Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning. ACM Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning. ACM
23.
Zurück zum Zitat Leeuwen MV, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases. ACM Leeuwen MV, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases. ACM
24.
Zurück zum Zitat Li C, Sun A, Weng J, He Q (2015) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570CrossRef Li C, Sun A, Weng J, He Q (2015) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570CrossRef
25.
Zurück zum Zitat Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Asilomar conference on signals, systems and computers, pp 40–44 Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Asilomar conference on signals, systems and computers, pp 40–44
26.
Zurück zum Zitat Peng M, Gao B, Zhu J, Huang J, Yuan M, Li F (2016) High quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl 44:92–101CrossRef Peng M, Gao B, Zhu J, Huang J, Yuan M, Li F (2016) High quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl 44:92–101CrossRef
27.
Zurück zum Zitat Peng M, Huang J, Fu H, Zhu J, Zhou L, He Y, Li F (2013) High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: Proceedings of the 14th international conference on web information systems engineering. Springer Peng M, Huang J, Fu H, Zhu J, Zhou L, He Y, Li F (2013) High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: Proceedings of the 14th international conference on web information systems engineering. Springer
28.
Zurück zum Zitat Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931CrossRef Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931CrossRef
29.
Zurück zum Zitat Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference on computational linguistics and intelligent text processing. Springer Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference on computational linguistics and intelligent text processing. Springer
30.
Zurück zum Zitat Siebes A, Vreeken J, Van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining Siebes A, Vreeken J, Van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining
31.
Zurück zum Zitat Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing. ACL Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing. ACL
32.
Zurück zum Zitat Unankard S, Li X, Sharaf MA (2015) Emerging event detection in social networks with location sensitivity. World Wide Web 18(5):1393–1417 Unankard S, Li X, Sharaf MA (2015) Emerging event detection in social networks with location sensitivity. World Wide Web 18(5):1393–1417
33.
Zurück zum Zitat Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
34.
Zurück zum Zitat Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases. ACM Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases. ACM
35.
Zurück zum Zitat Yang X, Ghoting A, Ruan Y, Parthasarathy S (2012) A framework for summarizing and analyzing Twitter feeds. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Yang X, Ghoting A, Ruan Y, Parthasarathy S (2012) A framework for summarizing and analyzing Twitter feeds. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
36.
Zurück zum Zitat Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining. ACM Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining. ACM
37.
Zurück zum Zitat Yang M, Rim H (2014) Identifying interesting twitter content using topical analysis. Expert Syst Appl 41:4330–4336CrossRef Yang M, Rim H (2014) Identifying interesting twitter content using topical analysis. Expert Syst Appl 41:4330–4336CrossRef
38.
Zurück zum Zitat Yang X, Ruan Y, Parthasarathy S, Ghoting A (2013) Summarization via pattern utility and ranking: a novel framework for social media data analytics. IEEE Data Eng Bull 36(3):67–76 Yang X, Ruan Y, Parthasarathy S, Ghoting A (2013) Summarization via pattern utility and ranking: a novel framework for social media data analytics. IEEE Data Eng Bull 36(3):67–76
Metadaten
Titel
Dynamic sampling of text streams and its application in text analysis
verfasst von
Gang Tian
Jiajia Huang
Min Peng
Jiahui Zhu
Yanchun Zhang
Publikationsdatum
27.03.2017
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1039-z

Weitere Artikel der Ausgabe 2/2017

Knowledge and Information Systems 2/2017 Zur Ausgabe