Skip to main content
Erschienen in: Cluster Computing 1/2019

21.12.2017

A distributed incremental information acquisition model for large-scale text data

verfasst von: Shengtao Sun, Jibing Gong, Albert Y. Zomaya, Aizhi Wu

Erschienen in: Cluster Computing | Sonderheft 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Timely discovering and acquiring information from incremental data on the Internet is a hot topic in a big data era. This paper presents a distributed incremental information acquisition model for large-scale text data. To obtain a lower false positive rate and higher efficiency of the traditional Bloom filter, a distributed multidimensional Bloom filter is designed and proposed to cope with the deduplication of large-scale Web URL text data. Three methods related to Bloom filter were compared based on the false positive rate and response efficiency. The results show that the distributed incremental information acquisition model for large-scale text data can achieve a high duplicate removal rate with a lower false positive rate.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)CrossRef Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)CrossRef
2.
Zurück zum Zitat Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)MathSciNetCrossRefMATH Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)MathSciNetCrossRefMATH
3.
Zurück zum Zitat Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)MathSciNetCrossRefMATH Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)MathSciNetCrossRefMATH
4.
Zurück zum Zitat Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)CrossRef Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)CrossRef
5.
Zurück zum Zitat Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)CrossRef Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)CrossRef
6.
Zurück zum Zitat Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)CrossRef Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)CrossRef
7.
Zurück zum Zitat Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010) Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010)
8.
Zurück zum Zitat Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)CrossRef Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)CrossRef
9.
Zurück zum Zitat Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)CrossRef Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)CrossRef
10.
Zurück zum Zitat Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)CrossRef Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)CrossRef
11.
Zurück zum Zitat Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)CrossRef Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)CrossRef
12.
Zurück zum Zitat Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015) Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015)
13.
Zurück zum Zitat Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015) Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015)
14.
Zurück zum Zitat Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)CrossRef Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)CrossRef
15.
Zurück zum Zitat Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)CrossRef Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)CrossRef
16.
Zurück zum Zitat Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)CrossRef Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)CrossRef
17.
Zurück zum Zitat Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)CrossRef Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)CrossRef
18.
Zurück zum Zitat Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011) Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011)
19.
Zurück zum Zitat Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011) Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011)
20.
Zurück zum Zitat Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)CrossRef Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)CrossRef
21.
Zurück zum Zitat Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)CrossRef Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)CrossRef
22.
Zurück zum Zitat Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015) Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015)
23.
Zurück zum Zitat Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011) Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011)
24.
Zurück zum Zitat Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013) Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013)
25.
Zurück zum Zitat Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)CrossRef Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)CrossRef
26.
Zurück zum Zitat Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)CrossRef Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)CrossRef
27.
Zurück zum Zitat Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)CrossRef Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)CrossRef
28.
Zurück zum Zitat Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012) Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012)
29.
Zurück zum Zitat Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)CrossRef Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)CrossRef
30.
Zurück zum Zitat Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)CrossRef Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)CrossRef
Metadaten
Titel
A distributed incremental information acquisition model for large-scale text data
verfasst von
Shengtao Sun
Jibing Gong
Albert Y. Zomaya
Aizhi Wu
Publikationsdatum
21.12.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe Sonderheft 1/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1498-8

Weitere Artikel der Sonderheft 1/2019

Cluster Computing 1/2019 Zur Ausgabe

Premium Partner