Skip to main content
Erschienen in: Peer-to-Peer Networking and Applications 2/2020

16.12.2019

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

verfasst von: Yong-Young Kim, Yong-Ki Kim, Dae-Sik Kim, Mi-Hye Kim

Erschienen in: Peer-to-Peer Networking and Applications | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188CrossRef Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188CrossRef
3.
Zurück zum Zitat Philipp P, Maleshkova M, Rettinger A, Katic D (2017) A semantic framework for sequential decision making. Journal of Web Engineering 16(5–6):471–504 Philipp P, Maleshkova M, Rettinger A, Katic D (2017) A semantic framework for sequential decision making. Journal of Web Engineering 16(5–6):471–504
7.
Zurück zum Zitat Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford WebBase components and applications. ACM Trans Internet Technol 6(2):153–186CrossRef Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford WebBase components and applications. ACM Trans Internet Technol 6(2):153–186CrossRef
8.
Zurück zum Zitat Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325CrossRef Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325CrossRef
9.
Zurück zum Zitat Choudhary S, Dincturk E, Mirtaheri S, Bochmann GV, Jourdan G-V, Onut IV (2014) Model-based rich internet applications crawling: "menu" and "probability" models. Journal of Web Engineering 13(3–4):243–262 Choudhary S, Dincturk E, Mirtaheri S, Bochmann GV, Jourdan G-V, Onut IV (2014) Model-based rich internet applications crawling: "menu" and "probability" models. Journal of Web Engineering 13(3–4):243–262
10.
Zurück zum Zitat Thenmalar S, Geetha TV (2014) The modified concept based focused crawling using ontology. Journal of Web Engineering 13(5–6):525–538 Thenmalar S, Geetha TV (2014) The modified concept based focused crawling using ontology. Journal of Web Engineering 13(5–6):525–538
11.
Zurück zum Zitat Cho J, Garcia-Molina H (2002) Parallel crawlers. In: 11th international conference on world wide web, pp. 124-135. ACM Cho J, Garcia-Molina H (2002) Parallel crawlers. In: 11th international conference on world wide web, pp. 124-135. ACM
14.
Zurück zum Zitat Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRef Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRef
15.
Zurück zum Zitat Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717CrossRef Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717CrossRef
16.
Zurück zum Zitat Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1–2):18–28CrossRef Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1–2):18–28CrossRef
17.
Zurück zum Zitat Zhou B, Li J, Ji Y, Guizani M (2018) Online internet traffic monitoring and DDoS attack detection using Big Data frameworks. In 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC) (pp. 1507–1512). IEEE Zhou B, Li J, Ji Y, Guizani M (2018) Online internet traffic monitoring and DDoS attack detection using Big Data frameworks. In 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC) (pp. 1507–1512). IEEE
19.
Zurück zum Zitat Xu H, Li K, Fan G (2018) An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P. In International Conference on Applications and Techniques in Cyber Security and Intelligence (pp. 849–855). Springer, Cham Xu H, Li K, Fan G (2018) An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P. In International Conference on Applications and Techniques in Cyber Security and Intelligence (pp. 849–855). Springer, Cham
20.
Zurück zum Zitat Hafaiedh, K. B., von Bochmann, G., Jourdan, G. V., Onut, I. V.: Fault Tolerant P2P RIA Crawling. In International Conference on Networked Systems (pp. 32–47). Springer, Cham (2016, May) Hafaiedh, K. B., von Bochmann, G., Jourdan, G. V., Onut, I. V.: Fault Tolerant P2P RIA Crawling. In International Conference on Networked Systems (pp. 32–47). Springer, Cham (2016, May)
24.
Zurück zum Zitat Kim Y-Y, Oh S, Lee H, Cha KJ (2015) A study on smart Workers' work/nonwork boundary management strategies. Knowledge Management Research 16(4):133–155 Kim Y-Y, Oh S, Lee H, Cha KJ (2015) A study on smart Workers' work/nonwork boundary management strategies. Knowledge Management Research 16(4):133–155
25.
Zurück zum Zitat Dixit DA (2012) Web crawler design issues: a review. International Journal of Managment, IT and Engineering 2(8):394–404 Dixit DA (2012) Web crawler design issues: a review. International Journal of Managment, IT and Engineering 2(8):394–404
26.
Zurück zum Zitat Desai K, Devulapalli V, Agrawal S, Kathiria P, Patel A (2017) Web Crawler: Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities. Int J Adv Res Comput Sci 8(3) Desai K, Devulapalli V, Agrawal S, Kathiria P, Patel A (2017) Web Crawler: Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities. Int J Adv Res Comput Sci 8(3)
27.
Zurück zum Zitat Sozer EM, Stojanovic M, Proakis JG (2000) Underwater acoustic networks. IEEE J Ocean Eng 25(1):72–83CrossRef Sozer EM, Stojanovic M, Proakis JG (2000) Underwater acoustic networks. IEEE J Ocean Eng 25(1):72–83CrossRef
Metadaten
Titel
Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data
verfasst von
Yong-Young Kim
Yong-Ki Kim
Dae-Sik Kim
Mi-Hye Kim
Publikationsdatum
16.12.2019
Verlag
Springer US
Erschienen in
Peer-to-Peer Networking and Applications / Ausgabe 2/2020
Print ISSN: 1936-6442
Elektronische ISSN: 1936-6450
DOI
https://doi.org/10.1007/s12083-019-00841-0

Weitere Artikel der Ausgabe 2/2020

Peer-to-Peer Networking and Applications 2/2020 Zur Ausgabe