Skip to main content
Top

2021 | OriginalPaper | Chapter

From Web Scraping to Web Crawling

Authors : Harshit Nigam, Prantik Biswas

Published in: Applications of Artificial Intelligence and Machine Learning

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The World Wide Web is the largest database comprising information in various forms from text to audio/video and in many other designs. However, most of the data published on the Web is in unstructured and hard-to-handle format, and hence, difficult to extract and use for further text processing applications such as trend detection, sentiment analysis, e-commerce market monitoring, and many others. Technologies like Web scraping and Web crawling cater to the need of extracting a huge amount of information available on the Web in an automated way. This paper starts with a basic explanation of Web scraping and the four methodologies—DOM tree parsing, semantic–syntactic framework, string matching, and computer vision/machine learning-based methodology—developed over time based on which scraping solutions and tools are formulated. The paper also explains the term Web crawling, an extension of Web scraping and introduces Scrapy, a Web crawling framework written in Python. The paper describes the workflow behind a Web crawling process initiated by Scrapy and provides with the basic understanding on each component involved in a Web crawling project, built using Scrapy. Further, the paper dives into the implementation of a Web crawler, namely confSpider that is dedicated to extract information related to upcoming conferences and summits from the Internet and may be used by educational institutions to promote student awareness and participation in multi-disciplinary conferences.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Source code for this project is available on GitHub under the MIT License at https://​github.​com/​NightmareNight-em/​Scrapy-for-Web-Crawling.
 
2
The complete dataset in.csv and.json format as well as pipelining through MySQL database is available at https://​github.​com/​NightmareNight-em/​Scrapy-for-Web-Crawling.
 
Literature
2.
go back to reference Baumgartner R, Frölich O, Gottlob G, Harz P, Herzog M, Lehmann P, Wien T (2005) Web data extraction for business intelligence: the lixto approach. In: Proceedings 12th conference on Datenbanksysteme in Büro. Technik und Wissenschaft, pp 48–65 Baumgartner R, Frölich O, Gottlob G, Harz P, Herzog M, Lehmann P, Wien T (2005) Web data extraction for business intelligence: the lixto approach. In: Proceedings 12th conference on Datenbanksysteme in Büro. Technik und Wissenschaft, pp 48–65
3.
go back to reference Anica-Popa I, Cucui G (2009) A framework for enhancing competitive intelligence capabilities using decision support system based on web mining techniques. Int J Comput Commun Control 4:326–334CrossRef Anica-Popa I, Cucui G (2009) A framework for enhancing competitive intelligence capabilities using decision support system based on web mining techniques. Int J Comput Commun Control 4:326–334CrossRef
5.
go back to reference Lin L, Liotta A, Hippisley A (2005) A method for automating the extraction of specialized information from the web. In: Hao Y et al (eds) Computational intelligence and security. CIS 2005. Lecture notes in computer science, vol 3801. Springer, Berlin, Heidelberg Lin L, Liotta A, Hippisley A (2005) A method for automating the extraction of specialized information from the web. In: Hao Y et al (eds) Computational intelligence and security. CIS 2005. Lecture notes in computer science, vol 3801. Springer, Berlin, Heidelberg
6.
go back to reference Suganya E, Vijayarani S (2020) Sentiment analysis for scraping of product reviews from multiple web pages using machine learning algorithms. In: Abraham A, Cherukuri A, Melin P, Gandhi N (eds) Intelligent systems design and applications. ISDA 2018 2018. Advances in intelligent systems and computing, vol 941. Springer, Cham Suganya E, Vijayarani S (2020) Sentiment analysis for scraping of product reviews from multiple web pages using machine learning algorithms. In: Abraham A, Cherukuri A, Melin P, Gandhi N (eds) Intelligent systems design and applications. ISDA 2018 2018. Advances in intelligent systems and computing, vol 941. Springer, Cham
11.
go back to reference Catanese SA, De Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the international conference on web intelligence, mining and semantics (WIMS ’11). Association for Computing Machinery, New York, NY, USA, Article 52, 1–8. https://doi.org/10.1145/1988688.1988749 Catanese SA, De Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the international conference on web intelligence, mining and semantics (WIMS ’11). Association for Computing Machinery, New York, NY, USA, Article 52, 1–8. https://​doi.​org/​10.​1145/​1988688.​1988749
12.
go back to reference Traud AL, Kelsic ED, Mucha PJ, Porter MA (2008) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):17 Traud AL, Kelsic ED, Mucha PJ, Porter MA (2008) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):17
18.
go back to reference Castrillo-Fernández Q (2015) Web scraping: applications and tools. European Public Sector Information Platform Topic Report No. 2015 Castrillo-Fernández Q (2015) Web scraping: applications and tools. European Public Sector Information Platform Topic Report No. 2015
21.
go back to reference Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Techn 4:378–419CrossRef Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Techn 4:378–419CrossRef
23.
go back to reference Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: proceedings of the 21st international conference on world wide web (WWW ’12 Companion). Association for Computing Machinery, New York, NY, USA, 93–102. https://doi.org/10.1145/2187980.2187991 Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: proceedings of the 21st international conference on world wide web (WWW ’12 Companion). Association for Computing Machinery, New York, NY, USA, 93–102. https://​doi.​org/​10.​1145/​2187980.​2187991
25.
go back to reference Uzun E, Agun HV, Yerlikaya T (2013) A hybrid approach for extracting information content from Webpages. Inf Process Manage 49(4):928–944CrossRef Uzun E, Agun HV, Yerlikaya T (2013) A hybrid approach for extracting information content from Webpages. Inf Process Manage 49(4):928–944CrossRef
26.
go back to reference Uzun E, Güner ES, Kılıçaslan Y, Yerlikaya T, Agun HV (2014) An effective and efficient Web content extractor for optimizing the crawling process. Softw Pract Exper 44(10):1181–1199 Uzun E, Güner ES, Kılıçaslan Y, Yerlikaya T, Agun HV (2014) An effective and efficient Web content extractor for optimizing the crawling process. Softw Pract Exper 44(10):1181–1199
27.
go back to reference Uzun E, Yerlikaya T, Kurt M (2011) A lightweight parser for extracting useful contents from web pages. In: proceedings of 2nd international symposium computer science engineering (ISCSE). Kuşadasi, Turkey, pp 67–73 Uzun E, Yerlikaya T, Kurt M (2011) A lightweight parser for extracting useful contents from web pages. In: proceedings of 2nd international symposium computer science engineering (ISCSE). Kuşadasi, Turkey, pp 67–73
30.
go back to reference Jose CAIMG, Fernandez-Villamor I, Blasco-Garcia J (2012) A semantic scraping model for web resources. Applying linked data to web page screen scraping. In: ICAART 2011—proceedings of the 3rd international conference on agents and artificial Intelligence, 2, 451–456 Jose CAIMG, Fernandez-Villamor I, Blasco-Garcia J (2012) A semantic scraping model for web resources. Applying linked data to web page screen scraping. In: ICAART 2011—proceedings of the 3rd international conference on agents and artificial Intelligence, 2, 451–456
31.
go back to reference Ioan D, Moisil I (2008) Advanced AI techniques for web mining Ioan D, Moisil I (2008) Advanced AI techniques for web mining
32.
go back to reference Mashuq M, Michel, Zhou Z Web content extraction through machine learning Mashuq M, Michel, Zhou Z Web content extraction through machine learning
33.
go back to reference Nguyen-Hoang B-D, Pham-Hong B-T, Jin J, Le PTV (2018) Genre-oriented web content extraction with deep convolutional neural networks and statistical methods. PACLIC Nguyen-Hoang B-D, Pham-Hong B-T, Jin J, Le PTV (2018) Genre-oriented web content extraction with deep convolutional neural networks and statistical methods. PACLIC
35.
go back to reference Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings 19th international conference database expert system applications (DEXA), pp 29–33 Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings 19th international conference database expert system applications (DEXA), pp 29–33
36.
go back to reference Weninger T, Hsu WH, Han J (2010) ‘CETR: content extraction via tag ratios. In: proceedings 19th international conference of world wide web (WWW), pp 971–980 Weninger T, Hsu WH, Han J (2010) ‘CETR: content extraction via tag ratios. In: proceedings 19th international conference of world wide web (WWW), pp 971–980
37.
go back to reference Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents. In: Proceedings 12th international conference on worldwideweb, pp 207–214 Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents. In: Proceedings 12th international conference on worldwideweb, pp 207–214
39.
go back to reference Adam G, Bouras C, Poulopoulos V (2009) CUTER: An efficient useful text extraction mechanism. In: International conference on advanced information networking and applications (AINA), pp 703–708 Adam G, Bouras C, Poulopoulos V (2009) CUTER: An efficient useful text extraction mechanism. In: International conference on advanced information networking and applications (AINA), pp 703–708
40.
go back to reference Gunasundari R (2012) A study of content extraction from Web pages based on links. Int J Data Mining Knowl Manage Process 2(3):230–236CrossRef Gunasundari R (2012) A study of content extraction from Web pages based on links. Int J Data Mining Knowl Manage Process 2(3):230–236CrossRef
Metadata
Title
From Web Scraping to Web Crawling
Authors
Harshit Nigam
Prantik Biswas
Copyright Year
2021
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-16-3067-5_9

Premium Partner