Top

Published in:

2021 | OriginalPaper | Chapter

From Web Scraping to Web Crawling

Authors : Harshit Nigam, Prantik Biswas

Published in: Applications of Artificial Intelligence and Machine Learning

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The World Wide Web is the largest database comprising information in various forms from text to audio/video and in many other designs. However, most of the data published on the Web is in unstructured and hard-to-handle format, and hence, difficult to extract and use for further text processing applications such as trend detection, sentiment analysis, e-commerce market monitoring, and many others. Technologies like Web scraping and Web crawling cater to the need of extracting a huge amount of information available on the Web in an automated way. This paper starts with a basic explanation of Web scraping and the four methodologies—DOM tree parsing, semantic–syntactic framework, string matching, and computer vision/machine learning-based methodology—developed over time based on which scraping solutions and tools are formulated. The paper also explains the term Web crawling, an extension of Web scraping and introduces Scrapy, a Web crawling framework written in Python. The paper describes the workflow behind a Web crawling process initiated by Scrapy and provides with the basic understanding on each component involved in a Web crawling project, built using Scrapy. Further, the paper dives into the implementation of a Web crawler, namely confSpider that is dedicated to extract information related to upcoming conferences and summits from the Internet and may be used by educational institutions to promote student awareness and participation in multi-disciplinary conferences.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media

next chapter Selection of Candidate Views for Big Data View Materialization

Source code for this project is available on GitHub under the MIT License at https://github.com/NightmareNight-em/Scrapy-for-Web-Crawling.

The complete dataset in.csv and.json format as well as pipelining through MySQL database is available at https://github.com/NightmareNight-em/Scrapy-for-Web-Crawling.

Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: A survey. Knowl-Based Syst 70:301–323. https://doi.org/10.1016/j.knosys.2014.07.007

Baumgartner R, Frölich O, Gottlob G, Harz P, Herzog M, Lehmann P, Wien T (2005) Web data extraction for business intelligence: the lixto approach. In: Proceedings 12th conference on Datenbanksysteme in Büro. Technik und Wissenschaft, pp 48–65

Anica-Popa I, Cucui G (2009) A framework for enhancing competitive intelligence capabilities using decision support system based on web mining techniques. Int J Comput Commun Control 4:326–334CrossRef

Chen H, Chau M, Zeng D (2002) CI Spider: a tool for competitive intelligence on the Web. Decision Supp Syst 34(1):1–17. https://doi.org/10.1016/S0167-9236(02)00002-7. ISSN 0167–9236

Lin L, Liotta A, Hippisley A (2005) A method for automating the extraction of specialized information from the web. In: Hao Y et al (eds) Computational intelligence and security. CIS 2005. Lecture notes in computer science, vol 3801. Springer, Berlin, Heidelberg

Suganya E, Vijayarani S (2020) Sentiment analysis for scraping of product reviews from multiple web pages using machine learning algorithms. In: Abraham A, Cherukuri A, Melin P, Gandhi N (eds) Intelligent systems design and applications. ISDA 2018 2018. Advances in intelligent systems and computing, vol 941. Springer, Cham

Priyadarshini R, Barik R K, Dubey H (2018) Deepfog: fog computing-based deep neural architecture for prediction of stress types, diabetes and hypertension attacks. Computation. 6:62 https://doi.org/10.3390/computation6040062

Hillen J (2019) Web scraping for food price research. British Food J ahead-of-print. https://doi.org/10.1108/BFJ-02-2019-0081

Glez-Peña D et al (2013) Web scraping technologies in an API world. Briefings in Bioinformatics Advance Access. https://doi.org/10.1093/bib/bbt026, published April 30, 2013

10.

Stein L (2002) Creating a bioinformatics nation. Nature 417(6885):119–120. https://doi.org/10.1038/417119aCrossRef

11.

Catanese SA, De Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the international conference on web intelligence, mining and semantics (WIMS ’11). Association for Computing Machinery, New York, NY, USA, Article 52, 1–8. https://doi.org/10.1145/1988688.1988749

12.

Traud AL, Kelsic ED, Mucha PJ, Porter MA (2008) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):17

13.

Barik RK, Misra C, Lenka RK et al (2019) Hybrid mist-cloud systems for large scale geospatial big data analytics and processing: opportunities and challenges. Arab J Geosci 12:32. https://doi.org/10.1007/s12517-018-4104-3CrossRef

14.

Laender AH, Ribeiro-Neto BA, Da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. SIGMOD Rec 31(2):84–93. https://doi.org/10.1145/565117.565137

15.

Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec 31(2):84. https://doi.org/10.1145/565117.565137CrossRef

16.

Singrodia V, Mitra A, Paul S (2019) A review on web scraping and its applications. In: 2019 international conference on computer communication and informatics (ICCCI). Coimbatore, Tamil Nadu, India, pp 1–6. https://doi.org/10.1109/ICCCI.2019.8821809

17.

Vanden Broucke S, Baesens B (2018) Practical Web scraping for data science, 1st edn. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3582-9

18.

Castrillo-Fernández Q (2015) Web scraping: applications and tools. European Public Sector Information Platform Topic Report No. 2015

19.

Heydon A, Najork M (1999) Mercator: a scalable, extensible Web crawler. World Wide Web 2(4):219–229. https://doi.org/10.1023/A:1019213109274CrossRef

20.

Chakrabarti S, Berg M, Dom B (2000) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(1623):1640. https://doi.org/10.1016/S1389-1286(99)00052-3CrossRef

21.

Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Techn 4:378–419CrossRef

22.

Kumar A, Paprzycki M, Gunjan VK (eds) (2020) ICDSMLA 2019. In: Lecture notes in electrical engineering. https://doi.org/10.1007/978-981-15-1420-3

23.

Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: proceedings of the 21st international conference on world wide web (WWW ’12 Companion). Association for Computing Machinery, New York, NY, USA, 93–102. https://doi.org/10.1145/2187980.2187991

24.

Uzun E (2020) A novel web scraping approach using the additional information obtained from web pages. IEEE Access 8:61726–61740. https://doi.org/10.1109/ACCESS.2020.2984503CrossRef

25.

Uzun E, Agun HV, Yerlikaya T (2013) A hybrid approach for extracting information content from Webpages. Inf Process Manage 49(4):928–944CrossRef

26.

Uzun E, Güner ES, Kılıçaslan Y, Yerlikaya T, Agun HV (2014) An effective and efficient Web content extractor for optimizing the crawling process. Softw Pract Exper 44(10):1181–1199

27.

Uzun E, Yerlikaya T, Kurt M (2011) A lightweight parser for extracting useful contents from web pages. In: proceedings of 2nd international symposium computer science engineering (ISCSE). Kuşadasi, Turkey, pp 67–73

28.

Jason Mun Personal website, https://www.jasonmun.com/using-scrapebox-for-good-not-evil/. Last Accessed 22 May 2020

29.

ScrapeBox homepage, http://www.scrapebox.com/. Last Accessed 10 June 2020

30.

Jose CAIMG, Fernandez-Villamor I, Blasco-Garcia J (2012) A semantic scraping model for web resources. Applying linked data to web page screen scraping. In: ICAART 2011—proceedings of the 3rd international conference on agents and artificial Intelligence, 2, 451–456

31.

Ioan D, Moisil I (2008) Advanced AI techniques for web mining

32.

Mashuq M, Michel, Zhou Z Web content extraction through machine learning

33.

Nguyen-Hoang B-D, Pham-Hong B-T, Jin J, Le PTV (2018) Genre-oriented web content extraction with deep convolutional neural networks and statistical methods. PACLIC

34.

Cai D, Yu S, Wen JR, Ma WY (2003) Extracting content structure for web pages based on visual representation. 406–471. https://doi.org/10.1007/3-540-36901-5_42

35.

Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings 19th international conference database expert system applications (DEXA), pp 29–33

36.

Weninger T, Hsu WH, Han J (2010) ‘CETR: content extraction via tag ratios. In: proceedings 19th international conference of world wide web (WWW), pp 971–980

37.

Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents. In: Proceedings 12th international conference on worldwideweb, pp 207–214

38.

Finn A, Kushmerick N, Smyth B (2001) ‘‘Fact or fiction: content classification for digital libraries. In: Proceedings of joint DELOS-NSF workshop, personalization recommender system digital libraries, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/citations;jsessionid=8E0FC70BEE7 DFA696487A2F7C6B622FA?

39.

Adam G, Bouras C, Poulopoulos V (2009) CUTER: An efficient useful text extraction mechanism. In: International conference on advanced information networking and applications (AINA), pp 703–708

40.

Gunasundari R (2012) A study of content extraction from Web pages based on links. Int J Data Mining Knowl Manage Process 2(3):230–236CrossRef

41.

Diffbot homepage, https://en.wikipedia.org/wiki/Diffbot. Last Accessed 10 June 2020

42.

Scrapy Installation Guide, https://docs.scrapy.org/en/latest/intro/install.html. Last Accessed 22 June 2020

43.

SelectorGadget, Chrome web store, https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en. Last Accessed 25 June 2020

44.

Python Web Scraping and Crawling using Scrapy, https://www.youtube.com/watch?v=ve_0h4Y8nuI&list=PLhTjy8cBISEqkN-5Ku_kXG4QW33sxQo0t. Last Accessed 22 June 2020

45.

Thomas DM, Mathur S (2019) Data analysis by web scraping using Python. In: 2019 3rd international conference on electronics, communication and aerospace technology (ICECA). Coimbatore, India 2019, pp 450–454. https://doi.org/10.1109/ICECA.2019.8822022

46.

Feng Y, Hong Y, Tang W, Yao J, Zhu Q (2011) Using HTML tags to improve parallel resources extraction. In: 2011 international conference on Asian language processing. Penang, pp 255–259. https://doi.org/10.1109/IALP.2011.23

Title: From Web Scraping to Web Crawling
Authors: Harshit Nigam
Prantik Biswas
Publisher: Springer Singapore
Book: Applications of Artificial Intelligence and Machine Learning
Print ISBN: 978-981-16-3066-8

Electronic ISBN: 978-981-16-3067-5

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-981-16-3067-5_9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner