research-article

A Model-Based Approach for Crawling Rich Internet Applications

Authors:
Mustafa Emre Dincturk

University of Ottawa, Canada

University of Ottawa, Canada
View Profile

,
Guy-Vincent Jourdan

University of Ottawa, Canada

University of Ottawa, Canada
View Profile

,
Gregor V. Bochmann

University of Ottawa, Canada

University of Ottawa, Canada
View Profile

,
Iosif Viorel Onut

IBM, Canada

IBM, Canada
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 8 Issue 3Article No.: 19pp 1–39https://doi.org/10.1145/2626371

Published:08 July 2014Publication History

ACM Transactions on the Web

Abstract

New Web technologies, like AJAX, result in more responsive and interactive Web applications, sometimes called Rich Internet Applications (RIAs). Crawling techniques developed for traditional Web applications are not sufficient for crawling RIAs. The inability to crawl RIAs is a problem that needs to be addressed for at least making RIAs searchable and testable. We present a new methodology, called “model-based crawling”, that can be used as a basis to design efficient crawling strategies for RIAs. We illustrate model-based crawling with a sample strategy, called the “hypercube strategy”. The performances of our model-based crawling strategies are compared against existing standard crawling strategies, including breadth-first, depth-first, and a greedy strategy. Experimental results show that our model-based crawling approach is significantly more efficient than these standard strategies.

References

M. Aigner. 1973. Lexicographic matching in boolean algebras. J. Combin. Theory 14, 3, 187--194. Google ScholarDigital Library
D. Amalfitano, A. R. Fasolino, and P. Tramontana. 2008. Reverse engineering finite state machines from rich Internet applications. In Proceedings of the 15^th Working Conference on Reverse Engineering (WCRE'08). IEEE Computer Society, 69--73. Google ScholarDigital Library
D. Amalfitano, A. R. Fasolino, and P. Tramontana. 2010. Rich Internet application testing using execution trace data. In Proceedings of the 3^rd International Conference on Software Testing, Verification, and Validation Workshops (ICSTW'10). IEEE Computer Society, 274--283. Google ScholarDigital Library
I. Anderson. 1987. Combinatorics of Finite Sets. Oxford University Press, London.Google Scholar
Apache. 2004. Apache flex. http://incubator.apache.org/flex/.Google Scholar
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. 2001. Searching the web. ACM Trans. Internet Technol. 1, 1, 2--43. Google ScholarDigital Library
J. Bau, E. Bursztein, D. Gupta, and J. Mitchell. 2010. State of the art: Automated black-box web application vulnerability testing. In Proceedings of the IEEE Symposium on Security and Privacy (SP'10). IEEE Computer Society, 332--345. Google ScholarDigital Library
K. Benjamin. 2010. A strategy for efficient crawling of rich Internet applications. M.S. thesis, EECS - University of Ottawa. http://ssrg.eecs.uottawa.ca/docs/Benjamin-Thesis.pdf.Google Scholar
K. Benjamin, G. V. Bochmann, G.-V. Jourdan, and I.-V. Onut. 2010. Some modeling challenges when testing rich Internet applications for security. In Proceedings of the 3^rd International Conference on Software Testing, Verification, and Validation Workshops (ICSTW'10). IEEE Computer Society, 403--409. Google ScholarDigital Library
K. Benjamin, G. Von Bochmann, M. E. Dincturk, G.-V. Jourdan, and I. V. Onut. 2011. A strategy for efficient crawling of rich Internet applications. In Proceedings of the 11^th International Conference on Web Engineering (ICWE'11). Springer, 74--89. Google ScholarDigital Library
C.-P. Bezemer, A. Mesbah, and A. Van Deursen. 2009. Automated security testing of web widget interactions. In Proceedings of the 7^th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'09). ACM Press, New York, 81--90. Google ScholarDigital Library
N. Bruijn, C. Tengbergen, and D. Kruyswijk. 1951. On the set of divisors of a number. Nieuw Arch. Wisk. 23, 191--194.Google Scholar
G. Carpaneto, M. Dellamico, and P. Toth. 1995. Exact solution of large-scale, asymmetric traveling salesman problems. ACM Trans. Math. Softw. 21, 4, 394--409. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. 2003. Estimating frequency of change. ACM Trans. Internet Technol. 3, 3, 256--290. Google ScholarDigital Library
S. Choudhary. 2012. M-crawler: Crawling rich Internet applications using menu meta-model. M.S. thesis, EECS - University of Ottawa. http://ssrg.site.uottawa.ca/docs/Surya-Thesis.pdf.Google Scholar
S. Choudhary, M. E. Dincturk, G. V. Bochmann, G.-V. Jourdan, I. V. Onut, and P. Ionescu. 2012. Solving some modeling challenges when testing rich Internet applications for security. In Proceedings of the International Conference on Software Testing, Verification, and Validation. 850--857. Google ScholarDigital Library
S. Choudhary, M. E. Dincturk, S. M. Mirtaheri, G.-V. Jourdan, G. Bochmann, and I.-V. Onut. 2013. Building rich Internet applications models: Example of a better strategy. In Proceedings of the 13^th International Conference on Web Engineering (ICWE'13). Lecture Notes in Computer Science, vol. 7977, Springer, 291--305. Google ScholarDigital Library
E. G. Coffman, Z. Liu, and R. R. Weber. 1998. Optimal robot scheduling for web search engines. J. Schedul. 1, 1, 15--29.Google ScholarCross Ref
R. P. Dilworth. 1950. A decomposition theorem for partially ordered sets. Ann. Math. 51, 1, 161--166.Google ScholarCross Ref
M. E. Dincturk. 2013. Model-based crawling - An approach to design efficient crawling strategies for rich Internet applications. Ph.D. thesis, EECS - University of Ottawa. http://ssrg.site.uottawa.ca/docs/Dincturk_MustafaEmre_2013_thesis.pdf. Google ScholarDigital Library
M. E. Dincturk, S. Choudhary, G. Bochmann, G.-V. Jourdan, and I. V. Onut. 2012. A statistical approach for efficient crawling of rich Internet applications. In Proceedings of the 12^th International Conference on Web Engineering (ICWE'12). Springer, 74--89. Google ScholarDigital Library
C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou. 2009. Ajax crawl: Making Ajax applications searchable. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'09). IEEE Computer Society, 78--89. Google ScholarDigital Library
M. Faheem and P. Senellart. 2013. Intelligent and adaptive crawling of web applications for web archiving. In Proceedings of the 13^th International Conference on Web Engineering (ICWE'13). F. Daniel, P. Dolog, and Q. Li, Eds., Lecture Notes in Computer Science, vol. 7977, Springer, 306--322. Google ScholarDigital Library
G. Frey. 2007. Indexing Ajax web applications. M.S. thesis, ETH Zurich. http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf.Google Scholar
J. J. Garrett. 2005. Ajax: A new approach to web applications. http://www.adaptivepath.com/publications/essays/archives/000385.php.Google Scholar
Google. 2009. Making Ajax applications crawlable. http://code.google.com/web/ajaxcrawling/index.html.Google Scholar
C. Greene and D. J. Kleitman. 1976. Strong versions of Sperner's theorem. J. Combin. Theory A20, 1, 80--88.Google ScholarCross Ref
J. Griggs, C. E. Killian, and C. Savage. 2004. Venn diagrams and symmetric chain decompositions in the boolean lattice. Electron. J. Combin. 11, 2.Google ScholarCross Ref
J. Lu, Y. Wang, J. Liang, J. Chen, and J. Liu. 2008. An approach to deep web crawling by sampling. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT'08), Vol. 1. 718--724. Google ScholarDigital Library
A. Mesbah, E. Bozdag, and A. V. Deursen. 2008. Crawling Ajax by inferring user interface state changes. In Proceedings of the 8^th International Conference on Web Engineering (ICWE'08). IEEE Computer Society, 122--134. Google ScholarDigital Library
A. Mesbah and A. Van Deursen. 2009. Invariant-based automatic testing of Ajax user interfaces. In Proceedings of the 31^st IEEE International Conference on Software Engineering (ICSE'09). 210--220. Google ScholarDigital Library
A. Mesbah, A. Van Deursen, and S. Lenselink. 2012. Crawling Ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans. Web 6, 1. Google ScholarDigital Library
Microsoft. 2007. Silverlight. http://www.microsoft.com/silverlight/.Google Scholar
A. Ntoulas, P. Zerfos, and J. Cho. 2005. Downloading textual hidden web content through keyword queries. In Proceedings of the 5^th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'05). ACM Press, New York, 100--109. Google ScholarDigital Library
C. Olston and M. Najork. 2010. Web crawling. Found. Trends Inf. Retr. 4, 3, 175--246. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. The PageRank citation ranking: Bringing order to the web. Tech. rep., Standford University. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.Google Scholar
Z. Peng, N. He, C. Jiang, Z. Li, L. Xu, Y. Li, and Y. Ren. 2012. Graph-based Ajax crawl: Mining data from rich Internet applications. In Proceedings of the International Conference on Computer Science and Electronics Engineering (ICCSEE'12). Vol. 3, 590--594. Google ScholarDigital Library
D. Roest, A. Mesbah, and A. Van Deursen. 2010. Regression testing Ajax applications: Coping with dynamism. In Proceedings of the 3^rd International Conference on Software Testing, Verification and Validation (ICST'10). IEEE Computer Society, 127--136. Google ScholarDigital Library
W3C. 2005. Document object model (dom). http://www.w3.org/DOM/.Google Scholar
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. 2006. Query selection techniques for efficient crawling of structured web sources. In Proceedings of the 22^nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 47. Google ScholarDigital Library

Index Terms

A Model-Based Approach for Crawling Rich Internet Applications

Recommendations

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

Using JavaScript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. At the same time, such techniques---collectively known ...
Read More
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web Engineering

Modern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...
Read More
Building rich internet applications models: example of a better strategy
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Crawling "classical" web applications is a problem that has been addressed more than a decode ago. Efficient crawling of web applications that use advanced technologies such as AJAX (called Rich Internet Applications, RIAs) is still an open problem. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 8, Issue 3
June 2014
256 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2639948
Editor:
Marc Najork
Google
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 July 2014
- Accepted: 1 December 2013
- Revised: 1 August 2013
- Received: 1 May 2012
Published in tweb Volume 8, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AJAX
Crawling
DOM
dynamic analysis
modeling
rich Internet applications
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 504
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Model-Based Approach for Crawling Rich Internet Applications

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

A statistical approach for efficient crawling of rich internet applications

Building rich internet applications models: example of a better strategy