Crawling Rich Internet Applications (RIAs) is important to ensure their security, accessibility and to index them for searching. To crawl a RIA, the crawler has to reach every application state and execute every application event. On a large RIA, this operation takes a long time. Previously published
proposes a distributed architecture to parallelize the task of crawling RIAs, and run the crawl over multiple computers to reduce time. In GDist-RIA Crawler, a centralized unit calculates the next task to execute, and tasks are dispatched to worker nodes for execution. This architecture is not scalable due to the centralized unit which is bound to become a bottleneck as the number of nodes increases. This paper extends GDist-RIA Crawler and proposes a fully peer-to-peer and scalable architecture to crawl RIAs, called
. PDist-RIA doesn’t have the same limitations in terms scalability while matching the performance of GDist-RIA. We describe a prototype showing the scalability and performance of the proposed solution.