RetriBlog: An architecture-centered framework for developing blog crawlers
Highlights
► We developed a blog crawler using software engineering techniques. ► We create applications related to social web. ► We proposed a quantitative and qualitative evaluation.
Introduction
Web 2.0 is changing the way users interact with information (O’Reilly, 2005). It allows them to become an active part of the Web through new tools that enable the creation and content management in a collaborative manner. Among these tools the importance of blogs should be emphasized.
Blogs are proliferating on the Web, particularly due to its simplicity and popularity. Simplicity of blogs use is portrayed, e.g. by Rosenbloom (2004) pointing out that “a blogger needs only a computer, Internet access, and an opinion”. On the other hand, regarding the popularity of blogs, Blood (2004) declared that “Weblogs have become so ubiquitous that for many of us the term is synonymous with ‘personal Web site’ ”. Furthermore, according to a Technoratis report,3 the activity on the blogsphere doubles every two hundred days. This fact also highlights blogs as a very interesting tool which may be used in several areas, e.g., e-commerce and e-learning.
In addition, other companies like Blog Pulse4 and Blog Scope5 have interesting statistics. The first have more than 160 million blogs indexed and more than 60 thousand are created every day. The second indexes approximately 53 million blogs and over a trillion posts.
Due to the large volume of information generated on blogs, it becomes unfeasible to extract information manually. Thus, efficient computational approaches are necessary to extract information from blogs and make a profitable use of its relevant content. This information may be: user preferences, which issues are being addressed in the blogosphere, interesting texts to certain domains (Xiong & Wang, 2008). Thus they may be used, for example, to determine if a product is well accepted in the blogosphere (Takashi Yamada, Atsushi Yoshikawa, Takao Terano, Geun Chol Moon, & Go Kikuta, 2010) or to assist in learning environments (educational blogs) (Qiao et al., 2009, Yang, 2008). Hence, some research has being carried out to automate the information extraction process from blogs (Fujimura et al., 2006, Joshi, 2006).
For the document acquisition task, the field of information retrieval (IR) stands out (Manning, Raghavan, & Schütze, 2008), since it is mainly concerned with identifying relevant texts for a particular purpose within a huge text collection. In order to enable users to perform their searches, the text and information related to each blog need to be properly indexed and stored. So, the blog crawlers are responsible for performing such task (Arasu, Cho, Garcia-Molia, Paepcke, & Raghavan, 2001).
However, to build a blog crawler, we should consider many aspects related to blogs themselves, such as language used by them, blog indexing service, preprocessing tasks, and indexing techniques. The variability and size of blogosphere pose two main problems to be dealt: (i) the difficulty of conceiving a general approach to deal with such variability; and (ii) the need to assist the developer with tools for extracting the main content of the blog, and the preprocessing of the blog text. In order to address this problem the approach that has been used in other works is the creation of a framework (Johnson & Foote, 1988). Two works deal with these problems (Chau et al., 2009, Ferreira et al., 2010). However, these frameworks do not follow some principles of software engineering, like explicit traceability between software architecture and implementation, which facilitates, for example, the evolution of the framework.
An approach centered on architecture is a solution to solve this problem. Develop a system based on architecture (Garlan & Shaw, 1994) has advantages such as: (i) Control of complexity; (ii) Modularity; (iii) facilitates the development and instantiation of the system. This approach fills the gap left by the systems mentioned above.
Besides this problem, some basic problems like extract the main content of blogs and recommend tags for posts are also a challenge to get interesting information from blogs. One solution for the first problem is to create algorithms that can detect the main content of an HTML page without any prior information. The second deals mainly with the problem of tags alignment. The same blog can be labeled with different tags due to differences in culture, knowledge and experiences of users (Subramanya & Liu, 2008).
On the one hand, to handle the first problem, some algorithms have been implemented to deal with the blogs content extraction, they are: Boilerplate Kohlschütter, Fankhauser, and Nejdl (2010), Document Slope Curve (DSC) Pinto et al. (2002) and Text-to-Tag Ratio (TTR) Weninger and Hsu (2008) and the heuristic Link Quota Filter (LQF) Gottron (2007) which are among the more efficient algorithms for content extraction. Furthermore, the algorithms mentioned have very desirable properties in the context of research, including the independence of the structure. This also allows us to apply them to other types of HTML pages (e.g. news), independent of language, and fast execution time.
On the other hand, to recommend tags, this paper proposes four services. These services are classified into approaches based on tags and based on post. These services deal with this problem differently, allowing users to choose the appropriate type of service according to his needs. For example, if the user has a corpus with little noise, it may use a service based on the post to achieve better results. On the other hand, if he needs a fast implementation, services based on tags are more appropriate.
Therefore, this work proposes an architecture-centered framework for developing blog crawlers. This framework aims to provide services to build applications in the Blogosphere. This work deals with general aspects as ease creation of a blog crawler and specific problems such as a tag recommendation. This framework is carried out both a quantitative and a qualitative evaluation. The quantitative evaluation focuses on tests in the main proposed algorithms. Another one shows the advantages of using architecture-centered development, evaluation. Finally, three case studies are described.
The rest of the paper is organized as follows. Section 2 describes aspects related to blogosphere, social media, information retrieval and software architecture. Section 3 introduces the proposed framework, its architecture and its implementation aspects. Three case studies that validates the proposed approach are described in Section 4. Section 5 present the experiments performed in the main proposed algorithms. Section 6 contextualizes the proposed framework against some related work, making the contribution of the proposed approach more explicit. Finally, some conclusions and discussion of possible future work are presented in Section 7.
Section snippets
Background
This section describes the necessary background to understand the proposed framework. The follow subsections describe the Blogosphere and the social media, information retrieval, and some software engineering aspects.
RetriBlog: an architecture-centered framework for creating blog crawlers
The RetriBlog architecture consists in a gray-box framework (Fayad et al., 1999), which allows the fast development of blog crawlers. In other words, it provides services that developers can easy modify to create new crawlers, they instantiate the framework using only high level interfaces. It was developed following an architecture-centered process and implemented in Java according to the COSMOS∗ component implementation model (Aguilar Gayard et al., 2008). In addition, the framework provides
Case study
This section presents a three case studies created using RetriBlog. The case studies emphasize the advantages of using the components-based and architecture-centered approaches. They serve mainly to show the flexibility and reusability of the framework.
Preliminary experiments
This section describes the experimental setup conducted in our research work and presents the preliminary results. Experiments were performed in three services: (i) Content Extraction, as a result we have a comparison between the algorithms and the results were as expected; (ii) Classification, as a result it was proved that using content extraction and preprocessing improves the classification; and (iii) Tag Recommendation, as a result we have the comparison between the algorithms and some
Related work
This section describes related work to Retriblog and a comparison among them is performed.
The main systems which provide services to create application with blogs could be divided into two blocks. The first is composed by Blogscope Agarwal et al., 2009, Bansal and Koudas, 2007, Fujimura et al., 2006, they provide services to obtain information from blogs, but they are not designed considering variability.
The second block is frameworks as well as RetriBlog. In this block there are the systems
Conclusions and future work
The work presented RetriBlog, a framework to create a blog crawlers using an architecture-centered approach and following COSMOS∗ deployment model.
The main contributions of this work were: (i) Construction of mechanisms for information retrieval in the blogosphere (blog crawlers); (ii) Implementation of algorithms for content extraction in HTML pages; (iii) Creation of tags recommending services for blogs; (iv) Evaluation about the influence of the use of content extraction and preprocessing
References (62)
- et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988) An effective refinement strategy for knn text classifier
Expert Systems with Applications
(2006)- Nitin Agarwal, Shamanth Kumar, Huan Liu, & Mark Woodward (2009). Blogtrackers: A tool for sociologists to track and...
- Leonel Aguilar Gayard, Cecília Mary Fischer Rubira, & Paulo Astério de Castro Guerra (2008). COSMOS∗: A component...
- et al.
Searching the web
Acm Transactions On Internet Technology
(2001) - et al.
Modern information retrieval
(1999) - Nilesh Bansal, & Nick Koudas (2007). Searching the...
- Yu Shiwen Li Baoli, & Lu Qin (2003). An improved k-nearest neighbor algorithm for text categorization. In: Proceedings...
- et al.
Software architecture in practices
(2003) - et al.
Developing multi-agent systems with JADE
(2007)
A computational model for developing semantic web-based educational systems
Knowledge Based Systems
How blogging software reshapes the online community
Communications of the ACM
No silver bullet essence and accidents of software engineering
Computer
A blog mining framework
IT Professional
Guest editors’ introduction: Social media and search
Internet Computing, IEEE
Staged configuration through specialization and multilevel configuration of feature models
Software Process: Improvement and Practice
Objects, components, and frameworks with UML: The catalysis approach
Building application frameworks: Object-oriented foundations of framework design
Information retrieval: Data structures and algorithms
Usage patterns of collaborative tagging systems
Journal of Information Science
Designing software product lines with UML: From use cases to pattern-based software architectures
Evaluating content extraction on html documents
ITA
Lucene in action (In action series)
A brief survey of text mining
LDV Forum – GLDV Journal for Computational Linguistics and Language Technology
Social streams blog crawler
Survey of improving Naive Bayes for classification
Components, frameworks, patterns
Communications of the ACM
Cited by (6)
Design and implementation of a scalable distributed web crawler based on Hadoop
2017, 2017 IEEE 2nd International Conference on Big Data Analysis, ICBDA 2017Method research and system design of automatic acquire recruitment information based on Internet
2017, 2016 2nd IEEE International Conference on Computer and Communications, ICCC 2016 - ProceedingsThe Design and Implementation of a High-Efficiency Distributed Web Crawler
2016, Proceedings - 2016 IEEE 14th International Conference on Dependable, Autonomic and Secure Computing, DASC 2016, 2016 IEEE 14th International Conference on Pervasive Intelligence and Computing, PICom 2016, 2016 IEEE 2nd International Conference on Big Data Intelligence and Computing, DataCom 2016 and 2016 IEEE Cyber Science and Technology Congress, CyberSciTech 2016, DASC-PICom-DataCom-CyberSciTech 2016Search engines crawling process optimization: a webserver approach
2016, Internet ResearchE&VRobot: A crawler of education and vocation
2014, Proceedings of the 9th International Conference on Computer Science and Education, ICCCSE 2014Application of web search results for document classification
2013, Lecture Notes in Electrical Engineering