RetriBlog: An architecture-centered framework for developing blog crawlers

https://doi.org/10.1016/j.eswa.2012.08.020Get rights and content

Abstract

Blogs have become an important social tool. It allows the users to share their tastes, express their opinions, report news, form groups related to some subject, among others. The information obtained from the blogosphere may be used to create several applications in various fields. However, due to the growing number of blogs posted every day, as well as the dynamicity of the blogosphere, the task of extracting relevant information from the blogs has become difficult and time consuming. In this paper, we use information retrieval and extraction techniques to deal with this problem. Furthermore, as blogs have many variation points is required to provide applications that can be easily adapted. Faced with this scenario, the work proposes RetriBlog, an architecture-centered framework for the development of blog crawlers. Finally, it presents an evaluation of the proposed algorithms and three case studies.

Highlights

► We developed a blog crawler using software engineering techniques. ► We create applications related to social web. ► We proposed a quantitative and qualitative evaluation.

Introduction

Web 2.0 is changing the way users interact with information (O’Reilly, 2005). It allows them to become an active part of the Web through new tools that enable the creation and content management in a collaborative manner. Among these tools the importance of blogs should be emphasized.

Blogs are proliferating on the Web, particularly due to its simplicity and popularity. Simplicity of blogs use is portrayed, e.g. by Rosenbloom (2004) pointing out that “a blogger needs only a computer, Internet access, and an opinion”. On the other hand, regarding the popularity of blogs, Blood (2004) declared that “Weblogs have become so ubiquitous that for many of us the term is synonymous with ‘personal Web site’ ”. Furthermore, according to a Technoratis report,3 the activity on the blogsphere doubles every two hundred days. This fact also highlights blogs as a very interesting tool which may be used in several areas, e.g., e-commerce and e-learning.

In addition, other companies like Blog Pulse4 and Blog Scope5 have interesting statistics. The first have more than 160 million blogs indexed and more than 60 thousand are created every day. The second indexes approximately 53 million blogs and over a trillion posts.

Due to the large volume of information generated on blogs, it becomes unfeasible to extract information manually. Thus, efficient computational approaches are necessary to extract information from blogs and make a profitable use of its relevant content. This information may be: user preferences, which issues are being addressed in the blogosphere, interesting texts to certain domains (Xiong & Wang, 2008). Thus they may be used, for example, to determine if a product is well accepted in the blogosphere (Takashi Yamada, Atsushi Yoshikawa, Takao Terano, Geun Chol Moon, & Go Kikuta, 2010) or to assist in learning environments (educational blogs) (Qiao et al., 2009, Yang, 2008). Hence, some research has being carried out to automate the information extraction process from blogs (Fujimura et al., 2006, Joshi, 2006).

For the document acquisition task, the field of information retrieval (IR) stands out (Manning, Raghavan, & Schütze, 2008), since it is mainly concerned with identifying relevant texts for a particular purpose within a huge text collection. In order to enable users to perform their searches, the text and information related to each blog need to be properly indexed and stored. So, the blog crawlers are responsible for performing such task (Arasu, Cho, Garcia-Molia, Paepcke, & Raghavan, 2001).

However, to build a blog crawler, we should consider many aspects related to blogs themselves, such as language used by them, blog indexing service, preprocessing tasks, and indexing techniques. The variability and size of blogosphere pose two main problems to be dealt: (i) the difficulty of conceiving a general approach to deal with such variability; and (ii) the need to assist the developer with tools for extracting the main content of the blog, and the preprocessing of the blog text. In order to address this problem the approach that has been used in other works is the creation of a framework (Johnson & Foote, 1988). Two works deal with these problems (Chau et al., 2009, Ferreira et al., 2010). However, these frameworks do not follow some principles of software engineering, like explicit traceability between software architecture and implementation, which facilitates, for example, the evolution of the framework.

An approach centered on architecture is a solution to solve this problem. Develop a system based on architecture (Garlan & Shaw, 1994) has advantages such as: (i) Control of complexity; (ii) Modularity; (iii) facilitates the development and instantiation of the system. This approach fills the gap left by the systems mentioned above.

Besides this problem, some basic problems like extract the main content of blogs and recommend tags for posts are also a challenge to get interesting information from blogs. One solution for the first problem is to create algorithms that can detect the main content of an HTML page without any prior information. The second deals mainly with the problem of tags alignment. The same blog can be labeled with different tags due to differences in culture, knowledge and experiences of users (Subramanya & Liu, 2008).

On the one hand, to handle the first problem, some algorithms have been implemented to deal with the blogs content extraction, they are: Boilerplate Kohlschütter, Fankhauser, and Nejdl (2010), Document Slope Curve (DSC) Pinto et al. (2002) and Text-to-Tag Ratio (TTR) Weninger and Hsu (2008) and the heuristic Link Quota Filter (LQF) Gottron (2007) which are among the more efficient algorithms for content extraction. Furthermore, the algorithms mentioned have very desirable properties in the context of research, including the independence of the structure. This also allows us to apply them to other types of HTML pages (e.g. news), independent of language, and fast execution time.

On the other hand, to recommend tags, this paper proposes four services. These services are classified into approaches based on tags and based on post. These services deal with this problem differently, allowing users to choose the appropriate type of service according to his needs. For example, if the user has a corpus with little noise, it may use a service based on the post to achieve better results. On the other hand, if he needs a fast implementation, services based on tags are more appropriate.

Therefore, this work proposes an architecture-centered framework for developing blog crawlers. This framework aims to provide services to build applications in the Blogosphere. This work deals with general aspects as ease creation of a blog crawler and specific problems such as a tag recommendation. This framework is carried out both a quantitative and a qualitative evaluation. The quantitative evaluation focuses on tests in the main proposed algorithms. Another one shows the advantages of using architecture-centered development, evaluation. Finally, three case studies are described.

The rest of the paper is organized as follows. Section 2 describes aspects related to blogosphere, social media, information retrieval and software architecture. Section 3 introduces the proposed framework, its architecture and its implementation aspects. Three case studies that validates the proposed approach are described in Section 4. Section 5 present the experiments performed in the main proposed algorithms. Section 6 contextualizes the proposed framework against some related work, making the contribution of the proposed approach more explicit. Finally, some conclusions and discussion of possible future work are presented in Section 7.

Section snippets

Background

This section describes the necessary background to understand the proposed framework. The follow subsections describe the Blogosphere and the social media, information retrieval, and some software engineering aspects.

RetriBlog: an architecture-centered framework for creating blog crawlers

The RetriBlog architecture consists in a gray-box framework (Fayad et al., 1999), which allows the fast development of blog crawlers. In other words, it provides services that developers can easy modify to create new crawlers, they instantiate the framework using only high level interfaces. It was developed following an architecture-centered process and implemented in Java according to the COSMOS component implementation model (Aguilar Gayard et al., 2008). In addition, the framework provides

Case study

This section presents a three case studies created using RetriBlog. The case studies emphasize the advantages of using the components-based and architecture-centered approaches. They serve mainly to show the flexibility and reusability of the framework.

Preliminary experiments

This section describes the experimental setup conducted in our research work and presents the preliminary results. Experiments were performed in three services: (i) Content Extraction, as a result we have a comparison between the algorithms and the results were as expected; (ii) Classification, as a result it was proved that using content extraction and preprocessing improves the classification; and (iii) Tag Recommendation, as a result we have the comparison between the algorithms and some

Related work

This section describes related work to Retriblog and a comparison among them is performed.

The main systems which provide services to create application with blogs could be divided into two blocks. The first is composed by Blogscope Agarwal et al., 2009, Bansal and Koudas, 2007, Fujimura et al., 2006, they provide services to obtain information from blogs, but they are not designed considering variability.

The second block is frameworks as well as RetriBlog. In this block there are the systems

Conclusions and future work

The work presented RetriBlog, a framework to create a blog crawlers using an architecture-centered approach and following COSMOS deployment model.

The main contributions of this work were: (i) Construction of mechanisms for information retrieval in the blogosphere (blog crawlers); (ii) Implementation of algorithms for content extraction in HTML pages; (iii) Creation of tags recommending services for blogs; (iv) Evaluation about the influence of the use of content extraction and preprocessing

References (62)

  • Gerard Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing and Management

    (1988)
  • Songbo Tan

    An effective refinement strategy for knn text classifier

    Expert Systems with Applications

    (2006)
  • Nitin Agarwal, Shamanth Kumar, Huan Liu, & Mark Woodward (2009). Blogtrackers: A tool for sociologists to track and...
  • Leonel Aguilar Gayard, Cecília Mary Fischer Rubira, & Paulo Astério de Castro Guerra (2008). COSMOS∗: A component...
  • Arvind Arasu et al.

    Searching the web

    Acm Transactions On Internet Technology

    (2001)
  • Ricardo Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Nilesh Bansal, & Nick Koudas (2007). Searching the...
  • Yu Shiwen Li Baoli, & Lu Qin (2003). An improved k-nearest neighbor algorithm for text categorization. In: Proceedings...
  • Len Bass et al.

    Software architecture in practices

    (2003)
  • Fabio Luigi Bellifemine et al.

    Developing multi-agent systems with JADE

    (2007)
  • Ig Ibert Bittencourt et al.

    A computational model for developing semantic web-based educational systems

    Knowledge Based Systems

    (2006)
  • Rebecca Blood

    How blogging software reshapes the online community

    Communications of the ACM

    (2004)
  • Frederick P. Brooks

    No silver bullet essence and accidents of software engineering

    Computer

    (1987)
  • Michael Chau et al.

    A blog mining framework

    IT Professional

    (2009)
  • Junghoo Cho et al.

    Guest editors’ introduction: Social media and search

    Internet Computing, IEEE

    (2007)
  • Paul Clements, & Linda Northrop (2001). Software product lines: Practices and patterns (pp....
  • Krzysztof Czarnecki et al.

    Staged configuration through specialization and multilevel configuration of feature models

    Software Process: Improvement and Practice

    (2005)
  • Desmond F. D’Souza et al.

    Objects, components, and frameworks with UML: The catalysis approach

    (1999)
  • Mohamed E. Fayad et al.

    Building application frameworks: Object-oriented foundations of framework design

    (1999)
  • Rafael Ferreira, Rinaldo J. Lima, Ig Ibert Bittencourt, Dimas Melo Filho, Olavo Holanda, Evandro Costa, Fred Freitas, &...
  • William B. Frakes et al.

    Information retrieval: Data structures and algorithms

    (1992)
  • Ko Fujimura, Hiroyuki Toda, Takafumi Inoue, Nobuaki Hiroshima, Ryoji Kataoka, & Masayuki Sugizaki (2006). Blogranger –...
  • David Garlan, & Mary Shaw (1994). An introduction to software architecture. Technical report, Pittsburgh, PA,...
  • Scott A. Golder et al.

    Usage patterns of collaborative tagging systems

    Journal of Information Science

    (2006)
  • Hassan Gomaa

    Designing software product lines with UML: From use cases to pattern-based software architectures

    (2004)
  • Thomas Gottron

    Evaluating content extraction on html documents

    ITA

    (2007)
  • Erik Hatcher et al.

    Lucene in action (In action series)

    (2004)
  • Andreas Hotho et al.

    A brief survey of text mining

    LDV Forum – GLDV Journal for Computational Linguistics and Language Technology

    (2005)
  • Matthew Hurst et al.

    Social streams blog crawler

  • Liangxiao Jiang et al.

    Survey of improving Naive Bayes for classification

  • Ralph E. Johnson

    Components, frameworks, patterns

    Communications of the ACM

    (1997)
  • Cited by (6)

    1

    Tel.: +55 (81) 30344984.

    2

    Tel.: +55 (82) 3214 1401.

    View full text