Word statistics in Blogs and RSS feeds: Towards empirical universal evidence

https://doi.org/10.1016/j.joi.2007.07.001Get rights and content

Abstract

We focus on the statistics of word occurrences and of the waiting times between such occurrences in Blogs. Due to the heterogeneity of words’ frequencies, the empirical analysis is performed by studying classes of “frequently-equivalent” words, i.e. by grouping words depending on their frequencies. Two limiting cases are considered: the dilute limit, i.e. for those words that are used less than once a day, and the dense limit for frequent words. In both cases, extreme events occur more frequently than expected from the Poisson hypothesis. These deviations from Poisson statistics reveal non-trivial time correlations between events that are associated with bursts of activities. The distribution of waiting times is shown to behave like a stretched exponential and to have the same shape for different sets of words sharing a common frequency, thereby revealing universal features.

Introduction

Web logs, also known as Blogs, have become an influential medium Hammersley, 2005, Glance et al., 2004, Thelwall et al., 2006, that encompasses a broad variety of subjects, e.g. politics and science, and are participative by nature. They involve a huge number of interacting users that belong to several layers of the population, from topic specialists to average people. This variety suggests that Blogs could be an efficient information source for identifying, tracking and modeling the spread of ideas and opinion formation, for example in public debates over political questions. Indeed, the democratic nature of Blogs allows us to examine how trends develop from the interactions of decentralized bloggers and to follow dynamic opinion changes over a wide and diverse sample of the population. This is in contrast with the main media where relatively few journalists are involved. Precise knowledge of word statistics in Blogs is consequently of interest in order to make coherent statistical tests for automatically detecting critical events, e.g. trends or media shocks Kleinberg, 2002, Kleinberg, 2008.

The most basic time statistics ignoring correlations between events can be modeled by Poisson distributions. This distribution concerns independent events: the number n of events arriving during some time interval Δ occurs with a probabilityP(n|a)=ann!ea,where a is the arithmetic average number of events during this time interval. Moreover, the distribution of waiting times between two successive Poisson events is the negative exponential:f(τ)=τc1expττc,where τc=Δ/a is the average characteristic waiting time between events. This distribution is well-known to apply to nuclear disintegration but it has also been used for describing the time gaps between shoppers entering a store (Kan & Fu, 1997), the number of failure of products (Gregory, 2005), the number of terrorist acts (Telesca & Lovallo, 2006) as well as the number of airplane accidents as a function of time (Ausloos & Lambiotte, 2006a). An increasing amount of empirical evidence indicates, though, that human activity patterns do not fit this model. It has been shown by many other authors that human processes are rather heterogeneously distributed in time, with short periods of high activity Kleinberg, 2002, Kleinberg, 2008, Willinger and Paxson, 1998, or bursts, separated by long periods of inactivity Barabási, 2005, Vázquez et al., 2006, Dewes et al., 2003, Paxson and Floyd, 1995, Dezsö et al., 2006, Vázquez, 2005, Ebeling and Neiman, 1995, Gopikrishnan et al., 2001, Sabatelli et al., 2002. This heterogeneity is characterized by a distribution of waiting times which deviates from the exponential (2) and which, usually, presents a so-called heavy tail.

In this paper, we focus on the statistics of such waiting times between word occurrences in Blogs (and other similar periodically updated web sources) and also on the statistics of the number of word occurrences per day. To do so, we focus on texts published in 68022 RSS feeds during a period of 214 days and analyze two limiting cases. On one hand, we focus on very rare “events”, namely words that occur on average less than once per day. It is shown that the frequency of words is very heterogeneous, so that the time statistics have to be measured in classes of “frequently-equivalent” words, i.e. words are discriminated through their total number of occurrences during the whole time period. This discrimination allows us to show that the distribution of waiting times deviates from the exponential (2), i.e. it is fitted by a stretched exponential and therefore presents an overpopulated tail. The deviation from the pure exponential is evaluated with the quantity ζ that measures the importance of the second moment of the time statistics. Interestingly, it is found that the shape of the distribution as well as the value of ζ do not depend on the class of words in which they are measured. On the other hand, we focus on events that occur many times per day on average. In that case, scaling laws are applied in order to smoothen the empirical results. Deviations from the Poisson statistics (1) are also found. Consequently, our results not only confirm that the dynamics of topics in Blogs present bursts of activity Kleinberg, 2002, Kleinberg, 2008, Willinger and Paxson, 1998 but they also provide tools in order to measure the importance of such bursts by comparing the empirical word statistics to a Poisson uncorrelated process.

Section snippets

RSS format

Really Simple Syndication (RSS) is an XML application designed to deliver brief summaries of the most recent updates of web sites (Hammersley, 2005), although it is flexible enough to incorporate other applications, such as reporting updates in digital libraries or search engine databases. Users with RSS reader software can subscribe to a range of RSS feeds based upon their interests, perhaps including favourite Blogs, some news sites or some special interest sites. The RSS reader will

Ensembles of equivalent words

Let us label each word by the index α. The number of posts in which this word occurs on day i is noted Wαi. Moreover Wα=i=1214Wαi denotes the number of occurrences of α over the total time period. As discussed above, words may exhibit a large range of frequencies (1– 106). The spread of these frequencies may find its origin in many causes, e.g. the word “popularity” (two synonyms may be more or less popular) or “contextuality” (words associated to general and frequent contexts should be used

Conclusion

In this article, we have performed an empirical analysis of the word frequencies arising in Blogs and RSS feeds. To do so, we have collected RSS data during a large time period (more than 200 days during spring 2005). These data encompass several kinds of information sources, such as newspaper RSS feeds and personal diary-like Blogs. Our analysis has been performed by discriminating words depending on their number of occurrences k. Namely, ensembles Ek of words occurring with the same frequency

Acknowledgement

This work has been supported by European Commission Project CREEN FP6-2003-NEST-Path-012864.

References (37)

  • C. Beck

    Dynamical foundations of non-extensive statistical mechanics

    Physical Review Letters

    (2001)
  • L. Benguigui et al.

    From lognormal distribution to power-law: A new classification of the size distribution

    International Journal Modern Physics C

    (2006)
  • C. Cattuto et al.

    A Yule-Simon process with memory

    Europhysics Letters

    (2006)
  • C. Dewes et al.

    An analysis of Internet chat systems

  • Z. Dezsö et al.

    Dynamics of information access on the web

    Physical Review E

    (2006)
  • T.S. Evans

    Exact solutions for network rewiring models

    The European Physical Journal B

    (2007)
  • R. Ferrer-Cancho et al.

    Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited

    Journal of Quantitative Linguistics

    (2001)
  • N.S. Glance et al.

    BlogPulse: Automated trend discovery for weblogs

  • Cited by (24)

    • Words ranking and Hirsch index for identifying the core of the hapaxes in political texts

      2020, Journal of Informetrics
      Citation Excerpt :

      Scientific debate has recently grown on text analysis and data mining because of the relevance of the information taken from texts and for the need of a systematic quantitative analysis of them. For example, it is worth mentioning (Lambiotte, Ausloos, & Thelwall, 2007), where the authors study the regularities of words occurred in blogs and (Jiang, Wang, Wang, & Ding, 2018), where the authors propose a model for assessing borrowers’ defaults on loans by analyzing texts on the available descriptions of such loans. In Chan and Chong (2017) the authors pay peculiar attention to the exploration of the financial texts for their relevant informative content.

    • Measuring complexity with multifractals in texts. Translation effects

      2012, Chaos, Solitons and Fractals
      Citation Excerpt :

      Very slight quantitative differences occur, more markedly for the shuffled AWLesp FTS; along a Baeysian reasoning, these differences can be attributed to the finite size of the sample. Some comment on the role/meaning of C1, a sort of information entropy on the structural complexity of a signal, can be found in Ref. [72]. The f(α) spectra are shown in Fig. 3.

    • A neural network based approach for sentiment classification in the blogosphere

      2011, Journal of Informetrics
      Citation Excerpt :

      Blogs are one of the fastest growing sections of the emerging communication mechanisms (Cohen & Krishnamurthy, 2006; Lambiotte, Ausloos, & Thelwall, 2007; Singh, Veron-Jackson, & Cullinane, 2008; Tang, Tan, & Cheng, 2009).

    • Punctuation effects in english and esperanto texts

      2010, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      The process is kinetic indeed and basically a growth process, somewhat similar to city growth; Thus it is a priori hard to say whether the causes (i) or (ii) or both are influencing the exponent values. According to a widespread conception, quantitative linguistics will eventually be able to explain such empirical quantitative findings (such as Zipf law) by deriving them from highly general stochastic linguistic laws that are assumed to be part of a general theory of human language [54,55] for a summary of possible theoretical positions). In Ref. [56], Meyer argues that on close inspection such claims turn out to be highly problematic, both on linguistic and on science-theoretical grounds.

    • Geographical dispersal of mobile communication networks

      2008, Physica A: Statistical Mechanics and its Applications
    View all citing articles on Scopus
    View full text