Elsevier

Pattern Recognition

Volume 63, March 2017, Pages 45-55
Pattern Recognition

Multi-label methods for prediction with sequential data

https://doi.org/10.1016/j.patcog.2016.09.015Get rights and content

Highlights

  • Drawing connections between learning for multi-label and sequential data.

  • A unified view between multi-label and sequential classifiers.

  • A novel Markov model-inspired method for multi-label (and sequence) classification.

  • A novel multi-label-inspired method for sequence (and multi-label) classification.

  • An empirical comparison with related methods, on real-world datasets, demonstrating the competitiveness of proposed methods.

Abstract

The number of methods available for classification of multi-label data has increased rapidly over recent years, yet relatively few links have been made with the related task of classification of sequential data. If labels indices are considered as time indices, the problems can often be seen as equivalent. In this paper we detect and elaborate on connections between multi-label methods and Markovian models, and study the suitability of multi-label methods for prediction in sequential data. From this study we draw upon the most suitable techniques from the area and develop two novel competitive approaches which can be applied to either kind of data. We carry out an empirical evaluation investigating performance on real-world sequential-prediction tasks: electricity demand, and route prediction. As well as showing that several popular multi-label algorithms are in fact easily applicable to sequencing tasks, our novel approaches, which benefit from a unified view of these areas, prove very competitive against established methods.

Introduction

Multi-label classification is the supervised learning problem where an instance is associated with multiple class variables (i.e., labels), rather than with a single class, as in traditional classification problems. See [1] for a review. The typical argument is that, since these labels are often strongly correlated, modelling the dependencies between them allows methods to obtain higher performance than if labels were modelled independently – at the expense of an increased computational cost.

The case of binary labels is most common, where a positive class value denotes the relevance of the label (and the negative or null class denotes irrelevance). Typical examples of binary multi-label classification involve categorizing text documents and images, which can be assigned any subset of a particular label set. For example, an image can be associated with both labels beach and sunset. This is usually represented in vector form, such that, given a set of labels1 L={beach,urban,foliage,sunset,mountains,fields}, then an associated label vector isy=[y1,y2,y3,y4,y5,y6]=[1,0,0,1,0,0]which indicates that the first and fourth labels (beach and sunset) are relevant. The image itself can be represented by feature vector x=[x1,,xD], and thus the pair x,y represents an image and its associated labels. The multi-label classification paradigm has been successfully considered also in many other domains, such as text, video, audio, and bioinformatics – see [1] and references therein for further examples.

Although binary labels (representing relevance and irrelevance) are enough to represent a huge number of practical problems, the generalization where each label can take multiple values – variously called multi-target, multi-output, or multi-dimensional classification – has also been investigated in the literature (see [3], [4], [5]). In this case each t-th ‘label’ (t=1,,T) can take on up to L values such as a rating yt{1,2,3,4,5} (where L=5), hour of day yt{0,,23} (where L=24) and so on, rather than the simple relevance/irrelevance case (L=2). In practice many multi-label algorithms can be applied directly to the general multi-output case, and are always applicable indirectly, following from the fact that any binary number can be represented as any decimal number and vice versa. Fig. 1 shows the relationship between these paradigms. Throughout this work, we will continue to use the term multi-label classification for the general case..

Sequential data applications deal with a changing state over time, for example of an object or scenario at a particular time index. Approaches to modelling in relevant domains are frequently based on some variety of Markov model, of which detailed overviews are given by [6] and [7].

For example, a traveller's movements among waypoints in a city can be modelled as a series of references to these points, where we can consider yt as indicating the waypoint at time t, then an example of a short path among four points under typical notation2y1:4=y1,y2,y3,y4=3,8,17,5where the numbers are unique to each node. The difference in real time between each t and t+1 depends on the application (it could be seconds, or minutes, for example). The observation (known often emission) available at time point t is represented as vector xt.

These two problems (the one of multi-label and sequential prediction), have until now mostly received attention as different areas of research. However, they can often be seen not just as related problems, but in fact as identical problems, where the terms ‘time index’, ‘state’, ‘observation’, and ‘path’ can be interchanged with terms like ‘label index’, ‘label’, ‘input’, and ‘label vector’, respectively.

We were motivated to take a unified view of these two tasks – multi-label classification and sequential prediction – in a framework that allows the natural application of one to the other. This allows us to apply and further develop suitable techniques from multi-label classification to the domain of sequential prediction, in the form of novel methods that overcoming the disadvantages of hidden Markov models and related approaches by allowing the simultaneous prediction of multiple values across time.

In the first contribution of this work, we compare and contrast typical approaches for modelling of multi-label and sequential data, then draw strong connections between these areas (Section 2). We show that many (if not, most) methods are directly applicable from one problem to the other, and that all methods are applicable in some way, usually only with small modifications to the way the data is preprocessed. We analyze and discuss the relative advantages and disadvantages of each method. Furthering this, we provide a unified view (in Section 3) describing a common framework for multi-label and sequential-data algorithms. We look particularly at the applicability of multi-label methods for obtaining competitive performance and necessary scalability characteristics for sequential prediction. In a novel manner we adapt a Markov-based methodology for multi-label data to create a new method (Viterbi Classifier Chains, Section 4), and discuss its suitability in both domains. This leads us to formulate a further novel approach (in Section 5): Sequential Increasingly-sized Chained Labelsets (SICL), which casts a combination of chain-based and set-based approaches to the sequential problem by taking into account the decay of confidence for points relatively further in the future. In Section 6 we compare against a number of competitive multi-label and sequential methods in empirical evaluations on some real-world sequential-data problems. We find that our novel schemes are competitive and scalable. Finally, in Section 7 we discuss the results, summarize our contributions, draw conclusions and mention promising future work in both areas.

Section snippets

Connections between multi-label and sequential classification problems

In this work we study the supervised classification task, where a series of inputs is mapped to a series of outputs by a model trained on similar labelled examples (i.e., a training set is available). In the sequential task, classification of the future is often specifically referred to as prediction (as opposed to the estimation of a current state). In the multi-label context, there is no explicit time context, and therefore the term prediction/estimation are used interchangeably for all

A unified view of multi-label and sequential-data methods

The area of multi-label classification has expanded rapidly over recent years and now many techniques have begun overlapping with more established areas like methods for sequential data. Here we develop some of the connections.

The immediately obvious difference between multi-label classification and classification of sequential data lies in the name of the latter: Sequential data is by definition ordered, usually across time,4

Viterbi classifier chains

Various methods have been applied as approximate inference on highly connected models [26], [5], [20]. Moreover, a number of search variants for optimal and approximate inference have been proposed for classifier chains, including: [26]'s ϵ-approximate inference, based on performing a depth-first search in the probabilistic tree with a cutting-off list; ‘beam search’ [20], a heuristic search algorithm that speeds up inference considerably; and Monte Carlo search [5]. The optimal and exhaustive

Sequential increasingly-sized chained labelsets

As we explained earlier, multi-label classifiers can be applied easily to the task of sequential prediction. The order of labels itself is not a factor in applicability, weather it be inference, or predicting ahead. Assuming neighbouring (Markov) time dependency has in many cases no particular advantage over arbitrary dependence of equivalent connectedness. However, there is a peculiarity of sequential data with regard to the problem of predicting into the future: generally accuracy will

Experiments

We carry out experiments on the task of multi-step ahead prediction. In sequential terms, given the current state and observation, we wish to estimate τ future states. There are several typical strategies for this [28]: In the so-called iterated strategy, the state is predicted one step ahead, and this prediction is used as input to estimate the second state, and so on. This is essentially rolling forward an HMM with a flat emission probability. In a direct strategy, all future time steps

Conclusions

We revealed and elaborated on many connections between methods for multi-label classification and those typically used in applications involving sequential data. Our motivation was based on that drawing stronger links will be mutually beneficial for both areas of research. For example, the ‘label bias’ problem relating to a number of Markov models used on sequential data is in fact the same concept as ‘error propagation’ cited frequently in multi-label classification, with tens of models

Acknowledgements

This work was supported in part by the Aalto University AEF research programme http://energyefficiency.aalto.fi/en/.

Jesse Read is assistant professor in the Laboratory of Informatics (LIX) at the École Polytechnique in Paris. He obtained his PhD in Computer Science from the University of Waikato in New Zealand in 2010, and has carried out postdoctoral research the Carlos III University of Madrid in Spain, Aalto University in Finland, and Télécom ParisTech in France.

References (31)

  • J. Hollmén, V. Tresp, Call-based fraud detection in mobile communications networks using a hierarchical...
  • S. Reddy et al.

    Using mobile phones to determine transportation modes

    ACM Trans. Sen. Netw.

    (2010)
  • P. Widhalm, P. Nitsche, N. Brandie, Transport mode detection with realistic smartphone sensor data, in: Pattern...
  • A.J. Viterbi

    Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

    IEEE Trans. Inf. Theory

    (1967)
  • T.G. Dietterich, Machine learning for sequential data: A review, in: Proceedings of the Joint IAPR International...
  • Cited by (28)

    • Probabilistic regressor chains with Monte Carlo methods

      2020, Neurocomputing
      Citation Excerpt :

      This is exemplified (and contrasted to the naive approach) in Fig. 1. This ‘chaining’ mechanism, although simple in its basic form, has proved successful in multi-label classification and provided dozens of modifications, extensions, and improvements, in the literature (see, e.g., [29,3,32,5,1,40,41] and references therein). It is flexible, in the sense that the base model class is a user hyperparameter, easily selected for different domains, and even relatively simple base models (e.g., off-the-shelf logistic regression) lead to greater predictive performance than if they had been used independently for each label.

    • Distributed data clustering over networks

      2019, Pattern Recognition
      Citation Excerpt :

      Additionally, the distributed version of the traditional algorithms often requires a central node to detain consensus like, e.g., in [14,15], where master-slave architectures are considered, or in [16], where the local results are sent to the main site to reach an agreement among the nodes. Additionally, the approaches differentiate if applied to a supervised learning problem such as, e.g., classification [17,18], regression [19–21], time series prediction [22–24], or to an unsupervised one as, e.g., clustering [25]. Recently, several authors start to deal with efficient distributed clustering procedures, which represent a challenging task due to the diffused nature of the underlying information that should be exploited efficiently among all agents.

    • Nonlinear regression via incremental decision trees

      2019, Pattern Recognition
      Citation Excerpt :

      Nonlinear regression has been extensively studied in machine learning and signal processing literature due to its applicability in an extremely wide range of real life scenarios ranging from prediction in time series [1,2] to face recognition [3,4] and object tracking [5].

    View all citing articles on Scopus

    Jesse Read is assistant professor in the Laboratory of Informatics (LIX) at the École Polytechnique in Paris. He obtained his PhD in Computer Science from the University of Waikato in New Zealand in 2010, and has carried out postdoctoral research the Carlos III University of Madrid in Spain, Aalto University in Finland, and Télécom ParisTech in France.

    Luca Martino was born in Palermo, Italy, 1980. He obtained his PhD in Statistical Signal Processing from the Universidad Carlos III de Madrid, Spain, in 2011. Later, he spent two years in the Department of Statistics at the University of Helsinki, Finland. He carried out postdoc research at Sao Paulo Research Foundation FAPESP, and is currently at Universitat de València.

    Jaakko Hollmen is a Chief Research Scientist at the Department of Information and Computer Science. He heads the Parsimonious Modelling group at the Helsinki Institute of Information Technology (HIIT). He started his research career as a Master's student at the Laboratory of Computer and Information Science in 1996 and started to conduct research on industrial applications of Self-Organizing Maps. He completed his PhD at Siemens Corporate Technology in Munich.

    View full text