Multi-label methods for prediction with sequential data

doi:10.1016/j.patcog.2016.09.015

Pattern Recognition

Volume 63, March 2017, Pages 45-55

https://doi.org/10.1016/j.patcog.2016.09.015 Get rights and content

Highlights

•
Drawing connections between learning for multi-label and sequential data.
•
A unified view between multi-label and sequential classifiers.
•
A novel Markov model-inspired method for multi-label (and sequence) classification.
•
A novel multi-label-inspired method for sequence (and multi-label) classification.
•
An empirical comparison with related methods, on real-world datasets, demonstrating the competitiveness of proposed methods.

Abstract

The number of methods available for classification of multi-label data has increased rapidly over recent years, yet relatively few links have been made with the related task of classification of sequential data. If labels indices are considered as time indices, the problems can often be seen as equivalent. In this paper we detect and elaborate on connections between multi-label methods and Markovian models, and study the suitability of multi-label methods for prediction in sequential data. From this study we draw upon the most suitable techniques from the area and develop two novel competitive approaches which can be applied to either kind of data. We carry out an empirical evaluation investigating performance on real-world sequential-prediction tasks: electricity demand, and route prediction. As well as showing that several popular multi-label algorithms are in fact easily applicable to sequencing tasks, our novel approaches, which benefit from a unified view of these areas, prove very competitive against established methods.

Introduction

Multi-label classification is the supervised learning problem where an instance is associated with multiple class variables (i.e., labels), rather than with a single class, as in traditional classification problems. See [1] for a review. The typical argument is that, since these labels are often strongly correlated, modelling the dependencies between them allows methods to obtain higher performance than if labels were modelled independently – at the expense of an increased computational cost.

The case of binary labels is most common, where a positive class value denotes the relevance of the label (and the negative or null class denotes irrelevance). Typical examples of binary multi-label classification involve categorizing text documents and images, which can be assigned any subset of a particular label set. For example, an image can be associated with both labels beach and sunset. This is usually represented in vector form, such that, given a set of labels¹ $L = {b e a c h, u r b a n, f o l i a g e, s u n s e t, m o u n t a i n s, f i e l d s}$ , then an associated label vector is $y = [y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}] = [1, 0, 0, 1, 0, 0]$ which indicates that the first and fourth labels (beach and sunset) are relevant. The image itself can be represented by feature vector $x = [x_{1}, \dots, x_{D}]$ , and thus the pair $x, y$ represents an image and its associated labels. The multi-label classification paradigm has been successfully considered also in many other domains, such as text, video, audio, and bioinformatics – see [1] and references therein for further examples.

Although binary labels (representing relevance and irrelevance) are enough to represent a huge number of practical problems, the generalization where each label can take multiple values – variously called multi-target, multi-output, or multi-dimensional classification – has also been investigated in the literature (see [3], [4], [5]). In this case each t-th ‘label’ ( $t = 1, \dots, T$ ) can take on up to L values such as a rating $y_{t} \in {1, 2, 3, 4, 5}$ (where L=5), hour of day $y_{t} \in {0, \dots, 23}$ (where L=24) and so on, rather than the simple relevance/irrelevance case (L=2). In practice many multi-label algorithms can be applied directly to the general multi-output case, and are always applicable indirectly, following from the fact that any binary number can be represented as any decimal number and vice versa. Fig. 1 shows the relationship between these paradigms. Throughout this work, we will continue to use the term multi-label classification for the general case..

Sequential data applications deal with a changing state over time, for example of an object or scenario at a particular time index. Approaches to modelling in relevant domains are frequently based on some variety of Markov model, of which detailed overviews are given by [6] and [7].

For example, a traveller's movements among waypoints in a city can be modelled as a series of references to these points, where we can consider y_t as indicating the waypoint at time t, then an example of a short path among four points under typical notation² $y_{1 : 4} = y_{1}, y_{2}, y_{3}, y_{4} = 3, 8, 17, 5$ where the numbers are unique to each node. The difference in real time between each t and $t + 1$ depends on the application (it could be seconds, or minutes, for example). The observation (known often emission) available at time point t is represented as vector x_t.

These two problems (the one of multi-label and sequential prediction), have until now mostly received attention as different areas of research. However, they can often be seen not just as related problems, but in fact as identical problems, where the terms ‘time index’, ‘state’, ‘observation’, and ‘path’ can be interchanged with terms like ‘label index’, ‘label’, ‘input’, and ‘label vector’, respectively.

We were motivated to take a unified view of these two tasks – multi-label classification and sequential prediction – in a framework that allows the natural application of one to the other. This allows us to apply and further develop suitable techniques from multi-label classification to the domain of sequential prediction, in the form of novel methods that overcoming the disadvantages of hidden Markov models and related approaches by allowing the simultaneous prediction of multiple values across time.

In the first contribution of this work, we compare and contrast typical approaches for modelling of multi-label and sequential data, then draw strong connections between these areas (Section 2). We show that many (if not, most) methods are directly applicable from one problem to the other, and that all methods are applicable in some way, usually only with small modifications to the way the data is preprocessed. We analyze and discuss the relative advantages and disadvantages of each method. Furthering this, we provide a unified view (in Section 3) describing a common framework for multi-label and sequential-data algorithms. We look particularly at the applicability of multi-label methods for obtaining competitive performance and necessary scalability characteristics for sequential prediction. In a novel manner we adapt a Markov-based methodology for multi-label data to create a new method (Viterbi Classifier Chains, Section 4), and discuss its suitability in both domains. This leads us to formulate a further novel approach (in Section 5): Sequential Increasingly-sized Chained Labelsets (SICL), which casts a combination of chain-based and set-based approaches to the sequential problem by taking into account the decay of confidence for points relatively further in the future. In Section 6 we compare against a number of competitive multi-label and sequential methods in empirical evaluations on some real-world sequential-data problems. We find that our novel schemes are competitive and scalable. Finally, in Section 7 we discuss the results, summarize our contributions, draw conclusions and mention promising future work in both areas.

Section snippets

Connections between multi-label and sequential classification problems

In this work we study the supervised classification task, where a series of inputs is mapped to a series of outputs by a model trained on similar labelled examples (i.e., a training set is available). In the sequential task, classification of the future is often specifically referred to as prediction (as opposed to the estimation of a current state). In the multi-label context, there is no explicit time context, and therefore the term prediction/estimation are used interchangeably for all

A unified view of multi-label and sequential-data methods

The area of multi-label classification has expanded rapidly over recent years and now many techniques have begun overlapping with more established areas like methods for sequential data. Here we develop some of the connections.

The immediately obvious difference between multi-label classification and classification of sequential data lies in the name of the latter: Sequential data is by definition ordered, usually across time,⁴

Viterbi classifier chains

Various methods have been applied as approximate inference on highly connected models [26], [5], [20]. Moreover, a number of search variants for optimal and approximate inference have been proposed for classifier chains, including: [26]'s ϵ-approximate inference, based on performing a depth-first search in the probabilistic tree with a cutting-off list; ‘beam search’ [20], a heuristic search algorithm that speeds up inference considerably; and Monte Carlo search [5]. The optimal and exhaustive

Sequential increasingly-sized chained labelsets

As we explained earlier, multi-label classifiers can be applied easily to the task of sequential prediction. The order of labels itself is not a factor in applicability, weather it be inference, or predicting ahead. Assuming neighbouring (Markov) time dependency has in many cases no particular advantage over arbitrary dependence of equivalent connectedness. However, there is a peculiarity of sequential data with regard to the problem of predicting into the future: generally accuracy will

Experiments

We carry out experiments on the task of multi-step ahead prediction. In sequential terms, given the current state and observation, we wish to estimate τ future states. There are several typical strategies for this [28]: In the so-called iterated strategy, the state is predicted one step ahead, and this prediction is used as input to estimate the second state, and so on. This is essentially rolling forward an HMM with a flat emission probability. In a direct strategy, all future time steps

Conclusions

We revealed and elaborated on many connections between methods for multi-label classification and those typically used in applications involving sequential data. Our motivation was based on that drawing stronger links will be mutually beneficial for both areas of research. For example, the ‘label bias’ problem relating to a number of Markov models used on sequential data is in fact the same concept as ‘error propagation’ cited frequently in multi-label classification, with tens of models

Acknowledgements

This work was supported in part by the Aalto University AEF research programme http://energyefficiency.aalto.fi/en/.

Jesse Read is assistant professor in the Laboratory of Informatics (LIX) at the École Polytechnique in Paris. He obtained his PhD in Computer Science from the University of Waikato in New Zealand in 2010, and has carried out postdoctoral research the Carlos III University of Madrid in Spain, Aalto University in Finland, and Télécom ParisTech in France.

References (31)

M.R. Boutell et al.
Learning multi-label scene classification
Pattern Recognit.
(2004)
D. Kocev et al.
Tree ensembles for predicting structured outputs
Pattern Recognit.
(2013)
J. Read et al.
Efficient Monte Carlo methods for multi-dimensional learning with classifier chains
Pattern Recognit.
(2014)
L. Liao et al.
Learning and inferring transportation routines
Artif. Intell.
(2007)
J. Read et al.
Scalable multi-output label prediction: from classifier chains to classifier trellises
Pattern Recognit.
(2015)
S. Ben Taieb et al.
Multiple-output modeling for multi-step-ahead time series forecasting
Neurocomput
(2010)
M.-L. Zhang et al.
A review on multi-label learning algorithms
IEEE Trans. Knowl. Data Eng.
(2014)
J.H. Zaragoza, L.E. Sucar, E.F. Morales, C. Bielza, P. Larrañaga, Bayesian chain classifiers for multidimensional...
D. Barber
Bayesian Reasoning and Machine Learning
(2012)
Y. Bengio, Markovian models for sequential data, Neural Computing Surveys 2 (199)...

J. Hollmén, V. Tresp, Call-based fraud detection in mobile communications networks using a hierarchical...

S. Reddy et al.

Using mobile phones to determine transportation modes

ACM Trans. Sen. Netw.

(2010)

P. Widhalm, P. Nitsche, N. Brandie, Transport mode detection with realistic smartphone sensor data, in: Pattern...

A.J. Viterbi

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Trans. Inf. Theory

(1967)

T.G. Dietterich, Machine learning for sequential data: A review, in: Proceedings of the Joint IAPR International...

Cited by (28)

Introducing instance label correlation in multiple instance learning. Application to cancer detection on histopathological images
2024, Pattern Recognition
In the last years, the weakly supervised paradigm of multiple instance learning (MIL) has become very popular in many different areas. A paradigmatic example is computational pathology, where the lack of patch-level labels for whole-slide images prevents the application of supervised models. Probabilistic MIL methods based on Gaussian Processes (GPs) have obtained promising results due to their excellent uncertainty estimation capabilities. However, these are general-purpose MIL methods that do not take into account one important fact: in (histopathological) images, the labels of neighboring patches are expected to be correlated. In this work, we extend a state-of-the-art GP-based MIL method, which is called VGPMIL-PR, to exploit such correlation. To do so, we develop a novel coupling term inspired by the statistical physics Ising model. We use variational inference to estimate all the model parameters. Interestingly, the VGPMIL-PR formulation is recovered when the weight that regulates the strength of the Ising term vanishes. The performance of the proposed method is assessed in two real-world problems of prostate cancer detection. We show that our model achieves better results than other state-of-the-art probabilistic MIL methods. We also provide different visualizations and analysis to gain insights into the influence of the novel Ising term. These insights are expected to facilitate the application of the proposed model to other research areas.
NeuroTIS: Enhancing the prediction of translation initiation sites in mRNA sequences via a hybrid dependency network and deep learning framework
2021, Knowledge-Based Systems
Translation initiation site prediction is crucial to understand the mechanisms of gene expression and regulation. Many computational approaches have been proposed and achieved acceptable prediction accuracy. Although recent Convolutional Neural Network-based method effectively learn consensus motifs and shows remarkable prediction performance, this method could not fully exploit coding features which have been proved significant to the identification of translation initiation sites. Indeed, coding features often exhibit higher-order distant interactions among nucleotides and learning this kind of feature from uncharacteristic mRNA sequences without any explicit biological knowledge is difficult. This situation gets worse when given no coding labels. In viewing of these shortcomings, we propose a novel method for translation initiation sites prediction in mRNA sequences based on a hybrid dependency network and deep learning framework (NeuroTIS) which explicitly model label dependencies among coding region, between coding region and translation initiation site. Meanwhile, a Bidirectional Recurrent Neural Network and a Convolutional Neural Network are employed for effective learning and inference. The experimental results show that the proposed framework yields an excellent prediction performance on two benchmark gene datasets, which significantly outperforms existing state-of-the-art methods.
Probabilistic regressor chains with Monte Carlo methods
2020, Neurocomputing
Citation Excerpt :
This is exemplified (and contrasted to the naive approach) in Fig. 1. This ‘chaining’ mechanism, although simple in its basic form, has proved successful in multi-label classification and provided dozens of modifications, extensions, and improvements, in the literature (see, e.g., [29,3,32,5,1,40,41] and references therein). It is flexible, in the sense that the base model class is a user hyperparameter, easily selected for different domains, and even relatively simple base models (e.g., off-the-shelf logistic regression) lead to greater predictive performance than if they had been used independently for each label.
A large number and diversity of techniques have been offered in the literature in recent years for solving multi-label classification tasks, including classifier chains where predictions are cascaded to other models as additional features. Chaining methods have often providing state of the art results, and the idea of extending it to multi-output regression has already been trialed. However, these ‘regressor chains’ have seen limited applicability, on account of yielding relatively little predictive performance compared to individual regression models, and also limited interpretability. In this work we identify and discuss the main limitations of regressor chains, including an analysis of different base models, loss functions, explainability, and other desiderata of real-world applications. We develop and examine techniques to overcome these limitations. In particular we present Monte Carlo schemes in the framework of probabilistic chains. We show they can be effective, flexible and useful in different areas. Overall, we also place regressor chains in context among general multi-output learning with continuous outputs, and in doing this shed additional light on the applicability of chaining to machine learning tasks.
Distinguishing two types of labels for multi-label feature selection
2019, Pattern Recognition
Multi-label feature selection plays an important role in pattern recognition, which can improve multi-label classification performance. In traditional multi-label feature selection methods based on information theory, feature relevance is evaluated by the accumulated mutual information between a candidate feature and each label. However, to the best of our knowledge, traditional methods ignore the effect of label redundancy on the evaluation of feature relevance. To address this issue, we propose a new multi-label feature selection method named multi-label Feature Selection based on Label Redundancy (LRFS). First, we categorize labels into two groups: independent labels and dependent labels. Second, by analyzing the differences between independent labels and dependent labels, we propose a new feature relevance term, that is, the conditional mutual information between candidate features and each label given other labels. Finally, we combine the new feature relevance term with the feature redundancy term to design our feature selection method. To evaluate the classification performance of our method, LRFS is compared to three information-theoretical-based multi-label feature selection methods on an artificial data set. Furthermore, LRFS is compared to five algorithm adaption feature selection methods and two problem transformation feature selection methods on 12 real-world multi-label data sets. The experimental results demonstrate that LRFS outperforms the other compared methods in terms of four evaluation metrics.
Distributed data clustering over networks
2019, Pattern Recognition
Citation Excerpt :
Additionally, the distributed version of the traditional algorithms often requires a central node to detain consensus like, e.g., in [14,15], where master-slave architectures are considered, or in [16], where the local results are sent to the main site to reach an agreement among the nodes. Additionally, the approaches differentiate if applied to a supervised learning problem such as, e.g., classification [17,18], regression [19–21], time series prediction [22–24], or to an unsupervised one as, e.g., clustering [25]. Recently, several authors start to deal with efficient distributed clustering procedures, which represent a challenging task due to the diffused nature of the underlying information that should be exploited efficiently among all agents.
In this paper, we consider the problem of distributed unsupervised clustering, where training data is partitioned over a set of agents, whose interaction happens over a sparse, but connected, communication network. To solve this problem, we recast the well known Expectation Maximization method in a distributed setting, exploiting a recently proposed algorithmic framework for in-network non-convex optimization. The resulting algorithm, termed as Expectation Maximization Consensus, exploits successive local convexifications to split the computation among agents, while hinging on dynamic consensus to diffuse information over the network in real-time. Convergence to local solutions of the distributed clustering problem is then established. Experimental results on well-known datasets illustrate that the proposed method performs better than other distributed Expectation-Maximization clustering approaches, while the method is faster than a centralized Expectation-Maximization procedure and achieves a comparable performance in terms of cluster validity indexes. The latter ones achieve good values in absolute range scales and prove the quality of the obtained clustering results, which compare favorably with other methods in the literature.
Nonlinear regression via incremental decision trees
2019, Pattern Recognition
Citation Excerpt :
Nonlinear regression has been extensively studied in machine learning and signal processing literature due to its applicability in an extremely wide range of real life scenarios ranging from prediction in time series [1,2] to face recognition [3,4] and object tracking [5].
We study sequential nonlinear regression and introduce an online algorithm that elegantly mitigates, via an adaptively incremental hierarchical structure, convergence and undertraining issues of conventional nonlinear regression methods. Particularly, we present a piecewise linear (or nonlinear) regression algorithm that partitions the regressor space and learns a linear model at each region to combine. Unlike the conventional approaches, our algorithm effectively learns the optimal regressor space partition with the desired complexity in a completely sequential and data driven manner. Our algorithm sequentially and asymptotically achieves the performance of the optimal twice differentiable regression function for any data sequence without any statistical assumptions. The introduced algorithm can be efficiently implemented with a computational complexity that is only logarithmic in the length of data. In our experiments, we demonstrate significant gains for the well-known benchmark real data sets when compared to the state-of-the-art techniques.

View all citing articles on Scopus

Luca Martino was born in Palermo, Italy, 1980. He obtained his PhD in Statistical Signal Processing from the Universidad Carlos III de Madrid, Spain, in 2011. Later, he spent two years in the Department of Statistics at the University of Helsinki, Finland. He carried out postdoc research at Sao Paulo Research Foundation FAPESP, and is currently at Universitat de València.

Jaakko Hollmen is a Chief Research Scientist at the Department of Information and Computer Science. He heads the Parsimonious Modelling group at the Helsinki Institute of Information Technology (HIIT). He started his research career as a Master's student at the Laboratory of Computer and Information Science in 1996 and started to conduct research on industrial applications of Self-Organizing Maps. He completed his PhD at Siemens Corporate Technology in Munich.

View full text

Multi-label methods for prediction with sequential data

Highlights

Abstract

Introduction

Section snippets

Connections between multi-label and sequential classification problems

A unified view of multi-label and sequential-data methods

Viterbi classifier chains

Sequential increasingly-sized chained labelsets

Experiments

Conclusions

Acknowledgements

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Artif. Intell.

Pattern Recognit.

Neurocomput

A review on multi-label learning algorithms

IEEE Trans. Knowl. Data Eng.

Bayesian Reasoning and Machine Learning

Using mobile phones to determine transportation modes

ACM Trans. Sen. Netw.

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Trans. Inf. Theory