Elsevier

Pattern Recognition

Volume 46, Issue 3, March 2013, Pages 817-833
Pattern Recognition

Tree ensembles for predicting structured outputs

https://doi.org/10.1016/j.patcog.2012.09.023Get rights and content

Abstract

In this paper, we address the task of learning models for predicting structured outputs. We consider both global and local predictions of structured outputs, the former based on a single model that predicts the entire output structure and the latter based on a collection of models, each predicting a component of the output structure. We use ensemble methods and apply them in the context of predicting structured outputs. We propose to build ensemble models consisting of predictive clustering trees, which generalize classification trees: these have been used for predicting different types of structured outputs, both locally and globally. More specifically, we develop methods for learning two types of ensembles (bagging and random forests) of predictive clustering trees for global and local predictions of different types of structured outputs. The types of outputs considered correspond to different predictive modeling tasks: multi-target regression, multi-target classification, and hierarchical multi-label classification. Each of the combinations can be applied both in the context of global prediction (producing a single ensemble) or local prediction (producing a collection of ensembles). We conduct an extensive experimental evaluation across a range of benchmark datasets for each of the three types of structured outputs. We compare ensembles for global and local prediction, as well as single trees for global prediction and tree collections for local prediction, both in terms of predictive performance and in terms of efficiency (running times and model complexity). The results show that both global and local tree ensembles perform better than the single model counterparts in terms of predictive power. Global and local tree ensembles perform equally well, with global ensembles being more efficient and producing smaller models, as well as needing fewer trees in the ensemble to achieve the maximal performance.

Highlights

► Tree ensembles for efficient local and global predictions of structured outputs. ► Considered multi-target prediction and hierarchical multi-label classification. ► Theoretical and empirical evaluations over a variety of benchmark domains. ► Local and global tree ensembles perform equally well, while global are more efficient.

Introduction

Supervised learning is one of the most widely researched and investigated areas of machine learning. The goal in supervised learning is to learn, from a set of examples with known class, a function that outputs a prediction for the class of a previously unseen example. If the examples belong to two classes (e.g., the example has some property or not) the task is called binary classification. The task where the examples can belong to a single class from a given set of m classes (m3) is known as multi-class classification. The case where the output is a real value is called regression.

However, in many real life problems of predictive modeling the output (i.e., the target) is structured, meaning that there can be dependencies between classes (e.g., classes are organized into a tree-shaped hierarchy or a directed acyclic graph) or some internal relations between the classes (e.g., sequences). These types of problems occur in domains such as life sciences (predicting gene function, finding the most important genes for a given disease, predicting toxicity of molecules, etc.), ecology (analysis of remotely sensed data, habitat modeling), multimedia (annotation and retrieval of images and videos) and the semantic web (categorization and analysis of text and web pages). Having in mind the needs of these application domains and the increasing quantities of structured data, Yang and Wu [1] and Kriegel et al. [2] listed the task of “mining complex knowledge from complex data” as one of the most challenging problems in machine learning.

A variety of methods, specialized in predicting a given type of structured output (e.g., a hierarchy of classes [3]), have been proposed [4]. These methods can be categorized into two groups of methods for solving the problem of predicting structured outputs [4], [3]: (1) local methods that predict component(s) of the output and then combine the individual models to get the overall model and (2) global methods that predict the complete structure as a whole (also known as ‘big-bang’ approaches). The global methods have several advantages over the local methods. First, they exploit and use the dependencies that exist between the components of the structured output in the model learning phase, which can result in better predictive performance. Next, they are typically more efficient: it can easily happen that the number of components in the output is very large (e.g., hierarchies in functional genomics can have several thousands of components), in which case executing a basic method for each component is not feasible. Furthermore, they produce models that are typically smaller than the sum of the sizes of the models built for each of the components.

The predictive models that we consider in this paper are predictive clustering trees (PCTs). PCTs belong to the group of global methods. PCTs offer a unifying approach for dealing with different types of structured outputs and construct the predictive models very efficiently. They are able to make predictions for several types of structured outputs: tuples of continuous/discrete variables, hierarchies of classes, and time series. More details about the PCT framework can be found in [5], [6], [7], [8], [9].

PCTs can be considered as a generalization of standard decision trees towards predicting structured outputs. Although they offer easily interpretable trees with good predictive performance, they inherit the stability issues of decision trees [10]. Namely, change in just a few training examples can sometimes drastically change the structure of the tree. Breiman [11] states that un-stable predictive models (such as decision trees) could be combined into an ensemble to improve their predictive performance. An ensemble is a set of (base) predictive models, whose output is combined. For basic classification and regression tasks, it is widely accepted that an ensemble lifts the predictive performance of its base predictive models [12]. However, for the task of predicting structured outputs, this issue has not been thoroughly investigated. Moreover, in the case where the base predictive models are decision trees, Bauer and Kohavi [13] conclude that the ensemble's increase in performance is stronger if the trees are unpruned, i.e., allowed to overfit. On the other hand, Blockeel et al. [14] state that PCTs for structured outputs show less overfitting than the trees for classification of a single target variable. Having in mind these two conflicting influences, it is not obvious whether an ensemble of predictive clustering trees can significantly increase the predictive performance over that of a single predictive clustering tree.

Furthermore, in an ensemble learning setting, it is not clear if the predictive performance of an ensemble of global predictive models will be better or worse than the predictive performance of a combination of ensembles of local predictive models. Generally, an ensemble is known to perform better than its base learner if the base learner is accurate and diverse [15]. While the superior predictive performance of global models has been shown before [8], less is known about their diversity or instability (i.e., whether they produce different errors with small changes to the training data). It is expected that a PCT for predicting structured outputs, especially in the case of hierarchical classification, is less unstable than a PCT for predicting components of the output. It is also not clear which approach will be more efficient, both in terms of running time and size of the predictive models.

In this paper, we investigate the aforementioned questions. We use bagging and random forests as ensemble learning methods, since they are the two most widely used ensemble learning methods in the context of decision trees [16]. We consider two structured output machine learning tasks: predicting multiple target variables and hierarchical multi-label classification. We perform an extensive empirical evaluation of the proposed methods over a variety of benchmark datasets.

The paper is based on our previous work by Kocev et al. [7] and Schietgat et al. [17]. Kocev et al. [7] conducted an initial comparison of different ensemble schemes using predictive clustering trees in the context of predicting multiple targets. Schietgat et al. [17] introduced bagging of PCTs for hierarchical multi-label classification (HMC) in the context of functional genomics. The present paper extends the previous work [7], [17] in the following directions:

  • The tasks of predicting multiple target variables and hierarchical multi-label classification are discussed jointly. Formal definitions of the considered tasks are provided, as well as an extensive discussion of the related work.

  • The experimental evaluation is performed on a much wider selection of datasets, from various domains. The performance of ensembles with different number of base predictive models is evaluated and saturation curves are provided.

  • A better methodology for performance comparisons is used and a more detailed discussion of the results is presented. Friedman tests combined with a Nemenyi post hoc test are used to compare different ensemble schemes, instead of performing pairwise comparisons.

  • Global random forests of PCTs for HMC are introduced and evaluated. Local ensembles (both bagging and random forests) of PCTs for HMC are introduced and evaluated.

  • A computational complexity analysis of the proposed methods is performed, which is consistent with the empirical evidence. Global tree ensembles are most efficient, especially random forests, and are scalable to large datasets.

The remainder of this paper is organized as follows. In Section 2, the considered machine learning tasks are formally defined. Section 3 first explains the predictive clustering trees framework and the extensions for predicting multiple targets and hierarchical multi-label classification. It then describes the ensemble methods and their extension for predicting structured outputs. The design of the experiments, the descriptions of the datasets, the evaluation measures and the parameter settings for the algorithms are given in Section 4. Section 5 presents and discusses the obtained results. The related work is presented in Section 6. Finally, the conclusions are stated in Section 7.

Section snippets

Machine learning tasks

The work presented in this paper concerns the learning of ensembles for predicting structured outputs. First, we formally describe the machine learning tasks that we consider here: predicting multiple targets and hierarchical multi-label classification. We follow the suggestions by Džeroski et al. [18] and Džeroski [86], where predictive modeling is defined for arbitrary types of input and output data. In particular, we describe the tasks of predicting multiple targets and hierarchical

Tree ensembles for predicting structured outputs

In this section, we present our ensemble methods for predicting structured outputs. We begin by presenting predictive clustering trees and their instantiations for predicting multiple continuous variables, predicting multiple discrete variables, and hierarchical multi-label classification. Next, we describe how ensemble learning methods can be adapted to use predictive clustering trees as base predictive models. Finally, we describe an approach to the prediction of structured outputs that uses

Experimental design

In this section, we describe the procedure for experimental evaluation of the proposed ensemble methods for predicting structured outputs. First, we state the questions we consider. Next, we present the datasets we use to evaluate the algorithms, and then the evaluation measures we applied. In the last subsection, we give the parameter values used in the algorithms and the statistical tests that we used.

Results and discussion

The results from the experiments can be analyzed along several dimensions. First, we present the saturation curves of the ensemble methods (both for predicting the structured outputs and the components of the outputs). We also compare single trees vs. ensembles of trees. Next, we compare models that predict the complete structured output vs. models that predict components of the structured output. Finally, we evaluate the algorithms by their efficiency in terms of running time and model size.

Related work

The task of predicting structured outputs is gaining more and more attention within the machine learning research community [4], [3]. The community has proposed a number of different methods for addressing this task. However, they are typically “computationally demanding and ill-suited for dealing with large datasets” [4]. In this paper, we have proposed a global method for predicting structured outputs that has good predictive performance and is very efficient: it scales linearly with the size

Conclusions

In this paper, we address the task of learning predictive models for structured output learning, which takes as input a tuple of attribute values and produces a structured object. In contrast to standard classification and regression, where the output is a single scalar value, in structured output learning the output is a data structure, such as a tuple or a directed acyclic graph. We consider both global and local prediction of structured outputs, the former based on a single model that

Acknowledgment

We would like to thank Valentin Gjorgjioski for the discussions concerning the computational complexity of the proposed methods. Celine Vens is a post-doctoral fellow of the Research Fund—Flanders (FWO—Vlaanderen). The work of Dragi Kocev and Sašo Džeroski was supported by the Slovenian Research Agency (Grants P2-0103 and J2-2285), the European Commission (Grants ICT-2010-266722 and ICT-2011-287713), and Operation no. OP13.1.1.2.02.0005 nanced by the European Regional Development Fund (85%) and

Dragi Kocev received his Ph.D. degree in computer science from the IPS Jozef Stefan in 2011. He is currently a post-doctoral researcher at the Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia. His research interests include machine learning and data mining, prediction of structured outputs and their applications in environmental and life sciences.

References (86)

  • J. Díez et al.

    A semi-dependent decomposition approach to learn hierarchical classifiers

    Pattern Recognition

    (2010)
  • Q. Yang et al.

    10 challenging problems in data mining research

    International Journal of Information Technology & Decision Making

    (2006)
  • H.-P. Kriegel et al.

    Future trends in data mining

    Data Mining and Knowledge Discovery

    (2007)
  • C. Silla et al.

    A survey of hierarchical classification across different application domains

    Data Mining and Knowledge Discovery

    (2011)
  • G.H. Bakır, T. Hofmann, B. Schölkopf, A.J. Smola, B. Taskar, S.V.N. Vishwanathan, Predicting Structured Data, Neural...
  • H. Blockeel, L.D. Raedt, J. Ramon, Top-down induction of clustering trees, in: Proceedings of the 15th International...
  • J. Struyf, S. Džeroski, Constraint based induction of multi-objective regression trees, in: Proceedings of the 4th...
  • D. Kocev, C. Vens, J. Struyf, S. Džeroski, Ensembles of multi-objective decision trees, in: ECML '07: Proceedings of...
  • C. Vens et al.

    Decision trees for hierarchical multi-label classification

    Machine Learning

    (2008)
  • I. Slavkov et al.

    Finding explained groups of time-course gene expression profiles with predictive clustering trees

    Molecular Biosystems

    (2010)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • G. Seni et al.

    Ensemble Methods in Data MiningImproving Accuracy through Combining Predictions

    (2010)
  • E. Bauer et al.

    An empirical comparison of voting classification algorithmsbagging, boosting, and variants

    Machine Learning

    (1999)
  • H. Blockeel, L. Schietgat, J. Struyf, S. Džeroski, A. Clare, Decision trees for hierarchical multilabel classification:...
  • L.K. Hansen et al.

    Neural network ensembles

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1990)
  • L. Schietgat et al.

    Predicting gene function using hierarchical multi-label decision tree ensembles

    BMC Bioinformatics

    (2010)
  • S. Džeroski, V. Gjorgjioski, I. Slavkov, J. Struyf, Analysis of time series data with predictive clustering trees, in:...
  • G. Tsoumakas et al.

    Multi label classificationan overview

    International Journal of Data Warehouse and Mining

    (2007)
  • P. Langley, Elements of Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,...
  • H. Blockeel et al.

    Efficient algorithms for decision tree cross-validation

    Journal of Machine Learning Research

    (2002)
  • R.J. Quinlan

    C4.5Programs for Machine Learning

    (1993)
  • A. Clare, Machine Learning and Data Mining for Yeast Functional Genomics, Ph.D. Thesis, University of Wales...
  • L. Kuncheva

    Combining Pattern ClassifiersMethods and Algorithms

    (2004)
  • Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th International...
  • L. Breiman

    Using iterated bagging to debias regressions

    Machine Learning

    (2001)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • T.K. Ho

    The random subspace method for constructing decision forests

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of the 1st International Workshop on Multiple...
  • I.H. Witten et al.

    Data MiningPractical Machine Learning Tools and Techniques

    (2005)
  • Intel, Intel® SSE4 Programming Reference, D91561-003 Edition,...
  • T. Gärtner et al.

    On structured output traininghard cases and an efficient alternative

    Machine Learning

    (2009)
  • A. Karalič, First Order Regression, Ph.D. Thesis, Faculty of Computer Science, University of Ljubljana, Ljubljana,...
  • Cited by (240)

    View all citing articles on Scopus

    Dragi Kocev received his Ph.D. degree in computer science from the IPS Jozef Stefan in 2011. He is currently a post-doctoral researcher at the Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia. His research interests include machine learning and data mining, prediction of structured outputs and their applications in environmental and life sciences.

    Celine Vens received her Ph.D. in applied sciences from the Katholieke Universiteit Leuven, Belgium. She is a post-doctoral fellow of the Research Foundation—Flanders (FWO) and she is working at the Declarative Languages and Artificial Intelligence Research Group at the Department of Computer Science of the K.U. Leuven University in Leuven, Belgium. Her research interests include machine learning, (relational) data mining, and applications thereof in bioinformatics and other domains.

    Jan Struyf received his Ph.D. in applied sciences from the Katholieke Universiteit Leuven, Belgium. He is a former post-doctoral researcher of the Declarative Languages and Artificial Intelligence Research Group at the Department of Computer Science of the K.U. Leuven University in Leuven, Belgium. His research interests include machine learning, statistical relational data mining, predictive clustering, structured output learning, and inductive databases.

    Sašo Džeroski received his Ph.D. degree in computer science from the University of Ljubljana in 1995. He is currently a scientific councilor at the Department of Knowledge Technologies, Jozef Stefan Institute, and the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, both in Ljubljana, Slovenia. He is also an associate professor at the Jozef Stefan International Postgraduate School, also in Ljubljana. His research interests include data mining and machine learning and their applications in environmental sciences (ecology) and life sciences (biomedicine). He is an ECCAI fellow, member of the executive board of SLAIS, member of ACM SIGKDD and IEEE.

    View full text