In practice it is usually possible to get better results by combining several different models into an ensemble of models. There are two significantly different ways to create an ensemble:
- Ensembles with independent models;
- Ensembles with boosting.
The models themselves can be:
- Heterogenous: Completely different kinds of models: e.g. a decision tree and a neural network. This is good because different kinds of models tend to make very different errors.
- Homogenous: Models are from the same class (e.g. all are decision trees) and they are made different from each other through other means, e.g. by training on slightly different parts of the dataset, by introducing stochasticity into the training process etc.
Ensembles with Independent Models
In an ensemble with independent models, several different models are trained to perform the same task. The ensemble as a whole then makes predictions by: voting/averaging.
Ensembles of this kind improve generalization. Intuitively: correct models agree, but errors tend to be stochastic. So the correct models have a chance to overvote the incorrect ones even when they are a minority.
from [levy], with modifications
Ensembles with Boosting
The other kind of ensemble is fundamentally different. The models are not independent. Instead, one model is trained and then the others, each focusing on correcting the errors of the previous ones. The overall prediction is made by taking a weighted average of the individual models’ predictions. This makes the overall model more expressive and improves accuracy.
Stacking: A Hybrid Approach
A hybrid approach is also possible. A method known as stacking first trains a number of models in parallel. It then augments the original dataset (actually, including the original data is optional, some models will just use the outputs) with their outputs and trains another model on that. Naturally, one can use multiple levels of such stacking.
Literature
- [levy] LEVY, J. Random Forests in Python. In [online]. [cit. 2014-11-03]. URL: <http://blog.yhathq.com/posts/random-forests-in-python.html>.