On Trees, Forests and Machines -- or: Do new Brooms Clean Better?
Classical regression models, such as linear or logistic regression are the standard approach in biostatistics.
In the past decade the statistical properties of several machine learning approaches, such as random forests or support vector machines have been better understood. For example, for random forests there are results available on consistency, convergence rates and asymptotic normality. However, machine learning approaches will only be used if the approaches are available in simple to use and fast implementations. In this presentation, I will focus on random forests as learning machine. In the part of the presentation, I will intuitively introduce classification trees and probability estimation trees. Trees will next be generalized to random forests. The statistical properties of random forests are sketched. A specific problem in machine learning is how probability estimates should be updated to make predictions for other centers or for different time points. In the second part of the presentation I will show that both a general approach by Elkan and a novel approach specifically developed for random forests can be used for calibrating probability estimates. The approach will be illustrated by use of data from the German Stroke Study Collaboration.