Certainly, not a new idea, but something you'll probably not see very often (or ever) ... a "regression" tree that can infer
things that aren't functions ... Very cool!
So, the typical regression problem is to infer some function on some observed points [latex]\{(\mathbf{x}_n,y_n)\}_{n=1}^N[/latex] where [latex]y_n[/latex] are possibly noisy (i.e. so interpolation wouldn't be appropriate).
The regression tree divides the feature space (the [latex]x[/latex]'s) in such a way so as to minimize
[latex]
\ell(Y) = \sum_{y \in Y} (y - \bar{y})^2 = |Y| \text{Var}(Y)
[/latex]
The divisions are binary splits, so each vertex is minimizing the above on each
side[latex]
\ell(Y_{\text{left}},Y_{\text{right}}) = \sum_{y \in Y_{\text{left}}} (y - \bar{y})^2 + \sum_{y \in Y_{\text{right}}} (y - \bar{y})^2 = N_{\text{left}} \text{Var}(Y_{\text{left}}) + N_{\text{right}}\text{Var}(Y_{\text{right}})
[/latex]
The left and right sides are per-feature (dimension) and determined by a threshold
[latex]
Y_{\text{left}} = \{ y_n\ :\ x_{nf} \leq t,\ n=1,2,3, ... N \} \\
Y_{\text{right}} = \{ y_n\ :\ x_{nf} > t,\ n = 1, 2, 3, ... N \}
[/latex]
So, to minimize the above, the vertex searches over thresholds [latex]t[/latex] in a subset of features [latex]f[/latex] and chooses the [latex](t,f)[/latex] with minimal residual.
Now, to generalize this to learn
things that aren't functions, consider minimizing a modification of the above
[latex]
\ell(Y) = \sum_{k=1}^K \sum_{y \in Y_k} (y - \bar{y}_k)^2 = \sum_{k=1}^K |Y_k| \text{Var}(Y_k)
[/latex]
Each partition [latex]Y_k[/latex] is a cluster of [latex]y[/latex] values. So, for example, if you were to examine a cross-section of a cloud of points describing a circle, you would see two distinct clusters of [latex]y[/latex] values (unless you were at the ends of the circle).
The general residual to minimize is
[latex]
\ell(Y_{\text{left}},Y_{\text{right}}) = \ell(Y_{\text{left}}) + \ell(Y_{\text{right}})
[/latex]
And I don't really care to expand it further because all the subscripts get ugly. The clusters are found using K-Means or similar. I chose to use K-Means because it is simple. The clustering problem is also simple ... 1D clustering.
So, I didn't explain any of this very well because I'm not really trying to lecture anyone ... just show some cool pictures with a little bit of explanation.
Here's some toy data:
Consider 1 feature [latex]x[/latex] and we want to predict [latex]y[/latex] (which can have two values).
So, here's some examples of individual KM regression trees:
These are plots of the pdfs which are estimated with Kernel Density Estimation. The parameters are inferred from K-Means clustering (the predictions for Y would be the K means).
This training set has 1000 instances. This tree has minimum local sample size of 20. This parameter is important so that clustering works well.
This is the same conditional, but at a different angle.
Here's an example of a cross-section which nicely captures the behavior of the data.
This training set has 3000 instances. This tree has minimum local sample size of 100.
This is the same conditional, but at a different angle.
Here's an attempt to infer Archimedes Spiral. This is the toy data below.
Here, the training set has 3000 instances. I trained a regression tree with K=5 clusters in mind and a minimum sample size of 100. Not a very nice picture
This is the same conditional, but at a different angle.
Lastly, here's a KM Regression Forest with 100 trees. It works very nicely here.
The same conditional at a different angle
Here are some cross-sections of the forest
You can use normal regression forest too ... but you don't get nice bimodal conditionals like that. Maybe if you had
a lot of regression trees, it would work.
All of this was written in Octave using a mixture of C++ Octfile and Matlab scripting. The code can deal with general data ... these toy data sets were just to make some cool pictures.