XOR, Moons, and Circles: classifying non-linearity
Today we will discuss how different classification algorithms fit different shapes of data. We will see how the decision boundaries are drawn by different algorithms when faced with a variety of non-linear shapes, and try to learn what to expect from which algorithm.
Meet the shapes
I had the idea for this post after coming across SciKit-Learn’s comparison of classification algorithms and decided to further the example with another shape and a deeper discussion of the results. We will consider four shapes: one follows a linearly separable relation that is smeared by noise, and three shapes that exhibit explicit non-linear distributions.
The XOR represents the logical eXclusive OR gate, which takes two binary inputs and outputs one if they are different and zero if they are the same, hence the name exclusive. To represent this as a distribution we added some normalised noise to smear the points around the four possible combinations of inputs. Algebraically, one could see the XOR to be parametrised by a second degree polynomial of the form
XOR(x_1,x_2) = (x_1-x_2)^2 \ ,
although there are many continuous functions that limit to the bitwise XOR in the discrete limit.
The Moons and Circles are defined by geometrical regions with polynomial boundaries. For example, the moons are separated by a polynomial of third degree of shape
x_2 = x_1(x_1-1)(x_1+1)
whereas the circles are separated from each other by a circle
x_1^2+x_2^2 = 1 \ .
These formulae are examples, where the actual boundaries might vary by changing overall normalisations, changing the non-homogeneous terms, etc. The point of presenting them is to show how these regions are intrinsically non-linear.
Meet the algorithms
We will use SciKit-Learn python library, which has a wealth of classification of algorithms. For this exercise, we will focus on choosing algorithms that behave quite differently from each other so we can learn what to expect from each family of algorithms.
The different estimators we choose for this exercise are
- Linear Models
- Support Vector Machines
- SVC with RBF Kernel
- Neural Network
- MLPClassifier, a SciKit-Learn implementation of a Dense Neural Network
- Discriminant Analysis
- Naive Bayes
- Ensemble Methods (All tree-based)
Let’s do a quick breakdown of these models.
Linear models are those where the response variable, \(y\), is assumed to follow a linear combination of the features, \(X\), over the weights, \(W\), and biases, \(b\),
y = W \cdot X + b\ ,
where the equation is implicitly multi-dimensional. Since the underlying assumption is that the target variable responds linearly with regards to the features, the decision boundary will be linear. We could feature engineer polynomial terms, but for the sake of this exercise we are only assessing the out-of-the-box behaviour of the algorithms.
In this set of models we’ve included the Perceptron, which is the ancestry of modern Neural Networks. Historically, the Perceptron was one of the first Machine Learning algorithms to be able to adjust its own waits during a training phase and was hoped to be a breakthrough in machine intelligence. Unfortunately, the Perceptron was proved to be incapable of fitting the XOR distribution and was progressively disregarded.
Support Vector Machines
Support Vector Machines are a set of algorithms that try to find the widest margin separating points of different classes. What’s most interesting, is that it allows for a powerful extension called the kernel trick. The kernel trick allows for the algorithm to consider non-linear relations whilst preventing mapping the features into new features, i.e without engineer new features and making use of higher dimensional feature spaces.
The LinearSVC class from Sci-Kit uses a lower-level library, liblinear, which has great performance but does not support non-linear kernels and, therefore, will have linear decision boundaries.
We have discussed Neural Networks here and here. In this exercise, we’ll use the SciKit-Learn implementation of a multi-layer perceptron, which is another way of calling a Dense Neural Network (DNN), using the MLPClassifier. By default, the MLPClassifier includes a single hidden layer with 100 cells/neurons. This is enought to highlight the strength of Neural Networks.
At each hidden cell in a DNN, there is a linear model like the one above, which means that a DNN is in fact a collection fo many linear models. This allows us to construct decision boundaries that are the result of many linear boundaries, each taken to be more or less relevant for different regions of the feature space. Indeed, training a DNN is the task of teaching it which linear models to consider for a given input vector of features. In this sense, a DNN can be seen as a more generalised and powerful version of a Local Regression.
Unlike the Perceptron, the MLP has hidden layers. This is crucial, as it will be able to fit non-linear decision boundaries where the Perceptron could not.
Discriminant Analysis and Naive Bayes
Discriminant Analysis and Naive Bayes have been around for a long time and are based on similar principles. The idea is to use Bayes rule for conditional probability to use all evidence to predict the probability of each class. Since we have observations over the features and classes, we can compute the probabilities of these and then infer the most probable class of a new observation.
The Naive Bayes further assume that the features are uncorrelated. This not only makes these models very fast, but very suitable for fat data, i.e. data with many features. Next, we have to assume the distribution followed by the features, leading us to Gaussian or Bernoulli, respectively suited for continuous and Boolean features.
The Discriminant Analysis classes relax these assumptions while assuming the features follow Gaussian distributions. The Linear implementation accounts for assuming that the Gaussian for each class share the same covariance matrix, leading to linear boundary conditions. The Quadratic implementation does not make the same assumptions, leading to a quadratic decision boundary.
In all cases, training is a matter of counting occurrences and is therefore very quick.
A Gaussian Process is a stochastic process where the response variable is linear combination of multivariate normal distributions. This is accomplished by fitting many normal distributions over the feature space and then summing them for a final distribution. Gaussian Processes often produce very good results, but they are computationally intensive for big data. They make use of Kernels, and are therefore equipped with the kernel trick introduced above allowing them to lift non-linear behaviour without making use multidimensional vector spaces.
Trees and Ensembles
Decision Trees is a class of models that is very popular. They represent a sequence of decisions (splitting of branches of a tree) at which in the end a conclusion as reached. The depth of a tree is normally a free parameter, and there are many ways of training (growing) a tree for a given problem. It might be difficult to train a tree with the correct shape before knowing the data. As such, trees are better used in groups, or forests, in what are called ensemble models.
Ensemble models are collections, ensembles, of estimators from which a final prediction is drawn. Therefore they are often called meta-estimators. While the principles of ensemble can be applied to any estimator, it is often to based them on decision trees.
There are two types of ensemble models we will look into: bagging and boosting. Bagging fits the estimators to subsets of the data (and features, to prevent high bias) and then combines the estimations to draw a final estimate, for example by averaging or voting. Boosting fits the estimators to the data, then assigns weights to the data points inversely proportionally to how well the estimator fit each point, before fitting again. In the subsequent fits, it focus more attention on the difficult points which improves the overall predictability and the capacity to uncover intricate relations. Boosting is very popular nowadays, and we also include XGBoost, a library with a boosting algorithm not included in SciKit-Learn.
Training can be performed in parallel in bagging, whilst boosting requires a sequence of models to assess the weights on the training data. Therefore, bagging is usually faster to fit than boosting, although boosting has been responsible for winning many Kaggle competitions.
Meet the results
I wrote a simple Python code that generates the data and fits the different models to each. Afterwards, it draws the points and the decision boundaries. The value on the bottom right corner of these plots is the score on the test set. Both the boundaries and the test score will indicate how well the model generalises to future data and how it learns how to make decisions.
Linearly Separable set
As expected, all models behave quite well. The set is linearly separable, and the linear models perform as well as those equipped with the capacity to deal with non-linear models. This first data-set serves mostly as a benchmark for the models and a starting point for comparison.
We also note that the ensemble models outperform the single decision tree even for a simple data-set like this.
The XOR set is known to be a tricky one for many models (for example, it’s responsible for the downfall of the Perceptron popularity), and the above results only reinforce this lore. Linear models fail dramatically as the set is not linearly separable.
Surprisingly, the Bernoulli Naive Bayes also fails even though the features follow a smeared Boolean distribution. This indicates that the Bernoulli Naive Byes can only fit a distribution of Boolean features if the classes are linearly separable. On the other hand, Gaussian Naive Bayes fails as a single Gaussian is unable to encapsulate a class while avoiding the other.
It’s also worth noting that how the AdaBoost ensemble completely misses the shape of the data, while all other ensemble methods perform very well. I’m not sure about why this happens, but I speculate the training procedure followed a bad sequence of models.
The Moons data-set provides some interesting insights. First, the Quadratic Discriminant Analysis fails to outperform the linear case, this is because the boundary between the classes is cubic and so one degree higher than the quadratic boundary proposed. We can also see the MLP performs only slightly better than the best models, although the decision boundary is seemingly following the predicted cubic shape.
Regarding the Tree based models we see that since the simple single Tree performs quite well, the ensemble models offer little improvement for this data-set.
Finally we have the co-centring circles. This data-set is different from the others as it exhibits a different topology, since you cannot continuously reshape this distribution into any of the others.
First we observe that the linear models and BernoulliNB fail unsurprisingly. The Gaussian Naive Bayes and the Gaussian Process perform very well, as the normal distribution has spherical/circular symmetry and so it fits the inner circle effortlessly. Interestingly enough, this time the MLP performs very well, whereas before it failed to generalise the cubic border accordingly.
We discussed specific models above, but some models seem to have very consistent performances: Non-linear SVC, Gaussian Process, ensembles of Trees. It is with no surprise that SVC and ensembles of Trees are often the first go-to choices when performing and out-of-the-box classification, and this exercise shows how versatile and powerful they are. We notice an awkward behaviour with AdaBoost with the XOR data-set, but otherwise ensembles of Trees perform very well.
Other considerations and shortcomings
While this exercise helped to understand some classification models better, it is by no means a complete analysis. There are many difficulties in every-day data-science that we haven’t touched with this analysis
- Only two dimensional feature space: no curse of dimensionality.
- Both features are continuous: no mixed feature difficulties.
- Fabricated data: the two features are relevant and independent, the models are not fooled by unimportant or highly correlated features.
- Both classes equally represented: in real-life cases the classes are not equally represented, leading to skewed models.
- No missing values: each observation has a value for both features.
As such, the results above are to be taken only as an indication and not a cheat-sheet for model assessing.
On top of this, so far I have been hiding a very important aspect of training these models: training times. In fact, there is a model that performs very well above, but which scales terribly with data: Gaussian Process. To highlight this I performed fits on increasingly bigger versions of the same data-sets of
[125,250,500,1000,2000,4000,8000,16000] points. The fitting times for the Gaussian Process can be seen below where it exhibits a \(O(n^2)\) complexity.
We studied how different classification models fit a selection of non-trivial data-sets. The scope of this exercise was to assess the shapes of decision boundaries of different algorithms to gather insight on when they are most likely to perform well.
Although better information about the data-set is our biggest asset, we found that Support Vector Machines with non-linear kernels and ensemble Tree estimators are very good default options as they offer great versatility across different data-sets. We also found that other models outperformed trees on data-sets with characteristics more suitable for their strengths, which means that a good understanding of the data is crucial to choose the best algorithm for the task at hand.
In the end, a few classifiers should be tried on the data-set to evaluate their performance, both in scores and in training times, to choose the one that is best suited for our needs.