I had the idea for this post after coming across SciKit-Learn’s comparison of classification algorithms and decided to further the example with another shape and a deeper discussion of the results. We will consider four shapes: one follows a linearly separable relation that is smeared by noise, and three shapes that exhibit explicit non-linear distributions.

The XOR represents the logical eXclusive OR gate, which takes two binary inputs and outputs one if they are different and zero if they are the same, hence the name exclusive. To represent this as a distribution we added some normalised noise to smear the points around the four possible combinations of inputs. Algebraically, one could see the XOR to be parametrised by a second degree polynomial of the form

\[

XOR(x_1,x_2) = (x_1-x_2)^2 \ ,

\]

although there are many continuous functions that limit to the bitwise XOR in the discrete limit.

The Moons and Circles are defined by geometrical regions with polynomial boundaries. For example, the moons are separated by a polynomial of third degree of shape

\[

x_2 = x_1(x_1-1)(x_1+1)

\]

whereas the circles are separated from each other by a circle

\[

x_1^2+x_2^2 = 1 \ .

\]

These formulae are examples, where the actual boundaries might vary by changing overall normalisations, changing the non-homogeneous terms, etc. The point of presenting them is to show how these regions are intrinsically non-linear.

We will use SciKit-Learn python library, which has a wealth of classification of algorithms. For this exercise, we will focus on choosing algorithms that behave quite differently from each other so we can learn what to expect from each family of algorithms.

The different estimators we choose for this exercise are

- Linear Models
- LogisticRegression
- Perceptron

- Support Vector Machines
- LinearSVC
- SVC with RBF Kernel

- Neural Network
- MLPClassifier, a SciKit-Learn implementation of a Dense Neural Network

- Discriminant Analysis
- LinearDiscriminantAnalysis
- QuadraticDiscriminantAnalysis

- GaussianProcess
- Naive Bayes
- BernoulliNB
- GaussianNB

- KNeighborsClassifier
- DecisionTreeClassifier
- Ensemble Methods (All tree-based)
- Bagging
- RandomForestClassifier
- ExtraTreesClassifier

- Boosting
- AdaBoostClassifier
- GradientBoostingClassifier
- XGBClassifier

- Bagging

Let’s do a quick breakdown of these models.

Linear models are those where the response variable, \(y\), is assumed to follow a linear combination of the features, \(X\), over the weights, \(W\), and biases, \(b\),

\[

y = W \cdot X + b\ ,

\]

where the equation is implicitly multi-dimensional. Since the underlying assumption is that the target variable responds linearly with regards to the features, the decision boundary will be linear. We could feature engineer polynomial terms, but for the sake of this exercise we are only assessing the out-of-the-box behaviour of the algorithms.

In this set of models we’ve included the Perceptron, which is the ancestry of modern Neural Networks. Historically, the Perceptron was one of the first Machine Learning algorithms to be able to adjust its own waits during a training phase and was hoped to be a breakthrough in machine intelligence. Unfortunately, the Perceptron was proved to be incapable of fitting the XOR distribution and was progressively disregarded.

Support Vector Machines are a set of algorithms that try to find the widest margin separating points of different classes. What’s most interesting, is that it allows for a powerful extension called the kernel trick. The kernel trick allows for the algorithm to consider non-linear relations whilst preventing mapping the features into new features, i.e without engineer new features and making use of higher dimensional feature spaces.

The LinearSVC class from Sci-Kit uses a lower-level library, liblinear, which has great performance but does not support non-linear kernels and, therefore, will have linear decision boundaries.

We have discussed Neural Networks here and here. In this exercise, we’ll use the SciKit-Learn implementation of a multi-layer perceptron, which is another way of calling a Dense Neural Network (DNN), using the MLPClassifier. By default, the MLPClassifier includes a single hidden layer with 100 cells/neurons. This is enought to highlight the strength of Neural Networks.

At each hidden cell in a DNN, there is a linear model like the one above, which means that a DNN is in fact a collection fo many linear models. This allows us to construct decision boundaries that are the result of many linear boundaries, each taken to be more or less relevant for different regions of the feature space. Indeed, training a DNN is the task of teaching it which linear models to consider for a given input vector of features. In this sense, a DNN can be seen as a more generalised and powerful version of a Local Regression.

Unlike the Perceptron, the MLP has hidden layers. This is crucial, as it will be able to fit non-linear decision boundaries where the Perceptron could not.

Discriminant Analysis and Naive Bayes have been around for a long time and are based on similar principles. The idea is to use Bayes rule for conditional probability to use all evidence to predict the probability of each class. Since we have observations over the features and classes, we can compute the probabilities of these and then infer the most probable class of a new observation.

The Naive Bayes further assume that the features are uncorrelated. This not only makes these models very fast, but very suitable for fat data, i.e. data with many features. Next, we have to assume the distribution followed by the features, leading us to Gaussian or Bernoulli, respectively suited for continuous and Boolean features.

The Discriminant Analysis classes relax these assumptions while assuming the features follow Gaussian distributions. The Linear implementation accounts for assuming that the Gaussian for each class share the same covariance matrix, leading to linear boundary conditions. The Quadratic implementation does not make the same assumptions, leading to a quadratic decision boundary.

In all cases, training is a matter of counting occurrences and is therefore very quick.

A Gaussian Process is a stochastic process where the response variable is linear combination of multivariate normal distributions. This is accomplished by fitting many normal distributions over the feature space and then summing them for a final distribution. Gaussian Processes often produce very good results, but they are computationally intensive for big data. They make use of Kernels, and are therefore equipped with the kernel trick introduced above allowing them to lift non-linear behaviour without making use multidimensional vector spaces.

Decision Trees is a class of models that is very popular. They represent a sequence of decisions (splitting of branches of a tree) at which in the end a conclusion as reached. The depth of a tree is normally a free parameter, and there are many ways of training (growing) a tree for a given problem. It might be difficult to train a tree with the correct shape before knowing the data. As such, trees are better used in groups, or forests, in what are called ensemble models.

Ensemble models are collections, ensembles, of estimators from which a final prediction is drawn. Therefore they are often called meta-estimators. While the principles of ensemble can be applied to any estimator, it is often to based them on decision trees.

There are two types of ensemble models we will look into: bagging and boosting. Bagging fits the estimators to subsets of the data (and features, to prevent high bias) and then combines the estimations to draw a final estimate, for example by averaging or voting. Boosting fits the estimators to the data, then assigns weights to the data points inversely proportionally to how well the estimator fit each point, before fitting again. In the subsequent fits, it focus more attention on the difficult points which improves the overall predictability and the capacity to uncover intricate relations. Boosting is very popular nowadays, and we also include XGBoost, a library with a boosting algorithm not included in SciKit-Learn.

Training can be performed in parallel in bagging, whilst boosting requires a sequence of models to assess the weights on the training data. Therefore, bagging is usually faster to fit than boosting, although boosting has been responsible for winning many Kaggle competitions.

I wrote a simple Python code that generates the data and fits the different models to each. Afterwards, it draws the points and the decision boundaries. The value on the bottom right corner of these plots is the score on the test set. Both the boundaries and the test score will indicate how well the model generalises to future data and how it learns how to make decisions.

The code can be found here, on my GitHub. The algorithms are all from SciKit-Learn, with exception to XGBoost, which can be found here.

As expected, all models behave quite well. The set is linearly separable, and the linear models perform as well as those equipped with the capacity to deal with non-linear models. This first data-set serves mostly as a benchmark for the models and a starting point for comparison.

We also note that the ensemble models outperform the single decision tree even for a simple data-set like this.

The XOR set is known to be a tricky one for many models (for example, it’s responsible for the downfall of the Perceptron popularity), and the above results only reinforce this lore. Linear models fail dramatically as the set is not linearly separable.

Surprisingly, the Bernoulli Naive Bayes also fails even though the features follow a smeared Boolean distribution. This indicates that the Bernoulli Naive Byes can only fit a distribution of Boolean features if the classes are linearly separable. On the other hand, Gaussian Naive Bayes fails as a single Gaussian is unable to encapsulate a class while avoiding the other.

It’s also worth noting that how the AdaBoost ensemble completely misses the shape of the data, while all other ensemble methods perform very well. I’m not sure about why this happens, but I speculate the training procedure followed a bad sequence of models.

The Moons data-set provides some interesting insights. First, the Quadratic Discriminant Analysis fails to outperform the linear case, this is because the boundary between the classes is cubic and so one degree higher than the quadratic boundary proposed. We can also see the MLP performs only slightly better than the best models, although the decision boundary is seemingly following the predicted cubic shape.

Regarding the Tree based models we see that since the simple single Tree performs quite well, the ensemble models offer little improvement for this data-set.

Finally we have the co-centring circles. This data-set is different from the others as it exhibits a different topology, since you cannot continuously reshape this distribution into any of the others.

First we observe that the linear models and BernoulliNB fail unsurprisingly. The Gaussian Naive Bayes and the Gaussian Process perform very well, as the normal distribution has spherical/circular symmetry and so it fits the inner circle effortlessly. Interestingly enough, this time the MLP performs very well, whereas before it failed to generalise the cubic border accordingly.

We discussed specific models above, but some models seem to have very consistent performances: Non-linear SVC, Gaussian Process, ensembles of Trees. It is with no surprise that SVC and ensembles of Trees are often the first go-to choices when performing and out-of-the-box classification, and this exercise shows how versatile and powerful they are. We notice an awkward behaviour with AdaBoost with the XOR data-set, but otherwise ensembles of Trees perform very well.

While this exercise helped to understand some classification models better, it is by no means a complete analysis. There are many difficulties in every-day data-science that we haven’t touched with this analysis

- Only two dimensional feature space: no curse of dimensionality.
- Both features are continuous: no mixed feature difficulties.
- Fabricated data: the two features are relevant and independent, the models are not fooled by unimportant or highly correlated features.
- Both classes equally represented: in real-life cases the classes are not equally represented, leading to skewed models.
- No missing values: each observation has a value for both features.

As such, the results above are to be taken only as an indication and not a cheat-sheet for model assessing.

On top of this, so far I have been hiding a very important aspect of training these models: training times. In fact, there is a model that performs very well above, but which scales terribly with data: Gaussian Process. To highlight this I performed fits on increasingly bigger versions of the same data-sets of `[125,250,500,1000,2000,4000,8000,16000]`

points. The fitting times for the Gaussian Process can be seen below where it exhibits a \(O(n^2)\) complexity.

We studied how different classification models fit a selection of non-trivial data-sets. The scope of this exercise was to assess the shapes of decision boundaries of different algorithms to gather insight on when they are most likely to perform well.

Although better information about the data-set is our biggest asset, we found that Support Vector Machines with non-linear kernels and ensemble Tree estimators are very good default options as they offer great versatility across different data-sets. We also found that other models outperformed trees on data-sets with characteristics more suitable for their strengths, which means that a good understanding of the data is crucial to choose the best algorithm for the task at hand.

In the end, a few classifiers should be tried on the data-set to evaluate their performance, both in scores and in training times, to choose the one that is best suited for our needs.

]]>`tf.layers`

— a relatively low-level abstraction that simplifies constructing layers in Tensorflow. Then, we will use `keras`

, a high-level abstraction that wraps Tensorflow and Theano functions in an unified functional API.
The purpose of this exercise is to see how we can accomplish the same models with two very different approaches, and assess the pros and cons of each workflow.

Before we start, it might be a good idea to review what a CNN is. If you have played around with Deep Neural Networks (like the one presented here) you may already be familiar with the notion of layers, forward feed, back propagation, etc. If not, let’s do a quick review.

A Neural Network is a Machine Learning algorithm that is composed of layers, each containing a certain number of neurons (usually called cells). The data is input in the first layer and it is then propagated to the next, and so on until it reaches the output layer, in a process called *forward feed (or forward propagation).*

Each neuron behaves as a linear model

\[

z = X \cdot W + b \ ,

\]

where \(X\) is the incoming connection, \(W\) the weights, and \(b\) the biases. The weights and the biases act just like the slopes and intercept of a Linear Regression, for example, and are the parameters to be fit by the training. Once \(z\) is computed, it is passed through an activation function and the resulting output is passed on to the next layer, etc. The activation function can be non-linear. This, in addition to the fact that a Neural Network can combine multiple linear models at once, allows the Neural Network to fit non-linear relations with regards to the target variable.

At the output layer, the predicted values of the target variable are compared with the real values (during the training phase, of course), and the weights and biases are adjusted so as to reduce the error (also known as loss) through a process known as *back propagation*, the details of which we will omit.

A layer where all its neurons are connected to all the neurons in the previous and next layers is called *Dense*. A network composed only of Dense layers is called a Dense Neural Network (DNN). This is the case of the Artificial Neural Network presented in the image above.

Above, the data is fed as an array represented by the input layer. Each cell in the input layer represents a feature of the data being fed to the network. This makes DNN specially suitable for data where:

- The position of each feature throughout all observations is fixed
- This is, the DNN will evaluate a new instance knowing that what is coming from cell \(i\) bares the same meaning as with the other, previous, instances.

- The relative position of features in the dataset is irrelevant (for example, you shuffle all the features the same way for all observations)
- This is, there is no structure in the data that enhances our comprehension.

Clearly there is a simple task that a DNN is not suitable: images. Why?

- In an image, the pixels around any given pixel bear meaning as to provide context for that pixel.
- Shuffling an image will destroy any structure. We clearly lose information if we shuffle the pixels of an image!
- A picture of a dog on the bottom-right or on the top-left corner of a white square does not change the fact that it is a dog. The same applies to rotation, scaling, etc

If we are to develop models that can recognise images, we need to develop a new architecture that can tackle these issues: in with the Convolutional Neural Networks!

CNNs were proposed following breakthroughs in understanding how the biological brain learns to see and interpret the information coming from the eyes. Studies showed that we do not observe the whole image and interpret it at once, but we process sub-regions of the image separately in an attempt to find features.

This explains, for example, why we can identify faces in cartoon drawings. Cartoon faces bare the same features as real faces, like eyes, smile, a border, etc. We can also identify a cat from a dog straight away just looking at noses or paws without seeing the whole picture.

A CNN works the same way, as it is constructed to be efficient at finding features and deciding which are relevant for a certain classification task. To do so, a CNN has two new types of layers: **convolutional** and **pooling** layers. Convolutional layers share their name with the mathematical convolution operation, which will be familiar to those who have worked on signal processing. Fortunately, for the rest of us, there is a more visual way to understand this using image processing. In image processing, a convolution is known as a kernel or filter, which is a matrix that is (element-wise) multiplied to a small region of an image and adds the elements of the resulting matrix. This process is repeated throughout the rest of the original image in regular steps known as **strides**, resulting in a new image.

For example, consider the kernel (kernel and filter are used interchangeably) of the form

\[

K = \begin{bmatrix}

-1 & -1 & -1 \\

-1 & \ \ 8 & -1 \\

-1 & -1 & -1

\end{bmatrix} \ ,

\]

this is known as an edge-detector, as the resulting value will significantly change whenever there is a sharp change in colours and hue. If we apply this filter to this image

we get the image below:

This is, in essence, what a convolutional layer is: a stack of images, each a result of a different filter. Each image in this stack is called a **feature map**. The filters are small matrices, where entries are initially generated and then changed during training: they are the **weights** of a convolutional layer. Training a CNN is then the process of finding the most relevant features for the classification task at hand.

Each feature map is produced by scanning an image with a single kernel and provides a looking glass into a single feature, but real-world images have many features. For example, if we want to distinguish human faces from cars or random pictures, we need not only to identify eyes and mouths, but also rims and windshields. Therefore, we need more and more kernels as the complexity of the images increases. This can lead to two issues: computer memory problems, as any computer has limited storage and processing capacities, and overfitting. To help tackle this we introduce a new ingredient: **pooling layers**.

A pooling layer is a layer that sub-samples incoming information. It is defined by a kernel and a pooling operation. The kernel, just like with the convolutional layer, defines the size of a sub-region of the incoming data on which we apply a pooling operation, such as **average** or **max.** This process returns a value as a function of the pixels in that sub-region (notice that we are considering a picture as being 3-d: height, width, and the channel, i.e. colours). Setting the stride to the size of the kernel, we effectively reduce the image to a \(height/stride \times width/stride\) image.

For example, a pooling layer with a kernel of side \(2\) and stride of the same size, will produce an image where the sides have half the size, leading to a quarter of the original pixels. This not only helps in reducing memory usage when training a DNN, it will also force the training to use the features that are relevant and drop those that are not, slightly helping to prevent overfitting.

So what does a CNN look like? There is room for a lot of creativity, but the idea is that we try to add convolutional layers close to the input layer, followed by a pooling layer, and so on. The pairing convolution+pooling layers is often called a ConvPol layer and should be the basic building block of a CNN. But there are many other options, for example stacking a few convolutional layers before a pooling layer, etc.

In principle, the deeper the CNN and the taller (i.e. more feature maps), the more features we can use to classify images. Unfortunately, very deep networks have many issues: overfitting, vanishing gradients, memory usage, training times, etc. So it is with no surprise that the state-of-the-art is ever-changing. For example, look into the ImageNet challenge, where every year all the hotshots (from academics to Google itself) compete to improve accuracy by a few decimal percents!

Finding the perfect sequence of convolutional and pooling layers is then a bit of an art that takes a lot of work, hyper-parameter searching, etc. But there is another ingredient that is more straightforward: the final and output part of the network. After the last convolutional or pooling layer, we are left with a stack of relevant feature maps. At this stage, we should only have the relevant information regarding each feature on each map, meaning that the relevant information was *separated* and broken down into pieces. We can now flatten all of this into an array and feed it into a DNN. This last part of the CNN is often called a *voter*, as it has to decide to which class the image belongs by looking at the final feature maps.

Putting this all together, a typical CNN looks like this:

We are now ready to construct a CNN! Building convolutional and pooling layers from scratch is a cumbersome effort that is prone to errors. This is because of the intricate shapes of the layers, which make weight initialisation and shape-matching between consecutive layers two tasks that require extra attention. In order to achieve this, we will *cheat* a bit and use an abstraction called Layers, from `tf.layers`

. For more information see here.

Layers is a relatively low-level abstraction, like Estimator API, that simplifies our lives by providing classes for the most commonly used layers that take care of the most repetitive and cumbersome tasks. In order for this exercise to be as close to pure Tensorflow as possible, we will try to avoid using as much functionality from Layers as possible.

We will deploy a simple, but powerful, architecture:

\[

Conv \to Conv \to Pool \to Conv \to Conv \to Pool \to Dense \to Output \ .

\]

On top of this, after each \(Pool\) and after \(Dense\), we will apply Dropout. Dropout is a very common regularisation technique for Neural Networks. At each mini-batch during training, some connections between neurons are severed (dropout) in order to force the Neural Network to learn from different combinations. Since this dropout is randomised, this forces the neurons to cooperate in different ways and prevents overfitting. Although seemingly quite aggressive, dropout works amazingly well. The full code can be found on my GitHub.

import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("/tmp/data/") height = 28 width = 28 n_inputs = height * width channels = 1 input_shape = (height, width, channels) batch_size = 50 num_classes = 10 epochs = 10 X_train = mnist.train.images X_val = mnist.validation.images X_test = mnist.test.images X_train_reshaped = X_train.reshape(X_train.shape[0], height, width, channels) X_val_reshaped = X_val.reshape(X_val.shape[0], height, width, channels) X_test_reshaped = X_test.reshape(X_test.shape[0], height, height, channels) Y_train = mnist.train.labels Y_val = mnist.validation.labels Y_test = mnist.test.labels

This first part is straightforward: we import the relevant packages and the MNIST dataset. This is a dataset of \(28\times28\) black and white (i.e. only one channel) images of hand-written digits (the labels are from 0 to 9) and serves as a good benchmark for CNN, or in general for Computer Vision problems. Tensorflow has its own function to import MNIST, and the data has three different subsets: train, validation, and test. Training will be performed with the train dataset. While it is training, the validation dataset will be used to let us assess how well the training is going when performing the model on a dataset it has never seen before. This information can be used to assess if the model is over or underfitting, or as an early training stopping criterion. Finally, after the training is done, we will test it one last time on the test dataset, which will give us a final assessment of the model. The purpose of the test dataset is to prevent us being misled about the quality of the model by looking only at the validation scores during training.

conv1_fmaps = 64 conv1_ksize = 3 conv1_stride = 1 conv1_pad = "SAME" conv1_act=tf.nn.relu conv2_fmaps = 64 conv2_ksize = 3 conv2_stride = 1 conv2_pad = "SAME" conv2_act=tf.nn.relu pool1_fmaps = conv2_fmaps pol1_ksize=[1, 2, 2, 1] pol1_strides=[1, 2, 2, 1] pol1_padding="VALID" dropout_1 = 0.25 conv3_fmaps = 32 conv3_ksize = 3 conv3_stride = 1 conv3_pad = "SAME" conv3_act=tf.nn.relu conv4_fmaps = 32 conv4_ksize = 3 conv4_stride = 1 conv4_pad = "SAME" conv4_act=tf.nn.relu pool2_fmaps = conv4_fmaps pol2_ksize=[1, 2, 2, 1] pol2_strides=[1, 2, 2, 1] pol2_padding="VALID" dropout_2 = 0.25 dense1_neurons = 256 dense1_act=tf.nn.relu dropout_3 = 0.5 n_outputs = num_classes

Here we define the parameters of the CNN layers.

- The
`fmaps`

parameter defines how many feature maps we stack on each layer. Notice that the pooling layers need to have the same number of feature maps as the previous layer. The reason we go from \(64\) to \(32\) is to force the CNN to start trimming and ignoring the least relevant features. This is a form of*encoding*that helps the CNN focus on what matters. - The
`ksize`

parameter is the size of the kernel matrix. `stride`

is how many pixels the kernel moves to the right (and downwards once it reaches the right edge) for each operation.- The
`pad`

parameter refers to how we deal with the edges of the picture. If`SAME`

, then Tensorflow extends the picture in every direction with pixels with zero values as borders (padding), while with`VALID`

it does not and instead considers the image in its original shape. - The
`act`

refers to the activation function, and`relu`

is the go-to default. - Finally, the
`dropout`

parameters refer to the percentage a certain neuron in the previous layer will be ignored during training, and the`neurons`

refer to how many neurons we are including in the last dense and output layers.

Let’s explain better what these parameters mean. Take, for example, the first convolution layer with \(3 \times 3\) kernel and padding `SAME`

. Since the padding is adding a border, the first sub-image that it looks into is the top-right \(3\times3\) corner with the first pixel in its centre. Then, the kernel moves one pixel to the right since `stride=1`

. At the end, since the padding extends the initial image, the output image is of the same size. While if the padding were set to `VALID`

it would not be possible to have a kernel centred on each of the original pixels, and therefore we would have a smaller feature map than the original image.

Likewise, in the padding layers we consider a \(2 \times 2 \) kernel and a `stride=2`

, so that it effectively covers each pixel only once in blocks of 2. The output image is going to be half the size. Notice that for the pooling layers we have different formats for the parameters. This is because we will use Tensorflow built-in class for the pooling layers, but the Layers class for the convolution layers. The shape `[1,2,2,1]`

refers to `[batch_size, height, width, channels]`

and we do not wish to pool along the observations (first index) or channels (last index).

A few extra comments for free:

- Pooling layers do not have weights! That is why we use the Tensorflow class. It is easy enough to use.
- They also do not have activation functions. We can add them by glueing an activation function node afterwards, but it’s not customary. The pooling layers are the ones with weights and define the feature maps and, as such, play an active part in the training process.

This is the *creative* part of the code. Now we need to write the code that will turn these numbers into a CNN.

tf.reset_default_graph() with tf.name_scope("inputs"): X = tf.placeholder(tf.float32,shape=[None,n_inputs],name="X") X_reshaped = tf.reshape(X, shape=[-1, height, width, channels]) y = tf.placeholder(tf.int64,shape=(None),name="y") training = tf.placeholder_with_default(False, shape=[], name='training') with tf.name_scope("ConvPolLayer1"): convulution_1 = tf.layers.conv2d(X_reshaped, filters=conv1_fmaps, strides=conv1_stride, kernel_size=conv1_ksize, padding=conv1_pad, activation=conv1_act, name="conv1") convulution_2 = tf.layers.conv2d(convulution_1, filters=conv2_fmaps, strides=conv2_stride, kernel_size=conv2_ksize, padding=conv2_pad, activation=conv2_act, name="conv2") pool_1 = tf.nn.avg_pool(convulution_2, ksize=pol1_ksize, strides=pol1_strides, padding=pol1_padding, name="pol1") pool_1_drop = tf.layers.dropout(pool_1, dropout_1, training=training) with tf.name_scope("ConvPolLayer2"): convulution_3 = tf.layers.conv2d(pool_1_drop, filters=conv3_fmaps, strides=conv3_stride, kernel_size=conv3_ksize, padding=conv3_pad, activation=conv3_act, name="conv3") convulution_4 = tf.layers.conv2d(convulution_3, filters=conv4_fmaps, strides=conv4_stride, kernel_size=conv4_ksize, padding=conv4_pad, activation=conv4_act, name="conv4") pool_2 = tf.nn.avg_pool(convulution_4, ksize=pol2_ksize, strides=pol2_strides, padding=pol2_padding, name="pol1") pool_2_drop = tf.layers.dropout(pool_2, dropout_2, training=training) with tf.name_scope("Flatten"): shape = pool_2_drop.get_shape().as_list() last_flat = tf.reshape(pool_2_drop,[-1,shape[1]*shape[2]*shape[3]]) with tf.name_scope("Dense1"): dense1 = tf.layers.dense(last_flat, units=dense1_neurons, activation=dense1_act, name="fc1") dense1_drop = tf.layers.dropout(dense1, dropout_3, training=training) with tf.name_scope("Output"): logits = tf.layers.dense(dense1_drop, n_outputs, name="output") Y_proba = tf.nn.softmax(logits, name="Y_proba") with tf.name_scope("Train"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y) loss = tf.reduce_mean(xentropy) optimizer = tf.train.AdamOptimizer() training_op = optimizer.minimize(loss) with tf.name_scope("Eval"): correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) with tf.name_scope("Init"): init = tf.global_variables_initializer()

Hopefully this is readable. I wrapped different parts of the CNN with `tf.name_scope`

, even though we are not using TensorBoard or other visualisation tools. It’s good practice to keep your code organised, and a Neural Network is no different!

First we define the inputs through the respective placeholders with the correct shapes and data types. We also define a placeholder for a Boolean training variable that lets us turn the dropout off when we are not training.

Then we have the first ConvPool layer, in fact, ConvConvPool. Here we can see how convenient Layers can be. All the tasks of initialising the weights of the kernels in the convolutional layers are taken care of. Layers does provide further arguments to customise these initialisations, but I found the default to work quite well for this exercise.

Next we pool the data to images of half-size, and add a dropout capability. This will dropout some connections from this layer to the next on each mini-batch during training.

Afterwards we have another ConvConvPool layer, which does basically the same but works only with \(32\) feature maps. The idea is that we force the CNN to focus on the most relevant features.

After this we flatten the output of the last pooling layer and feed it to the *voter* DNN. This Dense part will then evaluate which class each observation belongs to by looking at the features selected by the ConvConvPool layers.

Finally, we define the operational part of the network. First we introduce the Train functionalities, which are composed of a loss function and the operation to minimise it using a certain optimiser. We use the go-to default `Adam`

as optimiser, while the cross-entropy is the default for multi-class problems using softmax. The Eval part defines an accuracy metric to be used for assessment and evaluation of the network. We finish with the function to initialise all variables, i.e. to kick-off all the weights when we run the network.

Time to train and test our model!

n_epochs = 10 batch_size = 50 n_batches = mnist.train.num_examples // batch_size with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(n_batches): X_batch, Y_batch = mnist.train.next_batch(batch_size) batch_loss, batch_acc, _ = sess.run([loss,accuracy,training_op], feed_dict={X: X_batch, y: Y_batch,training:True}) epoch_loss_val = loss.eval(feed_dict={X: X_val,y: Y_val}) epoch_acc_val = accuracy.eval(feed_dict={X: X_val,y: Y_val}) print("-"*50) print("Epoch {} | Last Batch Train Loss: {:.4f} | Last Batch Train Accuracy: {:.4f} | Validation Loss: {:.4f} | Validation Accuracy: {:.4f} ".format(epoch+1, batch_loss , batch_acc, epoch_loss_val , epoch_acc_val )) acc_test = accuracy.eval(feed_dict={X: X_test,y: Y_test}) print("Final accuracy on test set:", acc_test)

This part of the code is almost self-explanatory. The MNIST dataset from Tensorflow is actually a full-featured class that includes an iterator to generate mini-batches. Very convenient! At each epoch we compute how the network is performing by looking at training and validation accuracies, and afterwards we compute the accuracy on the test dataset. The results are shown below:

-------------------------------------------------- Epoch 1 | Last Batch Train Loss: 0.0246 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0535 | Validation Accuracy: 0.9848 -------------------------------------------------- Epoch 2 | Last Batch Train Loss: 0.0292 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0442 | Validation Accuracy: 0.9876 -------------------------------------------------- Epoch 3 | Last Batch Train Loss: 0.0511 | Last Batch Train Accuracy: 0.9600 | Validation Loss: 0.0339 | Validation Accuracy: 0.9906 -------------------------------------------------- Epoch 4 | Last Batch Train Loss: 0.0095 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0302 | Validation Accuracy: 0.9912 -------------------------------------------------- Epoch 5 | Last Batch Train Loss: 0.0574 | Last Batch Train Accuracy: 0.9800 | Validation Loss: 0.0276 | Validation Accuracy: 0.9920 -------------------------------------------------- Epoch 6 | Last Batch Train Loss: 0.0234 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0227 | Validation Accuracy: 0.9930 -------------------------------------------------- Epoch 7 | Last Batch Train Loss: 0.0039 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0260 | Validation Accuracy: 0.9916 -------------------------------------------------- Epoch 8 | Last Batch Train Loss: 0.0045 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0257 | Validation Accuracy: 0.9922 -------------------------------------------------- Epoch 9 | Last Batch Train Loss: 0.0152 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0260 | Validation Accuracy: 0.9916 -------------------------------------------------- Epoch 10 | Last Batch Train Loss: 0.0085 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0213 | Validation Accuracy: 0.9938 Final accuracy on test set: 0.9932

This is actually a pretty good result! Before the introduction of CNN, the accuracy record was below 99% (regardless of how deep, DNN seem to always stabilise below 99%), and getting to 99.3% is quite good as the world record is around 99.7% (or was, the last time I read about it). It is quite impressive we can achieve such a high result with a CNN we can run at home. Furthermore, we notice that the Validation Accuracy was still increasing, while the Training Loss was still stabilising at lower levels. This indicates that, in principle, our CNN could perform better if we leave it running for a bit longer! Notice, however, that the longer it runs the higher is the risk of overfitting. Assessing the number of epochs, which is a form of regularisation, is the subject of a future post.

So far we mostly used Tensorflow built-in functions and classes. We cheated a bit, I admit, by using Layers. But even then, the code feels and looks like proper Tensorflow, as Layers provides only *shortcuts* for Neural Network applications to be used on a Tensorflow workflow and code.

Then there is Keras. A completely different beast. Keras was developed by the same people behind Tensorflow — Google — and is a higher-level abstraction. Indeed, Keras is more of a generic functional API for Neural Networks and it can run not only on top of Tensorflow, but also on top of Theano or CNTK. While Tensorflow allows you to perform generic computations as nodes in a graph, Keras exists to simplify building Neural Networks.

Let’s see how the exact same model looks in Keras.

import keras from keras.models import Sequential from keras.layers import Dense, Dropout, Flatten, Conv2D, AveragePooling2D from keras.optimizers import Adam from sklearn.metrics import accuracy_score from keras import backend as K

First we import the packages and libraries. The main things to notice is that Keras has built-in layers, like in Layers, and also has its own way of calling optimisers, which is unlike Layers. This is because the code and workflow of Keras **does not look or feel** like Tensorflow.

K.clear_session() model = Sequential() model.add(Conv2D(64, (3, 3), activation='relu', input_shape=input_shape,padding='same')) model.add(Conv2D(64, (3, 3), activation='relu',padding='same')) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Conv2D(32, (3, 3), activation='relu',padding='same')) model.add(Conv2D(32, (3, 3), activation='relu',padding='same')) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) model.compile(loss=keras.losses.sparse_categorical_crossentropy, optimizer="Adam", metrics=['accuracy'])

And this is it. The same model, using significantly less lines and characters than above. The workflow in Keras is as follows: choose the type of model, in our case it’s `Sequential`

, which is a built-in class to stack layers on top of each other; instantiate the class and start adding layers **in sequence from input to output**. The connections between layers are drawn automatically, since the layers are being sequentially glued. Keras knows how to do this.

The `compile`

method represents the operational part of the network. In one go, we define the loss, the optimiser, and we can even choose metrics to show during the training stage.

Running is also made a lot easier than in Tensorflow, as Keras `fit`

method knows how to generate mini-batches, accepts validation sets, etc. It runs the required cycles throughout epochs and mini-batches. Of course, this gives us less control on what is happening inside these loops, but the convenience is undeniable if one only wants to quickly fit a Neural Network.

print(model.summary()) model.fit(X_train_reshaped, Y_train, batch_size=batch_size, epochs=epochs,validation_data=(X_val_reshaped, Y_val)) test_pred=model.predict_classes(X_test_reshaped) print("Final accuracy on test set:", accuracy_score(test_pred,Y_test))

In addition, Keras has a nice `summary`

method that prints the shape and information of the model, including how many parameters are being fit. Once the model is fit, we can then use the `predict`

method to draw predictions from a test set. The API is very similar to that of SciKit Learn, which I think is not surprising as SciKit Learn has set a comprehensive and consistent standard for a long time.

Using TensorFlow backend. _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_1 (Conv2D) (None, 28, 28, 64) 640 _________________________________________________________________ conv2d_2 (Conv2D) (None, 28, 28, 64) 36928 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 14, 14, 64) 0 _________________________________________________________________ conv2d_3 (Conv2D) (None, 14, 14, 32) 18464 _________________________________________________________________ conv2d_4 (Conv2D) (None, 14, 14, 32) 9248 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 7, 7, 32) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 7, 7, 32) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 1568) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 401664 _________________________________________________________________ dropout_3 (Dropout) (None, 256) 0 _________________________________________________________________ dense_2 (Dense) (None, 10) 2570 ================================================================= Total params: 469,514 Trainable params: 469,514 Non-trainable params: 0 _________________________________________________________________ None Train on 55000 samples, validate on 5000 samples Epoch 1/10 55000/55000 [==============================] - 25s 446us/step - loss: 0.2074 - acc: 0.9341 - val_loss: 0.0415 - val_acc: 0.9862 Epoch 2/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0727 - acc: 0.9777 - val_loss: 0.0425 - val_acc: 0.9886 Epoch 3/10 55000/55000 [==============================] - 24s 438us/step - loss: 0.0557 - acc: 0.9827 - val_loss: 0.0366 - val_acc: 0.9896 Epoch 4/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0486 - acc: 0.9854 - val_loss: 0.0351 - val_acc: 0.9896 Epoch 5/10 55000/55000 [==============================] - 24s 439us/step - loss: 0.0406 - acc: 0.9878 - val_loss: 0.0239 - val_acc: 0.9932 Epoch 6/10 55000/55000 [==============================] - 24s 438us/step - loss: 0.0385 - acc: 0.9886 - val_loss: 0.0325 - val_acc: 0.9910 Epoch 7/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0334 - acc: 0.9899 - val_loss: 0.0266 - val_acc: 0.9926 Epoch 8/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0313 - acc: 0.9904 - val_loss: 0.0300 - val_acc: 0.9914 Epoch 9/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0273 - acc: 0.9914 - val_loss: 0.0266 - val_acc: 0.9928 Epoch 10/10 55000/55000 [==============================] - 24s 440us/step - loss: 0.0273 - acc: 0.9919 - val_loss: 0.0330 - val_acc: 0.9920 Final accuracy on test set: 0.993

Not surprisingly, the model performs similarly on both Tensorflow+Layers and Keras. Again, we notice that the Training Loss seems to be on a descending path, while the Validation Accuracy could maybe increase.

It is important to note, though, that **both fits are not the same**. This is because there are many random aspects to training a Neural Network: Stochastic Optimisers, random initialisation of weights, etc. So don’t be surprised if you get different results on your computer.

In this exercise we implemented a Convolutional Neural Network model using both Tensorflow+Layers and Keras. The end results were similar, but the path taken was quite different.

Tensorflow is more than a framework for Neural Networks. It is a framework to compute graphs representing many nodes — the operations — and the values flowing through them — the tensors. Using some low-level helpers, like Layers and Estimator API (I am planning a future post on this), we can deploy Neural Networks more easily while still being in control of the low-level architecture of our network.

Keras is an attempt to bring Neural Networks to the masses, by providing a functional API that makes constructing and training a Neural Network effortless. We lose control of the Tensorflow power as we can only communicate with it through the Keras built-in classes, with the attributes and methods provided. For this example we might not have noticed any drawbacks in using Keras, but our task would have been considerably more complicated if we wanted to implement a non-trivial architecture, or if we wanted to use other parameter values that Keras has not included in the class (for example, Keras layers do not accept the LeakyReLu activation function).

In the end, Layers and Keras will help you achieve different goals. If you need to deploy a Neural Network with a customary architecture, Keras will help you achieve your goal quickly. If, on the other hand, you need to design your own Network, then Layers provides you invaluable helping classes when constructing a Neural Network on Tensorflow.

I hope this was useful to someone. I am planning a few more posts on Tensorflow and other Machine Learning exercises.

]]>import numpy as np from sklearn.datasets import load_iris, load_breast_cancer from sklearn.preprocessing import StandardScaler, LabelBinarizer from sklearn.model_selection import train_test_split

The StandardScaler will transform all numerical features such as to follow a normal distribution of zero mean and standard deviation equal to one. This is because a Neural Network is sensitive to the scale of the inputs since each neuron is in its essence a linear model

\[ z = W \cdot X + b\ , \]

where \(W\) are the weights, \(X\) the inputs, and \(b\) the biases. The result \(z\) is then fed to the next layer of neurons after being passed through an activation function.

The LabelBinarizer will one-hot encode the classes, turning them into sparse arrays. One could think that this is too much for a binary classifier, where a single column with zero or one values would be enough to encode the classes. But by having the class dimensionality to be the same as the number of classes allows our model to be easily deployed for both binary and multi-class classifications.

With that said, we transform the incoming data. Note that our data processing is simple and takes only into consideration a data-set involving only numerical features, whereas further steps would have to be made if the data included categorical features.

data=load_iris() scaler = StandardScaler() lb = LabelBinarizer() X=(scaler.fit_transform(data.data)).astype(np.float32) Y_cls=data.target Y = lb.fit_transform(Y_cls).astype(np.int64) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) n_features = X_train.shape[1] n_classes=Y_train.shape[1]

Here we chose to use the iris data-set, but the same DNN architecture in this code can be used for other classification problems with numerical features. For example, a simple change from `load_iris()`

to `load_breast_cancer()`

would turn this exercise from a three-class to a binary classification without changing anything else in the code.

For the rest of the code above, we notice that we have cast the data-set to specific data types. This is because Tensforflow does not automatically cast the variables into the appropriate type given a context. The types above are those that I found to work with the functions we use in the DNN we define below for our model. The `train_test_split()`

splits the data-set into training and test sets, so that we can evaluate the performance of the DNN on a data different from that it was trained on.

It’s time to construct our DNN. There is a generic structure to all Tensorflow implementations that encompasses two steps: construct the network (or graph), and train it (or run it). For the purpose of deep-learning, the nodes on a Tensorflow graph will be defined by the layers supporting the neurons. Since coding each layer becomes quite repetitive and prone to errors, we define a function that returns the outputs of a layer of neurons. This function takes in the incoming values, initialises the weights and biases, and passes the resulting computation through an activation function.

def neuron_layer(X,n_neurons,name,activation=None): with tf.name_scope(name): n_inputs = int(X.shape[1]) stddev = 2 / np.sqrt(n_inputs) init = tf.truncated_normal((n_inputs,n_neurons),stddev=stddev) W=tf.Variable(init,name="Weights") b=tf.Variable(tf.zeros([n_neurons]),name="biases") z=tf.matmul(X,W)+b if activation=="relu": return tf.nn.relu(z) if activation=="leaky_relu": return tf.nn.leaky_relu(z) if activation=="sigmoid": return tf.nn.sigmoid(z) if activation=="softmax": return tf.nn.softmax(z) else: return z

The weights and biases are Tensorflow variables, as these are the parameters that Tensorflow will use to optimise the performance of the network. Recall that Tensorflow has three types of tensors:

- Variables: That hold the variables/parameters that Tensorflow can symbolically use (example for differentiation) and change/update on training phase. These have to be initialised.
- Constants: That hold values that are never changed by the training phase.
- Placeholders: These are empty tensors (but have a specific data type) that can be assigned values with a dictionary through the
`feed_dict`

when evaluating the graph. These are how we input the data into the network for both training and prediction.

As explained above, each neuron performs the action of a linear model, which then proceeds to apply an activation function before outputting the result. The dimensions of the layer, i.e. the number of neurons, is defined by the dimensions of the weights and biases that will then define the dimension of \(z\), which corresponds to the dimension of the layer. The initialisation of the Variables can influence the speed of the convergence of the network, and even avoid local non-optimal minima, and it is common practice to sample the initial weights from a normal distribution and the biases at zero.

The `name_scope`

wrapper is mostly to provide labels to visual Tensorflow tools like TensorBoard, which we are not using here. In this case, we use them as to keep the code easy to read and organised.

We are now ready to construct our DNN. A DNN is defined by a sequential stacking of layers. The more layers we have, deeper the network is. Here we will prepare a DNN with two hidden layers sandwiched between the inputs and the outputs.

tf.reset_default_graph() n_hidden_1 = 8 n_hidden_2 = 8 X = tf.placeholder(tf.float32,shape=[None,n_features],name="X") y = tf.placeholder(tf.int64,shape=[None,n_classes],name="y") with tf.name_scope("DNN"): hidden1 = neuron_layer(X,n_hidden_1,"hidden1",activation="leaky_relu") hidden2 = neuron_layer(hidden1,n_hidden_2,"hidden2",activation="leaky_relu") logits = neuron_layer(hidden2,n_classes,"outputs") with tf.name_scope("loss"): loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y,logits=logits),name="avg_xentropy") with tf.name_scope("train"): optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) training_op= optimizer.minimize(loss) with tf.name_scope("accuracy"): correct = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct,tf.float32)) with tf.name_scope("init"): init=tf.global_variables_initializer()

Let’s go through these lines with some detail:

- We reset the default graph. When not defining a new graph, all nodes and tensors in Tensorflow are assigned to the
`default_graph`

. It is good practice to reset it before running, namely when using interactive environments such as Jupyter Notebooks, where we can run cells non-sequentially and mistakenly and undesirably assign nodes to the default graph. `n_hidden_*`

defines the dimensions of the hidden layers, i.e. the output dimension of \(z\) on each layer.- We define a
`placeholder`

for each of the inputs we will be using: the feature data,`X`

, and the target labels,`Y`

. - The first hidden layer is fed as input the data
`X`

, and the second layer is then fed the outputs of the first layer,`hidden1`

. - The output of the second layer is of the dimension of the number of classes, and each value of the (1-dimensional) tensor is real. This is called a
`logit`

, as opposed to either a probability or a definite class assignment. Feeding logits to a`softmax`

function would return the probabilities that a certain observation belongs to each of the classes. - Once the data as been fed-forward to both hidden layers, we compare the output with the target labels. To do this we compute the
`loss`

. For multi-class problems, the`logits`

are fed through a`softmax`

function, and then the output probabilities are used to compute the`cross-entropy`

loss function. Tensorflow gives as a function to this in one go, i.e. by feeding logits and compare the results directly to the correct labels,`Y`

. The average cross-entropy is then computed with`tf.reduce_mean`

, a function that computes the mean of a tensor. - Having defined the loss function, we have now our criterion to train the neural network, which is to minimise the loss. To do this, we choose an
`optimizer`

and tell it to minimise the loss. We chose`Adam`

, which should always be the first choice. It is a Stochastic optimizer (meaning uses random mini-batches of the data per training) with adaptive learning rate and momentum, the former helps preventing jumping over the minimum and the latter increases convergence speed. We define a starting learning rate to give us the chance to tweak it, but for simple problems this is barely necessary. - The node
`training_op`

is the operation we will run during training, which is to minimise the loss, using the optimizer we chose. - Defining all the structure need to train our network, we add a few more operations that will help us assess the performance of the network. We define an
`accuracy`

node, which measures how many times our predictions are correct. To do this, we just compare the predictions for the classes with the true labels. Since the`softmax`

function is monotonic in its arguments, we can use the largest`logit`

as a the criterion for the most probable class. - Finally we setup an operation to initialise all variables. This will sample the initial values of the variables from the distributions we assigned.

With the DNN setup, we are ready to train. Almost, ready I should say, as the code above does not provide a way to generate mini-batches to feed the network. For that we create an iterator function to return mini-batches. I found this piece of code online on a notebook which I failed to keep the source, but I hope the original author does not mind me sharing it further!

def iterate_minibatches(inputs, targets, batchsize, shuffle=False): assert inputs.shape[0] == targets.shape[0] if shuffle: indices = np.arange(inputs.shape[0]) np.random.shuffle(indices) for start_idx in range(0, inputs.shape[0] - batchsize + 1, batchsize): if shuffle: excerpt = indices[start_idx:start_idx + batchsize] else: excerpt = slice(start_idx, start_idx + batchsize) yield inputs[excerpt], targets[excerpt]

With all ingredients put together, we are ready to train our network! For that, we need to run the training for a certain amount of epochs, in which every training point will be fed-forward through the network once, and the size of the mini-batches. Per each 10 epochs we print the average training loss of the mini-batches and the accuracy of both training and test data-sets.

n_epochs = 100 batch_size = 50 with tf.Session() as sess: init.run() max_acc=0 acc_going_down=0 for epoch in range(n_epochs): batch_step=0 avg_loss = 0. total_loss= 0. total_batch = int(X_train.shape[0]/batch_size) for X_batch, Y_batch in iterate_minibatches(X_train,Y_train,batchsize=batch_size): _,l=sess.run([training_op,loss],feed_dict={X:X_batch, y:Y_batch}) batch_step+=1 total_loss += l if((epoch)%10==0): avg_loss = total_loss/batch_size print("Epoch:", '%02d' % (epoch+1), "| Average Training Loss= {:.2f}".format(avg_loss), "| Training Accuracy: {:.2f}".format(accuracy.eval({X: X_train, y: Y_train})), "| Test/Validation Accuracy: {:.2f}".format(accuracy.eval({X: X_test, y: Y_test}))) print("Model fit complete.") print("Final Training Accuracy: {:.2f}".format(accuracy.eval({X: X_train, y: Y_train}))) print("Final Validation Accuracy: {:.2f}".format(accuracy.eval({X: X_test, y: Y_test})))

The results for the iris data-set are shown below.

Epoch: 01 | Average Training Loss= 0.14 | Training Accuracy: 0.02 | Test Accuracy: 0.02 Epoch: 11 | Average Training Loss= 0.03 | Training Accuracy: 0.81 | Test Accuracy: 0.82 Epoch: 21 | Average Training Loss= 0.02 | Training Accuracy: 0.85 | Test Accuracy: 0.91 Epoch: 31 | Average Training Loss= 0.01 | Training Accuracy: 0.94 | Test Accuracy: 0.96 Epoch: 41 | Average Training Loss= 0.01 | Training Accuracy: 0.95 | Test Accuracy: 0.93 Epoch: 51 | Average Training Loss= 0.01 | Training Accuracy: 0.96 | Test Accuracy: 0.93 Epoch: 61 | Average Training Loss= 0.01 | Training Accuracy: 0.97 | Test Accuracy: 0.93 Epoch: 71 | Average Training Loss= 0.00 | Training Accuracy: 0.97 | Test Accuracy: 0.93 Epoch: 81 | Average Training Loss= 0.00 | Training Accuracy: 0.98 | Test Accuracy: 0.93 Epoch: 91 | Average Training Loss= 0.00 | Training Accuracy: 0.99 | Test Accuracy: 0.93 Model fit complete. Final Training Accuracy: 0.99 Final Validation Accuracy: 0.93

Not too bad, although we have indication of over-fitting, as the training accuracy increases just before halfway through while the test accuracy decreases, and subsequent increase on training accuracy holds no further benefits to test accuracy. This is not uncommon. Neural networks are known to have multiple local minima, and the problem at hand is rather simple. We could employ regularisation techniques such as Lasso or Ridge, or Dropout, but I will leave that for another blog post.

Just before we finish, let’s prove that this DNN can run for bi-classification, for that we just change `load_iris()`

to `load_breast_cancer()`

on the top of the notebook. The results are below:

Epoch: 01 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 11 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 21 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 31 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 41 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 51 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 61 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 71 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 81 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Epoch: 91 | Average Training Loss= 0.00 | Training Accuracy: 1.00 | Test Accuracy: 1.00 Model fit complete. Final Training Accuracy: 1.00 Final Validation Accuracy: 1.00

That is quite an accomplishment for such a simple model!

I will continue to post my examples and discuss my ideas on different Machine and Deep Learning techniques here. Stay tuned!

]]>Very happy to have you here. Unfortunately, the site is still under construction and is by no means ready for visitors like yourself!

Over the next few days, I will be adding my game portfolio, a blog, and other goodies.

Stay tuned!

Cheers!

Miguel

]]>