# Convolutional Neural Networks for the MNIST Dataset: Tensorflow+Layers and Keras

Theoretical Physicist, Data Scientist, enthusiast Game Designer.

## Convolutional Neural Networks for the MNIST Dataset: Tensorflow+Layers and Keras

In this post, we are going to go through how to construct a Convolutional Neural Network (CNN) and train it to recognise hand-written digits with the MNIST dataset. In fact, we will do this twice. First we will construct the CNN using Tensorflow functions and some auxiliary classes from Layers — tf.layers — a relatively low-level abstraction that simplifies constructing layers in Tensorflow. Then, we will use keras, a high-level abstraction that wraps Tensorflow and Theano functions in an unified functional API.

The purpose of this exercise is to see how we can accomplish the same models with two very different approaches, and assess the pros and cons of each workflow.

## Neural Networks: super quick and dirty review

Before we start, it might be a good idea to review what a CNN is. If you have played around with Deep Neural Networks (like the one presented here) you may already be familiar with the notion of layers, forward feed, back propagation, etc. If not, let’s do a quick review.

A Neural Network is a Machine Learning algorithm that is composed of layers, each containing a certain number of neurons (usually called cells). The data is input in the first layer and it is then propagated to the next, and so on until it reaches the output layer, in a process called forward feed (or forward propagation). An Artificial Neural Network (source: Wikimedia)

Each neuron behaves as a linear model

$z = X \cdot W + b \ ,$
where $$X$$ is the incoming connection, $$W$$ the weights, and $$b$$ the biases. The weights and the biases act just like the slopes and intercept of a Linear Regression, for example, and are the parameters to be fit by the training. Once $$z$$ is computed, it is passed through an activation function and the resulting output is passed on to the next layer, etc. The activation function can be non-linear. This, in addition to the fact that a Neural Network can combine multiple linear models at once, allows the Neural Network to fit non-linear relations with regards to the target variable.

At the output layer, the predicted values of the target variable are compared with the real values (during the training phase, of course), and the weights and biases are adjusted so as to reduce the error (also known as loss) through a process known as back propagation, the details of which we will omit.

A layer where all its neurons are connected to all the neurons in the previous and next layers is called Dense. A network composed only of Dense layers is called a Dense Neural Network (DNN). This is the case of the Artificial Neural Network presented in the image above.

## The super simple introduction to Convolutional Networks

Above, the data is fed as an array represented by the input layer. Each cell in the input layer represents a feature of the data being fed to the network. This makes DNN specially suitable for data where:

• The position of each feature throughout all observations is fixed
• This is, the DNN will evaluate a new instance knowing that what is coming from cell $$i$$ bares the same meaning as with the other, previous, instances.
• The relative position of features in the dataset is irrelevant (for example, you shuffle all the features the same way for all observations)
• This is, there is no structure in the data that enhances our comprehension.

Clearly there is a simple task that a DNN is not suitable: images. Why?

• In an image, the pixels around any given pixel bear meaning as to provide context for that pixel.
• Shuffling an image will destroy any structure. We clearly lose information if we shuffle the pixels of an image!
• A picture of a dog on the bottom-right or on the top-left corner of a white square does not change the fact that it is a dog. The same applies to rotation, scaling, etc

If we are to develop models that can recognise images, we need to develop a new architecture that can tackle these issues: in with the Convolutional Neural Networks!

### Featuring convolutions…

CNNs were proposed following breakthroughs in understanding how the biological brain learns to see and interpret the information coming from the eyes. Studies showed that we do not observe the whole image and interpret it at once, but we process sub-regions of the image separately in an attempt to find features.

This explains, for example, why we can identify faces in cartoon drawings. Cartoon faces bare the same features as real faces, like eyes, smile, a border, etc. We can also identify a cat from a dog straight away just looking at noses or paws without seeing the whole picture.

A CNN works the same way, as it is constructed to be efficient at finding features and deciding which are relevant for a certain classification task. To do so, a CNN has two new types of layers: convolutional and pooling layers. Convolutional layers share their name with the mathematical convolution operation, which will be familiar to those who have worked on signal processing. Fortunately, for the rest of us, there is a more visual way to understand this using image processing. In image processing, a convolution is known as a kernel or filter, which is a matrix that is (element-wise) multiplied to a small region of an image and adds the elements of the resulting matrix. This process is repeated throughout the rest of the original image in regular steps known as strides, resulting in a new image.

For example, consider the kernel (kernel and filter are used interchangeably) of the form
$K = \begin{bmatrix} -1 & -1 & -1 \\ -1 & \ \ 8 & -1 \\ -1 & -1 & -1 \end{bmatrix} \ ,$
this is known as an edge-detector, as the resulting value will significantly change whenever there is a sharp change in colours and hue. If we apply this filter to this image Original image (source: Wikimedia)

we get the image below: Image after passing through an Edge-detection matrix (source: Wikimedia)

This is, in essence, what a convolutional layer is: a stack of images, each a result of a different filter. Each image in this stack is called a feature map. The filters are small matrices, where entries are initially generated and then changed during training: they are the weights of a convolutional layer. Training a CNN is then the process of finding the most relevant features for the classification task at hand.

### … and pooling

Each feature map is produced by scanning an image with a single kernel and provides a looking glass into a single feature, but real-world images have many features. For example, if we want to distinguish human faces from cars or random pictures, we need not only to identify eyes and mouths, but also rims and windshields. Therefore, we need more and more kernels as the complexity of the images increases. This can lead to two issues: computer memory problems, as any computer has limited storage and processing capacities, and overfitting. To help tackle this we introduce a new ingredient: pooling layers.

A pooling layer is a layer that sub-samples incoming information. It is defined by a kernel and a pooling operation. The kernel, just like with the convolutional layer, defines the size of a sub-region of the incoming data on which we apply a pooling operation, such as average or max. This process returns a value as a function of the pixels in that sub-region (notice that we are considering a picture as being 3-d: height, width, and the channel, i.e. colours). Setting the stride to the size of the kernel, we effectively reduce the image to a $$height/stride \times width/stride$$ image.

For example, a pooling layer with a kernel of side $$2$$ and stride of the same size, will produce an image where the sides have half the size, leading to a quarter of the original pixels. This not only helps in reducing memory usage when training a DNN, it will also force the training to use the features that are relevant and drop those that are not, slightly helping to prevent overfitting.

### Typical architecture

So what does a CNN look like? There is room for a lot of creativity, but the idea is that we try to add convolutional layers close to the input layer, followed by a pooling layer, and so on. The pairing convolution+pooling layers is often called a ConvPol layer and should be the basic building block of a CNN. But there are many other options, for example stacking a few convolutional layers before a pooling layer, etc.

In principle, the deeper the CNN and the taller (i.e. more feature maps), the more features we can use to classify images. Unfortunately, very deep networks have many issues: overfitting, vanishing gradients, memory usage, training times, etc. So it is with no surprise that the state-of-the-art is ever-changing. For example, look into the ImageNet challenge, where every year all the hotshots (from academics to Google itself) compete to improve accuracy by a few decimal percents!

Finding the perfect sequence of convolutional and pooling layers is then a bit of an art that takes a lot of work, hyper-parameter searching, etc. But there is another ingredient that is more straightforward: the final and output part of the network. After the last convolutional or pooling layer, we are left with a stack of relevant feature maps. At this stage, we should only have the relevant information regarding each feature on each map, meaning that the relevant information was separated and broken down into pieces. We can now flatten all of this into an array and feed it into a DNN. This last part of the CNN is often called a voter, as it has to decide to which class the image belongs by looking at the final feature maps.

Putting this all together, a typical CNN looks like this: A typical CNN. Note that subsamplig is what we call a pooling layer (source: exponea.com)

## A CNN on Tensorflow using Layers

We are now ready to construct a CNN! Building convolutional and pooling layers from scratch is a cumbersome effort that is prone to errors. This is because of the intricate shapes of the layers, which make weight initialisation and shape-matching between consecutive layers two tasks that require extra attention. In order to achieve this, we will cheat a bit and use an abstraction called Layers, from tf.layers. For more information see here.

Layers is a relatively low-level abstraction, like Estimator API, that simplifies our lives by providing classes for the most commonly used layers that take care of the most repetitive and cumbersome tasks. In order for this exercise to be as close to pure Tensorflow as possible, we will try to avoid using as much functionality from Layers as possible.

### Our architecture

We will deploy a simple, but powerful, architecture:
$Conv \to Conv \to Pool \to Conv \to Conv \to Pool \to Dense \to Output \ .$

On top of this, after each $$Pool$$ and after $$Dense$$, we will apply Dropout. Dropout is a very common regularisation technique for Neural Networks. At each mini-batch during training, some connections between neurons are severed (dropout) in order to force the Neural Network to learn from different combinations. Since this dropout is randomised, this forces the neurons to cooperate in different ways and prevents overfitting. Although seemingly quite aggressive, dropout works amazingly well. The full code can be found on my GitHub.

import numpy as np
import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

height = 28
width = 28
n_inputs = height * width
channels = 1
input_shape = (height, width, channels)
batch_size = 50
num_classes = 10
epochs = 10

X_train = mnist.train.images
X_val = mnist.validation.images
X_test = mnist.test.images

X_train_reshaped = X_train.reshape(X_train.shape, height, width, channels)
X_val_reshaped = X_val.reshape(X_val.shape, height, width, channels)
X_test_reshaped = X_test.reshape(X_test.shape, height, height, channels)

Y_train = mnist.train.labels
Y_val = mnist.validation.labels
Y_test = mnist.test.labels

This first part is straightforward: we import the relevant packages and the MNIST dataset. This is a dataset of $$28\times28$$ black and white (i.e. only one channel) images of hand-written digits (the labels are from 0 to 9) and serves as a good benchmark for CNN, or in general for Computer Vision problems. Tensorflow has its own function to import MNIST, and the data has three different subsets: train, validation, and test. Training will be performed with the train dataset. While it is training, the validation dataset will be used to let us assess how well the training is going when performing the model on a dataset it has never seen before. This information can be used to assess if the model is over or underfitting, or as an early training stopping criterion. Finally, after the training is done, we will test it one last time on the test dataset, which will give us a final assessment of the model. The purpose of the test dataset is to prevent us being misled about the quality of the model by looking only at the validation scores during training.

conv1_fmaps = 64
conv1_ksize = 3
conv1_stride = 1
conv1_act=tf.nn.relu

conv2_fmaps = 64
conv2_ksize = 3
conv2_stride = 1
conv2_act=tf.nn.relu

pool1_fmaps = conv2_fmaps
pol1_ksize=[1, 2, 2, 1]
pol1_strides=[1, 2, 2, 1]

dropout_1 = 0.25

conv3_fmaps = 32
conv3_ksize = 3
conv3_stride = 1
conv3_act=tf.nn.relu

conv4_fmaps = 32
conv4_ksize = 3
conv4_stride = 1
conv4_act=tf.nn.relu

pool2_fmaps = conv4_fmaps
pol2_ksize=[1, 2, 2, 1]
pol2_strides=[1, 2, 2, 1]

dropout_2 = 0.25

dense1_neurons = 256
dense1_act=tf.nn.relu

dropout_3 = 0.5

n_outputs = num_classes

Here we define the parameters of the CNN layers.

• The fmaps parameter defines how many feature maps we stack on each layer. Notice that the pooling layers need to have the same number of feature maps as the previous layer. The reason we go from $$64$$ to $$32$$ is to force the CNN to start trimming and ignoring the least relevant features. This is a form of encoding that helps the CNN focus on what matters.
• The ksize parameter is the size of the kernel matrix.
• stride is how many pixels the kernel moves to the right (and downwards once it reaches the right edge) for each operation.
• The pad parameter refers to how we deal with the edges of the picture. If SAME, then Tensorflow extends the picture in every direction with pixels with zero values as borders (padding), while with VALID it does not and instead considers the image in its original shape.
• The act refers to the activation function, and relu is the go-to default.
• Finally, the dropout parameters refer to the percentage a certain neuron in the previous layer will be ignored during training, and the neuronsrefer to how many neurons we are including in the last dense and output layers.

Let’s explain better what these parameters mean. Take, for example, the first convolution layer with $$3 \times 3$$ kernel and padding SAME. Since the padding is adding a border, the first sub-image that it looks into is the top-right $$3\times3$$ corner with the first pixel in its centre. Then, the kernel moves one pixel to the right since stride=1. At the end, since the padding extends the initial image, the output image is of the same size. While if the padding were set to VALID it would not be possible to have a kernel centred on each of the original pixels, and therefore we would have a smaller feature map than the original image.

Likewise, in the padding layers we consider a $$2 \times 2$$ kernel and a stride=2, so that it effectively covers each pixel only once in blocks of 2. The output image is going to be half the size. Notice that for the pooling layers we have different formats for the parameters. This is because we will use Tensorflow built-in class for the pooling layers, but the Layers class for the convolution layers. The shape [1,2,2,1] refers to [batch_size, height, width, channels] and we do not wish to pool along the observations (first index) or channels (last index).

• Pooling layers do not have weights! That is why we use the Tensorflow class. It is easy enough to use.
• They also do not have activation functions. We can add them by glueing an activation function node afterwards, but it’s not customary. The pooling layers are the ones with weights and define the feature maps and, as such, play an active part in the training process.

This is the creative part of the code. Now we need to write the code that will turn these numbers into a CNN.

tf.reset_default_graph()

with tf.name_scope("inputs"):
X = tf.placeholder(tf.float32,shape=[None,n_inputs],name="X")
X_reshaped = tf.reshape(X, shape=[-1, height, width, channels])
y = tf.placeholder(tf.int64,shape=(None),name="y")
training = tf.placeholder_with_default(False, shape=[], name='training')

with tf.name_scope("ConvPolLayer1"):
convulution_1 = tf.layers.conv2d(X_reshaped,
filters=conv1_fmaps,
strides=conv1_stride,
kernel_size=conv1_ksize,
activation=conv1_act,
name="conv1")
convulution_2 = tf.layers.conv2d(convulution_1,
filters=conv2_fmaps,
strides=conv2_stride,
kernel_size=conv2_ksize,
activation=conv2_act,
name="conv2")
pool_1 = tf.nn.avg_pool(convulution_2,
ksize=pol1_ksize,
strides=pol1_strides,
name="pol1")
pool_1_drop = tf.layers.dropout(pool_1,
dropout_1,
training=training)

with tf.name_scope("ConvPolLayer2"):
convulution_3 = tf.layers.conv2d(pool_1_drop,
filters=conv3_fmaps,
strides=conv3_stride,
kernel_size=conv3_ksize,
activation=conv3_act,
name="conv3")
convulution_4 = tf.layers.conv2d(convulution_3,
filters=conv4_fmaps,
strides=conv4_stride,
kernel_size=conv4_ksize,
activation=conv4_act,
name="conv4")
pool_2 = tf.nn.avg_pool(convulution_4,
ksize=pol2_ksize,
strides=pol2_strides,
name="pol1")
pool_2_drop = tf.layers.dropout(pool_2,
dropout_2,
training=training)

with tf.name_scope("Flatten"):
shape = pool_2_drop.get_shape().as_list()
last_flat = tf.reshape(pool_2_drop,[-1,shape*shape*shape])
with tf.name_scope("Dense1"):
dense1 = tf.layers.dense(last_flat,
units=dense1_neurons,
activation=dense1_act,
name="fc1")
dense1_drop = tf.layers.dropout(dense1,
dropout_3,
training=training)
with tf.name_scope("Output"):
logits = tf.layers.dense(dense1_drop,
n_outputs,
name="output")
Y_proba = tf.nn.softmax(logits, name="Y_proba")

with tf.name_scope("Train"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(xentropy)
training_op = optimizer.minimize(loss)
with tf.name_scope("Eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
with tf.name_scope("Init"):
init = tf.global_variables_initializer()

Hopefully this is readable. I wrapped different parts of the CNN with tf.name_scope, even though we are not using TensorBoard or other visualisation tools. It’s good practice to keep your code organised, and a Neural Network is no different!

First we define the inputs through the respective placeholders with the correct shapes and data types. We also define a placeholder for a Boolean training variable that lets us turn the dropout off when we are not training.

Then we have the first ConvPool layer, in fact, ConvConvPool. Here we can see how convenient Layers can be. All the tasks of initialising the weights of the kernels in the convolutional layers are taken care of. Layers does provide further arguments to customise these initialisations, but I found the default to work quite well for this exercise.

Next we pool the data to images of half-size, and add a dropout capability. This will dropout some connections from this layer to the next on each mini-batch during training.

Afterwards we have another ConvConvPool layer, which does basically the same but works only with $$32$$ feature maps. The idea is that we force the CNN to focus on the most relevant features.

After this we flatten the output of the last pooling layer and feed it to the voter DNN. This Dense part will then evaluate which class each observation belongs to by looking at the features selected by the ConvConvPool layers.

Finally, we define the operational part of the network. First we introduce the Train functionalities, which are composed of a loss function and the operation to minimise it using a certain optimiser. We use the go-to default Adam as optimiser, while the cross-entropy is the default for multi-class problems using softmax. The Eval part defines an accuracy metric to be used for assessment and evaluation of the network. We finish with the function to initialise all variables, i.e. to kick-off all the weights when we run the network.

Time to train and test our model!

n_epochs = 10
batch_size = 50
n_batches = mnist.train.num_examples // batch_size

with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(n_batches):
X_batch, Y_batch = mnist.train.next_batch(batch_size)
batch_loss, batch_acc, _ = sess.run([loss,accuracy,training_op], feed_dict={X: X_batch, y: Y_batch,training:True})
epoch_loss_val = loss.eval(feed_dict={X: X_val,y: Y_val})
epoch_acc_val = accuracy.eval(feed_dict={X: X_val,y: Y_val})
print("-"*50)
print("Epoch {} | Last Batch Train Loss: {:.4f} | Last Batch Train Accuracy: {:.4f} | Validation Loss: {:.4f} | Validation Accuracy: {:.4f} ".format(epoch+1, batch_loss , batch_acc, epoch_loss_val , epoch_acc_val ))
acc_test = accuracy.eval(feed_dict={X: X_test,y: Y_test})
print("Final accuracy on test set:", acc_test)

This part of the code is almost self-explanatory. The MNIST dataset from Tensorflow is actually a full-featured class that includes an iterator to generate mini-batches. Very convenient! At each epoch we compute how the network is performing by looking at training and validation accuracies, and afterwards we compute the accuracy on the test dataset. The results are shown below:

--------------------------------------------------
Epoch 1 | Last Batch Train Loss: 0.0246 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0535 | Validation Accuracy: 0.9848
--------------------------------------------------
Epoch 2 | Last Batch Train Loss: 0.0292 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0442 | Validation Accuracy: 0.9876
--------------------------------------------------
Epoch 3 | Last Batch Train Loss: 0.0511 | Last Batch Train Accuracy: 0.9600 | Validation Loss: 0.0339 | Validation Accuracy: 0.9906
--------------------------------------------------
Epoch 4 | Last Batch Train Loss: 0.0095 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0302 | Validation Accuracy: 0.9912
--------------------------------------------------
Epoch 5 | Last Batch Train Loss: 0.0574 | Last Batch Train Accuracy: 0.9800 | Validation Loss: 0.0276 | Validation Accuracy: 0.9920
--------------------------------------------------
Epoch 6 | Last Batch Train Loss: 0.0234 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0227 | Validation Accuracy: 0.9930
--------------------------------------------------
Epoch 7 | Last Batch Train Loss: 0.0039 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0260 | Validation Accuracy: 0.9916
--------------------------------------------------
Epoch 8 | Last Batch Train Loss: 0.0045 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0257 | Validation Accuracy: 0.9922
--------------------------------------------------
Epoch 9 | Last Batch Train Loss: 0.0152 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0260 | Validation Accuracy: 0.9916
--------------------------------------------------
Epoch 10 | Last Batch Train Loss: 0.0085 | Last Batch Train Accuracy: 1.0000 | Validation Loss: 0.0213 | Validation Accuracy: 0.9938
Final accuracy on test set: 0.9932

This is actually a pretty good result! Before the introduction of CNN, the accuracy record was below 99% (regardless of how deep, DNN seem to always stabilise below 99%), and getting to 99.3% is quite good as the world record is around 99.7% (or was, the last time I read about it). It is quite impressive we can achieve such a high result with a CNN we can run at home. Furthermore, we notice that the Validation Accuracy was still increasing, while the Training Loss was still stabilising at lower levels. This indicates that, in principle, our CNN could perform better if we leave it running for a bit longer! Notice, however, that the longer it runs the higher is the risk of overfitting. Assessing the number of epochs, which is a form of regularisation, is the subject of a future post.

So far we mostly used Tensorflow built-in functions and classes. We cheated a bit, I admit, by using Layers. But even then, the code feels and looks like proper Tensorflow, as Layers provides only shortcuts for Neural Network applications to be used on a Tensorflow workflow and code.

Then there is Keras. A completely different beast. Keras was developed by the same people behind Tensorflow — Google — and is a higher-level abstraction. Indeed, Keras is more of a generic functional API for Neural Networks and it can run not only on top of Tensorflow, but also on top of Theano or CNTK. While Tensorflow allows you to perform generic computations as nodes in a graph, Keras exists to simplify building Neural Networks.

Let’s see how the exact same model looks in Keras.

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, AveragePooling2D
from sklearn.metrics import accuracy_score
from keras import backend as K

First we import the packages and libraries. The main things to notice is that Keras has built-in layers, like in Layers, and also has its own way of calling optimisers, which is unlike Layers. This is because the code and workflow of Keras does not look or feel like Tensorflow.

K.clear_session()

model = Sequential()

model.compile(loss=keras.losses.sparse_categorical_crossentropy,
metrics=['accuracy'])

And this is it. The same model, using significantly less lines and characters than above. The workflow in Keras is as follows: choose the type of model, in our case it’s Sequential, which is a built-in class to stack layers on top of each other; instantiate the class and start adding layers in sequence from input to output. The connections between layers are drawn automatically, since the layers are being sequentially glued. Keras knows how to do this.

The compile method represents the operational part of the network. In one go, we define the loss, the optimiser, and we can even choose metrics to show during the training stage.

Running is also made a lot easier than in Tensorflow, as Keras fit method knows how to generate mini-batches, accepts validation sets, etc. It runs the required cycles throughout epochs and mini-batches. Of course, this gives us less control on what is happening inside these loops, but the convenience is undeniable if one only wants to quickly fit a Neural Network.

print(model.summary())
model.fit(X_train_reshaped, Y_train, batch_size=batch_size, epochs=epochs,validation_data=(X_val_reshaped, Y_val))
test_pred=model.predict_classes(X_test_reshaped)
print("Final accuracy on test set:", accuracy_score(test_pred,Y_test))

In addition, Keras has a nice summary method that prints the shape and information of the model, including how many parameters are being fit. Once the model is fit, we can then use the predict method to draw predictions from a test set. The API is very similar to that of SciKit Learn, which I think is not surprising as SciKit Learn has set a comprehensive and consistent standard for a long time.

Using TensorFlow backend.

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 28, 28, 64)        640
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 64)        36928
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64)        0
_________________________________________________________________
dropout_1 (Dropout)          (None, 14, 14, 64)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 14, 14, 32)        18464
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 14, 14, 32)        9248
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 32)          0
_________________________________________________________________
dropout_2 (Dropout)          (None, 7, 7, 32)          0
_________________________________________________________________
flatten_1 (Flatten)          (None, 1568)              0
_________________________________________________________________
dense_1 (Dense)              (None, 256)               401664
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2570
=================================================================
Total params: 469,514
Trainable params: 469,514
Non-trainable params: 0
_________________________________________________________________
None
Train on 55000 samples, validate on 5000 samples
Epoch 1/10
55000/55000 [==============================] - 25s 446us/step - loss: 0.2074 - acc: 0.9341 - val_loss: 0.0415 - val_acc: 0.9862
Epoch 2/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0727 - acc: 0.9777 - val_loss: 0.0425 - val_acc: 0.9886
Epoch 3/10
55000/55000 [==============================] - 24s 438us/step - loss: 0.0557 - acc: 0.9827 - val_loss: 0.0366 - val_acc: 0.9896
Epoch 4/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0486 - acc: 0.9854 - val_loss: 0.0351 - val_acc: 0.9896
Epoch 5/10
55000/55000 [==============================] - 24s 439us/step - loss: 0.0406 - acc: 0.9878 - val_loss: 0.0239 - val_acc: 0.9932
Epoch 6/10
55000/55000 [==============================] - 24s 438us/step - loss: 0.0385 - acc: 0.9886 - val_loss: 0.0325 - val_acc: 0.9910
Epoch 7/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0334 - acc: 0.9899 - val_loss: 0.0266 - val_acc: 0.9926
Epoch 8/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0313 - acc: 0.9904 - val_loss: 0.0300 - val_acc: 0.9914
Epoch 9/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0273 - acc: 0.9914 - val_loss: 0.0266 - val_acc: 0.9928
Epoch 10/10
55000/55000 [==============================] - 24s 440us/step - loss: 0.0273 - acc: 0.9919 - val_loss: 0.0330 - val_acc: 0.9920
Final accuracy on test set: 0.993


Not surprisingly, the model performs similarly on both Tensorflow+Layers and Keras. Again, we notice that the Training Loss seems to be on a descending path, while the Validation Accuracy could maybe increase.

It is important to note, though, that both fits are not the same. This is because there are many random aspects to training a Neural Network: Stochastic Optimisers, random initialisation of weights, etc. So don’t be surprised if you get different results on your computer.

## Encore

In this exercise we implemented a Convolutional Neural Network model using both Tensorflow+Layers and Keras. The end results were similar, but the path taken was quite different.

Tensorflow is more than a framework for Neural Networks. It is a framework to compute graphs representing many nodes — the operations — and the values flowing through them — the tensors. Using some low-level helpers, like Layers and Estimator API (I am planning a future post on this), we can deploy Neural Networks more easily while still being in control of the low-level architecture of our network.

Keras is an attempt to bring Neural Networks to the masses, by providing a functional API that makes constructing and training a Neural Network effortless. We lose control of the Tensorflow power as we can only communicate with it through the Keras built-in classes, with the attributes and methods provided. For this example we might not have noticed any drawbacks in using Keras, but our task would have been considerably more complicated if we wanted to implement a non-trivial architecture, or if we wanted to use other parameter values that Keras has not included in the class (for example, Keras layers do not accept the LeakyReLu activation function).

In the end, Layers and Keras will help you achieve different goals. If you need to deploy a Neural Network with a customary architecture, Keras will help you achieve your goal quickly. If, on the other hand, you need to design your own Network, then Layers provides you invaluable helping classes when constructing a Neural Network on Tensorflow.

I hope this was useful to someone. I am planning a few more posts on Tensorflow and other Machine Learning exercises.