A simple low-level Tensorflow classifier

Theoretical Physicist, Data Scientist, enthusiast Game Designer.

A simple low-level Tensorflow classifier

19/03/2018 Breast cancer data-set Classification Data Science Deep Neural Network Iris data-set Tensorflow 2

As an exercise, I developed a simple classifier only using Tensorflow low-level functions. By this I mean I avoided using any higher-level abstractions such as the Estimator API, Layers, or Keras. The resulting Deep Neural Network (DNN) is simple to read and serves as a good example to understand the Tensorflow workflow. The full code can be found on my GitHub. First, we import the libraries of interest. These are: Tensorflow, Numpy; and some utilities from Sci-Kit Learn stack: Train-Test split, StandardScaler, LabelBinarizer, and a couple of data-set loaders.

import numpy as np
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split

The StandardScaler will transform all numerical features such as to follow a normal distribution of zero mean and standard deviation equal to one. This is because a Neural Network is sensitive to the scale of the inputs since each neuron is in its essence a linear model

\[ z = W \cdot X + b\ , \]

where \(W\) are the weights, \(X\) the inputs, and \(b\) the biases. The result \(z\) is then fed to the next layer of neurons after being passed through an activation function.

The LabelBinarizer will one-hot encode the classes, turning them into sparse arrays. One could think that this is too much for a binary classifier, where a single column with zero or one values would be enough to encode the classes. But by having the class dimensionality to be the same as the number of classes allows our model to be easily deployed for both binary and multi-class classifications.

With that said, we transform the incoming data. Note that our data processing is simple and takes only into consideration a data-set involving only numerical features, whereas further steps would have to be made if the data included categorical features.

data=load_iris()

scaler = StandardScaler()
lb = LabelBinarizer()

X=(scaler.fit_transform(data.data)).astype(np.float32)
Y_cls=data.target Y = lb.fit_transform(Y_cls).astype(np.int64)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) n_features = X_train.shape[1] n_classes=Y_train.shape[1]

Here we chose to use the iris data-set, but the same DNN architecture in this code can be used for other classification problems with numerical features. For example, a simple change from load_iris() to load_breast_cancer() would turn this exercise from a three-class to a binary classification without changing anything else in the code.

For the rest of the code above, we notice that we have cast the data-set to specific data types. This is because Tensforflow does not automatically cast the variables into the appropriate type given a context. The types above are those that I found to work with the functions we use in the DNN we define below for our model. The train_test_split() splits the data-set into training and test sets, so that we can evaluate the performance of the DNN on a data different from that it was trained on.

It’s time to construct our DNN. There is a generic structure to all Tensorflow implementations that encompasses two steps: construct the network (or graph), and train it (or run it). For the purpose of deep-learning, the nodes on a Tensorflow graph will be defined by the layers supporting the neurons. Since coding each layer becomes quite repetitive and prone to errors, we define a function that returns the outputs of a layer of neurons. This function takes in the incoming values, initialises the weights and biases, and passes the resulting computation through an activation function.

def neuron_layer(X,n_neurons,name,activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.shape[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)
        W=tf.Variable(init,name="Weights")
        b=tf.Variable(tf.zeros([n_neurons]),name="biases")
        z=tf.matmul(X,W)+b
        if activation=="relu":
            return tf.nn.relu(z)
        if activation=="leaky_relu":
            return tf.nn.leaky_relu(z)
        if activation=="sigmoid":
            return tf.nn.sigmoid(z)
        if activation=="softmax":
            return tf.nn.softmax(z)
        else:
            return z

The weights and biases are Tensorflow variables, as these are the parameters that Tensorflow will use to optimise the performance of the network. Recall that Tensorflow has three types of tensors:

  1. Variables: That hold the variables/parameters that Tensorflow can symbolically use (example for differentiation) and change/update on training phase. These have to be initialised.
  2. Constants: That hold values that are never changed by the training phase.
  3. Placeholders: These are empty tensors (but have a specific data type) that can be assigned values with a dictionary through the feed_dict when evaluating the graph. These are how we input the data into the network for both training and prediction.

As explained above, each neuron performs the action of a linear model, which then proceeds to apply an activation function before outputting the result. The dimensions of the layer, i.e. the number of neurons, is defined by the dimensions of theĀ  weights and biases that will then define the dimension of \(z\), which corresponds to the dimension of the layer. The initialisation of the Variables can influence the speed of the convergence of the network, and even avoid local non-optimal minima, and it is common practice to sample the initial weights from a normal distribution and the biases at zero.

The name_scope wrapper is mostly to provide labels to visual Tensorflow tools like TensorBoard, which we are not using here. In this case, we use them as to keep the code easy to read and organised.

We are now ready to construct our DNN. A DNN is defined by a sequential stacking of layers. The more layers we have, deeper the network is. Here we will prepare a DNN with two hidden layers sandwiched between the inputs and the outputs.

tf.reset_default_graph()

n_hidden_1 = 8
n_hidden_2 = 8

X = tf.placeholder(tf.float32,shape=[None,n_features],name="X")
y = tf.placeholder(tf.int64,shape=[None,n_classes],name="y")

with tf.name_scope("DNN"):
    hidden1 = neuron_layer(X,n_hidden_1,"hidden1",activation="leaky_relu")
    hidden2 = neuron_layer(hidden1,n_hidden_2,"hidden2",activation="leaky_relu")
    logits = neuron_layer(hidden2,n_classes,"outputs")
with tf.name_scope("loss"):
     loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y,logits=logits),name="avg_xentropy")
with tf.name_scope("train"):    
    optimizer =  tf.train.AdamOptimizer(learning_rate=learning_rate)
    training_op= optimizer.minimize(loss)
with tf.name_scope("accuracy"): 
    correct = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct,tf.float32))
with tf.name_scope("init"):
    init=tf.global_variables_initializer()

Let’s go through these lines with some detail:

  1. We reset the default graph. When not defining a new graph, all nodes and tensors in Tensorflow are assigned to the default_graph. It is good practice to reset it before running, namely when using interactive environments such as Jupyter Notebooks, where we can run cells non-sequentially and mistakenly and undesirably assign nodes to the default graph.
  2. n_hidden_* defines the dimensions of the hidden layers, i.e. the output dimension of \(z\) on each layer.
  3. We define a placeholder for each of the inputs we will be using: the feature data, X, and the target labels, Y.
  4. The first hidden layer is fed as input the data X, and the second layer is then fed the outputs of the first layer, hidden1.
  5. The output of the second layer is of the dimension of the number of classes, and each value of the (1-dimensional) tensor is real. This is called a logit, as opposed to either a probability or a definite class assignment. Feeding logits to a softmax function would return the probabilities that a certain observation belongs to each of the classes.
  6. Once the data as been fed-forward to both hidden layers, we compare the output with the target labels. To do this we compute the loss. For multi-class problems, the logits are fed through a softmax function, and then the output probabilities are used to compute the cross-entropy loss function. Tensorflow gives as a function to this in one go, i.e. by feeding logits and compare the results directly to the correct labels, Y. The average cross-entropy is then computed with tf.reduce_mean, a function that computes the mean of a tensor.
  7. Having defined the loss function, we have now our criterion to train the neural network, which is to minimise the loss. To do this, we choose an optimizer and tell it to minimise the loss. We chose Adam, which should always be the first choice. It is a Stochastic optimizer (meaning uses random mini-batches of the data per training) with adaptive learning rate and momentum, the former helps preventing jumping over the minimum and the latter increases convergence speed. We define a starting learning rate to give us the chance to tweak it, but for simple problems this is barely necessary.
  8. The node training_op is the operation we will run during training, which is to minimise the loss, using the optimizer we chose.
  9. Defining all the structure need to train our network, we add a few more operations that will help us assess the performance of the network. We define an accuracy node, which measures how many times our predictions are correct. To do this, we just compare the predictions for the classes with the true labels. Since the softmax function is monotonic in its arguments, we can use the largest logit as a the criterion for the most probable class.
  10. Finally we setup an operation to initialise all variables. This will sample the initial values of the variables from the distributions we assigned.

With the DNN setup, we are ready to train. Almost, ready I should say, as the code above does not provide a way to generate mini-batches to feed the network. For that we create an iterator function to return mini-batches. I found this piece of code online on a notebook which I failed to keep the source, but I hope the original author does not mind me sharing it further!

def iterate_minibatches(inputs, targets, batchsize, shuffle=False):
    assert inputs.shape[0] == targets.shape[0]
    if shuffle:
        indices = np.arange(inputs.shape[0])
        np.random.shuffle(indices)
    for start_idx in range(0, inputs.shape[0] - batchsize + 1, batchsize):
        if shuffle:
            excerpt = indices[start_idx:start_idx + batchsize]
        else:
            excerpt = slice(start_idx, start_idx + batchsize)
        yield inputs[excerpt], targets[excerpt]

With all ingredients put together, we are ready to train our network! For that, we need to run the training for a certain amount of epochs, in which every training point will be fed-forward through the network once, and the size of the mini-batches. Per each 10 epochs we print the average training loss of the mini-batches and the accuracy of both training and test data-sets.

n_epochs = 100
batch_size = 50

with tf.Session() as sess:
    init.run()
    max_acc=0
    acc_going_down=0
    for epoch in range(n_epochs):
        batch_step=0
        avg_loss = 0.
        total_loss= 0.
        total_batch = int(X_train.shape[0]/batch_size)
        for X_batch, Y_batch in iterate_minibatches(X_train,Y_train,batchsize=batch_size):
            _,l=sess.run([training_op,loss],feed_dict={X:X_batch, y:Y_batch})
            batch_step+=1
            total_loss += l
        if((epoch)%10==0):
            avg_loss = total_loss/batch_size
            print("Epoch:", '%02d' % (epoch+1), "| Average Training Loss= {:.2f}".format(avg_loss), "| Training Accuracy:  {:.2f}".format(accuracy.eval({X: X_train, y: Y_train})), "| Test/Validation Accuracy:  {:.2f}".format(accuracy.eval({X: X_test, y: Y_test})))
    print("Model fit complete.")
    print("Final Training Accuracy: {:.2f}".format(accuracy.eval({X: X_train, y: Y_train})))
    print("Final Validation Accuracy: {:.2f}".format(accuracy.eval({X: X_test, y: Y_test})))

The results for the iris data-set are shown below.

Epoch: 01 | Average Training Loss= 0.14 | Training Accuracy:  0.02 | Test Accuracy:  0.02
Epoch: 11 | Average Training Loss= 0.03 | Training Accuracy:  0.81 | Test Accuracy:  0.82
Epoch: 21 | Average Training Loss= 0.02 | Training Accuracy:  0.85 | Test Accuracy:  0.91
Epoch: 31 | Average Training Loss= 0.01 | Training Accuracy:  0.94 | Test Accuracy:  0.96
Epoch: 41 | Average Training Loss= 0.01 | Training Accuracy:  0.95 | Test Accuracy:  0.93
Epoch: 51 | Average Training Loss= 0.01 | Training Accuracy:  0.96 | Test Accuracy:  0.93
Epoch: 61 | Average Training Loss= 0.01 | Training Accuracy:  0.97 | Test Accuracy:  0.93
Epoch: 71 | Average Training Loss= 0.00 | Training Accuracy:  0.97 | Test Accuracy:  0.93
Epoch: 81 | Average Training Loss= 0.00 | Training Accuracy:  0.98 | Test Accuracy:  0.93
Epoch: 91 | Average Training Loss= 0.00 | Training Accuracy:  0.99 | Test Accuracy:  0.93
Model fit complete.
Final Training Accuracy: 0.99
Final Validation Accuracy: 0.93

Not too bad, although we have indication of over-fitting, as the training accuracy increases just before halfway through while the test accuracy decreases, and subsequent increase on training accuracy holds no further benefits to test accuracy. This is not uncommon. Neural networks are known to have multiple local minima, and the problem at hand is rather simple. We could employ regularisation techniques such as Lasso or Ridge, or Dropout, but I will leave that for another blog post.

Just before we finish, let’s prove that this DNN can run for bi-classification, for that we just change load_iris() to load_breast_cancer() on the top of the notebook. The results are below:

Epoch: 01 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 11 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 21 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 31 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 41 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 51 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 61 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 71 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 81 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Epoch: 91 | Average Training Loss= 0.00 | Training Accuracy:  1.00 | Test Accuracy:  1.00
Model fit complete.
Final Training Accuracy: 1.00
Final Validation Accuracy: 1.00

That is quite an accomplishment for such a simple model!

I will continue to post my examples and discuss my ideas on different Machine and Deep Learning techniques here. Stay tuned!

2 Responses

  1. […] review what a CNN is. If you have played around with Deep Neural Networks (like the one presented here) you may already be familiar with the notion of layers, forward feed, back propagation, etc. If […]

  2. […] have discussed Neural Networks here and here. In this exercise, we’ll use the SciKit-Learn implementation of a multi-layer […]

Leave a Reply

Your email address will not be published. Required fields are marked *