Machine Learning: Flower Identifier
The aim of this project was to explore machine learning as an application of artificial intelligence, utilising computers to adapt, learn, and improve based on their environments and experience. This project explores the use of convolutional neural networks to identify small datasets and the effects of varying the learning rate and momentum of an SGD optimiser.
In practice, to train and build an entire convolutional network from scratch is difficult as it requires a very large, organised dataset. Instead, a method called transfer learning can be used. Transfer learning is a machine learning technique where a pretrained model developed for a task is used as a starting point for a model on another task. This report explores machine learning, and, more specifically, transfer learning, and applies it to retrain a pretrained convolutional neural network. Below is an example of the flower types examined. It should be noted that many of the photos in the training set where not to labelled correctly and may had an affect on results.
Scope
The project uses the MobileNetV2 network to identify flowers from a small dataset. In the dataset used, there were five different types of flowers, including of: daisies, dandelions, roses, sunflowers, and tulips. The MobileNetV2 is a convolutional neural network architecture that is designed to work best on mobile devices and is lightweight, with only 3.5 million parameters (comparably with others). This model was trained using ImageNet, an expansive image database organized according to the WorldNet hierarchy, aiming to offer tens of millions of cleanly labelled and sorted images.
TensorFlow and Keras were used to develop and retrain the model. TensorFlow is an end-to-end open source machine learning library used to aid in the development of machine learning models, whilst Keras is a deep learning API written in python which uses TensorFlow to enable fast experimentation of machine learning tools.
Building the model
As this project was for the purpose of assessment, many details are left out. However, if you are seeking to learn more about ML on datasets using transfer learning, please check out the Tensor Flow and Keras websites which both provide good examples as to how to use their packages (examples are linked).
Key details of the process used in this project:
The MobileNetV2 network was used with ‘include_top’ set to ‘False’ - to allow it to be trained later; ‘weights’ set to ‘imagenet’ - to download the pre-trained weights from ImageNet; and ‘classifier_activation’ set to ‘softmax’ which converts the weights from a vector of values to a probability distribution.
Preparing the model for training, the model was frozen and a new trainable layer was added.
The training images were resized and rescaled and augmented before entering the model. The rescaling was set to produce values between negative one and one. Tests with rescaling for values between zero and one resulted on less accurate results on average, and therefore, not used.
A drop out layer was used which randomly sets input units to 0 with a frequency of the inputted rate at each step during training time. This aids in preventing over fitting.
Global average pooling 2D was used. This is an alternative to using a flattening block.
As the dataset was small, the images split chosen was 70% training, 15% validation, and 15% test. This maximised the number of images used in the training and validation process, whilst leaving sufficient images for testing to validate the results.
The optimiser used to compile the model was the SGD optimiser. This method allows for tuning by altering parameters, such as the learning rate and momentum, which are explored in the following sections. In this section, the learning rate was set to 0.01, momentum of 0, and nesterov was set to ‘False’.
The model was then compiled with a loss function of ‘categorical crossentropy’, and using the ‘accuracy’ metrics.
An early call-back function was used which, in this case, monitored the validation loss. This was crucial to avoid over fitting to data and can limit the time taken to test parameters.
Results
When running this model, the test accuracy returns on average a test loss of 0.35 to 0.45 and a test accuracy of 0.8 to 0.9. To understand how the model was performing during training, the history of the model was recorded and plotted, as shown at right. The model was set to run for 40 epochs, although, due to the early call-back function, cut-off at 26, preventing overfitting of the data. The model trend matches what was expected and performs well when tasked to identify the test images.
Exploration of learning rate
This section will explore the effects of varying the learning rate by using three different orders of magnitude. These will be 0.1, 0.01, and 0.001. These will be compared against one another based on the test accuracy and test loss, as well as visually, viewing their history. The tests were set to 50 epochs, although, have an early call-back set-up, as mentioned earlier, to return to the epoch with the lowest validation loss if the validation loss has not decreased after 10 iterations.
The learning rate controls how quickly the model adapts to the dataset. Therefore, the smaller the learning rate, the larger the number of epochs, given that the changes were smaller on each update. This value needs to be optimised as, if the learning rate is too large, it may converge faster than required, and if it is too small, it may take too long or get stuck. This parameter is often considered one of the most important hyperparameters for a model. The results for each test are shown below.
The results show that the best performing learning rate was the original value. Although, the test using the 0.1 learning rate was not too far from this score. It even ended up with a lower test loss! The smaller learning rate, 0.001, should produce the greatest accuracy, and find the best convergence, however, as the test only ran for 50 epochs, it did not have enough time to converge. The graph below shows this clearer.
Exploring Momentum
When using an SGD optimiser, a small batch is used to calculate the derivative of the loss function, instead of finding the exact derivative. This means that the direction travelled may not be the most optimal due to noise. Momentum is used to accelerate gradient vectors in the right direction, resulting in a faster conversion. It does this by creating and adding a moving average to the gradient and uses this to update the weight of the network.
The momentum value must be a value between 0 and 1. The values used to understand the full range of this variable were 0.1, 0.5, and 0.9. These were compared against one another based on the test accuracy and test loss, as well as visually, viewing their history. The same testing conditions were used with 50 epochs and early call-back set. The learning rate was set to 0.01 as this provided the best result from the previous section.
The results show that, in fact, the best results occurred with the lowest momentum. Although, this score resulted in the lowest test loss. It should be noted that scores were very similar. The history graph, displaying the validation loss, shows the test with the greater momentums converged faster, which matches what was expected.
Not applied in this project, however, worth mentioning, the Nesterov accelerated gradient can be used to update the momentum as well. This method has gained popularity because of the way the momentum is updated, using a “lookahead” gradient that adds the gradient step vector in the direction between the momentum step and actual step.