A Brief Recap
In Blog 1 we discussed the dataset, EDA, and our baseline model. The dataset consists of 21,367 images of Cassava crops. Our goal is to correctly classify the images into five categories: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf. During EDA we found that CMD makes up ~61% of the images. We adapted the LeNet5 architecture to create a baseline model, achieving a val_accuracy of 0.5999 and a final val_loss of 1.7663.
In this article, we will discuss additional neural network architectures and their performance for this Cassava Leaf Disease Classification competition on Kaggle.
This phase of the project is a computer vision exercise based around adapting a few of the standard modern computer architectures to our task of identifying leaves. We implemented three transfer learning models that represent different paradigms in deep learning, comparing the impact of different ways of connecting and arranging layers in a network. Before we do this, we begin by exploring preprocessing steps…
In order to create more robust models, we used data augmentation to increase the size of our training dataset. Data augmentation is essentially taking an image, and manipulating some aspect of it. For example, RandomRotation randomly rotates an image by some factor. The data augmentation methods we used are RandomFlip, RandomRotation, RandomTranslation, RandomZoom, and RandomContrast. For more information check our TensorFlow documentation.
We discovered during EDA that our training data imbalanced among the five classes. As we noted in our first blog post, we found that just over 60% of the training data from Kaggle was from the CMD class, while all other classes ranged from 5–15%. This left us with a very skewed prior for the underlying distribution of classes we would face in the testing data, so we checked the class distribution in the Kaggle testing data by finding the accuracy of five different naive classifiers which each predicted a different class label for all inputs. The accuracies of these naive models when given the Kaggle test data showed us the proportions of each class, which ultimately reflected the same skew as we saw in the training data. To make sure our training procedure would prepare our model as best as possible for the test data, we used the class proportions from the test data to apply weights to the different class samples during training (using the class_weight parameter).
While training our models, we kept running into issues with our training procedure exceeding the RAM capacities of our Google Colab instance, often crashing after just one or two epochs. We tried many different adjustments to our code to resolve this issue, such as reducing the number of trainable parameters in our network, inspecting the GPU/CPU usage, and resizing our inputs. Ultimately a combination of approaches helped to resolve the issue but some of the more impactful changes were reducing our input size from the default 600x800 to 224x224, changing our buffer size for the TFRecord caching, increasing our batch size, and changing our data augmentation strategy from being applied as part of the sequential network (using preprocessing layers) to being applied once to our training set (using a dataset map function).
For our training and validation data, we opted to pull data from TFRecord files rather than directories of JPEG’s. The main reason for this decision was time improvement since TFRecord is a binary file format that is optimized for integration with TensorFlow pipelines. To use this file format, we needed to build several image parsing functions from scratch that map the files to TensorFlow datasets. First, the dataset is loaded into a TFRecord dataset object. Next, the images in the dataset are parsed from unicode strings into three dimensional tensors of float values, rescaled, and downsampled. Additionally, the labels are one-hot encoded. Finally, the dataset is shuffled, prefetched with an auto tuned buffer size, and a batch size is specified.
After downloading the data using TFRecord, we split our images into 70% training and 30% validation. We researched different networks to use as the base in our transfer learning models. This involves loading a pre-trained network architecture by instantiating it with the weights achieved by training on a given dataset (in our case, the ImageNet dataset). There are many different commonly used computer vision architectures that each have their own distinct qualities — see Figure 2 below, comparing the size and performance of some of those common networks.
… And here is a look at the usage of some of these networks over time (we can see different networks are popular at different times in history).
We decided to move forward with VGG, ResNet, EfficientNet, and InceptionNet. Each model uses Adam optimizer, a loss function of Categorical Cross-entropy, and accuracy as the training metric. Each model also includes callbacks ‘ReduceLROnPlateau’ (factor of 0.2 and a minimum loss plateau of 0.0001) and ‘EarlyStopping’ (patience of 15).
Our initial transfer learning model used a VGG-16 as the network base. This type of network is characterized by the use of blocks of convolutional kernels chained in a repeated pattern so that the final model has 16 layers: 13 convolutional and 3 fully connected. Below is the architecture for this type of network.
Four additional fully connected layers were added with outputs of 512, 128, 32 (with ReLU) and a final output of 5 (with softmax). After building this model and freezing the layers in the VGG body, we have 15,047,301 total network parameters, with 332,613 trainable parameters and 14,714,688 frozen ones.
With a batch size of 64 and 5 epochs, this model achieved a 1.3051 val_loss and 0.3922 val_accuracy.
For our second transfer learning model we implemented an architecture based on ResNet50, a common CNN architecture that us defined by its use of residual connections between blocks. The use of residual connections is a departure from typical feedforward network design principles because it allows for the input to a block to be added directly to output.
This increases the model’s expressivity simply by making it easier to maintain the value of a useful input wherever it might occur in the network. The effect of residual connections is generally found to be an efficient replacement of a large number of layers that would otherwise be devoted to recreating the input that we deliver directly with residual connections. In terms of the function space we are searching when we train a deep network, residual connections make it so that our training process searches for the best model in progressively smaller function spaces that are necessarily contained in one another, whereas a typical training procedure of a model without residual connections will search through function spaces where containment is not guaranteed, thus leaving us with a lot less theoretical certainty about where we’ll end up.
In actual implementation, our model is built on top of the ResNet50 architecture that is given in the Keras Applications library, using the weights achieved by training the network on the imagine dataset. This model was first proposed in  and it begins with a 7x7 convolutional layer and a 3x3 maximum pooling layer, and then uses four convolutional sections with interposing residual connections, before ending in a 1000 dimensional fully connected layer with a softmax activation . In our model, we append to this ResNet50 body a global pooling layer followed by five dense layers with ReLU activation (except for the output layer which uses softmax) that reduce in output dimension from 1024 to the final output dimension of 5 (the number of cassava leaf classes we have).
After building this model and freezing the layers in the ResNet50 body, we have 26,267,293 total network parameters, with 2,679,581 trainable parameters and 23,587,712 frozen ones.
After training for 5 epochs (with an initial learning rate of 0.001 and using the Adam optimizer) our ResNet model achieved a maximum validation accuracy of 0.5336.
In general, the more layers a network has, the more “powerful” it is. ResNet’s ability to increase depth has made ResNet a dominant architecture. Yet as we continue to increase of depth, the gains begin to drop off.
One of the models we trained is from the family of models called EfficientNets, which often perform more accurately and efficiently than previous ConvNets. EfficientNet is a CNN that uses a compound coefficient (Φ) to uniformly scale network depth, width and resolution. For example, if we wanted to increase computational resources by 2^N, we can use constant coefficients (B, α, γ — scaled together) to increase network width to be B^N, depth to be α^N, image size to be γ^N. We use compound scaling to capture the fine-grained patterns on larger images (by increasing channels) and increase the receptive field (by increasing layers).
We use EfficientNet-B0 and freeze layers so that the weights don’t change. We added a GlobalAveragePooling2D layer, BatchNormalization, Dropout of 0.2 and a final output of 5 (with softmax), to the EfficientNet-B0 body. After building this model and freezing the layers in the EfficientNet body, we have 4,061,096 total network parameters, with 8,965 trainable parameters and 4,052,131 frozen ones. After training the model, it seemed the class_weight parameter wasn’t working properly. After removing the class_weight parameter, we got the following results.
With a batch size of 64 and 5 epochs, this model achieved a val_loss of 1.2002 and val_accuracy of 0.6170.
Another base network architecture that we employed for one of our transfer learning models was the InceptionV3 application from TensorFlow. This type of architecture was first proposed here with the intent of reducing the number of trainable parameters of large networks such as VGG while maintaining high accuracy. To this end, the authors proposed a novel method of extracting features of multiple sizes using convolutional kernels of varying sizes from the same layer and subsequently concatenating them together. Examples of inception blocks used for this project are shown below, as well as the final architecture for the base. Note the varying convolutional kernel shapes used on the same input.
After building this model and freezing the layers in the Inception body, we have 22,921,829 total network parameters, with 1,119,045 trainable parameters and 21,802,784 frozen ones.
With a batch size of 64 and 5 epochs, this model achieved a 1.3431 val_loss and 0.4682 val_accuracy.
Overall, our models could have performed better. Our next steps include:
- Our main focus will be tuning hyper-parameters for each model, increasing epochs, adjusting learning rates, adding exponential decay to improve results.
- Changing EfficientNet-B0 to EfficientNet-B7: EfficientNet-B7 has a state-of-the-art 84.3% top-1 accuracy on ImageNet .
- Trying classical methods (i.e. K-Nearest Neighbors, and support vector classifiers), using our trained models to create feature embeddings that will be used as input.
- Ensemble modeling (majority voting over final layers OR average of softmax predictions).
All code can be found in this github repository. Our Kaggle submissions can be found here: VGG, ResNet, EfficientNet, InceptionNet. Our final blog post will discuss our exploration of the list above, and the details of our best trained model.
Blog Post 3: here!
 EfficientNet: https://arxiv.org/pdf/1905.11946.pdf
 T. A. Putra, S. I. Rufaida and J. Leu, “Enhanced Skin Condition Prediction Through Machine Learning Using Dynamic Training and Testing Augmentation,” in IEEE Access, vol. 8, pp. 40536–40546, 2020, doi: 10.1109/ACCESS.2020.2976045.
 Quinn, Joanne, et al. Dive into Deep Learning: Tools for Engagement. Corwin, a SAGE Company, 2020, https://d2l.ai/index.html
 VGG https://neurohive.io/en/popular-networks/vgg16/
 InceptionNet https://arxiv.org/pdf/1512.00567.pdf
 InceptionNet https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43022.pdf
 ResNet https://arxiv.org/abs/1512.03385