Cassava Leaf Disease Classification (Part 3)

10 min readMar 21, 2021

Blog 3: Updates & Final Model
By Cameron Webster, Jillian Green, and Chris Rohlicek

We are graduate students in the Data Science Master’s program at Brown University. As part of our Deep Learning and Special Topics in Data Science course (Data 2040), we worked on a Kaggle Competition:
Cassava Leaf Disease Classification.

Project Overview

The dataset consists of 21,367 images of Cassava crops from a Ugandan survey. Our goal is to correctly classify the images into five classes: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf. Properly classifying images will help farmers identify healthy plants, and hopefully save crops from further damage.

In our first blog we focused on Exploratory Data Analysis (EDA) and a baseline model. Through EDA we displayed visual similarities between classes, and found ~61% of the images in the dataset are of the CMD class. Our baseline model adapted the LeNet5 architecture, achieving a val_accuracy of 0.5999 and a final val_loss of 1.7663.

In our second blog we implemented three transfer learning models: ResNet50, InceptionNet and EfficientNet. Our models achieved val_accuracy between 47–61%, so we knew we had plenty of room for improvement.

In this blog post we focus on improving each individual transfer learning model. We then use a technique called ensembling to combine our three models to create our final model.

Improvements & Attempted Ideas

Here we discuss successful model updates, and other attempted methods.

Hyper-parameter Tuning (Improvement)
As a result of hyper-parameter tuning, we set a max epoch of 100, a batch size to 16, and used a starting learning rate of 0.001.

Data Augmentation (Improvement)
We were originally using TensorFlow’s resize to change the shapes of our image to 224x224, reducing resolution. We replaced resize with TensorFlow’s central_crop to make the image smaller (360x360) without reducing resolution, and extracting the center of the image.

Ensembling (Improvement)
See our Final Model section for more!

Differential Learning Rates (Attempt)
In training our transfer learning models, our first strategy was to use fine tuning, a process that involves training the full network with a small learning rate applied to the pre-trained base model layers and a high learning rate applied to the new head layers. This approach allows us to adjust the entire network more to our specific task, while not expending as much computation as we would with just the larger learning rate.

We implemented this by using a MultiOptimizer object to create layer/optimizer pairs that assigned the two learning rates to their respective sections, but ultimately the training results we saw showed a flatlining in validation accuracy and loss. Given the time constraint on our ability to try this method with many different hyperparameter combinations, after a few tries we decided to train our model with a single learning rate (the larger of the two from our fine tuning approach), ultimately giving us the best results.

Figure 1: Using differential learning rates for InceptionNet, plateauing around 47%.

Figure 2: Using differential learning rates for EfficientNet, plateauing around 52%.

Disproportionate Class Weights (Attempt)
Through EDA we saw heavily skewed class distribution that gave roughly 60% of the samples to a single class while giving as few at 5% to another. The prior assumption of the underlying data distribution is an important part of building an effective classifier, so in order to check the Kaggle testing data’s class distribution we fed naive classifiers (classifiers hardcoded to make the same prediction for all input data) to the competition, and checked those classifiers’ accuracies. We found that each of the naive classifiers achieved the same score on the Kaggle test data that it did of the training data, so we could say with confidence that the distribution of the training set was a good approximation of the distribution of the testing set.

Extra Model Layers, Training Entire Model (Attempt)
Initially, we thought it would be best to freeze all transfer model base layers and place a more complex dense architecture on top than what we produced for our final models. However, this method proved to be ineffective. First, because we froze our base layers, our models all hit a validation accuracy ceiling at about 0.6. Considering a majority classifier can produce the same accuracy, this was cause for concern. Second, because our model head had too many dense layers in a funnel pattern, we were effectively throwing out all useful parameters for the model to use in the final prediction layer.

Transfer Learning

What is Transfer Learning?
Transfer learning refers to the practice of building neural networks that build off of pre-trained models. The models we loaded (EfficientNet, InceptionNet, and ResNet50) were trained on the ImageNet dataset. To adapt these models to our task, we added a section of layers at the end of each model that adjusts the output to fit our five-class problem. For each of these models we found the best result from training the full network with a single learning rate — this is a very computationally intensive strategy compared to other transfer learning techniques that freeze pre-trained layers or apply smaller learning rates.

Model Head
The section of layers that we add to each transfer learning model are below.

In training we used the following callback methods to save checkpoint models when a new peak validation loss was achieved, reduce the learning rate when a plateau was found over the past three epochs, and stop training early if no improvement was made for 15 epochs.

ResNet
Our first model uses ResNet50 as the base, a standard computer vision architecture defined by its use of residual connections which allow for a more resilient training procedure that avoids the typical brittleness often found when using a very large number of layers in a standard feedforward architecture. Residual blocks work in the same way as a standard block of layers, except there is an additional connection that adds the input to the network to the outputs of the blocks. This simple architectural change makes a great improvement in the network’s expressivity, and enables the stable training of a very large network.

Figure 3: Diagram of Example of ResNet Architecture [2].

Trainable Parameters: 24,587,269
Non-trainable Parameters: 54,144

After 100 epochs of training we achieved a final validation loss of 0.3908 and validation accuracy of 0.8666.

Figure 4: Accuracy, Loss, Learning Rate Curves for ResNet50.

InceptionNet
Our second model uses an InceptionNet architecture for the base. The InceptionNet model leverages inception blocks in its architecture, a method of applying different sized convolutional kernels to the same input tensor and concatenating the outputs of each convolutional kernel. Through this procedure, InceptionNet is able to extract features of different sizes from the same input without creating a model that is too large and prone to overfitting, lending the model more discriminative capability without sacrificing ease of training. Below is a diagram of an example inception block architecture.

Figure 5: Diagram of Example of Inception Block Architecture [3].

Notice the variation of shape and size of the kernels applied to the same feature map.

Trainable Parameters: 22,821,029
Non-trainable Parameters: 35,456

After 81 epochs of training we achieved a peak validation loss of 0.3633 and validation accuracy of 0.8773.

Figure 6: Accuracy, Loss, Learning Rate Curves for InceptionNet.

EfficientNet
Our third model uses EfficientNetB0 as the base. Convolutional neural networks have three scaling dimensions: depth, width and resolution. EfficientNet is a family of networks that focuses on improving accuracy and efficiency, by uniformly scaling network depth, width and resolution.

Figure 7: Displaying Model Scaling Techniques [4].

Solution: Compound Scaling, denoted by the compound coefficient ɸ. Adjusting the value of ɸ produces EfficientNet B1-B7 [1].

After training our EfficientNet model with layers and callbacks described above, we have:

Trainable Parameters: 4,667,009
Non-trainable Parameters: 43,047

After 59 epochs of training we achieved a peak validation loss of 0.3808 and validation accuracy of 0.8774.

Figure 8: Accuracy, Loss, Learning Rate Curves for EfficientNet.

Validation accuracy likely fluctuates due to variance of the mini-batches.

Interpretations of Transfer Learning Models & Motivation for Ensembling

After training our transfer learning models, we decided to look at the behaviors of each model to see where they are succeeding and failing, and how we might be able to combine them.

To visually understand which images the models are struggling to classify, we generate 15 images from the training set that achieved the worst loss values. However, we don’t see any obvious trends across the mis-classified images.

Figure 9: Top 15 images from training set with worst loss value, by transfer learning model.

Next, we looked at the confusion matrices for each transfer learning model in order to see if there are any classes that are consistently mis-classified. In Figure 10, we can see the correlation between model predictions range between 0.57–0.69 (left), and that each model tends to over-predict class 0 as class 4 and ResNet and InceptionNet tend to confuse class 2 as class 3.

Figure 10: Correlation of Model Predictions, and Confusion Matrices for each Transfer Learning Model: 0:CBB, 1:CBSD, 2: CGM, 3:CMD, 4:Healthy.

After evaluating the transfer learning models, we conclude that each model individually achieves high accuracy, but exhibits different predictive behaviors. With this in mind, we try ensembling!

Final Model

We used a technique called ensembling to combine our three models.

What Is Ensembling?
Ensembling is the intuitive solution of combining model predictions to produce a single prediction that leverages all of the individual models’ strengths while compensating for their weaknesses. The most common approaches are:

1) Majority voting — models formulate their own separate predictions and use the most common one as the output.
2) Combining softmax outputs, using the argmax class as the final prediction.

Since we only have three models, we decided to use the latter method at risk of having the majority vote across three models be too variable.

Figure 11: Model Structure of ensembling ResNet50, EfficientNet, and InceptionNet.

We implemented this by creating a single Keras model which combined our three models’ softmax output in a single additional layer (see Figure 11).

With the 5-dimensional output of this layer, we can get the prediction class by taking the argmax.

Through the ensembling method, our Kaggle submission achieved private score of 0.8656 and a public score of 0.8618. (YAY!)

If We Had More Time

Like any model, there is almost always room for improvement. If given more time for this project, we would have spent more time investigating some of our attempted ideas listed above. For example, we would try improving our model by applying a small learning rate to the base layers and a high learning rate to the head layers. In addition to our current attempt, we would explore our code to see if there are any implementation issues and verify whether this method is appropriate for our model.

If given more time we would have also explored other architectures in the EfficientNet family of networks. While EfficientNetB0 found success, we would have liked to train other versions with more hyperparameter variation, resulting in longer training time.

Another approach we were interested in considering is the use of classical methods (i.e. K-Nearest Neighbors) to draw the final decision boundaries for a network. This replaces the linear softmax decision boundaries with non-linear boundaries.

Takeaways & What we Learned

We spent a lot of time trying to figure out the best image augmentation and image processing pipeline. We would have liked to allocate more time to tune hyperparameters and interpret intermediate model results.

We discovered a new tool, Weights and Biases (wandb), that allowed us to display training progress and diagnose issues earlier on. Check out our final reports for InceptionNet, EfficientNet, and ResNet.

We learned more about TFRecord, allowing us to pull data from TFRecord files rather than directories of JPEG’s. This saved a great deal of time and memory, an important necessity of our data processing pipeline in Colab.

The biggest conceptual insight we gained from this project was the benefits of ensembling models. In our project, our three models did reasonably well individually, but after noticing the lack of correlation between predictions, we were able to combine them through ensembling.

Checkout our code in our GitHub repository, our final Kaggle submission, and a recording of our project here.

Thanks for reading!

Resources

[1]https://medium.com/@nainaakash012/efficientnet-rethinking-model-scaling-for-convolutional-neural-networks-92941c5bfb95
[2] Quinn, Joanne, et al. Dive into Deep Learning: Tools for Engagement. Corwin, a SAGE Company, 2020, https://d2l.ai/index.html
[3] InceptionNet https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43022.pdf
[4] Tan et al., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks