Cassava Leaf Disease Classification (Part 2)

Blog 2: Image Preprocessing and Transfer Learning Models
By Chris Rohlicek, Cameron Webster, and Jillian Green

A Brief Recap

In Blog 1 we discussed the dataset, EDA, and our baseline model. The dataset consists of 21,367 images of Cassava crops. Our goal is to correctly classify the images into five categories: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf. During EDA we found that CMD makes up ~61% of the images. We adapted the LeNet5 architecture to create a baseline model, achieving a val_accuracy of 0.5999 and a final val_loss of 1.7663.


This phase of the project is a computer vision exercise based around adapting a few of the standard modern computer architectures to our task of identifying leaves. We implemented three transfer learning models that represent different paradigms in deep learning, comparing the impact of different ways of connecting and arranging layers in a network. Before we do this, we begin by exploring preprocessing steps…

Preprocessing Steps

Data Augmentation
In order to create more robust models, we used data augmentation to increase the size of our training dataset. Data augmentation is essentially taking an image, and manipulating some aspect of it. For example, RandomRotation randomly rotates an image by some factor. The data augmentation methods we used are RandomFlip, RandomRotation, RandomTranslation, RandomZoom, and RandomContrast. For more information check our TensorFlow documentation.

Figure 1: Class Score From Each Class Using Testing Dataset in Kaggle.
Figure 2: Comparing size and performance of some common networks [4].
Figure 3: Displaying the usage of some networks between 2014 and 2021 [2].

VGG Model

Our initial transfer learning model used a VGG-16 as the network base. This type of network is characterized by the use of blocks of convolutional kernels chained in a repeated pattern so that the final model has 16 layers: 13 convolutional and 3 fully connected. Below is the architecture for this type of network.

Figure 4: VGG Architecture.
Figure 5: Results of VGG.

ResNet Model

For our second transfer learning model we implemented an architecture based on ResNet50, a common CNN architecture that us defined by its use of residual connections between blocks. The use of residual connections is a departure from typical feedforward network design principles because it allows for the input to a block to be added directly to output.

Figure 6: Diagram of Residual Connection [6].
Figure 7: Diagram of Nested Function Classes [6].
Figure 8: Results of ResNet.


One of the models we trained is from the family of models called EfficientNets, which often perform more accurately and efficiently than previous ConvNets. EfficientNet is a CNN that uses a compound coefficient (Φ) to uniformly scale network depth, width and resolution. For example, if we wanted to increase computational resources by 2^N, we can use constant coefficients (B, α, γ — scaled together) to increase network width to be B^N, depth to be α^N, image size to be γ^N. We use compound scaling to capture the fine-grained patterns on larger images (by increasing channels) and increase the receptive field (by increasing layers)[2].

Figure 9: Displaying Model Scaling Techniques [2].
Figure 10: EfficientNet Architecture, Left [1], Right[5].
Figure 11: Results of EfficientNet.


Another base network architecture that we employed for one of our transfer learning models was the InceptionV3 application from TensorFlow. This type of architecture was first proposed here with the intent of reducing the number of trainable parameters of large networks such as VGG while maintaining high accuracy. To this end, the authors proposed a novel method of extracting features of multiple sizes using convolutional kernels of varying sizes from the same layer and subsequently concatenating them together. Examples of inception blocks used for this project are shown below, as well as the final architecture for the base. Note the varying convolutional kernel shapes used on the same input.

Figure 12: InceptionNet Architecture [9].
Figure 13: Results of InceptionNet.

Next Steps…

Overall, our models could have performed better. Our next steps include:

  1. Changing EfficientNet-B0 to EfficientNet-B7: EfficientNet-B7 has a state-of-the-art 84.3% top-1 accuracy on ImageNet [1].
  2. Trying classical methods (i.e. K-Nearest Neighbors, and support vector classifiers), using our trained models to create feature embeddings that will be used as input.
  3. Ensemble modeling (majority voting over final layers OR average of softmax predictions).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store