Cassava Leaf Disease Classification (Part 1)

Blog 1: Exploratory Data Analysis and a Baseline Model
By Jillian Green, Chris Rohlicek, and Cameron Webster

Introduction

By the end of our blog post series, our goal is to decipher the difference between healthy plants and diseased plants using the Cassava Leaf Disease Classification data from Kaggle. Cassava is a crop grown by African smallholder farmers. It is a carbohydrate that can be found in at least 80% of household farms in Sub-Saharan Africa. Unfortunately, the starchy root is prone to viral diseases. If only farmers could identify several of the diseases…

About the Dataset

Our goal is to correctly classify the images into five categories: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf. Properly classifying images will help farmers identify healthy plants, and hopefully save crops from further damage.

The data includes the following:

  • train_images with 21,367 images
  • test_images with approximately 15k images
  • train.csv which includes image file name (image_id) and the ID code for the disease (label)
  • [train/test]_tfrecords which are the image files in tfrecord format
  • label_num_to_disease_map.json which maps disease codes to disease names
  • sample_submission which is the final predictions, which should include image file name (image_id) and the predicted ID code for the disease (label)

Exploratory Data Analysis (EDA)

First, we display the breakdown of the five classifications. We can clearly see in Figure 1 that CMD is six times more common than the other four class names (classifications). In raw, we have 13,158 CMD, 2,577 Healthy, 2,386 CGM, 2,189 CBSD, and 1,087 CBB.

Figure 1: This image breaks down Relative Frequency by each of the 5 classifications: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf.

Before creating a model to predict image classification, we should explore some images in our dataset. Figure 2 displays five sample images per class name. The average human will likely see similarities between groups. For example in Figure 2, healthy plant #4 and CBB plant #2 highly resemble each other. Can you find other similarities?

Figure 2, classified image examples for each of the five classification groups.

Could you tell the differences? It might have been tricky. It is important that our model is trained better than a naked eye, and can distinguish the difference between classifications. We can begin this process in EDA by breaking the images down into pixels. Using python, we can convert an image into NumPy array to represent the value of each pixel. For example,

Figure 3, Example of Image Converted to NumPy Array.

is an array for one image. The shape of this array is (600, 800, 3), which represents height, width and channel (color), respectively. Representing an image as pixel values can help us explore the images and dataset further.

We can use the pixel values to distinguish typical and atypical images. We are considering an image “typical” when the average pixel value is close to the class-wide average and “atypical” when the average pixel value is far from the class-wide average. We can find examples of typical images by looping through each class name in our dataset, and finding images with minimum distance between an image’s mean and the class’s mean. Similarly, we find atypical images by finding the images with maximum distance. Can you guess which are typical and atypical below?

Figure 4, typical (top image) and atypical (bottom image) for each class name.

Next, we use the pixel values of a sample of 500 images to calculate mean and standard deviation of the NumPy arrays (aka the pixel values of each image) by class name. By graphing the results in Figure 5, we don’t see natural clustering in the data when observing the pixel value statistics.

Figure 5, Pixel Value Distribution for 500 samples. Scatter plot was inspired by Kaggle Notebook [3].

To further investigate our pixel data we find the mean, standard deviation, skewness, red value, green value, and blue value of each image. Skewness often represents the image surface — a darker or glossier image is more positively skewed than a lighter or matte surface. Color values represent the intensity of that particular color.

Figure 6, First five rows of dataset used to compare pixel values over all training images. Use of metadata inspired by Kaggle Notebook [2].

Figure 7 displays the mean and standard deviation across all pixels in all channels for each class. We see that the classes have comparable ranges of mean and standard deviation for the pixel values, and each class has a number of outliers in both metrics.

Figure 7, Boxplot of pixel value mean (left), standard deviation (right), over all training images. Box plot was inspired by Kaggle Notebook [3].

Finally, we break down pixel value mean by channel color. Each pixel is represented in color by the intensity of the color channels: red, green, blue. In Figure 8 we observe that the medians per channel have similar values across the five classes, and the relative differences between each class name are similar across all three channels.

Figure 8, Boxplot of mean across all pixels for each class, by color channel.

Baseline Model

To create a more developed baseline model, we adapted the LeNet5 architecture to operate on our larger image size. Our convolutional neural network has ReLU activation, padding, pooling, and fully connected layers. The baseline model architecture and code can be seen in Figure 9.

Fig 9, Baseline Model Architecture (left) and code (right).

The baseline model was trained on 21,367 samples, and produced 28,137,549 trainable parameters. When compiling our model, we chose the Adam as our optimizer with a learning rate of 0.001, categorical cross entropy as our loss function, and accuracy as our metric. To account for time and efficiency, we set epochs to 3 and set a static learning rate. The results in Figure 10 show a final val_accuracy of 0.5999 and a final val_loss of 1.7663.

Figure 10, Baseline Model Training Output.

We can also view our model performance by comparing Epoch to Loss, and Epoch to Accuracy (see Figure 11).

Fig 11, Model Performance graphs for loss (left) and accuracy (right).

Conclusion & Next Steps

Fig 12, Possible Methods for Improved Performance, stay tuned for Blog #2!

All code can be found in this github repository, and our Kaggle Competition Link can be found here!

Blog Post 2: here!
Blog Post 3: here!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store