Have you ever run into an unidentifiable plant and wondered, is this plant safe? Nowadays, there are new apps like PlantSnap and Leafsnap that can take a picture and provide information about particular species.
By the end of our blog post series, our goal is to decipher the difference between healthy plants and diseased plants using the Cassava Leaf Disease Classification data from Kaggle. Cassava is a crop grown by African smallholder farmers. It is a carbohydrate that can be found in at least 80% of household farms in Sub-Saharan Africa. Unfortunately, the starchy root is prone to viral diseases. If only farmers could identify several of the diseases…
About the Dataset
The dataset consists of 21,367 images from a Ugandan survey. Majority of the images are from farmers directly, who often only have access to low-bandwidth cameras. These images are then annotated by exports at National Crops Resources Research Institution (NaCRRI) and an AI lab at Makerere University, Kampala.
Our goal is to correctly classify the images into five categories: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and healthy leaf. Properly classifying images will help farmers identify healthy plants, and hopefully save crops from further damage.
The data includes the following:
- train_images with 21,367 images
- test_images with approximately 15k images
- train.csv which includes image file name (image_id) and the ID code for the disease (label)
- [train/test]_tfrecords which are the image files in tfrecord format
- label_num_to_disease_map.json which maps disease codes to disease names
- sample_submission which is the final predictions, which should include image file name (image_id) and the predicted ID code for the disease (label)
Exploratory Data Analysis (EDA)
Before creating our model, we analyze and investigate the dataset and summarize our findings. This is an important step in data science models, because it allows us to better understand the data, the outliers, and any errors. We begin with high level summaries and then dive deeper into more specific analyses.
First, we display the breakdown of the five classifications. We can clearly see in Figure 1 that CMD is six times more common than the other four class names (classifications). In raw, we have 13,158 CMD, 2,577 Healthy, 2,386 CGM, 2,189 CBSD, and 1,087 CBB.
Before creating a model to predict image classification, we should explore some images in our dataset. Figure 2 displays five sample images per class name. The average human will likely see similarities between groups. For example in Figure 2, healthy plant #4 and CBB plant #2 highly resemble each other. Can you find other similarities?
Could you tell the differences? It might have been tricky. It is important that our model is trained better than a naked eye, and can distinguish the difference between classifications. We can begin this process in EDA by breaking the images down into pixels. Using python, we can convert an image into NumPy array to represent the value of each pixel. For example,
is an array for one image. The shape of this array is (600, 800, 3), which represents height, width and channel (color), respectively. Representing an image as pixel values can help us explore the images and dataset further.
We can use the pixel values to distinguish typical and atypical images. We are considering an image “typical” when the average pixel value is close to the class-wide average and “atypical” when the average pixel value is far from the class-wide average. We can find examples of typical images by looping through each class name in our dataset, and finding images with minimum distance between an image’s mean and the class’s mean. Similarly, we find atypical images by finding the images with maximum distance. Can you guess which are typical and atypical below?
Next, we use the pixel values of a sample of 500 images to calculate mean and standard deviation of the NumPy arrays (aka the pixel values of each image) by class name. By graphing the results in Figure 5, we don’t see natural clustering in the data when observing the pixel value statistics.
To further investigate our pixel data we find the mean, standard deviation, skewness, red value, green value, and blue value of each image. Skewness often represents the image surface — a darker or glossier image is more positively skewed than a lighter or matte surface. Color values represent the intensity of that particular color.
Figure 7 displays the mean and standard deviation across all pixels in all channels for each class. We see that the classes have comparable ranges of mean and standard deviation for the pixel values, and each class has a number of outliers in both metrics.
Finally, we break down pixel value mean by channel color. Each pixel is represented in color by the intensity of the color channels: red, green, blue. In Figure 8 we observe that the medians per channel have similar values across the five classes, and the relative differences between each class name are similar across all three channels.
By performing a simple majority class classifier, we get a result of 3 (Cassava Mosaic Disease or CMD). This makes sense since CMD is the most common class, as seen in our EDA.
To create a more developed baseline model, we adapted the LeNet5 architecture to operate on our larger image size. Our convolutional neural network has ReLU activation, padding, pooling, and fully connected layers. The baseline model architecture and code can be seen in Figure 9.
The baseline model was trained on 21,367 samples, and produced 28,137,549 trainable parameters. When compiling our model, we chose the Adam as our optimizer with a learning rate of 0.001, categorical cross entropy as our loss function, and accuracy as our metric. To account for time and efficiency, we set epochs to 3 and set a static learning rate. The results in Figure 10 show a final val_accuracy of 0.5999 and a final val_loss of 1.7663.
We can also view our model performance by comparing Epoch to Loss, and Epoch to Accuracy (see Figure 11).
Conclusion & Next Steps
Overall, our baseline model did not perform very well. Next, we want to adjust parameters, hyperparameters and investigate different model and architectures to improve performance.
Helpful Kaggle Notebooks
Inspiration for Future Models