Dan Bull, Senior Data Scientist
Here at Lynker-Analytics we have long pondered the best way to use more information than a simple 3-band RGB image to train a CNN model. But why, you ask, is this an issue? Surely if you have a four band or multi-band image this will be intuitively better than a simple three bands? Surely the more information a model has, the better it will be?
Well, yes, you are right in this assumption. However, in leveraging the multiple bands we run into an issue. The issue is that of how to leverage the power of transfer learning in a multiband model. In this blog post I will outline an approach using multiband data to segment imagery into a 7 class output to map wetlands in Colorado.
Transfer learning is the process whereby models are fine-tuned based on model weights from previous training on massive datasets. At Lynker-Analytics we typically use imagenet weights, which allow us to train a 3-band RGB model within hours using only a few thousand labelled examples. Of course we have the option of training the model without pre-train weights. However, typically the result is 3-5% lower accuracy, and is often missing crucial details from the desired output.
In this Colorado project, we wished to improve on previous work by adding additional information to our models. After analysis of information rich input options we decided to use Digital Elevation Model (DEM) type data, and settled on slope, Depth to Water (DTW), and Topographic Wetness Index (TWI). DTW and TWI are derived from elevation data, and essentially measure how likely an area is to be wet based on slope and where water might flow.
Brilliant. This is highly likely to help find wetlands. In fact, using this data is a whole other blog post, so I won’t expand on these concepts here. Suffice to say, this is useful information for a model. This meant we had a total of 7 bands to train on.
So how do we make this work in a CNN model? Well, options include not using transfer learning at all, training a 7 band CNN using imagenet weights (designed for three band), or coming up with a way of creating weights for a 7 band-input-model to allow transfer learning ourselves. We opted for the last approach.
So it turns out that other people have had the same idea. This paper used a masked autoencoder to create weights. Essentially, an image is fed into one end of the model, parts of the image are randomly obscured, and the model is forced to recreate the original image. Through the training process, weights are created that encapsulate the relationships between the different bands and data, in a self-supervised fashion. These pretraining weights can then be used in a fine-tuning process to classify wetland classes.
In our sphere of existence, aka geospatial data, we have a built in advantage. The advantage is this: There is a tonne of satellite or aerial imagery available for use. In general, any training bottleneck is a shortage of labelled data with which to train against. Not only that, but imagery is often available from an API and pipelines can be easily built to use this.
In this project we used NAIP imagery, freely available from USGS, in conjunction with DEM indices bands, also freely available (although time consuming to assemble). To say processing the DEM indices for use in the model was a non-trivial task is probably an understatement, but I’ll save the gory details for another post.
The model architecture we used for the segmentation task was a highly modified DeepLab3+. Adapting DeepLab3+ to become an auto-encoder with a random mask was comparatively simple. The model head was replaced with an output that recreated the original image, with a mean-squared error loss function. A function that randomly masked out 50% of the original image (plus3 DEM indices) using patches of size 32 x 32 px was inserted in the data pipeline and whizz bang,this was ready to go.
We used approximately 250,000 data patches randomly downloaded from NAIP imagery and our previously created DEM indices. This number was arrived at with a thumb in the air approach. It took roughly 4 days to create the data, and another 5-7 days to train a model on AWS. Ideally we would have more data and better training, but time is money, and we estimated that this number of patches would capture the diversity of imagery/DEM data in the region, but not blow out our training budget too much.
After creating pre-training weights using the above autoencoder, we then fine-tuned the model on some carefully crafted human labelled class data with our seven classes. When we fine-tuned using the pretrained weights, we froze the first 525 layers, and only trained on the top 50 or so layers, i.e. only on the top part of the network, or approximately 7 million out of 50 million parameters. So fine-tuning only touched the tip of the iceberg, and the majority of the weights in the neural network are directly from the pretraining. As an initial comparison, we compared training a fine-tune model with and without the derived pretrained weights.
And here is an initial result:
As you can see, the pretraining image better distinguished water (blues) and wetland (green) and shoreline (yellow). Not only that, but using the pretrained weights vastly improved the training process, as we could train on a larger batch size of 10 rather than 4, as fewer weights needed to be held in GPU memory.