Monday, May 18, 2020

Training from Scratch, pt. 2 - Mechanics

In part 1, I discussed the desire to successfully train MobileNetV2 from scratch to both act as a baseline for other architectures/variations and to ultimately better capture the specifics of the dataset better.

First it is worthwhile to discuss what it means to "train from scratch" in this context. I am using this as if it were some binary truth, but how well a dataset captures the population in conjunction with how it trains on a specific network architecture variant can vary a great deal depending on the network, the image size, the training regimen, and countless other factors. For my specific case, I wanted to achieve similar accuracy (say no greater than about 1% absolute accuracy) as the original finetune against MobileNetV2, alpha=1.0, image size 224 with the same loss functions. The finetune had achieved in the neighborhood of 73% accuracy for the raw classifier, giving a goal of 72-73% for the new training.

But how can this be achieved?

First, I needed to obtain much better hardware. As noted in The Quest for New Hardware series, I had already obtained an old GPU server with a K80 in it with roughly an effective 21 GB of GPU RAM. This was a giant step up from the Jetson TX1 dev kit I had previously and allows for greatly increasing batch size.

Second, I needed to consider the various changes typically made when training from scratch. Surveying several results led me to broadly make the following theoretical and practical changes:
  1. Switch away from Adam to SGD for best generalization, albeit at a likely cost of training speed.
  2. As part of introducing SGD, introduce a schedule to reduce the learning rate.
  3. Greatly increase batch size. Increasing batch size has several side effects, including (among others) an increase in regularization, a need to revisit overall learning rate, and of course a nice speed boost.
  4. On the practical side, greatly increasing the batch size also required changing the data pipeline significantly to keep up: this meant switching to a tf.data approach.
 Interestingly, these are almost all nearly inter-related. Unsurprisingly, the culprit for this interdependence is batch size. Batch size affects base learning rate and loading performance. After some trial and error, I settled on a batch size of 192. This worked well not only with the GPU RAM, but also with system memory (64 GB DDR3), cores (16-core 2.6GHz, hyperthreaded), and storage (WD SSD Red) and of course the variant of MobileNetV2. I'd like to say this was a refined scientific process, but sizing was largely driven by picking sums of powers of 2 and finding what fit without warning about allocation OOM's.
However, data pipelining was a bit more scientific, watching the overall utilization rate of the GPU's using nvidia-smi. As noted, this did require changing the data loading method to tf.data. Like many others I was using the older Keras style of PIL and load_image. tf.data is powerful, but lacked some of the image augmentations I had (e.g. rotation) while adding other powerful ones (random JPEG compression artifacts). I think one area where I initially stumbled was understanding the necessary preprocessing. In the past I had more or less used the textbook-Keras-style:

img = image.load_img(img_path, target_size=(224224))
= image.img_to_array(img)
= np.expand_dims(x, axis=0)
= preprocess_input(x) # Note preprocess_input comes from the relevant application namespace

While I expected changes in the loading, I thought that the use of preprocess_input would likely be preserved. Not so. The example finetuning MobileNetV2 from Tensorflow now has the following for example code (here):

def format_example(imagelabel):
  image = tf.cast(image, tf.float32)
  image = (image/127.5- 1
  image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
  return image, label

It would appear that this first maps onto [0-255.0], then onto [0,2.0], then onto [-1,1]. It seems a bit strange that the specialized preprocess_input would be replaced with this more specific set of code.
Even more interesting, there is also a tf.image.convert_image_dtype for use as well. My original code had this:

def map_to_image(hashrating_onehot):
    img_bytes = tf.io.read_file(IMG_ROOT+hash+'.pic')
    img_u8 = tf.io.decode_image(img_bytes, channels=3expand_animations=False)
    img_float = tf.image.convert_image_dtype(img_u8, tf.float32)# [0,255] -> [0,1]
    img_resized = tf.image.resize_with_pad(img_float, SIZE, SIZE)
    return img_resized, rating_onehot

So, as noted this maps onto [0,1]. After working on my results I noticed the apparent discrepancy and instead moved it to [-1,1]. However, to my surprise this greatly dropped training accuracy with other constants staying the same, leading me to believe the correct range is [0,1] despite contrary indicators. I'm still not sure what to make of that; I understand there are different range conventions but the mismatch has me puzzled.

The image pipeline code also had two frustrating aspects. First, there is no WebP support, so it is not possible to get equivalent of PIL image loading when dealing with a mixed bag of formats. Second, while the image loading code is quite robust, there are a few images PIL was able to open that tf.image.decode_image was not able to. Unfortunately, when an image failure occurs in the pipeline, you can't really just have an exception and keep going - it crashes and you don't get a helpful error message. This required me to create a separate script to try opening all the images with TF one-by-one first and create a blacklist of failures when loading up the dataset in training.

However, despite its limitations, the performance of tf.data is amazing. I was able to greatly increase overall performance and keep the GPU relatively well-loaded. My processors were nowhere near maxed so there is likely some headroom left there if I get a second GPU. I'm not sure where SSD reading is with respect to becoming a bottleneck, though.

So the change to training from scratch required changing several variables at once, with some trial and error. This took some time, and two other changes occurred during this time as well. First, I made an update to the dataset, increasing the Q class of images. Secondly, the sampling strategy changed a bit to oversample a bit more fairly. Unfortunately these both muddy the purity of results, but I am concerned primarily with the general progress.

The net result was a training from scratch of MobileNetV2 achieving 72.7% accuracy:


I was excited by this! Forward progress! While I did not achieve any earth-shattering accuracy improvements, I could now try other architectures, variants, and input sizes and have a reasonable baseline to compare.

No comments:

Post a Comment