Monday, December 14, 2020

Release 2.0.1 - reCAPTCHA, Zone Switching Fixes

Upon the release of 2.0.0, I got a note from one user indicating that reCAPTCHA was no longer working for them. They recommended I added a whitelist for this - I thought that was a good idea so I did.

However, while investigating I also found a bug that was potentially causing two different images to get scrambled while scanning, which significantly impact results. Ironically, my little Wingman Jr. icon was being flagged! I fixed this issue.

I also found a small error with the stats calculation that would have caused an unexpected change in zone switching behavior and overall image scan results and fixed that as well.

I also added a new open-ended feedback link in the addon itself in the zones menu. I may need to restrict this at some point but I thought it might be good to provide a way for feedback!

Sunday, December 13, 2020

Firefox 83 Fix - Get 2.0!

(Update: 2.0 is out!)

Fix in Progress

There is a fix in progress in Wingman Jr., it changes quite a bit of code, but it should be out in a bit. I have a fully working solution and am testing it. If you upgrade, you will likely see a note about hidden tabs; this is expected and is due to the nature of the solution.

What Went Wrong

First, it's important to know that to make the AI work, the computer does a lot of math for each image it scans. A typical computer has more than one type of chip on it that knows how to do math. There's the CPU, which is for general purpose math. Then - on most modern computers - there's the GPU, which is for large numbers of parallel calculations of the same kind - which works great for both graphics/video as well as for AI.

Having a GPU - and indeed a fast GPU - can make a big difference. In some cases, it scans images 10x faster or even more than the CPU. The AI library I use for helping run calculations, Tensorflow.js, is careful to ensure that it goes as fast as possible.

So what happened with Firefox 83?

Prior to Firefox 83, there was bug in Firefox in certain cases. Basically, there is a special call that you can make that says "give me access to the GPU, but if there's a big performance problem because something about loading the GPU isn't quite right, don't give me access to the GPU at all - just let me know and fail". For the most part, this function was working correctly in Firefox 82 and prior. However, it didn't work in all cases. In some cases, it would give access to the GPU even when it maybe was taking a performance hit.

In Firefox 83, the Mozilla team fixed the glitch. So wouldn't this make things better?

Not quite. Basically, the fact that Wingman was trying to load the GPU in the background in the addon wasn't fully supported in some cases. So when Tensorflow.js tried to get access to the GPU in this way, it would now correctly fail and say, "I won't give you access to the GPU because there is a bit of a performance hit". This meant that Tensorflow.js would fall back to doing all the calculations on the CPU, even if the performance-hit GPU would still be much faster.

If you're one of the unfortunate users like myself that encountered this, it made the browsing experience basically unusable. The whole browser would seem to lock up on loading pages and things would take forever to load.

Partial Mitigation

This spring, Tensorflow.js also released another method of running AI models, the "WASM backend". It still used the CPU, but it did some advanced tricks that leveraged some things basically all modern CPU's can do, and it made the CPU case much faster. So much faster, in fact, that in some cases it was as good as the GPU or maybe even a tiny bit better. (See here for Google's blog post on the matter).

I added this as a fallback method for calculation, and it helped some users. But for some users (like myself), the performance with this method is still unbearably slow.

Options

One option I pursued for fixing this was to have Tensorflow.js use the GPU even if there were performance issues noted by Firefox. This loading option is not exposed by Tensorflow.js, but they were kind enough to consider adding it as an option.

While this might work for some, it might end up being the wrong choice for others. If it was the wrong choice, then the system should by all rights fallback to the "WASM backend" but would not if we forced it to use the GPU. Likely then the right thing to do would be to expose an option in Wingman to pick which method to use, but this makes for a potentially poor default experience.

As the true nature of the bug unfolded as the excellent team at Mozilla looked into my bug report, it started to become clearer that 1) a real bug had been fixed and that 2) existing performance may have been suboptimal already! Additionally, there was a critical realization: it wasn't that the GPU couldn't necessarily be loaded quickly - it's that the addon background setting wasn't programmed quite correctly to allow it do so. However, this meant that if you could load the GPU in a different setting, it might work as expected - for example, in a "normal" web page setting. So how could this be accomplished?

New Architecture

In the past, there was more or less one place where code would run: the background of the addon. This approach is simple, light, and generally works great. But now we needed to do processing in a "normal" web page. An addon can also create and load web pages, too.

So the solution was to split the code into two parts: the "background" and a "processor" running on a normal page. The two parts need to talk back and forth in deep conversation in order to work. The "background" says things like "here's a request and the data flowing in for it" and the "processor" says things like "here are the scan results you asked for". The addon ecosystem makes this straightforward to accomplish, but it's a lot of plumbing and a large rip up of the existing code.

I've finished the rewrite and am ensuring the changes are stable. While there is some overhead in this approach (due to the two sides being in conversation), there are also some advantages. One of those is that it is much easier to load more than one processor if needed. So far I have not yet been able to see a performance gain out of this, but in the future I may be able to use the GPU and WASM backends together to see a bit of a performance boost.

It is probably apparent now why there might be a warning about hidden tabs. Wingman creates the "processor" tabs as web pages so that they work properly, but they're not helpful for the user to see, so it immediately hides them. That's all the tab hiding that Wingman does, but it still drives needing the "tabHide" permission and will prompt a new message after the upgrade.

Final Notes and 2.0

This is a big change in the overall architecture - large enough that I plan to change the version to 2.0 to reflect what has happened.

I may try to squeeze in a couple other features or fixes, but stay tuned for a new release soon.

Friday, November 27, 2020

Release 1.3.0 - Partial Fix for Firefox 83 Slowness!

This release is an emergency release in response to the release of Firefox 83.

TL;DR - Firefox 83 broke things for some users and made browsing unbearably slow. While things get properly fixed, I can make it faster again, but not quite as fast as the plugin would be in Firefox 82.

The long version:
This plugin leverages another excellent library, Tensorflow.js, that runs the AI models created for this plugin. Tensorflow.js gives many different ways to run the AI models, called backends. They all give the same prediction, but some backends are much faster than others. The fast backend (WebGL) started failing in Firefox 83 for some users, which caused the default slow backend (CPU) to be used instead. For at least two users, this made the browsing experience so slow as to be unusable.
Fortunately Tensorflow.js recently added support for another relatively fast backend (WASM) that I have found does not seem to fail to load in Firefox 83. I am adding in support for that new backend as a fallback. It is not quite as fast, but makes browsing usable once again.

If you are experiencing issues, please disable the plugin and let me know over at Github - thanks!

For the technically curious, the Tensorflow.js team has a great writeup on the introduction of fast SIMD in the WASM backend over at their blog.

One final note - this version also fixes one issue that caused downloads to sometimes show up as gibberish rather than prompting for download.

Saturday, November 21, 2020

Firefox 83 Problem!

(Update: This problem has been partially worked around, see the later post on 1.3.0)

(Update 2: I have traced this back to the specific change in Firefox 83 that caused the issue and have posted an issue on Mozilla's bug tracker. Please be aware that given the nature of the commit that caused the issue, it's possible that fixing the issue experienced by Wingman Jr. may cause other things to break - so fixing this may not be as easy as it seems.)

(Update 3: I have found a technical workaround and have a full fix in progress - here's the post explaining the changes.)

Today my browser updated itself to Firefox 83, and it promptly made the addon on unusable! The underlying issue is something related to the way the graphics card, Firefox 83, and possibly Tensorflow.js are interacting. Note this may not affect all users but if suddenly performance became unusable after Firefox updated itself, this is why.

Workaround: Revert back to Firefox 82; otherwise the performance was poor enough you may have to disable the plugin until this can be resolved.

Things I thought might help, but did not:

Updating graphics driver
Reverting to an old version of Tensorflow.js. This also means older versions of the addon are unlikely to work either.

Technical details can be found with the bug I am tracking for this.

Sorry for the inconvenience!

Wednesday, November 4, 2020

Release 1.2.1 - The Case of the Distorted Symbols

International users - this release is a bug fix release for you!
One of you kindly reported that they were seeing special characters such as "ä", "ö", "ü", "ß" and "€" showing up incorrectly as ¿½. This release should fix most instances of that happening, but please comment at https://github.com/wingman-jr-addon/wingman_jr/issues/70 if you are still seeing problems. Thanks!

For the technically curious (or perhaps those who are having trouble falling asleep at night and need something boring to read), here's what was happening. In order to scan images that have been encoded as Base64 data URI's, I fully scan all documents of Content-Type text/html and do search and replace as necessary. However, when I get the document it is as bytes, so I need to handle the decoding from bytes into text myself. All examples out there just use UTF-8 for the TextDecoder, but alas, real life is a bit more complex - the source of this issue is due to incorrectly decoding non-UTF-8 docs as UTF-8. So now I try to do rudimentary encoding detection based on "charset" in Content-Type. An interesting followup is that when I turn text back into bytes, I use TextEncoder which - at present - only supports UTF-8, so I need to make sure the Content-Type gets set appropriately for that.

Note that using only Content-Type for character encoding detection is considerably simpler than the mechanism that browsers use, but it still hits a vast majority of the use cases even though it is not quite accurate. You can see how it fares against a selection of standardized tests by W3C. Character encoding detection is exceedingly sophisticated - if I still haven't bored you with the details, I recommend checking out the spec for those facing truly persistent insomnia.

Saturday, September 26, 2020

Release 1.2.0

It's been a while since the last release. I've got a couple small things in this latest 1.2.0 release:

A helpful user, Stephen, submitted a feature request to add an on/off button to the main menu. While showing it isn't the default, you can now turn it on in the options. I know I'll find this feature valuable as well! It's useful when most of the time you're browsing safe sites but then need to go to e.g. a photo site of some sort to find some content. See the GitHub issue here. Thanks Stephen!
As you may have noticed, the image score for blocked images almost always shows "99" since the release of the zones feature. I finally got back to adding a bit better approximation of the image score.
The key library for the AI part, Tensorflow.js, has seen an upgrade from version 1.x to 2.x in order to make sure this plugin will continue to be compatible with it.

The AI model was not changed, so no change in how good or bad the filtering performs is expected. However, if you're into machine learning, you may be interested to know that I've released the model into its own repository now, too.

As always, feel free to contact me at the GitHub project site: https://github.com/wingman-jr-addon/wingman_jr

Monday, May 18, 2020

Training from Scratch, pt. 2 - Mechanics

In part 1, I discussed the desire to successfully train MobileNetV2 from scratch to both act as a baseline for other architectures/variations and to ultimately better capture the specifics of the dataset better.

First it is worthwhile to discuss what it means to "train from scratch" in this context. I am using this as if it were some binary truth, but how well a dataset captures the population in conjunction with how it trains on a specific network architecture variant can vary a great deal depending on the network, the image size, the training regimen, and countless other factors. For my specific case, I wanted to achieve similar accuracy (say no greater than about 1% absolute accuracy) as the original finetune against MobileNetV2, alpha=1.0, image size 224 with the same loss functions. The finetune had achieved in the neighborhood of 73% accuracy for the raw classifier, giving a goal of 72-73% for the new training.

But how can this be achieved?

First, I needed to obtain much better hardware. As noted in The Quest for New Hardware series, I had already obtained an old GPU server with a K80 in it with roughly an effective 21 GB of GPU RAM. This was a giant step up from the Jetson TX1 dev kit I had previously and allows for greatly increasing batch size.

Second, I needed to consider the various changes typically made when training from scratch. Surveying several results led me to broadly make the following theoretical and practical changes:

Switch away from Adam to SGD for best generalization, albeit at a likely cost of training speed.
As part of introducing SGD, introduce a schedule to reduce the learning rate.
Greatly increase batch size. Increasing batch size has several side effects, including (among others) an increase in regularization, a need to revisit overall learning rate, and of course a nice speed boost.
On the practical side, greatly increasing the batch size also required changing the data pipeline significantly to keep up: this meant switching to a tf.data approach.

Interestingly, these are almost all nearly inter-related. Unsurprisingly, the culprit for this interdependence is batch size. Batch size affects base learning rate and loading performance. After some trial and error, I settled on a batch size of 192. This worked well not only with the GPU RAM, but also with system memory (64 GB DDR3), cores (16-core 2.6GHz, hyperthreaded), and storage (WD SSD Red) and of course the variant of MobileNetV2. I'd like to say this was a refined scientific process, but sizing was largely driven by picking sums of powers of 2 and finding what fit without warning about allocation OOM's.
However, data pipelining was a bit more scientific, watching the overall utilization rate of the GPU's using nvidia-smi. As noted, this did require changing the data loading method to tf.data. Like many others I was using the older Keras style of PIL and load_image. tf.data is powerful, but lacked some of the image augmentations I had (e.g. rotation) while adding other powerful ones (random JPEG compression artifacts). I think one area where I initially stumbled was understanding the necessary preprocessing. In the past I had more or less used the textbook-Keras-style:

img = image.load_img(img_path, target_size=(224, 224))

x = image.img_to_array(img)

x = np.expand_dims(x, axis=0)

x = preprocess_input(x) # Note preprocess_input comes from the relevant application namespace

While I expected changes in the loading, I thought that the use of preprocess_input would likely be preserved. Not so. The example finetuning MobileNetV2 from Tensorflow now has the following for example code (here):

def format_example(image, label):

  image = tf.cast(image, tf.float32)

  image = (image/127.5) - 1

  image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))

  return image, label

It would appear that this first maps onto [0-255.0], then onto [0,2.0], then onto [-1,1]. It seems a bit strange that the specialized preprocess_input would be replaced with this more specific set of code.
Even more interesting, there is also a tf.image.convert_image_dtype for use as well. My original code had this:

def map_to_image(hash, rating_onehot):

    img_bytes = tf.io.read_file(IMG_ROOT+hash+'.pic')

    img_u8 = tf.io.decode_image(img_bytes, channels=3, expand_animations=False)

    img_float = tf.image.convert_image_dtype(img_u8, tf.float32)# [0,255] -> [0,1]

    img_resized = tf.image.resize_with_pad(img_float, SIZE, SIZE)

    return img_resized, rating_onehot

So, as noted this maps onto [0,1]. After working on my results I noticed the apparent discrepancy and instead moved it to [-1,1]. However, to my surprise this greatly dropped training accuracy with other constants staying the same, leading me to believe the correct range is [0,1] despite contrary indicators. I'm still not sure what to make of that; I understand there are different range conventions but the mismatch has me puzzled.

The image pipeline code also had two frustrating aspects. First, there is no WebP support, so it is not possible to get equivalent of PIL image loading when dealing with a mixed bag of formats. Second, while the image loading code is quite robust, there are a few images PIL was able to open that tf.image.decode_image was not able to. Unfortunately, when an image failure occurs in the pipeline, you can't really just have an exception and keep going - it crashes and you don't get a helpful error message. This required me to create a separate script to try opening all the images with TF one-by-one first and create a blacklist of failures when loading up the dataset in training.

However, despite its limitations, the performance of tf.data is amazing. I was able to greatly increase overall performance and keep the GPU relatively well-loaded. My processors were nowhere near maxed so there is likely some headroom left there if I get a second GPU. I'm not sure where SSD reading is with respect to becoming a bottleneck, though.

So the change to training from scratch required changing several variables at once, with some trial and error. This took some time, and two other changes occurred during this time as well. First, I made an update to the dataset, increasing the Q class of images. Secondly, the sampling strategy changed a bit to oversample a bit more fairly. Unfortunately these both muddy the purity of results, but I am concerned primarily with the general progress.

The net result was a training from scratch of MobileNetV2 achieving 72.7% accuracy:

I was excited by this! Forward progress! While I did not achieve any earth-shattering accuracy improvements, I could now try other architectures, variants, and input sizes and have a reasonable baseline to compare.

Training from Scratch, pt. 1 - Motivation

The model currently deployed for Wingman Jr. is a relatively stock MobileNetV2 finetune with a bit of magic at the end. The dataset has been growing for quite some time and as I reached about 200K images I started seriously contemplating attempting to train from scratch.

Why train from scratch?

First, domain mismatch. Almost all base models are trained against ImageNet. While ImageNet is a fantastic proxy for photo-based populations, the domain of photos is not the same as the domain of internet images. The dataset has a significant portion of non-photo images: classical art, anime, stylized logos and icons, line drawings, and paintings. The closer I can get to approximating a target population of internet images, the more training from scratch has a potential to improve the model by retaining the parts that matter.

Second, ability to try new architectures and variations. Couldn't I try new architectures without doing this? Certainly. But many architectures do not have ImageNet pre-trained weights available, so finetuning is not an option. Additionally, having a bit of a baseline against a standard network helps provide a useful backdrop for comparison and helps to hint at whether a new network architecture is less capable vs. simply not having enough data.

Next time I'd like to discuss some of the mechanics of achieving a successful train from scratch; part 2 is now available!

Saturday, April 4, 2020

1.1.1.1 for Families Opt-In Support in Wingman Jr. 1.1.0

I was excited about a new service announced by Cloudflare this week - "1.1.1.1 for Families"! I admit, without an understanding of the company and the technology, that headline might not be the most eye-catching. Let me provide a bit of background.

Cloudflare is a technology company that provides many foundational services for using the internet. One exceptionally important service they provides is the DNS or Domain Name Service. While we think of internet addresses as text-based addresses, these text-based addresses are converted to a numerical form under the hood called an IP address that is used to route traffic. Specifically, the hostname - for example "google.com" - is represented numerically, but not the part of the address afterwards that goes to a specific page. Basically, every single webpage you visit "resolves" the hostname into this IP address by using a "DNS Provider".
One trick that has long been used is to block hostnames that contain questionable content by simply saying using a DNS provider that says "I don't know how to convert yourbadsite.com into an IP address", so all requests for media from that hostname fail. This is a lightweight check, and is a relatively coarse form of a blacklist. Maintaining this blacklist is a gargantuan effort, almost always a commercial one.
So what is this "1.1.1.1 for Families"? Well, two years ago Cloudflare launched their own DNS provider at "1.1.1.1". Now they have extended - free to the public - offerings that can filter out hostnames of known malware and adult content providers.

Wingman Jr. relies on AI to scan images fully client-side, which has the distinct advantage that 1) each image is considered individually rather than being lumped in with a whole site and 2) no communication with an external service provider is needed. However, as I've had at least one user helpfully remind me in an email, video is not blocked. Long term, I would like to support filtering video, but it is a difficult technical challenge to get right - and performant. One thing I can do in the mean time is provide the option to also block images and video by using the lighter weight DNS-based approach. This is now quite feasible thanks to Cloudflare!

So how does it work? Well, roughly speaking you go to the plugin's new settings area and enable DNS-based blocking. That's all you have to do. Under the hood, the plugin will start capturing image and video requests before they even occur and check the hostname with Cloudflare's servers. If Cloudflare says to block it, the image or video request will be aborted - you won't even see the usual Wingman icon or the update to the number of blocked images.

Now here's the thing: while there is a definite upside to this - a second layer of blocking, in some cases better efficiency, and basic video blocking - enabling this option does communicate the domains you are fetching media from with Cloudflare. Additionally, some websites with rather mixed content may end up being categorically blocked. These are tradeoffs - which means I am making this an opt-in only feature.

However, I'm excited about this new option! I believe it makes sense for many users. I also want to thank the user that took the time to write me an email and got me thinking about this - it's great to hear how people are using this plugin and what they'd like to see next. Look for an update in Firefox soon - I plan to release this with version 1.1.0!

Saturday, March 28, 2020

Feedback: Image Scanning Speed

Tonight I got some helpful feedback from someone about the speed of scanning images:
I don't know if it's intentional, or if it comes from interaction with my other addons but I'm forced to wait 30 seconds in front of a blank page before seeing the site display, let the site display with images replaced by white squares or another image would allow us to start reading it without waiting.

Thank you anonymous user! Based on this I'd like to gather feedback from others as well about image scanning speed - it would really help me out if you would please take the survey at https://forms.gle/vgQZ1fMG2WxuCzop7 Note the survey will close after a while.

Now, back to the suggestion: I hear your feedback and would like to implement something like that. Given how the plugin works, I'm not entirely sure I can make it do quite what is hoped for, but I am going to think about it. If you're willing to listen to a bit more technical explanation, let me explain.

Images are scanned as they get downloaded. Unfortunately, in order to scan them, I need to have the whole image, scan it, and only then start passing along the image once I know it's good. I have a bit limited control about what happens downstream, as I can only pass along an image.

Compare this to what typically happens with web pages that load progressively: the page quickly loads thumbnails or placeholders, and then comes back and replaces them.

Currently, the way the plugin works I don't have a chance to come back and "fix up" the images - I only get one shot. So the typical way to do this is not available to me. I could potentially play some other tricks to accomplish this, but I do not yet see a clear path and I have to take into account general performance as well. So stay tuned, and don't forget to take the survey because there are other ways to tackle this problem as well! https://forms.gle/vgQZ1fMG2WxuCzop7

The Quest for New Hardware, pt. 3 - conclusion

Well, I was able to put to rest the main remaining issue from last time: the fan control needing to trigger off the temperature of the GPU. I spent quite a bit of time on this but at the end of the day the primary things seems to have been that I needed to have both IPMI updated and the NVidia driver loaded for the "Optimal" fan setting to work properly and voila - the fans now stabilize the GPU temperature at 60C!

Now I simply need to work on tuning the running of the model, but that's a task for another day. While there have been frustrating times at points, I've enjoyed learning about servers, IPMI management and I've enjoyed getting my feet wet over at the Homelab subreddit - the folks there have been welcoming and knowledgeable!

Tuesday, March 24, 2020

The Quest for New Hardware, pt. 2

In the last post, I talked about the different paths one could take to do powerful machine learning on the cheap. I opted to buy an older GPU server, the SuperMicro 2027GR-TRF. I now have all the pieces and have it more or less all working, so I thought I'd share some things I ran into along the way and some embarrassing stories along the way.

Here was what I ended up ordering:

The SuperMicro 2027GR-TRF. I picked a used box from Garland Computers off Ebay for $470+$75 shipping. The specs on it were:

2x E5-2670 2.6 GHz 8-core CPU's

64GB DDR3 RAM

4 GPU slots (but K80's are dual-slot, so 2x K80 slots)

10x RAID slots

A Western Digital Red 500GB SSD drive from Newegg for $90. If you're not familiar, the "blue" line is consumer-grade, and the "red" line is a bit more enterprise-grade. I may be hitting it pretty hard, and I've heard it plays better with RAID, which may be important in the future.
An open box K80 from MET servers on Ebay for $325+$25 shipping.
An old used server console for about $75+$30 shipping on Ebay.
As it turns out, the K80 did not come with the extra converter to 2 PCI, so I ended up buying a generic cable.
A PS/2 to USB adapter.
A Kill-a-watt to monitor power usage from my local Menards.

So, all in all right around $1120, not including any server rack solution. Not bad!

Here are some things I learned - if you are a server administrator please enjoy a good laugh!

Power buttons are a bit different on servers. Basically, when you plug in the server its fans always go, so I thought it might be on. The main manual never had a picture of the full box with where the parts were, so it took me longer than I care to admit to realize that the power button was on what I thought was a mounting bracket! However, I kept seeing references to a control center with status lights in the manual. Finally, I saw a ribbon cable underneath some packaging on the side and saw that one of the "mounting brackets" actually had buttons. This wasn't shipped attached to the main device. (See this angled picture from a related server to get an idea of what I'm talking about.)
I didn't have my SSD in quite right in the cartridge so the cartridge would go all the way but never truly connect. Took a bit to figure out why it wasn't showing up in BIOS.
I've installed NVidia drivers on more than one occasion, but ran into a new issue. In the past, I've generally been using the box beforehand, and have lots of the standard tools already set up. This time I was going from a fresh install. I was installing via the apt package route. However, I failed to notice that the package - while appearing to install successfully - had a warning message about not being able to build the kernel module. I had to go get the kernel headers package manually and then it worked just fine. For some reason this step doesn't seem to make into many of the installation guides out there.
The SuperMicro 2027GR-TRF product page clearly states there is a PS/2 port for keyboard/mouse. I wasn't sure if it was one for each or just one, but at any rate I can assure you it does not have one externally - only USB. And neither the manual nor the motherboard manual seem to make mention of one. So, I needed to buy the extra adapter to make it work. Fortunately, this was cheap and I already needed to wait on the power adapter for the K80.
I didn't have detailed instructions for the K80 installation specifically. The K80 is a dual-slot card, so I was a bit unsure how to handle the general placement and/or what other hardware I might need. As noted, you need the extra power adapter. I ended up attaching it to the bottom slot to start with. However, I ran into overheating while doing some minor stress testing. The K80 shows up as two cards in nvidia-smi. The first card would always end up getting quite hot, encountering thermal issues at about 92C and turning off. So I tried switching the card to the top slot. No difference.The K80 is passively cooled and has several forum posts warning that trying to use it in anything but an official NVidia-certified integrator's server is likely to cause problems - part of the reason I went with a GPU server in the first place once I landed on the K80. Unfortunately, NVidia's official integrator list does NOT go back as far as the K80, so I had to rely on the SuperMicro's word that they support it. Unfortunately this symptom is identical to not having a proper system to support the K80 (see also this important explanation here). Fortunately, however, the server itself has good power and cooling capabilities - and it only seems to be the closed-loop monitoring of the cooling that was having issues. To that end, I decided to see if I could more manually increase the fan speeds as they seemed to be running a rather low speed all the time for the GPU. This box's BIOS did not expose the fan controls, but the IPMI management did, so I was able to set it that way. Unfortunately the options were "optimal for all fans" or "turn all fans to 11", so it now sounds rather like a jet engine. I actually went and grabbed hearing protection after a while. But, at least it is a cool jet engine. Now under reasonable (but not full) load, it was stabilizing around 51C for the hotter of the two cards.

So I learned a few things, but it was awesome to see it all up and running. I can now run batch sizes of at least 256 images once I spread across the cards for a 224x224 MobileNetV2, so I have definitely hit my target. However, I haven't yet gotten a good chance to try training from scratch and with current world circumstances around COVID-19, it is unfortunately a bit lower priority.