Wingman Jr.: March 2020

Saturday, March 28, 2020

Feedback: Image Scanning Speed

Tonight I got some helpful feedback from someone about the speed of scanning images:
I don't know if it's intentional, or if it comes from interaction with my other addons but I'm forced to wait 30 seconds in front of a blank page before seeing the site display, let the site display with images replaced by white squares or another image would allow us to start reading it without waiting.

Thank you anonymous user! Based on this I'd like to gather feedback from others as well about image scanning speed - it would really help me out if you would please take the survey at https://forms.gle/vgQZ1fMG2WxuCzop7 Note the survey will close after a while.

Now, back to the suggestion: I hear your feedback and would like to implement something like that. Given how the plugin works, I'm not entirely sure I can make it do quite what is hoped for, but I am going to think about it. If you're willing to listen to a bit more technical explanation, let me explain.

Images are scanned as they get downloaded. Unfortunately, in order to scan them, I need to have the whole image, scan it, and only then start passing along the image once I know it's good. I have a bit limited control about what happens downstream, as I can only pass along an image.

Compare this to what typically happens with web pages that load progressively: the page quickly loads thumbnails or placeholders, and then comes back and replaces them.

Currently, the way the plugin works I don't have a chance to come back and "fix up" the images - I only get one shot. So the typical way to do this is not available to me. I could potentially play some other tricks to accomplish this, but I do not yet see a clear path and I have to take into account general performance as well. So stay tuned, and don't forget to take the survey because there are other ways to tackle this problem as well! https://forms.gle/vgQZ1fMG2WxuCzop7

The Quest for New Hardware, pt. 3 - conclusion

Well, I was able to put to rest the main remaining issue from last time: the fan control needing to trigger off the temperature of the GPU. I spent quite a bit of time on this but at the end of the day the primary things seems to have been that I needed to have both IPMI updated and the NVidia driver loaded for the "Optimal" fan setting to work properly and voila - the fans now stabilize the GPU temperature at 60C!

Now I simply need to work on tuning the running of the model, but that's a task for another day. While there have been frustrating times at points, I've enjoyed learning about servers, IPMI management and I've enjoyed getting my feet wet over at the Homelab subreddit - the folks there have been welcoming and knowledgeable!

Tuesday, March 24, 2020

The Quest for New Hardware, pt. 2

In the last post, I talked about the different paths one could take to do powerful machine learning on the cheap. I opted to buy an older GPU server, the SuperMicro 2027GR-TRF. I now have all the pieces and have it more or less all working, so I thought I'd share some things I ran into along the way and some embarrassing stories along the way.

Here was what I ended up ordering:

The SuperMicro 2027GR-TRF. I picked a used box from Garland Computers off Ebay for $470+$75 shipping. The specs on it were:

2x E5-2670 2.6 GHz 8-core CPU's

64GB DDR3 RAM

4 GPU slots (but K80's are dual-slot, so 2x K80 slots)

10x RAID slots

A Western Digital Red 500GB SSD drive from Newegg for $90. If you're not familiar, the "blue" line is consumer-grade, and the "red" line is a bit more enterprise-grade. I may be hitting it pretty hard, and I've heard it plays better with RAID, which may be important in the future.
An open box K80 from MET servers on Ebay for $325+$25 shipping.
An old used server console for about $75+$30 shipping on Ebay.
As it turns out, the K80 did not come with the extra converter to 2 PCI, so I ended up buying a generic cable.
A PS/2 to USB adapter.
A Kill-a-watt to monitor power usage from my local Menards.

So, all in all right around $1120, not including any server rack solution. Not bad!

Here are some things I learned - if you are a server administrator please enjoy a good laugh!

Power buttons are a bit different on servers. Basically, when you plug in the server its fans always go, so I thought it might be on. The main manual never had a picture of the full box with where the parts were, so it took me longer than I care to admit to realize that the power button was on what I thought was a mounting bracket! However, I kept seeing references to a control center with status lights in the manual. Finally, I saw a ribbon cable underneath some packaging on the side and saw that one of the "mounting brackets" actually had buttons. This wasn't shipped attached to the main device. (See this angled picture from a related server to get an idea of what I'm talking about.)
I didn't have my SSD in quite right in the cartridge so the cartridge would go all the way but never truly connect. Took a bit to figure out why it wasn't showing up in BIOS.
I've installed NVidia drivers on more than one occasion, but ran into a new issue. In the past, I've generally been using the box beforehand, and have lots of the standard tools already set up. This time I was going from a fresh install. I was installing via the apt package route. However, I failed to notice that the package - while appearing to install successfully - had a warning message about not being able to build the kernel module. I had to go get the kernel headers package manually and then it worked just fine. For some reason this step doesn't seem to make into many of the installation guides out there.
The SuperMicro 2027GR-TRF product page clearly states there is a PS/2 port for keyboard/mouse. I wasn't sure if it was one for each or just one, but at any rate I can assure you it does not have one externally - only USB. And neither the manual nor the motherboard manual seem to make mention of one. So, I needed to buy the extra adapter to make it work. Fortunately, this was cheap and I already needed to wait on the power adapter for the K80.
I didn't have detailed instructions for the K80 installation specifically. The K80 is a dual-slot card, so I was a bit unsure how to handle the general placement and/or what other hardware I might need. As noted, you need the extra power adapter. I ended up attaching it to the bottom slot to start with. However, I ran into overheating while doing some minor stress testing. The K80 shows up as two cards in nvidia-smi. The first card would always end up getting quite hot, encountering thermal issues at about 92C and turning off. So I tried switching the card to the top slot. No difference.The K80 is passively cooled and has several forum posts warning that trying to use it in anything but an official NVidia-certified integrator's server is likely to cause problems - part of the reason I went with a GPU server in the first place once I landed on the K80. Unfortunately, NVidia's official integrator list does NOT go back as far as the K80, so I had to rely on the SuperMicro's word that they support it. Unfortunately this symptom is identical to not having a proper system to support the K80 (see also this important explanation here). Fortunately, however, the server itself has good power and cooling capabilities - and it only seems to be the closed-loop monitoring of the cooling that was having issues. To that end, I decided to see if I could more manually increase the fan speeds as they seemed to be running a rather low speed all the time for the GPU. This box's BIOS did not expose the fan controls, but the IPMI management did, so I was able to set it that way. Unfortunately the options were "optimal for all fans" or "turn all fans to 11", so it now sounds rather like a jet engine. I actually went and grabbed hearing protection after a while. But, at least it is a cool jet engine. Now under reasonable (but not full) load, it was stabilizing around 51C for the hotter of the two cards.

So I learned a few things, but it was awesome to see it all up and running. I can now run batch sizes of at least 256 images once I spread across the cards for a 224x224 MobileNetV2, so I have definitely hit my target. However, I haven't yet gotten a good chance to try training from scratch and with current world circumstances around COVID-19, it is unfortunately a bit lower priority.

Friday, March 13, 2020

The Quest for New Hardware, pt. 1

Recently I've been considering what types of hardware setups might be the next natural step for training the model. In the past I've used a Jetson TX1 devkit, which is a rather modest piece of hardware and allows me to do a bit heavier finetuning. I have this running on its own in a back bedroom, and it's helpful to have a box that is dedicated to that rather than trying to share time with my laptop or a desktop, for example.

However, the dataset I train on has grown considerably, now over 200K images. This is large enough that I believe it may be reasonable to consider training from scratch rather than finetuning. Training from scratch, however, requires much more computing power - and ideally could run in significantly larger batch sizes.

In fact, one of the key things I wanted to be able to do was run batch sizes on par with training "modern" CNN's from scratch using ImageNet. This varies widely, but let's say ~100 images or so per batch as a reasonable target. Generally speaking, this means more GPU RAM.

However, I'm doing my best to keep on a relatively tight budget - ideally something around $1000. This is lower than many new setups typically run, even ones aimed at the budget-conscious consumer - see for example this blog post with this nice, new box settling in at a reasonable $1700. This necessitated some deep introspection about what I was truly try to achieve, and ultimately I settled on roughly this set of priorities:

GPU RAM
Speed
Future expansion

My thought was that by prioritizing RAM, I could train almost any model - but perhaps at a slower speed.

With these priorities in mind, I considered several possibilities for a new or second-hand setup. With a different set of constraints than many considering a machine learning rig, I was pleasantly surprised by the sheer diversity of possible solutions.

Purchase a Jetson AGX Xavier devkit. The price on these seems to have been reducing, and now the devkit just got upgraded to 32GB of RAM shared between GPU and CPU.
Build a desktop and add higher-quality consumer-level cards like the popular 1080 TI or perhaps even the newer 2080 to it.
Build a specialized rig and purchase many second-hand crypto mining rig cards - sort of a quantity over quality approach.
Look to older server solutions and see what was available.

After quite some time of searching, I settled on option #4. I discovered that I could find the now relatively old NVidia K80 cards in GPU servers for an acceptable price, providing a surprisingly strong GPU RAM/$ ratio at a satisfactory level of speed. I've never really looked into the world of servers before, and it was an enjoyable journey - but a bit of an overwhelming one at first. I joined the subreddit /r/homelab, a friendly place to discuss running servers at home. (I'd like especially to thank merkuron for their help!)

Ultimately I settled on an older GPU server by SuperMicro, the 2027GR-TRF. I currently plan to add one K80 to it, but it has support for up to two if I wish to expand in the future. I have recently been working on acquiring the full solution in various pieces, primarily through Ebay. I have gotten most of the parts but need a few more before I can put it all together, so stay tuned for more updates!