Tuesday, March 24, 2020

The Quest for New Hardware, pt. 2

In the last post, I talked about the different paths one could take to do powerful machine learning on the cheap. I opted to buy an older GPU server, the SuperMicro 2027GR-TRF. I now have all the pieces and have it more or less all working, so I thought I'd share some things I ran into along the way and some embarrassing stories along the way.

Here was what I ended up ordering:
  1. The SuperMicro 2027GR-TRF. I picked a used box from Garland Computers off Ebay for $470+$75 shipping. The specs on it were:
    • 2x E5-2670 2.6 GHz 8-core CPU's
    • 64GB DDR3 RAM
    • 4 GPU slots (but K80's are dual-slot, so 2x K80 slots)
    • 10x RAID slots
  2. A Western Digital Red 500GB SSD drive from Newegg for $90. If you're not familiar, the "blue" line is consumer-grade, and the "red" line is a bit more enterprise-grade. I may be hitting it pretty hard, and I've heard it plays better with RAID, which may be important in the future.
  3. An open box K80 from MET servers on Ebay for $325+$25 shipping.
  4. An old used server console for about $75+$30 shipping on Ebay.
  5. As it turns out, the K80 did not come with the extra converter to 2 PCI, so I ended up buying a generic cable.
  6. A PS/2 to USB adapter.
  7.  A Kill-a-watt to monitor power usage from my local Menards.
So, all in all right around $1120, not including any server rack solution. Not bad!

Here are some things I learned - if you are a server administrator please enjoy a good laugh!
  1. Power buttons are a bit different on servers. Basically, when you plug in the server its fans always go, so I thought it might be on. The main manual never had a picture of the full box with where the parts were, so it took me longer than I care to admit to realize that the power button was on what I thought was a mounting bracket!  However, I kept seeing references to a control center with status lights in the manual. Finally, I saw a ribbon cable underneath some packaging on the side and saw that one of the "mounting brackets" actually had buttons. This wasn't shipped attached to the main device. (See this angled picture from a related server to get an idea of what I'm talking about.)
  2. I didn't have my SSD in quite right in the cartridge so the cartridge would go all the way but never truly connect. Took a bit to figure out why it wasn't showing up in BIOS.
  3. I've installed NVidia drivers on more than one occasion, but ran into a new issue. In the past, I've generally been using the box beforehand, and have lots of the standard tools already set up. This time I was going from a fresh install. I was installing via the apt package route. However, I failed to notice that the package - while appearing to install successfully - had a warning message about not being able to build the kernel module. I had to go get the kernel headers package manually and then it worked just fine. For some reason this step doesn't seem to make into many of the installation guides out there.
  4.  The SuperMicro 2027GR-TRF product page clearly states there is a PS/2 port for keyboard/mouse. I wasn't sure if it was one for each or just one, but at any rate I can assure you it does not have one externally - only USB. And neither the manual nor the motherboard manual seem to make mention of one. So, I needed to buy the extra adapter to make it work. Fortunately, this was cheap and I already needed to wait on the power adapter for the K80.
  5. I didn't have detailed instructions for the K80 installation specifically. The K80 is a dual-slot card, so I was a bit unsure how to handle the general placement and/or what other hardware I might need. As noted, you need the extra power adapter. I ended up attaching it to the bottom slot to start with. However, I ran into overheating while doing some minor stress testing. The K80 shows up as two cards in nvidia-smi. The first card would always end up getting quite hot, encountering thermal issues at about 92C and turning off. So I tried switching the card to the top slot. No difference.The K80 is passively cooled and has several forum posts warning that trying to use it in anything but an official NVidia-certified integrator's server is likely to cause problems - part of the reason I went with a GPU server in the first place once I landed on the K80. Unfortunately, NVidia's official integrator list does NOT go back as far as the K80, so I had to rely on the SuperMicro's word that they support it. Unfortunately this symptom is identical to not having a proper system to support the K80 (see also this important explanation here). Fortunately, however, the server itself has good power and cooling capabilities - and it only seems to be the closed-loop monitoring of the cooling that was having issues. To that end, I decided to see if I could more manually increase the fan speeds as they seemed to be running a rather low speed all the time for the GPU. This box's BIOS did not expose the fan controls, but the IPMI management did, so I was able to set it that way. Unfortunately the options were "optimal for all fans" or "turn all fans to 11", so it now sounds rather like a jet engine. I actually went and grabbed hearing protection after a while. But, at least it is a cool jet engine. Now under reasonable (but not full) load, it was stabilizing around 51C for the hotter of the two cards.
So I learned a few things, but it was awesome to see it all up and running. I can now run batch sizes of at least 256 images once I spread across the cards for a 224x224 MobileNetV2, so I have definitely hit my target. However, I haven't yet gotten a good chance to try training from scratch and with current world circumstances around COVID-19, it is unfortunately a bit lower priority.

No comments:

Post a Comment