User:Poppy Linden/Network Errors and Data Loss 2008-01
In a blog post by Hamilton Linden we heard "Region Data Corruption / Loss Problem Resolved". It's actually a long journey into how reliable / unreliable TCP/IP connections actually are. If you're interested in computer network diagnostics, read on.
What the specific error was
TCP/IP is supposed to be a completely reliable protocol. On top of the reliability guarantees from Ethernet (specifically, the 802.3 CRC mechanism) it would seem very difficult to have data corruption, especially though a major internet backbone provider, and especially to have corruption in a way that nobody is alerted about. In this case, we noticed the result of an HTTP PUT operation was a file that had certain bits flipped. Note that this doesn't happen to every packet - quite the opposite! This was only showing up in 0.1% of transmissions of a 5.5MB file. Considering how rarely we actually pull down the data we upload (due to the fact that assets are immutable, and are usually replaced with a newer version before long) we almost never saw any problems. But when we saw problems, they were weird, to say the least.
A diff of the hex output of the original file and the corrupted file looked like this:
$ diff known_funky_simstate.tmp.gz.hex known_funky_simstate.tmp.gz.bunk.hex 69424c69424 < 010f2f0 c794 6f38 e893 6dbf e728 d435 348f cdcc --- > 010f2f0 c794 6f38 e89b 6dbf e728 d435 348f cdcc 69436c69436 < 010f3b0 fd09 995b 2ddd e146 357c c5ea fa89 fd7b --- > 010f3b0 fd09 995b 2dd5 e146 357c c5ea fa89 fd7b
if you remember your binary arithmetic and look right here:
c794 6f38 e893 6dbf e728 d435 348f cdcc c794 6f38 e89b 6dbf e728 d435 348f cdcc ^ fd09 995b 2ddd e146 357c c5ea fa89 fd7b fd09 995b 2dd5 e146 357c c5ea fa89 fd7b ^
you can see the "magic" at work. A 1 where there should be a 0, and a 0 where there should be a 1. And if you look at where the corruption happened in the file:
010f2f0 010f3b0
you'll notice that's 00C0 later, 192 bytes, a multiple of 64. That's starting to sound like memory going bad in some device.
Which device? In this case, it was a router. It couldn't have been a switch, because the ethernet CRC would have protected the data in that case. In a router, the TCP/IP checksum is all that protects the data because the ethernet CRC is recalculated for the new (now corrupt) frame, because it has a new sender and receiver section. The TCP/IP checksum is effectively a parity check, so the flipped 1 with a matching flipped 0 will give the same parity result.
Right now you would assume that any communication with another encoding layer (IPSEC, https, ssh, etc) would catch this, and you'd be correct. Unfortunately, the result is typically an abruptly terminated session. Anything transmitted in the clear had flipped bits, but anything sent through a trnasport with more protection was truncated.
How we found it
Like Jeopardy!, I've already given you the answer. What was the question?
Simulators (aka Regions to you lovely residents!) backup every hour. Yes, every one of the approximately 12k regions sends its entire simstate once an hour. If a sim crashes normally or shuts down it send another state right then. If a sim crashes abnormally (in a pretty rough way), it doesn't send state. Also, if a sim gets caught in a crash loop the sim states are sent, but are probably worthless, and the region stops trying to start up after a fixed number of tries. In either of these two last cases, our support team will perform a roll back for you, to a known-good simstate. Because of all of this, it's pretty hard to lose region data on the grid.
Support noticed an abnormal amount of crash loops on the grid. Because this was abnormal, the LL:Grid_Monkey was given the simstates to look at. The first thing you have to do is unzip them, and that failed straight away. Everyone initially thought this was truncation, because some zip recovery tools showed the file up to a certain point (probably up to the point where the corruption started!). More investigation showed that the original simstate (recovered from the original sending machine - it's kept on the sim host for 7 days) and the corrupted simstate had the exact same filesize, but had differences with the cmp tool, and had different md5 checksums. At this point the original grid monkey had prior obligations, and I stepped in.
A hexdump and a diff (shown in the previous section) showed the bit flips. The first thing everyone said is "well, it's impossible for the network to corrupt it, let's look at our code." We looked long and we looked hard, and we were very simply generating the file and sending it with the unix utility curl. We began to look at the load balancers in front of the asset storage cluster. We wrote a test to systematically pummel the asset store using a known corrupt file, to attempt to see which files had problems, so we could look in the transfer and load balancer logs and see which routes the broken files took. We sent from 15 racks of sim hosts (40 per rack, each machine hostig 4 regions... basically a lot of hardware) in our Dallas facility to the load balancers and asset cluster in San Francisco. We wrote the software to email our gridmonkey list when it found any problems. Immediately people across the office began to have their email boxes filled with test failures (whoops!).
We ran a few more tests and began to analyze all of these failures. We were finding that they were evenly distributed among the load balancers, asset nodes, and sim hosts. We isolated out the load balancers and asset cluster by sending to the beta grid (aditi). Seeing those errors, we began to run tests from each point in our network to each other point. The data began to show that anything going in and out of Dallas had a chance of getting corrupted, but every other point was flawless.
We looked hard for a pattern in the data. The errors all appeared in 64 byte multiples away from each other with none farther than 320 bytes away. Clearly, pairs of opposing flips needed to be in the same TCP packet or it would fail the TCP parity check. The data string 0x0000 0x00FF 0xFFFF repeated in a file would enable this situation to happen as frequently as possible:
0000000 0000 0000 0000 ffff ffff ffff 0000 0000 0000010 0000 ffff ffff ffff 0000 0000 0000 ffff 0000020 ffff ffff 0000 0000 0000 ffff ffff ffff
would be tripped into
0000000 0800 0000 0000 ffff ffff ffff 0000 0000 0000010 0000 ffff ffff ffff 0000 0000 0000 ffff 0000020 f5ff ffff 0000 0000 0000 ffff ffff ffff ^
Once we started pushing this data out, we saw failures immediately. This enabled our ISP to track down specifically which node on which router in the path was failing.