Of course, that’s when I choose to post…

March 23, 2007 - No Responses

So my first official personal post was written after I confidently walk home after deploying a patch and talk about weighing the pros and cons of doing so on a Friday.

Ooops.

Of course, it turned out that the patch had a bug we couldn’t find on any of our test grids due to the lack of load. (Side note: we are serious about doing synthetic load testing, we’re just not there yet.) The bug was in a change intended to address “stuck upload queues” - which  basically means that a simulator is busy uploading attachments from residents to the asset servers and gets “stuck.” The fix unfortunately had bigger problems - the simulators would crash under certain circumstances. D’oh.

For those who are interested…

The grid is so large now that random hardware and network failures create enough noise that it usually takes a while to notice a trend. In this case, the crash rate spiked noticeably after the rollout was complete. Of course, everyone had gone home, so it was up to the Grid Monkey to notice this. The GM got in touch with me the next day, and we decided that a rollback was the safest option, and initiated that as quickly as was convenient.

One positive thing about the patch/rollback is that while the patch was live residents identified SVC-55: llLookAt and llRotLookAt were broken by the patch. We were able to identify the cause of that, fix it, and then see if we could address the issue with the “stuck upload queues” fix. After several days of investigation we narrowed it down, but it was unclear if a re-fix would actually improve things in practice, without rolling it out to the grid, we ended up simply reverting the change.

It’s All Relative

March 17, 2007 - One Response

I had a somewhat amusing thought as I was walking home from the Lab today. But first, some background.

We pushed out a server patch late today. Since I joined the Lab about 8 months ago there’s been an unwritten rule - “no releases on Fridays.” It makes sense since there’s almost no-one around on the weekend to respond to any crisis that may erupt. That said, occasionally I think it’s worth taking the risk and breaking the rule, so that we have a more stable weekend. It’s a big risk - any code we push out is inherently untested under the scale and usage patterns of the main grid - but it’s a very informed decision since I’ve usually worked closely with the developers and testers who’ve done their darnest to make sure it’s the right thing.

When we do a deploy we actually push the code out to the sim nodes in a multi-stage process. First we actually distribute the code in compressed form to all of the sim nodes, which is a time consuming process. Then we decompress the code into place, even though it’s not running yet. Finally we “bounce” the process so that the new code is used. We can tune this sequence depending on levels of paranoia, and depending on the exact code change that needs to go out.

For today’s push we went as far as decompressing the code, then paused. I contacted Yiffy Yaffle and CrystalShard Foo who’d reported some of the LSL problems and had them get ready for a test, then bounced some handy regions to verify the change. Although this was not a comprehensive functionality test, it was a good last minute “let’s make sure we deployed the right code!” sanity check. After confirming that the fixes were in place, Aura Linden took over the job of monitoring the rolling restart which bounced the rest of the grid.

However…

One little thing was amusing during the limited tests. After CrystalShard’s region came back up, I couldn’t teleport in. I could actually watch the actual simulator process starting by watching log data on the sim host (I which I have color coded based on message category, so it’s usually a steady scroll of green chatter which looks like The Matrix but is actually incredibly useful) but in-world I couldn’t teleport in.

To teleport, the region that you’re in needs to know that the region you want to teleport to is available.

Fairly recently, we finally gave up on the notion of “absolute knowledge” about the grid. This is somewhat like the Newtonian ideas of “absolute space” and “absolute time” which were shown to be incorrect by Einstein - because the speed of light is both finite and constant for any observer, you can’t define an objective clock or objective yardstick. In our case, rather than having a precise universal knowledge of space - found by having each simulator constantly querying a single, central canonical source - simulators cache and regularly query a distributed, cached repository of that knowledge that’s updated in a less frequent fashion.

It’s also a bit like the Heisenberg Uncertainty Principle of quantum mechanics, which is the observation that you can’t know to arbitrary precision of both values of certain pairs of properties, like position and momentum. This is commonly but incorrectly understood as “if you try and measure one, you change the other.” In actuality it’s more like you’re trying to measure a blob of jelly - if you get a precise measure of one, how you relate the other to the whole has less meaning. They’re more like aspects of the same thing. But in the case of the Second Life Grid, it really is a matter of “if you try to know everything about everything, it’ll all slip through your fingers and you’ll be left with nothing!” - if every part of the grid is spending its time asking every other part of the grid for status, the whole grid will be too busy to do anything fun!

So anyway, the end result is that despite having knowledge that a region was available from having a direct connection to the region (watching the logs!) I still couldn’t teleport to it since other parts of the system didn’t have that knowledge yet.

As the Grid grows and scales, this will be more and more the norm. Absolute knowledge of the state of a complex system is impossible. Embracing the uncertainty and building systems which can accommodate the uncertainty is critical. And it will also be a matter of adjusting the way that residents interact with the system so that this lack of certainty is not seen as a blocker.

Hello world!

October 5, 2006 - 6 Responses

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!