It’s All Relative

I had a somewhat amusing thought as I was walking home from the Lab today. But first, some background.

We pushed out a server patch late today. Since I joined the Lab about 8 months ago there’s been an unwritten rule - “no releases on Fridays.” It makes sense since there’s almost no-one around on the weekend to respond to any crisis that may erupt. That said, occasionally I think it’s worth taking the risk and breaking the rule, so that we have a more stable weekend. It’s a big risk - any code we push out is inherently untested under the scale and usage patterns of the main grid - but it’s a very informed decision since I’ve usually worked closely with the developers and testers who’ve done their darnest to make sure it’s the right thing.

When we do a deploy we actually push the code out to the sim nodes in a multi-stage process. First we actually distribute the code in compressed form to all of the sim nodes, which is a time consuming process. Then we decompress the code into place, even though it’s not running yet. Finally we “bounce” the process so that the new code is used. We can tune this sequence depending on levels of paranoia, and depending on the exact code change that needs to go out.

For today’s push we went as far as decompressing the code, then paused. I contacted Yiffy Yaffle and CrystalShard Foo who’d reported some of the LSL problems and had them get ready for a test, then bounced some handy regions to verify the change. Although this was not a comprehensive functionality test, it was a good last minute “let’s make sure we deployed the right code!” sanity check. After confirming that the fixes were in place, Aura Linden took over the job of monitoring the rolling restart which bounced the rest of the grid.

However…

One little thing was amusing during the limited tests. After CrystalShard’s region came back up, I couldn’t teleport in. I could actually watch the actual simulator process starting by watching log data on the sim host (I which I have color coded based on message category, so it’s usually a steady scroll of green chatter which looks like The Matrix but is actually incredibly useful) but in-world I couldn’t teleport in.

To teleport, the region that you’re in needs to know that the region you want to teleport to is available.

Fairly recently, we finally gave up on the notion of “absolute knowledge” about the grid. This is somewhat like the Newtonian ideas of “absolute space” and “absolute time” which were shown to be incorrect by Einstein - because the speed of light is both finite and constant for any observer, you can’t define an objective clock or objective yardstick. In our case, rather than having a precise universal knowledge of space - found by having each simulator constantly querying a single, central canonical source - simulators cache and regularly query a distributed, cached repository of that knowledge that’s updated in a less frequent fashion.

It’s also a bit like the Heisenberg Uncertainty Principle of quantum mechanics, which is the observation that you can’t know to arbitrary precision of both values of certain pairs of properties, like position and momentum. This is commonly but incorrectly understood as “if you try and measure one, you change the other.” In actuality it’s more like you’re trying to measure a blob of jelly - if you get a precise measure of one, how you relate the other to the whole has less meaning. They’re more like aspects of the same thing. But in the case of the Second Life Grid, it really is a matter of “if you try to know everything about everything, it’ll all slip through your fingers and you’ll be left with nothing!” - if every part of the grid is spending its time asking every other part of the grid for status, the whole grid will be too busy to do anything fun!

So anyway, the end result is that despite having knowledge that a region was available from having a direct connection to the region (watching the logs!) I still couldn’t teleport to it since other parts of the system didn’t have that knowledge yet.

As the Grid grows and scales, this will be more and more the norm. Absolute knowledge of the state of a complex system is impossible. Embracing the uncertainty and building systems which can accommodate the uncertainty is critical. And it will also be a matter of adjusting the way that residents interact with the system so that this lack of certainty is not seen as a blocker.

One Response

  1. Hi,

    This is great post because its a rare peak into what you guys do and how you do it which I wish more was said about :) I hope you continue to post about your experiences.

    ~Ben

    Ben - March 22, 2007 at 8:20 pm

Leave a Reply