So my first official personal post was written after I confidently walk home after deploying a patch and talk about weighing the pros and cons of doing so on a Friday.
Ooops.
Of course, it turned out that the patch had a bug we couldn’t find on any of our test grids due to the lack of load. (Side note: we are serious about doing synthetic load testing, we’re just not there yet.) The bug was in a change intended to address “stuck upload queues” – which basically means that a simulator is busy uploading attachments from residents to the asset servers and gets “stuck.” The fix unfortunately had bigger problems – the simulators would crash under certain circumstances. D’oh.
For those who are interested…
The grid is so large now that random hardware and network failures create enough noise that it usually takes a while to notice a trend. In this case, the crash rate spiked noticeably after the rollout was complete. Of course, everyone had gone home, so it was up to the Grid Monkey to notice this. The GM got in touch with me the next day, and we decided that a rollback was the safest option, and initiated that as quickly as was convenient.
One positive thing about the patch/rollback is that while the patch was live residents identified SVC-55: llLookAt and llRotLookAt were broken by the patch. We were able to identify the cause of that, fix it, and then see if we could address the issue with the “stuck upload queues” fix. After several days of investigation we narrowed it down, but it was unclear if a re-fix would actually improve things in practice, without rolling it out to the grid, we ended up simply reverting the change.