Quote:I think he over simplified the back end architecture.
I'm sure there's more than one physical server running each world.
Things as I see it you have a cabinet of blade systems ( say 50 servers... ) Each server can handle a given number of instances simultaneously. ( say 100 )
In these 50 servers you have 2 or 3 servers dedicated to database transactions ( they are not part of the instance cluster ), and 2 or 3 more dedicated to coordination between all the servers ( they are part of the instance clusters but don't run actual instances, they are the ones that tell the other servers to spawn instances when necessary... or that move an instance from one server to another. )
Now for reliability purpose, every instance spawned is duplicated on two different blade. That way when the instance crash on one of the blade, the other blade can take the action on the fly while the crashed instance is reset on the relevant blade.
I suspect that some of the lag spikes we encountered ( typical case : when somebody Disconnect ) were caused by the instance crashing and us being 'transfered' on the second blade that ran the instance we were in.
As the blade used in secondary role in the above example is also handling other instances in the primary role when we get transfered ot it because of a crash we get lag, because we are the low priority job.
Anyway, I'm not sure I'm clear...