Details about FLNetwork

Hi ๐Ÿ˜‰
As promised in the last report entry, I’ll give you here some info about the current state of NeREIDS networking system (This should be more focused on the programming aspects than what I’ve shown before).

Networking and time synchronization need to be supported by a robust and precise clock mechanism. I initially planned for explaining my problem with timers under Windows, and publish my code for FLCore::AppClock (This would have been a first !). Unfortunately, while seeking for the URLs of the references I used (so that I could link them to you), I’ve come to read some more on the matter, and I’m not so sure my solution is that great anymore, that it deserves to be published. So, let’ talk about networking instead.

FLNetwork library

The FLNetwork library now contains quite some stuff, but the primary classes are FLNetwork::SrvMgr (to be used by a server) and FLNetwork::CliMgr (to be used by… oh well, I’ll leave this for you as an exercise). I’ll present here a rough picture of what SrvMgr does, and since the client is supposed to react to what the server would expect, you should be able to follow this and imagine the reciprocal in your own head. Or with your own pen (Or however you wish, but here I won’t write things twice).

On initialization, SrvMgr opens a socket in UDP mode and starts calling recv() on it : recv() would return every UDP packet that was sent to the PC where it runs, on the port that was bound to that socket.
When a packet comes in, it is filtered by the means of a 32 bit header specific to the application. This may be unnecessary, but, heh… this is what Glenn Fiedler seems to consider as good practice, so this should be good for me too : You get more chances winning your national lottery than a random packet sent by another application has to pass that filter by accident (and 32b is a very tiny overhead). If it passes, the sender address is compared to the already known addresses.

If, like me, you’re reading that page from left to right and top to bottom, then you’d know the server has just been initialized, so it knows of no addresses at the moment.
The next piece of data after the header is the message type. Since this new address is unheard of, SrvMgr only allows for messages typed “FLNETWORK_MSG_CONNECTION_REQUEST” to continue further. That new address is then added to the addresses known by the SrvMgr instance, registering a FLNetwork::SrvConnectionToClient. Every subsequent packet coming from that same address and port will be interpreted as part of that connection, and assumed to come from the same Client, which is not a design resistant to spoofing attacks, but adding layers of security for the sake of a freaking game had to stop somewhere.

The new connection is then initialized as following the Connection Protocol, so each subsequent packet that also passes the application filter will get transferred to the Connection diplomat.

The Connection Protocol

The server Connection Diplomat begins the Connection Protocol by sending a message typed FLNETWORK_MSG_CONNECTION_CHALLENGE to the client’s address, along with some value named ‘salt’. If no answer to this message is received, that same packet will be retransmitted a limited amount of times, then the connection would simply fail. The client should answer to that with a FLNETWORK_MSG_CONNECTION_REPLY containing a salt value of its own, along with a MD5 hash of the server secret (typically, the password set up when you launch NeREIDS server) concatenated with the two salts. That value is compared to the same hashed value on the server (obviously knowing its own secret, and now the two salts), and if it matches, the connection protocol succeeds.
SrvMgr would detect this success, and now set the associated connection as following the Authentication Protocol, transferring subsequent message to the Authentication Diplomat.

The Authentication Protocol

The server Authentication Diplomat begins the Authentication Protocol by sending heartbeats to the client’s address, waiting for it to send either a login request, or a new user registration request (NeREIDS client would typically wait for user decision between connection and authentication, this is why the server just sends heartbeats while waiting for us petty humans reaction times). The client should send replies to those heartbeats while its biological self is gathering the required information, otherwise would suffer a timeout.

When a client sends a FLNETWORK_MSG_NEW_USER_REQUEST, SrvMgr asks to the underlying application (by the means of a callback) if this operation is allowed, and if the username (which was embedded into the request) is free of use. If so, it notifies to the client that the registration is allowed, and waits for a FLNETWORK_MSG_NEW_USER_FINALIZATION message. That message would contain a HMAC-MD5 of the chosen user name, the chosen password for that user, and some application constant salt value. The underlying application is tasked to store this data anyway it wishes, as it would be required in all future checks for the FLNETWORK_MSG_LOGIN_REQUEST.
As a side-note, SrvMgr allows for a user to ask for registration with a value named “userstatus” greater than 0 (0 being your default everyday client). If so, a challenge protocol comparable to the connection challenge is set up, this time checking against a user-status upgrade secret (Defining and allowing for specific user statuses is left entirely to the discretion of the application).

When a client sends a FLNETWORK_MSG_LOGIN_REQUEST, SrvMgr sends a FLNETWORK_MSG_LOGIN_CHALLENGE to the client’s address, in pretty much the same way as for the connection challenge, except this time, the client needs to also provide a user name along with a (salted) rehash of same HMAC-MD5 user name + pass that was sent during registration. SrvMgr then asks the underlying application if that pair of user name and hashed password is valid, and if so, the authentication protocol succeeds.
SrvMgr would detect this success, and now set the associated connection as following the Live Protocol, transferring subsequent messages to the Live Diplomat.

The Live Protocol

This is where the fun begins, although FLNetwork does very few to help with it, since this was meant to be a totally application-defined protocol (and an implementation of this Protocol needs to be provided by the application itself).

So, our application being ‘NeREIDS’, what does NeREIDS do with a live connection ?
Well, lots, already. But far less than it should, for this to be called a game ๐Ÿ˜‰
You can get an overview of the current state of the live protocol in that other post there. I’ll just give away some general considerations here.

Design Goals

Armed with less experience about networking than I have now (which still isn’t much), my plans for NeREIDS were initially leading me towards an event-driven network design. Event-driven means that, for example, you only need to transfer the fact that a player is currently doing a given action, or, as would be the case in a strategy game, that a unit has received a given order, and both client and server simulates the resulting states in parallel, with the common hope that starting from the same chain of events in correct order and timing, both would compute the same result. When done correctly this works well, a lot of successful games use this, with the colossal advantage that an event is typically an order of magnitude more lightweight, in terms of amount of data to send over the wire, than the associated states. States being for example, that unit AB is at x,y,z now, oriented in this direction, it has so much life left, etc.

But let’s face it, fully event-driven design means that you need a total, complete, utterly reliable connection. If this wasn’t the case, as soon as you miss one tiny event, your simulation would desync, and chaotically diverge. And a total, complete, utterly reliable connection protocol is TCP.
Remember, I chose UDP.
Which is not a bad choice, it’s simply a different tradeoff. UDP is unreliable, true, but that doesn’t mean it sucks. It simply means that occasionally, one message you’ve sent will get lost (even if most of the time, they will arrive to their destination).

We’ve traded reliability for more reactivity, as well as more control over this reactivity : When using TCP, you’re ensured that your messages will be received on the opposite side, true. But by giving you this certainty, TCP must give away something else. It’s like yin yang. Or like saying that your post service cannot ensure that your postman will be there every morning to fetch your daily mail : There will be some weeks where Mr postman would be busy trying to deliver that letter from one month ago, or checking with your old aunt Amy whether the electrical bill was sent one minute earlier than the one for water. Aunt Amy likes things in order. But we may not have same patience for this stuff, see ? We are waiting to know if that arrow pierced that helmet, and we are waiting to see that result on our screen. Aunt Amy doesn’t know what frames per seconds mean. Aunt Amy doesn’t care about helmets and arrows. But we do. And we need that damn poster there. Every morning. Every frame, dammit. Now.

So, what to do with this unreliable protocol ? There is the advantage that you don’t care any more about the letter from one month ago, to begin with. You wouldn’t know what to do with that letter, anyway. It has already obsolete info. A distant-past action in a fast-paced game has almost no value. We care about present data. So we discard too old letters. Or packets. Same difference.
I still want some of the advantages of the event-driven design : packet size reduction, by a great amount. So we go by some rules of an event-driven design anyway : Most of the time, it will save use bandwidth. On the occasional event loss, we just need to devise a resilient system, able to recover from event-data loss by the means state-data update. Seems complicated ? Well, maybe not so much :
We are free to use a mix of even-driven and “last-state data”. For example, sending info about an avatar position and velocity, is using a mix of the two :
It can be seen as event-driven, since we can extrapolate a lot from the position and velocity of an avatar at a given point in time : Assuming it will continue his way in the same direction at the same pace, we can simulate an accurate result for its position next tick, then next tick, then next tick, then… see ? weโ€™ve already simulated a lot, from the reduced โ€œeventโ€ that the avatar was doing something (moving in that direction) at a single point in time.
But that same data can also be considered as state driven, and perfectly recovers our simulation from missed events : as soon as we’re notified of a more recent position-velocity couple, indeed, we can compute states from there, and be in perfect sync with the server at that time, no matter how many past position-velocity messages we’ve lost before. That’s because our data contain a little mix of event (a new moving direction and speed at time t) and state (position, direction and speed).
Yet this data is perfectly usable as event-driven : You just need to send a new position/velocity couple when velocity changes (we could even go for : “when velocity unpredictably changes”) : If a given bird hasn’t modified its flight trajectory for the past two seconds, and the client has acked reports about this bird no more than two seconds ago, then we only send the changes in trajectory, which amounts to… hmm… lemme check… ah : nothing.
And ‘nothing‘ is a really cool amount of data when you need to send it over a wire.
Only when a client reports a frame lost, do we need to send back same data. same data “slot”, I should say : we will send the bird’s position and velocity again, sure, but we will sent the current values at the time of this new send, not obsolete, 10 frames old lost data. Had we used TCP, we could have waited a long time for fresh data, simply to ensure that we received the old first.

Yet, at times, It’s still harsh to lose precious networked data : Sometimes I’m just thinking “event” too much while designing protocols, and the realization that I’m driving myself towards a chaotic desync in case of packet loss comes quite late to my mind. This realization is almost always a painful experience. It happened for example with my grand plans for an AI that would almost need no network traffic, except an initialization phase and some occasional tactical decisions… *ahem*. Cannot work, dude, cannot work. Not with UDP. But with some more efforts into the design, I’ve found ways to tweak it and to make the system recoverable after event losses, while still using an even-driven approach that makes the required data transfer as tiny as possible. I don’t know for sure what I’ll be able to code in that regard, but I’m being optimistic there ๐Ÿ˜‰

Sometimes, though, Iโ€™d need real reliability. For initialization data. Or chat data, which should also never be lost. The solution is simply to consistently resend said data if not acknowledged in a reasonable amount of time. I just need to pack those re-sends along with existing messages, and other traffic does not need to be stalled while we send it.

2 Comments:

  1. You spend a lot of energy to save bandwith, and i can see the personal achievement of doing so. But have you ever tested what a typical DSL connection bandwidth is. I mean there was huge improvement in connection quality in the past few years, its a bit like hardware resources, every computer has a significant amount of computational power witch is barely used.

    • Hey there ! Thanks for your comment ๐Ÿ˜‰
      It is true that I haven’t tested those network limits by myself. It is true that I’m basing my assumptions about safe bandwidth usage on articles that are a few years old by now. And it is likely, as you say, that the issues the authors had to work around at that time, dealing with users equipped with 56Kb modems or worse, could now be dismissed. Typical gamer has ADSL at least, and your average ISP also improved, I guess.
      And yet, being conservative over bandwidth usage still seems like a good choice. At the time of Tribe’s article, Network was the bottleneck of your simulation. Your average gamer’s bandwidth has increased, but less so (very much less) than your average gamer’s available memory and processing power. In fact I’m thinking the ratio between bandwidth capacity and computing capacity has decreased indeed. And a recent game makes full use of the local resources. My dream for a future NeREIDS is no exception to that rule, and would typically operate on far more data than a similar-looking action-game. That means the bottleneck is almost certainly the network still.
      But let’s crunch some numbers together : A server sends a packet every game tick, that is roughly 32 times a second. I limit packet size to a rough 1KBytes, to avoid any chance of that packet being split. That’s a 32KBytes per second max that a server would need to be able to upload per client. A gamer community hosting for 32 concurrent connections would already need a 1MByte per second DSL, and if using commercial dedicated hardware I think this is already pretty expensive. Mr Joe gamer, if he wants to host and play along with 4 friends, needs to upload 128KByte per second from his local PC, which is already quite demanding for his Asymmetric DSL.
      What about the other side of the equation ? We’re left with a thousand bytes per game tick limit, to send ALL the data we need to send for that tick. On the bare NeREIDS skeleton I have here, an avatar state in memory is already over 40 Bytes, and this is way below the amount that a final NeREIDS character would need. If I were to send every entity state, brute force, each time, that would mean the server cannot operate on more than 20 entities…
      On conclusion, yeah, I do need something else than brute force already, that’s why I’m spending so much energy to find ways around it ๐Ÿ˜‰

Leave a Reply

Your email address will not be published. Required fields are marked *