Earlier this week I’ve spent some time on a problem of scaling Socket.io in a Node.js cluster. Essence of the problem is, when you run multiple Node app threads (workers) on a server, or multiple servers, socket.io clients connections are routed by cluster in a random round-robin manner, and handshaken / authorized io client requests get handed to workers where they are not handshaken / authorized, where the mess begins. This happens if socket.io sockets created by workers use memory store and do not share transports between each other, or in other words, are not scale ready.
This is a known issue and StackOverflow has a few similar questions:
And more mentions elsewhere on the web:
http://www.quora.com/How-do-I-scale-socket-io-servers-2 – see top answer by Drew Harry.
Native Socket.io solution
Socket.io developers Learnboost suggest to use Redis store, which is built-in to Socket.io:
io.set('store', new RedisStore());
I have tested this approach and it does not work. But it seems that I’m not the only one. Therefore a lot of sources above suggest different approaches and different architectures for scaling socketio. In my case, client connections seem to be trying to repeatedly try to re-handshake after disconnect, and socket.io server would not emit events to clients, because
transports[id] would be
null after initial connect. I have tried to look into these issues spending a few hours but I do not have a definitive answer.
Drew Harry (see Quora link above) suggests splitting Node app to three different pieces and have them talk between each other via a message queue or a pub/sub:
- Application core. This does all the actual application logic, and holds the state of the system in its own memory, or relies on some datastore. These application cores can usually be easily scaled up by partitioning in some application-specific way.
- Socket.io layer. Clients connect directly to this, and it passes any messages from clients to the app core. Messages from the app core to clients are dispatched to the appropriate socket.io process which then sends the message on to the client.
- A load balancer. This could be nginx like in the examples elsewhere in this thread, or it could be a smarter app that can talk back and forth with the socket.io layers to measure their actual load and direct new connections appropriately.
Although I don’t see how this approach solves a problem of running Socket.io on different workers, but possibly his idea is that managing load of socketio server is the solution, instead of scaling socketio server.
Another company who faces same issue is Trello.com who rely heavily on Socket.io. They describe exactly the same issue:
The Socket.io server currently has some problems with scaling up to more than 10K simultaneous client connections when using multiple processes and the Redis store, and the client has some issues that can cause it to open multiple connections to the same server, or not know that its connection has been severed. There are some issues with submitting our fixes (hacks!) back to the project – in many cases they only work with WebSockets (the only Socket.io transport we use). We are working to get those changes which are fit for general consumption ready to submit back to the project.
Other developers turn away from Socket.io completely in favor of other libraries, such as . This comes from Ryan Smith who posted this question on StackOverflow:
Sadly we turned away from Socket.io due to the issues we encountered with this project and switched to Sock.js (github.com/sockjs/sockjs-node) and have yet to look back. I haven’t seen the latest changes to Socket.io but I have heard that version 1.0 will include many fixes including the issue with the redis store. One thing to keep in mind if you consider Sockjs is that is a much lower level library than Socket.io, so if you need channels and groups you will have to build that out your self.
As for myself, I need to revisit this issue later. For now, the main takeaway for me is running Socket.io servers on separate layers and do not even try to scale them, and scale only the core application itself.