Is there a problem holding large numbers of open socket connections for lengthy periods of time?

Question

The application we have in mind is a relay server. The aim is to accept large numbers of incoming socket connections (at least thousands) that will stay open for lengthy periods (hours or maybe days). They will exchange modest quantities of data and they need reasonably low latency.

The technical design is straightfoward and we have a test implementation. Our preference is to use Windows hosting and .NET, because that's the technology we know. However, this kind of usage is well outside what we are familiar with.

The question is whether there are specific limits or constraints to be aware of that are inherent in or common to software that does this, and that we should allow for in our design and/or test for before a roll-out.

I found this question (Handling large amounts of sockets) and this link (http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1), which tend to suggest that our solution should work.

--

Commenters have suggested opening and closing ports, or using some kind of protocol, without suggesting how. The problem is that at any time a message might be relayed to an individual known destination with the expectation that it be received quickly (say 1 second at most, preferably sooner). The destination is (in general) behind a firewall with unknown properties, so cannot (in general) run a server or accept incoming connections. It can only make outgoing connections. We think it needs a persistent outgoing connection in order to receive packets at any time without notice. Alternative suggestions would be of interest, although strictly off topic for this question.

Other commenters have suggested there are OS limits, but not specified any. Assuming that this is some version of Windows server and some (perhaps large) amount of memory, are the limits likely to be a problem? Roughly what will they be? Is Windows a bad idea?

score 6 · Answer 1 · answered Apr 03 '14 at 15:39

I think you're looking for a protocol: something that can handle errors, retransmissions, etc. For example, what happens if one of the socket is dropped because the underlying network had a problem? Or if your messages are received twice because of a faulty switch along the line? Or if they arrive in the wrong order?

Given that you're planning for a large number of connections, you'll also have to consider the increase in probability that something will malfunction.

Also, what happens if you want to scale your architecture horizontally, i.e. adding more servers? You can't load-balance open sockets and seamlessly transfer them across nodes.

In the end I'd recommend using a more robust message-passing protocol, designed expecting failures. In simpler terms, consider each communication atomic, where the worst case is a new connection every time. More or less along the lines of the C10K problem.

If you still need more convincing, try to test your architecture with a mock up: see how it reacts to ten thousands client connecting at the same time (it's easier in a LAN). Then imagine adding network latency, errors, etc.

score 4 · Answer 2 · answered Apr 03 '14 at 13:20

computers have a hard limit of how many connections can be open at the time (decided by the OS). Programs see a subset of that.

each open socket requires some resources and a network-timeout heartbeat so disconnections can be detected. Having a lot of sockets sending those heartbeats will start to fill the bandwidth.

my suggestion is to close the connections as needed and just accept that you will need to reopen them.

score 4 · Accepted Answer · answered Apr 04 '14 at 14:31

I worked on a relay server for stock market data in C# on a Windows Server. There was no way I could get thousands of simultaneous connections relayed by one machine. The specs for the relay were very simple 1 connection to stock market data provider, and unlimited outbound connects to SilverLight clients.

There are two basic approaches that I investigated.

Use a thread pool, each client gets a socket and worker thread.
Use a worker thread, worker thread pushes data by iterating over all open sockets.

Neither approach could exceed the performance limits of the CPU, and each approach had serious limitations and restrictions.

Using a thread pool.

Windows sucks at multi-thread handling. Once I hit around 250 threads things just started to go down hill. It isn't a memory or system resource problem. It's a quantity problem. While Windows has no problem managing 250 threads. It's another story asking Windows to keep those 250 threads busy relaying data. As performance lags a data backlog starts to happen.

Using a worker thread.

You can't use a worker thread to iterate sockets if those sockets are blocking. Each time the thread hits a socket that has to timeout all other sockets are left waiting. If you switch to asycn socket operations, then a huge backlog of callbacks are generated to quickly and everything breaks.

For me the results were.

100 clients everything is stable. 250 clients everything is working but limit reached. 1000 clients never achieved.

Conclusion.

C# on Windows is not the right tool for a socket relay server. Not for client connections ranging in the thousands.

The only alternative is to not use HTTP sockets, and switch to a broadcasting protocol like UDP or TCP. For me this was not an option as no data was allowed to be dropped. Most broadcasting protocols assume packet loss is acceptable.

Finally.

If you are able to create a C# relay that can handle thousands of clients. Please come back and let me know how you did it.

Michael Shaw · Answer 4 · 2014-04-04T08:40:40.647

Experience suggests that there is a better and probably fairly different solution for your problem.

However, it is possible to create and maintain substantial numbers of socket connections for a long period of time with a less than 100% reliability.

A robust design would be done very differently.

The problem is what happens when your relay server crashes. If you only have one server, then your service is 100% lost. If you have multiple servers, then the clients can reconnect and be connected to a different server - but any messages being relayed back to them during the disconnect and reconnect will have been lost. Depending on your product, this may be important.

If your 'relay' was actually implemented as a distributed message queue service, then it provides a tool set that solves most of these issues for you and provides a mechanism for implementing the level of robustness and availability that suits your application.

This is not a recommendation, but RabbitMQ is one example of a product that may perform the relay function for you. There are others, and you will need to review a selection to decide which is most appropriate for your product.

miniscule · Answer 5 · 2015-07-01T23:44:31.570

Having provided firewall and VPN support for a couple of years now, I can tell you with confidence that applications which keep ports open for an extended period of time are not stable from both the local server perspective and the network point of entry perspective. It can also be a security risk (if that is a consideration in your case).

Leaving a persistent connection open on a server port can potentially cause an issue on the firewall as the NAT table accumulates entries, for example.

I personally recommend keeping your connected client numbers in the 200 to 250 range per server for persistent connections via TCP/IP where the data being sent and received requires minimal latency.

If you have the hardware to move to ATM (Asynchronous Transfer Mode), you can reconfigure your clients to listen only to their specific address, which will take a significant burden off of your servers (at the expense of the clients). ATM should get you into the 300 to 350 active clients per server capacity.

To clarify, the solution to this issue is to put your server in a DMZ on the firewall to enhance access and use ATM connectivity with as many clients as possible. Otherwise you are at the mercy of the remote network firewall and other intermediate intelligent routers.

Is there a problem holding large numbers of open socket connections for lengthy periods of time?

5 Answers5