Help! Unknown multi-threaded socket limit.

Submitted by TragicWarrior
on February 16, 2009 - 11:53pm

I am writing a multi-threaded application which services hundreds of remote connections for data transfer. Several instances of this program run simultaneously. The problem is that whenever the total number of active user connections (cumulative total of all open sockets tallied across all process instances) reaches about 700, I seem to hit some kind of hard limit. Initially, I thought I was running into a file descriptor limit but I've implemented setrlimit() with RLIMIT_NOFILE raising the number of max descriptors to 32K so I don't believe that is a problem.

Because of this problem I am currently having to constrain the number of concurrent processes running on the system to 2 allowing no more than 256 connections each. In this configuration the sever will run for days without failure until I stop it. If I try to add a third process or restart one of the daemons with a higher
connection limit, bad things will start happening at about 700 open sockets. I believe that I am hitting some sort of hard Linux or GNU limitation, but don't have any idea what it might be. I have repeated this test more than a dozen times.

Other things I have tried:
I have tried reducing my OS_SNDBUF to 32k from 65k but that made no difference. I have tried reducing the stack size of each thread from 8M to 4M but that has not helped either.

Thanks in advance to anyone who can help.

Probably not a socket limit

Rhad (not verified)
on
February 17, 2009 - 2:51am

What kind of 'hard limit' do you hit? Which call errors, and what's the error return or errno value from the call?

I handle 10,000 simultaneous TCP connections in a process that runs fine on Centos 5 / Fedora Core 10. System limits for open file handles need to be increased (/etc/security/limits.conf) before the open file limit can be raised by the process beyond the system configured limit.

Note that if you're using select() on file descriptors numbered above 1023 (if I recall the number correctly) you'll overflow the fixed size fd_set structure. If you need to wait on file descriptors this high, you'll need to use poll() or epoll().

The process I run uses only 5 threads total, but it sounds like you're using a thread per connection, so perhaps you're hitting a thread limit rather than a connection limit.

What kind of 'hard limit' do

on
February 17, 2009 - 2:49pm

What kind of 'hard limit' do you hit?
I'm not sure.

Which call errors, and what's the error return or errno value from the call?
I believe write() and accept() start failing. The write() call fails with EPIPE.

Note that if you're using select() on file descriptors numbered above 1023...
I am indeed using poll() and not select().

Thanks for your thoughts on this. If you have any more please let me know. I'm going to look at the /etc/security/limits.conf but I think I already did and everything was commented out.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.