Hi,
When a process calls the poll or select, the kernel calls (struct
file_operations)->poll on every file descriptor and returns a mask of
events which are ready. If the process is only interested by POLLIN
events, the mask is still computed for POLLOUT and it can be expensive.
For example, on Unix datagram sockets, a process running poll() with
POLLIN will wakes-up when the remote end call read(). This is a
performance regression introduced when fixing another bug by
3c73419c09a5ef73d56472dbfdade9e311496e9b and
ec0d215f9420564fc8286dcf93d2d068bb53a07e.
The attached program illustrates the problem. It compares the
performance of sending/receiving data on an Unix datagram socket and
select(). When the datagram sockets are not connected, the performance
problem is not triggered, but when they are connected it becomes a lot
slower. On my computer, I have the following time:
Connected datagram sockets: >4 seconds
Non-connected datagram sockets: <1 second
The patch attached in the next email fixes the performance problem: it
becomes <1 second for both cases. I am not suggesting the patch for
inclusion; I would like to change the prototype of (struct
file_operations)->poll instead of adding ->poll2. But there is a lot of
poll functions to change (grep tells me 337 functions).
Any opinions?
Best regards,
Alban Crequy
---------------
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>
int
main(int argc, char *argv[])
{
#define NB_CLIENTS 800
int client_fds[NB_CLIENTS];
int server_fd;
int max_fds = 0;
struct sockaddr_un addr_server;
struct sockaddr_un addr_client;
fd_set rfds;
struct timeval tv;
int retval;
int i;
char buffer[1024];
int pair[2] = {0,0};
time_t t1, t2;
int trigger;
if (argc != 2)
exit(1);
if (strcmp(argv[1], "connected") == 0)
{
printf("The ...---
fs/select.c | 27 ++++++++++++++++++++++++---
include/linux/fs.h | 1 +
include/linux/net.h | 3 +++
net/socket.c | 15 ++++++++++-----
net/unix/af_unix.c | 17 +++++++++++------
5 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 500a669..ff46afa 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -449,9 +449,21 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
if (file) {
f_op = file->f_op;
mask = DEFAULT_POLLMASK;
- if (f_op && f_op->poll) {
+ if (f_op && f_op->poll2) {
+ unsigned long filter_mask = 0;
+ filter_mask |= (in & bit)
+ ? POLLIN_SET : 0;
+ filter_mask |= (out & bit)
+ ? POLLOUT_SET : 0;
+ filter_mask |= (ex & bit)
+ ? POLLEX_SET : 0;
wait_key_set(wait, in, out, bit);
- mask = (*f_op->poll)(file, wait);
+ mask = (*f_op->poll2)(file,
+ wait, filter_mask);
+ } else if (f_op && f_op->poll) {
+ wait_key_set(wait, in, out, bit);
+ mask = (*f_op->poll)(file,
+ wait);
}
fput_light(file, fput_needed);
if ((mask & POLLIN_SET) && (in & bit)) {
@@ -738,7 +750,16 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
mask = POLLNVAL;
if (file != NULL) {
mask = DEFAULT_POLLMASK;
- if (file->f_op && file->f_op->poll) {
+ if (file->f_op && file->f_op->poll2) {
+ unsigned long filter_mask;
+ filter_mask = pollfd->events
+ | POLLERR | POLLHUP;
+ if (pwait)
+ pwait->key = pollfd->events |
+ POLLERR | POLLHUP;
+ mask = file->f_op->poll2(file, pwait,
+ filter_mask);
+ } else if (file->f_op && file->f_op->poll) {
if (pwait)
pwait->key = pollfd->events |
POLLERR | POLLHUP;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 509ca14..dd0eb7a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1493,6 +1493,7 @@ struct file_operations {
...My opinion would be to use epoll() for this kind of workload. Also, about unix_datagram_poll() being slow, it probably can be addressed separately. --
Yeah, epoll does check for event hints coming with the callback wakeup, and avoid waking up epoll_wait() waiters, for non matching events. Most of the devices we care about, have been modified to report the event mask with the wakeup call. - Davide
Alban test program is _very_ pathological :
All the time is consumed in do_select() because of false sharing between
two tasks.
We can probably rearrange variables in do_select() to make this false
sharing less problematic. I am taking a look at this.
Events: 3K cycles
+ 26.14% uclient [kernel.kallsyms] [k] do_raw_spin_lock
+ 21.11% uclient [kernel.kallsyms] [k] do_select
+ 13.38% uclient [kernel.kallsyms] [k] pollwake
+ 9.22% uclient [kernel.kallsyms] [k] unix_dgram_poll
+ 5.24% uclient [kernel.kallsyms] [k] unix_peer_get
+ 3.04% uclient [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 3.03% uclient [kernel.kallsyms] [k] task_rq_lock
+ 2.85% uclient [kernel.kallsyms] [k] do_raw_spin_unlock
+ 1.84% uclient [kernel.kallsyms] [k] try_to_wake_up
+ 1.55% uclient [kernel.kallsyms] [k] fget_light
+ 1.34% uclient [kernel.kallsyms] [k] core_kernel_text
annotate :
5.66 : 410fb342: 85 ff test %edi,%edi
0.00 : 410fb344: 74 1f je 410fb365 <do_select+0x3d5>
0.13 : 410fb346: 85 b5 6c fd ff ff test %esi,-0x294(%ebp)
0.00 : 410fb34c: 74 17 je 410fb365 <do_select+0x3d5>
: res_out |= bit;
0.00 : 410fb34e: 09 b5 5c fd ff ff or %esi,-0x2a4(%ebp)
: retval++;
0.00 : 410fb354: 83 85 64 fd ff ff 01 addl $0x1,-0x29c(%ebp)
: wait = NULL;
0.00 : 410fb35b: c7 85 ...Also, why not using the existing wait->key instead of adding a poll2()? - Davide
Following patch solves the problem for me, and its only a start, I am pretty sure we can optimize more than that. What I did is to move "struct poll_wqueues table;" out of do_select() in its caller. This structure is highly modified by an other cpu in Alban workload, and this slow down do_select() because many accesses to some of its local variables (normally, private ones !) hit a false sharing. Before : # time /root/uclient connected The performance problem will be triggered select: begin select: end: 3 seconds real 0m3.101s user 0m0.000s sys 0m6.104s After : # time /root/uclient connected The performance problem will be triggered select: begin select: end: 1 seconds real 0m1.041s user 0m0.004s sys 0m2.040s Impressive no ? Reported-by: Alban Crequy <alban.crequy@collabora.co.uk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- fs/compat.c | 3 ++- fs/select.c | 19 ++++++++++--------- include/linux/poll.h | 3 ++- 3 files changed, 14 insertions(+), 11 deletions(-) diff --git a/fs/compat.c b/fs/compat.c index ff66c0d..3c3c3d3 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -1704,6 +1704,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp, int size, max_fds, ret = -EINVAL; struct fdtable *fdt; long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; + struct poll_wqueues table; if (n < 0) goto out_nofds; @@ -1744,7 +1745,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp, zero_fd_set(n, fds.res_out); zero_fd_set(n, fds.res_ex); - ret = do_select(n, &fds, end_time); + ret = do_select(n, &fds, end_time, &table); if (ret < 0) goto out; diff --git a/fs/select.c b/fs/select.c index b7b10aa..ed054f5 100644 --- a/fs/select.c +++ b/fs/select.c @@ -393,10 +393,10 @@ static inline void wait_key_set(poll_table *wait, unsigned long in, } } -int do_select(int n, fd_set_bits *fds, struct timespec *end_time) +int do_select(int n, fd_set_bits *fds, struct ...
Sure. Its only a start as I said. We should identify the shared portion and make sure it sits in a dedicated cache line. --
Indeed, if wait is not null, we have in wait->key the interest of
poller. If a particular poll() function is expensive, it can test these
bits.
Thanks !
Note: I chose the 'goto skip_write' to make this patch really obvious.
[PATCH] af_unix: optimize unix_dgram_poll()
unix_dgram_poll() is pretty expensive to check POLLOUT status, because
it has to lock the socket to get its peer, take a reference on the peer
to check its receive queue status, and queue another poll_wait on
peer_wait. This all can be avoided if the process calling
unix_dgram_poll() is not interested in POLLOUT status. It makes
unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
peer_wait.
On a test program provided by Alan Crequy :
Before:
real 0m0.211s
user 0m0.000s
sys 0m0.208s
After:
real 0m0.044s
user 0m0.000s
sys 0m0.040s
Suggested-by: Davide Libenzi <davidel@xmailserver.org>
Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/unix/af_unix.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3c95304..dcb84fe 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2090,6 +2090,9 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
}
+ if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT)))
+ goto skip_write;
+
/* writable? */
writable = unix_writable(sk);
if (writable) {
@@ -2111,6 +2114,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
else
set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+skip_write:
return mask;
}
--
Plain agreement on th patch, and I understand the indent overflow concerns, but why not ... /* * No write status requested, avoid expensive OUT tests. */ if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT))) return mask The write-test code is the last one we do anyway. - Davide
Sorry to intrude out of the blue without really understanding the kernel side of most of the code in question, but if there's a performance regression for applications using poll() shouldn't we address that so we get back to the prior performance level rather than requireing all userspace apps to switch to epoll() ?? -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Plain text mails only, please http://www.expita.com/nomime.html Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
From: Jesper Juhl <jj@chaosbits.net> For such a pathological program like Alban's test case, I say absolutely not. --
Le Fri, 29 Oct 2010 21:27:11 +0200,
I found a problem with epoll() with the following program. When there
is several datagram sockets connected to the same server and the
receiving queue is full, epoll(EPOLLOUT) wakes up only the emitter who
has its skb removed from the queue, and not all the emitters. It is
because sock_wfree() runs sk->sk_write_space() only for one emitter.
poll/select do not have this problem.
-----------------------
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>
#include <sys/epoll.h>
int
main(int argc, char *argv[])
{
// cat /proc/sys/net/unix/max_dgram_qlen
#define NB_CLIENTS (10 + 2)
int client_fds[NB_CLIENTS];
int server_fd;
struct sockaddr_un addr_server;
int epollfd = epoll_create(10);
#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int i;
char buffer[1024];
int trigger = atoi(argv[1]);
printf("trigger = %d\n", trigger);
memset(&addr_server, 0, sizeof(addr_server));
addr_server.sun_family = AF_UNIX;
addr_server.sun_path[0] = '\0';
strcpy(addr_server.sun_path + 1, "dgram_perfs_server");
server_fd = socket(AF_UNIX, SOCK_DGRAM, 0);
bind(server_fd, (struct sockaddr*)&addr_server, sizeof(addr_server));
for (i = 0 ; i < NB_CLIENTS ; i++)
{
client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
}
ev.events = EPOLLOUT;
ev.data.fd = client_fds[NB_CLIENTS-1];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
perror("epoll_ctl: client_fds max");
exit(EXIT_FAILURE);
}
if (trigger == 0)
{
ev.events = EPOLLOUT;
ev.data.fd = client_fds[0];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
perror("epoll_ctl: client_fds 0");
exit(EXIT_FAILURE);
}
}
for (i = 0 ; i < NB_CLIENTS ; i++)
{
...I dont think this is the reason.
sock_wfree() really is good here, since it copes with one socket (the
one that sent the message)
Problem is the peer_wait, that epoll doesnt seem to be plugged into.
Bug is in unix_dgram_poll()
It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if socket
is 'writable'. Its a clear bug
Try this patch please ?
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0ebc777..315716c 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2092,7 +2092,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
/* writable? */
writable = unix_writable(sk);
- if (writable) {
+ if (1 /*writable*/) {
other = unix_peer_get(sk);
if (other) {
if (unix_peer(other) != sk) {
--
Also you'll need to change your program to make the epoll registrations
_after_ sockets connect(), or else you can see that epoll() wont know
about the other peer stuff.
for (i = 0 ; i < NB_CLIENTS ; i++) {
client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
}
for (i = 0 ; i < NB_CLIENTS ; i++) {
connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
}
ev.events = EPOLLOUT;
ev.data.fd = client_fds[NB_CLIENTS-1];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
perror("epoll_ctl: client_fds max");
exit(EXIT_FAILURE);
}
if (trigger == 0) {
ev.events = EPOLLOUT;
ev.data.fd = client_fds[0];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
perror("epoll_ctl: client_fds 0");
exit(EXIT_FAILURE);
}
}
--
Whoops, I must have missed AF_UNIX :) Looks good to me. - Davide --
From: Davide Libenzi <davidel@xmailserver.org> Applied. --
Le Sat, 30 Oct 2010 08:44:44 +0200, Pauli Nieminen told me about his performance problem in select() so I wrote the test program but I don't know what exactly is the real life Your 2 patches are good for me. In my opinion the improved performances are good enough with your 2 patches, so no need to add more complexity unless we discover new problems. I am preparing patches to implement multicast features on Unix datagram+seqpacket sockets and my patches could potentially make things worse in unix_dgram_poll() because it would need to check the receiving queues of all multicast members. So I want unix_dgram_poll() to be fast in the first place before proposing other changes for multicast. --
Le Wed, 24 Nov 2010 01:27:56 +0200, But are they SOCK_STREAM or SOCK_DGRAM sockets? The patches fix performances with SOCK_DGRAM. If the xserver scenario is with SOCK_STREAM sockets, your problem is probably still unfixed. --
It should not matter ? commit 67426b756c4d52c51 (af_unix: use keyed wakeups ) makes unix_write_space() call wake_up_interruptible_sync_poll() instead of wake_up_interruptible_sync(). So it should be fixed for both STREAM/DGRAM sockets ? --
