Re: [PATCH] af_unix: unix_write_space() use keyed wakeups

Previous thread: [PATCH] fs/dcache: fix kernel-doc warnings by Randy Dunlap on Friday, October 29, 2010 - 10:54 am. (1 message)

Next thread: [PATCH 6/7] kgdbts: prevent re-entry to kgdbts before it unregisters by Jason Wessel on Friday, October 29, 2010 - 11:43 am. (1 message)
From: Alban Crequy
Date: Friday, October 29, 2010 - 11:18 am

Hi,

When a process calls the poll or select, the kernel calls (struct
file_operations)->poll on every file descriptor and returns a mask of
events which are ready. If the process is only interested by POLLIN
events, the mask is still computed for POLLOUT and it can be expensive.
For example, on Unix datagram sockets, a process running poll() with
POLLIN will wakes-up when the remote end call read(). This is a
performance regression introduced when fixing another bug by
3c73419c09a5ef73d56472dbfdade9e311496e9b and
ec0d215f9420564fc8286dcf93d2d068bb53a07e.

The attached program illustrates the problem. It compares the
performance of sending/receiving data on an Unix datagram socket and
select(). When the datagram sockets are not connected, the performance
problem is not triggered, but when they are connected it becomes a lot
slower. On my computer, I have the following time:

Connected datagram sockets: >4 seconds
Non-connected datagram sockets: <1 second

The patch attached in the next email fixes the performance problem: it
becomes <1 second for both cases. I am not suggesting the patch for
inclusion; I would like to change the prototype of (struct
file_operations)->poll instead of adding ->poll2. But there is a lot of
poll functions to change (grep tells me 337 functions).

Any opinions?

Best regards,
Alban Crequy


---------------
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>

int
main(int argc, char *argv[])
{
#define NB_CLIENTS 800
  int client_fds[NB_CLIENTS];
  int server_fd;
  int max_fds = 0;
  struct sockaddr_un addr_server;
  struct sockaddr_un addr_client;
  fd_set rfds;
  struct timeval tv;
  int retval;
  int i;
  char buffer[1024];
  int pair[2] = {0,0};
  time_t t1, t2;
  int trigger;

  if (argc != 2)
    exit(1);

  if (strcmp(argv[1], "connected") == 0)
    {
      printf("The ...
From: Alban Crequy
Date: Friday, October 29, 2010 - 11:21 am

---
 fs/select.c         |   27 ++++++++++++++++++++++++---
 include/linux/fs.h  |    1 +
 include/linux/net.h |    3 +++
 net/socket.c        |   15 ++++++++++-----
 net/unix/af_unix.c  |   17 +++++++++++------
 5 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 500a669..ff46afa 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -449,9 +449,21 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll) {
+					if (f_op && f_op->poll2) {
+						unsigned long filter_mask = 0;
+						filter_mask |= (in & bit)
+							? POLLIN_SET : 0;
+						filter_mask |= (out & bit)
+							? POLLOUT_SET : 0;
+						filter_mask |= (ex & bit)
+							? POLLEX_SET : 0;
 						wait_key_set(wait, in, out, bit);
-						mask = (*f_op->poll)(file, wait);
+						mask = (*f_op->poll2)(file,
+							wait, filter_mask);
+					} else if (f_op && f_op->poll) {
+						wait_key_set(wait, in, out, bit);
+						mask = (*f_op->poll)(file,
+							wait);
 					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
@@ -738,7 +750,16 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll) {
+			if (file->f_op && file->f_op->poll2) {
+				unsigned long filter_mask;
+				filter_mask = pollfd->events
+						| POLLERR | POLLHUP;
+				if (pwait)
+					pwait->key = pollfd->events |
+							POLLERR | POLLHUP;
+				mask = file->f_op->poll2(file, pwait,
+							filter_mask);
+			} else if (file->f_op && file->f_op->poll) {
 				if (pwait)
 					pwait->key = pollfd->events |
 							POLLERR | POLLHUP;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 509ca14..dd0eb7a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1493,6 +1493,7 @@ struct file_operations {
 ...
From: Eric Dumazet
Date: Friday, October 29, 2010 - 12:27 pm

My opinion would be to use epoll() for this kind of workload.

Also, about unix_datagram_poll() being slow, it probably can be
addressed separately.



--

From: Davide Libenzi
Date: Friday, October 29, 2010 - 1:08 pm

Yeah, epoll does check for event hints coming with the callback wakeup, 
and avoid waking up epoll_wait() waiters, for non matching events.
Most of the devices we care about, have been modified to report the event 
mask with the wakeup call.


- Davide

From: Eric Dumazet
Date: Friday, October 29, 2010 - 1:20 pm

Alban test program is _very_ pathological :

All the time is consumed in do_select() because of false sharing between
two tasks.

We can probably rearrange variables in do_select() to make this false
sharing less problematic. I am taking a look at this.

Events: 3K cycles
+     26.14%  uclient  [kernel.kallsyms]  [k] do_raw_spin_lock              
+     21.11%  uclient  [kernel.kallsyms]  [k] do_select                     
+     13.38%  uclient  [kernel.kallsyms]  [k] pollwake                      
+      9.22%  uclient  [kernel.kallsyms]  [k] unix_dgram_poll               
+      5.24%  uclient  [kernel.kallsyms]  [k] unix_peer_get                 
+      3.04%  uclient  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore   
+      3.03%  uclient  [kernel.kallsyms]  [k] task_rq_lock                  
+      2.85%  uclient  [kernel.kallsyms]  [k] do_raw_spin_unlock            
+      1.84%  uclient  [kernel.kallsyms]  [k] try_to_wake_up                
+      1.55%  uclient  [kernel.kallsyms]  [k] fget_light                    
+      1.34%  uclient  [kernel.kallsyms]  [k] core_kernel_text              


annotate :

   5.66 :        410fb342:       85 ff                   test   %edi,%edi                  
    0.00 :        410fb344:       74 1f                   je     410fb365 <do_select+0x3d5> 
    0.13 :        410fb346:       85 b5 6c fd ff ff       test   %esi,-0x294(%ebp)          
    0.00 :        410fb34c:       74 17                   je     410fb365 <do_select+0x3d5> 
         :                                                        res_out |= bit;           
    0.00 :        410fb34e:       09 b5 5c fd ff ff       or     %esi,-0x2a4(%ebp)          
         :                                                        retval++;                 
    0.00 :        410fb354:       83 85 64 fd ff ff 01    addl   $0x1,-0x29c(%ebp)          
         :                                                        wait = NULL;              
    0.00 :        410fb35b:       c7 85 ...
From: Davide Libenzi
Date: Friday, October 29, 2010 - 1:46 pm

Also, why not using the existing wait->key instead of adding a poll2()?


- Davide
From: Eric Dumazet
Date: Friday, October 29, 2010 - 2:05 pm

Following patch solves the problem for me, and its only a start, I am
pretty sure we can optimize more than that.

What I did is to move "struct poll_wqueues table;" out of do_select()
in its caller. This structure is highly modified by an other cpu in
Alban workload, and this slow down do_select() because many accesses to
some of its local variables (normally, private ones !) hit a false
sharing.

Before :

# time /root/uclient connected
The performance problem will be triggered
select: begin
select: end: 3 seconds

real	0m3.101s
user	0m0.000s
sys	0m6.104s

After :

# time /root/uclient connected
The performance problem will be triggered
select: begin
select: end: 1 seconds

real	0m1.041s
user	0m0.004s
sys	0m2.040s


Impressive no ?

Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/compat.c          |    3 ++-
 fs/select.c          |   19 ++++++++++---------
 include/linux/poll.h |    3 ++-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index ff66c0d..3c3c3d3 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1704,6 +1704,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	int size, max_fds, ret = -EINVAL;
 	struct fdtable *fdt;
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	struct poll_wqueues table;
 
 	if (n < 0)
 		goto out_nofds;
@@ -1744,7 +1745,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	zero_fd_set(n, fds.res_out);
 	zero_fd_set(n, fds.res_ex);
 
-	ret = do_select(n, &fds, end_time);
+	ret = do_select(n, &fds, end_time, &table);
 
 	if (ret < 0)
 		goto out;
diff --git a/fs/select.c b/fs/select.c
index b7b10aa..ed054f5 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -393,10 +393,10 @@ static inline void wait_key_set(poll_table *wait, unsigned long in,
 	}
 }
 
-int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
+int do_select(int n, fd_set_bits *fds, struct ...
From: Eric Dumazet
Date: Friday, October 29, 2010 - 3:08 pm

Sure. Its only a start as I said. We should identify the shared portion
and make sure it sits in a dedicated cache line.



--

From: Eric Dumazet
Date: Saturday, October 30, 2010 - 2:53 am

Indeed, if wait is not null, we have in wait->key the interest of
poller. If a particular poll() function is expensive, it can test these
bits.

Thanks !

Note: I chose the 'goto skip_write' to make this patch really obvious.

[PATCH] af_unix: optimize unix_dgram_poll()

unix_dgram_poll() is pretty expensive to check POLLOUT status, because
it has to lock the socket to get its peer, take a reference on the peer
to check its receive queue status, and queue another poll_wait on
peer_wait. This all can be avoided if the process calling
unix_dgram_poll() is not interested in POLLOUT status. It makes
unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
peer_wait.

On a test program provided by Alan Crequy :

Before:

real    0m0.211s
user    0m0.000s
sys     0m0.208s

After:

real	0m0.044s
user	0m0.000s
sys	0m0.040s

Suggested-by: Davide Libenzi <davidel@xmailserver.org>
Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/unix/af_unix.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3c95304..dcb84fe 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2090,6 +2090,9 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 			return mask;
 	}
 
+	if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT)))
+		goto skip_write;
+
 	/* writable? */
 	writable = unix_writable(sk);
 	if (writable) {
@@ -2111,6 +2114,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	else
 		set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
+skip_write:
 	return mask;
 }
 



--

From: Davide Libenzi
Date: Saturday, October 30, 2010 - 10:45 am

Plain agreement on th patch, and I understand the indent overflow 
concerns, but why not ...

	/*
	 * No write status requested, avoid expensive OUT tests.
	 */
	if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT)))
		return mask

The write-test code is the last one we do anyway.


- Davide

From: Jesper Juhl
Date: Friday, October 29, 2010 - 1:20 pm

Sorry to intrude out of the blue without really understanding the kernel 
side of most of the code in question, but if there's a performance 
regression for applications using poll() shouldn't we address that so we 
get back to the prior performance level rather than requireing all 
userspace apps to switch to epoll() ??

-- 
Jesper Juhl <jj@chaosbits.net>             http://www.chaosbits.net/
Plain text mails only, please      http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
From: David Miller
Date: Friday, October 29, 2010 - 1:40 pm

From: Jesper Juhl <jj@chaosbits.net>

For such a pathological program like Alban's test case, I say
absolutely not.
--

From: Alban Crequy
Date: Saturday, October 30, 2010 - 4:34 am

Le Fri, 29 Oct 2010 21:27:11 +0200,

I found a problem with epoll() with the following program. When there
is several datagram sockets connected to the same server and the
receiving queue is full, epoll(EPOLLOUT) wakes up only the emitter who
has its skb removed from the queue, and not all the emitters. It is
because sock_wfree() runs sk->sk_write_space() only for one emitter.

poll/select do not have this problem.

-----------------------

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>
#include <sys/epoll.h>

int
main(int argc, char *argv[])
{

// cat /proc/sys/net/unix/max_dgram_qlen
#define NB_CLIENTS (10 + 2)

  int client_fds[NB_CLIENTS];
  int server_fd;
  struct sockaddr_un addr_server;
  int epollfd = epoll_create(10);
#define MAX_EVENTS 10
  struct epoll_event ev, events[MAX_EVENTS];
  int i;
  char buffer[1024];

  int trigger = atoi(argv[1]);
  printf("trigger = %d\n", trigger);

  memset(&addr_server, 0, sizeof(addr_server));
  addr_server.sun_family = AF_UNIX;
  addr_server.sun_path[0] = '\0';
  strcpy(addr_server.sun_path + 1, "dgram_perfs_server");

  server_fd = socket(AF_UNIX, SOCK_DGRAM, 0);
  bind(server_fd, (struct sockaddr*)&addr_server, sizeof(addr_server));

  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
    }

  ev.events = EPOLLOUT;
  ev.data.fd = client_fds[NB_CLIENTS-1];
  if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
      perror("epoll_ctl: client_fds max");
      exit(EXIT_FAILURE);
  }
  if (trigger == 0)
    {
      ev.events = EPOLLOUT;
      ev.data.fd = client_fds[0];
      if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
        perror("epoll_ctl: client_fds 0");
        exit(EXIT_FAILURE);
      }
    }

  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      ...
From: Eric Dumazet
Date: Saturday, October 30, 2010 - 5:53 am

I dont think this is the reason.

sock_wfree() really is good here, since it copes with one socket (the
one that sent the message)

Problem is the peer_wait, that epoll doesnt seem to be plugged into.

Bug is in unix_dgram_poll()

It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if socket
is 'writable'. Its a clear bug

Try this patch please ?

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0ebc777..315716c 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2092,7 +2092,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 
 	/* writable? */
 	writable = unix_writable(sk);
-	if (writable) {
+	if (1 /*writable*/) {
 		other = unix_peer_get(sk);
 		if (other) {
 			if (unix_peer(other) != sk) {




--

From: Eric Dumazet
Date: Saturday, October 30, 2010 - 6:17 am

Also you'll need to change your program to make the epoll registrations
_after_ sockets connect(), or else you can see that epoll() wont know
about the other peer stuff.

for (i = 0 ; i < NB_CLIENTS ; i++) {
	client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
}


for (i = 0 ; i < NB_CLIENTS ; i++) {
	connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
}

ev.events = EPOLLOUT;
ev.data.fd = client_fds[NB_CLIENTS-1];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
      perror("epoll_ctl: client_fds max");
      exit(EXIT_FAILURE);
}
if (trigger == 0) {
      ev.events = EPOLLOUT;
      ev.data.fd = client_fds[0];
      if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
        perror("epoll_ctl: client_fds 0");
        exit(EXIT_FAILURE);
      }
}



--

From: Davide Libenzi
Date: Saturday, October 30, 2010 - 8:03 am

Whoops, I must have missed AF_UNIX :)
Looks good to me.


- Davide


--

From: David Miller
Date: Monday, November 8, 2010 - 2:44 pm

From: Davide Libenzi <davidel@xmailserver.org>

Applied.
--

From: Alban Crequy
Date: Saturday, October 30, 2010 - 2:36 pm

Le Sat, 30 Oct 2010 08:44:44 +0200,

Pauli Nieminen told me about his performance problem in select() so I
wrote the test program but I don't know what exactly is the real life

Your 2 patches are good for me. In my opinion the improved performances
are good enough with your 2 patches, so no need to add more complexity
unless we discover new problems.

I am preparing patches to implement multicast features on Unix
datagram+seqpacket sockets and my patches could potentially make things
worse in unix_dgram_poll() because it would need to check the receiving
queues of all multicast members. So I want unix_dgram_poll() to be fast
in the first place before proposing other changes for multicast.
--

From: Alban Crequy
Date: Tuesday, November 23, 2010 - 5:20 pm

Le Wed, 24 Nov 2010 01:27:56 +0200,

But are they SOCK_STREAM or SOCK_DGRAM sockets? The patches fix
performances with SOCK_DGRAM. If the xserver scenario is with
SOCK_STREAM sockets, your problem is probably still unfixed.

--

From: Eric Dumazet
Date: Tuesday, November 23, 2010 - 5:28 pm

It should not matter ?

commit 67426b756c4d52c51 (af_unix: use keyed wakeups ) makes
unix_write_space() call wake_up_interruptible_sync_poll() instead of
wake_up_interruptible_sync().

So it should be fixed for both STREAM/DGRAM sockets ?



--

Previous thread: [PATCH] fs/dcache: fix kernel-doc warnings by Randy Dunlap on Friday, October 29, 2010 - 10:54 am. (1 message)

Next thread: [PATCH 6/7] kgdbts: prevent re-entry to kgdbts before it unregisters by Jason Wessel on Friday, October 29, 2010 - 11:43 am. (1 message)