[PATCH 4/5] select: add a poll specific struct to the restart_block union

Previous thread: [RFC: 2.6 patch] remove the broken i82443bxgx_edac driver by Adrian Bunk on Sunday, August 31, 2008 - 9:10 am. (6 messages)

Next thread: Re: [PATCH 2/4] libata: Implement disk shock protection support by Elias Oltmanns on Sunday, August 31, 2008 - 10:07 am. (5 messages)
From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:28 am

New in this version:
* With lots of help from Thomas Gleixner, select() and poll() now
  exclusively use hrtimers
* Several key cleanups from Thomas actually simplify and clean the
  code up so that it's an overall improvement in code quality
* Various interesting bugs were encountered during the switchover on this
  level. The Fedora/Red Hat "nash" program deserves a special mention for
  both asking for a 1 nanosecond ppoll() timeout AND depending on the
  implementation to set this to 0 nanoseconds in userspace memory at
  the end of the first iteration.


 fs/compat.c                 |  187 +++++++++--------------
 fs/select.c                 |  346 +++++++++++++++++++++-----------------------
 include/linux/hrtimer.h     |    2 
 include/linux/poll.h        |    8 -
 include/linux/thread_info.h |    8 +
 include/linux/time.h        |    4 
 kernel/hrtimer.c            |   65 ++++++++
 kernel/time.c               |   18 ++
 8 files changed, 344 insertions(+), 294 deletions(-)

(the bulk of actual linecount growth is just newly added comments)
----

Today in Linux, select() and poll() operate in jiffies resolution
(granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to
10 milliseconds (HZ=100). Delays shorter than this are not possible, and all
delays are in multiples of this granularity.

The effect is that applications that want (on average) to specify more
accurate delays (for example multimedia or other interactive apps) just
cannot do that; this creates more than needed jitter.

With this patch series, the internals of select() and poll() interfaces are
changed such that they work on the nanosecond level (using hrtimers). The
userspace interface for select() is in microseconds, for pselect() and
ppoll() this is in nanoseconds.

[actual behavior obviously on what resolution the hardware timers work, on
modern PCs this is pretty good though]

To show this effect I made a test application to measure the error made
in the select() ...
From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:29 am

From: Arjan van de Ven <arjan@linux.intel.com>
Subject: [PATCH] select: Introduce a hrtimeout function

This patch adds a schedule_hrtimeout() function, to be used by select() and
poll() in a later patch. This function works similar to schedule_timeout()
in most ways, but takes a timespec rather than jiffies.

With a lot of contributions/fixes from Thomas

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h |    2 +
 kernel/hrtimer.c        |   65 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 6d93dce..becd17d 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -346,6 +346,8 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
 extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 				 struct task_struct *tsk);
 
+extern int schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
+
 /* Soft interrupt function to run the hrtimer queues: */
 extern void hrtimer_run_queues(void);
 extern void hrtimer_run_pending(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index b8e4dce..782137d 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1678,3 +1678,68 @@ void __init hrtimers_init(void)
 #endif
 }
 
+/**
+ * schedule_hrtimeout - sleep until timeout
+ * @expires:	timeout value (ktime_t)
+ * @mode:	timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
+ *
+ * Make the current task sleep until the given expiry time has
+ * elapsed. The routine will return immediately unless
+ * the current task state has been set (see set_current_state()).
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
+ * pass before the routine returns.
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the ...
From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:29 am

From: Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH] select: add a timespec_add_safe() function

For the select() rework, it's important to be able to add timespec
structures in an overflow-safe manner.

This patch adds a timespec_add_safe() function for this which is similar in
operation to ktime_add_safe(), but works on a struct timespec.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/time.h |    4 ++++
 kernel/time.c        |   18 ++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index e15206a..7269764 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -38,6 +38,8 @@ struct timezone {
 #define NSEC_PER_SEC	1000000000L
 #define FSEC_PER_SEC	1000000000000000L
 
+#define TIME_T_MAX	(time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1)
+
 static inline int timespec_equal(const struct timespec *a,
                                  const struct timespec *b)
 {
@@ -72,6 +74,8 @@ extern unsigned long mktime(const unsigned int year, const unsigned int mon,
 			    const unsigned int min, const unsigned int sec);
 
 extern void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec);
+extern struct timespec timespec_add_safe(const struct timespec lhs,
+					 const struct timespec rhs);
 
 /*
  * sub = lhs - rhs, in normalized form
diff --git a/kernel/time.c b/kernel/time.c
index 6a08660..d63a433 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -669,3 +669,21 @@ EXPORT_SYMBOL(get_jiffies_64);
 #endif
 
 EXPORT_SYMBOL(jiffies);
+
+/*
+ * Add two timespec values and do a safety check for overflow.
+ * It's assumed that both values are valid (>= 0)
+ */
+struct timespec timespec_add_safe(const struct timespec lhs,
+				  const struct timespec rhs)
+{
+	struct timespec res;
+
+	set_normalized_timespec(&res, lhs.tv_sec + rhs.tv_sec,
+				lhs.tv_nsec + rhs.tv_nsec);
+
+	if ...
From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:30 am

From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 31 Aug 2008 08:16:57 -0700
Subject: [PATCH] select: add poll_select_set_timeout() and
poll_select_copy_remaining() helpers

This patch adds 2 helpers that will be used for the hrtimer based
select/poll:

poll_select_set_timeout() is a helper that takes a timeout (as a
second, nanosecond pair) and turns that into a "struct timespec" that
represents the absolute end time. This is a common operation in the
many select() and poll() variants and needs various, common, sanity
checks.

poll_select_copy_remaining() is a helper that takes care of copying the
remaining time to userspace, as select(), pselect() and ppoll() do.
This function comes in both a natural and a compat implementation (due
to datastructure differences).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/compat.c          |   51 ++++++++++++++++++++++++++++++++++
 fs/select.c          |   75
++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/poll.h
|    2 + 3 files changed, 128 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 075d050..424767c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1436,6 +1436,57 @@ out_ret:
 
 #define __COMPAT_NFDBITS       (8 * sizeof(compat_ulong_t))
 
+static int poll_select_copy_remaining(struct timespec *end_time, void
__user *p,
+				      int timeval, int ret)
+{
+	struct timespec ts;
+
+	if (!p)
+		return ret;
+
+	if (current->personality & STICKY_TIMEOUTS)
+		goto sticky;
+
+	/* No update for zero timeout */
+	if (!end_time->tv_sec && !end_time->tv_nsec)
+		return ret;
+
+	ktime_get_ts(&ts);
+	ts = timespec_sub(*end_time, ts);
+	if (ts.tv_sec < 0)
+		ts.tv_sec = ts.tv_nsec = 0;
+
+	if (timeval) {
+		struct compat_timeval rtv;
+
+		rtv.tv_sec = ts.tv_sec;
+		rtv.tv_usec = ts.tv_nsec / NSEC_PER_USEC;
+
+		if (!copy_to_user(p, &rtv, sizeof(rtv)))
+			return ret;
+	} else {
+		struct ...
From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:30 am

From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 31 Aug 2008 08:19:15 -0700
Subject: [PATCH] select: add a poll specific struct to the restart_block union

with hrtimer poll/select, the signal restart data no longer is a single
long representing a jiffies count, but it becomes a second/nanosecond pair
that also needs to encode if there was a timeout at all or not.

This patch adds a struct to the restart_block union for this purpose

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/thread_info.h |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 38a5647..e6b820f 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -38,6 +38,14 @@ struct restart_block {
 #endif
 			u64 expires;
 		} nanosleep;
+		/* For poll */
+		struct {
+			struct pollfd __user *ufds;
+			int nfds;
+			int has_timeout;
+			unsigned long tv_sec;
+			unsigned long tv_nsec;
+		} poll;
 	};
 };
 
-- 
1.5.5.1

--

From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 9:31 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Sun, 31 Aug 2008 08:26:40 -0700
Subject: [PATCH] select: switch select() and poll() over to hrtimers

With lots of help, input and cleanups from Thomas Gleixner

This patch switches select() and poll() over to hrtimers.

The core of the patch is replacing the "s64 timeout" with a
"struct timespec end_time" in all the plumbing.

But most of the diffstat comes from using the just introduced helpers:
	poll_select_set_timeout
	poll_select_copy_remaining
	timespec_add_safe
which make manipulating the timespec easier and less error-prone.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/compat.c          |  136 ++++----------------------
 fs/select.c          |  263 +++++++++++++++++---------------------------------
 include/linux/poll.h |    6 +-
 3 files changed, 111 insertions(+), 294 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 424767c..133ed7f 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1568,7 +1568,8 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset,
 	((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
 
 int compat_core_sys_select(int n, compat_ulong_t __user *inp,
-	compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout)
+	compat_ulong_t __user *outp, compat_ulong_t __user *exp,
+	struct timespec *end_time)
 {
 	fd_set_bits fds;
 	void *bits;
@@ -1615,7 +1616,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	zero_fd_set(n, fds.res_out);
 	zero_fd_set(n, fds.res_ex);
 
-	ret = do_select(n, &fds, timeout);
+	ret = do_select(n, &fds, end_time);
 
 	if (ret < 0)
 		goto out;
@@ -1641,7 +1642,7 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 	compat_ulong_t __user *outp, compat_ulong_t __user *exp,
 	struct compat_timeval __user *tvp)
 {
-	s64 timeout = -1;
+	struct timespec end_time, *to = NULL;
 	struct compat_timeval tv;
 	int ret;
 
@@ ...
From: Willy Tarreau
Date: Sunday, August 31, 2008 - 9:55 am

Hi Arjan,


This is a good news. I've been bothered by this problem for a long time
and got used to add 99999 us to any delay fed to select() (that started
when HZ=100). I remember having noticed performance improvements in my
applications using this trick even on other systems, so I don't think
that Linux is the only one with this anomaly. I think it's simply caused
by the rounding methods which make it hard to compute an accurate deadline
when a value is converted before being added to jiffies.

I do not remember noticing this on epoll though. Maybe it's just because
I did not look close enough :-/

Cheers,
Willy

--

From: Davide Libenzi
Date: Sunday, August 31, 2008 - 2:13 pm

Another way to get down in the timeout resolution for newer kernels, and 
still supporting the syscalls with millisec timeouts, is to use timerfds.


- Davide


--

From: Arjan van de Ven
Date: Sunday, August 31, 2008 - 2:28 pm

On Sun, 31 Aug 2008 14:13:45 -0700 (PDT)

perhaps, but once you have to change the app you can also make it use
ppoll() rather than poll().. that's a VERY minute change.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

Previous thread: [RFC: 2.6 patch] remove the broken i82443bxgx_edac driver by Adrian Bunk on Sunday, August 31, 2008 - 9:10 am. (6 messages)

Next thread: Re: [PATCH 2/4] libata: Implement disk shock protection support by Elias Oltmanns on Sunday, August 31, 2008 - 10:07 am. (5 messages)