I'm pleased to announce the availability of version 6 of the syslet subsystem. Ingo and I agreed that I'll handle syslet releases while he's busy with CFS. I copied the cc: list from Ingo's v5 announcement. If you'd like to be dropped (or added), please let me know. The v6 patch series against 2.6.21 can be downloaded from: http://oss.oracle.com/~zab/syslets/v6/ Example applications and previous syslet releases can be found at: http://people.redhat.com/~mingo/syslet-patches/ The syslet subsystem aims to provide user-space with an efficient interface for managing the asynchronus submission and completion of existing system calls. The only changes since v5 are small changes that I made to support the experimental aio patch described below. My syslet subsystem todo list is as follows, in no particular order: - replace WARN_ON() calls with error handling or avoidance - split the x86_64-async.patch into more specific patches - investigate integration with ptrace - investigate rare ./syslet-test cpu spinning - provide distro kernel rpms and documentation for developers - compat design problems, still? http://lkml.org/lkml/2007/3/7/523 Included in this patch series is an experimental patch which reworks fs/aio.c to reuse the syslet subsystem to process iocb requests from user space. The intent of this work is to simplify the code and broaden aio functionality. Many issues need to be addressed before this aio work could be merged: - support cancellation by sending signals to async_threads - figure out what to do about signals from handlers, like SIGXFSZ - verify that heavy loads do not consume excessive cpu or memory - concurrent dio writes - cfq gets confused, share io_context amongst threads? - restrict allowed operations like .aio_{r,w} methods used to More details on this work in progress can be found in the patch. Any and all feedback is welcome and encouraged! - z -
.. so don't keep us in suspense. Do you have any numbers for anything (like Oracle, to pick a random thing out of thin air ;) that might actually indicate whether this actually works or not? Or is it just so experimental that no real program that uses aio can actually work yet? Linus -
I haven't gotten to running Oracle's database against it. It is going
to be Very Cranky if O_DIRECT writes aren't concurrent, and that's going
to take a bit of work in fs/direct-io.c.
I've done initial micro-benchmarking runs for basic sanity testing with
fio. They haven't wildly regressed, that's about as much as can be said
with confidence so far :).
Take a streaming O_DIRECT read. 1meg requests, 64 in flight.
str: (g=0): rw=read, bs=1M-1M/1M-1M, ioengine=libaio, iodepth=64
mainline:
read : io=3,405MiB, bw=97,996KiB/s, iops=93, runt= 36434msec
aio+syslets:
read : io=3,452MiB, bw=99,115KiB/s, iops=94, runt= 36520msec
That's on an old gigabit copper FC array with 10 drives behind a, no
seriously, qla2100.
The real test is the change in memory and cpu consumption, and I haven't
modified fio to take reasonably precise measurements of those yet. Once
I get O_DIRECT writes concurrent that'll be the next step.
I was pleased to see my motivation for the patches, to avoid having to
add specific support for operations to be called from fs/aio.c, work
out.
Take the case of 4k random buffered reads from a block device with a
cold cache:
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
mainine:
read : io=16,116KiB, bw=457KiB/s, iops=111, runt= 36047msec
slat (msec): min= 4, max= 629, avg=563.17, stdev=71.92
clat (msec): min= 0, max= 0, avg= 0.00, stdev= 0.00
aio+syslets:
read : io=125MiB, bw=3,634KiB/s, iops=887, runt= 36147msec
slat (msec): min= 0, max= 3, avg= 0.00, stdev= 0.08
clat (msec): min= 2, max= 643, avg=71.59, stdev=74.25
aio+syslets w/o cfq
read : io=208MiB, bw=6,057KiB/s, iops=1,478, runt= 36071msec
slat (msec): min= 0, max= 15, avg= 0.00, stdev= 0.09
clat (msec): min= 2, max= 758, avg=42.75, stdev=37.33
Everyone step back and thank Jens for writing a tool that gives us
interesting data without us always having to craft some stupid ...You should pick up the kevent work :) Having async request and response rings would be quite useful, and most closely match what is going on under the hood in the kernel and hardware. Jeff -
> You should pick up the kevent work :) Yeah, but I have lots of competing thoughts about this. For the time being I'm focusing on simplifying the mechanisms that support the sys_io_*() interface so I never ever have to debug fs/aio.c (also known as chewing glass to those of us with the scars) again. That said, I'll gladly work closely with developers who are seriously considering putting some next gen interface to the test. That todo item about producing documentation and distro kernels is specifically to bait Uli into trying to implement posix aio on top of syslets in glibc. 'cause we can go back and forth about potential interfaces for, well, how long as it been? years? I want non-trivial users who we can measure so we can *stop* designing and implementing the moment something is good enough for them. - z -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Get DaveJ to pick up the code for Fedora kernels and I'll get to it. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFGXLUk2ijCOnn/RHQRAjL0AJ0UQzNnMn8xpj7ga0OeEWUhnkhZfgCfTH+j iQ52SLZgWwp4wmAGCy/eLZs= =hpyn -----END PGP SIGNATURE----- -
On Tue, May 29, 2007 at 04:20:04PM -0700, Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Zach Brown wrote: > > That todo item > > about producing documentation and distro kernels is specifically to bait > > Uli into trying to implement posix aio on top of syslets in glibc. > > Get DaveJ to pick up the code for Fedora kernels and I'll get to it. With F7 out the door, I'm looking at getting devel/ back in shape again, so I can get something done there soon-ish. With the usual caveat that if this isn't upstream by the time we do a release, we'll have to drop it due to the added syscall. (Maybe we can just get that reserved upstream now?) Dave -- http://www.codemonkey.org.uk -
Maybe, but we'd have to agree on the bare syslet interface that is being
supported :).
Personally, I'd like that to be the simplest thing that works for people
and I'm not convinced that the current syslet-specific syscalls are that.
Certainly not the atom interface, anyway.
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+ unsigned long val, new_val;
+
+ if (get_user(val, uptr))
+ return -EFAULT;
+ /*
+ * inc == 0 means 'read memory value':
+ */
+ if (!inc)
+ return val;
+
+ new_val = val + inc;
+ if (__put_user(new_val, uptr))
+ return -EFAULT;
+
+ return new_val;
+}
A syscall for *long addition* strikes me as a bit much, I have to admit.
Where do we stop? (Where's the compat wrapper? :))
Maybe this would be fine for some wildly aggressive optimization some
number of years in the future when we have millions of syslet interface
users complaining about the cycle overhead of their syslet engines, but
it seems like we can do something much less involved in the first pass
without harming the possibility of promising to support this complex
optimization in the future.
- z
-
note that async request and response rings are implemented already in essence: that's how FIO uses syslets. The linked list of syslet atoms is the 'request ring' (it's just that 'ring' is not a hard-enforced data structure - you can use other request formats too), and the completion ring is the 'response ring'. Ingo -
3 months ago i verified the published kevent vs. epoll benchmark and found that benchmark to be fatally flawed. When i redid it properly kevent showed no significant advantage over epoll. Note that i did those measurements _before_ the recent round of epoll speedups. So unless someone does believable benchmarks i consider kevent an over-hyped, mis-benchmarked complication to do something that epoll is perfectly capable of doing. Ingo -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm not going to judge your tests but saying there are no significant advantages is too one-sided. There is one huge advantage: the interface. A memory-based interface is simply the best form. File descriptors are a resource the runtime cannot transparently consume. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXShu2ijCOnn/RHQRAi5ZAJ920rRneulUMjTETu6XoiOaOi7SLgCfbmO+ UDM1CLqbaEZREAMnuOWRzuY= =CERV -----END PGP SIGNATURE----- -
yeah - this is a fundamental design question for Linus i guess :-) glibc (and other infrastructure libraries) have a fundamental problem: they cannot (and do not) presently use persistent file descriptors to make use of kernel functionality, due to ABI side-effects. [applications can dup into an fd used by glibc, applications can close it - shells close fds blindly for example, etc.] Today glibc simply cannot open a file descriptor and keep it open while application code is running due to these problems. we should perhaps enable glibc to have its separate fd namespace (or 'hidden' file descriptors at the upper end of the fd space) so that it can transparently listen to netlink events (or do epoll), without impacting the application fd namespace - instead of ducking to a memory based API as a workaround. it is a serious flexibility issue that should not be ignored. The unified fd space is a blessing on one hand because it's simple and powerful, but it's also a curse because nested use of the fd space for libraries is currently not possible. But it should be detached from any fundamental question of kevent vs. epoll. (By improving library use of file descriptors we'll improve the utility of all syscalls - by ducking to a memory based API we only solve that particular event based usage.) Ingo -
There is another issue with file descriptors - userspace must dig into kernel each time it wants to get a new set of events, while with memory based approach it has them without doing so. After it has returned from kernel and know that there are some evetns, kernel can add more of them into the ring (if there is a place) and userspace will process them withouth additional syscalls. Although syscall overhead is very small, it does exist and should not be -- Evgeniy Polyakov -
Firstly, this is not a fundamental property of epoll. If we wanted to, it would be possible to extend epoll to fill in a ring of events from the wakeup handler. It's an incremental add-on to epoll that should not impact the design. How much info to put into a single event is another incremental thing - for most of the high-performance cases all the information we need is the type of the event and the fd it occured on. Currently epoll supports that minimal approach. Secondly, our current syscall overhead is below 0.1 usecs on latest hardware: dione:~/l> ./lat_syscall null Simple syscall: 0.0911 microseconds so you need millions of events _per cpu_ for the syscall overhead to show up. Thirdly, our main problem was not the structure of epoll, our main problem was that event APIs were not widely available, so applications couldnt go to a pure event based design - they always had to handle certain types of event domains specially, due to lack of coverage. The latest epoll patches largely address that. This was a huge barrier against adoption of epoll. Ingo -
Well, quite frankly, to me, the most important part of syslets is that if they are done right, they introduce _no_ new interfaces at all that people actually use. Over the years, we've done lots of nice "extended functionality" stuff. Nobody ever uses them. The only thing that gets used is the standard stuff that everybody else does too. So when it comes to syslets, the most important interface will be the existing aio_read() etc interfaces _without_ any in-memory stuff at all, and everything done by the kernel to just make it look exactly like it used to look. And the biggest advantage is that it simplifies the internal kernel code, and makes us use the same code for aio and non-aio (and I think we have a good possibility of improving performance too, if only because we will get much more natural and fine-grained scheduling points!) Any extended "direct syslets" use is technically _interesting_, but ultimately almost totally pointless. Which was why I was pushing really really hard for a simple interface and not being too clever or exposing internal designs too much. An in-memory thing tends to be the absolute glibc has a more fundamental problem: the "fun" stuff is generally not worth it. For example, any AIO thing that requires glibc to be rewritten is almost totally uninteresting. It should work with _existing_ binaries, and _existing_ ABI's to be useful - since 99% of all AIO users are binary- only and won't recompile for some experimental library. The whole epoll/kevent flame-wars have ignored a huge issue: almost nobody uses either. People still use poll and select, to such an _overwhelming_ degree that it almost doesn't even matter if you were to make the Yeah, I don't think it would be at all wrong to have "private file descriptors". I'd prefer that over memory-based (for all the abstraction issues, and because a lot of things really *are* about file descriptors!). Linus -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Something like this would only work reliably if you have actual protection coming with it. Also, there are still reasons why an application might want to see, close, handle, whatever these descriptors in a separate namespace. I think such namespaces are a broken concept. How many do you want to introduce? Plus, then you get away from the normal file descriptor interfaces anyway. If you'd represent these alternative namespace descriptors with ordinary ints you gain nothing. You'd have to use tuples (namespace,descriptor) and then you need a whole set of new interfaces or some sticky namespace selection which will only cause It's not "ducking". Memory mapping is one of the most natural interfaces. Just because Unix/Linux is built around the concept of file descriptors does not mean this is the ultimate in usability. File descriptors are in fact clumsy: if you have a file descriptor to read and write data, all auxiliary data for that communication must be transferred out-of-band (e.g, fcntl) or in very magical and hard to use ways (recvmsg, sendmsg). With a memory based event mechanism this Too simple. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXZqX2ijCOnn/RHQRAsSFAKCNrd8/sRss1wBA9hkpnYIeALDbXQCfRNAb yZy2Nofz2CgDo9PQYK3C/bo= =klUJ -----END PGP SIGNATURE----- -
Here I think we are forgetting that glibc is userspace and there's no
separation between the application code and glibc code. An application
linking to glibc can break glibc in thousand ways, indipendently from fds
or not fds. Like complaining that glibc is broken because printf()
suddendly does not work anymore ;)
#include <stdio.h>
int main(void) {
close(fileno(stdout));
printf("Whiskey Tango Foxtrot?\n");
return 0;
}
- Davide
-
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 It's not (only/mainly) about breaking. File descriptors are a resources which has to be used under the control of the program. The runtime cannot just steal some for itself. This indirectly leads to breaking code. We've seen this many times and I keep repeating the same issue over and over again: why do we have MAP_ANON instead of keeping a file descriptor with /dev/null open? Why is mmap made more complicated by allowing the file descriptor to be closed after the mmap() call is done? Take a look at a process running your favorite shell. Ever wonder why there is this stray file descriptor with a high number? $ cat /proc/3754/cmdline bash $ ll /proc/3754/fd/ total 0 lrwx------ 1 drepper drepper 64 2007-05-30 12:50 0 -> /dev/pts/19 lrwx------ 1 drepper drepper 64 2007-05-30 12:50 1 -> /dev/pts/19 lrwx------ 1 drepper drepper 64 2007-05-30 12:49 2 -> /dev/pts/19 lrwx------ 1 drepper drepper 64 2007-05-30 12:50 255 -> /dev/pts/19 File descriptors must be requested explicitly and cannot be implicitly consumed. All that and the other problem I mentioned earlier today about auxiliary data. File descriptors are not the ideal interface. Elegant: yes, ideal: no. Fro physics and math you might have learned that not every result that looks clean and beautiful is correct. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXdbC2ijCOnn/RHQRAgBbAJ0RoNsQr4L6Bm5hLy7somAKeTqCcQCbBHmx 8hzG+1w0rYMTqXxNmi/QQ7o= =O7Xm -----END PGP SIGNATURE----- -
No, Davide, the problem is that some applications depend on getting _specific_ file descriptors. For example, if you do close(0); .. something else .. if (open("myfile", O_RDONLY) < 0) exit(1); you can (and should) depend on the open returning zero. So library routines *must not* open file descriptors in the normal space. (The same is true of real applications doing the equivalent of for (i = 0; i < NR_OPEN; i++) close(i); to clean up all file descriptors before doing something new. And yes, I think it was bash that used to *literally* do something like that a long time ago. Another example of the same thing: people open file descriptors and know that they'll be "dense" in the result, and then use "select()" on them. So it's true that file descriptors can't be used randomly by the standard libraries - they'd need to have some kind of separate "private space". Which *could* be something as simple as saying "bit 30 in the file descriptor specifies a separate fd space" along with some flags to make open and friends return those separate fd's. That makes them useless for "select()" (which assumes a flat address space, of course), but would be useful for just about anything else. Linus -
Right. I misunderstood Uli and Ingo. I thought it was like trying to I think it can be solved in a few ways. Yours or Ingo's (or something else) can work, to solve the above "legacy" fd space expectations. - Davide -
Then you can also exclude multi-threading, since a thread (even not inside glibc) can also use socket()/pipe()/open()/whatever and take the zero file descriptor as well. Frankly I dont buy this fd namespace stuff. The only hardcoded thing in Unix is 0, 1 and 2 fds. People usually take care of these, or should use a Microsoft OS. POSIX mandates that open() returns the lowest available fd. But this obviously works only if you dont have another thread messing with fds, or if you dont call a library function that opens a file. Quite buggy IMHO This hack was to avoid bugs coming from ancestors applications, forking/execing a shell, and at times where one process could not open more than 20 files (AT&T Unix, 21 years ago) Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make Please dont do that. Second class fds. Then what about having ten different shared libraries ? Third class fds ? -
No. The application is _correct_. It's how file descriptors are defined to Totally different. That's an application internal issue. It does *not* Wrong. I already gave an example of real code that just didn't bother to keep track of which fd's it had open, and closed them all. Partly, in fact, because you can't even _know_ which fd's you have open when somebody else just execve's you. You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. You cannot just change years and years of coding practice, and standard documentations. The behaviour of file descriptors is a fact. Ignoring that fact because you don't like it is na
If someone really cares, /proc/self/fd can help. But one shouldn't care at all.
About the things that the process can do before execing() a process, file
descriptors outside of 0,1,2 are the most obvious thing, but you also have
I want to change nothing. Current situation is fine and well documented, thank
you.
If a program does "for (i = 0; i < NR_OPEN; i++) close(i);", this
*will*/*should* work as intended : close all files descriptors from 0 to
NR_OPEN. Big deal.
But you wont find in a program :
FILE *fp = fopen("somefile", "r");
for (i = 0; i < NR_OPEN; i++)
close(i);
while (fgets(buff, sizeof(buff), fp)) {
}
You and/or others want to add fd namespaces and other hacks.
I saw on this thread suspicious examples, I am waiting for a real one,
justifying all this stuff.
After file descriptors separation, I guess we'll need memory space separation
as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu
time separation, and so on... setrlimit() layered for every shared lib.
-
Looking at it now, I'd agree (although I think I have that somewhere in my old code too). Consider though, that such code is contained also in reference books like Richard Stevens "UNIX Network Programming". - Davide -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Indeed. It was not only bash, though, I fixed probably a dozen applications. But even the new and better solution (readdir of /proc/self/fd) does not prevent the problem of closing descriptors the I don't like special cases. For me things better come in quantities 0, 1, and unlimited (well, reasonable high limit). Otherwise, who gets to use that special namespace? The C library is not the only body of code which would want to use descriptors. And then the semantics: do these descriptors should show up in /proc/self/fd? Are there separate directories for each namespace? Do they count against the rlimit? This seems to me like a shot from the hips without thinking about other possibilities. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY z9ql4FJa8XTSiZzRG79ocwM= =0E7f -----END PGP SIGNATURE----- -
Well, don't think of it as a special case at all: think of bit 30 as a "the user asked for a non-linear fd". In fact, to make it effective, I'd suggest literally scrambling the low bits (using, for example, some silly per-boot xor value to to actually generate the "true" index - the equivalent of a really stupid randomizer). That way you'd have the legacy "linear" space, and a separate "non-linear space" where people simply *cannot* make assumptions about contiguous fd allocations. There's no special case there - it's just an extension which explicitly allows us to say "if you do that, your fd's won't be allocated the traditional way any more, but you *can* mix the traditional and the Oh, absolutely. The'd be real fd's in every way. People could use them 100% equivalently (and concurrently) with the traditional ones. The whole, and the _only_ point, would be that it breaks the legacy guarantees of a dense fd space. Most apps don't actually *need* that dense fd space in any case. But by defaulting to it, we wouldn't break those (few) apps that actually depend on it. Linus -
I agree. What would be a good interface to allocate fds in such area? We don't want to replicate syscalls, so maybe a special new dup function? - Davide -
I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or similar, and just have NONLINEAR_FD be some magic value (for example, make it be 0x40000000 - the bit that says "private, nonlinear" in the first place). But what's gotten lost in the current discussion is that we probably don't actually _need_ such a private space. I'm just saying that if the *choice* is between memory-mapped interfaces and a private fd-space, we should probably go for the latter. "Everything is a file" is the UNIX way, after all. But there's little reason to introduce private fd's otherwise. Linus -
it's both a flexibility and a speedup thing as well: flexibility: for libraries to be able to open files and keep them open comes up regularly. For example currently glibc is quite wasteful in a number of common networking related functions (Ulrich, please correct me if i'm wrong), which could be optimized if glibc could just keep a netlink channel fd open and could poll() it for changes and cache the results if there are no changes (or something like that). speedup: i suggested O_ANY 6 years ago as a speedup to Apache - non-linear fds are cheaper to allocate/map: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html (i definitely remember having written code for that too, but i cannot find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap overhead as well and use a per-process list/pool of struct file buffers plus a maximum-fd field as the 'non-linear fd allocator' (at the price of only deallocating them at process exit time). Ingo -
On Thu, 31 May 2007 08:13:03 +0200 Only very few apps need to open more than 100.000 files. As these files are likely sockets, O_ANY is not a solution. A trick is to try to keep first 64 handles freed, so that kernel wont consume too much cpu time and cache in get_unused_fd() http://lkml.org/lkml/2005/9/15/307 This trick is portable (not linux centric). -
yes. I did not list it as a primary reason for private fds, it's just a nice side-effect. As long as the other apps are not hurt, i see no why not? It would be a natural thing to extend sys_socket() with a 'flags' parameter and pass in O_ANY (along with any other possible fd this is basically a user-space front-end cache to fd allocation - which duplicates data needlessly. I dont see any problem with doing this in the kernel. (Also, obviously 'first 64 handles' could easily break with certain types of apps so glibc cannot do this.) Ingo -
to measure this i've written fd-scale-bench.c: http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c which tests the (cache-hot or cache-cold) cost of open()-ing of two fds while there are N other fds already open: one is from the 'middle' of the range, one is from the end of it. Lets check our current 'extreme high end' performance with 1 million fds. (which is not realistic right now but there certainly are systems with over a hundred thousand open fds). Results from a fast CPU with 2MB of cache: cache-hot: # ./fd-scale-bench 1000000 0 checking the cache-hot performance of open()-ing 1000000 fds. num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us ... num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us num_fds: 1000000, best cost: 8.20 us, worst cost: 9.60 us cache-cold: # ./fd-scale-bench 1000000 1 checking the performance of open()-ing 1000000 fds. num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us ... num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us num_fds: ...
On Thu, 31 May 2007 11:02:52 +0200 Your numbers do not match mines (mines were more than two years old so I redid a test before replying) I tried your bench and found two problems : - You scan half of the bitmap - You incorrectlty divide best_delta and worst_delta by LOOPS (5) Try to close not a 'middle fd', but a really low one (10 for example), and latencie is doubled. with a corrected bench; cache-cold numbers are > 100 us on this Intel Pentium-M num_fds: 1000000, best cost: 120.00 us, worst cost: 131.00 us On an Opteron x86_64 machine, results are better :) num_fds: 1000000, best cost: 28.00 us, worst cost: 106.00 us -
that was intentional. I really didnt want to fabricate a worst-case result but something more representative: in real apps the bitmap isnt fully filled all the time and most of the find-bit sequences are short. ah, indeed, that's a bug - victim of a last minute edit :) Since the divident is constant it doesnt really matter to the validity of the relative nature of the slowdown (which is what i was intested in), but you are right - i have fixed the download and have redone the numbers. Here are the correct results from my box: # ./fd-scale-bench 1000000 0 checking the cache-hot performance of open()-ing 1000000 fds. num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us ... num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us num_fds: 1000000, best cost: 41.00 us, worst cost: 59.00 us and cache-cold: # ./fd-scale-bench 1000000 1 checking the cache-cold performance of open()-ing 1000000 fds. num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us ... num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us num_fds: 61693, ...
btw., this also allows mostly-lockless fd allocation, which would probably benefit threaded apps too. (we can just recycle it from a per-CPU list of cached fds for that process) Ingo -
See also: http://lkml.org/lkml/2006/6/16/144 which originates from a much simpler patch I did to fix performance regressions in this area for the SLES10 kernel. -- Jens Axboe -
If the deal is to be able to get faster open()/socket()/pipe()/... calls by not finding the first 0 bit in a huge bitmap, a better way would be to have a flag in struct task, reset to 0 at exec time. A new syscall would say : This process is OK to receive *random* fds. -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This sounds easy but doesn't really solve all the issues. Let me repeat your example and the solution currently in use: problem: application wants to close all file descriptors except a select few, cleaning up what is currently open. It doesn't know all the descriptors that are open. Maybe all this in preparation of an exec call. Today the best method to do this is to readdir() /proc/self/fd and exclude the descriptors on the whitelist. If the special, non-sequential descriptors are also listed in that directory the runtimes still cannot use them since they are visible. If you go ahead with this, then at the very least add a flag which causes the descriptor to not show up in /proc/*/fd. You also have to be aware that open() is just one piece of the puzzle. What about socket()? I've cursed this interface many times before and now it's biting you: there is parameter to pass a flag. What about transferring file descriptors via Unix domain sockets? How can I decide the transferred descriptor should be in the private namespace? There are likely many many more problems and cornercases like this. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI ALNKu8VCKy7CvoIqJD3Xs/Y= =+fM8 -----END PGP SIGNATURE----- -
Well, we can't just replicate/change every system call that creates a file descriptor. So I'm for something like: int sys_fdup(int fd, int flags); So you basically create your fds with their native/existing system calls, and then you dup/move them into the prefered fd space. - Davide -
On Wed, 30 May 2007 14:27:52 -0700 (PDT) If the sole point is to protect an fd from being closed or operated on outside of a certain context, why not just provide the ability to "protect" an fd to prevent its use. Maybe a pair of syscalls like "fdprotect" and "fdunprotect" that take an fd and an integer key. Protected fds would return EBADF or something if accessed. The same integer key must be provided to fdunprotect in order to gain access to it again. Then glibc or valgrind or whatever would just unprotect the fd before operating on it. - DML -
One could always stuff a seed or per-cpu seeds in the files_struct and use a PRNG. The only trick would be cacheline bounces and/or space consumption of seeds. Another possibility would be bitreversed contiguity or otherwise a bit permutation of some contiguous range, modulo (of course) the high bit used to tag the randomized range. With "truly" random/sparse fd numbers it may be meaningful to use a different data structure from a bitmap to track them in-kernel, though xor and other easily-computed mappings to/from contiguous ranges won't need such in earnest. -- wli -
Valgrind could certainly make use of it. It currently reserves a set of
fds "high enough", and tries hard to hide them from apps, but
/proc/self/fd makes it intractable in general (there was only so much
simulation I was willing to do in Valgrind).
J
-
Please, do not drop me out of the Cc list. If you have a valid point, you should be able to carry it forward regardless, no? - Davide -
Some programs - legitimately, I think - scan /proc/self/fd to close
everything. The question is whether the glibc-private fds should appear
there. And something like a "close-on-fork" flag might be useful,
though I guess glibc can keep track of its own fds closely enough to not
need something like that.
J
-
Sure. I think there are things we can do (like make the non-linear fd's appear somewhere else, and make them close-on-exec by default etc). And it's not like it's necessarily at all the only way to do things. I just threw it out as a possible solution - and one that is almost certainly *superior* to trying to work around the fd thing with some shared memory area which has tons of much more serious problems of its own (*). Linus (*) Ranging from: specialized-only interfaces, inability to pass it around, lack of any abstraction interfaces, and almost impossible to debug. The security implications of kernel and user space sharing read-write access to some shared area are also legion! -
Side note: it might not even be a "close-on-exec by default" thing: it might well be a *always* close-on-exec. That COE is pretty horrid to do, we need to scan a bitmap of those things on each exec. So it migth be totally sensible to just declare that the non-linear fd's would simply always be "local", and never bleed across an execve). Linus -
Hm, I wouldn't limit the mechanism prematurely. Using Valgrind as an
example of an alternate user of this mechanism, it would be useful to
use a pipe to transmit out-of-band information from an exec-er to an
exec-ee process. At the moment there's a lot mucking around with
execve() to transmit enough information from the parent valgrind to its
successor.
J
-
Or.. we could have a method of swizzling in and out an entire FD array, similar to UML's trick for swizzling MMs. -- Mathematics is the supreme nostalgia of our time. -
I like that notion even better than randomization. I think it should happen. I like SKAS, too, of course. -- wli -
Hi Ingo, developers. I did not want to start with another round of ping-pong insults :), but, Ingo, you did not show that kevent works worse. I did show that sometimes it works better. It flawed from 0 to 30% win in that tests, in results Johann Bork presented kevent and epoll behaved the same. In results I posted earlier, I said, that sometimes epoll behaved better, sometimes kevent. What does it say? Just the fact, that in that given workload result was the one we saw. Nothing more, nothing less. It does not show something is broken, and definitely not that it is: citation1: we're heading to yet-another monolitic interface, we're heading with no valid reasons given if other than some handwaving. citation2: consider kevent an over-hyped, mis-benchmarked complication to do something that epoll is perfectly Getting into account another features kevent has (and what it was designed for originally - for network AIO, which is quite hard (if ever possible) with files and epoll, I'm not talking about syslets as AIO, it is different approach and likely it is simpler, getting even only that it is already very good), it is not what people said in above citations. It looks like you have some personal insults on that, which I do not understand. But it has nothing with technical side of the problem, so lets stop such rethoric and concentrate on real problem and forget any possible personal issues which might be raised sometimes :). Although I closed kevent and eventfs projects, I would gladly continue if we can and want to have progress in that area. -- Evgeniy Polyakov -
let me refresh your recollection: http://lkml.org/lkml/2007/2/25/116 where you said: "But note, that on my athlon64 3500 test machine kevent is about 7900 requests per second compared to 4000+ epoll, so expect a challenge." for a long time you made much fuss about how kevents is so much better and how epoll cannot perform and scale as well (you said various arguments why that is supposedly so), and some people bought into the performance argument and advocated kevent due to its supposed performance and scalability advantages - while now we are down to "epoll and kevent are break-even"? in my book that is way too much of a difference, it is (best-case) a way too sloppy approach to something as fundamental as Linux's basic event model and design, and it is also compounded by your continued "nothing happened, really, lets move on" stance. Losing trust is easy, winning it back is hard. Let me reuse a phrase of yours: "expect a challenge". Ingo -
You can also find in that threads that I managed to run epoll server on that machine with 9k requests per second, although that was not You just draw a picture you want to see. Even on the kevent page I have links to other people's benchmarks, which show how kevent behave compared to epoll in theirs load. _My_ tests showed kevent performance win, you tuned my (can be broken) epoll code and results changed - this is developemnt process, Well, I do not care much about what people think I did wrong or right. There are obviously bad and good ideas and implementations. I might be absolutely wrong with something, but that is a process of solving problems, which I really enjoy. I just want that there sould be no personal insults, if I made such things, -- Evgeniy Polyakov -
You snipped the key part of my response, so I'll say it again: Event rings (a) most closely match what is going on in the hardware and (b) often closely match what is going on in multi-socket, event-driven software application. To echo Uli and paraphrase an ad, "it's the interface, silly." This is not something epoll is capable of doing, at the present time. Jeff -
event rings are just pure data structures that describe a set of data, and they have advantages and disadvantages. For the record, we've already got direct experience with rings as software APIs: they were used for KAIO and they were an implementational and maintainance nightmare and nobody used them. Kevent might be better, but you make it sound as if it was a trivial design choice while it certainly isnt! Sure, for hardware interfaces like networking cards tx and rx rings are the best thing but that is apples to oranges: hardware itself is about _limited_ physical resources, matching a _limited_ data structure like a ring quite well. But for software APIs, the built-in limit of rings makes it a baroque data structure that has a fair share disadvantages in epoll is very much is capable of doing it - but why bother if something more flexible than a ring can be used and the performance difference is negligible? (Read my other reply in this thread for further points.) but, for the record, syslets very much use a completion ring, so i'm not fundamentally opposed to the idea. I just think it's seriously over-hyped, just like most other bits of the kevent approach. (Nor do we have to attach this to syslets and threadlets - kevents are an orthogonal approach not directly related to asynchronous syscalls - syslets/threadlets can make use of epoll just as much as they can make use of kevent APIs.) Ingo -
in particular i'd like to (re-)stress this point: Thirdly, our main problem was not the structure of epoll, our main problem was that event APIs were not widely available, so applications couldnt go to a pure event based design - they always had to handle certain types of event domains specially, due to lack of coverage. The latest epoll patches largely address that. This was a huge barrier against adoption of epoll. starting with putting limits into the design by going to over-smart data structures like rings is just stupid. Lets fix, enhance and speed up what we have now (epoll) so that it becomes ubiquitous, and _then_ we can extend epoll to maybe fill events into rings. We should have our priorities right and should stop rewriting the whole world, especially when it comes to user APIs. Right now we have _no_ event API with complete coverage, and that's far more of a problem than the actual micro-structure of the API. Ingo -
I have rather strong counter-arguments:
(a) yes, it's how hardware does it, but if you actually look at hardware,
you quickly realize that every single piece of hardware uses a
*different* ring interface.
This should really tell you something. In fact, it may not be rings
at all, but structures with more complex formats (eg the USB
descriptors).
(b) yes, event-driven software tends to use some data structures that are
sometimes approximated by event rings, but they all use *different*
software structures. There simply *is* no common "event" structure:
each program tends to have its own issues, it's own allocation
policies, and its own "ring" structures.
They may not be rings at all. They can be priority queues/heaps or
THERE IS NO INTERFACE! You're just making that up, and glossing over the
most important part of the whole thing!
If you could actually point to something specific that matches what
everybody needs, and is architecture-neutral, it would be a different
issue. As is, you're just saying "memory-mapped interfaces" without
actually going into enough detail to show HOW MUCH IT SUCKS.
There really are very few programs that would use them. We had a trivial
benchmark, the only function of which was to show usage, and here Ingo and
Evgeniy are (once more) talking about bugs in that one months later.
THAT should tell you something.
Make poll/select/aio/read etc faster. THAT is where the payoffs are.
In fact, if somebody wants to look at a standard interface that could be
speeded up, the prime thing to look at is "readdir()" (aka getdents).
Making _that_ thing go faster and scale better and do read-ahead is likely
to be a lot more important for performance. It was one of the bottle-necks
for samba several years ago, and nobody has really tried to improve it.
And yes, that's because it's hard - people would rather make up new
interfaces that are largely irrelevant even before ...looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:
1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.
2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.
3) it should be end-to-end supported by glibc.
our failed API attempts so far were:
- sendfile(). This API mainly failed on #2. It partly failed on #1 too.
(couldnt be used in certain types of scenarios so was unintuitive.)
splice() fixes this almost completely.
- KAIO. It fails on #2 and #3.
our more successful new APIs:
- futexes. After some hickups they form the base of all modern
user-space locking.
- splice. (a bit too early to tell but it's looking good so far. Would
be nice if someone did a brute-force memcpy() based vmsplice to user
memory, just to make usage fully symmetric.)
partially successful, not yet failed new APIs:
- epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
not completely). Despite the non-complete coverage of event domains a
good number of apps are using it, and in particular a couple really
'high end' apps with massive amounts of event sources - which apps
would have no chance with poll, select or threads.
- inotify. It's being used quite happily on the desktop, despite some
of its limitations. (Possibly integratable into epoll?)
Ingo
-
Heh, I actually agree, at least then the interface is complete! We can
always replace it with something more clever, should someone feel so
inclined. Here's a rough patch to do that, it's totally untested (but it
compiles). sparse will warn about the __user removal, though. I'm sure
viro would shoot me dead on the spot, should he see this...
diff --git a/fs/splice.c b/fs/splice.c
index 12f2828..5023c01 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -657,9 +657,9 @@ out_ret:
* key here is the 'actor' worker passed in that actually moves the data
* to the wanted destination. See pipe_to_file/pipe_to_sendpage above.
*/
-ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
- struct file *out, loff_t *ppos, size_t len,
- unsigned int flags, splice_actor *actor)
+ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, void *actor_priv,
+ loff_t *ppos, size_t len, unsigned int flags,
+ splice_actor *actor)
{
int ret, do_wakeup, err;
struct splice_desc sd;
@@ -669,7 +669,7 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
sd.total_len = len;
sd.flags = flags;
- sd.file = out;
+ sd.file = actor_priv;
sd.pos = *ppos;
for (;;) {
@@ -1240,28 +1240,104 @@ static int get_iovec_page_array(const struct iovec __user *iov,
return error;
}
+static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ int ret;
+
+ ret = buf->ops->pin(pipe, buf);
+ if (!ret) {
+ void __user *dst = sd->userptr;
+ /*
+ * use non-atomic map, can be optimized to map atomically if we
+ * prefault the user memory.
+ */
+ char *src = buf->ops->map(pipe, buf, 0);
+
+ if (copy_to_user(dst, src, sd->len))
+ ret = -EFAULT;
+
+ buf->ops->unmap(pipe, buf, src);
+
+ if (!ret)
+ return sd->len;
+ }
+
+ return ret;
+}
+
+/*
+ * For lack of a better implementation, implement vmsplice() to userspace
+ * as a simple copy of the pipes pages to the user iov.
+ */
+static ...I wonder how useful it would be to reimplement sendfile() using splice(), either in glibc or inside the kernel itself? sendfile() does get used a fair bit, but I really doubt that anyone outside of a handful of people on this list actually use splice(). Cheers -
It's indeed the plan, I even have git branch for it. Just never took the time to actually finish it. http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=splice-sendfile -- Jens Axboe -
I'd like that, if only because right now we have two separate paths that kind of do the same thing, and splice really is the only one that is generic. I thought Jens even had some experimental patches for it. It might be worth to "just do it" - there's some internal overhead, but on the other hand, it's also likely the best way to make sure any issues get sorted out. Linus -
I do, this is a one year old patch that does that: http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=f8f550e027fd07ad8d871101788... I'll update it, test, and submit for 2.6.23. -- Jens Axboe -
Last time I played with splice(), I found a bug with readahead logic, most probably because nobody but me tried it before. (corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 ) So yes, reimplement sendfile() should help to find last splice() bugs, and as a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket) -
Well, to get those kinds of advantages, you'd have to use splice directly, since sendfile() hasn't supported nonblocking disk IO, and the interface doesn't really allow for it. In fact, since nonblocking accesses require also some *polling* method, and we don't have that for files, I suspect the best option for those things is to simply mix AIO and splice(). AIO tends to be the right thing for disk waits (read: short, often cached), and if we can improve AIO performance for the cached accesses (which is exactly what the threadlets should hopefully allow us to do), I would seriously suggest going that route. But the pure "use splice to _implement_ sendfile()" thing is worth doing for all the other reasons, even if nonblocking file access is not likely one of them. Linus -
sendfile() interface doesnt allow it, but if you open("somediskfile", O_RDONLY | O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk io, (while starting an io with readahead) I actually use this trick myself :) (splice(disk -> pipe, NONBLOCK), splice(pipe -> worker)) -
I think, as Linus pointed out (as I did a few months ago), that there's confusion about the term "Unification" or "Single Interface". Unification is not about fetching all the data coming from the more diverse sources, into a single interface. That is just broken, because each data source wants a different data structure to be reported. This is ABI-hell 101. Unification is the ability to uniformly wait for readiness, and then fetch data with source-dependent collectors (read(2), io_getvents(2), ...). That way you have ABI isolation on the single data source, and not monster structures trying to blob together the more diverse data formats. AFAIK, inotify works with select/poll/epoll as is. - Davide -
On Tue, May 29 2007, Zach Brown wrote: Yeah, it'll confuse CFQ a lot actually. The threads either need to share an io context (clean approach, however will introduce locking for things that were previously lockless), or CFQ needs to get better support for cooperating processes. The problem is that CFQ will wait for a dependent IO for a given process, which may arrive from a totally unrelated process. For the fio testing, we can make some improvements there. Right now you don't get any concurrency of the io requests if you set eg iodepth=32, as the 32 requests will be submitted as a linked chain of atoms. For io saturation, that's not really what you want. I'll take a stab at improving both of the above. -- Jens Axboe -
Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm using fio's libaio engine. I'm not testing the syslet syscall interface yet. - z -
Ah ok, then there's no issue from that end! -- Jens Axboe -
