Hi everybody, These patches are a follow up on two previous [rfc] threads: - new sk_buff member: hwstamp [1] - support for IEEE 1588 [2] [1] http://www.spinics.net/lists/netdev/msg68804.html [2] http://www.spinics.net/lists/netdev/msg67799.html The idea is to associate control messages with skbs - for both out-going and in-going traffic. The RX case seems pretty straight forward. For the TX case, I am queueing an empty (or a clone of the original) skb (with control messages attached) to the error queue of the socket the TX skb belongs to. To make an idea about how would this be used from userspace, I've added a simple userspace demo for getting TX hardware timestamps. I still ended up adding a new skb member but since this is a bit more generic maybe it can offset the cost? Thanks, tavi --- #include <stdlib.h> #include <string.h> #include <error.h> #include <stdio.h> #include <errno.h> #include <unistd.h> #include <sys/types.h> #include <sys/socket.h> #include <sys/poll.h> #include <sys/uio.h> #include <netinet/in.h> typedef unsigned long long __u64; typedef unsigned long long __u32; #define SOL_SKB 275 #include "include/linux/skb_cmsg.h" struct cmsg_tx_tstamp { struct cmsghdr hdr; struct skb_tx_tstamp stt __attribute__ ((packed)); }; int main() { char buffer[100]; struct iovec iovec = { .iov_base = buffer, .iov_len = sizeof(buffer) }; struct cmsg_tx_tstamp ctt = { .hdr = { .cmsg_len = sizeof(ctt), .cmsg_level = SOL_SKB, .cmsg_type = SKB_TX_GET_TSTAMP }, .stt = { .cookie = 0x0bad0badbeefbeefULL, } }; struct sockaddr_in daddr = { .sin_family = AF_INET, .sin_port = 8888, .sin_addr = { htonl(0x01010101) } }; struct msghdr msg = { .msg_iov = &iovec, .msg_iovlen = 1, .msg_control = &ctt, .msg_controllen = sizeof(ctt), .msg_name = &daddr, .msg_namelen = sizeof(daddr) }; int s=socket(PF_INET, SOCK_DGRAM, 0); struct pollfd pollfd = { .fd = s, .events = ...
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
include/linux/skbuff.h | 2 +-
include/net/ip.h | 1 +
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 2 ++
net/ipv4/ip_sockglue.c | 6 ++++++
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 1 +
7 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2988d1..2d394ed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1762,7 +1762,7 @@ struct skb_cmsg *skb_cmsg_alloc(int type, int len, int gfp);
*/
static inline void skb_cmsg_add(struct skb_cmsg **head, struct skb_cmsg *sc)
{
- sc->next = (*head)->next;
+ sc->next = *head;
*head = sc;
}
diff --git a/include/net/ip.h b/include/net/ip.h
index 3b40bc2..ccb8be7 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -54,6 +54,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ struct skb_cmsg *skb_cmsg;
};
#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8739735..68da287 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -377,6 +377,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.skb_cmsg = NULL;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -534,6 +535,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
inet_sk(sk)->tos = tos;
ipc.addr = iph->saddr;
ipc.opt = &icmp_param.replyopts;
+ ipc.skb_cmsg = NULL;
{
struct flowi fl = {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e527628..da85d42 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -941,6 +941,7 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
+ skb_shinfo(skb)->cmsg = ...SKB_RX_GET_TSTAMP - get hardware timestamp at receive time
SKB_TX_GET_TSTAMP - get hardware timestamp at transmit time
SKB_TX_SCHED - schedule packet to be transmited at hardware timestamp
SKB_TX_EMBED_TSTAMP - embed hardware timestamp into packet at
transmit time
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
include/linux/skb_cmsg.h | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)
create mode 100644 include/linux/skb_cmsg.h
diff --git a/include/linux/skb_cmsg.h b/include/linux/skb_cmsg.h
new file mode 100644
index 0000000..c5b8a9b
--- /dev/null
+++ b/include/linux/skb_cmsg.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_SKB_CMSG_H
+#define _LINUX_SKB_CMSG_H
+
+#define SKB_RX_GET_TSTAMP 1
+#define SKB_TX_GET_TSTAMP 2
+#define SKB_TX_SCHED 3
+#define SKB_TX_EMBED_TSTAMP 4
+
+struct skb_rx_tstamp {
+ __u64 tstamp;
+};
+
+struct skb_tx_tstamp {
+ __u64 tstamp, cookie;
+};
+
+struct skb_tx_sched {
+ __u64 tstamp;
+};
+
+#define SKB_TX_EMBED_TSTAMP_OFFSET_EOP 0x00010000
+
+struct skb_tx_embed_tstamp {
+ __u64 mask;
+ __u32 offset;
+};
+
+#endif
--
1.5.6.2
--
This patch introduces per skb control messages that can be used to
directly exchange information between the application and the
hardware, at a per packet level. Examples of usecases are: RX/TX
hardware timestamps, TX scheduling (request hardware to send packets
at a future time).
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
include/linux/skbuff.h | 62 +++++++++++++++++++++++++
include/linux/socket.h | 1 +
include/net/sock.h | 2 +
net/core/skbuff.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
net/core/sock.c | 46 +++++++++++++++++++
5 files changed, 227 insertions(+), 2 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f24d261..f2988d1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -135,6 +135,17 @@ struct skb_frag_struct {
__u32 size;
};
+/*
+ * Control message queue; used to exchange additional information between the
+ * application and hardware (e.g. RX/TX hardware timestamps, TX scheduling,
+ * etc.)
+ */
+struct skb_cmsg {
+ struct skb_cmsg *next;
+ int type, len;
+ char data[0];
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -148,6 +159,7 @@ struct skb_shared_info {
__be32 ip6_frag_id;
struct sk_buff *frag_list;
skb_frag_t frags[MAX_SKB_FRAGS];
+ struct skb_cmsg *cmsg;
};
/* We divide dataref into two halves. The higher 16 bits hold references
@@ -1719,5 +1731,55 @@ static inline void skb_forward_csum(struct sk_buff *skb)
}
bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+
+#define skb_cmsg_queue(skb) (skb_shinfo(skb)->cmsg)
+
+/**
+ * skb_cmsg_for_each - iterate over a control message list
+ * @i: the loop cursor (a struct skb_cmsg *)
+ * @head: the head (a struct skb_cmsg **)
+ */
+#define skb_cmsg_for_each(i, head) for (i = *head; i != NULL; \
+ i = i->next)
+
+void skb_cmsg_free(struct skb_cmsg ...Putting it in the share info area makes its life-cycle management difficult. This is also a major change to how we store skb control What about the other callers of copy_skb_header, like pskb_copy? In fact why isn't this in copy_skb_header? Also what about pskb_expand_head? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Thursday 24 July 2008, Herbert Xu wrote: Hi Herbert, Initially I tried putting the new member directly into the skb, but then I run into issues with cloned packets, since I needed a reference counter to know when to delete the cmsg queue. I've noticed that the shared info already has a reference counter and I've though that taking advantage of it would be better then doing it myself. The shinfo seems to be the right place also because the cmsg queue can not be modified once the skb has been pushed into the network stack. It sort of is like data, only with special meaning. From what I understand, the shinfo remains shared, thus there is no need to do In copy_skb_header we do a pointer copy. This should be enough for packets that are cloned or partially cloned. The only problem is when we do a full copy of the packet, where shinfo should I think this is only touching the skb header, not the shinfo. Thus we should be safe? BTW: I have a dilemma with regard to fully copied skbs: should we copy the cmsg queue as well, or should we just prune it in the copy? I don't see a real usecase for doing the copy at this point, but since these is at core level, maybe it is a good idea to be conservative and do the copying? Thanks, tavi --
No pskb_copy does a full copy on the head portion of the packet But you're freeing the cmsg in skb_release_data which is called Do we need to make it so generic? If we do then can you please come up with other uses for it too. For example, find out what existing fields can be better suited by storing them as a cmsg. Only then can we design this properly. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
It looks like we can switch a few members from the skb to cmsg: secmark, priority, tstamp, thus saving some space in the skb, but we will have a performance overhead for accessing these since we now have to search for them in the cmesg list. So, the more I think about this the more I am inclined to think that we don't really need it. For our hw timestamp usecases, the reason of this patch, we can get away by adding these new fields in the skb: u16 flags; /* request / signal presence of RX/TX hw stamps, TX scheduling, TX embed hw timestamp in packet */ u8 hwstamp[6]; and of course the mechanism of enabling hw stamps requests on per skb, with msghdr.msg_control stuff. Would that be acceptable (with proper CONFIG_HW_TSTAMPS of course)? Thanks, tavi --
From: Octavian Purdila <opurdila@ixiacom.com> Adding new fields to struct sk_buff that take up space is generally not allowed unless the new field adds substantially to the benefit of a large group of users of Linus. I don't think that applied here for this hw-tstamps stuff. And yes, this is even with the config option protection, because every distribution is going to turn all the options on, which makes them essentially pointless from a sk_buff overhead perspective. --
What about the approach proposed in the patch? Is it ok to add a pointer which may resolve other future similar issues? Thanks, tavi --
From: Octavian Purdila <opurdila@ixiacom.com> Traversing the list and maintaining the reference counting and sharing issues is expensive and error prone. This is what OpenBSD uses for their IPSEC state attached to MBUFs and it's a nightmare. We have a timestamp in the SKB already, why don't you simply override it when your feature is enable and set a single flag bit that indicates you used a HW timestamp to set that timestamp? --
On Thu, 24 Jul 2008 14:56:00 -0700 (PDT) How hard would it be to add PLL support to use same kind of timestamp. Enlist the help of some NTP/clock experts to help. --
Let me see if I got this correctly: modify the hardware so that the CPU is synchronized with the NIC timestamping unit. If so, I will agree that this is the cleanest possible solution. Thanks, tavi --
On Fri, 25 Jul 2008 01:35:52 +0300
Perodically, sample the hardware clock and the system time of day,
and resynchronize the clocks. I did something like this in a prototype
that never got integrated...
------------------------
Subject: sky2: hardware receive timestamp counter
Use hardware timestamp counter to stamp receive packets.
It allows for higher resolution values without the hardware overhead
of doing gettimeofday.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
---
drivers/net/sky2.c | 159 +++++++++++++++++++++++++++++++++++++++--------------
drivers/net/sky2.h | 5 +
2 files changed, 124 insertions(+), 40 deletions(-)
--- a/drivers/net/sky2.c 2007-08-29 11:41:16.000000000 -0700
+++ b/drivers/net/sky2.c 2007-08-30 13:05:39.000000000 -0700
@@ -26,6 +26,7 @@
#include <linux/kernel.h>
#include <linux/version.h>
#include <linux/module.h>
+#include <linux/reciprocal_div.h>
#include <linux/netdevice.h>
#include <linux/dma-mapping.h>
#include <linux/etherdevice.h>
@@ -2186,6 +2187,36 @@ error:
goto resubmit;
}
+static inline u32 sky2_tist2ns(const struct sky2_hw *hw, u32 clk)
+{
+ return reciprocal_divide(clk * 1000, hw->tist_rate);
+}
+
+/*
+ * Convert hardware timestamp clock into a real time value
+ */
+static void sky2_set_timestamp(struct sk_buff *skb,
+ const struct sky2_hw *hw, u32 stamp)
+{
+ unsigned long seq;
+
+ do {
+ s32 delta;
+
+ seq = read_seqbegin(&hw->tist_lock);
+
+ /* ticks since last synchronization */
+ delta = stamp - hw->tist_base;
+
+ if (delta > 0)
+ skb->tstamp = ktime_add_ns(hw->tist_real,
+ sky2_tist2ns(hw, delta));
+ else
+ skb->tstamp = ktime_sub_ns(hw->tist_real,
+ sky2_tist2ns(hw, -delta));
+ } while (read_seqretry(&hw->tist_lock, seq));
+}
+
/* Transmit complete */
static inline void sky2_tx_done(struct net_device *dev, u16 last)
{
@@ -2262,6 +2293,16 @@ static int sky2_status_intr(struct sky2_
break;
#ifdef ...I thought of something similar, but I am not sure if I can to so, as it seems that the skb->tstamp requires current gettimeofday semantics at least in netfilter's ipt_time module. Thanks, tavi --
From: Octavian Purdila <opurdila@ixiacom.com> Can your timestamp format at least be converted to gettimeofday() format? I thought we had a ton of accessor functions that code uses to access the timestamp? You should be able to do your translation in those routines. --
Sure but the problem is that the NIC hw timestamp is not synced with the CPU time. In that case I think that the netfilter rules which are looking at the timestamp will be messed up. [ A bit of general context about this annoying hw timestamp thing I keep bringing up here:) I know that this is a very specific thing and there is probably not a clean solution to this and we will probably have to go with an in internal patch approach. But the thing is that we accumulated a lot such internal patches to the point that makes it very hard to upgrade and track a recent Linux version. And I feel that we need to stop adding new stuff in this pile, otherwise we will not be able to keep up. Thus my fixation with this very specific and not so significant thing in the great Linux ecosystem. Probably I just chosen the wrong patch to battle. ] Thanks, tavi --
From: Octavian Purdila <opurdila@ixiacom.com> Right, but you can maintain a suitable "delta" between the two. --
OK, I think I got the idea now: - use a bit in the timestamp to specify that this is a hw timestamp - use a new net_device method to convert a hw timestamp to a gettimeofday timestamp This is great, thanks a lot ! tavi --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgb |
