[RFC][PATCH 3/3] net: add SKB_SOL control messages

Previous thread: [PATCH] xt_hashlimit: fix race between htable_destroy and htable_gc by Pavel Emelyanov on Wednesday, July 23, 2008 - 2:51 pm. (2 messages)

Next thread: Re: recent -git: BUG in free_thread_xstate by Vegard Nossum on Wednesday, July 23, 2008 - 3:45 pm. (5 messages)
From: Octavian Purdila
Date: Wednesday, July 23, 2008 - 3:01 pm

Hi everybody,

These patches are a follow up on two previous [rfc] threads: 
- new sk_buff member: hwstamp [1]
- support for IEEE 1588 [2]

[1] http://www.spinics.net/lists/netdev/msg68804.html
[2] http://www.spinics.net/lists/netdev/msg67799.html

The idea is to associate control messages with skbs - for both 
out-going and in-going traffic. 

The RX case seems pretty straight forward.  

For the TX case, I am queueing an empty (or a clone of the original)
skb (with control messages attached) to the error queue of the socket
the TX skb belongs to.

To make an idea about how would this be used from userspace, I've
added a simple userspace demo for getting TX hardware timestamps.

I still ended up adding a new skb member but since this is a bit more
generic maybe it can offset the cost?

Thanks,
tavi
---
#include <stdlib.h>
#include <string.h>
#include <error.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/poll.h>
#include <sys/uio.h>
#include <netinet/in.h>

typedef unsigned long long __u64;
typedef unsigned long long __u32;

#define SOL_SKB 275

#include "include/linux/skb_cmsg.h"

struct cmsg_tx_tstamp  {
	struct cmsghdr hdr;
	struct skb_tx_tstamp stt __attribute__ ((packed));
};


int main()
{
	char buffer[100];
	struct iovec iovec = {
		.iov_base = buffer,
		.iov_len = sizeof(buffer)
	};
	struct cmsg_tx_tstamp ctt = {
		.hdr = {
			.cmsg_len = sizeof(ctt),
			.cmsg_level = SOL_SKB,
			.cmsg_type = SKB_TX_GET_TSTAMP
		},
		.stt = {
			.cookie = 0x0bad0badbeefbeefULL,
		}
	};
	struct sockaddr_in daddr = {
		.sin_family = AF_INET,
		.sin_port = 8888,
		.sin_addr = { htonl(0x01010101) }
	};
	struct msghdr msg = {
		.msg_iov = &iovec,
		.msg_iovlen = 1,
		.msg_control = &ctt,
		.msg_controllen = sizeof(ctt),
		.msg_name = &daddr,
		.msg_namelen = sizeof(daddr)
	};
	int s=socket(PF_INET, SOCK_DGRAM, 0);
	struct pollfd pollfd = {
		.fd = s,
		.events = ...
From: Octavian Purdila
Date: Wednesday, July 23, 2008 - 3:01 pm

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
 include/linux/skbuff.h |    2 +-
 include/net/ip.h       |    1 +
 net/ipv4/icmp.c        |    2 ++
 net/ipv4/ip_output.c   |    2 ++
 net/ipv4/ip_sockglue.c |    6 ++++++
 net/ipv4/raw.c         |    1 +
 net/ipv4/udp.c         |    1 +
 7 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2988d1..2d394ed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1762,7 +1762,7 @@ struct skb_cmsg *skb_cmsg_alloc(int type, int len, int gfp);
  */
 static inline void skb_cmsg_add(struct skb_cmsg **head, struct skb_cmsg *sc)
 {
-	sc->next = (*head)->next;
+	sc->next = *head;
 	*head = sc;
 }
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 3b40bc2..ccb8be7 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -54,6 +54,7 @@ struct ipcm_cookie
 	__be32			addr;
 	int			oif;
 	struct ip_options	*opt;
+	struct skb_cmsg		*skb_cmsg;
 };
 
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8739735..68da287 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -377,6 +377,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	inet->tos = ip_hdr(skb)->tos;
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
+	ipc.skb_cmsg = NULL;
 	if (icmp_param->replyopts.optlen) {
 		ipc.opt = &icmp_param->replyopts;
 		if (ipc.opt->srr)
@@ -534,6 +535,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	inet_sk(sk)->tos = tos;
 	ipc.addr = iph->saddr;
 	ipc.opt = &icmp_param.replyopts;
+	ipc.skb_cmsg = NULL;
 
 	{
 		struct flowi fl = {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e527628..da85d42 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -941,6 +941,7 @@ alloc_new_skb:
 			skb->ip_summed = csummode;
 			skb->csum = 0;
 			skb_reserve(skb, hh_len);
+			skb_shinfo(skb)->cmsg = ...
From: Octavian Purdila
Date: Wednesday, July 23, 2008 - 3:01 pm

SKB_RX_GET_TSTAMP - get hardware timestamp at receive time
SKB_TX_GET_TSTAMP - get hardware timestamp at transmit time
SKB_TX_SCHED - schedule packet to be transmited at hardware timestamp
SKB_TX_EMBED_TSTAMP - embed hardware timestamp into packet at
                    transmit time

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
 include/linux/skb_cmsg.h |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/skb_cmsg.h

diff --git a/include/linux/skb_cmsg.h b/include/linux/skb_cmsg.h
new file mode 100644
index 0000000..c5b8a9b
--- /dev/null
+++ b/include/linux/skb_cmsg.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_SKB_CMSG_H
+#define _LINUX_SKB_CMSG_H
+
+#define SKB_RX_GET_TSTAMP      1
+#define SKB_TX_GET_TSTAMP      2
+#define SKB_TX_SCHED           3
+#define SKB_TX_EMBED_TSTAMP    4
+
+struct skb_rx_tstamp {
+       __u64 tstamp;
+};
+
+struct skb_tx_tstamp {
+       __u64 tstamp, cookie;
+};
+
+struct skb_tx_sched {
+       __u64 tstamp;
+};
+
+#define SKB_TX_EMBED_TSTAMP_OFFSET_EOP 0x00010000
+
+struct skb_tx_embed_tstamp {
+       __u64 mask;
+       __u32 offset;
+};
+
+#endif
-- 
1.5.6.2

--

From: Octavian Purdila
Date: Wednesday, July 23, 2008 - 3:01 pm

This patch introduces per skb control messages that can be used to
directly exchange information between the application and the
hardware, at a per packet level. Examples of usecases are: RX/TX
hardware timestamps, TX scheduling (request hardware to send packets
at a future time).

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
 include/linux/skbuff.h |   62 +++++++++++++++++++++++++
 include/linux/socket.h |    1 +
 include/net/sock.h     |    2 +
 net/core/skbuff.c      |  118 +++++++++++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c        |   46 +++++++++++++++++++
 5 files changed, 227 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f24d261..f2988d1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -135,6 +135,17 @@ struct skb_frag_struct {
 	__u32 size;
 };
 
+/*
+ * Control message queue; used to exchange additional information between the
+ * application and hardware (e.g. RX/TX hardware timestamps, TX scheduling,
+ * etc.)
+ */
+struct skb_cmsg {
+	struct skb_cmsg *next;
+	int type, len;
+	char data[0];
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -148,6 +159,7 @@ struct skb_shared_info {
 	__be32          ip6_frag_id;
 	struct sk_buff	*frag_list;
 	skb_frag_t	frags[MAX_SKB_FRAGS];
+	struct skb_cmsg *cmsg;
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references
@@ -1719,5 +1731,55 @@ static inline void skb_forward_csum(struct sk_buff *skb)
 }
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+
+#define skb_cmsg_queue(skb) (skb_shinfo(skb)->cmsg)
+
+/**
+ * skb_cmsg_for_each - iterate over a control message list
+ * @i: the loop cursor (a struct skb_cmsg *)
+ * @head: the head (a struct skb_cmsg **)
+ */
+#define skb_cmsg_for_each(i, head) for (i = *head; i != NULL; \
+				       i = i->next)
+
+void skb_cmsg_free(struct skb_cmsg ...
From: Herbert Xu
Date: Thursday, July 24, 2008 - 5:46 am

Putting it in the share info area makes its life-cycle management
difficult.  This is also a major change to how we store skb control

What about the other callers of copy_skb_header, like pskb_copy?
In fact why isn't this in copy_skb_header?

Also what about pskb_expand_head?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 6:34 am

On Thursday 24 July 2008, Herbert Xu wrote:

Hi Herbert,


Initially I tried putting the new member directly into the skb, but then I
run into issues with cloned packets, since I needed a reference
counter to know when to delete the cmsg queue.

I've noticed that the shared info already has a reference counter and I've 
though that taking advantage of it would be better then doing it myself. 

The shinfo seems to be the right place also because the cmsg queue can not be 
modified once the skb has been pushed into the network stack. It sort of is 
like data, only with special meaning.


From what I understand, the shinfo remains shared, thus there is no need to do 

In copy_skb_header we do a pointer copy. This should be enough for packets 
that are cloned or partially cloned.

The only problem is when we do a full copy of the packet, where shinfo should 

I think this is only touching the skb header, not the shinfo. Thus we should 
be safe?

BTW: I have a dilemma with regard to fully copied skbs: should we copy the 
cmsg queue as well, or should we just prune it in the copy? I don't see a 
real usecase for doing the copy at this point, but since these is at core 
level, maybe it is a good idea to be conservative and do the copying?

Thanks,
tavi
--

From: Herbert Xu
Date: Thursday, July 24, 2008 - 8:01 am

No pskb_copy does a full copy on the head portion of the packet

But you're freeing the cmsg in skb_release_data which is called

Do we need to make it so generic? If we do then can you please
come up with other uses for it too.  For example, find out what
existing fields can be better suited by storing them as a cmsg.
Only then can we design this properly.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 9:22 am

It looks like we can switch a few members from the skb to cmsg: secmark, 
priority, tstamp, thus saving some space in the skb, but we will have a 
performance overhead for accessing these since we now have to search for them 
in the cmesg list.

So, the more I think about this the more I am inclined to think that we don't 
really need it. For our hw timestamp usecases, the reason of this patch, we 
can get away by adding these new fields in the skb:

u16 flags; /* request / signal presence of RX/TX hw stamps, TX scheduling, TX 	
		    embed hw timestamp in packet */
u8 hwstamp[6];

and of course the mechanism of enabling hw stamps requests on per skb, with 
msghdr.msg_control stuff. 

Would that be acceptable (with proper CONFIG_HW_TSTAMPS of course)?

Thanks,
tavi
--

From: David Miller
Date: Thursday, July 24, 2008 - 1:28 pm

From: Octavian Purdila <opurdila@ixiacom.com>

Adding new fields to struct sk_buff that take up space is generally
not allowed unless the new field adds substantially to the benefit of
a large group of users of Linus.

I don't think that applied here for this hw-tstamps stuff.

And yes, this is even with the config option protection, because every
distribution is going to turn all the options on, which makes them
essentially pointless from a sk_buff overhead perspective.
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 2:49 pm

What about the approach proposed in the patch? Is it ok to add a pointer which 
may resolve other future similar issues?

Thanks,
tavi
--

From: David Miller
Date: Thursday, July 24, 2008 - 2:56 pm

From: Octavian Purdila <opurdila@ixiacom.com>

Traversing the list and maintaining the reference counting and sharing
issues is expensive and error prone.

This is what OpenBSD uses for their IPSEC state attached to MBUFs and
it's a nightmare.

We have a timestamp in the SKB already, why don't you simply override
it when your feature is enable and set a single flag bit that
indicates you used a HW timestamp to set that timestamp?
--

From: Stephen Hemminger
Date: Thursday, July 24, 2008 - 2:58 pm

On Thu, 24 Jul 2008 14:56:00 -0700 (PDT)

How hard would it be to add PLL support to use same kind of timestamp.
Enlist the help of some NTP/clock experts to help.
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 3:35 pm

Let me see if I got this correctly: modify the hardware so that the CPU is 
synchronized with the NIC timestamping unit. If so, I will agree that this is 
the cleanest possible solution. 

Thanks,
tavi
--

From: Stephen Hemminger
Date: Thursday, July 24, 2008 - 4:05 pm

On Fri, 25 Jul 2008 01:35:52 +0300

Perodically, sample the hardware clock and the system time of day,
and resynchronize the clocks.  I did something like this in a prototype
that never got integrated...
------------------------
Subject: sky2: hardware receive timestamp counter

Use hardware timestamp counter to stamp receive packets.
It allows for higher resolution values without the hardware overhead
of doing gettimeofday.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>


---
 drivers/net/sky2.c |  159 +++++++++++++++++++++++++++++++++++++++--------------
 drivers/net/sky2.h |    5 +
 2 files changed, 124 insertions(+), 40 deletions(-)

--- a/drivers/net/sky2.c	2007-08-29 11:41:16.000000000 -0700
+++ b/drivers/net/sky2.c	2007-08-30 13:05:39.000000000 -0700
@@ -26,6 +26,7 @@
 #include <linux/kernel.h>
 #include <linux/version.h>
 #include <linux/module.h>
+#include <linux/reciprocal_div.h>
 #include <linux/netdevice.h>
 #include <linux/dma-mapping.h>
 #include <linux/etherdevice.h>
@@ -2186,6 +2187,36 @@ error:
 	goto resubmit;
 }
 
+static inline u32 sky2_tist2ns(const struct sky2_hw *hw, u32 clk)
+{
+	return reciprocal_divide(clk * 1000, hw->tist_rate);
+}
+
+/*
+ * Convert hardware timestamp clock into a real time value
+ */
+static void sky2_set_timestamp(struct sk_buff *skb,
+			       const struct sky2_hw *hw, u32 stamp)
+{
+	unsigned long seq;
+
+	do {
+		s32 delta;
+
+		seq = read_seqbegin(&hw->tist_lock);
+
+		/* ticks since last synchronization */
+		delta = stamp - hw->tist_base;
+
+		if (delta > 0)
+			skb->tstamp = ktime_add_ns(hw->tist_real,
+						   sky2_tist2ns(hw, delta));
+		else
+			skb->tstamp = ktime_sub_ns(hw->tist_real,
+						   sky2_tist2ns(hw, -delta));
+	} while (read_seqretry(&hw->tist_lock, seq));
+}
+
 /* Transmit complete */
 static inline void sky2_tx_done(struct net_device *dev, u16 last)
 {
@@ -2262,6 +2293,16 @@ static int sky2_status_intr(struct sky2_
 			break;
 
 #ifdef ...
From: Octavian Purdila
Date: Thursday, July 24, 2008 - 3:12 pm

I thought of something similar, but I am not sure if I can to so, as it seems 
that the skb->tstamp requires current gettimeofday semantics at least in 
netfilter's ipt_time module.

Thanks,
tavi
--

From: David Miller
Date: Thursday, July 24, 2008 - 3:17 pm

From: Octavian Purdila <opurdila@ixiacom.com>

Can your timestamp format at least be converted to
gettimeofday() format?

I thought we had a ton of accessor functions that code uses to access
the timestamp?  You should be able to do your translation in those
routines.
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 4:14 pm

Sure but the problem is that the NIC hw timestamp is not synced with the CPU 
time. In that case I think that the netfilter rules which are looking at the 
timestamp will be messed up.

[ 
A bit of general context about this annoying hw timestamp thing I keep 
bringing up  here:) 

I know that this is a very specific thing and there is probably not a clean 
solution to this and we will probably have to go with an in internal patch 
approach. 

But the thing is that we accumulated a lot such internal patches to the point 
that makes it very hard to upgrade and track a recent Linux version. And I 
feel that we need to stop adding new stuff in this pile, otherwise we will 
not be able to keep up.

Thus my fixation with this very specific and not so significant thing in the 
great Linux ecosystem. Probably I just chosen the wrong patch to battle. 
]

Thanks,
tavi
--

From: David Miller
Date: Thursday, July 24, 2008 - 4:18 pm

From: Octavian Purdila <opurdila@ixiacom.com>

Right, but you can maintain a suitable "delta" between the two.
--

From: Octavian Purdila
Date: Thursday, July 24, 2008 - 4:26 pm

OK, I think I got the idea now:
- use a bit in the timestamp to specify that this is a hw timestamp
- use a new net_device method to convert a hw timestamp to a gettimeofday 
timestamp

This is great, thanks a lot !

tavi
--

Previous thread: [PATCH] xt_hashlimit: fix race between htable_destroy and htable_gc by Pavel Emelyanov on Wednesday, July 23, 2008 - 2:51 pm. (2 messages)

Next thread: Re: recent -git: BUG in free_thread_xstate by Vegard Nossum on Wednesday, July 23, 2008 - 3:45 pm. (5 messages)