Re: [IPv6] "sendmsg: invalid argument" to multicast group after some time

Previous thread: [PATCH] *icsk_ca_priv: Reduce inet_connection_sock size? by Daniele Lacamera on Sunday, August 31, 2008 - 11:26 am. (7 messages)

Next thread: [PATCH] *icsk_ca_priv: Reduce inet_connection_sock size? by Daniele Lacamera on Sunday, August 31, 2008 - 11:05 am. (1 message)
From: Bernhard Schmidt
Date: Sunday, August 31, 2008 - 11:20 am

Hello all,

this is about the same box as the message from Remi an hour ago, but
most probably not related.

I'm running a Teredo (RFC4830) relay on a i386 Xen domU with kernel
2.6.26 vanilla with the integrated pv_ops feature. This relay function
is implemented in a userspace daemon called Miredo which provides a tun
interface to the OS where native IPv6 for 2001::/32 is routed into. The
traffic is then handled in the userspace daemon and emitted encapsulated
in IPv4/UDP. Apart from a few scalability problems which seem to be
related to the neighbor or route cache size it works fine. The machine
is doing around 2kpps of IPv6 traffic (which means 4kpps of IPv4+IPv6).

As there are a couple of similar relays globally anycasted I'm supposed
to withdraw the route from BGP if the daemon or the machine fails. To do
this I'm running ripngd from the Quagga routing suite, which announces
the Teredo prefix to my core routers using RIPng (RFC2080). On a kernel
level RIPng is basically periodic UDP to a link-local multicast address 
[ff02::9]:521.

Every few hours this announcement fails (no announcements reach the core
routers anymore, which kill the routing entry after a timeout). ripngd
debugging claims that it could not send the announcement due to 
"Invalid argument". There are no outgoing packets in tcpdump anymore.

I even get the same error when doing a multicast ping6:
miredo:~# ping6 -I eth0 ff02::9
PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
ping: sendmsg: Invalid argument                                        
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.018 ms (DUP!)

(fe80::216:3eff:feb9:29f5 is the box itself, it's the only one that ever
answers ... duplicate however)

ping6 to other multicast addresses, even in the same scope works fine

miredo:~# ping6 -I eth0 ff02::2                       
PING ff02::2(ff02::2) from fe80::216:3eff:feb9:29f5 ...
From: David Stevens
Date: Monday, September 1, 2008 - 6:03 am

Well, it'd certainly be good to identify exactly where in
the sendmsg path you're getting the EINVAL from. If
possible, I'd suggest putting in some debugging code
and reproducing it.

 If you can pick up the exact arguments when it's happening
via strace and send those here, that may help, but I wouldn't
expect those to be incorrect from ping6 only some of the
time...

I'll look into this some more if someone doesn't beat me to
it, but I'm travelling now so it will be a few days at least before
I get to it.

                                        +-DLS

--

From: Bernhard Schmidt
Date: Monday, September 1, 2008 - 10:01 am

Do you have any suggestions/patches? I'm not an experienced programmer,
and since it usually takes a couple of hours for this problem to appear
I probably would need a few years to get anything reasonable.

Btw, it's happening again just right now (2.6.27-rc5), I've set a static
route so I can run some debugging without affecting the service. So if
you have any debugging commands you would like me to run this is the

The working group:

# strace -e recvmsg,sendmsg ping6 -c 1 -I eth0 ff02::2
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\22\37\274H4x\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1\10"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\22\37\274H4x\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 256
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\22\37\274H4x\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
sendmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "ff02::2", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, msg_iov(1)=[{"\200\0\0\0004x\0\1\22\37\274H\366\v\5\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 64}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_IPV6, cmsg_type=, ...}, msg_flags=MSG_PEEK|MSG_PROXY|MSG_WAITALL|MSG_CONFIRM|MSG_FIN|MSG_SYN|MSG_RST|MSG_CMSG_CLOEXEC|0x8bc0000}, 0) = 64
recvmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "fe80::216:3eff:feb9:29f5", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=if_nametoindex("eth0")}, msg_iov(1)=[{"\201\0\305\n4x\0\1\22\37\274H\366\v\5\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 4208}], msg_controllen=36, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 64


The non-working RIPng ...
From: Bernhard Schmidt
Date: Monday, September 1, 2008 - 10:05 am

BTW, these are the sendmsg() calls from ripngd (strace attached to the
running process):

sendmsg(5, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(521), inet_pton(AF_INET6, "ff02::9", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, msg_iov(1)=[{"\2\1\0\0 \1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \1"..., 24}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_IPV6, cmsg_type=, ...}, msg_flags=0}, 0) = -1 EINVAL (Invalid argument)

Regards,
Bernhard
--

From: Pekka Savola
Date: Monday, September 1, 2008 - 10:57 am

It seems that in the latter case, you haven't specified the interface 
(sin6_scope_id=0), but in the former case you have.  You can't send to 
link-local multicast groups if you have multiple interface if the 
interface isn't specified.

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
--

From: Bernhard Schmidt
Date: Monday, September 1, 2008 - 11:03 am

On Mon, Sep 01, 2008 at 08:57:51PM +0300, Pekka Savola wrote:


I did specify the interface:

miredo:~# ping6 -I eth0 ff02::2
PING ff02::2(ff02::2) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.034 ms
[...]
miredo:~# ping6 -I eth0 ff02::9
PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
ping: sendmsg: Invalid argument
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.033 ms

I have no idea what the different flags mean or where they come from.

Also please note that broken ping6 is just a symptom, the real problem
is ripngd suddenly not being able to send to the multicast group
anymore.

Bernhard
--

From: Pekka Savola
Date: Tuesday, September 2, 2008 - 2:06 am

Sorry, as a matter of fact in the quoted text, both sin6_scope_id was 
0.  I was looking at recvmsg where it was different.

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
--

From: Brian Haley
Date: Tuesday, September 2, 2008 - 6:57 am

Well, at least in the ping6 sources I have, msg_flags is never 
initialized before the sendmsg() call, and since it's allocated on the 
stack it can have random bits set.  Can you rebuild your ping6 with the 
attached patch and retry?

-Brian
From: Bernhard Schmidt
Date: Tuesday, September 2, 2008 - 8:00 am

Done, no change.

sendmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "ff02::2", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, msg_iov(1)=[{"\200\0\0\0\3252\0\0010T\275H\274\314\7\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 64}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_IPV6, cmsg_type=, ...}, msg_flags=0}, 0) = 64

vs. 

sendmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "ff02::9", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, msg_iov(1)=[{"\200\0\0\0\3162\0\1+T\275H\255K\16\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 64}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_IPV6, cmsg_type=, ...}, msg_flags=0}, 0) = -1 EINVAL (Invalid argument)

don't push too hard on ping6, I just included it to show that all
processes are affected sending to this particular group.

Bernhard
--

From: Brian Haley
Date: Tuesday, September 2, 2008 - 8:48 am

That was just the obvious answer to why the flags were different.  Since 
EINVAL is too generic to point at one place in the kernel code path, I'd 
second David Stevens' suggestion of finding where in the sendmsg() code 
this is coming from.

Maybe you can trace the miredo daemon to see what it's doing that might 
fix the problem if you don't want to start hacking in the kernel.

-Brian
--

From: David Stevens
Date: Monday, September 8, 2008 - 5:34 pm

Bernhard,
        I looked at this some more and didn't see anything obvious.
The send side doesn't need group membership to send, or anything
special, really. The only thing that comes to mind is that maybe you
have a bogus route installed (since you don't have a bogus interface
flag :-)).
        Can you do an "ip -6 route list" when it's happening?

        Also might be worthwhile to see the entire arg list, so
maybe using the "-s" option to strace to increase it, and we
probably only need sendmsg(), so maybe:

strace -s 1024 -e trace=sendmsg -e verbose=sendmsg ping6 -I eth0 ....


                                        +-DLS

--

From: Bernhard Schmidt
Date: Monday, September 8, 2008 - 5:38 pm

On Mon, Sep 08, 2008 at 05:34:00PM -0700, David Stevens wrote:


Sure, here we go

miredo:~# ip -6 route list
2001::/32 via fe80::1 dev teredo  metric 1024  mtu 1280 advmss 1220 hoplimit 4294967295
2001:1b10:100::1:1 via fe80::2c0:9fff:fe4b:8ccf dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100::1:2 via fe80::2c0:9fff:fe4b:8a4d dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100::21:1 via fe80::2c0:9fff:fe4b:8a4d dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100::53:1 via fe80::2c0:9fff:fe4b:8ccf dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100::119:1 via fe80::2c0:9fff:fe4b:8a4d dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100::1:9000:1 via fe80::2c0:9fff:fe4b:8a4d dev eth0  proto zebra  metric 2  mtu 1500 advmss 1440 hoplimit 4294967295
2001:1b10:100:3::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440 hoplimit 4294967295
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440 hoplimit 4294967295
fe80::/64 dev teredo  proto kernel  metric 256  mtu 1280 advmss 1220 hoplimit 4294967295
ff00::/8 dev eth0  metric 256  mtu 1500 advmss 1440 hoplimit 4294967295
ff00::/8 dev teredo  metric 256  mtu 1280 advmss 1220 hoplimit 4294967295

Working (all-hosts):
sendmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "ff02::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, msg_iov(1)=[{"\200\0\0\0\25S\0\3\0\305\305H\212/\r\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37 !\"#$%&'()*+,-./01234567"..., 64}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_IPV6, cmsg_type=, ...}, msg_flags=0}, MSG_CONFIRM) = 64

Non-working (RIPng group):
sendmsg(3, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "ff02::9", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, ...
From: David Stevens
Date: Monday, September 8, 2008 - 7:26 pm

Did you specify the interface on those?

I don't know whether or not I should trust that sin_scope_id==0,
which Brian also mentioned.

For link-local addresses, it must be specified, though you can
get there from an SO_BINDTODEVICE too. The error if it isn't
there is EINVAL, but sin_scope_id is showing as 0 for both cases.
The send side shouldn't actually care anything about the group
number unless they matched different routes, which I don't see
in your routing table.

So, I think it's back to running it on a debug kernel that prints
something distinguishing from the different EINVAL cases
to narrow it down.

                                        +-DLS

--

From: Rémi Denis-Courmont
Date: Monday, September 8, 2008 - 11:52 pm

On Tue, 9 Sep 2008 02:38:53 +0200, Bernhard Schmidt <berni@birkenwald.de>
               ^^^^^^
Uho... that interface is not multicast-capable. Not sure how the kernel
handles the conflicting routes.

-- 
Rémi Denis-Courmont

--

From: David Stevens
Date: Tuesday, September 9, 2008 - 12:17 am

Multicast programs shouldn't rely on the routing
table at all, IMAO. Unicast routing has nothing at all
to do with multicast routing, where a single address
means copying and forwarding to multiple different
segments, and the address has nothing at all to do with
the routing topology.
        That's why the scope_id should be set (and to
the proper index for eth0), but I think either strace
is lying to us here or they're using SO_BINDTODEVICE
or IP_MULTICAST_IF.
        But when it's breaking, I imagine it is ending
up on the wrong interface and thus the EINVAL... I
just don't see how some addresses still work while
others fail at the same time. :-)

                                                +-DLS

--

From: Bernhard Schmidt
Date: Tuesday, September 9, 2008 - 3:06 am

I agree with you that it should not rely on it. However, it obviously
did (in some way):

miredo:~# ip -6 route del ff00::/8 dev teredo  metric 256  mtu 1280 advmss 1220 hoplimit 4294967295                           
miredo:~# ping6 -I eth0 ff02::9
PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from fe80::2c0:9fff:fe4b:8ccf: icmp_seq=1 ttl=64 time=0.453 ms (DUP!)
64 bytes from fe80::20c:86ff:fe9a:3819: icmp_seq=1 ttl=64 time=0.467 ms (DUP!)
64 bytes from fe80::20c:86ff:fe9a:2819: icmp_seq=1 ttl=64 time=0.472 ms (DUP!)

I'm not convinced that this route was the culprit though, I think it
might be related to some sort of routing-table locking foo that got
resolved when I changed something. I'll keep it running (will take one
or two days max to reappear) and try deleting an arbitrary route when it
happens again.

Bernhard
--

From: David Stevens
Date: Tuesday, September 9, 2008 - 8:05 am

Bernhard,
        It'll definitely fall back to using the routing table
when the interface is not specified already, which is why
I was ranting about it (I think it should fail without a unicast
routing table lookup in that case, and always require the
application to provide the interface).
        What is odd about your case is you specified
"-I eth0" on the ping6, so it should never be doing a
route lookup (but it also should have a nonzero scope_id).
        That could be a bug in ping6. If it's passing
uninitialized flags to sendmsg, it clearly needs some work.
        Then the question is whether RIPng is explicitly
providing the interface for the sendmsg or not. If it isn't,
that's the problem, but I'd expect it to-- to operate correctly,
it needs to know what interface it's talking on.

                                        +-DLS

--

From: Pekka Savola
Date: Tuesday, September 9, 2008 - 10:16 am

I wonder if it's relevant that msg_iov in working case has MSG_CONFIRM 
but in non-working case this is zero?  I guess not but apart from the 
contents of msg_iov, that's the only difference..

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
--

From: David Miller
Date: Tuesday, September 9, 2008 - 1:13 pm

From: Pekka Savola <pekkas@netcore.fi>

MSG_CONFIRM refreshes the route's neighbour ->confirmed timestamp with
the current value of jiffies.

I really can't see how that could lead to the observed behavior, but who
knows at this point :)

--

Previous thread: [PATCH] *icsk_ca_priv: Reduce inet_connection_sock size? by Daniele Lacamera on Sunday, August 31, 2008 - 11:26 am. (7 messages)

Next thread: [PATCH] *icsk_ca_priv: Reduce inet_connection_sock size? by Daniele Lacamera on Sunday, August 31, 2008 - 11:05 am. (1 message)