From: Bernard Pidoux <bpidoux@free.fr> I don't see any large value in adding this patch right now. At best it's a new BUG check, it doesn't actually fix the problem. I'd rather apply a fix :-) --
I slightly disagree. The patch actually prevents use counter to become
underflowed (starting with 65535 then 65534 etc...)
The patch makes sure that use never becomes less than 0.
This may prevent further problems.
However I agree that the real reason of the bug is still unknown.
It must be hidden in the very obscure loop (at least for me).
Here is the original code :
void rose_kill_by_neigh( struct rose_neigh *neigh)
{
struct sock *s;
struct hlist_node *node;
spin_lock_bh(&rose_list_lock);
sk_for_each(s,node,&rose_list) {
struct rose_sock *rose = rose_sk(s);
if (rose->neighbour == neigh) {
rose_disconnect(s, ENETUNREACH, ROSE_OUT_OF_ORDER, 0);
rose->neighbour->use--;
rose->neighbour = NULL;
}
}
spin_unlock_bh(&rose_list_lock);
}
I suspect that the bug was unravelled when we added more than one
neighbour per route. The protocole accepts three, but this was not much
used during the early days since the density of radio stations on the
network was not big (only one node station per destination address
usually). The network is now denser with Internet links.
However, I don't understand why the test
if (rose->neighbour == neigh)
does not work, for
rose->neighbour = NULL;
should prevent next comparison to be valid and thus instruction
rose->neighbour->use--; not executed.
I have seen that a problem with sk_for_each() macro was identified a
while ago into ax25 code. The problem here could be similar ?
Bernard
--
From: Bernard Pidoux <bpidoux@free.fr>
I took a look at this some more.
That neighbour case loop you mention does set ->neighbour to NULL.
But other paths do not. Just look for all of the pieces of code
that do "rose->neighbour->use--;" and you'll find a few that do not
NULL it out.
One such example is rose_kill_by_device().
That would cause a problem because, while rose_disconnect() marks
the socket DEAD, it doesn't actually remove it from "rose_list".
That happens later when the user actually closes the socket or
some other similar event occurs.
Therefore if rose_kill_by_neigh() happens next, the ->neighbour could
match and we'll decrement again.
But I have no idea how safe it is to NULL out this ->neighbour
unconditionally. Lots of code seems to deref the thing unconditionally.
For example the ROSE_STATE_2 handling in rose_release().
I guess since rose_disconnect() sets sk->sk_state to TCP_CLOSE, we'll
be OK here.
Can you try this patch?
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index a7f1ce1..41dd630 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -197,6 +197,7 @@ static void rose_kill_by_device(struct net_device *dev)
if (rose->device == dev) {
rose_disconnect(s, ENETUNREACH, ROSE_OUT_OF_ORDER, 0);
rose->neighbour->use--;
+ rose->neighbour = NULL;
rose->device = NULL;
}
}
@@ -625,6 +626,7 @@ static int rose_release(struct socket *sock)
case ROSE_STATE_2:
rose->neighbour->use--;
+ rose->neighbour = NULL;
release_sock(sk);
rose_disconnect(sk, 0, -1, -1);
lock_sock(sk);
--
David, I tried the patch you proposed. I agree that it makes sense, however it does not prevent rose->neighbour->use to become negative as displayed in /proc/net/rose_neigh use field value. I already looked at all of the pieces of code that do "rose->neighbour->use--;" The only place that caused use to underflow (negative) is actually inside rose_kill_by_neigh(). This is why I had put a test and a warning there. I think that inside of sk_for_each() loop in rose_kill_by_neigh() when rose->neighbour->use-- becomes = 0 then rose->neighbour should be NULLed and in that case only. However it seems that rose->neighbour is not actually NULLed for in that case the comparison would not be true anymore and the decrement would not occur. I will soon report the printk output of rose->neighbour->use-- inside of the loop to illustrate what happens here. Bernard Pidoux --
