I've been hitting a lockup on my machine, and the way to hit
it is sort of straight forward.
1) Setup IPSEC connection using openswan
2) Use something like XAUTH where openswan is not able refresh the
IPSEC policy when it times out because it has no way to prompt the
user for the XAUTH password again.
3) After the timeout, try to send a packet over the IPSEC connection,
a simple DNS lookup over it works for me.
4) Shut down IPSEC daemon.
At this point with SMP we'll wedge in __xfrm_state_destroy() trying to
obtain xfrm_state_lock.
This problem seems to have been introduced by:
commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
Author: Timo Teras <timo.teras@iki.fi>
Date: Thu Feb 28 21:31:08 2008 -0800
[XFRM]: Speed up xfrm_policy and xfrm_state walking
(BTW, I want to point out how hesitant I was at the time to apply
this patch. It touches a lot of delicate areas...)
This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy().
Such new locking cannot work properly, without also updating a lot of
other code in net/xfrm/xfrm_state.c
The new locking means that it is absolutely illegal to do an
xfrm_state_put() while holding xfrm_state_lock. Yet we do this
all over the place!
Asynchronously another context can drop the count to one, and this
xfrm_state_put() call will thus call into __xfrm_state_destroy() and
try to take xfrm_state_lock (again) causing the deadlock.
I'm not exactly sure how to fix this currently.
I guess we can add a local pointer variable to the relevant functions
"struct xfrm_state *to_put", which is initialized to NULL, and we
replace the put calls with assignments to to_put. In the "out" path,
after we drop the xfrm_state_lock, we do a put if "to_put" is
non-NULL.
Something like the patch below, but frankly this is some ugly stuff.
I double checked xfrm_policy.c, since it has the potential to have
similar issues due to the above changeset, but it appears to be
OK the best I can ...From: David Miller <davem@davemloft.net>
This part of my patch is buggy, we loop here so would need to remember more
state to do the puts right.
But luckily, this function doesn't need this to_put code, as it grabs
x->lock not xfrm_state_lock so all the changes to this function can
simply be removed :)
I'm going to do some testing on this and commit it unless more problems
are found.
ipsec: Fix deadlock in xfrm_state management.
Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
with xfrm_state_lock held. If we do, we'll deadlock since we
have the lock already and __xfrm_state_destroy() tries to take
it again.
Fix this by pushing the xfrm_state_put() calls after the lock
is dropped.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..7bd62f6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -780,11 +780,13 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
{
unsigned int h;
struct hlist_node *entry;
- struct xfrm_state *x, *x0;
+ struct xfrm_state *x, *x0, *to_put;
int acquire_in_progress = 0;
int error = 0;
struct xfrm_state *best = NULL;
+ to_put = NULL;
+
spin_lock_bh(&xfrm_state_lock);
h = xfrm_dst_hash(daddr, saddr, tmpl->reqid, family);
hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) {
@@ -833,7 +835,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
if (tmpl->id.spi &&
(x0 = __xfrm_state_lookup(daddr, tmpl->id.spi,
tmpl->id.proto, family)) != NULL) {
- xfrm_state_put(x0);
+ to_put = x0;
error = -EEXIST;
goto out;
}
@@ -849,7 +851,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
error = security_xfrm_state_alloc_acquire(x, pol->security, fl->secid);
if (error) {
x->km.state = ...This is simply bogus. We should remove the entry from the list when the state is deleted, not when it's destroyed. That is, we should be deleting x->all in places like __xfrm_state_delete instead of __xfrm_state_destroy. The fact that we're adding x->all in __xfrm_state_insert means that we should be deleting it from its counter-part, which is __xfrm_state_delete. The same thing applies to the policies of course. Once a policy or state is deleted there is no reason why it should show up in a However, moving things out of critical sections is always a good idea. So I think your patch is an improvement regardless of why it came about :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But then caller of xfrm_state_walk() can be still holding a reference to the entry and dumping of the whole chain fails, because the entry is no longer part of the chain when dumping continues later. The walking coding skips entries which are dead so the entry is only kept temporarily to make a full traversal of Right. Also alternatively the xfrm_state_all could be protected by a different lock than xfrm_state_lock. - Timo --
Ah I see your point now.
Actually another thing we should do is to postpone the unlinking
into the GC process. This doesn't really need to be immediate.
And here is a patch to fix a couple of changes in user-behaviour.
ipsec: Restore larval states and socket policies in dump
The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]:
Speed up xfrm_policy and xfrm_state walking") inadvertently removed
larval states and socket policies from netlink dumps. This patch
restores them.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 841b32a..5138845 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1077,6 +1077,7 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
struct hlist_head *chain = policy_hash_bysel(&pol->selector,
pol->family, dir);
+ list_add_tail(&pol->bytype, &xfrm_policy_bytype[pol->type]);
hlist_add_head(&pol->bydst, chain);
hlist_add_head(&pol->byidx, xfrm_policy_byidx+idx_hash(pol->index));
xfrm_policy_count[dir]++;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..89551b6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -856,6 +856,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
if (km_query(x, tmpl, pol) == 0) {
x->km.state = XFRM_STATE_ACQ;
+ list_add_tail(&x->all, &xfrm_state_all);
hlist_add_head(&x->bydst, xfrm_state_bydst+h);
h = xfrm_src_hash(daddr, saddr, family);
hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1051,6 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
xfrm_state_hold(x);
x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
add_timer(&x->timer);
+ list_add_tail(&x->all, &xfrm_state_all);
hlist_add_head(&x->bydst, xfrm_state_bydst+h);
h = xfrm_src_hash(daddr, saddr, family);
hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
Cheers,
-- ...Oh, I seemed to have missed that there was multiple places where the list insertion happens. Would it make sense to make function for inserting the entries in the lists? - Timo --
Sure, grep for the places that change xfrm_state_num and you've got your function :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Applied to net-2.6, thanks Herbert. --
Yes that's a good idea. Here's a patch to do that for states.
I'm still thinking about whether policy destruction should be
restructured or not so I'm not patching that.
ipsec: Move state dump list unlinking into GC
This patch moves the unlinking of x->all (for dumping) into the
GC and changes the lock to xfrm_cfg_mutex.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 89551b6..f29923a 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode)
static void xfrm_state_gc_destroy(struct xfrm_state *x)
{
+ mutex_lock(&xfrm_cfg_mutex);
+ list_del(&x->all);
+ mutex_unlock(&xfrm_cfg_mutex);
+
del_timer_sync(&x->timer);
del_timer_sync(&x->rtimer);
kfree(x->aalg);
@@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x)
{
WARN_ON(x->km.state != XFRM_STATE_DEAD);
- spin_lock_bh(&xfrm_state_lock);
- list_del(&x->all);
- spin_unlock_bh(&xfrm_state_lock);
-
spin_lock_bh(&xfrm_state_gc_lock);
hlist_add_head(&x->bydst, &xfrm_state_gc_list);
spin_unlock_bh(&xfrm_state_gc_lock);
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Shouldn't this also change the locking in all places where x->all is used? --
You're right. I'd forgotten about the larval states for some
reason, probably because I just added them to the list :)
So let's keep xfrm_state_lock and just move it.
ipsec: Move state dump list unlinking into GC
This patch moves the unlinking of x->all (for dumping) into the
GC so that we know for sure what locks we hold when we unlink the
entry.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 89551b6..0244325 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode)
static void xfrm_state_gc_destroy(struct xfrm_state *x)
{
+ spin_lock_bh(&xfrm_state_lock);
+ list_del(&x->all);
+ spin_unlock_bh(&xfrm_state_lock);
+
del_timer_sync(&x->timer);
del_timer_sync(&x->rtimer);
kfree(x->aalg);
@@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x)
{
WARN_ON(x->km.state != XFRM_STATE_DEAD);
- spin_lock_bh(&xfrm_state_lock);
- list_del(&x->all);
- spin_unlock_bh(&xfrm_state_lock);
-
spin_lock_bh(&xfrm_state_gc_lock);
hlist_add_head(&x->bydst, &xfrm_state_gc_list);
spin_unlock_bh(&xfrm_state_gc_lock);
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au> How about we un-crap the reference counting? Every list an object lives on is an external reference, and that is the rule that is violated by this ->all list handling. And that's why we need to take a single shared lock twice just to get rid of an xfrm_state object now. --
Well, it's just another list keeping a reference like ->bydst, ->bysrc and ->byspi. The actual amount of external references is still correct (the walking code calls _hold() when it returns while keeping an external pointer). The difference is that node should not be unlinked from ->all until all other references are gone. For other lists the unlinking can be done earlier since they are used only for lookups. Any good other ways to enumerate to list entries while allowing to keep a temporary "iterator"? The previous method was crap too. - Timo --
From: Timo Teräs <timo.teras@iki.fi> ->bydst, ->bysrc, and ->byspi are counted as a single external reference because: 1) They are controlled as a group 2) Doing 3 atomic operations is more expensive than one I know because I did that conversion from 3 refcount operations down to 1 and I timed it with stress tests, which showed that it made a Once there are no list references, there cannot be any other references. So in fact it seems to me that unlinking when the xfrm_state is removed from those other lists makes perfect sense. If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state objects marked DEAD, why does the ->all list reference have to survive past __xfrm_state_delete()? At least the old stuff was self-consistent and only needed one central lock grab to destoy an object. --
I was a bit confused what you meant by "external reference". But 1. xfrm_state_walk() called, it returns but holds an entry since the walking was interrupted temporarily (e.g. full netlink buffer). 2. xfrm_state_delete() called to the entry that xfrm_state_walk() is keeping a pointer to and it is unlinked. 3. xfrm_state_walk() called again, it tries to resume list walking Yes, but the dumping code produced crap. It could dump same entry multiple times, miss entries and was dog slow. With it there was no possibility to keep userland in sync with kernel SPD/SAD because entries were lost. - Timo --
From: Timo Teräs <timo.teras@iki.fi> Get creative, use a key of some sort to continue the walk, that's what I'd rather see an entry twice in a dump than have my IPSEC gateway lockup, or run slower because we take a lock twice as often as necessary during object destruction. --
Ok. ctnetlink_dump_table() seems to iterate the hash list and keeps a unique id where it left. If that entry was deleted it starts again from that hash bucket. We could iterate SPD by idx. By SPI would be nice for SAD, but not all entries have SPI. I guess we could use either of the bysrc or bydst hash and memorize xfrm_id. Restart from first hash bucket entry if the memorized entry gets deleted, and restart from the beginning if hash gets resized. This way we could avoid the additional locking and guarantee that all Seeing entry twice is not a problem. Not seeing an entry at all was the real problem. Also listing SAD could take tens of seconds on modern system. That's not nice either. - Timo --
OK let's see if this is creative enough :)
The next step after this patch is to kill x->all altogether and
go back to the old walking method (plus the saved state).
ipsec: Use RCU-like construct for saved state within a walk
Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.
As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.
An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
/* Full description of state of transformer. */
struct xfrm_state
{
- /* Note: bydst is re-used during gc */
struct list_head all;
- struct hlist_node bydst;
+ union {
+ struct list_head gclist;
+ struct hlist_node bydst;
+ };
struct hlist_node bysrc;
struct hlist_node byspi;
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
walk->count = 0;
}
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
- if (walk->state != NULL) {
- xfrm_state_put(walk->state);
- walk->state = NULL;
- }
-}
-
extern int xfrm_state_walk(struct xfrm_state_walk *walk,
int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
extern struct xfrm_state *xfrm_state_alloc(void);
extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..ec9d1e5 ...From: Herbert Xu <herbert@gondor.apana.org.au> What happens to all of the entries on the local gc_list which you will be skipping when you break out of that last loop? It looks like they will never get processed. --
I made gc_list static, so they'll be processed when the GC task runs again. Also whenever completed is incremented we schedule the GC task. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Thanks, I didn't catch that "static" bit :-) --
From: Herbert Xu <herbert@gondor.apana.org.au> Ok, now that I finally understand how this works I like it. The only comment I would make is that maybe it's a bit excessive to trigger the GC worker every time we walk the states. Instead you could expose the GC task's local list go: xfrm_state_walk_completed++; smp_wmb(); if (!list_empty(&xfrm_state_gc_work_list)) schedule_work(&xfrm_state_gc_work); in xfrm_state_walk_done(). --
Good point!
I've avoided the memory barrier by simply extending the mutexed
section in the GC to cover the list splicing. Here's the updated
patch:
ipsec: Use RCU-like construct for saved state within a walk
Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.
As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.
An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
/* Full description of state of transformer. */
struct xfrm_state
{
- /* Note: bydst is re-used during gc */
struct list_head all;
- struct hlist_node bydst;
+ union {
+ struct list_head gclist;
+ struct hlist_node bydst;
+ };
struct hlist_node bysrc;
struct hlist_node byspi;
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
walk->count = 0;
}
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
- if (walk->state != NULL) {
- xfrm_state_put(walk->state);
- walk->state = NULL;
- }
-}
-
extern int xfrm_state_walk(struct xfrm_state_walk *walk,
int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
extern struct xfrm_state *xfrm_state_alloc(void);
extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..d90f936 100644
--- ...From: Herbert Xu <herbert@gondor.apana.org.au> Looks good. I'll be able to merge this next time I pull net-2.6 into net-next-2.6 because of your larval/socket policy fix which touched similar areas. --
From: Herbert Xu <herbert@gondor.apana.org.au> Applied, thanks Herbert. --
There is only one reader at a time, right? Otherwise, I don't see how the increments and reads of xfrm_state_walk_completed line up. --
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> The RTNL semaphore is held across the modifications of the counters. --
Got it, thank you, sorry for the noise! Thanx, Paul --
s/RTNL/xfrm_cfg_mutex/ but otherwise what Dave said :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Thank you for the updated information, and I guess I don't feel quite so bad about the false alarm. ;-) Thanx, Paul --
Does not this logic fail if: 1. completed = ongoing 2. 1st walk started and iterator kept (a->lastused = ongoing, ongoing++) 3. 2nd walk started and iterator kept (b->lastused = ongoing, ongoing++) 4. 2nd walk finished (completed++) 5. gc triggered: a gets deleted since a->lastused == completed 6. 1st walk continued but freed memory accessed as a was deleted Though currently it does not affect, since xfrm_state_hold/_put are still called when keeping the iterator, so the entries won't actually get garbage collected anyway. So the completed/ongoing counting is basically useless. Or am I missing something? Wouldn't it be enough to do the list_del_rcu on delete? And just keep reference as previously? - Timo --
Obviously I missed the fact that it prevents deletion of the entries which the iterator held list node might still point to. But the flaw still exists: the entries which interator->next points can be deleted if there is a walking that takes a long time and So if we do list_del_rcu() on delete, could we also xfrm_state_hold() the entry pointed to by that list entry. And then on GC we could xfrm_state_put() the next entry. - Timo --
Unfortunately it's not that simple since we'll be in the same
bind if the entry after the next entry gets deleted as well as
the next entry.
The only other solution is to go back to the original scheme
where we keep the list intact until GC. However, nobody has
come up with a way of doing that that allows the SA creation
path to run locklessly with respect to the dumping path.
So I think we'll have to stick with the quasi-RCU solution for
now. Here's a patch to fix the flaw that you discovered.
ipsec: Fix xfrm_state_walk race
As discovered by Timo Teräs, the currently xfrm_state_walk scheme
is racy because if a second dump finishes before the first, we
may free xfrm states that the first dump would walk over later.
This patch fixes this by storing the dumps in a list in order
to calculate the correct completion counter which cures this
problem.
I've expanded netlink_cb in order to accomodate the extra state
related to this. It shouldn't be a big deal since netlink_cb
is kmalloced for each dump and we're just increasing it by 4 or
8 bytes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 9ff1b54..cbba776 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[6];
+ long args[7];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4bb9499..48630b2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1246,6 +1246,8 @@ struct xfrm6_tunnel {
};
struct xfrm_state_walk {
+ struct list_head list;
+ unsigned long genid;
struct xfrm_state *state;
int count;
u8 proto;
@@ -1281,13 +1283,7 @@ static inline void xfrm6_fini(void)
extern int xfrm_proc_init(void);
#endif
-static inline void xfrm_state_walk_init(struct ...Well, I was thinking that we hold the next pointer. And when continuing the dump, we can first skip all entries that are marked as dead (each next pointer is valid since each of the next pointers are held once). When we find the first valid entry to dump we _put() the originally held entry. That would recursively _put() all But yes, this would work as well. Not sure which one would be faster. I guess the holding of individual entries would be at least more memory efficient. Cheers, Timo --
No that doesn't work. Let's say we store the entry X in walk->state, and we hold X as well as X->next. Now X, X->next, and X->next->next get deleted from the list. What'll happen is that X and X->next will stick around but X->next->next will be freed. So when we resume from X we'll dump X and X->next correctly, but then hit X->next->next and be in the same shithole. Really, the only way to do what you want here is to hold everything from X to the very end of the list. The problem with that is that you can no longer drop the reference counts when we resume. You can see why even if we only hold X and X->next. Should X->next be deleted from the list, when we resume from X we'll no longer be able to drop the reference count on the original X->next since it I think this is a lot more efficient than storing every single entry from X to the end of the list :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I think it would work. Here's the scenarios: We hold X as dumping is interrupted there. X->next points statically to some non-deleted entry and is held. Now, if X->next gets deleted, it's marked dead and X->next->next is held too. Thus when there is multiple deleted entries in chain, the whole chain is held recursively/iteratively. When walking is continued on X the first thing we do is skip all dead entries from X, after that we put X and that would trigger put() for all X->next:s which were held iteratively. If X->next is not deleted, and X->next->next gets deleted, the X->next list structure is updated correctly by list_del_rcu and the entry can be actually freed even if the walking didn't iterate that entry (it would be skipped anyway as it's marked dead on deletion). So the idea was to hold X->next from deletion function, not from the walking function. That would be, we always hold deleted->next when there are ongoing walks. And on final _put() we _put() the ->next entry. I think that would work. - Timo --
Right, this would work. However, now you're again slowing down the relatively fast path of state deletion by moving extra work there from the non-critical dump path. When we optimise code for time, we don't necessarily try to minimise the work overall. We instead try to minimise the amount of work on the critical paths. It's OK to let the other paths such as dumping do a bit more work if it improves the fast paths. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
The extra step there wold be a hold call. The recursive _put on a _put of some entry can happen only on dump path. As otherwise the ->next entry is first held in state delete, but would be immediately _put on the _put as the final step of _delete(). So basically one additional atomic_inc() and one atomic_dec() on the Not sure about the overall penalty, but yes I know it would have some penalty. But at least there would be no locking. Particularly the thing that can happen on your approach is that if there is a user land process that gets suspended during a dump processing, it would prevent the garbage collection of all entries for all eternity until that process is continued or killed. So this would allow deletion and GC of entries even during walking. But this was just a thought. Maybe it's not worth trying. Cheers, Timo --
Can you post a patch? If this can be done locklessly then yes it would probably be a good way to go. However, I'm not sure whether I understand your lockless proposal yet. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Something like this. I just compile tested, so I'm not sure if it
actually works.
This basically reverts the GC changes. The interesting bits are in
xfrm_state_delete(). It will _hold() the ->next entry unless we
were at last entry. This will make sure that when an entry is
in XFRM_STATE_DEAD the all.next is valid all the way until the
entry is destroyed.
The corresponding put is in _destroy(). Added it as a final thing
to do so hopefully compiler optimizes tail recursion to iteration.
(Seemed to do that in my case.)
And finally, _walk() was right all along. It returns to dumping
and first skips all dead entries. And just before return, when
all dead entries are already skipped the originally held entry
is _put(). That _put call (or the one in _walk_done()) will result
all dead entries that were held, to be iteratively put.
- Timo
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[7];
+ long args[6];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..d5cba7b 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@ struct xfrm_state
{
struct list_head all;
union {
- struct list_head gclist;
+ struct hlist_node gclist;
struct hlist_node bydst;
};
struct hlist_node bysrc;
@@ -1246,8 +1246,6 @@ struct xfrm6_tunnel {
};
struct xfrm_state_walk {
- struct list_head list;
- unsigned long genid;
struct xfrm_state *state;
int count;
u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..f45f006 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
static ...Yeah we should open code this to be absolutely sure that we don't get killed by a stupid compiler. However, I think this is still open to the same problem that my patch had, i.e., if a dumper goes to sleep during the dump we may prevent entries from being freed indefinitely. Yes your idea is better in that we may only withhold say a fraction (depending on the order of deletion it could be anywhere from none to every entry) of the entries instead of all of them, but fundamentally the same issue is still there. Considering the fact that dumps require root privileges I'm not convinced as of yet that this is worth it. Hmm, could we perhaps go back to your original scheme of keeping everything on the list and see if we can use RCU to make it lockless instead? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Thanks :) And actually, we can make _delete() do normal list_del() and not hold the next pointer if there are no running dumps. This is the common Right. The patch was just to show the idea. Will do a proper patch with this and the above mentioned optimization if you think we You are right. It can keep memory allocated. And in worst case you end up keeping all the deleted entries (entries are deleted in the order they appear in all list). But I would not be to That would be ideal of course. - Timo --
How about this: we keep list of walks as your latest patch does. When walking is interrupted, we do not hold the entry, we just store the pointer to walk iterator. When the entry is deleted from the lists we go through the walk contexts, and if someone is pointing to the entry being deleted, we just update it to next. The list of walkers can be protected by xfrm_state_lock. That needs to be taken anyway for accessing/modifying the other lists. During most of the time, there is no walks active, so the penalty for _destroy() is minimal. This would make it possibly to reclaim the deleted entries right away. Does this sound better? Cheers, Timo --
Yep this sounds pretty good to me. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
So the patch would look something like this. Compile tested.
Will test later when it becomes possible to reboot my box.
Cheers,
Timo
ipsec: Fix up xfrm_state_walk.state on node deletion
Now that we track xfrm_state_walks, it makes more sense to
fix up the state pointer on deletion of the node if needed.
This allows accurately to delete all entries immediately,
instead of possibly waiting for userland.
Also fixed locking of xfrm_state_walks list handling.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
include/linux/netlink.h | 2 +-
include/net/xfrm.h | 3 +-
net/xfrm/xfrm_state.c | 77 +++++++++++++++++-----------------------------
3 files changed, 31 insertions(+), 51 deletions(-)
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[7];
+ long args[6];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..7f787c7 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@ struct xfrm_state
{
struct list_head all;
union {
- struct list_head gclist;
+ struct hlist_node gclist;
struct hlist_node bydst;
};
struct hlist_node bysrc;
@@ -1247,7 +1247,6 @@ struct xfrm6_tunnel {
struct xfrm_state_walk {
struct list_head list;
- unsigned long genid;
struct xfrm_state *state;
int count;
u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..636f7ee 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,11 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
static unsigned int xfrm_state_num;
static unsigned int xfrm_state_genid;
-/* Counter indicating ongoing walk, protected by ...Just make a new list in xfrm_state that has all the dumpers sitting on it. In fact all you need is a hlist, or even just a pointer. Walking through an unbounded list on the fast path is not good :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
There can be a lot of xfrm_state structs. As in thousands. So it will take more memory then. hlist would be enough, so it'd be a 4/8 bytes addition. Single pointer would not be enough as we can I was thinking about the same. That's why I wasn't so sure if this is better. In practice there aren't many walks active. Maybe we limit the maximum simultaneous walks? But in real life, there is only a one or two walkers if any active. So I would not consider a global update too expensive. But if you think it might become a problem, I'd add an hlist to xfrm_state of walkers referencing that entry. - Timo --
Right, we need doubly linked lists in the walkers to ease unlinking. Well in real life dumps don't tend to get suspended either :) In any case, a suspended dump is only going to pin down some memory, while a very long list walk may kill the box. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
So, what to do? 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and xfrm_state_put(all.next) on destruct. 2. Add per-entry hlist of walkers currently referencing it. 3. Use the global walker list. 1 can keep memory allocated until userland wakes up. 2 & 3 can make the delete of that entry slow if there's many walkers suspended. Btw. the current stuff in net-next is broken. There's no locking for xfrm_state_walkers list handling. Cheers, Timo --
I'd cross 3 off the list because 2 is just so much better :) I'd slightly lean towards 2 but now that you mention it yes even that is vulnerable to loads of dumpers sitting on the same entry. What about xfrm_cfg_mutex? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Umm... right. It's a tricky problem. Cannot think of perfect
solution atm. But I guess 3 is in general case the best. But in
worst case scenarios 1 performs better.
I have no strong opinion either way. So what ever you want, I'm
It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
touch xfrm_state_walks list without locking properly. At least in
the version I'm looking (= net-next-2.6 via git web interface).
- Timo
--
Their callers should be holding it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Well, 1 is an improvement over the current implementation, so Ah, the other layers take it at least on _walk_init paths. But _walk_done can be called from recv() syscalls. The af_key implementation does not take xfrm_cfg_mutex there. I don't think xfrm_user does that either as it does not pass cb_mutex to netlink_kernel_create. So at least the _state_walk_done path is unsafe as-is, I think. - Timo --
OK we'd need to fix that up.
However, I've got a new idea :) Let's put the dumpers on the
list directly and then we won't have to deal any of this crap
about states going away.
Warning: compile tested only!
ipsec: Put dumpers on the dump list
As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep. It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.
I've also changed the order of addition on new states to prevent
a never-ending dump.
Finally the obsolete last optimisation is now gone.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[7];
+ long args[6];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..17f9494 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,8 @@ extern struct mutex xfrm_cfg_mutex;
/* Full description of state of transformer. */
struct xfrm_state
{
- struct list_head all;
union {
- struct list_head gclist;
+ struct hlist_node gclist;
struct hlist_node bydst;
};
struct hlist_node bysrc;
@@ -138,6 +137,8 @@ struct xfrm_state
/* Key manger bits */
struct {
+ /* These two fields correspond to xfrm_state_walk. */
+ struct list_head all;
u8 state;
u8 dying;
u32 seq;
@@ -1246,11 +1247,11 @@ struct xfrm6_tunnel {
};
struct xfrm_state_walk {
- struct list_head list;
- unsigned long genid;
- struct xfrm_state *state;
- int count;
+ /* These two fields correspond to xfrm_state. */
+ struct list_head all;
+ u8 state;
u8 proto;
+ int count;
};
struct xfrm_policy_walk ...Now this is an interesting idea... I like this a lot. Only comment is that there really should be struct for the shared stuff. Otherwise it's bound to break at some point. Cheers, Timo --
That's why I put comments there. If people change it without reading the comments, they'll ignore the struct too :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Well, it's also because in the dump routine the entries are cast to xfrm_state, even if it was a walker entry. This is just wrong, though it probably works since only the specific entries are used. There should be some intermediate struct which we use to iterate in dumping routing, check if it was iterator/real entry (can be still based on the state thing). And only after that cast to struct xfrm_state. It would make the dumping routine more readable. Cheers, Timo --
No it works because all walker entries set the state to DEAD so they're skipped by everybody else. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes, I know. I was pointing the fact that the walker function iterates using struct xfrm_state. So temporarily when it is iterating through the walker entry, we get strct xfrm_state pointer which points to some place before the struct xfrm_state_walk. Now since the km.state is checked first, those are skipped. I find it very confusing to let the code say "iterate through list of struct xfrm_state" when it is not such a list. It is a list of struct xfrm_state or struct xfrm_state_walk. So I'd use some intermediate struct to so the code can say e.g "iterate through list of struct xfrm_state_dump_entry" or whatever. Or at least add a comment to the dumping function to say that we have struct xfrm_state, but in matter of fact it can be also struct xfrm_state_walk pointer with displacement, so we better check km.state first. Cheers, Timo --
Which is exactly what we do. The first thing we check in the loop is km.state. I really don't see your problem. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes. I'm not complaining that the code does not work. Just saying that the code easily misleads the reader. And in this kind of non-obvious places we should have some comments. Or make the code more readable by adding the intermediate struct. Cheers, Timo --
Fair enough. Feel free to reformat the patch and add the struct to make it more bullet-proof. I'm sorry I've got some paid work to do now :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Looks good to me, I espeically like the count/seq bit. But I'm sure Dave would appreciate it if we actually ran this too :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Gave it a test spin and found one bug. The walking routine will
get called one extra time when it has successfully dumped
everything so we need to check for that in the beginning; the
earlier version would start all over from the beginning resulting
never ending dumps.
So the patch is now as follows. I have an almost ready patch
for xfrm_policy dumping too. Will finish it and send tomorrow.
- Timo
ipsec: Put xfrm_state dumpers on the dump list
Based on Herbert Xu's patch and idea. Improved the original patch
to apply cleanly and modified iteration code to be more readable
by using a common struct for entries in all list.
As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep. It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.
I've also changed the order of addition on new states to prevent
a never-ending dump.
Finally the obsolete last optimisation is now gone.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
include/linux/netlink.h | 2 +-
include/net/xfrm.h | 29 ++++++-------
net/xfrm/xfrm_state.c | 109 +++++++++++++++-------------------------------
3 files changed, 50 insertions(+), 90 deletions(-)
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[7];
+ long args[6];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..becaa1e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
*/
+struct ...Aha, that's why the if statement was there :) Thanks for all the work! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yup. I forgot to test pf_key support. And remembered why the "last" thingy was in the walking routines. In pf_key dumps, the dump is terminated when an entry with seq zero is received; that's why the final entry received special treatment. And now that breaks. The first entry is zero so pf_key dumps only one entry. Not sure if we should fix this in pf_key or in the walking routines. Cheers, Timo --
This is part of the user-space ABI (and an RFC) so wecan't change
it. I've put the last stuff back and fixed up the error handling
on the very last entry.
ipsec: Put xfrm_state dumpers on the dump list
Based on Herbert Xu's patch and idea. Timo improved the original
patch to apply cleanly and modified iteration code to be more
readable by using a common struct for entries in all list.
As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep. It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.
I've also changed the order of addition on new states to prevent
a never-ending dump.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
int (*dump)(struct sk_buff * skb, struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
int family;
- long args[7];
+ long args[6];
};
struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..becaa1e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
*/
+struct xfrm_state_walk {
+ struct list_head all;
+ u8 state;
+ union {
+ u8 dying;
+ u8 proto;
+ };
+ u32 seq;
+};
+
/* Full description of state of transformer. */
struct xfrm_state
{
- struct list_head all;
union {
- struct list_head gclist;
+ struct hlist_node gclist;
struct hlist_node bydst;
};
struct hlist_node bysrc;
@@ -136,12 +145,8 @@ struct xfrm_state
u32 genid;
- /* Key manger bits */
- struct {
- u8 state;
- u8 dying;
- u32 seq;
- } km;
+ /* Key manager bits ...I was just figuring if we need to change seq in pf_key.c kernel code or in xfrm_state.c. But I guess it's good if the walking walks as previously. Now there is a new problem: if dumping is interrupted and meanwhile all the remaining entries get deleted, there is no entry to dump with seq zero when dumping is eventually continued. Since xfrm_user does not care for the seq as end-of-message, this problem appears only in af_key. Maybe we should make af_key.c code instead cache one entry before sending it to user space. We could then get rid of the last hack inside xfrm core walking code. Cheers, Timo --
Yes that should work. But you want to keep cache the skb instead of the state since the latter can die. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Exactly, keep the converted entry in skb, since we have to allocate that anyway. And when there's no more entries coming from enumerator the sequence number of final skb is adjusted to zero. Not sure if I can finish it today, but I think I should have the time by tomorrow. Unless you have time to do it before that, I'll post a patch later this week. Will update the similar xfrm_policy patch for that too, before sending it. - Timo --
Thanks! I've got some other work to do so feel free to post this tomorrow. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
It was simpler than I thought (and I thought I would be more busy with other things). Anyway, here goes. This is a single patch that modifies both xfrm_state and xfrm_policy dumping. This was to get af_key patch nicer (the final skb patching/sending is common code). Also, since the af_key implementation is the only place where this seq number is used, we might consider removing it from the core walking code. It's pretty af_key specfic too. The xfrm_policy patch uses moves xfrm_policy.dead to xfrm_policy_walk_entry which is the struct shared with walkers and the real entries. I did some testing to ensure that it works with xfrm and af_key as expected (with entries added by hand and dumping them). ipsec: Put dumpers on the dump list Herbert Xu came up with the idea and the original patch to make xfrm_state dump list contain also dumpers: As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Timo Teräs improved the patch to apply cleanly to latest tree, modified iteration code to be more readable by using a common struct for entries in the list, implemented the same idea for xfrm_policy dumping and moved the af_key specific "last" entry caching to af_key. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Timo Teras <timo.teras@iki.fi> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index ...
One more thing I noticed. If the af_key socket is closed while dumping is ongoing the current code can leak stuff. We should call the .done. And also in the new patch, release the stored skb. The right place to do that would be pfkey_release, I assume? - Timo --
Yes that sounds right. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Timo Teräs <timo.teras@iki.fi> I reread this thread, and the most recent patch seems good and even tested :-) But it seems that you want to make some more tweaks, right? Please post explicitly the patch you want me to apply to net-next-2.6 once you reach that point, thanks! --
Yes I think Timo is still touching it up. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Looks good, applied to net-next-2.6 --
From: Herbert Xu <herbert@gondor.apana.org.au> After your patch I just applied, things are a bit different now. --
This patch is now redundant. What could be done though is to remove x->all and restore the original walking by hashing. This would eliminate a list node from xfrm_state, and would also restore the original dump ordering. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Yes I remember you mentioning that. If you're too busy I can try to spin this idea into a patch. --
That would be very much appreciated. Thanks Dave! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> No problem. It might be a little bit of a chore because this new walker design is intimately tied to the af_key non-atomic dump changes. --
Actually I was mistaken as to how the original dump worked. I'd thought that it actually kept track of which bucket it was in and resumed from that bucket. However in reality it only had a global counter and would always start walking from the beginning up until the counted value. So it isn't as easy as just copying the old code across :) While it is possible to restore the old order and save a little bit of memory, I certainly don't think this is very urgent. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au>
Only AF_KEY gave an error, and this ended the dump. This was
one of Timo's goals, to make AF_KEY continue where it left
off in subsequent dump calls done by the user, when we hit the
Too late, I already implemented it :-)
I'm about to give the following patch a test:
ipsec: Restore hash based xfrm_state dumping.
Get rid of ->all member of struct xfrm_state, and just use a hash
iteration like we used before.
This shrinks the size of struct xfrm_state, and also restores the
dump ordering of 2.6.25 and previous.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4bb9499..ba0e47c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex;
/* Full description of state of transformer. */
struct xfrm_state
{
- struct list_head all;
union {
struct list_head gclist;
struct hlist_node bydst;
@@ -1247,6 +1246,7 @@ struct xfrm6_tunnel {
struct xfrm_state_walk {
struct xfrm_state *state;
+ unsigned int chain;
int count;
u8 proto;
};
@@ -1283,9 +1283,10 @@ extern int xfrm_proc_init(void);
static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
{
- walk->proto = proto;
walk->state = NULL;
+ walk->chain = 0;
walk->count = 0;
+ walk->proto = proto;
}
extern int xfrm_state_walk(struct xfrm_state_walk *walk,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index aaafcee..ccf8492 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -50,7 +50,6 @@ static DEFINE_SPINLOCK(xfrm_state_lock);
* Main use is finding SA after policy selected tunnel or transport mode.
* Also, it can be used by ah/esp icmp error handler to find offending SA.
*/
-static LIST_HEAD(xfrm_state_all);
static struct hlist_head *xfrm_state_bydst __read_mostly;
static struct hlist_head *xfrm_state_bysrc __read_mostly;
static struct hlist_head ...This union now needs to move across to bysrc or byspi since bydst needs to be preserved for walking even after a node is added to This is a bug in the original patch. When we get an error on the very last node we'll lose that node instead of dumping it later. I wonder if we could just get rid of last altogether... Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> I'll take a look at this too after I resolve and test the above issues. Thanks. --
From: David Miller <davem@davemloft.net>
No, it doesn't explain the crash/hang. And it happens even without my
patch.
I get a hang in del_timer_sync() and it's from a workqueue so I suspect
your patch :-)
I can't see what does the list delete from the GC leftover list? Entries
seem to stay on there forever. We even kfree() the object.
The list iteration is:
list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
if ((long)(x->lastused - completed) > 0)
break;
xfrm_state_gc_destroy(x);
}
And xfrm_state_gc_destroy() doesn't do a list_del(&x->gclist) or similar.
--
From: David Miller <davem@davemloft.net>
I'm now testing the following fix:
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index aaafcee..abbe270 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -423,6 +423,7 @@ static void xfrm_state_gc_task(struct work_struct *data)
list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
if ((long)(x->lastused - completed) > 0)
break;
+ list_del(&x->gclist);
xfrm_state_gc_destroy(x);
}
--
Doh! Yes there's supposed to be a list_del in there.
Here's a fixed patch:
ipsec: Use RCU-like construct for saved state within a walk
Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.
As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.
An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
/* Full description of state of transformer. */
struct xfrm_state
{
- /* Note: bydst is re-used during gc */
struct list_head all;
- struct hlist_node bydst;
+ union {
+ struct list_head gclist;
+ struct hlist_node bydst;
+ };
struct hlist_node bysrc;
struct hlist_node byspi;
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
walk->count = 0;
}
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
- if (walk->state != NULL) {
- xfrm_state_put(walk->state);
- walk->state = NULL;
- }
-}
-
extern int xfrm_state_walk(struct xfrm_state_walk *walk,
int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
extern struct xfrm_state *xfrm_state_alloc(void);
extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..fe84a46 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,6 +59,11 @@ static ...From: Herbert Xu <herbert@gondor.apana.org.au> Your original change was already in net-next-2.6 so I commited the one-liner fix. Thanks. --
Yes, this was the other goal. The other goal was to speed up dumping in netlink from O(n^2) to O(n). And fixing the problem that netlink might miss to dump some entries (if an entry was removed I think bad things will happen if the hash gets resized between xfrm_state_walk() calls. Since it assumes that the walk->state entry has remained in walk->chain bucket. The dumping order shouldn't really make any difference. But reducing struct xfrm_state is definitely good. The only downside is that when an entry is inserted to the beginning of the hash while dump is ongoing, it won't be dumped at all. But that is not really a problem since you get a notification about that entry separately. Cheers, Timo --
From: Timo Teräs <timo.teras@iki.fi> That's a good point, the hash resizing case. If the hash grows, we'll see some entries again. If it shrinks, we might skip some instead, which is very undesirable. Actually, thinking about it some more, both possibilities (skipping and dups) seem to be possible in both situations (growing and shrinking). --
Yeah the hash resizing thing really kills this approach. In fact it seems that the original walker was written before resizing was added. I definitely agree that ordering isn't as important as showing the entries reliably. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Timo Teräs <timo.teras@iki.fi> Before your changes the state destroy was protected by no central locks at all. There was no need. Since the reference goes to zero, there are no external references to this object. But this is exactly what you added, an external references that is not reference counted. This is yet another reason all of Herbert's objections to your location for ->all list add and removal are sound. So we now have to take a centralized lock twice when getting rid of an XFRM state object. Which is potentially a scalability performance regression. --
