Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits

Previous thread: WebMail Quota Alert!!! by Webmail Technical Team on Thursday, June 25, 2009 - 7:17 am. (1 message)

Next thread: WebMail Quota Alert!!! by Webmail Technical Team on Thursday, June 25, 2009 - 6:18 am. (1 message)
From: Paweł Staszewski
Date: Thursday, June 25, 2009 - 8:48 am

Hello ALL

Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
        Aver depth:     2.28
        Max depth:      6
        Leaves:         276539
        Prefixes:       289922
        Internal nodes: 66762
          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5  
9: 1  18: 1
        Pointers: 691228
Null ptrs: 347928
Total size: 35709  kB

Counters:
---------
gets = 26276593
backtracks = 547306
semantic match passed = 26188746
semantic match miss = 1117
null node hit= 27285055
skipped node resize = 0

Local:
        Aver depth:     3.33
        Max depth:      4
        Leaves:         9
        Prefixes:       10
        Internal nodes: 8
          1: 8
        Pointers: 16
Null ptrs: 0
Total size: 2  kB

Counters:
---------
gets = 26642350
backtracks = 1282818
semantic match passed = 18166
semantic match miss = 0
null node hit= 0
skipped node resize = 0



This machine is running bgpd with two bgp peers / full route table

 cat /proc/meminfo
MemTotal:       12279032 kB
MemFree:        ...
From: Eric Dumazet
Date: Thursday, June 25, 2009 - 2:19 pm

Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation)

Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes)

But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit
should not anymore be applied.

Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ?


commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Thu Apr 10 02:56:38 2008 -0700

    IPV4: fib_trie use vmalloc for large tnodes

    Use vmalloc rather than alloc_pages to avoid wasting memory.
    The problem is that tnode structure has a power of 2 sized array,
    plus a header. So the current code wastes almost half the memory
    allocated because it always needs the next bigger size to hold
    that small header.

    This is similar to an earlier patch by Eric, but instead of a list
    and lock, I used a workqueue to handle the fact that vfree can't
    be done in interrupt context.

    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--

From: Paweł Staszewski
Date: Thursday, June 25, 2009 - 2:52 pm

cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfe6a000 ioremap
0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46 
phys=dfef5000 ioremap
0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef2000 ioremap
0xf800c000-0xf800e000    8192 
acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfefb000 ioremap
0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef4000 ioremap
0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef3000 ioremap
0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef1000 ioremap
0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef0000 ioremap
0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeef000 ioremap
0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeee000 ioremap
0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeed000 ioremap
0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeec000 ioremap
0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a 
phys=fed1c000 ioremap
0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65 
pages=1 vmalloc
0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806e000-0xf8070000    8192 ...
From: Eric Dumazet
Date: Thursday, June 25, 2009 - 3:54 pm

This is from a 32 bit kernel.

This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine)

Of course, I would like /proc/vmallocinfo on your loaded router, not from

--

From: Paweł Staszewski
Date: Friday, June 26, 2009 - 3:06 am

Yes sorry for no info about it.
I test the same kernel configurations on one 32bit machine and second 64bit

here is meminfo from this 32bit machine working on kernel 2.6.30
cat /proc/meminfo
MemTotal:        3625444 kB
MemFree:         3043648 kB
Buffers:          133968 kB
Cached:            36316 kB
SwapCached:            0 kB
Active:           256868 kB
Inactive:          76252 kB
Active(anon):     163064 kB
Inactive(anon):        0 kB
Active(file):      93804 kB
Inactive(file):    76252 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       2758160 kB
HighFree:        2556136 kB
LowTotal:         867284 kB
LowFree:          487512 kB
SwapTotal:        995896 kB
SwapFree:         995896 kB
Dirty:              3624 kB
Writeback:             0 kB
AnonPages:        162912 kB
Mapped:             3612 kB
Slab:             235888 kB
SReclaimable:      46408 kB
SUnreclaim:       189480 kB
PageTables:          384 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2808616 kB
Committed_AS:     170648 kB
VmallocTotal:     122880 kB
VmallocUsed:        2876 kB
VmallocChunk:     109824 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       4096 kB
DirectMap4k:        8184 kB
DirectMap4M:      901120 kB
and vmallocinfo

cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfe6a000 ioremap
0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46 
phys=dfef5000 ioremap
0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef2000 ioremap
0xf800c000-0xf800e000    8192 
acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfefb000 ioremap
0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef4000 ioremap
0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46 ...
From: Eric Dumazet
Date: Friday, June 26, 2009 - 3:34 am

Yes, I was a fool to ask you to try 2.6.31-rc1, sorry.

Even 2.6.30 is too young for a production machine.

2.6.29.5 contains the fixes, Pawel, did you tried this version ?
--

From: Paweł Staszewski
Date: Friday, June 26, 2009 - 3:47 am

No problem with this test i lost only one test failover and no traffic 
I alvays make like this - i have iBGP mesh with main access path of 
machines on stable 2.6.28.9 kernels and second failover path  on 
machines  that use  newest kernel for  testing in this case  2.6.29 but 
I will try 2.6.29.5 today

Thanks
Paweł Staszewski
--

From: Eric Dumazet
Date: Friday, June 26, 2009 - 3:52 am

OK thanks

Please report (while machine has enough load) output of

rtstat -c20 -i1

(rtstat is a symbolic link to lnstat, if not provided by your distro)

--

From: Paweł Staszewski
Date: Friday, June 26, 2009 - 10:26 am

here you have
rtstat  -i 1 -c 20
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
 entries|  in_hit|in_slow_|in_slow_|in_no_ro|  in_brd|in_marti|in_marti| 
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
        |        |     tot|      mc|     ute|        |  an_dst|  
an_src|        |    _tot|     _mc|        |      ed|    miss| verflow| 
_search|t_search|
   93362|22930850| 1671866|       0|    1369|       2|       0|       
0|   53432|    1324|       0|       0|       0|       0|       0| 
4783896|   11985|
   92067|  101426|    5315|       0|       2|       0|       0|       
0|     258|       2|       0|       0|       0|       0|       0|   
21893|       6|
   90561|  100094|    4666|       0|       6|       0|       0|       
0|     267|       1|       0|       0|       0|       0|       0|   
23433|      30|
   90101|   98672|    5630|       0|       2|       0|       0|       
0|     253|       0|       0|       0|       0|       0|       0|   
24386|      34|
   89994|   99962|    5654|       0|       6|       0|       0|       
0|     266|       2|       0|       0|       0|       0|       0|   
26251|      38|
   95209|   91974|   14860|       0|       9|       0|       0|       
0|     236|      31|       0|       0|       0|       0|       0|   
14238|      35|
   95323|  101714|   10126|       0|      14|       0|       0|       
0|     255|       9|       0|       0|       0|       0|       0|    
8532|      21|
   94814|   99918|    8539|       0|       5|       0|       0|       
0|     258|       4|       0|       0|       0|       0|       0|   
11069|      24|
   98510|   93929|   12672|       0|      13|       0|       0|       
0|     238|      31|       0|       0|       0|       0|       0|   
12704|      34|
   98983|   96131|   11128|       0|      12|       0|       0|       
0|     252|      10|       0|     ...
From: Jarek Poplawski
Date: Friday, June 26, 2009 - 1:03 am

On the other hand, even if there is no problem with memory, it seems
because of hitting max_resize the threshold should be changed, e.g.
by reverting the patch below.

Jarek P.


commit 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Mon Mar 19 16:29:58 2007 -0700

    [IPV4]: fib_trie root node settings
    
    The threshold for root node can be more aggressive set to get
    better tree compression. The new setting mekes the root grow
    from 16 to 19 bits and substansial improvemnt in Aver depth
    this with the current table of 214393 prefixes
    
    But really the dynamic resize should need more investigation
    both in terms convergence and performance and maybe it should
    be possible to change...
    
    Maybe just for the brave to start with or we may have to back
    this out.

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5d2b43d..9be7da7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -292,8 +292,8 @@ static inline void check_tnode(const struct tnode *tn)
 
 static int halve_threshold = 25;
 static int inflate_threshold = 50;
-static int halve_threshold_root = 15;
-static int inflate_threshold_root = 25;
+static int halve_threshold_root = 8;
+static int inflate_threshold_root = 15;
 
 
 static void __alias_free_mem(struct rcu_head *head)
--

From: Robert Olsson
Date: Friday, June 26, 2009 - 2:19 am

Jarek Poplawski writes:

 > >> oprofile: using NMI interrupt.
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits

 > On the other hand, even if there is no problem with memory, it seems
 > because of hitting max_resize the threshold should be changed, e.g.
 > by reverting the patch below.

 You seem to have some temporary memory problem. So the printout might be
 a bit misleading in this case. We really like to keep the root node as big 
 as we can to keep the tree as flat as possible for performance reasons.
 (We're even more motivated now when we can disable the route cache)

 So I'll guess the next insert/delete inflates the root node to be within
 the interval. So I'll assume this just a temporary failure?

 I would be nice to have *threshholds* settable by /proc or /sys. I would
 use this in the other direction to trade memory for even faster lookups. 
 
 But maybe experts memory allocation has some good suggestions.

 Cheers.
					--ro
 
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 2:37 am

Pawel has reported these problems for a long time:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

So, until it's fully investigated, it seems some 'fast' fix is needed
here.

Cheers,
Jarek P.
--

From: Jorge Boncompte [DTI2]
Date: Friday, June 26, 2009 - 3:26 am

I have never reported these problems but I am definitely seeing the same
message on kernel 2.6.29.5, usually, when one of my BGP peers goes down. So,
just a "me too".

	Regards,

		Jorge

-----------------
[ 1198.333854] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1198.437028] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1198.460848] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.240223] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.279723] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.383081] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.154893] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.191711] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.223242] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.270299] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.355795] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.239254] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.271995] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.349351] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.384676] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.428801] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.457315] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.485710] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.513691] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.039681] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.069224] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.108840] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.141450] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.172317] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.197824] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.224711] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.251566] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.289603] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1211.561178] Fix ...
From: Robert Olsson
Date: Friday, June 26, 2009 - 5:42 am

Jarek Poplawski writes:

 > >  But maybe memory allocation experts has some good suggestions.
 > 
 > Pawel has reported these problems for a long time:
 > http://bugzilla.kernel.org/show_bug.cgi?id=6648
 > 
 > So, until it's fully investigated, it seems some 'fast' fix is needed
 > here.

 We talked about having a fixed pre-allocated root-node long ago but it's only 
 optimisation for routers w. full BGP. Best if memory problems got solved.
 
 Cheers
						--ro
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 5:54 am

I think the current process of rebalancing can allocate and hold
unnecessarily long a lot of 'temp' memory, so probably something
like the patch below could be useful. It should be applied to the
2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even
compile-test it now).

Cheers,
Jarek P.

--- (for testing)

 net/ipv4/fib_trie.c |   24 ++++++++++++++++++------
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..c2fc862 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1020,19 +1028,23 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 		tnode_put_child_reorg((struct tnode *)tp, cindex,
 				      (struct node *)tn, wasfull);
 
-		tp = node_parent((struct node *) tn);
+		synchronize_rcu();
 		tnode_free_flush();
+		tp = node_parent((struct node *) tn);
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }

---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix ...
From: Jarek Poplawski
Date: Friday, June 26, 2009 - 6:28 am

Alternatively here is a faster version with less synchronize_rcu().

Jarek P.

--- (take 2 - for testing)

 net/ipv4/fib_trie.c |   27 +++++++++++++++++++++------
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..2936b2e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }



---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing
    
    While doing trie_rebalance(): resize(), inflate(), halve() RCU free
    tnodes before updating their parents. It depends on RCU delaying the
    real destruction, but if RCU readers start after call_rcu() and before
   ...
From: Robert Olsson
Date: Friday, June 26, 2009 - 6:52 am

Jarek Poplawski writes:

 Thanks, 

 Should be worth testing so we synchronize_rcu instead of doing call_rcu's
 
 Cheers
					--ro


 > Alternatively here is a faster version with less synchronize_rcu().
 > 
 > Jarek P.
 > 
 > --- (take 2 - for testing)
 > 
 >  net/ipv4/fib_trie.c |   27 +++++++++++++++++++++------
 >  1 files changed, 21 insertions(+), 6 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..2936b2e 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 >  	vfree(tn);
 >  }
 >  
 > +static void __tnode_free(struct tnode *tn)
 > +{
 > +	if (size <= PAGE_SIZE)
 > +		kfree(tn);
 > +	else
 > +		vfree(tn);
 > +}
 > +
 >  static void __tnode_free_rcu(struct rcu_head *head)
 >  {
 >  	struct tnode *tn = container_of(head, struct tnode, rcu);
 > @@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 >  	while ((tn = tnode_free_head)) {
 >  		tnode_free_head = tn->tnode_free;
 >  		tn->tnode_free = NULL;
 > -		tnode_free(tn);
 > +		__tnode_free(tn);
 >  	}
 >  }
 >  
 > @@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  				      (struct node *)tn, wasfull);
 >  
 >  		tp = node_parent((struct node *) tn);
 > -		tnode_free_flush();
 >  		if (!tp)
 >  			break;
 >  		tn = tp;
 >  	}
 >  
 > +	if (tnode_free_head) {
 > +		synchronize_rcu();
 > +		tnode_free_flush();
 > +	}
 > +
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn))
 > +	if (IS_TNODE(tn)) {
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > -
 > -	rcu_assign_pointer(t->trie, (struct node *)tn);
 > -	tnode_free_flush();
 > +		rcu_assign_pointer(t->trie, (struct node *)tn);
 > +		synchronize_rcu();
 > +		tnode_free_flush();
 > +	} else {
 > +		rcu_assign_pointer(t->trie, (struct node *)tn);
 > +	}
 >  
 >  	return;
 >  }
 > 
 > 
 > 
 > ---
 > commit ...
From: Jarek Poplawski
Date: Friday, June 26, 2009 - 8:10 am

Alas take 2 (nor 1) doesn't compile, so here it is again.

Thanks,
Jarek P.
--- (take 3 - for testing)

 net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
 1 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1a4c4b7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }
--

From: Paul E. McKenney
Date: Friday, June 26, 2009 - 8:30 am

So the idea is to balance memory and latency, so that large changes
(those affecting the root node) get at least one synchronize_rcu(),
while smaller changes just use call_rcu(), correct?  This means that
the amount of memory awaiting an RCU grace period is limited, but
the algorithm avoids per-node synchronize_rcu() overhead.

If I understand the goal correctly, looks good!  (Give or take my
limited understanding of fib_trie and is usage, of course.)

--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 8:54 am

The goal is practically to replace all call_rcu() during
trie_rebalance() with synchronize_rcu() (except some freeing after
ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
memory after trie_rebalance() has finished, that's why there were
problems with enabled preemption. So this patch tries to do/force
this a bit earlier - at least before the top/largest node is
rebalanced.

Thanks,
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 9:15 am

On the other hand, we could probably stay with call_rcu() plus only
one synchronize_rcu() before the top node's resize() if you think it's
enough here?

Thanks,
--

From: Paul E. McKenney
Date: Friday, June 26, 2009 - 9:23 am

Well, my first task is to understand the problem/goal.  ;-)

My guess from what you said above is that use of call_rcu(), when
combined with changes to the trie in rapid succession, is resulting
in excessive memory awaiting a grace period.  Is this the case, or am I
confused?

--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 9:45 am

Exactly! (I guess... ;-)

Thanks,
--

From: Paul E. McKenney
Date: Friday, June 26, 2009 - 10:05 am

;-)

In that case, simply invoking synchronize_rcu() every once and awhile
should take care of things.  This could be at the end of every large
trie operation, or you could even count the call_rcu() invocations and
do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
the amount of memory available.

							Thanx, Paul
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 11:05 am

On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote:

OK, for now the minimal change for testing (2.6.30 needs previously
mentioned two commits from 2.6.31-rc). (I guess I'll send it with a
changelog after net-next is opened.)

Thanks,
Jarek P.
--- (take 4 - for testing)

 net/ipv4/fib_trie.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..98b31a1 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
 	t_key cindex, key;
-	struct tnode *tp;
+	struct tnode *tp, *oldtnode = tn;
 
 	key = tn->key;
 
@@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
+		/* force memory freeing after last changes */
+		if (oldtnode != tn)
+			synchronize_rcu();
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+	}
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();
--

From: Paul E. McKenney
Date: Friday, June 26, 2009 - 11:21 am

Looks promising to me!!!

--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 1:19 pm

Alas, after rethinking, there is one detail which bothers me. Those
largest allocs here are done with vmalloc and freed with RCU by
schedule_work(). So, I wonder if just because of this, the previous
version doing it directly isn't more reliable anyway. Of course, it's
my bad I didn't point it while describing the problem earlier. (I knew
I missed something...;-)

Thanks,
Jarek P.
--

From: Robert Olsson
Date: Friday, June 26, 2009 - 1:26 pm

Yes looks like a good solution but maybe it safest to synchronize unconditionally?


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..9cb8623 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
+		/* force memory freeing after last changes */
+		synchronize_rcu();
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+	}
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();

Cheers
						--ro

Jarek Poplawski writes:

 >  net/ipv4/fib_trie.c |    8 ++++++--
 >  1 files changed, 6 insertions(+), 2 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..98b31a1 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  {
 >  	int wasfull;
 >  	t_key cindex, key;
 > -	struct tnode *tp;
 > +	struct tnode *tp, *oldtnode = tn;
 >  
 >  	key = tn->key;
 >  
 > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn))
 > +	if (IS_TNODE(tn)) {
 > +		/* force memory freeing after last changes */
 > +		if (oldtnode != tn)
 > +			synchronize_rcu();
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > +	}
 >  
 >  	rcu_assign_pointer(t->trie, (struct node *)tn);
 >  	tnode_free_flush();
 > --
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 1:37 pm

Hmm... I lost around half an hour for this doubt... Sure! (Unless
there are some strange cases which very often create and destroy very
small tables?)

Thanks,
--

From: Jarek Poplawski
Date: Friday, June 26, 2009 - 2:20 pm

...or maybe even only updating such small tables very often?

Btw., Robert, I wondered about some design details lately, especially
about pointer to a parent. I didn't see it in the basic docs, so it
seems it could be avoided. It seems to be a problem with RCU, unless I
miss something: if there were no going back from children to parents
it seems we could free those "temporary" (created by inflate() and
halve() and destroyed before resize() has finished) earlier.

Another problem with this, it seems, are possibly false lookups (if we
go back to the new parent which doesn't have it's parent or other nodes
updated). So, was there so much performance gain to introduce this?

Thanks,
Jarek P.
--

From: Jarek Poplawski
Date: Saturday, June 27, 2009 - 12:20 pm

Robert, you and Eric pointed at memory problems, so I thought I missed
something. But after the second look I see "skipped node resize" should
show this, but it's always zero on these reports. So, isn't it possible
the current inflate_threshold_root is simply unreachable with some
conditions, at least within 10 loops?

Then these settable thresholds might be more useful here than memory
fixes, but here is some idea to try handle this automatically within
some limits. The patch below increases inflate_threshold_root (only)
up to ~50% of its initial value if needed, and should be able to go
back sometimes.

Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
some offsets.)

Thanks,
Jarek P.
---

 net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			if (inflate_threshold_root_fix * 2 <
+			    ...
From: Jarek Poplawski
Date: Saturday, June 27, 2009 - 1:51 pm

On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:

A tiny adjustment in the last if...

Jarek P.
--- (take 2)

 net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			if (inflate_threshold_root_fix * 2 <
+			    inflate_threshold_root)
+				inflate_threshold_root_fix++;
+			else
+				pr_warning("Fix inflate_threshold_root."
+					   " Now=%d size=%d bits fix=%d\n",
+					   inflate_threshold_root, tn->bits,
+					   inflate_threshold_root_fix);
+		} else {
 			pr_warning("Fix inflate_threshold."
 				   " Now=%d size=%d bits\n",
 				   inflate_threshold, tn->bits);
-	}
+		}
+	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
+		inflate_threshold_root_fix--;
 
 	check_tnode(tn);
 
--

From: Paweł Staszewski
Date: Saturday, June 27, 2009 - 5:28 pm

Thanks Jarek

I apply this patch to 2.6.29.5
For some results we must wait to "rush hours" when there will be more 
traffic / routes :)



Regards
Paweł Staszewski



--

From: Robert Olsson
Date: Sunday, June 28, 2009 - 4:11 am

When testing please monitor size of root node and and aver depth

Cheers
				--ro
       	       	       	    	 

Jarek Poplawski writes:
 > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
 > ...
 > > Then these settable thresholds might be more useful here than memory
 > > fixes, but here is some idea to try handle this automatically within
 > > some limits. The patch below increases inflate_threshold_root (only)
 > > up to ~50% of its initial value if needed, and should be able to go
 > > back sometimes.
 > > 
 > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
 > > some offsets.)
 > 
 > A tiny adjustment in the last if...
 > 
 > Jarek P.
 > --- (take 2)
 > 
 >  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 >  1 files changed, 16 insertions(+), 7 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..1dc1bb4 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 >  static const int inflate_threshold = 50;
 >  static const int halve_threshold_root = 8;
 >  static const int inflate_threshold_root = 15;
 > +static int inflate_threshold_root_fix;
 >  
 >  
 >  static void __alias_free_mem(struct rcu_head *head)
 > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	/* Keep root node larger  */
 >  
 >  	if (!tn->parent)
 > -		inflate_threshold_use = inflate_threshold_root;
 > +		inflate_threshold_use = inflate_threshold_root +
 > +					inflate_threshold_root_fix;
 >  	else
 >  		inflate_threshold_use = inflate_threshold;
 >  
 > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	if (max_resize < 0) {
 > -		if (!tn->parent)
 > -			pr_warning("Fix inflate_threshold_root."
 > -				   " Now=%d size=%d bits\n",
 > -				   inflate_threshold_root, tn->bits);
 > -		else
 > +		if (!tn->parent) {
 > +			if ...
From: Paweł Staszewski
Date: Monday, June 29, 2009 - 12:57 am

Some fib_triestats - kernel.2.6.29.5 with first Jarek patch.

Dump every 10sec:

Mon Jun 29 11:54:31 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276978
        Prefixes:       290448
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347816
Total size: 18403  kB

Counters:
---------
gets = 390981859
backtracks = 5332465
semantic match passed = 390452936
semantic match miss = 30198
null node hit= 375522207
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391017445
backtracks = 121012874
semantic match passed = 37565
semantic match miss = 0
null node hit= 261583
skipped node resize = 0

Mon Jun 29 11:54:41 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276976
        Prefixes:       290446
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347818
Total size: 18403  kB

Counters:
---------
gets = 391061852
backtracks = 5334173
semantic match passed = 390532664
semantic match miss = 30199
null node hit= 375595706
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391097445
backtracks = 121039213
semantic match passed = 37570
semantic match miss = 0
null node hit= 261589
skipped node resize = ...
From: Robert Olsson
Date: Sunday, June 28, 2009 - 4:04 am

Jarek Poplawski writes:

 > Robert, you and Eric pointed at memory problems, so I thought I missed
 > something. But after the second look I see "skipped node resize" should
 > show this, but it's always zero on these reports. So, isn't it possible
 > the current inflate_threshold_root is simply unreachable with some
 > conditions, at least within 10 loops?
 > 
 > Then these settable thresholds might be more useful here than memory
 > fixes, but here is some idea to try handle this automatically within
 > some limits. The patch below increases inflate_threshold_root (only)
 > up to ~50% of its initial value if needed, and should be able to go
 > back sometimes.

 Yes we keep the old tnode size and the convergence interval was some
 of the concerns. That why this checks was added. Still we want to
 inflate the root node to a very max. 

 So this approach with halving or doubling tnodes towards the root
 node was the suggest by Dyntree paper. I asked Stefan (one of the
 authors) if we could get safe and very offensive settings. But 
 from what I understood there was no easy way to calculate this. 
 So any bright ideas in this area are very welcome. But we should
 also monitor size of root and average tree depth so we don't
 take an to defensive approach just to solve this case.

 The memory patches and "manual RCU" are also interesting to address
 the case with PREEMTP's.

 Inserts and deletes are also very fast due to the flat tree so I think
 we can "slow down" this a bit if need to be safe with all PREEMPT's.

 Thanks for giving this area energy.

 Cheers
					--ro


 > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
 > some offsets.)
 > 
 > Thanks,
 > Jarek P.
 > ---
 > 
 >  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 >  1 files changed, 16 insertions(+), 7 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..1dc1bb4 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ ...
From: Jarek Poplawski
Date: Sunday, June 28, 2009 - 5:03 am

On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:

Yes, but with this offensive approach it seems the current level of
warnings could be too alarming. Btw., because of a design flaw in my
current patch this _fix variable, which should be logically per trie/
table, can be reset by changes of other tables now, but I think it
all could be fine tuned more in the future. Of course if there are
people interested in testing/reporting this more.

Thanks,
Jarek P.
--

From: Jarek Poplawski
Date: Sunday, June 28, 2009 - 7:35 am

On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:

Since 2.6.29 looks like prefered here, and there were a lot of takes
in this thread, I attach below a combined all-in-one patch including:
- 2.6.29 -> 2.6.30 preemption disable patch
- 2 RCU vs. preemption fixes from 2.6.31-rc
- "manual RCU" patch to force vfree/kfree before root's resize (take 3)
- "automatic" inflate_threshold_root fix (take 2)

Thanks,
Jarek P.

--- (for 2.6.29.x or even .28 or .27; any testing appreciated)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-06-28 15:45:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -315,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ ...
From: Paweł Staszewski
Date: Sunday, June 28, 2009 - 8:32 am

After 18 hours from apply first Jarek patch i have no info about Fix 
inflate_threshold_root
even if i make: "clear ip bgp *" on router
So i change Jarek patch from previous to this new one for test and we 
will see ...

Regards
Pawel Staszewski



--

From: Paweł Staszewski
Date: Sunday, June 28, 2009 - 8:48 am

After apply this patch something is wrong

Traffic is not forwarded
no info in dmesg / no info from bgp
and also i can't connect to bgpd process

I revert kernel to past version with first Jarek patch




--

From: Jarek Poplawski
Date: Sunday, June 28, 2009 - 12:56 pm

Thank you very much, Pawel, for trying this. I'm starting to look for
the reason. In the meantime try to get some fib_trie stats for Robert.

...
--

From: Jarek Poplawski
Date: Sunday, June 28, 2009 - 2:36 pm

To David Miller:
since among patches tested negatively by Pawel are current 2 fixes
from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
please withdraw them until they are tested alone. Thanks.

To Pawel:

Since checking this can take time I attach here a patch with only
changes which are currently in 2.6.31-rc. Of course, this part can be
broken as well, so it's up to you: if you could try it with caution
somewhere it would be very helpful; otherwise don't bother.

It could be applied to 2.6.29 with or without this currently working
patch.

Thanks,
Jarek P.
--- (for 2.6.29.x, .28 or .27)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-06-28 23:06:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static ...
From: Paweł Staszewski
Date: Monday, June 29, 2009 - 1:08 am

Ok.
I applied this patch 15mins ago to 2.6.29.5 and now it's working - 
traffic is forwarded.

Some fib_triestats
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.29
        Max depth:      6
        Leaves:         277015
        Prefixes:       290493
        Internal nodes: 67115
          1: 35733  2: 13635  3: 9544  4: 4832  5: 2239  6: 1125  7: 5  
9: 1  18: 1
        Pointers: 686614
Null ptrs: 342485
Total size: 18396  kB

Counters:
---------
gets = 3956301
backtracks = 192497
semantic match passed = 3895955
semantic match miss = 133
null node hit= 4306948
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 3960981
backtracks = 2152441
semantic match passed = 4757
semantic match miss = 0
null node hit= 194997

--

From: Paweł Staszewski
Date: Monday, June 29, 2009 - 1:47 am

But
With all this patches i have the same problem with CPU load
Every time when route cache entries are purged cpu load is increasing 
from 1% to 40 / 80% it depends

I see that on 64bit machine when route cache entries are going down i 
have almost 80% load on each cpu where ethernet card is binded by 
smp_affinity
But on 32bit machine cpu load reported by mpstat is half that on 64bit 
machine
here is example from 32bit machine ( mpstat + rtstat -k entries )

Linux 2.6.29.5 (TM_02_C1)       06/29/09        _i686_  (2 CPU)

12:36:54     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle        RT CACHE ENTRIES (from rtstat)
12:36:57     all    0.00    0.00    0.00    0.00    1.51   15.08    
0.00    0.00   83.42        83346
12:36:58     all    0.00    0.00    0.00    0.00    1.01    7.58    
0.00    0.00   91.41        85988
12:36:59     all    0.00    0.00    0.00    0.00    0.50    1.01    
0.00    0.00   98.49        89979
12:37:00     all    0.00    0.00    0.50    0.00    0.00    1.51    
0.00    0.00   97.99        93652
12:37:01     all    0.00    0.00    0.00    0.00    0.00    2.01    
0.00    0.00   97.99        96533
12:37:02     all    0.00    0.00    0.00    0.00    0.51    1.01    
0.00    0.00   98.48        99451
12:37:03     all    0.00    0.00    0.00    0.00    0.00    2.49    
0.00    0.00   97.51        102018
12:37:04     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        104153
12:37:05     all    0.00    0.00    0.00    0.00    0.00    1.01    
0.00    0.00   98.99        105979
12:37:06     all    0.00    0.00    0.00    0.00    0.00    1.01    
0.00    0.00   98.99        107684
12:37:07     all    0.00    0.00    0.00    0.00    0.00    1.53    
0.00    0.00   98.47        109070
12:37:08     all    0.00    0.00    0.00    0.00    0.00    1.51    
0.00    0.00   98.49        110462
12:37:09     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        ...
From: Jarek Poplawski
Date: Monday, June 29, 2009 - 2:27 am

I guess Eric is thinking about this. Btw., two little suggestions:
it should be easier to track if these route cache reports stay in its
starting thread ("weird problem"?), and if you could send these
stats/logs as attachements or turn off line wrapping, please? ;-)

Thanks,
Jarek P.
--

From: Paweł Staszewski
Date: Monday, June 29, 2009 - 2:43 am

Sorry Jarek for combining problems :)
And yes i will apply next stats in attachements :)

--

From: Jarek Poplawski
Date: Monday, June 29, 2009 - 1:33 am

David, IMHO this fix is needed in net-2.6 even if it doesn't fix the
problem reported by Pawel (there could be still something more).

Pawel, I see you decided to test my previous patch, but try to add
this one on top.

Thanks,
Jarek P.
------------------->
ipv4: Fix fib_trie rebalancing, part 3

Alas current delaying of freeing old tnodes by RCU in trie_rebalance
is still not enough because we can free a top tnode before updating a
t->trie pointer.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/ipv4/fib_trie.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..00a54b2 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
+		if (!tp)
+			rcu_assign_pointer(t->trie, (struct node *)tn);
+
 		tnode_free_flush();
 		if (!tp)
 			break;
--

From: Paweł Staszewski
Date: Monday, June 29, 2009 - 2:51 am

I apply this patch

fib_triestats in attached file :)



From: Jarek Poplawski
Date: Monday, June 29, 2009 - 3:47 am

Great! But it would be nice to check if this (accidentally ;-) might
fix the previous problem, so I attach below the patch with "manual
RCU", which btw. (or even more important) should verify RCU use here.

It should be applied on top of this last "Fix..., part3". And
again: it's quite probable it can fail, so with caution, no hurry
(it can wait for quiet time)...

Many thanks,
Jarek P.
--------------------> (synchronize_rcu take 4)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-06-29 10:04:22.000000000 +0000
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie *
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		if (!tp)
+		if (!tp) {
 			rcu_assign_pointer(t->trie, (struct node *)tn);
-
-		tnode_free_flush();
-		if (!tp)
 			break;
+		}
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 ...
From: Paweł Staszewski
Date: Monday, June 29, 2009 - 9:24 am

Jarek Poplawski pisze:
> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Pawe
From: Jarek Poplawski
Date: Monday, June 29, 2009 - 10:09 am

OK, I'll look at it again.

Thanks for testing!
Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, June 30, 2009 - 12:09 am

Pawel, here is another try to check what's going on here, so just
like before, but this one on top of these 2 last working patches,
plus quite time... (Stats aren't necessary; if these are some doubts
let me know.)

Thanks,
Jarek P.
--------------------> (synchronize_rcu take 5)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-06-30 06:50:35.000000000 +0000
@@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie *
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();
+	synchronize_rcu();
 
 	return;
 }
--

From: Paweł Staszewski
Date: Tuesday, June 30, 2009 - 1:16 pm

Apply and tested


--

From: Jarek Poplawski
Date: Tuesday, June 30, 2009 - 1:41 pm

A little comment: these last 2 patches weren't exactly to fix the
problem you reported, which should be mostly fixed by the earlier
patch.

There is some other bug, which you omit with CONFIG_PREEMPT_NONE
(but it's not for sure there is no by effects). So, I'd like to be
sure you are willing and can (without too much risk) to do more such
tests. Alas I've no way to generate similar conditions so it would
simply have to wait for somebody else.

Many thanks again,
Jarek P.
--

From: Paweł Staszewski
Date: Tuesday, June 30, 2009 - 4:31 pm

Yes i can make tests like this.
My network is splited to test clients and other normal clients
so it's really no problem to make testing. - if testing clients working 
then traffic from normal clients is also switched to this router (but if 
traffic is not forwarded "like in this case" for testing clients then 
failover switching them to working router )

and other point to make this tests - is that - it is good to have all in 
linux kernel networking working well :)

Regards

--

From: Jarek Poplawski
Date: Tuesday, June 30, 2009 - 11:36 pm

On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote:

It's extremely nice of you! On the other hand, this type of change
was planned to the net-next to fix possible memory problems, which
might have happened to you as well. So you'd probably experience this
problem in the future (2.6.32) anyway.

So here is the first of 2 patches (the second in a separate message),
which should be tested separately, each one applied on top of the
2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous
one. So, they are again all-in-one, to eclude any misunderstanding.

Btw., I assume there were no oopses, warnings or lockups after those
previous non-working patches - only no routing/forwarding.

Thanks,
Jarek P.
----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 05:30:50.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-07-01 05:15:37.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = ...
From: Jarek Poplawski
Date: Tuesday, June 30, 2009 - 11:41 pm

On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote:

The second patch.

Jarek P.
----------> (synchronize take 7 all-in-one for 2.6.29x, .28, or .27)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-01 06:17:08.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-07-01 05:27:02.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static struct node *resize(struct trie *
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +530,7 @@ static struct node *resize(struct trie *
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +691,7 @@ static struct node *resize(struct trie *
 ...
From: Jarek Poplawski
Date: Wednesday, July 1, 2009 - 12:24 am

It looks like Cc was shortened BTW, but I guess at least Robert is
interested in this testing, so I add him back.

Cheers,
--

From: Paweł Staszewski
Date: Wednesday, July 1, 2009 - 2:43 am

Yes on on previous patches there was / no warnings / no oopses or lockups

But now i apply this patch and i make more testing.
First boot with start of bgpd and - traffic is not forwarded
So i start to search and make only some routes (static without bgpd) 
thru this host
And all is working for this host when i make all by static routes.

So i change a little my bgp configuration and make default route to only 
one of my iBGP peers and start bgpd process
All is working and what is weird is number of routes in kernel table.
Kernel is learning routes from bgpd but very slowly - really very slowly.

In attached file there are some fib_triestats after 5min of traffic.

Without this patch (normally)
total size: reported by fib_triestats in less that 1sec is:  "Total 
size: 35769  kB"

But with this patch
Total size is growing up and in 5 min of traffic it grow to only:  
"Total size: 1005  kB"

Regards

--

From: Paweł Staszewski
Date: Wednesday, July 1, 2009 - 2:50 am

From: Jarek Poplawski
Date: Wednesday, July 1, 2009 - 3:13 am

On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote:

Pawel, this is really very helpful! So, this is (probably) only about
timing, not wrong memory freeing. On the other hand this test was only
for inserts. Btw., if you didn't start the second test, you can skip
it. I have to rethink this.

Many thanks,
Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, July 1, 2009 - 4:04 am

So, after your findings I'm about to recommend sending to -stable
3 patches from net-2.6, with additional lowering of threshold_root
settings, but it would be nice if you could give it a try with
CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
your other apps!) It is expected to work this time...;-) Maybe a
bit slower.

Thanks,
Jarek P.
--------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-01 06:17:08.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-07-01 10:43:44.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -313,8 +316,8 @@ static inline void check_tnode(const str
 
 static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
-static const int halve_threshold_root = 8;
-static const int inflate_threshold_root = 15;
+static const int halve_threshold_root = 15;
+static const int inflate_threshold_root = 25;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = ...
From: Paweł Staszewski
Date: Wednesday, July 1, 2009 - 3:17 pm

>> On Wed, Jul 01, 2009 at 11:43:04AM +0200, Pawe
From: Jarek Poplawski
Date: Wednesday, July 1, 2009 - 10:32 pm

Hmm... It should, because you tested very similar patch already;-)

It could probably matter only if you're using some broken out-of-tree
patches. Otherwise the kernel is expected to work OK.

Btw., it would be also interesting to check if there is any difference
wrt. these route cache problems while PREEMPT is enabled.

Thanks,
Jarek P.
--

From: Paweł Staszewski
Date: Wednesday, July 1, 2009 - 10:43 pm

Yes i know there was almost identical one.
Im a little confused about using of PREEMPT kernel because of past
there was many oopses / lockups :) but yes that was a little long time ago.

--

From: Jarek Poplawski
Date: Wednesday, July 1, 2009 - 11:00 pm

Yes, it looks like we can't free memory so simple because of such huge

And you're very right! The place we're fixing is the best example. On
the other hand, I hope there is not many such places yet. But if we
test/fix it there will be one less...

Jarek P.
--

From: Robert Olsson
Date: Thursday, July 2, 2009 - 8:31 am

Jarek Poplawski writes:

 > Yes, it looks like we can't free memory so simple because of such huge
 > latencies.  

 Controlling RCU seems crucial. Insertion of the full BGP table increased
 from 2 seconds to > 20 min with one synchronize_rcu patches.

 And fib_trie "worst case" wrt memory is the root node. So maybe we should 
 monitor changes in root node and use this to control synchronize_rcu.

 Didn't Paul suggest something like this?

 And with don't find any decent solution we have to add an option for 
 a fixed and pre-allocated root-nod typically for BGP-routers.

 Cheers
					--ro
--

From: Jarek Poplawski
Date: Thursday, July 2, 2009 - 12:06 pm

I wish I knew this a few days before. I could imagine a slow down,
but it looked like it was stuck. Since these last changes weren't
tested on SMP + PREEMPT I thought there is still something broken.
(I was mainly interested in this synchronize_rcu at the moment as

Sure, and it needs testing, but we should send some safe preemption

Probably you're right; I'd prefer to see the test results showing
a difference vs. simply less aggressive root thresholds. But of
course, even if not convinced, I'll respect your choice as the author
and maintainer, so feel free to NAK my proposals - I won't get it
personally.;-)

Cheers,
Jarek P.
--

From: Robert Olsson
Date: Thursday, July 2, 2009 - 2:32 pm

Jarek Poplawski writes:

 > >  Controlling RCU seems crucial. Insertion of the full BGP table increased
 > >  from 2 seconds to > 20 min with one synchronize_rcu patches.
 > 
 > I wish I knew this a few days before. I could imagine a slow down,
 > but it looked like it was stuck. Since these last changes weren't
 > tested on SMP + PREEMPT I thought there is still something broken.
 > (I was mainly interested in this synchronize_rcu at the moment as
 > a preemption test.)  


 Honestly this huge slowdown was surprise for me too. I think I sent 
 you a script so you could insert the full table yourself.

 > >  And fib_trie "worst case" wrt memory is the root node. So maybe we should 
 > >  monitor changes in root node and use this to control synchronize_rcu.
 > > 
 > >  Didn't Paul suggest something like this?
 > 
 > Sure, and it needs testing, but we should send some safe preemption
 > fix for -stable first, don't we?
 
 Yes my hope was that we could combine them... personally I'll need 
 to understand who we can preeemted better in the different configs
 and most of that this can be handled by "standard" RCU.

 > >  And with don't find any decent solution we have to add an option for 
 > >  a fixed and pre-allocated root-nod typically for BGP-routers.
 > 
 > Probably you're right; I'd prefer to see the test results showing
 > a difference vs. simply less aggressive root thresholds. But of
 > course, even if not convinced, I'll respect your choice as the author
 > and maintainer, so feel free to NAK my proposals - I won't get it
 > personally.;-)

 Thresholds we can change no problem... but very soon I'll people 
 will start routing without the route cache this at least in close
 to Internet core ,we will need all fib_look performance we can get.

 fib_trie was designed for classical RCU and no preempt you see the
 names i file... so this new and very challenging work to all of us.
 
 First week of vacation and have to fix the roof of the house...
 it's hot and ...
From: Jarek Poplawski
Date: Thursday, July 2, 2009 - 3:13 pm

I can't remember this script, but I guess my hardware should be

I mean changing thresholds as a temporary solution, until we can
control memory freeing; and it seems to me, even excluding the root
node, there could be a lot of temporary allocations during all those


Have a nice time,
Jarek P.
--

From: Paweł Staszewski
Date: Saturday, July 4, 2009 - 5:26 pm

Ok kernel configured with CONFIG_PREEMPT
and all this day work without any problems (with Jarek last patch).


So in attached file trere is fib_tirestats
I dont see any big change of (cpu load or faster/slower 
routing/propagating routes from bgpd or something else) - in avg there 
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"

Regards

--

From: Paweł Staszewski
Date: Saturday, July 4, 2009 - 5:30 pm

Oh

I forgot - please Jarek give me patch with sync rcu and i will make test 
on preempt kernel

Thanks
Paweł Staszewski


--

From: Jarek Poplawski
Date: Sunday, July 5, 2009 - 9:20 am

Probably non-preempt kernel might need something like this more, but
comparing is always interesting. This patch is based on Paul's
suggestion (I hope).

Thanks,
Jarek P.
---> (synchronize take 7; apply on top of the 2.6.29.x with the last
	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..fce8238 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,6 +164,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +394,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +407,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * 128) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)
--

From: Jarek Poplawski
Date: Sunday, July 5, 2009 - 10:32 am

Hold on ;-) Here is something even better... Syncing after 128 pages
might be still too slow, so here is a higher initial value, 1000, plus
you can change this while testing in:

/sys/module/fib_trie/parameters/sync_pages

It would be interesting to find the lowest acceptable value.

Jarek P.
---> (synchronize take 8; apply on top of the 2.6.29.x with the last
 	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..decc8d0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -71,6 +71,7 @@
 #include <linux/netlink.h>
 #include <linux/init.h>
 #include <linux/list.h>
+#include <linux/moduleparam.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/protocol.h>
@@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+static int sync_pages __read_mostly = 1000;
+module_param(sync_pages, int, 0640);
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +411,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)
--

From: Paul E. McKenney
Date: Sunday, July 5, 2009 - 2:32 pm

Looks like a promising approach to me!

--

From: Jarek Poplawski
Date: Sunday, July 5, 2009 - 3:23 pm

Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this
synchronize_rcu done at the lowest acceptable rate could be actually
mostly idle or on the contrary too late. Probably some more complex
(per cpu?) accounting would be necessary to really matter here, but
on the other hand these problems weren't reported often enough.

Thanks,
--

From: Paweł Staszewski
Date: Sunday, July 5, 2009 - 4:53 pm

kernel 2.6.29.5 preempt
bgp starts normal and kernel know routes normaly like without patch

Here are some fib_triestats

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277888
        Prefixes:       291399
        Internal nodes: 66818
          1: 33080  2: 14584  3: 10788  4: 4911  5: 2185  6: 900  7: 
366  8: 3  17: 1
        Pointers: 595584
Null ptrs: 250879
Total size: 18072  kB

Counters:
---------
gets = 1052940
backtracks = 55985
semantic match passed = 1034114
semantic match miss = 5
null node hit= 534415
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 1057636
backtracks = 1101307
semantic match passed = 4751
semantic match miss = 0
null node hit= 195605
skipped node resize = 0




kernel 2.6.29.5 no-preempt
All is ok like with preempt kernel (andl all working in normal time 
"routes propagation")

cat /sys/module/fib_trie/parameters/sync_pages
1000


cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.45
        Max depth:      6
        Leaves:         277905
        Prefixes:       291416
        Internal nodes: 66863
          1: 33119  2: 14594  3: 10782  4: 4911  5: 2187  6: 901  7: 
365  8: 3  17: 1
        Pointers: 595654
Null ptrs: 250887
Total size: 18074  kB

Counters:
---------
gets = 1060650
backtracks = 53161
semantic match passed = 1041008
semantic match miss = 12
null node hit= 504478
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: ...
From: Jarek Poplawski
Date: Monday, July 6, 2009 - 2:02 am

On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote:

14 == 10!? ;-)

Hmm... So, it's better than I expected; syncing after 128 or 256 pages
could be quite reasonable. But then it would be interesting to find
out if with such a safety we could go back to more aggressive values
for possibly better performance. So here is 'the same' patch (so the
previous, take 8, should be reverted), but with additional possibility
to change:
/sys/module/fib_trie/parameters/inflate_threshold_root

I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15
can give faster lookups (or lower cpu loads); with this these inflate
warnings could be back btw.; or maybe you'll find something in between
like inflate_threshold_root 20 is optimal for you. (I think it should be
enough to try this only for PREEMPT_NONE unless you have spare time ;-)

Thanks,
Jarek P.
---> (synchronize take 9; apply on top of the 2.6.29.x with the last
  	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..e8fca11 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -71,6 +71,7 @@
 #include <linux/netlink.h>
 #include <linux/init.h>
 #include <linux/list.h>
+#include <linux/moduleparam.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/protocol.h>
@@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+static int sync_pages __read_mostly = 1000;
+module_param(sync_pages, int, 0640);
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn)
 
 static const int ...
From: Paweł Staszewski
Date: Tuesday, July 7, 2009 - 3:56 pm

Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make 
only one test with sync_pages = 256 to check that is working :)

So here are some tests for different sync_pages size.

echo 1 > /sys/module/fib_trie/parameters/sync_pages
I stop count after 1minute - total size still rising :)

echo 2 > /sys/module/fib_trie/parameters/sync_pages
Total size in fib_triestats reach maximum in 33sec

echo 3 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 31sec

echo 4 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 23sec

echo 8 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 17sec

echo 16 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 14 sec

echo 32 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 14 sec

So i see in prev tests i make something wrong in time counting
So i modify test script and make tests again:

echo 64 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 13 sec

echo 128 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 10 sec

echo 256 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 10 sec

And for sync_paqges >256 time for propagating routes is always 10sec.

Also today i have many messages in dmesg like this:

And after tune :
/sys/module/fib_trie/parameters/inflate_threshold_root
no more info :)

Regards

--

From: Jarek Poplawski
Date: Tuesday, July 7, 2009 - 4:50 pm

This is something new and a bit surprising to me: the same threshold
in previous tests didn't generate this? Do you mean more than: 

With what value?

Pawel, let's say that current defaults are:
inflate_threshold_root 25 sync_pages 256

I'd like you to try to check if e.g.:
inflate_threshold_root 15 sync_pages 256
can give you any visible or subjective difference worth tweaking it
at all? (These stats from the next messages don't show this enough.)
You don't need to hurry with this...

Thanks,
Jarek P.
--

From: Paweł Staszewski
Date: Thursday, July 9, 2009 - 1:34 pm

Yes. Sorry for that - this info was not all the day but only 5 minutes 
when i was making tests.
When i set 35 as inflate_threshold_root there was no info even if all 
iBGP peers was down/up.

But i start to search when i have info about "Fix inflate_threshold_root"
And i see that the best is set this to 20 for me i have no info then in 
normal router operation / without down/up bgp peers many times in short 
I will try to make more accurate tests in weekend.

Regards

--

From: Jarek Poplawski
Date: Tuesday, July 14, 2009 - 12:41 pm

So it looks like the patch tested earlier could be still useful; after
changing the inflate_threshold_root it seems these warnings should be
very rare but there is no reason to alarm users with something they
can't fix optimally, anyway.

Thanks,
Jarek P.
--------------------->
ipv4: Fix inflate_threshold_root automatically

During large updates there could be triggered warnings like: "Fix
inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
node isn't finished in 10 loops. It should be much rarer now, after
changing the threshold from 15 to 25, and a temporary problem, so
this patch tries to handle it automatically using a fix variable to
increase by one inflate threshold for next root resizes (up to the 35
limit, max fix = 10). The fix variable is decreased when root's
inflate() finishes below 7 loops (even if some other, smaller table/
trie is updated -- for simplicity the fix variable is global for now).

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-13 13:32:53.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-07-13 15:16:18.000000000 +0200
@@ -327,6 +327,8 @@ static const int inflate_threshold = 50;
 static const int halve_threshold_root = 15;
 static const int inflate_threshold_root = 25;
 
+static int inflate_threshold_root_fix;
+#define INFLATE_FIX_MAX 10	/* a comment in resize() */
 
 static void __alias_free_mem(struct rcu_head *head)
 {
@@ -617,7 +619,8 @@ static struct node *resize(struct trie *
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -641,15 +644,27 @@ static struct node ...
From: Robert Olsson
Date: Wednesday, July 15, 2009 - 12:43 am

Jarek Poplawski writes:


Looks good. Maybe we're getting close to some generic solution to take 
a very optimistic approach wrt thresholds for root node and adjust to 
settings without the warning. Or maybe now even remove warning totally
with stata counter?

Can we even consider some other different strategy for bumping up the root 
node. 

We need all lookup performance we can get when we now try to route without 
the route cache. And we probably need to evaluate the cost for the multiple 
lookups again at least for LOCAL and MAIN when we talking routing well at 
least straight-forward simple routing. (Semantic change)

I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns 
we see in/close to Internet core. This w/o route cache on our hi-end opterons
with 8 CPU cores using niu and ixgbe. I'll test again and your patches when
I'm back from vacation.

Cheers
					--ro

 > So it looks like the patch tested earlier could be still useful; after
 > changing the inflate_threshold_root it seems these warnings should be
 > very rare but there is no reason to alarm users with something they
 > can't fix optimally, anyway.
 > 
 > Thanks,
 > Jarek P.
 > --------------------->
 > ipv4: Fix inflate_threshold_root automatically
 > 
 > During large updates there could be triggered warnings like: "Fix
 > inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
 > node isn't finished in 10 loops. It should be much rarer now, after
 > changing the threshold from 15 to 25, and a temporary problem, so
 > this patch tries to handle it automatically using a fix variable to
 > increase by one inflate threshold for next root resizes (up to the 35
 > limit, max fix = 10). The fix variable is decreased when root's
 > inflate() finishes below 7 loops (even if some other, smaller table/
 > trie is updated -- for simplicity the fix variable is global for now).
 > 
 > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
 > Reported-by: Jorge Boncompte ...
From: Jarek Poplawski
Date: Wednesday, July 15, 2009 - 6:05 am

I guess, we could, but maybe let's wait a bit to make sure there is

Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if
tests show there is a better strategy we should go for it.

Thanks,
Jarek P.
--

From: Robert Olsson
Date: Friday, July 17, 2009 - 1:08 am

Jarek Poplawski writes:
 > On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote:

 > > a very optimistic approach wrt thresholds for root node and adjust to 
 > > settings without the warning. Or maybe now even remove warning totally
 > > with stata counter?
 > 
 > I guess, we could, but maybe let's wait a bit to make sure there is
 > nothing surprising?

Yes if Pawel is running it we we'll get reports. I've no chance to upgrade
any of our routers now.  I've seen this printout in one our routers but we 
don't do "clear ip bgp *" to often and besides we try to use soft re-
configuration inbound.


 > > I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns 
 > > we see in/close to Internet core. This w/o route cache on our hi-end opterons
 > > with 8 CPU cores using niu and ixgbe. I'll test again and your patches when
 > > I'm back from vacation.
 > > 

 > Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if
 > tests show there is a better strategy we should go for it.

Routing without route cache is "new" area probably for minority of systems
were caching is not possible. Read BGP routers in core. 

Yes we should have safe defults. Thanks for all your work.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>

Cheers 
					--ro
--

From: David Miller
Date: Monday, July 20, 2009 - 7:41 am

From: Jarek Poplawski <jarkao2@gmail.com>

Applied.
--

From: Paweł Staszewski
Date: Tuesday, July 7, 2009 - 4:23 pm

Jarek Poplawski pisze:
> On Mon, Jul 06, 2009 at 01:53:49AM +0200, Pawe
From: Paweł Staszewski
Date: Tuesday, July 7, 2009 - 4:30 pm

i forgot to add:
Traffic when i make test was +/- 10Mbit/s in next tests:
eth0:         RX: 231.21 Mb/s          TX: 287.40 Mb/s    

--

From: Jarek Poplawski
Date: Tuesday, July 14, 2009 - 11:33 am

Below is a simpler version of this patch, without the sysfs parameter.

------------------------>
ipv4: Use synchronize_rcu() during trie_rebalance()

During trie_rebalance() we free memory after resizing with call_rcu(),
but large updates, especially with PREEMPT_NONE configs, can cause
memory stresses, so this patch calls synchronize_rcu() in
tnode_free_flush() after each sync_pages to guarantee such freeing
(especially before resizing the root node).

The value of sync_pages = 128 is based on Pawel Staszewski's tests as
the lowest which doesn't hinder updating times. (For testing purposes
there was a sysfs module parameter to change it on demand, but it's
removed until we're sure it could be really useful.)

The patch is based on suggestions by: Paul E. McKenney
<paulmck@linux.vnet.ibm.com>

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/ipv4/fib_trie.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 63c2fa7..58ba9f4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,6 +164,14 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+/*
+ * synchronize_rcu after call_rcu for that many pages; it should be especially
+ * useful before resizing the root node with PREEMPT_NONE configs; the value was
+ * obtained experimentally, aiming to avoid visible slowdown.
+ */
+static const int sync_pages = 128;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +401,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 ...
From: David Miller
Date: Monday, July 20, 2009 - 7:41 am

From: Jarek Poplawski <jarkao2@gmail.com>

Applied.
--

From: Jarek Poplawski
Date: Tuesday, July 14, 2009 - 2:20 pm

While looking for other fib_trie problems reported by Pawel Staszewski
I noticed there are a few uses of tnode_get_child() and node_parent()
in lookups instead of their rcu versions.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
(this patch was prepared on top of my 2 today's fib_trie patches)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-14 20:40:39.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-07-14 22:41:26.000000000 +0200
@@ -1465,7 +1465,7 @@ static int fn_trie_lookup(struct fib_tab
 			cindex = tkey_extract_bits(mask_pfx(key, current_prefix_length),
 						   pos, bits);
 
-		n = tnode_get_child(pn, cindex);
+		n = tnode_get_child_rcu(pn, cindex);
 
 		if (n == NULL) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
@@ -1600,7 +1600,7 @@ backtrace:
 		if (chopped_off <= pn->bits) {
 			cindex &= ~(1 << (chopped_off-1));
 		} else {
-			struct tnode *parent = node_parent((struct node *) pn);
+			struct tnode *parent = node_parent_rcu((struct node *) pn);
 			if (!parent)
 				goto failed;
 
@@ -1813,7 +1813,7 @@ static struct leaf *trie_firstleaf(struc
 static struct leaf *trie_nextleaf(struct leaf *l)
 {
 	struct node *c = (struct node *) l;
-	struct tnode *p = node_parent(c);
+	struct tnode *p = node_parent_rcu(c);
 
 	if (!p)
 		return NULL;	/* trie with just one leaf */
--

From: David Miller
Date: Monday, July 20, 2009 - 7:41 am

From: Jarek Poplawski <jarkao2@gmail.com>

Applied.
--

From: Paweł Staszewski
Date: Saturday, July 4, 2009 - 5:31 pm

Sorry again no attachement.




Pawe
From: Jarek Poplawski
Date: Sunday, July 5, 2009 - 5:56 am

David & Robert,
below are my recommendations for -stable plus one more patch:

On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:

So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:

2.6.30 needs:
-------------

commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing

commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2

commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Tue Jun 30 12:47:19 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 3

plus the new patch below

    ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

2.6.29 needs:
-------------

this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Thu May 21 15:20:59 2009 -0700

    ipv4: Fix oops with FIB_TRIE

plus above mentionned patches for 2.6.30 (part 1 - 4)

-----------------

David, if possible, please add to all these "Fix... part 1 - 4":

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).

Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic ...
From: Jarek Poplawski
Date: Sunday, July 5, 2009 - 6:08 am

(Take 2: Changelog spelling fixes, sorry.)

David & Robert,
below are my recommendations for -stable plus one more patch:

On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:

So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:

2.6.30 needs:
-------------

commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing

commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2

commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Tue Jun 30 12:47:19 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 3

plus the new patch below

    ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

2.6.29 needs:
-------------

this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Thu May 21 15:20:59 2009 -0700

    ipv4: Fix oops with FIB_TRIE

plus above mentionned patches for 2.6.30 (part 1 - 4)

-----------------

David, if possible, please add to all these "Fix... part 1 - 4":

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).

Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 ...
From: David Miller
Date: Tuesday, July 7, 2009 - 7:42 pm

From: Jarek Poplawski <jarkao2@gmail.com>

I think if we' re going to toss this into -stable, we should
put it into net-2.6 too, and that's what I'm going to do.

Once this makes it's way to Linus I'll work on the -stable
submissions.

And I'll make sure to add the tested-by tags, as you mentioned.

Thanks!
--

From: Jarek Poplawski
Date: Tuesday, July 7, 2009 - 11:44 pm

It's your decision: I don't think this patch is worth any arguing
about (de)stabilizing. Btw., since -stable rules are less strict it
seems natural such patches with bug fixes should rather go net-next
-> -stable way, unless I miss something?

Thanks,
Jarek P.
--

From: Jarek Poplawski
Date: Monday, June 29, 2009 - 3:58 am

David, I guess you could add:

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

Thanks,
Jarek P.
--

From: David Miller
Date: Tuesday, June 30, 2009 - 12:48 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Done, and applied, thanks Jarek.
--

From: Jarek Poplawski
Date: Tuesday, June 30, 2009 - 1:14 pm

Btw., a little comment: there are still some issues while trying to
reclaim memory after synchronize_rcu, which means the algorithm is
buggy, or RCU use is still buggy, or maybe some timing because of
synchronize_rcu. Anyway, fib_trie still seems to be safe only with
CONFIG_PREEMPT_NONE, so I have no idea how this should be fixed in
-stables (or why people don't report more this BUG in 2.6.30)...

Thanks,
Jarek P.
--

From: Stephen Hemminger
Date: Friday, July 10, 2009 - 8:29 am

On Tue, 30 Jun 2009 12:48:49 -0700 (PDT)

This is probably in kernel bugzilla as well, so someone should
update:
  http://bugzilla.kernel.org/show_bug.cgi?id=6648

-- 
--

Previous thread: WebMail Quota Alert!!! by Webmail Technical Team on Thursday, June 25, 2009 - 7:17 am. (1 message)

Next thread: WebMail Quota Alert!!! by Webmail Technical Team on Thursday, June 25, 2009 - 6:18 am. (1 message)