Re: [NET] Add proc file to display the state of all qdiscs.

Previous thread: [PATCH 0/5] More const ops cleanups by Stephen Hemminger on Tuesday, September 1, 2009 - 10:25 pm. (10 messages)

Next thread: [PATCH] NET: Fix possible corruption in bpqether driver by Ralf Baechle on Wednesday, September 2, 2009 - 1:58 am. (2 messages)
From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 1:14 am

[Resent with fixed netdev@ address]


I think, tc should've no problem with displaying summary stats of
multiqueue qdiscs or even all of them separately, as mentioned by
Patrick. And, maybe I still miss something, but there should be

As I wrote earlier, it's a nice work, but since it makes a new API I
guess we shouldn't hurry with merging this patch, and publish it as an
RFC first.

Then my humble suggestions would be to reserve more space for most of
the columns to make it readable not only for scripts when more TX#,
bytes, packets etc. Users of non-default qdiscs would also miss things
like: q->ops->id, q->handle, and q->parent at least. Plus, as I
mentioned earlier, q->qstats.qlen update with q->q.qlen (or using it
directly) is needed.

--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 1:28 am

I made a patch, but for a 8 queue device (bnx2), here is the "tc -s -d qdisc" result :

$ tc -s -d qdisc show
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 51814 bytes 459 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0


Same name "eth0" is displayed, that might confuse parsers...

What naming convention should we choose for multiqueue devices ?

--

From: David Miller
Date: Wednesday, September 2, 2009 - 1:30 am

From: Eric Dumazet <eric.dumazet@gmail.com>

We could give an index field to multiple root qdiscs assigned
to a device.
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 5:30 am

Here is a patch then :)

Only point is that I am iterating from 0 to dev->real_num_tx_queues
instead of dev->num_tx_queues. I hope it's fine, because there are
allocated qdisc, but not really used.

Next patches to allow selective qdisc change/fetch (providing a TCA_QINDEX
selector value to kernel)

Thanks


[PATCH net-next-2.6] tc: report informations for multiqueue devices

qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
for multiqueue devices.

We use a new TCA_QINDEX attribute, to report queue index to user space.
iproute2 tc should be changed to eventually display this queue index as in :

$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0



Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/rtnetlink.h |    1
 net/sched/sch_api.c       |  118 ++++++++++++++++++++----------------
 2 files changed, 67 insertions(+), 52 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index ba3254e..b80e0f6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -490,6 +490,7 @@ enum
 	TCA_FCNT,
 	TCA_STATS2,
 	TCA_STAB,
+	TCA_QINDEX,
 	__TCA_MAX
 };
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 24d17ce..74cde83 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -35,9 +35,9 @@
 #include <net/pkt_sched.h>
 
 static int qdisc_notify(struct sk_buff *oskb, struct nlmsghdr *n, u32 clid,
-			struct Qdisc *old, struct Qdisc *new);
+			struct Qdisc *old, struct Qdisc *new, int qnum);
 static int tclass_notify(struct sk_buff *oskb, struct nlmsghdr *n,
-			 struct Qdisc *q, unsigned long cl, int event);
+			 ...
From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 5:48 am

Here is the iproute2 patch as well, to display queue indexes

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index ba3254e..b80e0f6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -490,6 +490,7 @@ enum
 	TCA_FCNT,
 	TCA_STATS2,
 	TCA_STAB,
+	TCA_QINDEX,
 	__TCA_MAX
 };
 
diff --git a/tc/tc_class.c b/tc/tc_class.c
index 9d4eea5..1bc4bc6 100644
--- a/tc/tc_class.c
+++ b/tc/tc_class.c
@@ -196,6 +196,13 @@ int print_class(const struct sockaddr_nl *who,
 	if (filter_ifindex == 0)
 		fprintf(fp, "dev %s ", ll_index_to_name(t->tcm_ifindex));
 
+	if (tb[TCA_QINDEX]) {
+		int qindex = 0;
+
+		memcpy(&qindex, RTA_DATA(tb[TCA_QINDEX]), MIN(RTA_PAYLOAD(tb[TCA_QINDEX]), sizeof(int)));
+		fprintf(fp, "qindex %d ", qindex);
+	}
+
 	if (t->tcm_parent == TC_H_ROOT)
 		fprintf(fp, "root ");
 	else {
diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index c7f2988..3ec2cf4 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -232,6 +232,14 @@ int print_qdisc(const struct sockaddr_nl *who,
 	fprintf(fp, "qdisc %s %x: ", (char*)RTA_DATA(tb[TCA_KIND]), t->tcm_handle>>16);
 	if (filter_ifindex == 0)
 		fprintf(fp, "dev %s ", ll_index_to_name(t->tcm_ifindex));
+
+	if (tb[TCA_QINDEX]) {
+		int qindex = 0;
+
+		memcpy(&qindex, RTA_DATA(tb[TCA_QINDEX]), MIN(RTA_PAYLOAD(tb[TCA_QINDEX]), sizeof(int)));
+		fprintf(fp, "qindex %d ", qindex);
+	}
+
 	if (t->tcm_parent == TC_H_ROOT)
 		fprintf(fp, "root ");
 	else if (t->tcm_parent) {
--

From: Stephen Hemminger
Date: Wednesday, September 2, 2009 - 9:23 am

On Wed, 02 Sep 2009 14:48:27 +0200

Even if the kernel keeps the value in something else for the API
QINDEX needs to be a fixed size unsigned type like u32.


--

From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 9:30 am

Just to avoid duplicate work - I'm currently trying to put a patch
together that presents multiqueue qdiscs to userspace as regular
child qdiscs of a dummy qdisc, which also contains the aggregated
statistics. So far it actually looks pretty sane :)
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 9:50 am

Excellent !

I am cooking a patch to make vlan devices multiqueue aware
(ie having same tx_queue numbers than their real_dev)

--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 10:17 am

vlan devices are currently not multi-queue capable.

We can do that with a new rtnl_link_ops method,
get_tx_queues(), called from rtnl_create_link()

This new method gets num_tx_queues/real_num_tx_queues
from real device.

register_vlan_device() is also handled.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/rtnetlink.h  |    2 ++
 net/8021q/vlan.c         |    5 +++--
 net/8021q/vlan_netlink.c |   20 ++++++++++++++++++++
 net/core/rtnetlink.c     |   10 +++++++++-
 4 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 3c1895e..0525a1d 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -70,6 +70,8 @@ struct rtnl_link_ops {
 	size_t			(*get_xstats_size)(const struct net_device *dev);
 	int			(*fill_xstats)(struct sk_buff *skb,
 					       const struct net_device *dev);
+	int			(*get_tx_queues)(struct net*, struct nlattr *tb[],
+						 int *tx_queues, int *real_tx_queues);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index e814794..8836575 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -330,12 +330,13 @@ static int register_vlan_device(struct net_device *real_dev, u16 vlan_id)
 		snprintf(name, IFNAMSIZ, "vlan%.4i", vlan_id);
 	}
 
-	new_dev = alloc_netdev(sizeof(struct vlan_dev_info), name,
-			       vlan_setup);
+	new_dev = alloc_netdev_mq(sizeof(struct vlan_dev_info), name,
+				  vlan_setup, real_dev->num_tx_queues);
 
 	if (new_dev == NULL)
 		return -ENOBUFS;
 
+	new_dev->real_num_tx_queues = real_dev->real_num_tx_queues;
 	dev_net_set(new_dev, net);
 	/* need 4 bytes for extra VLAN header info,
 	 * hope the underlying device can handle it.
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index e9c91dc..8ce4122 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -100,6 +100,25 @@ static int vlan_changelink(struct net_device ...
From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 10:49 am

Looks great. This will also make it easier to test multiqueue
scheduling without a multiqueue-capable device :)
--

From: Brian Haley
Date: Wednesday, September 2, 2009 - 11:37 am

int			(*get_tx_queues)(struct net *net, struct nlattr *tb[],

Total nitpick, but tbird highlighted it because of the missing "net".

-Brian
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 11:55 am

Thats right, and I believe some gcc versions could complain

Thanks Brian

[PATCH net-next-2.6] vlan: multiqueue vlan device

vlan devices are currently not multi-queue capable.

We can do that with a new rtnl_link_ops method,
get_tx_queues(), called from rtnl_create_link()

This new method gets num_tx_queues/real_num_tx_queues
from real device.

register_vlan_device() is also handled.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/rtnetlink.h  |    2 ++
 net/8021q/vlan.c         |    5 +++--
 net/8021q/vlan_netlink.c |   20 ++++++++++++++++++++
 net/core/rtnetlink.c     |   10 +++++++++-
 4 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 3c1895e..3f4ab54 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -70,6 +70,8 @@ struct rtnl_link_ops {
 	size_t			(*get_xstats_size)(const struct net_device *dev);
 	int			(*fill_xstats)(struct sk_buff *skb,
 					       const struct net_device *dev);
+	int			(*get_tx_queues)(struct net *net, struct nlattr *tb[],
+						 int *tx_queues, int *real_tx_queues);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index e814794..8836575 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -330,12 +330,13 @@ static int register_vlan_device(struct net_device *real_dev, u16 vlan_id)
 		snprintf(name, IFNAMSIZ, "vlan%.4i", vlan_id);
 	}
 
-	new_dev = alloc_netdev(sizeof(struct vlan_dev_info), name,
-			       vlan_setup);
+	new_dev = alloc_netdev_mq(sizeof(struct vlan_dev_info), name,
+				  vlan_setup, real_dev->num_tx_queues);
 
 	if (new_dev == NULL)
 		return -ENOBUFS;
 
+	new_dev->real_num_tx_queues = real_dev->real_num_tx_queues;
 	dev_net_set(new_dev, net);
 	/* need 4 bytes for extra VLAN header info,
 	 * hope the underlying device can handle it.
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index e9c91dc..8ce4122 ...
From: Stephen Hemminger
Date: Wednesday, September 2, 2009 - 12:01 pm

On Wed, 02 Sep 2009 20:55:24 +0200


Nitpick again.  use unsigned unless you have some reason
to allow -1.
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 12:12 pm

No problem :)

Thanks Stephen

[PATCH net-next-2.6] vlan: multiqueue vlan device

vlan devices are currently not multi-queue capable.

We can do that with a new rtnl_link_ops method,
get_tx_queues(), called from rtnl_create_link()

This new method gets num_tx_queues/real_num_tx_queues
from real device.

register_vlan_device() is also handled.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/rtnetlink.h  |    3 +++
 net/8021q/vlan.c         |    5 +++--
 net/8021q/vlan_netlink.c |   20 ++++++++++++++++++++
 net/core/rtnetlink.c     |   10 +++++++++-
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 3c1895e..b630196 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -70,6 +70,9 @@ struct rtnl_link_ops {
 	size_t			(*get_xstats_size)(const struct net_device *dev);
 	int			(*fill_xstats)(struct sk_buff *skb,
 					       const struct net_device *dev);
+	int			(*get_tx_queues)(struct net *net, struct nlattr *tb[],
+						 unsigned int *tx_queues,
+						 unsigned int *real_tx_queues);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index e814794..8836575 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -330,12 +330,13 @@ static int register_vlan_device(struct net_device *real_dev, u16 vlan_id)
 		snprintf(name, IFNAMSIZ, "vlan%.4i", vlan_id);
 	}
 
-	new_dev = alloc_netdev(sizeof(struct vlan_dev_info), name,
-			       vlan_setup);
+	new_dev = alloc_netdev_mq(sizeof(struct vlan_dev_info), name,
+				  vlan_setup, real_dev->num_tx_queues);
 
 	if (new_dev == NULL)
 		return -ENOBUFS;
 
+	new_dev->real_num_tx_queues = real_dev->real_num_tx_queues;
 	dev_net_set(new_dev, net);
 	/* need 4 bytes for extra VLAN header info,
 	 * hope the underlying device can handle it.
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index e9c91dc..cd5c2ae 100644
--- ...
From: David Miller
Date: Wednesday, September 2, 2009 - 6:04 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Applied, but now I need you to do an audit :-)

I believe that drivers will change the number of TX queues
in use at times where we'll need to trigger an event or
something so that vlan's can learn about the value changing.
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 6:05 pm

Yes, I know :)

This is why I kept both num_tx_queues and real_num_tx_queues values,
and not the later :)

Thanks
--

From: Eric Dumazet
Date: Thursday, September 3, 2009 - 2:17 am

Here is a followup to enable 'true' multiqueue vlan xmit.

I missed this in my first patch.

[PATCH net-next-2.6] vlan: enable multiqueue xmits

vlan_dev_hard_start_xmit() & vlan_dev_hwaccel_hard_start_xmit()
select txqueue number 0, instead of using index provided by
skb_get_queue_mapping().

This is not correct after commit 2e59af3dcbdf11635c03f
[vlan: multiqueue vlan device] because
txq->tx_packets  & txq->tx_bytes changes are performed on
a single location, and not the right locking.

Fix is to take the appropriate struct netdev_queue pointer

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 53f84c7..3938c3e 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -291,7 +291,8 @@ static int vlan_dev_hard_header(struct sk_buff *skb, struct net_device *dev,
 static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev)
 {
-	struct netdev_queue *txq = netdev_get_tx_queue(dev, 0);
+	int i = skb_get_queue_mapping(skb);
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
 	struct vlan_ethhdr *veth = (struct vlan_ethhdr *)(skb->data);
 
 	/* Handle non-VLAN frames if they are sent to us, for example by DHCP.
@@ -329,7 +330,8 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 static netdev_tx_t vlan_dev_hwaccel_hard_start_xmit(struct sk_buff *skb,
 						    struct net_device *dev)
 {
-	struct netdev_queue *txq = netdev_get_tx_queue(dev, 0);
+	int i = skb_get_queue_mapping(skb);
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
 	u16 vlan_tci;
 
 	vlan_tci = vlan_dev_info(dev)->vlan_id;
--

From: David Miller
Date: Thursday, September 3, 2009 - 2:20 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Details details :-)

Looks good, applied!
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 10:31 am

There is one thing that bothers me with "tc -s -d qdisc", maybe
its the right time to speak

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 54823 bytes 490 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 468 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0


The rate estimation is given by kernel even if no rate estimation is performed.

User space doesnt have an indication saying this 0 numbers are real or fake
(rate 0bit 0pps).

Could we set a bit at Qdisc level when a gen_new_estimator() is really
done one a Qdisc, so that we do not call gnet_stats_copy_rate_est()
from tc_fill_qdisc() if this bit is not set ?

->

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 54823 bytes 490 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 468 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

--

From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 10:50 am

That should be possible, I'll look into it once I've finished my patch.
--

From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 6:18 am

This might confuse existing userspace since the handle is not unique
anymore. libnl f.i. will treat all but the first root qdisc as an
update and use it to update the state of the first one. There's also
no combined view for applications unaware of multiqueue.

Please have a look at the mail I just wrote for some possible ways
around this.
--

From: Eric Dumazet
Date: Wednesday, September 2, 2009 - 6:49 am

Hum, how can we combine infos on qdisc/class if in the future we allow each queue index
to have its own qdisc/classes ?

htb on queue index 0
cbq on queue index 1

Combining info would lock us and not allow for special configurations.
Say 
   macvlan device 0 mapped to queue index 0
   macvlan device 1 mapped to queue index 1...

For old apps, just give informations for queue 0 as we do now, and
allow kernel to give more informations only if new application provided a TCA_INDEX attribute
in its request ?

(-1 : all queue indexes,  >=0 for a given queue index)
--

From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 7:07 am

My suggestion was to only dump the statistics in the combined
view and use a virtual qdisc, something like:

qdisc multiqueue 0: dev eth0 root queues 8
  Sent ...
  rate ...

and show each real qdisc as child of this qdisc:

qdisc pfifo_fast <unique handle> dev eth0 parent 0: bands 3 ...
qdisc pfifo_fast <unique handle> dev eth0 parent 0: bands 3 ...

Configuration would be symetrical to this:

tc qdisc add dev eth0 handle 0: root multiqueue
tc qdisc add dev eth0 handle x: parent 0: pfifo_fast
...

without the virtual multiqueue qdisc, the root qdisc would simply


If we don't combine the information, existing multiqueue unaware
applications will get incorrect information. There's also the
problem of non-unique handles, I think we should encode the queue
index in the handle instead of using a new attribute. This needs
a bit more thought though to avoid clashes with user-defined handles.
--

From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 3:17 pm

Actually, I wonder why it can't be a real "virtual" qdisc with
classes, similar to... multiq, doing such mappings inside, according
to the current api:

tc qdisc add dev eth0 handle 1: root multiqueue
tc qdisc add dev eth0 handle x: parent 1:1 pfifo_fast

Jarek P.

PS: but I'd prefer (more) different name, even mq, mtq or something.
--

From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 2:18 am

Hmm... anything could break here something for somebody, so there is
still a (Patrick's) question if not sum it all? Otherwise, I wonder
about using the qdisc handle (tcm_handle>>16): there would be at
least one "pfifo_fast 0:" looking like proper root for somebody...

Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 2:33 am

I meant "proper" for pfifo_fast. On the other hand, I wonder why these
multiqueue qdisc handles can't be really given such unique per dev
(instead of per queue) handles?

Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 2:37 am

should be:
--

From: Patrick McHardy
Date: Wednesday, September 2, 2009 - 5:44 am

In my opinion the best way would be to sum up the stats and display
them for the "main" qdisc to avoid any compatibility problems and
additionally dump each queue for informational purposes.

This raises a few more questions however. First of all, there's no
"main" qdisc, so if we just use the first one for the summed up stats,
the question is whether the parameters of all root qdiscs are the same.
Currently they should be since pfifo_fast doesn't support changing
parameters, but this might change in the future, in which case we
might be displaying "incorrect" parameters.

The next question would be whether and how to support changing
parameters of individual multiq qdiscs. Similar to dumps, when
changing qdisc parameters we always pick queue number 0. It we want
to support changing parameters of different queues, we could
replicate changes to all queues, which would avoid the problem
stated above, or we could add an optional identifier for the
queue number, or we could use a combination of both which only
replicates settings when no queue number is specified.

In the second and third case, one possibility to get around the
incorrect parameters for the "main" qdisc would be to display
a virtual (non-existant) qdisc as the root, which only contains
the stats. I don't think the default choice of root qdisc type
should be counted as belonging to the API and userspace should
be prepared to deal with this.

And finally, the TC API used to be symetrical in the sense that
you could replay a dump to the kernel to get the same configuration.
Dumping the individual queues would break that property unless
we also support creating and configuring them in a symetrical
fashion.

So here's something of a possible solution, assuming that at some
point we do want to support changing parameters for individual root
qdiscs and taking the above into account:

- when creating a qdisc on a multiqueue-device through the TC
  API, we keep the current behaviour and share it between all
  ...
From: Christoph Lameter
Date: Wednesday, September 2, 2009 - 11:13 am

eth0/tx<number> ?


--

From: Jesper Dangaard Brouer
Date: Thursday, September 3, 2009 - 7:19 am

Remember that we already have a naming convention in /proc/interrupts

  eth0-tx-<number>

Lets not introduce too many new once ;-)

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Patrick McHardy
Date: Thursday, September 3, 2009 - 7:29 am

The approach I'm currently working on will present multiqueue root
qdiscs as children of a dummy classful qdisc. This avoids handle
clashes and the need for new identifiers and allows to address each
qdisc seperately, similar to how it works with other classful qdiscs:

qdisc mq 1: root refcnt 2
 Sent 126 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

class mq 1:1 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:2 root leaf 8002:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:3 root leaf 8003:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:4 root leaf 8004:
 Sent 126 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Its a bit tricky though so I'm likely going to need a few more hours.
--

From: Eric Dumazet
Date: Thursday, September 3, 2009 - 7:43 am

Seems pretty cool, thanks Patrick.
--

From: Jesper Dangaard Brouer
Date: Thursday, September 3, 2009 - 10:30 am

I like your approach. Its well suited for the qdiscs :-)

I especially like the possibility to access each qdisc seperately.  Does 
it then support having seperate qdisc per TX queue?  (I'm toying with the 
idea of transmitting our multicast traffic into/via a seperate TX hardware 
queue, and making a special qdisc for IPTV MPEG2-TS shaping)

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Patrick McHardy
Date: Thursday, September 3, 2009 - 10:56 am

Yes, you can attach qdiscs to the classes representing the queues.
At least it should work :) It would probably also be possible to
use TC classifiers for queue selection.

--

From: David Miller
Date: Thursday, September 3, 2009 - 4:31 pm

From: Patrick McHardy <kaber@trash.net>

You mean just like Intel's multiqueue scheduler and multiqueue
classifiers which we already have in the tree :-)
--

From: Patrick McHardy
Date: Thursday, September 3, 2009 - 6:36 pm

No, actually integrated in the current infrastructure :) The patches
add a virtual root qdisc, so we still have a single root from userspace's
POV. This qdisc is only used for sch_api purposes and doesn't necessarily
match the real root qdiscs. When a "mq" qdisc is attached (done by
default for multiqueue devices), the queues are exposed as child
classes of this qdisc and can be individually grafted with different
qdiscs.

Since the virtual root is global per device, what we could do is allow
to attach classifiers and execute them in dev_pick_tx() to select the
queue. But currently this would require global locking, so this would
have to be fixed previously. Probably not an easy thing to do.

--

From: Patrick McHardy
Date: Thursday, September 3, 2009 - 6:43 pm

Since its getting late and I won't finish this tonight, here are the
patches I'm currently working on in case someone already wants to have
a look. Its working as intended and should be testable. Patch 1 is good
shape IMO, patch 3 still needs some work. Userspace needs no changes.


From: David Miller
Date: Thursday, September 3, 2009 - 4:28 pm

From: Jesper Dangaard Brouer <hawk@diku.dk>

Indeed it sounds interesting.

But I want to warn everyone against adding support for anything
that adds a way to do operations "on every device qdisc" at the
kernel level.

I tried to do that and the error unwinding complexity is unacceptable.

We can make the tools support things like this.  Keep the kernel
simple and allow it to only operate on one queue at a time for
a given individual config request.

--

From: David Miller
Date: Thursday, September 3, 2009 - 4:22 pm

From: Jesper Dangaard Brouer <hawk@diku.dk>

Yes, please :-)
--

From: Christoph Lameter
Date: Wednesday, September 2, 2009 - 11:12 am

Ok. Can you come up with a patch? net/sched/sch_api.c can likely be
patches with the loop logic that Eric suggested earlier.
--

From: Jarek Poplawski
Date: Wednesday, September 2, 2009 - 12:49 pm

Yes, it's nice Eric and Patrick found time to fix it most properly.
 
Jarek P.
--

Previous thread: [PATCH 0/5] More const ops cleanups by Stephen Hemminger on Tuesday, September 1, 2009 - 10:25 pm. (10 messages)

Next thread: [PATCH] NET: Fix possible corruption in bpqether driver by Ralf Baechle on Wednesday, September 2, 2009 - 1:58 am. (2 messages)