Beyond fq_codel & cake

fq_codel and cake have been available for some time. Cake has received a lot of improvements since it has become available. Both appear to not be particularly adequate for consumer embedded routers with MIPS and ARM microprocessors and symmetric gigabit links. Wireless routers have to do even more work when they handle NAT and they are also used as access points. These devices can't handle 1 gbps of traffic when using cake.

Software offloading can't work when using ingress shaping. Hardware offloading is also not available on many routers.

Another problem with shaping and AQM is latency. Network equipment from the provider's network and ONTs can also introduce latency/bloat. More changes and improvements are required. Reducing the time required to process packets can help.

This post is meant to be a place to brainstorm. The goals of this is to discuss ways to reduce latency, to increase throughput and to improve fairness.

Increasing throughput for fq_codel has been explored here: https://github.com/dtaht/fq_codel_fast. Removing unneeded fields from structs and shrinking the structs should help increase throughput for fq_codel and cake. The CPUs found in these routers have small caches for data and instructions. A significant reduction in cache misses can have a huge impact on performance.

Both cake and fq_codel are DRR based, even if they've greatly improved the algorithm. Alternatives to DRR might be worth exploring for both. fq_codel might be easier to experiment with due to the fact that its code is simple. What algorithms would be worth exploring? A different algorithm which doesn't remove items from the head of lists to append them to the tail may be able to get better performance and improved fairness.

fq_codel lacks GSO splitting, NAT awareness and the improved hashing that cake has. It might be interesting to benchmark a modified fq_codel implementation.

Using cake on a PPPoE interface still leaves us with fq_codel on the physical network interface. Perhaps benchmark with a local PPPoE server and client would provide the answer to the question: is it better to have CAKE on the PPPoE device or on the actual physical device?

Cake could be improved for x86-64 machines to increase the number of queues per tin. This should avoid collisions and improve fairness for networks with more flows. It might also need to probe for unused flows less often via the 8-way associative hashing if the tin has more queues. The 8-way associative hashing could probably be improved to avoid packet reordering when flows become empty. I'm not sure how that could be implemented.

mac80211 uses a different implementation of fair queueing which has a quantum of 300 bytes - see https://elixir.bootlin.com/linux/latest/source/include/net/fq_impl.h#L353. This leads to a few rounds of going through the list of flows before the packet can be transmitted. This is probably not a problem for x86-64 based access points. The smaller embedded ARM based and MIPS based routers may have a hard time going above a certain rate. Perhaps a quantum of 600 would help a little while still being somewhat fair towards small packets.

From cursory observation it seems neither the fq-schedulers nor the actual AQM parts are all that demanding, but implementing a low latency traffic shaper is the elephant in the room*.
For egress there is at least work done of offloading traffic shaping to the NIC, with in combination with BQL might help, but for ingress shaping nothing seems to be in the works.

*) Not sure whether this is caused by the shere CPU cycles needed for shapin or by the rlative tight deadlines in which a shaper needs to acquire the CPU to maintain shaper rate without creating too large a queue in the NIC/driver.

1 Like

I think MIPS is on the way out, so we should accept that a <1GHz mips cpu will not be able to power a router for a 1gbps+ link, independent of any offloads. These routers had a decent run, but with IAS speeds above 100 Mbps they have reached their limit.
ARM is different as there exist cores of sufficient performance to shape gigabit traffic, as demonstrated by e.g. the raspberry 4B's quad arm a72 cpu @>=1400 GHz. IMHO it will take a while before SoC makers switch from their current roughly 10 year old arm core designs to something better, but compared to MIPS there is a viable path forward.

However in Europe ISPs started to roll out affordable (nominal) 10 or even 25 Gbps internet plans which will likely require quite modern high performance ARM or x86 cpus. That and/or competent traffic shaping offloading to NICs...
The challenge with offloads is that either everything including AQM and fq scheduler move into the NIC or we still need BQL to generate backpressure so we can keep doing fq and AQM in our qdisc. How all of this will work with multiqueue NICs is also an open question.

2 Likes

a lot of mt7621 devices with @arinc9 's phy mux change can reach gigabit now.

Above 1G definitely needs something beefier than MIPS.

1 Like

OK, maybe my claim for routing capability might have been too pessimistic*, but I am certain that none of the MIPSen will traffic shape anywhere close to 1 Gbps.

*) Since I do not know the patch, I assume that you are talking about routing performance here, and not just about being able to operate/feed the switch from the CPU at 1 Gbps full duplex, the big limitations show up once you need to do actual per-packet processing.

1 Like

right. there's not enough CPU power to do SQM near that speed.

2 Likes

With cake in traffic shaper mode fq_codel on the interface will never really see a queue and hence should not engage. But you have a point we might want to switch that back to fifo or bfifo to avoid any theoretical potential for packet reordering.

Yes and no, cake will always keep at least one (or two?) packets in each queue which under sustained load affects the per-queue service interval. Let's look at that for a hypothetical bit torrent load with 1024 parallel flows all trying to transfer maximally sized ethernet packets at 1 Gbps:

1024 [flows] * ((1538 [byte/packet] * 8 [bit/byte]) / 1000^3 [bit/sec.]) * 1000 [ms/sec.] = 12.599296 ms

so the delay between serving the same queue twice is 12ms which already is above the default 5ms delay target. From such limit observations I am not sure whether 1024 flows might not suffice for 1 Gbps links?

1 Like

wonderful provocative thought, and a great summary! and I'm very much inclined to put all we currently know about cake and fq_codel aside whilst we think about this.

For starters, there's some GREAT work going on in libreqos and most recently https://github.com/rchac/LibreQoS/issues/57#issuecomment-1237525100 - a version of that rewritten in rust arrived here:

Leveraging xdp to help manage the input path and spread the load over multiple cpus.

Also, a very little understood facility is a programmable completion interrupt which (so far as I know), only a few intel chips have. This
make it possible to set egress to be at any rate below the actual physical rate of the link and not require a shaper. I've often thought that pushing this facility down into bql in software would be cheaper than shaping.

I'd really like to get numbers back on how well cake works on more modern hardware (2.5gbit especially) at the "line" rate.

1 Like

Tackling this. For patent reasons, we did not adopt SQF's improvements over SFQ and went with DRR because it had clear IP going back decades. Those SQF patents expire in 2028, and it is possible to work around them. We generally get the best results from fq_codel and cake at low bandwidths with a 300 byte quantum, and SFQ is per packet.

The new hotness is being seen in timestamping from ingress to egress, as you can see in (somewhat wrongheaded) cilium bandwidth manager. It is (and has always been) dumb to not leverage hardware timestamping all the way through to egress.

There are also calendar queues and timerwheel designs for complicated setups.

My fervent wish has mostly been that the flow offload hardware got more progammable, and would let us leverage new abstractions like P4, or FPGAs.

2 Likes

BQL btw is not scaling as I would like, as it maintains state per all the hardware queues in play. So 64 hardware queues can end up with 64x as much latency as one, well managed, hardware queue. Ideally
BQL would actually only maintain enough data in all the hardware queues to keep the medium busy.

1 Like

But would this not simply make BQL do the costly computations instead of the qdisc?

+1. But I think current aim is "well busy" so BQL tends to err gently on the side of throughput (however it is so much better than without BQL).
For me BQL is really just shorthand for "device/interface/driver" that creates back pressure on a qdisc if we can move the costly part in hardware all the better.

Would adding something like MQPRIO help on "underpowered" targets? A lot of our popular target should be able to do some kind of HW QOS which could be used via MQPRIO qdisc.

As reference see: https://bootlin.com/blog/tag/mqprio/

From their example:

# Shaping and priority configuration
tc qdisc add dev eth0 parent root handle 100 mqprio \
num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
hw 1 mode channel shaper bw_rlimit \
max_rate 1Gbit 500Mbit 250Mbit 125Mbit 50Mbit 25Mbit 25Mbit 25MBit

# Assign HTTP traffic a priority of 1, to limit it to 500Mbps
iptables -t mangle -A POSTROUTING -p tcp --dport 80 -j CLASSIFY \
--set-class 0:1

When thinking about > 1Gbit, maybe our ARM targets can keep up, but even the IPQ806x targets need help from the NSS core offloads. So for 2.5 or 10 Gbps and shaping something like x86 is the only way to go without specific "designed" hardware.

It happens closer to the interrupt handler than software rescheduling as we do it today, and can be offloaded to hardware that supports it. I am shocked, actually, that this facility doesn't exist in more ethernet hardware, it's just a shift register.

In general, finding things that can be offloaded to hardware would be a win, if only there was hardware that was thinking about queuing more correctly.

The linked article actually talks about an offloaded traffic shaper (including a per-port shaper) in mvneta (which I believe is in my turris omnia) if this also supports BQL that would allow to test my hypothesis about shaping being the expensive part.... not that I have a 1Gbps internetlink over which to test the whole thing ;). Plugging fq_codel or cake on top of that should be "easy".

Targets like the MT7621 can use 8 queues for QOS (similar to the mvneta) per port. It can do Strict Priority or Weighted Round Robin per port. This could be translated to the MQPRIO with Hardware Offload qdisc. It includes registers for Max-Min Scheduler for traffic shaping.

A pdf comparing SW and HW QoS on the MT7621 can be found here: Mediatek QoS

Targets like the MT7628 also can do HW QOS, but with 4 TX queues only (and by nature of the hardware are limited to 100mbps).

Qualcomm/Atheros targets should be able to do similar, but like HW-NAT are still behind in development vs the MT7621 (MT7530 GWS).

Another thing we are not using (yet) is multi queue RX (2 actually) on those targets.

I have been trying to work out how the fast/slow concept in fq_codel could apply to a strict prio queue (or PIFO) like this for some time now. http://www.diva-portal.org/smash/get/diva2:1251687/FULLTEXT01.pdf - rather than 8 classes, there would be 4, and it would be "fqprio" to handle classification that way into 1024 possible flows.

For starters I really want to see how load/performance changes if the traffic shaper gets offloaded. Multi-queue is IMHO a different kettle of fish.
I need to check whether my omnia's ethernet supports that.

The requirements of such a traffic shaper aren't clear to me. Would this require changes to the kernel and the drivers? Would it just be a different scheduler with a modified algorithm? A version of DRR with a larger number of queues which sends a single packet per queue might have lower latency between the queues.

MIPS might be on its way out because the new designs are ARM based. I hope RISC-V becomes popular. We still have a lot of MIPS based devices out there. All the performance improvements can make these devices more usable and provide better latency. Devices which have only the integrated ath9k 2.4 802.11n are still usable for the 2.4 GHz networks. I have some devices which are older than 10 years. They still work just fine with embedded clients.

Even though they're not the best, they're probably good devices to use for development. If a qdisc implementation runs well on these, it's very likely to run even better on better hardware.

1 Like

I didn't think of the packet reordering fq_codel would cause on the physical device. It seemed to be yet another queue the packets have to go through and which has to be properly handled by a qdisc.

Given the example with 1024 backlogged flows with MTU sized packets belonging to bitorrent transfers, I have the impression we need per application fairness, not just per network node host fairness. It'd be easy for one such bitorrent application to open a large number of connections and make all those 1024 flows backlogged. Perhaps qosify would help in that case.

10 gbps and 25 gbps connections would have a lower latency. Perhaps those connections would also benefit from improved algorithms and lower per flow processing latency.