I recently had to add `ssl_preread_server_name` to my NGINX configuration in order to `proxy_pass` requests for certain domains to another NGINX instance. In this setup, the first instance simply forwards the raw TLS stream (with `proxy_protocol` prepended), while the second instance handles the actual TLS termination.
This approach works well when implementing a failover mechanism: if the default path to a server goes down, you can update DNS A records to point to a fallback machine running NGINX. That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.
However, this method won't work with HTTP/3. Since HTTP/3 uses QUIC over UDP and encrypts the SNI during the handshake, `ssl_preread_server_name` can no longer be used to route based on domain name.
What alternatives exist to support this kind of SNI-based routing with HTTP/3? Is the recommended solution to continue using HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?
Clients supporting QUIC usually also support HTTPS DNS records, so you can use a lower priority record as a failover, letting the client potentially take care of it. (See for example: host -t https dgl.cx.)
That's the theory anyway. You can't always rely on clients to do that (see how much of the HTTPS record Chromium actually supports[1]), but in general if QUIC fails for any reason clients will transparently fallback, as well as respecting the Alt-Svc[2] header. If this is a planned failover you could stop sending a Alt-Svc record and wait for the alternative to timeout, although it isn't strictly necessary.
If you do really want to route QUIC however, one nice property is the SNI is always in the first packet, so you can route flows by inspecting the first packet. See cloudflare's udpgrm[3] (this on its own isn't enough to proxy to another machine, but the building block is there).
Without Encrypted Client Hello (ECH) the client hello (including SNI) is encrypted with a known key (this is to stop middleboxes which don't know about the version of QUIC breaking it), so it is possible to decrypt it, see the code in udpgrm[4]. With ECH the "router" would need to have a key to decrypt the ECH, which it can then decrypt inline and make a decision on (this is different to the TLS key and can also use fallback HTTPS records to use a different key than the non-fallback route, although whether browsers currently support that is a different issue, but it is possible in the protocol). This is similar to how fallback with ECH could be supported with HTTP/2 and a TCP connection.
[1]: https://issues.chromium.org/issues/40257146
[2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
[3]: https://blog.cloudflare.com/quic-restarts-slow-problems-udpg...
[4]: https://github.com/cloudflare/udpgrm/blob/main/ebpf/ebpf_qui...
> for setups requiring this behavior?
TLS terminating at your edge (which is presumably where the IP addresses attach) isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate, so you might as well do it yourself and move on with life.
Also: I've been unable to reproduce performance and reliability claims of quic. I keep trying a couple times a year to see if anything's gotten better, but I mostly leave it disabled for monetary reasons.
> This approach works well when implementing a failover mechanism: if the default path to a server goes down...
I'm not sure I agree: DNS can take minutes for updates to be reflected, and dumb clients (like web browsers) don't failover.
So I use an onerror handler to load the second path. When my ad tracking that looks something like this:
<img src=patha.domain1?tracking
onerror="this.src='pathb.domain2?tracking';this.onerror=function(){}">
but with the more complex APIs, fetch() is wrapped up similarly in the APIs I deliver to users. This works much better than anything else I've tried.For a failover circumstance, I wouldn’t bother with failover for QUIC at all. If a browser can’t make a QUIC connection (even if advertised in DNS), it will try HTTP1/2 over TLS. Then you can use the same fallback mechanism you would if it wasn’t in the picture.
Unfortunately I think that falls under the "Not a bug" category of bugs. Keeping the endpoint concealed all the way to the TLS endpoint is a feature* of HTTP/3.
* I do actually consider it a feature, but do acknowledge https://xkcd.com/1172/
PS. HAProxy can proxy raw TLS, but can't direct based on hostname. Cloudflare tunnel I think has some special sauce that can proxy on hostname without terminating TLS but requires using them as your DNS provider.
Unless you're using ECH (encrypted client helo) the endpoint is obscured (known keys), not concealed.
PS: HAProxy definitely can do this too, something using req.ssl_sni like this:
frontend tcp-https-plain
mode tcp
tcp-request inspect-delay 10s
bind [::]:443 v4v6 tfo
acl clienthello req.ssl_hello_type 1
acl example.com req.ssl_sni,lower,word(-1,.,2) example.com
tcp-request content accept if clienthello
tcp-request content reject if !clienthello
default_backend tcp-https-default-proxy
use_backend tcp-https-example-proxy if example.com
Then tcp-https-example-proxy is a backend which forwards to a server listening for HTTPS (and using send-proxy-v2, so the client IP is kept). Cloudflare really isn't doing anything special here; there are also other tools like sniproxy[1] which can intercept based on SNI (a common thing commerical proxies do for filtering reasons).Hm, that’s a good question. I suppose the same would apply to TCP+TLS with Encrypted Client Hello as well, right? Presumably the answer would be the same/similar between the two.
Not an expert on eSNI, but my understanding was that the encryption in eSNI is entirely separate from the "main" encryption in TLS, and the eSNI keys have to be the same for every domain served from the same IP address or machine.
Otherwise, the TLS handshake would run into the same chicken/egg problem that you have: To derive the keys, it needs the certificate, but to select the certificate, it needs the domain name.
So you only need to replicate the eSNI key, not the entire cert store.
Personally, I'd like to have an option of the outbound firewall doing the eSNI encryption, is that possible?
That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.
Won't you need to "replicate the TLS config" on the back end servers then? And how hard is it to configure TLS on the nginx side anyway, can't you just use ACME?
QUIC v1 does encrypt the SNI in the client hello, but the keys are derived from a predefined salt and the destination connection id. I don't see why decrypting this would be difficult for a nginx plugin.
There is no way to demultiplex incoming QUIC or HTTP/3 connections based on plaintext metadata inside the protocol. The designers went one step too far in their fight against middleboxes of all sorts. Unless you can assign a each destination at least its own (IP address, UDP port) pair you're shit out of luck and can't have end-to-end encryption. A QUIC proxy has to decrypt, inspect, and reencrypt the traffic. Such a great performance and security improvement :-(. With IPv6 you can use unique IP addresses which immediately undoes any of the supposed privacy advantages of encrypting the server name in the first place. With IPv4 your pretty much fucked. Too bad SRV record support for HTTP(S) was never accepted because it would threatten business models. I guess your best bet is to try to redirect clients to unique ports.
Hiding SNI is more important than breaking rare cases of weird web server setups. This setup is not typical because large organizations like Google tend to put all the services behind the same domain name.
I recall this article on QUIC disadvantages: https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...
Seems like this is a step in the right direction to resole some of those issues. I suppose nothing is preventing it from getting hardware support in future network cards as well.
QUIC does not work very well for use cases like machine-to-machine traffic. However most of traffic in Internet today is from mobile phones to servers and it is were QUIC and HTTP 3 shine.
For other use cases we can keep using TCP.
Let me try providing a different perspective based on experience. QUIC works amazingly well for _some_ kinds of machine to machine traffic.
ssh3, based on QUIC is quicker at dropping into a shell compared to ssh. The latency difference was clearly visible.
QUIC with the unreliable dgram extension is also a great way to implement port forwarding over ssh. Tunneling one reliable transport over another hides the packer losses in the upper layer.
The article that GP posted was specifically about throughput over a high speed connection inside a data center.
It was not about latency.
In my opinion, the lessons that one can draw from this article should not be applied for use cases that are not about maximum throughput inside a data center.
Why doesn't QUIC work well for machine-to-machine traffic ? Is it due to the lack of offloads/optimizations for TCP and machine-to-machine traffic tend to me high volume/high rate ?
QUIC would work okay, but not really have many advantages for machine-to-machine traffic. Machine-to-machine you tend to have long-lived connections over a pretty good network. In this situation TCP already works well and is currently handled better in the kernel. Eventually QUIC will probably be just as good for TCP in this use case, but we're not there yet.
You still have latency, legacy window sizes, and packet schedulers to deal with.
But that is the huge advantage of QUIC. It does NOT totally outcompete TCP traffic on links (we already have bittorrent over udp for that purpose). They redesigned the protocol 5 times or so to achieve that.
The NAT firewalls do not like P2P UDP traffic. Majoritoy of the routers lack the smarts to passtrough QUIC correctly, they need to treat it the same as TCP essentially.
NAT is the devil. bring on the IPoc4lypse
NAT isn't dead with IPv6. ISPs assigning a /128 to your residential network is a thing.
No it isn't unless they want to ban you from using iPhones.
What do you mean? If the v6 configuration is incompatible with iPhones, the iPhone will just use v4
Nat is massively useful for all sorts of reasons which has nothing to do with ip limitations.
sounds great but it fucks up P2P in residential connections, where it is mostly used due to ipv4 address conservation. You can still have nat in IPv6 but hopefully I won't have to deal with it
In practice, P2P over ipv6 is totally screwed because there are no widely supported protocols for dynamic firewall pinholing (allowing inbound traffic) on home routers, whereas dynamic ipv4 NAT configuration via UPnP is very popular and used by many applications.
Most home routers do a form of stateful IPv6 firewall (and IPv4 NAT for that matter) compatible with STUN. UPnP is almost never necessary and has frequent security flaws in common implementations.
You just send a (UDP) packet to the other side's address and port and they send one to yours. The firewalls treat it as an outbound connection on both sides.
I don't believe that's true. You would still need something like UDP hole punching to bootstrap the inbound flow on both sides first. Also you would still only be limited to UDP traffic, TCP would still be blocked.
Sending one packet outbound is hole punching. It's really that simple. Since there's no NAT, you don't need to bother with all the complexity of trying to predict the port number on the public side of the NAT. You just have two sides send at least one packet to each other, and that opens the firewalls on both sides.
You just need to tell the other side that you want to connect.
just don't use a firewall
The NAT RPC talks purely about IP exhaustion.
What do you have in mind.
Why run your K8S cluster on IPv6 when IPv4 with 10.0.0.0/8 works perfectly with less hassle? You can always support IPv6 at the perimeter for ingress/egress. If your cluster is so big it can’t fit in 10.0.0.0/8, maybe the right answer is multiple smaller clusters-your service mesh (e.g. istio) can route inter-cluster traffic just based on names, not IPs.
And if 10.0.0.0/8 is not enough, there is always the old Class E, 240.0.0.0/4 - likely never going to be acceptable for use on the public Internet, but growing use as an additional private IPv4 address range - that gives you over 200 million more IPv4 addresses
> Why run your K8S cluster on IPv6 when IPv4 with 10.0.0.0/8 works perfectly with less hassle? You can always support IPv6 at the perimeter for ingress/egress.
How is it "less hassle"? You've got to use a second, fiddlier protocol and you've got to worry about collisions and translations. Why not just use normal IPv6 and normal addresses for your whole network, how is that more hassle?
> You can always support IPv6 at the perimeter for ingress/egress. If your cluster is so big it can’t fit in 10.0.0.0/8, maybe the right answer is multiple smaller clusters-your service mesh (e.g. istio) can route inter-cluster traffic just based on names, not IPs.
You can work around the problems, sure. But why not just avoid them in the first place?
> How is it "less hassle"? You've got to use a second, fiddlier protocol and you've got to worry about collisions and translations.
Because, while less common than it used to be, software that has weird bugs with IPv6 is still a thing-especially if we are talking about internally developed software as opposed to just open source and major proprietary packages. And as long as IPv6 remains the minority in data centre environments, that’s likely to remain true - it is easy for bugs to linger (or even new ones to be introduced) when they are only triggered by a less popular configuration
True, but already the newest software has good IPv6 support, and that suggests a tipping point should be coming where as soon as the majority is on IPv6 it becomes in everyone's interest to get off of IPv4.
Kubes
Rather, NAT is a bandage for all sorts of reasons besides IP exhaustion.
Example: Janky way to get return routing for traffic when you don't control enterprise routes.
Source: FW engineer
Sure. When I can bgp advertise my laptop with my phone provider and have it update is a second or so globally when I move from tethering to wifi, or one network to another.
No doubt you think I should simply renumber all my VMs every time that happens, breaking internal connections. Or perhaps run a completely separate addrsssing in each vm in parallel and make sure each vm knows which connection to use. Perhaps the vms peer with my laptop and then the laptop decides what to push out which way via localprefs, as paths etc. that sounds so much simpler than a simple masquerade.
What happens when I want vm1 out of connection A, vm 3 out of connection B, vm 4-7 out of connection C. Then I want to change them quickly and easily. I’m balancing outbound and inbound rules, reaching for communities, and causing bgp dampening all over the place.
What when they aren’t VMs but instead physical devices. My $40 mifi is now processing the entire DFZ routing table?
What happens when I want a single physical device like a tv to contact one service via connection 1 and another via connection 2 but the device doesn’t support multiple routing tables or selection of that. What if it does support it but I just want to be able to shift my ssh sessions to a low latency higher loss link but keep my streaming ups on the high latency no loss link.
All this is trivial with nat. Now sure I can use NAT66, and do a 1:1 natting (no PAT here), but then I’m using nat and that breaks the ipv6 cult that believes translating network addresses is useless.
Fair, there are reasons to keep it around, like load-balancing and connection persistence.
QUIC isn’t generally P2P though. Browsers don’t support NAT traversal for it.
I think basically there is currently a lot of overhead and, when you control the network more and everything is more reliable, you can make tcp work better.
It's explained in the reddit thread. Most of it is because you have to handle a ton of what TCP does in userland.
For starters, why encrypt something literally in the same datacenter 6 feet away? Add significant latency and processing overhead.
Encryption gets you data integrity "for free". If a bit is flipped by faulty hardware, the packet won't decrypt. TCP checksums are not good enough for catching corruption in many cases.
Interesting. When I read this I was thinking “that can’t be right, the whole internet relies on tcp being “reliable”. But it is right; https://dl.acm.org/doi/10.1145/347059.347561. It might be rare, but an unencrypted RPC packet might accidentally set that “go nuclear” bit. ECC memory is not enough people! Encrypt your traffic for data integrity!
To stop or slow down the attacker who is inside your network and trying to move horizontally? Isn’t this the principle of defense in depth?
Because the NSA actively intercepts that traffic. There's a reason why encryption is non optional
To me this seems outlandish (e.g. if you're part of PRISM you know what's happening and you're forced to comply.) But to think through this threat model, you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts to spy traffic at the NIC level? I guess it would be harder to intercept and untangle traffic at the NIC level than intra-DC, but I'm not sure?
> you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts
It doesn't have to be one or the other. We've known for over a decade that the traffic between DCs was tapped https://www.theguardian.com/technology/2013/oct/30/google-re... Extending that to intra-DC wouldn't be surprising at all.
Meanwhile backdoored chips and firmware attacks are a constant worry and shouldn't be discounted regardless of the first point.
> you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts to spy traffic at the NIC level
It might not be able to, if you use secure boot and your server is locked in a cage.
> (e.g. if you're part of PRISM you know what's happening and you're forced to comply.)
Only a handful of people need to know what happens in Room 641A, and they're compelled or otherwise incentivized not to let anyone else know.
The difference between tapping intra-DC and in computer spying is that in computer spying is much more likely to get caught and much less easily able to get data out. There's a pretty big difference between software/hardware weaknesses that require specific targeting to exploit and passive scooping everything up and scanning
If you are concerned about this, how do you think you could protect against AWS etc allowing NSA to snoop on you from the hypervisor level?
Assuming the PSP isn't backdoored, using AMD SME and SEV theoretically allow you to run VMs that are encrypted such that, even at the hypervisor level, you can't read code or data from the VM.
You cannot assume that. The solution is to have a server on your territory and use the datacenter only to forward the packets.
Imaginary problems are the funnest to solve.
Its a stone cold fact that the NSA does this, it was part of the snowden revelations. Don't spread FUD about security, its important
Service meshes often encrypt traffic that may be running on the same physical host. Your security policy may simply require this.
Because any random machine in the same datacenter and network segment might be compromised and do stuff like running ARP spoofing attacks. Cisco alone has had so many vendor-provided backdoors cropping up that I wouldn't trust anything in a data center with Cisco gear.
Ummm, no, The network is completely isolated. No one enters the cage and just plugs something into my switches/routers.
Any communication between the cage and the outside world is through the cross-connects.
Unless it's some state-adversary, no one taps us like this. This is not a shared hosting. No one runs serious workloads like this.
"Unserious"? Sure, everything is encrypted p2p.
I don't understand what you mean by "machine-to-machine" if a phone (a machine) talking to a server (a machine) is not machine-to-machine.
I hope you don't think that user-to-machine means that I have to stick my finger in a network switch? :)
Machine-to-machine is usually meant as traffic where neither of the sides is the client device (desktop, mobile etc). Often not initiated by user, but that's debatable.
I would say an server making a sync of database to passive node is machine-to-machine, while a user connection from his browser to webserver is not.
I don't know about using it in the kernel but I would love to see OpenSSH support QUIC so that I get some of the benefits of Mosh [1] while still having all the features of OpenSSH including SFTP, SOCKS, port forwarding, less state table and keep alive issues, roaming support, etc... Could OpenSSH leverage the kernel support?
[1] - https://mosh.org/
SSH would need a lot of work to replace its crypto and mux layers with QUIC. It's probably worth starting from scratch to create a QUIC login protocol. There are a bunch of different approaches to this in various states of prototyping out there.
Fair points. I suppose Mosh would be the proper starting point then. I'm just selfish and want the benefits of QUIC without losing all the really useful features of OpenSSH.
OpenSSH is an OpenBSD project therefore I guess a Linux api isn't that interesting but I could be wrong ofc.
Once Linux implements it, I think odds are high that FreeBSD sooner or later does too. And maybe NetBSD and XNU/Darwin/macOS/iOS thereafter. And if they’ve all got it, that increases the odds that eventually OpenBSD also implements it. And if OpenBSD has the support in its kernel, then they might be willing to consider accepting code in OpenSSH which uses it. So OpenSSH supporting QUIC might eventually happen, but if it does, it is going to be some years away
That's a good point. At least it would not be an entirely new idea. [1] Curious what reactions he received.
[1] - https://papers.freebsd.org/2022/eurobsdcon/jones-making_free...
What will the socket API look like for multiple streams? I guess it is implied it is the same as multiple connections, with caching behind the scenes.
I would hope for something more explicit, where you get a connection object and then open streams from it, but I guess that is fine for now.
https://github.com/microsoft/msquic/discussions/4257 ah but look at this --- unless this is an extension, the server side can also create new streams, once a connection is established. The client creating new "connections" (actually streams) cannot abstract over this. Something fundamentally new is needed.
My guess is recvmsg to get a new file descriptor for new stream.
I would look at SCTP socket API it supports multistreaming.
Ah fuck, it still has a stream_id notion
How are socket APIs always such garbage....
At least the SCTP API has sctp_peeloff, which gives you a new single-stream socket descriptor for the connection. Maybe QUIC will get something like that, eventually. Kind of a glaring omission, though, unless I'm misunderstanding.
Yeah. Huge omission.
> API RFC is ...
still a draft though.
I checked that out and....yuck!
- Send specifies which stream by ordinal number? (Can't have different parts of a concurrent app independently open new streams)
- Receive doesn't specify which stream at all?!
SCTP is very telecom-shaped; in particular, IIRC, the number of streams is fixed at the start of the connection, so (this sucks but also) GP’s problem does not appear.
I have a question - bottleneck for TCP is said to the handshake. But that can be solved by reusing connections and/or multiplexing. The current implementation is 3-4x slower than the Linux impl and performance gap is expected to close.
If speed is touted as the advantage for QUIC and it is in fact slower, why bother with this protocol ? The author of the PR itself attributes some of the speed issues to the protocol design. Are there other problems in TCP that need fixing ?
The article discusses many of the reasons QUIC is currently slower. Most of them seem to come down to "we haven't done any optimization for this yet".
> Long offers some potential reasons for this difference, including the lack of segmentation offload support on the QUIC side, an extra data copy in transmission path, and the encryption required for the QUIC headers.
All of these three reasons seem potentially very addressable.
It's worth noting that the benchmark here is on pristine network conditions, a drag race if you will. If you are on mobile, your network will have a lot more variability, and there TCP's design limits are going to become much more apparent.
TCP itself often has protocols run on top of it, to do QUIC like things. HTTP/2 is an example of this. So when you compare QUIC and TCP, it's kind of like comparing how fast a car goes with how fast an engine bolted to a frame with wheels on it goes. QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3. Thats less system design.
QUIC also has wins for connecting faster, and especially for reconnecting faster. It also has IP mobility: if you're on mobile and your IP address changes (happens!) QUIC can keep the session going without rebuilding it once the client sends the next packet.
It's a fantastically well thought out & awesome advancement, radically better in so many ways. The advantages of having multiple non-blocking streams (alike SCTP) massively reduces the scope that higher level protocol design has to take on. And all that multi-streaming stuff being in the kernel means it's deeply optimizable in a way TCP can never enjoy.
Time to stop driving the old rust bucket jalopy of TCP around everywhere, crafting weird elaborate handmade shit atop it. We need a somewhat better starting place for higher level protocols and man oh man is QUIC alluring.
> QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3
IP is layer 3 - network(ensures packets are routed to the correct host). TCP is layer 4 - transport(some people argue that TCP has functions from layer 5 - eg. establishing sessions between apps), while TLS adds a few functions from layer 6(eg. encryption), which QUIC also has.
The OSI is not a useful guide to how layering works in the Internet.
TCP is level 4 in the OSI model
That's just one bottleneck. The other issue is head-of-line blocking. When there is packet loss on a TCP connection, nothing sent after that is delivered until the loss is repaired.
Whats the packet loss rate on modern networks ? Curious.
~80% when you step out of wifi range on your cell phone.
… from 0% (a wired home LAN with nothing screwy going on) to 100% (e.g., cell reception at the San Antonio Caltrain station), depending on conditions…?
As it always has been, and always will be.
It can be high on cellular.
Pretty bad sometimes when on a train
That depends on how much data you are pushing. if you are pushing 200 mb on a 100mb line you will get 50% packet loss.
Well, yes, that's the idea behind TCP itself, but a "normal" rate of packet loss is something along the lines of 5/100k packets dropped on any given long-haul link. Let's say a random packet passes about 8 such links, so a "normal" rate of packet loss is 0.025% or so.
Once it makes it to the long haul links. Measure starting at your cell phone and packet loss is much higher than 0.025% and that's where QUIC shines.
TCP windowing fixes the issue you are describing. Make the window big and TCP will keep sending when there is a packet loss. It will also retry and usually recover before the end of the window is reached.
The statement in the comment you're replying to is still true. While waiting for those missed packets, the later packets will not be dropped if you have a large window size. But they won't be delivered either. They'll be cached in the kennel, even though it may be that the application could make use of them before the earlier blocked packet.
They are unrelated. Larger windows help achieve higher throughput over paths with high delay. You allude to selective acknowledgements as a way to repair loss before the window completely drains which is true, but my point is that no data can be delivered to the application until the loss is repaired (and that repair takes at least a round-trip time). (Then the follow-on effects from noticed loss on the congestion controller can limit subsequent in-flight data for a time, etc, etc.)
The application will hang waiting for the stack, but the stack keeps working and once the drop is remedied, the application will get a flood of data at a higher rate than the max network rate. So the application may pause sometimes, but the average rate of throughput is not much affected by drops.
The queuing discipline used by default (pfifo_fast) is barely more than 3 FIFO queues bundled together. The 3 queues allow for a barest minimum semblance of prioritisation of traffic, where Queue 0 > 1 > 2, and you can tweak some tcp parameters to have your traffic land in certain queues. If there's something in queue 0 it must be processed first before anything in queue 1 gets touched etc.
Those queues operate purely head-of-queue basis. If what is at the top of the queue 0 is blocked in any way, the whole queue behind it gets stuck, regardless of if it is talking to the same destination, or a completely different one.
I've seen situations where a glitching network card caused some serious knock on impacts across a whole cluster, because the card would hang or packets would drop, and that would end up blocking the qdisc on a completely healthy host that was in the middle of talking to it, which would have impacts on any other host that happened to be talking to that healthy host. A tiny glitch caused much wider impacts than you'd expect.
The same kind of effect would happen from a VM that went through live migration. The tiny, brief pause would cause a spike of latency all over the place.
There are classful alternatives like fq_codel that can be used, that can mitigate some fo this, but you do have to pay a small amount of processing overhead on every packet, because now you have a queuing discipline that actually needs to track some semblance of state.
> bottleneck for TCP is said to the handshake. But that can be solved by reusing connections
You can't reuse a connection that doesn't exist yet. A lot of this is about reducing latency not overall speed.
The "advantage" is tracking via the server provided connection ID https://www.cse.wustl.edu/~jain/cse570-21/ftp/quic/index.htm...
That's non-sensical. Connection-id doesn't allow tracking that you couldn't do with tcp.
I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.
Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.
You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.
Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc
But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.
Yeah, there’s also a lot of offloads that can be done to the kernel with UDP (e.g. UDP segmentation offload, generic receive offload, checksum offload), and offloading quick entirely would be a natural extension to that.
It just offers people choice for the right solution at the right moment.
You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.
1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.
2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.
To your point, Quic in the kernel seems to not have either advantage.
So... RDMA?
No, the first technique describes the basic way they already operate, DMA, but giving access to userspace directly because it's a zerocopy buffer. This is handled by the OS.
RDMA is directly from bus-to-bus, bypassing all the software.
You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.
That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.
The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.
At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.
What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.
In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).
I'm hoping we get there too with io_uring. It looks like the last few kernel release have made a lot of progress with zero-copy TCP rx/tx, though NIC support is limited and you need some finicky network iface setup to get the flow steering working
The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.
Performance comes from dedicating core(s) to polling, not from userspace.
Networking is much faster in the kernel. Even faster on an ASIC.
Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.
> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup
I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.
What is the need for mashing more and more stuff into the kernel? I thought the job of the kernel was to manage memory, hardware, and tasks. Shouldn't protocols built on top of IP be handled by userland?
Having networking, routing, VPN etc all not leave kernel space can be a performance improvement for some use cases.
Similarly, splitting the networking/etc stacks out from the kernel into userspace can also be a performance improvement for some use cases.
Can't you say that about virtually anything? I'm sure having, say, MIDI synthesizers in the kernel would improve performance too, but not many think that is a good idea.
Yup, context switches between kernelspace and userspace are very expensive in high-performance situations, which is why these types of offloads are used.
At specific workloads (think: load balancers / proxy servers / etc), these things become extremely expensive.
Maybe. Getting stuff into the kernel means (in theory) it’s been hardened, it has a serious LTS, and benefits from… well, the performance of being part of the kernel.
DMA transfers and NIC offloading
No, protocols directly on IP specifically can’t be used in userland because they can’t be multiplexed to multiple processes.
If everything above IP was in userland, only one program at a time could use TCP.
TCP and UDP being intermediated by the kernel allow multiple programs to use the protocols at the same time because the kernel routes based on port to each socket.
QUIC sits a layer even higher because it cruises on UDP, so I think your point still stands, but it’s stuff on top of TCP/UDP, not IP.
How do you think this works on microkernels? Do they have no support for multiple applications using the network?
That is not at all a problem. On a microkernel you just have a userspace TCP/network server that your other programs talk to that manages/multiplexes the shared network connection.
If they don’t have TCP in them, yes. Either each application would need its own IP or another application would be responsible for being the TCP port router.
Looks good. Quick is a real game changer for many. Internet should be a little faster with it. Probably we will not care because of 5g, but still valuable. Wondering that there is a separate tow handshake, I was thinking that qick embeds tls, but seams like I am wrong.
The general web is slowed down by bloated websites. But I guess this can make game latency lower.
https://en.m.wikipedia.org/wiki/Jevons_paradox
The Jevons Paradox is applicable in a lot of contexts.
More efficient use of compute and communications resources will lead to higher demand.
In games this is fine. We want more, prettier, smoother, pixels.
In scientific computing this is fine. We need to know those simulation results.
On the web this is not great. We don’t want more ads, tracking, JavaScript.
No, the last 20 years of browser improvements has made my static site incredibly fast!
I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC, etc, etc
This seems to be a categorical error, for reasons that are contained in the article itself. The whole appeal of QUIC is being immune to ossification, being free to change parameters of the protocol without having to beg Linux maintainers to agree.
IMHO, you likely want the server side to be in the kernel, so you can get to performance similar to in-kernel TCP, and ossification is less of a big deal, because it's "easy" to modify the kernel on the server side.
OTOH, you want to be in user land on the client, because modifying the kernel on clients is hard. If you were Google, maybe you could work towards a model where Android clients could get their in-kernel protocol handling to be something that could be updated regularly, but that doesn't seem to be something Google is willing or able to do; Apple and Microsoft can get priority kernel updates out to most of their users quickly; Apple also can influence networks to support things they want their clients to use (IPv6, MP-TCP). </rant>
If you were happy with congestion control on both sides of TCP, and were willing to open multiple TCP connections like http/1, instead of multiplexing requests on a single connection like http/2, (and maybe transfer a non-pessimistic bandwidth estimate between TCP connections to the same peer), QUIC still gives you control over retransmission that TCP doesn't, but I don't think that would be compelling enough by itself.
Yes, there's still ossification in middle boxes doing TCP optimization. My information may be old, but I was under the impression that nobody does that in IPv6, so the push for v6 is both a way to avoid NAT and especially CGNAT, but also a way to avoid optimizer boxes as a benefit for both network providers (less expense) and services (less frustration).
One thing is that congestion control choice is sort of cursed in that it assumes your box/side is being switched but the majority of the rest of the internet continues with legacy limitations (aside from DCTCP, which is designed for intra-datacenter usage), which is an essential part of the question given that resultant/emergent network behavior changes drastically depending on whether or not all sides are using the same algorithm. (Cubic is technically another sort-of-exception, at least since it became the default Linux CC algorithm, but even then you’re still dealing with all sorts of middleware with legacy and/or pathological stateful behavior you can’t control.)
This is a perspective, but just one of many. The overwhelming majority of IP flows are within data centers, not over planet-scale networks between unrelated parties.
I've never been convinced by an explanation of how QUIC applies for flows in the data center.
Ossification doesn't apply (or it shouldn't, IMHO, the point of Open Source software is that you can change it to fit your needs... if you don't like what upstream is doing, you should be running a local fork that does what you want... yeah, it's nicer if it's upstreamed, but try running a local fork of Windows or MacOS); you can make congestion control work for you when you control both sides; enterprise switches and routers aren't messing with tcp flows. If you're pushing enough traffic that this is an issue, the cost of QUIC seems way too high to justify, even if it helps with some issues.
I don't see why this exception to the end-to-end principle should exist. At the scale of single hosts today, with hundreds of CPUs and hundreds of tenants in a single system sharing a kernel, the kernel itself becomes an unwanted middlebox.
Unless you're using QUIC as some kind of datacenter-to-datacenter protocol (basically as SCTP on steroids with TLS), I don't think QUIC in the datacenter makes much sense at all.
As very few server administrators bother turning on features like MPTCP, QUIC has an advantage on mobile phones with moderate to bad reception. That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead. Any service those people are using should probably consider implementing QUIC, and if they use it, they'd benefit from an in-kernel server.
All the data center operators can stick to (MP)TCP, the telco people can stick to SCTP, but the consumer facing side of the internet would do well to keep QUIC as an option.
> That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead.
For what it's worth: Romania, one of the piss poorest countries of Europe, has a perfectly fine mobile phone network, and even outback small villages have XGPON fiber rollouts everywhere. Germany? As soon as you cross into the country from Austria, your phone signal instantly drops, barely any decent coverage outside of the cities. And forget about PON, much less GPON or even XGPON.
Germany should be considered a developing country when it comes to expectations around telecommunication.
Ossification does not come about from the decisions of "Linux maintainers". You need to look at the people who design, sell, and deploy middleboxes for that.
I disagree. There is plenty of ossification coming from inside the house. Just some examples off the top of my head are the stuck-in-1974 minimum RTO and ack delay time parameters, and the unwillingness to land microsecond timestamps.
Not a networking expert, but does TCP in IPv6 suffer the same maladies?
Yes.
Layer4 TCP is pretty much just slapped on top of Layer3 IPv4 or IPv6 in exactly the same way for both of them.
Outside of some little nitpicky things like details on how TCP MSS clamping works, it is basically the same.
…which is basically how it’s supposed to work (or how we teach that it’s supposed to work). (Not that you said anything to the contrary!)
The "middleboxes" excuse for not improving (or replacing) protocols in the past was horseshit. If a big incumbent player in the networking world releases a new feature that everyone wants (but nobody else has), everyone else (including 'middlebox' vendors) will bend over backwards to support it, because if you don't your competitors will and then you lose business. It was never a technical or logistical issue, it was an economic and supply-demand issue.
To prove it:
1. Add a new OSI Layer 4 protocol called "QUIC" and give it a new protocol number, and just for fun, change the UDP frame header semantics so it can't be confused for UDP.
2. Then release kernel updates to support the new protocol.
Nobody's going to use it, right? Because internet routers, home wireless routers, servers, shared libraries, etc would all need their TCP/IP stacks updated to support the new protocol. If we can't ship it over a weekend, it takes too long!
But wait. What if ChatGPT/Claude/Gemini/etc only supported communication over that protocol? You know what would happen: every vendor in the world would backport firmware patches overnight, bending over backwards to support it. Because they can smell the money.
The protocol itself is resistant to ossification, no matter how it is implemented.
It is mostly achieved by using encryption, and it is a reason why it is such an important and mandatory part of the protocol. The idea is to expose as little as possible of the protocol between the endpoints, the rest is encrypted, so that "middleboxes" can't look at the packet and do funny things based on their own interpretation of the protocol stack.
Endpoint can still do whatever they want, and ossification can still happen, but it helps against ossification at the infrastructure level, which is the worst. Updating the linux kernel on your server is easier than changing the proprietary hardware that makes up the network backbone.
The use of UDP instead of doing straight QUIC/IP is also an anti-ossification technique, as your app can just use UDP and a userland library regardless of the QUIC kernel implementation. In theory you could do that with raw sockets too, but that's much more problematic since because you don't have ports, you need the entire interface for yourself, and often root access.
Do you think putting QUIC in the kernel will significantly ossify QUIC? If so, how do you want to deal with the performance penalty for the actual syscalls needed? Your concern makes sense to me as the Linux kernel moves slower than userspace software and middleboxes sometimes never update their kernels.
That's so wrong, putting more and more stuff into the kernel and expanding attack surface. How long will it take before someone finds a vulnerability in QUIC handling?
The kernel should be as minimal as possible and everything that can be moved to userspace should be moved there. If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.
Use a microkernel if this is your strong opinion. Linux is a monolithic kernel and includes a whole lot in kernel space for the sake of performance and (as mentioned in the article) hardware integration. A well designed microkernel may be able to provide similar performance with better security, but until people put serious work in, it won't be competitive with Linux.
Unfortunately the os community puts 99% of it'st collective energy into Linux. There is definitely pent up demand for a different architecture. China seems to be innovating here, but it's unclear if the west will get anything out of their designs.
Sadly Linux distributions use large kernel and there is no simple way to get a working desktop system with a microkernel.
> If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.
By the same logic, we should never improve performance in software and just tell everyone to buy new hardware instead. A bit ridiculous.
We should not compromise security for minor improvements in performance.
Would this (eventually) include the unreliable datagram extension?
Don't know if it could get faster than UDP if it is on top of it.
The use case for this would be running a multiplayer game server over QUIC
Or a mix of both, like datagrams for voice chat / movement and reliable packets for important data
Other use cases include video / audio streaming, VPNs over QUIC, and QUIC-over-QUIC (you never know)
The article didn’t discuss ACK. I have often wondered if it makes sense for the protocol to not have ACKs, and to leave that up to the application layer. I feel like the application layer has to ensure this anyway, so I don’t know how much benefit it is to additionally support this at a lower layer.
> QUIC is meant to be fast, but the benchmark results included with the patch series do not show the proposed in-kernel implementation living up to that. A comparison of in-kernel QUIC with in-kernel TLS shows the latter achieving nearly three times the throughput in some tests. A comparison between QUIC with encryption disabled and plain TCP is even worse, with TCP winning by more than a factor of four in some cases.
Jesus, that's bad. Does anyone know if userspace QUIC implementations are also this slow?
I think the ‘fast’ claims are just different. QUIC is meant to make things fast by:
- having a lower latency handshake
- avoiding some badly behaved ‘middleware’ boxes between users and servers
- avoiding resetting connections when user up addresses change
- avoiding head of line blocking / the increased cost of many connections ramping up
- avoiding poor congestion control algorithms
- probably other things too
And those are all things about working better with the kind of network situations you tend to see between users (often on mobile devices) and servers. I don’t think QUIC was meant to be fast by reducing OS overhead on sending data, and one should generally expect it to be slower for a long time until operating systems become better optimised for this flow and hardware supports offloading more of the work. If you are Google then presumably you are willing to invest in specialised network cards/drivers/software for that.
Yeah I totally get that it optimizes for different things. But the trade offs seem way too severe. Does saving one round trip on the handshake mean anything at all if you're only getting one fourth of the throughput?
Are you getting one fourth of the throughput? Aren’t you going to be limited by:
- bandwidth of the network
- how fast the nic on the server is
- how fast the nic on your device is
- whether the server response fits in the amount of data that can be sent given the client’s initial receive window or whether several round trips are required to scale the window up such that the server can use the available bandwidth
It depends on the use case. If your server is able to handle 45k connections but 42k of them are stalled because of mobile users with too much packet loss, QUIC could look pretty attractive. QUIC is a solution to some of the problematic aspects of TCP that couldn't be fixed without breaking things.
The primary advantage of QUIC for things like congestion control is that companies like Google are free to innovate both sides of the protocol stack (server in prod, client in chrome) simultaneously. I believe that QUIC uses BBR for congestion control, and the major advantage that QUIC has is being able to get a bit more useful info from the client with respect to packet loss.
This could be achieved by encapsulating TCP in UDP and running a custom TCP stack in userspace on the client. That would allow protocol innovation without throwing away 3 decades of optimizations in TCP that make it 4x as efficient on the server side.
Is that true? Aren’t lots of the tcp optimisations about offloading work to the hardware, eg segmentation or tls offload? The hardware would need to know about your tcp-in-udp protocol to be able to handle that efficiently.
Most hardware is fairly generic for tunneled protocols, and tx descriptors can take things like "inner l4 header offset/len" and "outer l4 header offset/len"
Generic support for tunneled TCP is far more doable than support for a new and volatile protocol.
Maybe it’s a fourth as fast in ideal situations with a fast LAN connection. Who knows what they meant by this.
It could still be faster in real world situations where the client is a mobile device with a high latency, lossy connection.
There are claims of 2x-3x operating costs on the server side to deliver better UX for phone users.
> - avoiding some badly behaved ‘middleware’ boxes between users and servers
Surely badly behaving middleboxes won't just ignore UDP traffic? If anything, they'd get confused about udp/443 and act up, forcing clients to fall back to normal TCP.
Your average middlebox will just NAT UDP (unless it's outright blocked by security policy) and move on. It's TCP where many middleboxes think they can "help" the congestion signaling, latch more deeply into the session information, or worse. Unencrypted protocols can have further interference under either TCP or UDP beyond this note.
QUIC is basically about taking all of the information middleboxes like to fuck with in TCP, putting it under the encryption layer, and packaging it back up in a UDP packet precisely so it's either just dropped or forwarded. In practice this (i.e. QUIC either being just dropped or left alone) has actually worked quite well.
Yes. msquic is one of the best performing implementations and only achieves ~7 Gbps [1]. The benchmarks for the Linux kernel implementation only get ~3 Gbps to ~5 Gbps with encryption disabled.
To be fair, the Linux kernel TCP implementation only gets ~4.5 Gbps at normal packets sizes and still only achieves ~24 Gbps with large segmentation offload [2]. Both of which are ridiculously slow. It is straightforward to achieve ~100 Gbps/core at normal packet sizes without segmentation offload with the same features as QUIC with a properly designed protocol and implementation.
[1] https://microsoft.github.io/msquic/
[2] https://lwn.net/ml/all/cover.1751743914.git.lucien.xin@gmail...
Yes, they are. Worse, I’ve seen them shrink down to nothing in the face of congestion with TCP traffic. If Quic is indeed the future protocol, it’s a good thing to move it into the kernel IMO. It’s just madness to provide these massive userspace impls everywhere, on a packet switched protocol nonetheless, and expect it to beat good old TCP. Wouldn’t surprise me if we need optimizations all the way down to the NIC layer, and maybe even middleboxes. Oh and I haven’t even mentioned the CPU cost of UDP.
OTOH, TCP is like a quiet guy at the gym who always wears baggy clothes but does 4 plates on the bench when nobody is looking. Don't underestimate. I wasted months to learn that lesson.
Why is QUIC being pushed, then?
From what I understand the ”killer app” initially was because of mobile spotty networks. TCP is interface (and IP) specific, so if you switch from WiFi to LTE the conn breaks (or worse, degrades/times out slowly). QUIC has a logical conn id that continues to work even when a peer changes the path. Thus, your YouTube ads will not buffer.
Secondary you have the reduced RTT, multiple streams (prevents HOL blocking), datagrams (realtime video on same conn) and you can scale buffers (in userspace) to avoid BDP limits imposed by kernel. However.. I think in practice those haven’t gotten as much visibility and traction, so the original reason is still the main one from what I can tell.
MPTCP provides interface mobility. It's seen widespread deployment with the iPhone, so network support today is much better than one would assume. Unlike QUIC, the changes required by applications are minimal to none. And it's backward compatible; an application can request MPTCP, but if the other end doesn't support it, everything still works.
It has good properties compared to tcp-in-tcp (http/2), especially when connected to clients without access to modern congestion control on iffy networks. http/2 was perhaps adopted too broadly; binary protocol is useful, header compression is useful (but sometimes dangerous), but tcp multiplexing is bad, unless you have very low loss ... it's not ideal for phones with inconsistent networking.
I know in the p2p space, peers have to send lots of small pieces of data. QUIC stops stream blocking on a single packet delay.
because it _does_ provide a number of benefits (potentially fewer initial round-trips, more dynamic routing control by using UDP instead of TCP, etc), and is a userspace softare implementation compared with a hardware-accelerated option.
QUIC getting hardware acceleration should close this gap, and keep all the benefits. But a kernel (software) implementation is basically necessary before it can be properly hardware-accelerated in future hardware (is my current understanding)
To clarify, the userspace implementation is not a benefit, it's just that you can't have a brand new protocol dropped into a trillion dollars of existing hardware overnight, you have to do userspace first as PoC
It does save 2 round-trips during connection compared to TLS-over-TCP, if Wikipedia's diagram is accurate: https://en.wikipedia.org/wiki/QUIC#Characteristics That is a decent latency win on every single connection, and with 0-RTT you can go further, but 0-RTT is stateful and hard to deploy and I expect it will see very little use.
The problem it is trying to solve is not overhead of the Linux kernel on a big server in a datacenter
Google wants control.
QUIC performance requires careful use of batching. Using UDP spckets naively, i.e. sending one QUIC packet per syscall, will incur a lot of oberhead - every time the kernel has to figure out which interface to use, queue it up on a buffer, and all the rest. If one uses it like TCP, batching up lots of data and enquing packets in one “call” helps a ton. Similarly, the kernel wireguard implementation can be slower than wireguard-go since it doesn’t batch traffic. At the speeds offered by modern hardware, we really need to use vectored I/O to be efficient.
I would expect that a protocol such as TCP performs much better than QUIC in benchmarks. Now do a realistic benchmark over roaming LTE connection and come back with the results.
Without seeing actual benchmark code, it's hard to tell if you should even care about that specific result.
If your goal is to pipe lots of bytes from A to B over internal or public internet there probably aren't make things, if any, that can outperform TCP. Decades were spent optimizing TCP for that. If HOL blocking isn't an issue for you, then you can keep using HTTP over TCP.
IMO being Google's proprietary crap is enough reason to stay away. It not actually being any better is an even more compelling reason.
It's not proprietary. It is an IETF standard and several of the authors are not from Google.
It’s an interesting testament to how well designed TCP is.
IMO, it's more a testament to how fast hardware designers can make things with 30 years to tune.
For the love of god, can we please move to microkernel-based operating systems already? We're adding a million lines of code to the linux kernel every year. That's so much attack surface area. We're setting ourselves up for a kessler syndrome of sorts with every system that we add to the kernel.
Most of that code is not loaded into the kernel, only when needed.
True, but the last time I checked (several years ago), the size of the portion of code that is not drivers or kernel modules was still 7 million lines of code, and the average system still has to load a few million more via kernel modules and drivers. That is still a phenomenally large attack surface.
The SeL4 kernel is 10k lines of code. OKL4 is 13k. QNX is ~30k.
Can I run Firefox or PostgreSQL with reasonable performance on SeL4, OKL4, or QNX?
SeL4 was not built for multiple CPU cores, it's not going to perform with modern day "high end" hardware and last I looked its formal proofs don't apply to multicore systems.
Reasonable performance includes GPU acceleration for both rendering and decoding media, right?
Not necessarily; short-comings or "TODO"s are okay. I just want to know if I can run actual real-world complex applications on these micro-kernels, and what the trade-offs are (if any). Firefox on OpenBSD has fairly reasonable performance, but is quite a lot slower than on Linux. It's a perfectly reasonable trade-off, but you do need to be aware of it.
I've asked this question a few times over the last few years when people bring up "we must use microkernel now! They already exist!"-type posts, and thus far the response has either been crickets or vague hand-waving with microbenchmarks that bear no relation to real-world programs.
yes
You've still got combinatorial complexity problem though, because you never know what a specific user is going to load.
Often you do know what a specific user is going to load
Naive question: is the Mac OS or iOS a microkernel? They seem to support http3 in their network foundation librairies and I’m wondering if it’s userland only or more.
MacOS is a hybrid kernel, which has been becoming more microkernel-like over time, and they are aggressively pushing more and more things to userspace. I don't think it will ever be a full microkernel, but it is promising to see that happening there.
Ironic (in the alannis morrisette sense) that Apple has strictly controlled hardware AND OS-level software...if there's anybody out there that can possibly get away with a monolithic kernel in a safe way, it would be them. But Linux...where you have to support practically infinite variations in hardware and the full bazaar of software, that's a dumpster fire waiting to happen.
I might be wrong, but microkernel also need drivers, so the attack surface would be the same, or not?
You're not wrong, but monolithic kernel drivers run at a privilege level that's even higher than root (ring 0) while microkernels run them at userspace so they're as dangerous as running a normal program.
"Just think of the power of ring-0, muhahaha! Think of the speed and simplicity of ring-0-only and identity-mapping. It can change tasks in half a microsecond because it doesn't mess with page tables or privilege levels. Inter-process communication is effortless because every task can access every other task's memory.
"It's fun having access to everything."
— Terry A. Davis
> Inter-process communication is effortless because every task can access every other task's memory.
I think this would get messy quick in an OS designed by more than one person
Redox is a microkernel written in Rust
Brace for unauthenticated remote execution exploits on network stack.
I've been hearing about QUIC for ages, yet it is still an obscure tech and will likely end up like IPv6.
> yet it is still an obscure tech and will likely end up like IPv6.
Probably. According to Google, IPv6 has a measly 46% of internet traffic now [0], and growing at about 5% per year. QUIC is 40% of Chrome traffic, and is growing at 5% every two years [1]. So yeah, their fates do look similar, which is to say both are headed for world domination in a couple of decades.
[0] https://dnsmadeeasy.com/resources/the-state-of-ipv6-adoption...
[1] https://www.cellstream.com/2025/02/14/an-update-on-quic-adop...
When you remove IoT, those numbers will look very differently.
> When you remove IoT, those numbers will look very differently.
To paraphrase: "when you remove all the new stuff being added, you will see all the old stuff is still using the old protocols". Sounds reasonable, but I don't believe it. These IoT devices usually have the simplest stack imaginable, of many of them implemented from the main loop. IPv6 isn't so bad, but QUIC/http2/http3 is a long, long way from simple.
A major driver of IPv6 is phones, which I wound not classify as IoT. Where I live they all receive an IPv6 address now. When I hotspot, they hand out a routable IPv6 address to the laptop / desktop. Modern Windows / Linux installations will use the IPv6 in preference to the double NAT'ed IPv4 address they also hand out. The funny thing is you don't even notice, or at least I didn't. I only twigged when I happened to be looking at packet capture from my tethered laptop and saw all this IPv6 traffic, and wondered what the heck was going on. It could have been happening for years without me noticing. Maybe it was.
It wasn't I surprise I didn't notice. I set up WiFi access for a conference of 100's of computing nerds and professionals many years ago. Partly for kicks, partly as a learning excise I made it IPv6 only. As a backup plan I had a IPv4 network (behind a NAT sadly, which the IPv6 wasn't) ready to go on a different SSID. To my utter disbelief there no complaints, literally not a single one. Again, no one noticed.
QUIC is really simple for most IOT: Just import the library.
If your idea of IoT is a rpi with 4Gb of RAM connected to a power supply running Linux, then yes, just import the library. But my of idea of IoT is something that runs runs on a battery for a year, and in that case it's unlikely to have enough hardware to support QUIC (or it's main user, http3). QUIC needs too much RAM (min 4KB), too much flash (min 64Kb for the code), and too much CPU. In reality it's a squeeze even on a nRF52840.
QUIC is already out and in active use. Every major web browser supports it, and it's not like IPv6. There's no fundamental infrastructure change needed to support it since it's built on top of UDP. The end points obviously have to support it, but that's the same as any other protocol built on UDP or TCP (like HTTP, SNMP, etc.).
Your browser is using it when you watch a video on youtube (HTTP/3).
isn't it just http3 now?