Comments Page - QUIC for the kernel

« Back QUIC for the kernellwn.netSubmitted by Bogdanp a day ago

qwertox a day ago
I recently had to add `ssl_preread_server_name` to my NGINX configuration in order to `proxy_pass` requests for certain domains to another NGINX instance. In this setup, the first instance simply forwards the raw TLS stream (with `proxy_protocol` prepended), while the second instance handles the actual TLS termination.
This approach works well when implementing a failover mechanism: if the default path to a server goes down, you can update DNS A records to point to a fallback machine running NGINX. That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.
However, this method won't work with HTTP/3. Since HTTP/3 uses QUIC over UDP and encrypts the SNI during the handshake, `ssl_preread_server_name` can no longer be used to route based on domain name.
What alternatives exist to support this kind of SNI-based routing with HTTP/3? Is the recommended solution to continue using HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?
- dgl 19 hours ago
  Clients supporting QUIC usually also support HTTPS DNS records, so you can use a lower priority record as a failover, letting the client potentially take care of it. (See for example: host -t https dgl.cx.)
  That's the theory anyway. You can't always rely on clients to do that (see how much of the HTTPS record Chromium actually supports[1]), but in general if QUIC fails for any reason clients will transparently fallback, as well as respecting the Alt-Svc[2] header. If this is a planned failover you could stop sending a Alt-Svc record and wait for the alternative to timeout, although it isn't strictly necessary.
  If you do really want to route QUIC however, one nice property is the SNI is always in the first packet, so you can route flows by inspecting the first packet. See cloudflare's udpgrm[3] (this on its own isn't enough to proxy to another machine, but the building block is there).
  Without Encrypted Client Hello (ECH) the client hello (including SNI) is encrypted with a known key (this is to stop middleboxes which don't know about the version of QUIC breaking it), so it is possible to decrypt it, see the code in udpgrm[4]. With ECH the "router" would need to have a key to decrypt the ECH, which it can then decrypt inline and make a decision on (this is different to the TLS key and can also use fallback HTTPS records to use a different key than the non-fallback route, although whether browsers currently support that is a different issue, but it is possible in the protocol). This is similar to how fallback with ECH could be supported with HTTP/2 and a TCP connection.
  [1]: https://issues.chromium.org/issues/40257146
  [2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
  [3]: https://blog.cloudflare.com/quic-restarts-slow-problems-udpg...
  [4]: https://github.com/cloudflare/udpgrm/blob/main/ebpf/ebpf_qui...
- geocar 11 hours ago
  > for setups requiring this behavior?
  TLS terminating at your edge (which is presumably where the IP addresses attach) isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate, so you might as well do it yourself and move on with life.
  Also: I've been unable to reproduce performance and reliability claims of quic. I keep trying a couple times a year to see if anything's gotten better, but I mostly leave it disabled for monetary reasons.
  > This approach works well when implementing a failover mechanism: if the default path to a server goes down...
  I'm not sure I agree: DNS can take minutes for updates to be reflected, and dumb clients (like web browsers) don't failover.
  So I use an onerror handler to load the second path. When my ad tracking that looks something like this:
  <img src=patha.domain1?tracking onerror="this.src='pathb.domain2?tracking';this.onerror=function(){}">
  but with the more complex APIs, fetch() is wrapped up similarly in the APIs I deliver to users. This works much better than anything else I've tried.
- johncolanduoni 17 hours ago
  For a failover circumstance, I wouldn’t bother with failover for QUIC at all. If a browser can’t make a QUIC connection (even if advertised in DNS), it will try HTTP1/2 over TLS. Then you can use the same fallback mechanism you would if it wasn’t in the picture.
- MadnessASAP 19 hours ago
  Unfortunately I think that falls under the "Not a bug" category of bugs. Keeping the endpoint concealed all the way to the TLS endpoint is a feature* of HTTP/3.
  * I do actually consider it a feature, but do acknowledge https://xkcd.com/1172/
  PS. HAProxy can proxy raw TLS, but can't direct based on hostname. Cloudflare tunnel I think has some special sauce that can proxy on hostname without terminating TLS but requires using them as your DNS provider.
  dgl 18 hours ago
  Unless you're using ECH (encrypted client helo) the endpoint is obscured (known keys), not concealed.
  PS: HAProxy definitely can do this too, something using req.ssl_sni like this:
  frontend tcp-https-plain mode tcp tcp-request inspect-delay 10s bind [::]:443 v4v6 tfo acl clienthello req.ssl_hello_type 1 acl example.com req.ssl_sni,lower,word(-1,.,2) example.com tcp-request content accept if clienthello tcp-request content reject if !clienthello default_backend tcp-https-default-proxy use_backend tcp-https-example-proxy if example.com
  Then tcp-https-example-proxy is a backend which forwards to a server listening for HTTPS (and using send-proxy-v2, so the client IP is kept). Cloudflare really isn't doing anything special here; there are also other tools like sniproxy[1] which can intercept based on SNI (a common thing commerical proxies do for filtering reasons).
  [1]: https://github.com/ameshkov/sniproxy
- jcgl 21 hours ago
  Hm, that’s a good question. I suppose the same would apply to TCP+TLS with Encrypted Client Hello as well, right? Presumably the answer would be the same/similar between the two.
- xg15 11 hours ago
  Not an expert on eSNI, but my understanding was that the encryption in eSNI is entirely separate from the "main" encryption in TLS, and the eSNI keys have to be the same for every domain served from the same IP address or machine.
  Otherwise, the TLS handshake would run into the same chicken/egg problem that you have: To derive the keys, it needs the certificate, but to select the certificate, it needs the domain name.
  So you only need to replicate the eSNI key, not the entire cert store.
  silon42 2 hours ago
  Personally, I'd like to have an option of the outbound firewall doing the eSNI encryption, is that possible?
- NewJazz 13 hours ago
  That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.
  Won't you need to "replicate the TLS config" on the back end servers then? And how hard is it to configure TLS on the nginx side anyway, can't you just use ACME?
- kldx 11 hours ago
  QUIC v1 does encrypt the SNI in the client hello, but the keys are derived from a predefined salt and the destination connection id. I don't see why decrypting this would be difficult for a nginx plugin.
- crest 4 hours ago
  There is no way to demultiplex incoming QUIC or HTTP/3 connections based on plaintext metadata inside the protocol. The designers went one step too far in their fight against middleboxes of all sorts. Unless you can assign a each destination at least its own (IP address, UDP port) pair you're shit out of luck and can't have end-to-end encryption. A QUIC proxy has to decrypt, inspect, and reencrypt the traffic. Such a great performance and security improvement :-(. With IPv6 you can use unique IP addresses which immediately undoes any of the supposed privacy advantages of encrypting the server name in the first place. With IPv4 your pretty much fucked. Too bad SRV record support for HTTP(S) was never accepted because it would threatten business models. I guess your best bet is to try to redirect clients to unique ports.
- codedokode 13 hours ago
  Hiding SNI is more important than breaking rare cases of weird web server setups. This setup is not typical because large organizations like Google tend to put all the services behind the same domain name.
WASDx a day ago
I recall this article on QUIC disadvantages: https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...
Seems like this is a step in the right direction to resole some of those issues. I suppose nothing is preventing it from getting hardware support in future network cards as well.
- miohtama a day ago
  QUIC does not work very well for use cases like machine-to-machine traffic. However most of traffic in Internet today is from mobile phones to servers and it is were QUIC and HTTP 3 shine.
  For other use cases we can keep using TCP.
  kldx 11 hours ago
  Let me try providing a different perspective based on experience. QUIC works amazingly well for _some_ kinds of machine to machine traffic.
  ssh3, based on QUIC is quicker at dropping into a shell compared to ssh. The latency difference was clearly visible.
  QUIC with the unreliable dgram extension is also a great way to implement port forwarding over ssh. Tunneling one reliable transport over another hides the packer losses in the upper layer.
  exDM69 5 hours ago
  The article that GP posted was specifically about throughput over a high speed connection inside a data center.
  It was not about latency.
  In my opinion, the lessons that one can draw from this article should not be applied for use cases that are not about maximum throughput inside a data center.
  thickice a day ago
  Why doesn't QUIC work well for machine-to-machine traffic ? Is it due to the lack of offloads/optimizations for TCP and machine-to-machine traffic tend to me high volume/high rate ?
  yello_downunder a day ago
  QUIC would work okay, but not really have many advantages for machine-to-machine traffic. Machine-to-machine you tend to have long-lived connections over a pretty good network. In this situation TCP already works well and is currently handled better in the kernel. Eventually QUIC will probably be just as good for TCP in this use case, but we're not there yet.
  jabart a day ago
  You still have latency, legacy window sizes, and packet schedulers to deal with.
  spwa4 a day ago
  But that is the huge advantage of QUIC. It does NOT totally outcompete TCP traffic on links (we already have bittorrent over udp for that purpose). They redesigned the protocol 5 times or so to achieve that.
  extropy a day ago
  The NAT firewalls do not like P2P UDP traffic. Majoritoy of the routers lack the smarts to passtrough QUIC correctly, they need to treat it the same as TCP essentially.
  beeflet a day ago
  NAT is the devil. bring on the IPoc4lypse
  mort96 10 hours ago
  NAT isn't dead with IPv6. ISPs assigning a /128 to your residential network is a thing.
  immibis 9 hours ago
  No it isn't unless they want to ban you from using iPhones.
  mort96 7 hours ago
  What do you mean? If the v6 configuration is incompatible with iPhones, the iPhone will just use v4
  hdgvhicv a day ago
  Nat is massively useful for all sorts of reasons which has nothing to do with ip limitations.
  beeflet 21 hours ago
  sounds great but it fucks up P2P in residential connections, where it is mostly used due to ipv4 address conservation. You can still have nat in IPv6 but hopefully I won't have to deal with it
  mightyham 18 hours ago
  In practice, P2P over ipv6 is totally screwed because there are no widely supported protocols for dynamic firewall pinholing (allowing inbound traffic) on home routers, whereas dynamic ipv4 NAT configuration via UPnP is very popular and used by many applications.
  johncolanduoni 18 hours ago
  Most home routers do a form of stateful IPv6 firewall (and IPv4 NAT for that matter) compatible with STUN. UPnP is almost never necessary and has frequent security flaws in common implementations.
  immibis 9 hours ago
  You just send a (UDP) packet to the other side's address and port and they send one to yours. The firewalls treat it as an outbound connection on both sides.
  mightyham 2 hours ago
  I don't believe that's true. You would still need something like UDP hole punching to bootstrap the inbound flow on both sides first. Also you would still only be limited to UDP traffic, TCP would still be blocked.
  immibis 2 hours ago
  Sending one packet outbound is hole punching. It's really that simple. Since there's no NAT, you don't need to bother with all the complexity of trying to predict the port number on the public side of the NAT. You just have two sides send at least one packet to each other, and that opens the firewalls on both sides.
  GoblinSlayer 2 hours ago
  You just need to tell the other side that you want to connect.
  beeflet 18 hours ago
  just don't use a firewall
  paulddraper 21 hours ago
  The NAT RPC talks purely about IP exhaustion.
  What do you have in mind.
  skissane 19 hours ago
  Why run your K8S cluster on IPv6 when IPv4 with 10.0.0.0/8 works perfectly with less hassle? You can always support IPv6 at the perimeter for ingress/egress. If your cluster is so big it can’t fit in 10.0.0.0/8, maybe the right answer is multiple smaller clusters-your service mesh (e.g. istio) can route inter-cluster traffic just based on names, not IPs.
  And if 10.0.0.0/8 is not enough, there is always the old Class E, 240.0.0.0/4 - likely never going to be acceptable for use on the public Internet, but growing use as an additional private IPv4 address range - that gives you over 200 million more IPv4 addresses
  lmm 13 hours ago
  > Why run your K8S cluster on IPv6 when IPv4 with 10.0.0.0/8 works perfectly with less hassle? You can always support IPv6 at the perimeter for ingress/egress.
  How is it "less hassle"? You've got to use a second, fiddlier protocol and you've got to worry about collisions and translations. Why not just use normal IPv6 and normal addresses for your whole network, how is that more hassle?
  > You can always support IPv6 at the perimeter for ingress/egress. If your cluster is so big it can’t fit in 10.0.0.0/8, maybe the right answer is multiple smaller clusters-your service mesh (e.g. istio) can route inter-cluster traffic just based on names, not IPs.
  You can work around the problems, sure. But why not just avoid them in the first place?
  skissane 12 hours ago
  > How is it "less hassle"? You've got to use a second, fiddlier protocol and you've got to worry about collisions and translations.
  Because, while less common than it used to be, software that has weird bugs with IPv6 is still a thing-especially if we are talking about internally developed software as opposed to just open source and major proprietary packages. And as long as IPv6 remains the minority in data centre environments, that’s likely to remain true - it is easy for bugs to linger (or even new ones to be introduced) when they are only triggered by a less popular configuration
  lmm 11 hours ago
  True, but already the newest software has good IPv6 support, and that suggests a tipping point should be coming where as soon as the majority is on IPv6 it becomes in everyone's interest to get off of IPv4.
  Iwan-Zotow 17 hours ago
  Kubes
  unethical_ban a day ago
  Rather, NAT is a bandage for all sorts of reasons besides IP exhaustion.
  Example: Janky way to get return routing for traffic when you don't control enterprise routes.
  Source: FW engineer
  hdgvhicv 11 hours ago
  Sure. When I can bgp advertise my laptop with my phone provider and have it update is a second or so globally when I move from tethering to wifi, or one network to another.
  No doubt you think I should simply renumber all my VMs every time that happens, breaking internal connections. Or perhaps run a completely separate addrsssing in each vm in parallel and make sure each vm knows which connection to use. Perhaps the vms peer with my laptop and then the laptop decides what to push out which way via localprefs, as paths etc. that sounds so much simpler than a simple masquerade.
  What happens when I want vm1 out of connection A, vm 3 out of connection B, vm 4-7 out of connection C. Then I want to change them quickly and easily. I’m balancing outbound and inbound rules, reaching for communities, and causing bgp dampening all over the place.
  What when they aren’t VMs but instead physical devices. My $40 mifi is now processing the entire DFZ routing table?
  What happens when I want a single physical device like a tv to contact one service via connection 1 and another via connection 2 but the device doesn’t support multiple routing tables or selection of that. What if it does support it but I just want to be able to shift my ssh sessions to a low latency higher loss link but keep my streaming ups on the high latency no loss link.
  All this is trivial with nat. Now sure I can use NAT66, and do a 1:1 natting (no PAT here), but then I’m using nat and that breaks the ipv6 cult that believes translating network addresses is useless.
  unethical_ban 4 hours ago
  Fair, there are reasons to keep it around, like load-balancing and connection persistence.
  johncolanduoni 18 hours ago
  QUIC isn’t generally P2P though. Browsers don’t support NAT traversal for it.
  dan-robertson a day ago
  I think basically there is currently a lot of overhead and, when you control the network more and everything is more reliable, you can make tcp work better.
  m00x a day ago
  It's explained in the reddit thread. Most of it is because you have to handle a ton of what TCP does in userland.
  exabrial a day ago
  For starters, why encrypt something literally in the same datacenter 6 feet away? Add significant latency and processing overhead.
  sleepydog 21 hours ago
  Encryption gets you data integrity "for free". If a bit is flipped by faulty hardware, the packet won't decrypt. TCP checksums are not good enough for catching corruption in many cases.
  lll-o-lll 10 hours ago
  Interesting. When I read this I was thinking “that can’t be right, the whole internet relies on tcp being “reliable”. But it is right; https://dl.acm.org/doi/10.1145/347059.347561. It might be rare, but an unencrypted RPC packet might accidentally set that “go nuclear” bit. ECC memory is not enough people! Encrypt your traffic for data integrity!
  lll-o-lll 21 hours ago
  To stop or slow down the attacker who is inside your network and trying to move horizontally? Isn’t this the principle of defense in depth?
  20k a day ago
  Because the NSA actively intercepts that traffic. There's a reason why encryption is non optional
  Karrot_Kream a day ago
  To me this seems outlandish (e.g. if you're part of PRISM you know what's happening and you're forced to comply.) But to think through this threat model, you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts to spy traffic at the NIC level? I guess it would be harder to intercept and untangle traffic at the NIC level than intra-DC, but I'm not sure?
  viraptor a day ago
  > you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts
  It doesn't have to be one or the other. We've known for over a decade that the traffic between DCs was tapped https://www.theguardian.com/technology/2013/oct/30/google-re... Extending that to intra-DC wouldn't be surprising at all.
  Meanwhile backdoored chips and firmware attacks are a constant worry and shouldn't be discounted regardless of the first point.
  codedokode 13 hours ago
  > you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts to spy traffic at the NIC level
  It might not be able to, if you use secure boot and your server is locked in a cage.
  heavyset_go 14 hours ago
  > (e.g. if you're part of PRISM you know what's happening and you're forced to comply.)
  Only a handful of people need to know what happens in Room 641A, and they're compelled or otherwise incentivized not to let anyone else know.
  adgjlsfhk1 21 hours ago
  The difference between tapping intra-DC and in computer spying is that in computer spying is much more likely to get caught and much less easily able to get data out. There's a pretty big difference between software/hardware weaknesses that require specific targeting to exploit and passive scooping everything up and scanning
  cherryteastain a day ago
  If you are concerned about this, how do you think you could protect against AWS etc allowing NSA to snoop on you from the hypervisor level?
  heavyset_go 14 hours ago
  Assuming the PSP isn't backdoored, using AMD SME and SEV theoretically allow you to run VMs that are encrypted such that, even at the hypervisor level, you can't read code or data from the VM.
  codedokode 13 hours ago
  You cannot assume that. The solution is to have a server on your territory and use the datacenter only to forward the packets.
  exabrial a day ago
  Imaginary problems are the funnest to solve.
  20k 16 hours ago
  Its a stone cold fact that the NSA does this, it was part of the snowden revelations. Don't spread FUD about security, its important
  switchbak a day ago
  Service meshes often encrypt traffic that may be running on the same physical host. Your security policy may simply require this.
  mschuster91 20 hours ago
  Because any random machine in the same datacenter and network segment might be compromised and do stuff like running ARP spoofing attacks. Cisco alone has had so many vendor-provided backdoors cropping up that I wouldn't trust anything in a data center with Cisco gear.
  subscribed 6 hours ago
  Ummm, no, The network is completely isolated. No one enters the cage and just plugs something into my switches/routers.
  Any communication between the cage and the outside world is through the cross-connects.
  Unless it's some state-adversary, no one taps us like this. This is not a shared hosting. No one runs serious workloads like this.
  "Unserious"? Sure, everything is encrypted p2p.
  mort96 10 hours ago
  I don't understand what you mean by "machine-to-machine" if a phone (a machine) talking to a server (a machine) is not machine-to-machine.
  szszrk 10 hours ago
  I hope you don't think that user-to-machine means that I have to stick my finger in a network switch? :)
  Machine-to-machine is usually meant as traffic where neither of the sides is the client device (desktop, mobile etc). Often not initiated by user, but that's debatable.
  I would say an server making a sync of database to passive node is machine-to-machine, while a user connection from his browser to webserver is not.
Bender a day ago
I don't know about using it in the kernel but I would love to see OpenSSH support QUIC so that I get some of the benefits of Mosh [1] while still having all the features of OpenSSH including SFTP, SOCKS, port forwarding, less state table and keep alive issues, roaming support, etc... Could OpenSSH leverage the kernel support?
[1] - https://mosh.org/
- wmf a day ago
  SSH would need a lot of work to replace its crypto and mux layers with QUIC. It's probably worth starting from scratch to create a QUIC login protocol. There are a bunch of different approaches to this in various states of prototyping out there.
  Bender 20 hours ago
  Fair points. I suppose Mosh would be the proper starting point then. I'm just selfish and want the benefits of QUIC without losing all the really useful features of OpenSSH.
- bauruine a day ago
  OpenSSH is an OpenBSD project therefore I guess a Linux api isn't that interesting but I could be wrong ofc.
  skissane 16 hours ago
  Once Linux implements it, I think odds are high that FreeBSD sooner or later does too. And maybe NetBSD and XNU/Darwin/macOS/iOS thereafter. And if they’ve all got it, that increases the odds that eventually OpenBSD also implements it. And if OpenBSD has the support in its kernel, then they might be willing to consider accepting code in OpenSSH which uses it. So OpenSSH supporting QUIC might eventually happen, but if it does, it is going to be some years away
  Bender 20 hours ago
  That's a good point. At least it would not be an entirely new idea. [1] Curious what reactions he received.
  [1] - https://papers.freebsd.org/2022/eurobsdcon/jones-making_free...
Ericson2314 a day ago
What will the socket API look like for multiple streams? I guess it is implied it is the same as multiple connections, with caching behind the scenes.
I would hope for something more explicit, where you get a connection object and then open streams from it, but I guess that is fine for now.
https://github.com/microsoft/msquic/discussions/4257 ah but look at this --- unless this is an extension, the server side can also create new streams, once a connection is established. The client creating new "connections" (actually streams) cannot abstract over this. Something fundamentally new is needed.
My guess is recvmsg to get a new file descriptor for new stream.
- gte525u a day ago
  I would look at SCTP socket API it supports multistreaming.
  wahern 20 hours ago
  API RFC is https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket...
  Ericson2314 20 hours ago
  Ah fuck, it still has a stream_id notion
  How are socket APIs always such garbage....
  wahern 18 hours ago
  At least the SCTP API has sctp_peeloff, which gives you a new single-stream socket descriptor for the connection. Maybe QUIC will get something like that, eventually. Kind of a glaring omission, though, unless I'm misunderstanding.
  Ericson2314 17 hours ago
  Yeah. Huge omission.
  signa11 16 hours ago
  > API RFC is ...
  still a draft though.
  Ericson2314 21 hours ago
  I checked that out and....yuck!
  - Send specifies which stream by ordinal number? (Can't have different parts of a concurrent app independently open new streams)
  - Receive doesn't specify which stream at all?!
  mananaysiempre 19 hours ago
  SCTP is very telecom-shaped; in particular, IIRC, the number of streams is fixed at the start of the connection, so (this sucks but also) GP’s problem does not appear.
another_twist a day ago
I have a question - bottleneck for TCP is said to the handshake. But that can be solved by reusing connections and/or multiplexing. The current implementation is 3-4x slower than the Linux impl and performance gap is expected to close.
If speed is touted as the advantage for QUIC and it is in fact slower, why bother with this protocol ? The author of the PR itself attributes some of the speed issues to the protocol design. Are there other problems in TCP that need fixing ?
- jauntywundrkind a day ago
  The article discusses many of the reasons QUIC is currently slower. Most of them seem to come down to "we haven't done any optimization for this yet".
  > Long offers some potential reasons for this difference, including the lack of segmentation offload support on the QUIC side, an extra data copy in transmission path, and the encryption required for the QUIC headers.
  All of these three reasons seem potentially very addressable.
  It's worth noting that the benchmark here is on pristine network conditions, a drag race if you will. If you are on mobile, your network will have a lot more variability, and there TCP's design limits are going to become much more apparent.
  TCP itself often has protocols run on top of it, to do QUIC like things. HTTP/2 is an example of this. So when you compare QUIC and TCP, it's kind of like comparing how fast a car goes with how fast an engine bolted to a frame with wheels on it goes. QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3. Thats less system design.
  QUIC also has wins for connecting faster, and especially for reconnecting faster. It also has IP mobility: if you're on mobile and your IP address changes (happens!) QUIC can keep the session going without rebuilding it once the client sends the next packet.
  It's a fantastically well thought out & awesome advancement, radically better in so many ways. The advantages of having multiple non-blocking streams (alike SCTP) massively reduces the scope that higher level protocol design has to take on. And all that multi-streaming stuff being in the kernel means it's deeply optimizable in a way TCP can never enjoy.
  Time to stop driving the old rust bucket jalopy of TCP around everywhere, crafting weird elaborate handmade shit atop it. We need a somewhat better starting place for higher level protocols and man oh man is QUIC alluring.
  redleader55 20 hours ago
  > QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3
  IP is layer 3 - network(ensures packets are routed to the correct host). TCP is layer 4 - transport(some people argue that TCP has functions from layer 5 - eg. establishing sessions between apps), while TLS adds a few functions from layer 6(eg. encryption), which QUIC also has.
  fanf2 2 hours ago
  The OSI is not a useful guide to how layering works in the Internet.
  w3ll_w3ll_w3ll 20 hours ago
  TCP is level 4 in the OSI model
- morning-coffee a day ago
  That's just one bottleneck. The other issue is head-of-line blocking. When there is packet loss on a TCP connection, nothing sent after that is delivered until the loss is repaired.
  another_twist a day ago
  Whats the packet loss rate on modern networks ? Curious.
  adgjlsfhk1 21 hours ago
  ~80% when you step out of wifi range on your cell phone.
  deathanatos 15 hours ago
  … from 0% (a wired home LAN with nothing screwy going on) to 100% (e.g., cell reception at the San Antonio Caltrain station), depending on conditions…?
  As it always has been, and always will be.
  wmf a day ago
  It can be high on cellular.
  geysersam 6 hours ago
  Pretty bad sometimes when on a train
  reliablereason a day ago
  That depends on how much data you are pushing. if you are pushing 200 mb on a 100mb line you will get 50% packet loss.
  spwa4 a day ago
  Well, yes, that's the idea behind TCP itself, but a "normal" rate of packet loss is something along the lines of 5/100k packets dropped on any given long-haul link. Let's say a random packet passes about 8 such links, so a "normal" rate of packet loss is 0.025% or so.
  positr0n 15 hours ago
  Once it makes it to the long haul links. Measure starting at your cell phone and packet loss is much higher than 0.025% and that's where QUIC shines.
  anonymousiam a day ago
  TCP windowing fixes the issue you are describing. Make the window big and TCP will keep sending when there is a packet loss. It will also retry and usually recover before the end of the window is reached.
  https://en.wikipedia.org/wiki/TCP_window_scale_option
  quietbritishjim a day ago
  The statement in the comment you're replying to is still true. While waiting for those missed packets, the later packets will not be dropped if you have a large window size. But they won't be delivered either. They'll be cached in the kennel, even though it may be that the application could make use of them before the earlier blocked packet.
  morning-coffee 21 hours ago
  They are unrelated. Larger windows help achieve higher throughput over paths with high delay. You allude to selective acknowledgements as a way to repair loss before the window completely drains which is true, but my point is that no data can be delivered to the application until the loss is repaired (and that repair takes at least a round-trip time). (Then the follow-on effects from noticed loss on the congestion controller can limit subsequent in-flight data for a time, etc, etc.)
  anonymousiam 15 hours ago
  The application will hang waiting for the stack, but the stack keeps working and once the drop is remedied, the application will get a flood of data at a higher rate than the max network rate. So the application may pause sometimes, but the average rate of throughput is not much affected by drops.
  Twirrim 20 hours ago
  The queuing discipline used by default (pfifo_fast) is barely more than 3 FIFO queues bundled together. The 3 queues allow for a barest minimum semblance of prioritisation of traffic, where Queue 0 > 1 > 2, and you can tweak some tcp parameters to have your traffic land in certain queues. If there's something in queue 0 it must be processed first before anything in queue 1 gets touched etc.
  Those queues operate purely head-of-queue basis. If what is at the top of the queue 0 is blocked in any way, the whole queue behind it gets stuck, regardless of if it is talking to the same destination, or a completely different one.
  I've seen situations where a glitching network card caused some serious knock on impacts across a whole cluster, because the card would hang or packets would drop, and that would end up blocking the qdisc on a completely healthy host that was in the middle of talking to it, which would have impacts on any other host that happened to be talking to that healthy host. A tiny glitch caused much wider impacts than you'd expect.
  The same kind of effect would happen from a VM that went through live migration. The tiny, brief pause would cause a spike of latency all over the place.
  There are classful alternatives like fq_codel that can be used, that can mitigate some fo this, but you do have to pay a small amount of processing overhead on every packet, because now you have a queuing discipline that actually needs to track some semblance of state.
- bawolff 17 hours ago
  > bottleneck for TCP is said to the handshake. But that can be solved by reusing connections
  You can't reuse a connection that doesn't exist yet. A lot of this is about reducing latency not overall speed.
- frmdstryr 21 hours ago
  The "advantage" is tracking via the server provided connection ID https://www.cse.wustl.edu/~jain/cse570-21/ftp/quic/index.htm...
  bawolff 17 hours ago
  That's non-sensical. Connection-id doesn't allow tracking that you couldn't do with tcp.
kibwen a day ago
I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.
- toast0 21 hours ago
  Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.
  You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.
  Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc
  But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.
  stingraycharles 10 hours ago
  Yeah, there’s also a lot of offloads that can be done to the kernel with UDP (e.g. UDP segmentation offload, generic receive offload, checksum offload), and offloading quick entirely would be a natural extension to that.
  It just offers people choice for the right solution at the right moment.
- shanemhansen a day ago
  You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.
  1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.
  2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.
  To your point, Quic in the kernel seems to not have either advantage.
  FuriouslyAdrift a day ago
  So... RDMA?
  michaelsshaw a day ago
  No, the first technique describes the basic way they already operate, DMA, but giving access to userspace directly because it's a zerocopy buffer. This is handled by the OS.
  RDMA is directly from bus-to-bus, bypassing all the software.
- Karrot_Kream a day ago
  You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.
  Veserv 21 hours ago
  That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.
  The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.
  At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.
- zamalek a day ago
  What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.
  In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).
  phlip9 20 hours ago
  I'm hoping we get there too with io_uring. It looks like the last few kernel release have made a lot of progress with zero-copy TCP rx/tx, though NIC support is limited and you need some finicky network iface setup to get the flow steering working
  https://docs.kernel.org/networking/iou-zcrx.html
- michaelsshaw a day ago
  The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.
- wmf a day ago
  Performance comes from dedicating core(s) to polling, not from userspace.
- 0xbadcafebee a day ago
  Networking is much faster in the kernel. Even faster on an ASIC.
  Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.
EdSchouten 4 hours ago
> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup
I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.
bjourne 15 hours ago
What is the need for mashing more and more stuff into the kernel? I thought the job of the kernel was to manage memory, hardware, and tasks. Shouldn't protocols built on top of IP be handled by userland?
- heavyset_go 15 hours ago
  Having networking, routing, VPN etc all not leave kernel space can be a performance improvement for some use cases.
  Similarly, splitting the networking/etc stacks out from the kernel into userspace can also be a performance improvement for some use cases.
  bjourne an hour ago
  Can't you say that about virtually anything? I'm sure having, say, MIDI synthesizers in the kernel would improve performance too, but not many think that is a good idea.
  stingraycharles 14 hours ago
  Yup, context switches between kernelspace and userspace are very expensive in high-performance situations, which is why these types of offloads are used.
  At specific workloads (think: load balancers / proxy servers / etc), these things become extremely expensive.
- leoh 15 hours ago
  Maybe. Getting stuff into the kernel means (in theory) it’s been hardened, it has a serious LTS, and benefits from… well, the performance of being part of the kernel.
- mcosta 9 hours ago
  DMA transfers and NIC offloading
- kortilla 14 hours ago
  No, protocols directly on IP specifically can’t be used in userland because they can’t be multiplexed to multiple processes.
  If everything above IP was in userland, only one program at a time could use TCP.
  TCP and UDP being intermediated by the kernel allow multiple programs to use the protocols at the same time because the kernel routes based on port to each socket.
  QUIC sits a layer even higher because it cruises on UDP, so I think your point still stands, but it’s stuff on top of TCP/UDP, not IP.
  surajrmal 4 hours ago
  How do you think this works on microkernels? Do they have no support for multiple applications using the network?
  Veserv 2 hours ago
  That is not at all a problem. On a microkernel you just have a userspace TCP/network server that your other programs talk to that manages/multiplexes the shared network connection.
  kortilla 2 hours ago
  If they don’t have TCP in them, yes. Either each application would need its own IP or another application would be responsible for being the TCP port router.
xgpyc2qp a day ago
Looks good. Quick is a real game changer for many. Internet should be a little faster with it. Probably we will not care because of 5g, but still valuable. Wondering that there is a separate tow handshake, I was thinking that qick embeds tls, but seams like I am wrong.
wosined a day ago
The general web is slowed down by bloated websites. But I guess this can make game latency lower.
- fmbb a day ago
  https://en.m.wikipedia.org/wiki/Jevons_paradox
  The Jevons Paradox is applicable in a lot of contexts.
  More efficient use of compute and communications resources will lead to higher demand.
  In games this is fine. We want more, prettier, smoother, pixels.
  In scientific computing this is fine. We need to know those simulation results.
  On the web this is not great. We don’t want more ads, tracking, JavaScript.
  01HNNWZ0MV43FF a day ago
  No, the last 20 years of browser improvements has made my static site incredibly fast!
  I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC, etc, etc
jeffbee a day ago
This seems to be a categorical error, for reasons that are contained in the article itself. The whole appeal of QUIC is being immune to ossification, being free to change parameters of the protocol without having to beg Linux maintainers to agree.
- toast0 a day ago
  IMHO, you likely want the server side to be in the kernel, so you can get to performance similar to in-kernel TCP, and ossification is less of a big deal, because it's "easy" to modify the kernel on the server side.
  OTOH, you want to be in user land on the client, because modifying the kernel on clients is hard. If you were Google, maybe you could work towards a model where Android clients could get their in-kernel protocol handling to be something that could be updated regularly, but that doesn't seem to be something Google is willing or able to do; Apple and Microsoft can get priority kernel updates out to most of their users quickly; Apple also can influence networks to support things they want their clients to use (IPv6, MP-TCP). </rant>
  If you were happy with congestion control on both sides of TCP, and were willing to open multiple TCP connections like http/1, instead of multiplexing requests on a single connection like http/2, (and maybe transfer a non-pessimistic bandwidth estimate between TCP connections to the same peer), QUIC still gives you control over retransmission that TCP doesn't, but I don't think that would be compelling enough by itself.
  Yes, there's still ossification in middle boxes doing TCP optimization. My information may be old, but I was under the impression that nobody does that in IPv6, so the push for v6 is both a way to avoid NAT and especially CGNAT, but also a way to avoid optimizer boxes as a benefit for both network providers (less expense) and services (less frustration).
  ComputerGuru a day ago
  One thing is that congestion control choice is sort of cursed in that it assumes your box/side is being switched but the majority of the rest of the internet continues with legacy limitations (aside from DCTCP, which is designed for intra-datacenter usage), which is an essential part of the question given that resultant/emergent network behavior changes drastically depending on whether or not all sides are using the same algorithm. (Cubic is technically another sort-of-exception, at least since it became the default Linux CC algorithm, but even then you’re still dealing with all sorts of middleware with legacy and/or pathological stateful behavior you can’t control.)
  jeffbee a day ago
  This is a perspective, but just one of many. The overwhelming majority of IP flows are within data centers, not over planet-scale networks between unrelated parties.
  toast0 a day ago
  I've never been convinced by an explanation of how QUIC applies for flows in the data center.
  Ossification doesn't apply (or it shouldn't, IMHO, the point of Open Source software is that you can change it to fit your needs... if you don't like what upstream is doing, you should be running a local fork that does what you want... yeah, it's nicer if it's upstreamed, but try running a local fork of Windows or MacOS); you can make congestion control work for you when you control both sides; enterprise switches and routers aren't messing with tcp flows. If you're pushing enough traffic that this is an issue, the cost of QUIC seems way too high to justify, even if it helps with some issues.
  jeffbee a day ago
  I don't see why this exception to the end-to-end principle should exist. At the scale of single hosts today, with hundreds of CPUs and hundreds of tenants in a single system sharing a kernel, the kernel itself becomes an unwanted middlebox.
  jeroenhd a day ago
  Unless you're using QUIC as some kind of datacenter-to-datacenter protocol (basically as SCTP on steroids with TLS), I don't think QUIC in the datacenter makes much sense at all.
  As very few server administrators bother turning on features like MPTCP, QUIC has an advantage on mobile phones with moderate to bad reception. That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead. Any service those people are using should probably consider implementing QUIC, and if they use it, they'd benefit from an in-kernel server.
  All the data center operators can stick to (MP)TCP, the telco people can stick to SCTP, but the consumer facing side of the internet would do well to keep QUIC as an option.
  mschuster91 20 hours ago
  > That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead.
  For what it's worth: Romania, one of the piss poorest countries of Europe, has a perfectly fine mobile phone network, and even outback small villages have XGPON fiber rollouts everywhere. Germany? As soon as you cross into the country from Austria, your phone signal instantly drops, barely any decent coverage outside of the cities. And forget about PON, much less GPON or even XGPON.
  Germany should be considered a developing country when it comes to expectations around telecommunication.
- corbet a day ago
  Ossification does not come about from the decisions of "Linux maintainers". You need to look at the people who design, sell, and deploy middleboxes for that.
  jeffbee a day ago
  I disagree. There is plenty of ossification coming from inside the house. Just some examples off the top of my head are the stuck-in-1974 minimum RTO and ack delay time parameters, and the unwillingness to land microsecond timestamps.
  otterley a day ago
  Not a networking expert, but does TCP in IPv6 suffer the same maladies?
  pumplekin a day ago
  Yes.
  Layer4 TCP is pretty much just slapped on top of Layer3 IPv4 or IPv6 in exactly the same way for both of them.
  Outside of some little nitpicky things like details on how TCP MSS clamping works, it is basically the same.
  ComputerGuru a day ago
  …which is basically how it’s supposed to work (or how we teach that it’s supposed to work). (Not that you said anything to the contrary!)
  0xbadcafebee a day ago
  The "middleboxes" excuse for not improving (or replacing) protocols in the past was horseshit. If a big incumbent player in the networking world releases a new feature that everyone wants (but nobody else has), everyone else (including 'middlebox' vendors) will bend over backwards to support it, because if you don't your competitors will and then you lose business. It was never a technical or logistical issue, it was an economic and supply-demand issue.
  To prove it:
  1. Add a new OSI Layer 4 protocol called "QUIC" and give it a new protocol number, and just for fun, change the UDP frame header semantics so it can't be confused for UDP.
  2. Then release kernel updates to support the new protocol.
  Nobody's going to use it, right? Because internet routers, home wireless routers, servers, shared libraries, etc would all need their TCP/IP stacks updated to support the new protocol. If we can't ship it over a weekend, it takes too long!
  But wait. What if ChatGPT/Claude/Gemini/etc only supported communication over that protocol? You know what would happen: every vendor in the world would backport firmware patches overnight, bending over backwards to support it. Because they can smell the money.
- GuB-42 20 hours ago
  The protocol itself is resistant to ossification, no matter how it is implemented.
  It is mostly achieved by using encryption, and it is a reason why it is such an important and mandatory part of the protocol. The idea is to expose as little as possible of the protocol between the endpoints, the rest is encrypted, so that "middleboxes" can't look at the packet and do funny things based on their own interpretation of the protocol stack.
  Endpoint can still do whatever they want, and ossification can still happen, but it helps against ossification at the infrastructure level, which is the worst. Updating the linux kernel on your server is easier than changing the proprietary hardware that makes up the network backbone.
  The use of UDP instead of doing straight QUIC/IP is also an anti-ossification technique, as your app can just use UDP and a userland library regardless of the QUIC kernel implementation. In theory you could do that with raw sockets too, but that's much more problematic since because you don't have ports, you need the entire interface for yourself, and often root access.
- Karrot_Kream a day ago
  Do you think putting QUIC in the kernel will significantly ossify QUIC? If so, how do you want to deal with the performance penalty for the actual syscalls needed? Your concern makes sense to me as the Linux kernel moves slower than userspace software and middleboxes sometimes never update their kernels.
codedokode 13 hours ago
That's so wrong, putting more and more stuff into the kernel and expanding attack surface. How long will it take before someone finds a vulnerability in QUIC handling?
The kernel should be as minimal as possible and everything that can be moved to userspace should be moved there. If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.
- NewJazz 13 hours ago
  Use a microkernel if this is your strong opinion. Linux is a monolithic kernel and includes a whole lot in kernel space for the sake of performance and (as mentioned in the article) hardware integration. A well designed microkernel may be able to provide similar performance with better security, but until people put serious work in, it won't be competitive with Linux.
  surajrmal 4 hours ago
  Unfortunately the os community puts 99% of it'st collective energy into Linux. There is definitely pent up demand for a different architecture. China seems to be innovating here, but it's unclear if the west will get anything out of their designs.
  codedokode 10 hours ago
  Sadly Linux distributions use large kernel and there is no simple way to get a working desktop system with a microkernel.
- sevg 12 hours ago
  > If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.
  By the same logic, we should never improve performance in software and just tell everyone to buy new hardware instead. A bit ridiculous.
  codedokode 10 hours ago
  We should not compromise security for minor improvements in performance.
valorzard a day ago
Would this (eventually) include the unreliable datagram extension?
- wosined a day ago
  Don't know if it could get faster than UDP if it is on top of it.
  valorzard a day ago
  The use case for this would be running a multiplayer game server over QUIC
  mudkipdev 13 hours ago
  Or a mix of both, like datagrams for voice chat / movement and reliable packets for important data
  01HNNWZ0MV43FF a day ago
  Other use cases include video / audio streaming, VPNs over QUIC, and QUIC-over-QUIC (you never know)
jbritton a day ago
The article didn’t discuss ACK. I have often wondered if it makes sense for the protocol to not have ACKs, and to leave that up to the application layer. I feel like the application layer has to ensure this anyway, so I don’t know how much benefit it is to additionally support this at a lower layer.
dahfizz a day ago
> QUIC is meant to be fast, but the benchmark results included with the patch series do not show the proposed in-kernel implementation living up to that. A comparison of in-kernel QUIC with in-kernel TLS shows the latter achieving nearly three times the throughput in some tests. A comparison between QUIC with encryption disabled and plain TCP is even worse, with TCP winning by more than a factor of four in some cases.
Jesus, that's bad. Does anyone know if userspace QUIC implementations are also this slow?
- dan-robertson a day ago
  I think the ‘fast’ claims are just different. QUIC is meant to make things fast by:
  - having a lower latency handshake
  - avoiding some badly behaved ‘middleware’ boxes between users and servers
  - avoiding resetting connections when user up addresses change
  - avoiding head of line blocking / the increased cost of many connections ramping up
  - avoiding poor congestion control algorithms
  - probably other things too
  And those are all things about working better with the kind of network situations you tend to see between users (often on mobile devices) and servers. I don’t think QUIC was meant to be fast by reducing OS overhead on sending data, and one should generally expect it to be slower for a long time until operating systems become better optimised for this flow and hardware supports offloading more of the work. If you are Google then presumably you are willing to invest in specialised network cards/drivers/software for that.
  dahfizz a day ago
  Yeah I totally get that it optimizes for different things. But the trade offs seem way too severe. Does saving one round trip on the handshake mean anything at all if you're only getting one fourth of the throughput?
  dan-robertson a day ago
  Are you getting one fourth of the throughput? Aren’t you going to be limited by:
  - bandwidth of the network
  - how fast the nic on the server is
  - how fast the nic on your device is
  - whether the server response fits in the amount of data that can be sent given the client’s initial receive window or whether several round trips are required to scale the window up such that the server can use the available bandwidth
  yello_downunder a day ago
  It depends on the use case. If your server is able to handle 45k connections but 42k of them are stalled because of mobile users with too much packet loss, QUIC could look pretty attractive. QUIC is a solution to some of the problematic aspects of TCP that couldn't be fixed without breaking things.
  drewg123 a day ago
  The primary advantage of QUIC for things like congestion control is that companies like Google are free to innovate both sides of the protocol stack (server in prod, client in chrome) simultaneously. I believe that QUIC uses BBR for congestion control, and the major advantage that QUIC has is being able to get a bit more useful info from the client with respect to packet loss.
  This could be achieved by encapsulating TCP in UDP and running a custom TCP stack in userspace on the client. That would allow protocol innovation without throwing away 3 decades of optimizations in TCP that make it 4x as efficient on the server side.
  dan-robertson 10 hours ago
  Is that true? Aren’t lots of the tcp optimisations about offloading work to the hardware, eg segmentation or tls offload? The hardware would need to know about your tcp-in-udp protocol to be able to handle that efficiently.
  drewg123 15 minutes ago
  Most hardware is fairly generic for tunneled protocols, and tx descriptors can take things like "inner l4 header offset/len" and "outer l4 header offset/len"
  Generic support for tunneled TCP is far more doable than support for a new and volatile protocol.
  brokencode a day ago
  Maybe it’s a fourth as fast in ideal situations with a fast LAN connection. Who knows what they meant by this.
  It could still be faster in real world situations where the client is a mobile device with a high latency, lossy connection.
  eptcyka a day ago
  There are claims of 2x-3x operating costs on the server side to deliver better UX for phone users.
  jeroenhd a day ago
  > - avoiding some badly behaved ‘middleware’ boxes between users and servers
  Surely badly behaving middleboxes won't just ignore UDP traffic? If anything, they'd get confused about udp/443 and act up, forcing clients to fall back to normal TCP.
  zamadatix a day ago
  Your average middlebox will just NAT UDP (unless it's outright blocked by security policy) and move on. It's TCP where many middleboxes think they can "help" the congestion signaling, latch more deeply into the session information, or worse. Unencrypted protocols can have further interference under either TCP or UDP beyond this note.
  QUIC is basically about taking all of the information middleboxes like to fuck with in TCP, putting it under the encryption layer, and packaging it back up in a UDP packet precisely so it's either just dropped or forwarded. In practice this (i.e. QUIC either being just dropped or left alone) has actually worked quite well.
- Veserv a day ago
  Yes. msquic is one of the best performing implementations and only achieves ~7 Gbps [1]. The benchmarks for the Linux kernel implementation only get ~3 Gbps to ~5 Gbps with encryption disabled.
  To be fair, the Linux kernel TCP implementation only gets ~4.5 Gbps at normal packets sizes and still only achieves ~24 Gbps with large segmentation offload [2]. Both of which are ridiculously slow. It is straightforward to achieve ~100 Gbps/core at normal packet sizes without segmentation offload with the same features as QUIC with a properly designed protocol and implementation.
  [1] https://microsoft.github.io/msquic/
  [2] https://lwn.net/ml/all/cover.1751743914.git.lucien.xin@gmail...
- klabb3 a day ago
  Yes, they are. Worse, I’ve seen them shrink down to nothing in the face of congestion with TCP traffic. If Quic is indeed the future protocol, it’s a good thing to move it into the kernel IMO. It’s just madness to provide these massive userspace impls everywhere, on a packet switched protocol nonetheless, and expect it to beat good old TCP. Wouldn’t surprise me if we need optimizations all the way down to the NIC layer, and maybe even middleboxes. Oh and I haven’t even mentioned the CPU cost of UDP.
  OTOH, TCP is like a quiet guy at the gym who always wears baggy clothes but does 4 plates on the bench when nobody is looking. Don't underestimate. I wasted months to learn that lesson.
  vladvasiliu a day ago
  Why is QUIC being pushed, then?
  klabb3 a day ago
  From what I understand the ”killer app” initially was because of mobile spotty networks. TCP is interface (and IP) specific, so if you switch from WiFi to LTE the conn breaks (or worse, degrades/times out slowly). QUIC has a logical conn id that continues to work even when a peer changes the path. Thus, your YouTube ads will not buffer.
  Secondary you have the reduced RTT, multiple streams (prevents HOL blocking), datagrams (realtime video on same conn) and you can scale buffers (in userspace) to avoid BDP limits imposed by kernel. However.. I think in practice those haven’t gotten as much visibility and traction, so the original reason is still the main one from what I can tell.
  wahern 20 hours ago
  MPTCP provides interface mobility. It's seen widespread deployment with the iPhone, so network support today is much better than one would assume. Unlike QUIC, the changes required by applications are minimal to none. And it's backward compatible; an application can request MPTCP, but if the other end doesn't support it, everything still works.
  toast0 a day ago
  It has good properties compared to tcp-in-tcp (http/2), especially when connected to clients without access to modern congestion control on iffy networks. http/2 was perhaps adopted too broadly; binary protocol is useful, header compression is useful (but sometimes dangerous), but tcp multiplexing is bad, unless you have very low loss ... it's not ideal for phones with inconsistent networking.
  favflam a day ago
  I know in the p2p space, peers have to send lots of small pieces of data. QUIC stops stream blocking on a single packet delay.
  fkarg a day ago
  because it _does_ provide a number of benefits (potentially fewer initial round-trips, more dynamic routing control by using UDP instead of TCP, etc), and is a userspace softare implementation compared with a hardware-accelerated option.
  QUIC getting hardware acceleration should close this gap, and keep all the benefits. But a kernel (software) implementation is basically necessary before it can be properly hardware-accelerated in future hardware (is my current understanding)
  01HNNWZ0MV43FF a day ago
  To clarify, the userspace implementation is not a benefit, it's just that you can't have a brand new protocol dropped into a trillion dollars of existing hardware overnight, you have to do userspace first as PoC
  It does save 2 round-trips during connection compared to TLS-over-TCP, if Wikipedia's diagram is accurate: https://en.wikipedia.org/wiki/QUIC#Characteristics That is a decent latency win on every single connection, and with 0-RTT you can go further, but 0-RTT is stateful and hard to deploy and I expect it will see very little use.
  dan-robertson a day ago
  The problem it is trying to solve is not overhead of the Linux kernel on a big server in a datacenter
  userbinator 17 hours ago
  Google wants control.
- eptcyka a day ago
  QUIC performance requires careful use of batching. Using UDP spckets naively, i.e. sending one QUIC packet per syscall, will incur a lot of oberhead - every time the kernel has to figure out which interface to use, queue it up on a buffer, and all the rest. If one uses it like TCP, batching up lots of data and enquing packets in one “call” helps a ton. Similarly, the kernel wireguard implementation can be slower than wireguard-go since it doesn’t batch traffic. At the speeds offered by modern hardware, we really need to use vectored I/O to be efficient.
- 0x457 21 hours ago
  I would expect that a protocol such as TCP performs much better than QUIC in benchmarks. Now do a realistic benchmark over roaming LTE connection and come back with the results.
  Without seeing actual benchmark code, it's hard to tell if you should even care about that specific result.
  If your goal is to pipe lots of bytes from A to B over internal or public internet there probably aren't make things, if any, that can outperform TCP. Decades were spent optimizing TCP for that. If HOL blocking isn't an issue for you, then you can keep using HTTP over TCP.
- userbinator 17 hours ago
  IMO being Google's proprietary crap is enough reason to stay away. It not actually being any better is an even more compelling reason.
  surajrmal 12 hours ago
  It's not proprietary. It is an IETF standard and several of the authors are not from Google.
- rayiner a day ago
  It’s an interesting testament to how well designed TCP is.
  adgjlsfhk1 a day ago
  IMO, it's more a testament to how fast hardware designers can make things with 30 years to tune.
darksaints a day ago
For the love of god, can we please move to microkernel-based operating systems already? We're adding a million lines of code to the linux kernel every year. That's so much attack surface area. We're setting ourselves up for a kessler syndrome of sorts with every system that we add to the kernel.
- mdavid626 a day ago
  Most of that code is not loaded into the kernel, only when needed.
  darksaints a day ago
  True, but the last time I checked (several years ago), the size of the portion of code that is not drivers or kernel modules was still 7 million lines of code, and the average system still has to load a few million more via kernel modules and drivers. That is still a phenomenally large attack surface.
  The SeL4 kernel is 10k lines of code. OKL4 is 13k. QNX is ~30k.
  arp242 a day ago
  Can I run Firefox or PostgreSQL with reasonable performance on SeL4, OKL4, or QNX?
  yencabulator 3 hours ago
  SeL4 was not built for multiple CPU cores, it's not going to perform with modern day "high end" hardware and last I looked its formal proofs don't apply to multicore systems.
  doubled112 a day ago
  Reasonable performance includes GPU acceleration for both rendering and decoding media, right?
  arp242 5 hours ago
  Not necessarily; short-comings or "TODO"s are okay. I just want to know if I can run actual real-world complex applications on these micro-kernels, and what the trade-offs are (if any). Firefox on OpenBSD has fairly reasonable performance, but is quite a lot slower than on Linux. It's a perfectly reasonable trade-off, but you do need to be aware of it.
  I've asked this question a few times over the last few years when people bring up "we must use microkernel now! They already exist!"-type posts, and thus far the response has either been crickets or vague hand-waving with microbenchmarks that bear no relation to real-world programs.
  0x457 21 hours ago
  yes
  regularfry a day ago
  You've still got combinatorial complexity problem though, because you never know what a specific user is going to load.
  beeflet a day ago
  Often you do know what a specific user is going to load
- jiehong 4 hours ago
  Naive question: is the Mac OS or iOS a microkernel? They seem to support http3 in their network foundation librairies and I’m wondering if it’s userland only or more.
  darksaints 2 hours ago
  MacOS is a hybrid kernel, which has been becoming more microkernel-like over time, and they are aggressively pushing more and more things to userspace. I don't think it will ever be a full microkernel, but it is promising to see that happening there.
  Ironic (in the alannis morrisette sense) that Apple has strictly controlled hardware AND OS-level software...if there's anybody out there that can possibly get away with a monolithic kernel in a safe way, it would be them. But Linux...where you have to support practically infinite variations in hardware and the full bazaar of software, that's a dumpster fire waiting to happen.
- wosined a day ago
  I might be wrong, but microkernel also need drivers, so the attack surface would be the same, or not?
  kaoD a day ago
  You're not wrong, but monolithic kernel drivers run at a privilege level that's even higher than root (ring 0) while microkernels run them at userspace so they're as dangerous as running a normal program.
  pessimizer a day ago
  "Just think of the power of ring-0, muhahaha! Think of the speed and simplicity of ring-0-only and identity-mapping. It can change tasks in half a microsecond because it doesn't mess with page tables or privilege levels. Inter-process communication is effortless because every task can access every other task's memory.
  "It's fun having access to everything."
  — Terry A. Davis
  beeflet 21 hours ago
  > Inter-process communication is effortless because every task can access every other task's memory.
  I think this would get messy quick in an OS designed by more than one person
- 01HNNWZ0MV43FF a day ago
  Redox is a microkernel written in Rust
snvzz 18 hours ago
Brace for unauthenticated remote execution exploits on network stack.
gethly a day ago
I've been hearing about QUIC for ages, yet it is still an obscure tech and will likely end up like IPv6.
- rstuart4133 21 hours ago
  > yet it is still an obscure tech and will likely end up like IPv6.
  Probably. According to Google, IPv6 has a measly 46% of internet traffic now [0], and growing at about 5% per year. QUIC is 40% of Chrome traffic, and is growing at 5% every two years [1]. So yeah, their fates do look similar, which is to say both are headed for world domination in a couple of decades.
  [0] https://dnsmadeeasy.com/resources/the-state-of-ipv6-adoption...
  [1] https://www.cellstream.com/2025/02/14/an-update-on-quic-adop...
  gethly 20 hours ago
  When you remove IoT, those numbers will look very differently.
  rstuart4133 19 hours ago
  > When you remove IoT, those numbers will look very differently.
  To paraphrase: "when you remove all the new stuff being added, you will see all the old stuff is still using the old protocols". Sounds reasonable, but I don't believe it. These IoT devices usually have the simplest stack imaginable, of many of them implemented from the main loop. IPv6 isn't so bad, but QUIC/http2/http3 is a long, long way from simple.
  A major driver of IPv6 is phones, which I wound not classify as IoT. Where I live they all receive an IPv6 address now. When I hotspot, they hand out a routable IPv6 address to the laptop / desktop. Modern Windows / Linux installations will use the IPv6 in preference to the double NAT'ed IPv4 address they also hand out. The funny thing is you don't even notice, or at least I didn't. I only twigged when I happened to be looking at packet capture from my tethered laptop and saw all this IPv6 traffic, and wondered what the heck was going on. It could have been happening for years without me noticing. Maybe it was.
  It wasn't I surprise I didn't notice. I set up WiFi access for a conference of 100's of computing nerds and professionals many years ago. Partly for kicks, partly as a learning excise I made it IPv6 only. As a backup plan I had a IPv4 network (behind a NAT sadly, which the IPv6 wasn't) ready to go on a different SSID. To my utter disbelief there no complaints, literally not a single one. Again, no one noticed.
  adgjlsfhk1 18 hours ago
  QUIC is really simple for most IOT: Just import the library.
  rstuart4133 14 hours ago
  If your idea of IoT is a rpi with 4Gb of RAM connected to a power supply running Linux, then yes, just import the library. But my of idea of IoT is something that runs runs on a battery for a year, and in that case it's unlikely to have enough hardware to support QUIC (or it's main user, http3). QUIC needs too much RAM (min 4KB), too much flash (min 64Kb for the code), and too much CPU. In reality it's a squeeze even on a nRF52840.
- Jtsummers 21 hours ago
  QUIC is already out and in active use. Every major web browser supports it, and it's not like IPv6. There's no fundamental infrastructure change needed to support it since it's built on top of UDP. The end points obviously have to support it, but that's the same as any other protocol built on UDP or TCP (like HTTP, SNMP, etc.).
- tralarpa 21 hours ago
  Your browser is using it when you watch a video on youtube (HTTP/3).
- gfody a day ago
  isn't it just http3 now?