• tptacek 5 months ago

    Netkit, which is what this is built on, is pretty neat. For transmitting packets from one container/VM to another, the conventional solution is to give each its own veth device. When you do that, the kernel network stack, at like the broad logic level, is sort of oblivious to the fact that the devices aren't real ethernet devices and don't have to go through the ethernet motions to transact.

    Netkit replaces that logic with a simple pairing of sending and receiving eBPF programs; it's an eBPF cut-through for packet-level networking between networks that share a host kernel. It's faster, and it's simpler to reason about; the netkit.c code is pretty easy to read straight through.

    • charleslmunger 5 months ago

      >When you do that, the kernel network stack, at like the broad logic level, is sort of oblivious to the fact that the devices aren't real ethernet devices and don't have to go through the ethernet motions to transact.

      Is that true even for virtio-net? I guess I just assumed all these virtual devices worked like virtiofs and had low overhead fast paths for host and guest communication.

      • XorNot 5 months ago

        Yeah this is a surprise to me too - my impression was things like loopback and virtio devices were used explicitly because they don't pretend to ever be real devices, and thus bypass all the real device handling.

        What additional overhead is cut out by the netkit approach?

        • tptacek 5 months ago

          Are you using virtual machines? They're not.

          The big win here as I understand it is that it gives you roughly the same efficient inter-device forwarding path that XDP gives you: you can bounce from one interface to another in eBPF without converting buffers back into skbuffs and snaking them through the stack again.

          • XorNot 5 months ago

            But in containers we use the "veth" devices, which aren't even virtio and are only ever routed routed locally on the Linux kernel. So my question is, if this sort of optimization is possible, what does it sacrifice compared to veth to do it, given the constraints are (apparently) the same?

            • tptacek 5 months ago

              I assume the thing here is that veth simply doesn't do it? We're talking about a programmable fast path that bypasses the stack to get from interface A to interface B. For an ethernet interface, that's what XDP does.

          • kapilvt 5 months ago

            This article from isovalent introducing netkit walks through the benefits and tradeoffs

            https://isovalent.com/blog/post/cilium-netkit-a-new-containe...

        • lsnd-95 5 months ago

          It would be nice to see an implementation of TCP fusion (on Solaris) or SIO_LOOPBACK_FASTPATH (on Windows) for Linux.

          • sirjaz 5 months ago

            Someone on HN giving kudos to Windows for once. Has hell frozen over.

            • undefined 5 months ago
              [deleted]
            • jiveturkey 5 months ago

              Came here to say the same. I'm glad linux is finally catching up to Solaris.

            • nonameiguess 5 months ago

              I'm not looking at the kernel source itself, but is this lying or am I reading it wrong?

              > Packets transmitted on one device in the pair are immediately received on the other device. When either device is down, the link state of the pair is down.

              https://www.man7.org/linux/man-pages/man4/veth.4.html

              That sure makes it sound like veth transmissions at least on the same link are instantaneous and bypass the networking stack. I would imagine in a containerized environment it should be something like:

              Pod 1 tries to send a packet to Pod 2, both on the same node but in different network namespaces with different IPs. Pod 1 sends its packet to the bridge connected to the other end of its veth pair and that should be instantaneous. Then the bridge sends across its other veth pair to pod 2's namespace, which is also instantaneous.

              Is the problem with processing overhead at the bridge?

              • jigneshdarji91 5 months ago
                • preisschild 5 months ago

                  Cilium (a Kubernetes CNI) can use netkit instead of veth bridges since netkit was introduced in the kernel

                  https://isovalent.com/blog/post/cilium-netkit-a-new-containe...

                  • ignoramous 5 months ago

                    > Netkit, which is what this is built on, is pretty neat. For transmitting packets from one container/VM to another ...

                    Sounds like virtio but intra-host?

                    • tptacek 5 months ago

                      No, virtio presents to the network stack the same way other devices do.

                    • akamaka 5 months ago

                      Thanks for the clear explanation!

                    • erulabs 5 months ago

                      I'd love to see a more complete picture of ByteDance's TikTok infra. They released "KubeAdmiral" (1) so I'm assuming they're using eBPF via a Kubernetes CNI, and I see ByteDance listed on Cilium's github (2). They're also using KubeRay (3) to orchestrate huge inference tasks. It's annoying that a company I definitely do not want to work for has such an incredibly interesting infrastructure!

                      1. https://github.com/kubewharf/kubeadmiral

                      2. https://github.com/cilium/cilium/blob/main/USERS.md

                      3. https://www.anyscale.com/blog/how-bytedance-scales-offline-i...

                    • nighthawk454 5 months ago

                      > eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter (BPF, with the "e" originally meaning "extended") filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well.

                      > It is used to safely and efficiently extend the capabilities of the kernel at runtime without requiring changes to kernel source code or loading kernel modules. Safety is provided through an in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively.

                      https://en.wikipedia.org/wiki/EBPF?useskin=vector

                      • udev4096 5 months ago

                        There's https://github.com/eunomia-bpf/bpf-developer-tutorial if anyone wanted to get started with eBPF

                        • bogantech 5 months ago

                          Semi related: is there some way to check what eBPF programs are installed on a system and explore what they're attached to / doing etc?

                          Whenever I see a problem solved with eBPFs I feel like it's also making things more opaque and difficult to troubleshoot but I'm guessing that's just because I don't know enough about it

                          • AlotOfReading 5 months ago

                            That's what bpftool is for. It follows the grand Linux tradition of making everything possible, but not necessarily easy.

                          • throw78311 5 months ago

                            I guess this is why everything is under Federation/default now, the old mess was annoying to work with.

                            • tomohawk 5 months ago

                              pretty cool, but basically solves a problem caused by one too many layers of abstraction.

                              • vifoggy 5 months ago

                                [dead]