• computerbuster 19 hours ago

    Another resource on the same topic: https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

    As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.

    FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.

    dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.

    While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.

    I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.

    • janwas 17 minutes ago

      I'm also in the mission-critical camp, with perhaps an interesting counterpoint. If we're focusing on small details (or drowning in incidental complexity), it can be harder to see algorithmic optimizations. Or the friction of changing huge amounts of per-platform code can prevent us from escaping a local minimum.

      Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.

      This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.

      Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

      • cornstalks 10 hours ago

        One of the fun things about dav1d is that since it’s written in assembly, they can use their own calling convention. And it can differ from method to method, so they have very few stack stores and loads compared to what a compiler will generate following normal platform calling conventions.

        • janwas 37 minutes ago

          I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?

          • MortyWaves 5 hours ago

            Doesn’t this just make it harder to maintain ports to other architectures though?

            • wolf550e 2 minutes ago

              There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful.

              • epr 4 hours ago

                For what's written in assembly, lack of portability is a given. The only exceptions would presumably be high level entry points called to from C, etc. If you wanted to support multiple targets, you have completely separate assembly modules for each architecture at least. You'd even need to bifurcate further for each simd generation (within x64 for example).

                • antoinealb 5 hours ago

                  Yes, but on projects like that, ease of maintenance is a secondary priority when compared to performance or throughput.

                  • secondcoming 5 hours ago

                    SIMD instructions are already architecture dependent

                • dundarious 15 hours ago

                  What does Zig offer in the way of builtin SIMD support, beyond overloads for trivial arithmetic operations? 90% of the utility of SIMD is outside of those types of simple operations. I like Zig, but my understanding is you have to reach for CPU specific builtins for the vast majority of cases, just like in C/C++.

                  GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.

                  • anonymoushn 4 hours ago

                    Zig ships LLVM's internal generic SIMD stuff, which is fairly common for newish systems languages. If you want dynamic shuffles or even moderately exotic things like maddubs or aesenc then you need to use LLVM intrinsics for specific instructions or asm.

                    • MortyWaves 5 hours ago

                      I’m also wondering what “built in” even means. Many have SIMD, Vector, Matrix, Quaternions and the like as part of the standard library, but not necessarily as their own keywords. C#/.NET, Java has SIMD by this metric.

                      • neonsunset 5 hours ago

                        Java's Panama Vectors are work in progress and are far from being competitive with .NET's implementation of SIMD abstractions, which is mostly on par with Zig, Swift and Mojo.

                        You can usually port existing SIMD algorithms from C/C++/Rust to C# with few changes retaining the same performance, and it's practically impossible to do so in Java.

                        I feel like C veterans often don't realize how unnecessarily ceremonious platform-specific SIMD code is given the progress in portable abstractions. Unless you need an exotic instruction that does not translate across architectures and/or common patterns nicely, there is little reason to have a bespoke platform-specific path.

                        • kierank 2 hours ago

                          We in FFmpeg need all the instructions and we often need to do register allocations by hand.

                          • neonsunset an hour ago

                            Absolutely fair! FFmpeg does fall into the category of scenarios where skipping to the very last mile optimizations is reasonable. And thank you for your work on FFmpeg!

                            Most code paths out there aren't like that however and compilers are not too bad at instruction selection nowadays (you'd be right to mention that they sometimes have odd regressions, I've definitely seen that being a problem in LLVM, GCC and RyuJIT).

                          • MortyWaves 4 hours ago

                            Exactly!

                      • zbobet2012 15 hours ago

                        So on point. We do _a lot_ of hand written SIMD on the other side (encoders) as well for similar reasons. In addition on the encoder side it's often necessary to "structure" the problem so you can perform things like early elimination of loops, and especially loads. Compilers simply can not generate autovectorized code that does those kinds of things.

                      • buserror 6 hours ago

                        I used to do quite a bit of SIMD version of critical functions, but now I rarely do -- one thing to try is isolate that code, and run it in the Most Excellent Compiler Explorer [0].

                        And stare at the generated code!

                        More often than not, the auto-vectorisation now generates pretty excellent SIMD version of your function, and all you have to do is 'hint' the compiler -- for example explicitly list alignment, provide your own vector source/destination type -- you can do a lot by 'styling' your C code while thinking about what the compiler might be able to do with it -- for example, use extra intermediary variables, really break down all the operations you want etc.

                        Worst case if REALLY the compiler isn't clever enough, this give you a good base to adapt the generated assembly to tweak, without having to actually write the boilerplate bits.

                        In most case, the resulting C function will be vectorized as good, or better than the hand coded one I'd do -- and in many other cases, it's "close enough" not to matter that much. The other good news is that that code will probably vectorize fine for WASM and NEON etc without having to have explicit versions.

                        [0] https://godbolt.org/

                        • kimixa 4 hours ago

                          We did something slightly similar - for the very few isolated things it makes sense (e.g. image up/download and conversions in the gpu driver that weren't supported/large enough to be worth firing off a gpu job to complete), they were initially written in C and used the compiler annotations to specify things like the alignment or allowed pointer aliasing in order to make it generate the code wanted. GCC and Clang both support some vector extensions, that allow somewhat portable implementations of things like scatter-gather, or shuffling things around or masking elements in a single register that's hard to specify clearly enough so that it's both readable for humans and will always generate the expected code between compiler versions in "plain" C.

                          But due to needing to support other compilers and platforms we actually ended up importing the generated asm from those source files in the actual build.

                          • anonymoushn 4 hours ago

                            I have no idea how to get the compiler to generate wider-than-16 pshufb in the general case, for example, and for the 16-wide case, writing the actual definition of pshufb prevents you from getting pshufb while writing a version with UB gets you pshufb.

                          • kierank 16 hours ago

                            I am the author of these lessons.

                            Ask me anything.

                            • 201984 an hour ago

                              How does FFmpeg generate SEH tables for assembly functions on Windows? Is this something that x86asm.inc handles, or do you guys just not worry about it?

                              • ilyagr 9 hours ago

                                As a user of an ARM Mac, I wonder: how much effort does it take to get such optimized code to work the same in all platforms? I guess you must have very thorough tests and fallback algorithms?

                                If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

                                • rbultje 2 hours ago

                                  > If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

                                  Not ported, but rather re-implemented. So: yes.

                                  A bit more detail: during build, on x86, the FFmpeg binary would include hand-written AVX2 (and SSSE3, and AVX512, etc.) implementations of CPU-intensive functions, and on Arm, the FFmpeg binary would include hand-written Neon implementations (and a bunch of extensions; e.g. dotprod) instead.

                                  At runtime (when you start the FFmpeg binary), FFmpeg "asks" the CPU what instruction sets it supports. Each component (decoder, encoder, etc.) - when used - will then set function pointers (for CPU-intensive tasks) which are initialized to a C version, and these are updated to the Neon or AVX2 version depending on what's included in the build and supported by this specific device.

                                  So in practice, all CPU-intensive tasks for components in use will run hand-written Neon code for you, and hand-written AVX2 for me. For people on obscure devices, it will run the regular C fallback.

                                  • saagarjha 7 hours ago

                                    While the instructions are different, every platform will have some implementation of the basic operations (load, store, broadcast, etc.), perhaps with a different bit width. With those you can write an accelerated baseline implementation, typically (sometimes these are autogenerated/use some sort of portable intrinsics, but usually they don't). If you want to go past that then things get more complicated and you will have specialized algorithms for what is available.

                                  • cnt-dracula 16 hours ago

                                    Hi, thanks for your work!

                                    I have a question, as someone who can just about read assembly but still do not intuitively understand how to write or decompose ideas to utilise assembly, do you have any suggestions to learn / improve this?

                                    As in, at what point would someone realise this thing can be sped up by using assembly? If one found a function that would be really performant in assembly how do you go about writing it? Would you take the output from a compiler that's been converted to assembly or would you start from scratch? Does it even matter?

                                    • qingcharles 15 hours ago

                                      You're looking for the tiniest blocks of code that are run an exceptional number of times.

                                      For instance, I used to work on graphics renderers. You'd find the bit that was called the most (writing lines of pixels to the screen) and try to jiggle the order of the instructions to decrease the number of cycles used to move X bits from system RAM to graphics RAM.

                                      When I was doing it, branching (usually checking an exit condition on a loop) was the biggest performance killer. The CPU couldn't queue up instructions past the check because it didn't know whether it was going to go true or false until it got there.

                                      • booi 10 hours ago

                                        Don’t modern or even just not ancient cpus use branch prediction to work past a check knowing that the vast majority of the time the check yields the same result?

                                        • kaslai 7 hours ago

                                          All the little tricks that the CPU has to speed things up, like branch prediction, out of order execution, parallel branch execution, etc, are mostly more expensive than just not having to rely on them in the first place. Branch prediction in particular is not something that should be relied on too heavily either, since it is actually quite a fragile optimization that can cause relatively large performance swings with seemingly meaningless changes to the code.

                                          • akoboldfrying 5 hours ago

                                            Branch prediction is great for predictable branches, which is often what you have, or a good approximation to it. I forget the exact criteria, but even quite old chips could learn, e.g., all repeating patterns of length up to 4, most repeating patterns of length up to 8 and fixed-length loop patterns (n YESes followed by 1 NO) of any length.

                                            Quite often, though, you don't have predictable branches, and then you'll pay half the misprediction cost each time on average. If you're really unlucky, you could hit inputs where the branch predictor gets it wrong more than 50% of the time.

                                        • epr 3 hours ago

                                          The best answer to your question is some variant of "write more assembly".

                                          When someone indicates to me they want to learn programming for example, I ask them how many programs they've written. The answer is usually zero, and in fact I've never even heard greater than 10. No one will answer a larger number because that selects out people who would even ask the question. If you write 1000 programs that solve real problems, you'll be at least okay. 10k and you'll be pretty damn good. 100k and you might be better than the guy who wrote the assembly manual.

                                          For a fun answer, this is a $20 nand2tetris-esque game that holds your hand through creating multiple cpu architectures from scratch with verification (similarly to prolog/vhdl), plus your own assembly language. I admittedly always end up writing an assembler outside of the game that copies to my clipboard, but I'm pretty fussy about ux and prefer my normal tools.

                                          https://store.steampowered.com/app/1444480/Turing_Complete/

                                          • otteromkram 14 hours ago

                                            This is one heck of a question.

                                            I don't know assembly, but my advice would be to take the rote route by rewriting stuff in assembly.

                                            Just like anything else, there's no quick path to the finish line (unless you're exceptionally gifted), so putting in time is always the best action to take.

                                          • HALtheWise 14 hours ago

                                            What's your perspective on variable-width SIMD instruction sets (like ARM SVE or the RISC-V V extension)? How does developer ergonomics and code performance compare to traditional SIMD? Are we approaching a world with fewer different SIMD instruction sets to program for?

                                            • janwas 10 minutes ago

                                              Var-width SIMD can mostly be written using the exact same Highway code, we just have to be careful to avoid things like arrays of vectors and sizeof(vector).

                                              It can be more complicated to write things which are vector-length dependent, such as sorting networks or transposes, but we have always found a way so far.

                                              On the contrary, there are increasing numbers of ISAs, including the two LoongArch LSX/LASX, AVX-512 which is really really good on Zen5, and three versions of Arm SVE. RISC-V V also has lots of variants and extensions. In such a world, I would not want to have to implement per-platform implementations.

                                            • qingcharles 15 hours ago

                                              As someone who wrote x86 optimization code professionally in the 90s, do we need to do this manually still in 2025?

                                              Can we not just write tests and have some LLM try 10,000 different algorithms and profile the results?

                                              Or is an LLM unlikely to find the optimal solution even with 10,000 random seeds?

                                              Just asking. Optimizing x86 by hand isn't the easiest, because to think through it you start to have to try and fit all the registers in your mind and work through the combinations. Also you need to know how long each instruction combination will take; and some of these instructions have weird edge cases that take vastly longer or quicker to run that is hard for a human to take into account.

                                              • Ecco 7 hours ago

                                                I guess your question could be rephrased as "couldn't we come up with better compilers?" (LLM-based or not, brute force based or not).

                                                I don't have an answer but I believe that a lot of effort has been put in making (very smart) compilers already, so if it's even possible I doubt it's easy.

                                                I also believe there are some cases where it's simply not possible for a compiler to beat handwritten assembly : indeed there is only so much info you can convey in a C program, and a developer who's aware of the whole program's behavior might be able to make extra assumptions (not written in the C code) and therefore beat a compiler. I'm sure people here would be able to come up with great practical examples of this.

                                                • kierank 2 hours ago

                                                  I have tried with Grok3 and Claude. They both seem to have an understanding of the algorithms and data patterns which is more than I expected but then just guess a solution that's often nonsensical.

                                                  • danybittel 9 hours ago
                                                    • janwas 9 minutes ago

                                                      Collaborators have actually superoptimized some of the more complicated Highway ops on RISC-V, with interesting gains, but I think the approach would struggle with largish tasks/algorithms?

                                                    • magicalhippo 12 hours ago

                                                      While using a LLM might not be the best approach, it would be interesting to know if there are some tools these days that can automate this.

                                                      Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

                                                      One potential pitfall I can see is that there are a lot of non-local interactions in moderns systems. We have large out-of-order buffers, many caching layers, complex branch predictors, and an OS running other tasks at the same time, and a dozen other things.

                                                      What is optimal on paper might not be optimal in the real world.

                                                      • dist-epoch 2 hours ago

                                                        > Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

                                                        There are optimization libraries which can find the optimum combination of parameters for an objective, like Optuna.

                                                        It would be enough to expose all the optimization knobs that LLVM has, and Optuna will find the optimum for a particular piece of code on a particular test payload.

                                                      • saagarjha 7 hours ago

                                                        You would need to be very careful about verifying the output. Having an LLM generate patterns and then running them through a SAT solver might work, but usually it's only really feasible for short sequences of code.

                                                      • christiangenco 16 hours ago

                                                        Hacker News is such a cool website.

                                                        Hi thank you for writing this!

                                                      • Daniel_Van_Zant 21 hours ago

                                                        I'm curious from anyone who has done it. Is there any "pleasure" to be had in learning or implementing assembly (like there is for LISP or RISC-V) or is it something you learn and implement because you want to do something else (like learning COBOL if you need to work with certain kinds of systems). It has always piqued my interest but I don't have a good reason in my day-to-day job to get into it. Wondering if it is worth committing some time to for the fun of it.

                                                        • msaltz 21 hours ago

                                                          I did the first 27 chapters of this tutorial just because I was interested in learning more and it was thoroughly enjoyable: https://mariokartwii.com/armv8/

                                                          I actually quite like coding in assembly now (though I haven’t done much more than the tutorial, just made an array library that I could call from C). I think it’s so fun because at that level there’s very little magic left - you’re really saying exactly what should happen. What you see is mostly what you get. It also helped me understand linking a lot better and other things that I understood at a high level but still felt fuzzy on some details.

                                                          Am now interested to check out this ffmpeg tutorial bc it’s x86 and not ARM :)

                                                          • crq-yml 18 hours ago

                                                            Learning at least one assembly language is very rewarding because it puts you in touch with the most primitive forms of practical programming: while there are theoretical models like Turing machines or lambda calculus that are even more simplistic, the architectures that programmers actually work with have some forgiving qualities.

                                                            It isn't a thing to be scared of - assembly is verbose, not complex. Everything you do in it needs load and store, load and store, millions of times. When you add some macros and build-time checks, or put it in the context of a Forth system(which wraps an interpreter around "run chunks of assembly", enabling interactive development and scripting) - it's not that far off from C, and it removes the magic of the compiler.

                                                            I'm an advocate for going retro with it as well; an 8-bit machine in an emulator keeps the working model small, in a well-documented zone, and adds constraints that make it valuable to think about doing more tasks in assembly, which so often is not the case once you are using a 32-bit or later architecture and you have a lot of resources to throw around. People who develop in assembly for work will have more specific preferences, but beginners mostly need an environment where the documentation and examples are good. Rosetta Code has some good assembly language examples that are worth using as a way to learn.

                                                            • btown 20 hours ago

                                                              One “fun” thing about it is that it’s higher level than you think, because the actual chip may do things with branch prediction and pipelining that you can only barely control.

                                                              I remember a university course where we competed on who could have the most performant assembly program for a specific task; everyone tried various variants of loop unrolling to eke out the best performance and guide the processor away from bad branch predictions. I may or may not have hit Ballmer Peak the night before the due date and tried a setup that most others missed, and won the competition by a hair!

                                                              There’s also the incredible joy of seeing https://github.com/chrislgarry/Apollo-11 and quipping “this is a Unix system; I know this!” Knowing how to read the language of how we made it to the moon will never fade in wonder.

                                                              Short answer: yes!

                                                              • brown 21 hours ago

                                                                Learning assembly was profound for me, not because I've used it (I haven't in 30 years of coding), but because it completed the picture - from transistors to logic gates to CPU architecture to high-level programming. That moment when you understand how it all fits together is worth the effort, even if you never write assembly professionally.

                                                                • renox 7 hours ago

                                                                  While I think that learning assembly is very useful, I think that one must be careful at applying assembly language concepts in a HLL C/X++/Zig..

                                                                  For example, an HLL pointer is different from an assembly pointer(1). Sure the HLL pointer will be lowered to an assembly language pointer eventually but it still has a different semantic.

                                                                  1: because you're relying on the compiler to use efficiently the registers, HLL pointers must be restricted otherwise programs would be awfully slow as soon as you'd use one pointer.

                                                                • daeken 21 hours ago

                                                                  I have spent the last ~25 years deep in assembly because it's fun. It's occasionally useful, but there's so much pleasure in getting every last byte where it belongs, or working through binaries that no one has inspected in decades, or building an emulator that was previously impossible. It's one of the few areas where I still feel The Magic, in the way I did when I first started out.

                                                                  • ghhrjfkt4k 21 hours ago

                                                                    I once used it to get a 4x speedup of sqrt computations, by using SIMD. It was quite fun, and also quite self contained and manageable.

                                                                    The library sqrt handles all kinds of edge-cases which prevent the compiler from autovectorizing it.

                                                                    • AnyTimeTraveler 5 hours ago

                                                                      I learned 8086 (not x86) assembly in a university course during my bachelors degree and won a contest to create the first correct implementation that would play "Jingle Bells" on the PC-Speaker[0] attached to the custom built computer. That was very fun and I kept playing around with assembly a bit afterwards, but never got around to learning any of the extensions made in x86 assembler and beyond.

                                                                      In my masters degree, there was another course, where one built their own computer PCB in Eagle, got it fabbed and then had to make a game for the 8052 CPU on there. 8052 assembly is very fun! The processor has a few bytes of ram where every bit is individually addressable and testable. I built the game Tetris on three attached persistence of vision LED-Matrices[1]. Unfortunately, the repository isn't very clean, but I used expressive variable names, so it should be readable. I did create my own calling convention for performance reasons and calculated how many cpu cycles were available for game logic between screen refreshes. Those were all very fun things to think about :)

                                                                      Reading assembly now has me look up instruction names here and there, but mostly I can understand what's going on.

                                                                      [0] https://github.com/AnyTimeTraveler/HardwareNaheProgrammierun... [1] https://github.com/AnyTimeTraveler/HardwarenaheSystementwick...

                                                                      • sigbottle 17 hours ago

                                                                        If you're working with C++ (and I'd imagine C), knowing how to debug the assembly comes up. And if you've written assembly it helps to be aware of basic patterns such as loops, variables, etc. to not get completely lost.

                                                                        Compilers have debug symbols, you can tune optimization levels, etc. so it's hopefully not too scary of a mess once you objdump it, but I've seen people both use their assembly knowledge at work and get rewarded handsomely for it.

                                                                        • jwr 16 hours ago

                                                                          Yes, it is definitely worth it. You get a much better understanding of CPU architectures. Also, most of your knowledge will be applicable to any platform.

                                                                          • anta40 11 hours ago

                                                                            I do it purely for fun: learning NES/Sega/GBA coding, hopefully being able to write simple games one day.

                                                                            When lockdown started in 2020, I thought working from home would give me more spare time, thus enrolled those classes on Udemy.

                                                                            I'm a mobile app dev (Java/Kotlin), and assembly is practically irrelevant for daily use cases.

                                                                            • saagarjha 7 hours ago

                                                                              There's still a lot of reasons to learn it to apply your skills, not just because you want to do it for fun. It's quite helpful when debugging, critical in fields like binary security or compilers, and basically the whole game if you're writing (say) SIMD algorithms.

                                                                              • mobiledev2014 15 hours ago

                                                                                Given there’s a mini genre of games that emulate using assembly to solve puzzles the answer is clearly yes. Not sure if any of them teach a real language.

                                                                                The most popular are the Zachtronics games and Tomorrow Corp games. They’re so so good!

                                                                                • colanderman 19 hours ago

                                                                                  Depends on the ISA. ARM32 is a lot more enjoyable to work with than x86-64. In-order VLIW architectures like TileGX and Blackfin (IIRC) are fun if you like puzzles. Implementing tight loops of vectorized operations on most any ISA is similarly entertaining.

                                                                                  • nevi-me 21 hours ago

                                                                                    I'm about 60% with RISC-V, I'm enjoying learning it, and my use-case is being able to embed some assembly on ESP32 code.

                                                                                    A few years ago I embarked on learning ARM assembly, I also got far, but I found it more laborious somehow. x64 is just too much for me to want to learn.

                                                                                    • pjmlp 20 hours ago

                                                                                      It was cool back in the day, when the alternative was BASIC, also during the demoscene early days.

                                                                                      Nowadays most of that can be done with intrinsics, which were already present in some 1960's system programming languages, predating UNIX for a decade.

                                                                                      Modern Assembly is too complex, it is probably easier to target retrogaming, or virtual consoles, if the purpose is having fun.

                                                                                      • kevingadd 19 hours ago

                                                                                        Learning assembly is really valuable even if you never write any. Looking at the x64 or ARM64 assembly generated by i.e. the C or C# you write can help you understand its performance characteristics a lot better, and you can optimize based on that knowledge without having to drop down to a lower level.

                                                                                        Of course, most applications probably never need optimization to that degree, so it's still kind of a niche skill.

                                                                                        • YZF 21 hours ago

                                                                                          It's a lot of fun and you get a better understanding of what goes on under the hood for everything running on your machine.

                                                                                          • bitwize 20 hours ago

                                                                                            If you want to get the ultimate performance out of a processor, understanding assembly is paramount. Writing it by hand is less critical today than it was in the days of old 8- and 16-bit CPUs when memory was at a premium, instruction cycle counts were known constants, and sequential execution was guaranteed. But being able to read your compiler's output and understand what the optimizer does is a huge performance win.

                                                                                            • gostsamo 20 hours ago

                                                                                              I took a course in it in college. Extreme fun. Currently, python microservices don't have much need of this exact skill, but it gave me a significant confidence bust at the time that I actually know what is going on.

                                                                                              • dinkumthinkum 17 hours ago

                                                                                                I mean, some people are interested in computers. Some people are interested in performance. Some people like to understand how the things they work with and use on a regular basis work at a very fundamental level; it's not like understanding assembly is like trying to understand computing via physics, it is directly a part of the process. I think there was a time when many people found it exciting to learn, still there are some, but now there are so many non-technical programmers working in the field, making web pages, etc., that it is a minority percentage compared earlier times.

                                                                                              • jupp0r 20 hours ago

                                                                                                I personally don't think there's much value in writing assembly (vs using intrinsics), but it's been really helpful to read it. I have often used Compiler Explorer (https://godbolt.org/) to look at the assembly generated and understand optimizations that compilers perform when optimizing for performance.

                                                                                                • frontfor 12 hours ago

                                                                                                  Your commented is directly contradicted by the article.

                                                                                                  > To make multimedia processing fast. It’s very common to get a 10x or more speed improvement from writing assembly code, which is especially important when wanting to play videos in real time without stuttering.

                                                                                                  • TinkersW 8 hours ago

                                                                                                    They said they prefer intrinsics which the article says are only about 10% slower(citation needed), you misunderstood and made a comparison against scalar.

                                                                                                    Personally I'd say the only good reason to use assembly over intrinsics is having control over calling convention, for example the windows CC is absolute trash and wastes many SIMD registers.

                                                                                                    • edward28 5 hours ago

                                                                                                      And how often are you doing multimedia processing?

                                                                                                  • wruza 17 hours ago

                                                                                                    I don’t care about the split, just wanted to say that this guide is so good. I wish I had this back when I was interested in low-low-level.

                                                                                                    • slicktux a day ago

                                                                                                      Kudos for the K&R reference! That was the book I bought to learn C and programming in general. I had initially tried C++ as my first language but I found it too abstract to learn because I kept asking what was going on underneath the hood.

                                                                                                      • lukaslalinsky a day ago

                                                                                                        This is perfect. I used to know the x86 assembly at the time of 386, but for the more advanced processors, it was too complex. I'd definitely like to learn more about SIMD on recent CPUs, so this seems like a great resource.

                                                                                                        • imglorp a day ago

                                                                                                          Asm is 10x faster than C? That was definitely true at some point but is it still true today? Have compilers really stagnated so badly they can't come close to hand coded asm?

                                                                                                          • jsheard a day ago

                                                                                                            C with intrinsics can get very close to straight assembly performance. The FFmpeg devs are somewhat infamously against intrinsics (IIRC they don't allow them in their codebase even if the performance is as good as equivalent assembly) but even by TFAs own estimates the difference between intrinsics and assembly is on the order of 10-15%.

                                                                                                            You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.

                                                                                                            • CyberDildonics 20 hours ago

                                                                                                              You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that,

                                                                                                              I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.

                                                                                                              • magicalhippo 12 hours ago

                                                                                                                > I can get far more than 10x over naive C

                                                                                                                However you can write better than naive C by compiling and watching the compiler output.

                                                                                                                I stopped writing assembly back around y2k as I was fairly consistently getting beaten by the compiler when I wrote compiler-friendly high-level code. Memory organization is also something you can control fairly well on the high-level code side too.

                                                                                                                Sure some niches remained, but for my projects the gains were very modest compared to invested time.

                                                                                                              • UltraSane 21 hours ago

                                                                                                                "The FFmpeg devs are somewhat infamously against intrinsics (they don't allow them in their codebase even if the performance is as good as equivalent assembly)"

                                                                                                                Why?

                                                                                                                • Narishma 21 hours ago

                                                                                                                  I don't know if it's their reason but I myself avoid them because I find them harder to read than assembly language.

                                                                                                                  • schainks 21 hours ago

                                                                                                                    Did you read lesson one?

                                                                                                                    TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.

                                                                                                                    • astrange 16 hours ago

                                                                                                                      Intrinsics have the disadvantages of asm (non-portable) but also don't reliably have the advantages of them (compilers are pretty unpredictable about optimizing with them) and they're ugly (especially x86 with its weird Hungarian stuff).

                                                                                                                      There is just a little bit of intrinsics code in ffmpeg, which I wrote, that does memory copies.

                                                                                                                      https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/i...

                                                                                                                      It's like this because we didn't want to hide the memory accesses from the compiler, because that hurts optimization, as well as memory tools like ASan.

                                                                                                                      • janwas 3 minutes ago

                                                                                                                        Intrinsics have the huge advantage of enabling wrapper functions, which remove the ugly names and allow you to write user code only once, such that it is even portable (or at least multiplatform-dependent).

                                                                                                                        Good point about asan and other instrumentation :) hm, I'd think that is very important for codecs in particular?

                                                                                                                      • brigade 20 hours ago

                                                                                                                        Well that was more true when you had to care about the 8 registers of x86, CPUs were only like 2-4 wide, and codecs preferred to operate on 8x8 blocks and one bitdepth.

                                                                                                                        Nowadays the impact of suboptimal register allocation and addressing calculations of compilers is almost unmeasurable between having 16/32 registers available and CPUs that are 8-10 wide in the frontend but only 3-4 vector units in the backend. But the added complexity of newer codecs has strained their use of the nasm/gas macro systems to be far less readable or maintainable than intrinsics. Like, think of how unmaintainable complex C macros are and double that.

                                                                                                                        And it's not uncommon to find asm in ffmpeg or related projects written suboptimally in a way a compiler wouldn't, either because the author didn't fully read/understand CPU performance manuals or because rewriting/twisting the existing macros to fix a small suboptimality is more work than it's worth.

                                                                                                                        (yes, I have written some asm for ffmpeg in the past)

                                                                                                                      • oguz-ismail 21 hours ago

                                                                                                                        Have you seen C code with SIMD intrinsics? They are an eyesore

                                                                                                                        • jsheard 21 hours ago

                                                                                                                          You're not wrong but that's more of an issue with C than an issue with intrinsics, in higher level languages like C++ or Rust you have the option to wrap instrinsics in types which are much nicer to work with.

                                                                                                                          • oguz-ismail 21 hours ago

                                                                                                                            >C++ or Rust

                                                                                                                            Nah. I find well commented three column AT&T assembly with light use of C preprocessor macros easier and more enjoyable to read.

                                                                                                                            • Inityx 20 hours ago

                                                                                                                              Now that's what I call an unpopular opinion.

                                                                                                                              • saagarjha 7 hours ago

                                                                                                                                Among people who write assembly regularly it's not that unpopular

                                                                                                                          • t-3 18 hours ago

                                                                                                                            Not just an eyesore, they also are typed, so any widening or narrowing or using only part of a vector register ends up needing casts so things can get really extremely confusing and cluttered when doing anything beyond basic algebra. With asm it's a much shorter, more elegant and visually-aligned waterfall of code.

                                                                                                                            • xgkickt 16 hours ago

                                                                                                                              Only if using x86-64 IME. Other architectures that don’t require as much shuffling of data are far more legible.

                                                                                                                        • lukaslalinsky a day ago

                                                                                                                          This is for heavily vectorized code, using every hack possible to fully utilize the CPU. Compilers are smart when it comes to normal code, but codecs are not really normal code. Not a ffmpeg programmer, but have some background dealing with audio.

                                                                                                                          • PaulDavisThe1st 21 hours ago

                                                                                                                            > codecs are not really normal code.

                                                                                                                            Not really a fair comment. They are entirely normal code in most senses. They differ in one important way: they are (frequently) perfect examples of where "single instruction, multiple data" completely makes sense. "Do this to every sample" is the order of the day, and that is a bit odd when compared with text processing or numerical computation.

                                                                                                                            But this is true of the majority of signal processing, not just codecs. As simple a thing as increasing the volume of an audio data stream means multiplying every sample by the same value - more or less the definition of SIMD.

                                                                                                                            • astrange 16 hours ago

                                                                                                                              There's a difference because audio processing is often "massively parallel", or at least like 1024 samples at once, but in video codecs operations could be only 4 pixels at once and you have to stretch to find extra things to feed the SIMD operations.

                                                                                                                              • screcth an hour ago

                                                                                                                                Can you use the remaining SIMD lanes for processing independent data streams?

                                                                                                                                Think encoding or decoding non-overlapping parts of a video.

                                                                                                                            • bad_username 19 hours ago

                                                                                                                              > codecs are not really normal code.

                                                                                                                              Codecs are pretty normal code. You can get decent performance by just writing quality idiomatic C or C++, even without asm. (I implemented a commercial x.264 codec and worked on a bunch of audio codecs.)

                                                                                                                            • variadix 21 hours ago

                                                                                                                              C compilers are still pretty bad at auto vectorization. For problems where SIMD is applicable, you can reasonably expect a 2x-16x speed up over the naive scalar implementation.

                                                                                                                              • astrange 16 hours ago

                                                                                                                                Also, if you write code with intrinsics the autovectorization can make it _worse_. eg a pattern is to write a SIMD main loop and then a scalar tail, but it can autovectorize that and mess it up.

                                                                                                                              • warble a day ago

                                                                                                                                I highly doubt it's true. I can usually approach the same speed in C if I'm working with a familiar compiler. Sometimes I can do significantly better in assembly but it's rare.

                                                                                                                                I work on bare metal embedded systems though, so maybe there's some nuance when working with bigger OS libs?

                                                                                                                                • umanwizard 19 hours ago

                                                                                                                                  The difference is probably that you don’t work in an environment that supports SIMD or your code can’t benefit from it.

                                                                                                                                  • warble 12 minutes ago

                                                                                                                                    You're correct, I don't use SIMD instructions much, but I can, and with a C compiler. So still, not sure the advantage of ASM.

                                                                                                                                • 1propionyl a day ago

                                                                                                                                  It's not a matter of compiler stagnation. The compiler simply isn't privy to the information the assembly author makes use of to inform their design.

                                                                                                                                  Put more simply: a C compiler can't infer from a plain C implementation that you're trying to do certain mathematics that could alternately be expressed more efficiently with SIMD intrinsics. It doesn't have access to your knowledge about the mathematics you're trying to do.

                                                                                                                                  There are also target specific considerations. A compiler is, necessarily, a general purpose compiler. Problems like resource (e.g. register) allocation are NP-complete (equivalent to knapsack) and very few people want their compiler to spend hours upon hours searching for the absolute most optimal (if indeed you can even know that statically...) asmgen.

                                                                                                                                  • bob1029 21 hours ago

                                                                                                                                    This gets even more complex once you start looking at dynamic compilations. Some of the JIT compilers have the ability to hot patch functions based upon runtime statistics. In very large, enterprisey applications with unknowns regarding how they will actually be used at build time, this can make a difference.

                                                                                                                                    You can go nuclear option with your static compilations and turn on all the optimizations everywhere, but this kills inner loop iteration speed. I believe there are aspects of some dynamic compiling runtimes that can make them superior to static compilations - even if we don't care how long the build takes.

                                                                                                                                    • astrange 16 hours ago

                                                                                                                                      Statistics aren't magic and it's not going to find superoptimizing cases like this by using them. I think this is only helpful when you get a lot of incoming poorly written/dynamic code needing a lot of inlining, that maybe just got generated in the first place. So basically serving ads on websites.

                                                                                                                                      In ffmpeg's case you can just always be the correct thing.

                                                                                                                                    • epolanski a day ago

                                                                                                                                      I remember a series of lectures from an Intel engineer that went into how difficult it was writing assembly code for x86. He basically stated that the number of cases you can really write code that is faster than what a compiler would do is close to none.

                                                                                                                                      Essentially people think they are writing low level code, in reality that's not how CPUs interpret that code, so he explained how writing manual assembly kills performance pretty much always (at least on modern x86).

                                                                                                                                      • iforgotpassword 21 hours ago

                                                                                                                                        That's for random "I know asm so it must be faster".

                                                                                                                                        If you know it really well, have already optimized everything on an algorithmic level and have code that can benefit from simd, 10x is real.

                                                                                                                                        • FarmerPotato 20 hours ago

                                                                                                                                          You have to consider that modern CPUs don't execute code in-order, but speculatively, in multiple instruction pipelines.

                                                                                                                                          I've used Intel's icc compiler and profiler tools in an iterative fashion. A compiler like Intel's might be made to profile cache misses, pipeline utilization, branches, stalls, and supposedly improve in the next compilation.

                                                                                                                                          The assembly programmer has to consider those factors. Sure would be nice to have a computer check those things!

                                                                                                                                          In the old days, we only worried about cycle counts, wait states, and number of instructions.

                                                                                                                                          • saagarjha 7 hours ago

                                                                                                                                            That's assembly by people who learned it in 1990. Intel very much does want you writing assembly for their processors and in many ways the only way to push them hard is by doing so.

                                                                                                                                          • jki275 a day ago

                                                                                                                                            Probably some very niche things. I know I can't write ASM that's 10x better than C, but I wouldn't assume no one can.

                                                                                                                                            • CyberDildonics 20 hours ago

                                                                                                                                              It isn't very hard to write C that is 10x better than C, because most programs have too many memory allocations and terrible memory access patterns. Once you sort that out you are already more than 10x ahead, then you can turn on the juice with SIMD, parallelization and possibly optimize for memory bandwidth as well.

                                                                                                                                              • 1propionyl a day ago

                                                                                                                                                It depends on what you're trying to do. I would in general only expect such substantial speedups when considering writing computation kernels (for audio, video, etc).

                                                                                                                                                Compilers today are liable in most circumstances to know many more tricks than you do. Especially if you make use of hints (e.g. "this memory is almost always accessed sequentially", "this branch is almost never taken", etc) to guide it.

                                                                                                                                                • astrange 16 hours ago

                                                                                                                                                  Mm, those hints don't matter on modern CPUs. There's no good way for the compiler to pass it down to them either. There are some things like prefetch instructions, but unless you know the exact machine you're targeting, you won't know when to use them.

                                                                                                                                                  • jki275 20 hours ago

                                                                                                                                                    Oh I definitely agree that in the vast majority of cases the compiler will probably win.

                                                                                                                                                    But I suspect there are cases where the super experts exist who can do things better.

                                                                                                                                                • ajross 17 hours ago

                                                                                                                                                  No, that claim is ridiculous. When doing the same task, quite frankly, compilers are much better than any human at optimizing general logic.

                                                                                                                                                  But when the human and compiler are not faced with the same problem...

                                                                                                                                                  Say, if your compiler doesn't support autovectorization and/or your C code isn't friendly to the idiom, then sure: a 10x difference in performance between a hand-optimized SIMD implementation and a naive scalar one fed to a C compiler is probably about right.

                                                                                                                                                • foresto 19 hours ago

                                                                                                                                                  > Note that the “q” suffix refers to the size of the pointer *(*i.e in C it represents *sizeof(*src) == 8 on 64-bit systems, and x86asm is smart enough to use 32-bit on 32-bit systems) but the underlying load is 128-bit.

                                                                                                                                                  I find that sentence confusing.

                                                                                                                                                  I assume that i.e is supposed to be i.e., but What is *(* supposed to mean? Shouldn't that be just an open parenthesis?

                                                                                                                                                  In what context would *sizeof(*src) be considered valid? As far as I know, sizeof never yields a pointer.

                                                                                                                                                  I get the impression that someone sprinkled random asterisks in that sentence, or maybe tried to mix asterisks-denoting-italics with C syntax.

                                                                                                                                                  • kevingadd 19 hours ago

                                                                                                                                                    Yes, this looks like something went wrong with the markdown itself or the conversion of the source material to markdown.

                                                                                                                                                    • sweeter 17 hours ago

                                                                                                                                                      Wouldn't it return the size of the pointer? I would guess it's exclusively used to handle architecture differences

                                                                                                                                                      • foresto 12 hours ago

                                                                                                                                                        Strictly speaking, or maybe just the way I personally think of it, sizeof doesn't return anything. It's not a function, so it doesn't return at all. (At least, not at run time.)

                                                                                                                                                        Nitpicking aside, the result of sizeof(*src) would be the size of the object at which the pointer points. The type of that result is size_t. That's what makes this code from the lesson I quoted invalid:

                                                                                                                                                        *sizeof(*src)

                                                                                                                                                        That first asterisk tries to dereference the result of sizeof as though it were a pointer, but it's a size_t: an unsigned integer type. Not a pointer.

                                                                                                                                                        • sweeter 12 hours ago

                                                                                                                                                          Yea but that first asterisk is incorrect

                                                                                                                                                          • foresto 11 hours ago

                                                                                                                                                            Is there an echo in here? ;)

                                                                                                                                                    • xuhu a day ago

                                                                                                                                                      "Assembly language of FFmpeg" leads me to think of -filter_complex. It's not for human consumption even once you know many of its gotchas (-ss and keyframes, PTS, labeling and using chain outputs, fading, mixing resolutions etc).

                                                                                                                                                      But then again no-one is adjusting timestamps manually in batch scripts, so a high-level script on top of filter_complex doesn't have much purpose.

                                                                                                                                                      • chgs 19 hours ago

                                                                                                                                                        I use filter-complex all the time, often in batch scripts

                                                                                                                                                        • pdyc a day ago

                                                                                                                                                          what do you mean by no purpose? you can adjust them programmatically in batch scripts.

                                                                                                                                                        • fracus 20 hours ago

                                                                                                                                                          I'm halfway through this tutorial and I'm really enjoying it. I haven't touched assembly since back in university decades ago. I've always had an urge to optimize processes for some reason. This scratches that itch. I was also more curious about SIMD since hearing about it on Digital Foundry.

                                                                                                                                                          • agumonkey a day ago

                                                                                                                                                            I remember kempf saying most of recent development on codecs is in raw asm. Only logical that they can write some tutorials :)

                                                                                                                                                            • fulafel 9 hours ago

                                                                                                                                                              SIMD was introduced in the 80s but become ubiquitous when Intel got in on it in the 90s. It's interesting that (for x86), PLT is still stuck at hand-writing assembly 40 years later.

                                                                                                                                                              • thayne 21 hours ago

                                                                                                                                                                It doesn't mention the downsides of using assembly. The biggest of which is that your code is architecture specific, so for example you have to write different code for x86 and arm, and possibly even different code for x86_64. Unfortunately, for SIMD, there isn't really a great way to write portable code for it, at least in C. Rust is working on stabilizing a portable simd API, and zig has simd support, but I suspect ffmpeg would still complain they aren't quite as fast as they would like.

                                                                                                                                                                One thing that confuses me is the opposition to inline asm. It seems like inline asm would be more efficient than having to make a function call to an asm function.

                                                                                                                                                                • PaulDavisThe1st 21 hours ago

                                                                                                                                                                  I can't speak for ffmpeg, but I can report on why we use non-portable assembler inside Ardour (a x-platform digital audio workstation).

                                                                                                                                                                  Ardour's own code doesn't do very much DSP (it's a policy choice), but one thing that our own code does do is metering: comparing a current sample value to every previous sample value in a given audio data stream within a given time window to decide if it is higher (or lower) than the previous max (or min).

                                                                                                                                                                  When someone stepped forward (hi Sampo!) to code this in hand-written SIMD assembler, we got a 30% reduction in CPU usage when using mid-sized buffers on moderate size sessions (say, 24 tracks or so).

                                                                                                                                                                  That's a worthy tradeoff, even though it means that we now have 5 different asm versions of about half-a-dozen functions. The good news is that they don't really need to be maintained. New SIMD architectures mean new implementations, not hacks to existing code.

                                                                                                                                                                  However, I should note that it is always very important to compare what compilers are capable of, and to keep comparing that. In the decade or more after our asm metering code was first written, gcc improved to the point where simply using C(++) and some compiler flags produced code that was within an instruction or two of our hand-crafted version (and may be more correct in the face of all possible conditions).

                                                                                                                                                                  So ... you can get dramatic performance benefits that are worth the effort, the maintainance costs are low, you should keep checking how your code compares with today's compiler's best optimization effort.

                                                                                                                                                                  • thayne 11 hours ago

                                                                                                                                                                    I'm not at all saying that it isn't worth it for ffmpeg to use assembly, but there is a tradeoff there. Ffmpeg either needs to either only support a limited number of architectures, and duplicate code for all of them, have asm implementations for the most popular architectures (probably x86(_64) and arm), and a slower, arch independent fallback implementation in c for the rest, or have asm implementations in a large number of ISAs. I'm guessing ffmpeg does the middle option, especially since this guide focuses on x86 assembly, but ffmpeg supports many other architectures.

                                                                                                                                                                    The performance wins may very well be worth it, but it is still good to be aware of the tradeoff involved.

                                                                                                                                                                    • saagarjha 7 hours ago

                                                                                                                                                                      ffmpeg has multiple implementations for each architecture to take advantage of microarchitectural wins.

                                                                                                                                                                    • sweeter 17 hours ago

                                                                                                                                                                      Ardour is a great piece of software! Thanks for that. I love hearing experiences like these.

                                                                                                                                                                    • arkj 21 hours ago

                                                                                                                                                                      If you look at this from a top-down perspective, you’ll see downsides, but from a bottom-up view, those same differences can be an advantage. Different architectures have different capabilities, and writing assembly means you’re optimizing for performance rather than prioritizing code portability or maintenance.

                                                                                                                                                                      • adgjlsfhk1 20 hours ago

                                                                                                                                                                        The counterpoint to this is that if you can write AVX2 assembly, that will be supported on ~99% of x86 CPUs around today (Haswell was 2013), so just that one branch covers ~80% of the desktop/laptop market.

                                                                                                                                                                        • sorenjan 19 hours ago

                                                                                                                                                                          94.67% according to Steam hardware survey, which is probably close enough.

                                                                                                                                                                          https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

                                                                                                                                                                          • Someone 19 hours ago

                                                                                                                                                                            There’s no guarantee that the fastest AVX2 assembly is equal on all CPUs, and reading https://stackoverflow.com/a/64782733, there are differences between CPUs.

                                                                                                                                                                            So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.

                                                                                                                                                                            • renhanxue 9 hours ago

                                                                                                                                                                              If you really care about performance though you'd want to be a lot more specific than this. I've seen image processing code that not only does things like avoid specific instructions on some CPU families (like for example it avoids the vpermd instruction on Zen1/2/3 CPU's because of excessive latency), but also queries the CPU cache topology at runtime and uses buffer allocation strategies that ensure that it can work in data batches that fit in cache.

                                                                                                                                                                              • withinboredom 19 hours ago

                                                                                                                                                                                hmmm... that's not exactly true. Hosts may not expose all instructions to VMs, especially certain hosts. So, yeah, I agree with you on the desktop/laptop market, but be wary if your target is servers.

                                                                                                                                                                              • astrange 16 hours ago

                                                                                                                                                                                The code would be architecture specific anyway. ffmpeg is meant to be fast, so it's split into architecture independent and dependent (DSP) parts. The first relies on compiler optimizations, second is what uses SIMD, asm etc.

                                                                                                                                                                                There is no such thing as a generic "SIMD API" it could use because it uses all specific hardware tools it can to be performant. Anyone who thinks this is posssible is simply mistaken. You can tell because none of them have written ffmpeg.

                                                                                                                                                                                (There are some things called "array languages" or "stream processing" or "autoscalarization" that work better than SIMD - an example is ispc. But they're not a great fit here, because ffmpeg isn't massively parallel. It's just parallel enough to work.)

                                                                                                                                                                                • wffurr 21 hours ago

                                                                                                                                                                                  What about Highway? https://github.com/google/highway I suppose that's C++ not C though.

                                                                                                                                                                                  • kccqzy 19 hours ago

                                                                                                                                                                                    I've enjoyed using Highway, but it does in fact use plenty of C++ features that would make it unappealing to C projects. And if you make even just one mistake, it's easy to get several screenfuls of error messages; I accept that as a C++ developer but C developers would hate it.

                                                                                                                                                                                    • femto 19 hours ago

                                                                                                                                                                                      In a similar vein (C++) there is also, Eigen: https://eigen.tuxfamily.org

                                                                                                                                                                                    • aidenn0 18 hours ago

                                                                                                                                                                                      I don't know what is state of the art today, but historically compilers are terribly inefficient for inline assembly because they inhibit optimizations around inline assembly, so inline asm is often slower than intrinsics. For DSP code, your performance critical code is often a large number of iterations through a hot loop, so the function-call overhead incurred by calling your assembly function is negligible.

                                                                                                                                                                                      • jsheard 16 hours ago

                                                                                                                                                                                        MSVC doesn't even support inline assembly anymore, so to be portable across the big three compilers you have to use either intrinsics or standalone assembly.

                                                                                                                                                                                      • hereonout2 21 hours ago

                                                                                                                                                                                        Possibly they could have added that warning, but at the same time this is a guide from the ffmpeg project, presumably for ffmpeg developers.

                                                                                                                                                                                        They lay it out quite clearly I think, but things like libavcodec are probably one of the few types of project where the benefits of assembly outweigh the lack of portability.

                                                                                                                                                                                        I'm not sure rust or zig's support for SIMD would be the project's first complaint either. Likely more concerned with porting a 25 year old codebase to a new language first.

                                                                                                                                                                                        • brigade 19 hours ago

                                                                                                                                                                                          Asm is only good on one architecture; inline asm further restricts that to at most two compilers. Plus most of the "documentation" for inline asm constraints is scattered across various comments in the source code of those compilers, and you generally can't safely use gas macros or directives.

                                                                                                                                                                                          • wyldfire 18 hours ago

                                                                                                                                                                                            > at most two compilers.

                                                                                                                                                                                            As far as C, C++ go - that's two out of three. So it's not as bad as it sounds to be "at most two".

                                                                                                                                                                                        • eachro a day ago

                                                                                                                                                                                          This looks great! Is there going to be exercises or a project based component as well?

                                                                                                                                                                                          • neallindsay 21 hours ago

                                                                                                                                                                                            Things that you would expect every software developer to know today will one day become niche, low-level knowledge.

                                                                                                                                                                                            • krick 18 hours ago

                                                                                                                                                                                              Huh, I didn't even know ffmpeg still actively employs assembly in its source code.

                                                                                                                                                                                              • Charon77 15 hours ago

                                                                                                                                                                                                This is very approachable and beginner friendly. Kudos to authors.

                                                                                                                                                                                                • jancsika 20 hours ago

                                                                                                                                                                                                  What's the cost of shuttling data in and out of SIMD land?

                                                                                                                                                                                                  • umanwizard 19 hours ago

                                                                                                                                                                                                    SIMD doesn’t operate on a separate memory space or anything like that. You just load data from normal memory into the SIMD registers, just like you would have to load it into the scalar registers if you wanted to operate on it with normal instructions.

                                                                                                                                                                                                    • aidenn0 18 hours ago

                                                                                                                                                                                                      On some targets you need to overalign data for vectorization.

                                                                                                                                                                                                      • astrange 16 hours ago

                                                                                                                                                                                                        It is slow to move data from SIMD to scalar registers, or can be.

                                                                                                                                                                                                        • TinkersW 7 hours ago

                                                                                                                                                                                                          It depends, for SIMD float-> scalar floats it is fast as they operate on the same registers. If pulling out of lane 0 you don't even need to do anything(just a type cast). For other lanes you need a shuffle.

                                                                                                                                                                                                          For SIMD integer to scalar integer, it has to move into separate register, so there is some short penalty(3 cycles iir).

                                                                                                                                                                                                      • kccqzy 18 hours ago

                                                                                                                                                                                                        It's pretty cheap. You can easily find the latency and throughput numbers on different Intel architectures. Here's an example for movdqa: https://www.intel.com/content/www/us/en/docs/intrinsics-guid... which is a basic 128-bit load. Even a 512-bit load isn't much more expensive: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

                                                                                                                                                                                                      • ej1 20 hours ago

                                                                                                                                                                                                        This os a great article!

                                                                                                                                                                                                        • mkoubaa 21 hours ago

                                                                                                                                                                                                          I'm shocked there still isn't a hardware accelerator for video decoding.

                                                                                                                                                                                                          • qingcharles 15 hours ago

                                                                                                                                                                                                            Practically every computer device manufactured in the last 15 years has some sort of accelerator/specific instructions designed primarily for optimizing the decoding of video.

                                                                                                                                                                                                            • graypegg 21 hours ago

                                                                                                                                                                                                              There is! FFmpeg supports hardware acceleration for a lot of operations. (Though format/codec dependant on the chipset you're working with, so it's not as general as you might expect. I don't know a ton about video's guts, so I assume the variance between video codec decoding is big enough to require incompatible special silicon.)

                                                                                                                                                                                                              https://trac.ffmpeg.org/wiki/HWAccelIntro

                                                                                                                                                                                                              • Narishma 21 hours ago

                                                                                                                                                                                                                What do you mean? Pretty much any SoC designed for consumer applications has some form of hardware accelerated video decoding.

                                                                                                                                                                                                                • ghhrjfkt4k 21 hours ago

                                                                                                                                                                                                                  ffmpeg does more than hardware decoding. For example scaling, cropping, changing colors, effects. All this stuff can benefit from vectorized operations (on CPU or GPU).

                                                                                                                                                                                                                • henning a day ago

                                                                                                                                                                                                                  This is what Hacker News should be about. Awesome. Thank you.

                                                                                                                                                                                                                  • sylware a day ago

                                                                                                                                                                                                                    A gigantic mistake was done in much of ffmpeg assembly:

                                                                                                                                                                                                                    They are abusing nasm macro-preprocessor up to obscene levels...

                                                                                                                                                                                                                    • ryanianian a day ago

                                                                                                                                                                                                                      Why is it "abusing," and what would you suggest as an alternative?

                                                                                                                                                                                                                      • sylware a day ago

                                                                                                                                                                                                                        Have a look an their code, it is obvious. Often you have to figure out what actually the macros does, and I remember it was not that straight forward.

                                                                                                                                                                                                                        And the macro language is specific to nasm.

                                                                                                                                                                                                                        What to do: unroll the macros and/or use a little abstraction using a simple common macro preprocessor, aka not tied to the assembler.

                                                                                                                                                                                                                        And I am just doing exactly that: my x86_64 assembly code does assemble with fasm/nasm/gas with a little abstraction using a C preprocessor.

                                                                                                                                                                                                                        • PhilipRoman 21 hours ago

                                                                                                                                                                                                                          To be fair, nasm allows you to detach the preprocessor from the assembler (-E). But I agree with you in general.

                                                                                                                                                                                                                          • pengaru 21 hours ago

                                                                                                                                                                                                                            there is nothing wrong with depending on nasm

                                                                                                                                                                                                                            • sylware 4 hours ago

                                                                                                                                                                                                                              Yes, it is since you can with a little C preprocessor abstraction assemble with fasm/gas/nasm.

                                                                                                                                                                                                                      • belter a day ago

                                                                                                                                                                                                                        Uhmmm...Lots of praise but these are just three small lessons covering basics. Exercises not uploaded yet. Looks like a work in progress or in the beginning?

                                                                                                                                                                                                                        • toisanji 17 hours ago

                                                                                                                                                                                                                          Just wondering, would it make sense to use LLMs to translate higher level languages to assembly or to directly write in assembly?

                                                                                                                                                                                                                          • mikestew 17 hours ago

                                                                                                                                                                                                                            Are you asking if an LLM can produce better assembly than an optimizing compiler?

                                                                                                                                                                                                                            • saagarjha 7 hours ago

                                                                                                                                                                                                                              Generally, no. It is hard to do this translation in a way that is correct, much less performant.

                                                                                                                                                                                                                            • beebaween 15 hours ago

                                                                                                                                                                                                                              I'm kind of stunned we haven't gotten something better / more rust based than ffmpeg?

                                                                                                                                                                                                                              Especially curious given the advent of apple metal etc.

                                                                                                                                                                                                                              Does anyone have recommendations?

                                                                                                                                                                                                                              • mvdtnz 15 hours ago

                                                                                                                                                                                                                                "Rust based" is not a feature. End users DO NOT CARE. What's your value prop?

                                                                                                                                                                                                                                • filleduchaos 11 hours ago

                                                                                                                                                                                                                                  Gstreamer is increasingly developed in Rust, and is a far saner, better documented and more flexible framework for developers than libav/ffmpeg.

                                                                                                                                                                                                                                  The pipeline/plugin based architecture is pretty neat even as an end user, I find it a lot more discoverable.

                                                                                                                                                                                                                                  • lukaslalinsky 2 hours ago

                                                                                                                                                                                                                                    Gstreamer is a high level API that uses FFmpeg, not a FFmpeg replacement.

                                                                                                                                                                                                                                • adamnemecek 15 hours ago

                                                                                                                                                                                                                                  Why? It's a Herculean effort. It took it was 28 years between the creation of C and FFMPEG, so if there still is not a replacement by 2038, then your complaint is justified.

                                                                                                                                                                                                                                • imchaz 15 hours ago

                                                                                                                                                                                                                                  I'll be honest, I didn't read through much. Ffmpeg gives me severe ptsd. My first task out of college was to write a procedurally generated video using ffmpeg, conform to dash, and get it under 150kb/s while being readable. Docs were unusable. Dash was only a few months old. And stackoverflow was devoid of help. I kid you not, the only way to get any insight was some sketchy IRC channel. (2016 btw, well past IRCs prime)

                                                                                                                                                                                                                                  • thegrim33 14 hours ago

                                                                                                                                                                                                                                    Not trying to be too negative but the memories your comment brought up in me, I need to rant about ffmpeg for a minute. ffmpeg is the worst documented major library I've ever used in my life. I integrated with it to render videos inside my 3D engine and boy do I shiver at any thought of having to work with it again.

                                                                                                                                                                                                                                    The "documentation" is a collect of 15-20 year old source samples. The vast majority of them either won't compile anymore because the API has changed, or they use 2, 3, or 4 times deprecated functions that shouldn't be used anymore. The source examples also have almost no comments explaining anything. They have super dense, super complicated code, with no comments, but then there will be a line like "setRenderSpeed(3)" or whatever and it'll have a comment: "Sets render speed to 3", the absolute most useless comment ever. The source examples are also all written in 30 year old as C-Style of C code as you can get, incredibly horribly dense, with almost no organization, have to jump up and down all over the file to find the global variables being accessed, it's just gross and barely comprehensible.

                                                                                                                                                                                                                                    They put a lot of effort into producing doxygen documentation for everything, but the doxygen documentation is nearly useless, it just lists the API with effectively zero documentation or explanation on the functions or parameters. There's so little explanation of how to do anything. On the website they have sections for each library, and for most libraries you get 2-3 sentences of explanation on what the library is for, and that's it. That's the extent the entire library is documented. They just drop an undocumented massive C API split across a dozen or so libraries on you and wish you luck.

                                                                                                                                                                                                                                    The API has also gotten absolutely wrecked over the last 20 years or however long it's been around as it has evolved. Sometimes they straight up delete functions to deprecate them, sometimes they create a new version of a function as fuction2 and then as function3, and keep all of them around, sometimes they replace a function with a completely differently named function and keep them both around, and there's absolutely nothing written anywhere about what the "right" way to do anything is, what functions you should actually be using. So many times I went down rabbit holes reading some obscure 15 year old mailing list post trying to find anyone that had successfully done something I was trying to do. And again, the obscure message board posts and documentation that does exist is almost all deprecated at this point and shouldn't be used.

                                                                                                                                                                                                                                    Then there's the custom build system, so if you need to build it custom to support or not support different features, you can't use any modern build system, it's all custom scripts that do weird things like hardcoded dumping build output into your home directory. Makes it difficult to integrate with a modern build system.

                                                                                                                                                                                                                                    It has so much momentum, and so many users, but man, there has to be a massive opening for someone to replace ffmpeg with a modern programming language and a modern build system, built with GPU acceleration of stuff in mind from the beginning and not tacked on top 20 years later, and not using 30 year old c-style code, and an actually documented project.

                                                                                                                                                                                                                                    • Ono-Sendai an hour ago

                                                                                                                                                                                                                                      that's hilarious, thank you. The life of a c++ programmer using dodgy libraries.

                                                                                                                                                                                                                                      • imchaz 13 hours ago

                                                                                                                                                                                                                                        Heres a few lines out of the 700 line shell script; ${FFMPEG_CMD} -y -i $SIL -i ${STEP1_Q} -i $SIL -i ${STEP1_Q} -i $SIL -i ${STEP1_Q} -i $SIL -i ${ATTN_Q} -i $SIL -i ${ALERT_Q} -i ${STEP2BOUT} -i $SIL -i ${EOM_Q} -i $SIL -i ${EOM_Q} -i $SIL -i ${EOM_Q} -i $LOW -i $LOW -filter_complex concat=n=19:v=0:a=1 ${STEP2OUT} 2>> ${LOG_FILE} ... ${FFMPEG_CMD} -y -i ${STEP3OUT} -vf drawtext="fontfile=${FONT_FILE}:textfile=${INCOMING_DIR}/${VIDEO_LANG}.txt:fontcolor=white:fontsize=36:y=h-h/3:x=w-120*t" -b:v 9000k -maxrate 9000k -minrate 9000k -bufsize 1890k -acodec copy ${STEP4OUT} 2>> ${LOG_FILE} ...${FFMPEG_CMD} -i ${STEP2OUT} -f lavfi -i color=c=red:s=640x480:d="${TOTAL_CRAWLTIME}" -vf "subtitles=${SUBTITLES_FILE}:force_style='Alignment=10,Outline=0,Fontsize=18', subtitles=${SLIDE_NUMBERS_FILE}:force_style='Alignment=2,Outline=0', drawtext=fontfile=${FONT_FILE}:fontsize=30:fontcolor=white:y=line_h:x=(w-text_w)/2:text='MESSAGE D’FOO ou foo', drawtext=fontfile=${FONT_FILE}:fontsize=20:fontcolor=white:y=(h-80):x=(w-text_w)/2:text='English message to follow.'" -b:v 9000k -maxrate 9000k -minrate 9000k -bufsize 1890k -acodec libmp3lame ${STEP4OUT} 2>> ${LOG_FILE} ${FFMPEG_CMD} -y -i ${STEP5OUT} -vcodec libx264 -x264opts keyint=60:min-keyint=30:ref=3:bframes=0 -profile:v Main -level 3.1 -s 576:432 -g 60 -r 29.970 -crf 22 -maxrate:v 300k -bufsize 600k -acodec aac -ab 32k -ar 22050 -ac 1 -vbsf h264_mp4toannexb ${OUTPUT_DIR}/${BASE_FILE_NAME}.ts 2>> ${LOG_FILE}

                                                                                                                                                                                                                                        • imchaz 13 hours ago

                                                                                                                                                                                                                                          might as well be assembly lol

                                                                                                                                                                                                                                    • netr0ute a day ago

                                                                                                                                                                                                                                      The only thing I don't like about this is the focus on x86 assembly, which is a sinking ship because RISC-V is coming to eat its lunch, FAST.

                                                                                                                                                                                                                                      • KeplerBoy a day ago

                                                                                                                                                                                                                                        I could understand if you wrote arm, because that's an architecture with actual marketshare. arguably more marketshare than x86-64 at this point, but you had to choose risc-v for the lols.

                                                                                                                                                                                                                                        • wolf550e a day ago

                                                                                                                                                                                                                                          Where are the high performance RISC-V implementations? Those that compete with AMD Zen-5 and Apple M4? Or at least AWS Graviton 4?

                                                                                                                                                                                                                                          • zozbot234 17 hours ago

                                                                                                                                                                                                                                            The Tenstorrent folks are working on that.

                                                                                                                                                                                                                                          • high_na_euv a day ago

                                                                                                                                                                                                                                            HackerNews does not reflect real world well

                                                                                                                                                                                                                                            • ksec a day ago

                                                                                                                                                                                                                                              The unwritten rule of HN:

                                                                                                                                                                                                                                              You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.

                                                                                                                                                                                                                                            • do_not_redeem a day ago

                                                                                                                                                                                                                                              How would you define "fast"?

                                                                                                                                                                                                                                              • hagbard_c a day ago

                                                                                                                                                                                                                                                In relative terms, compared with similarly priced and powered devices on the market. RISC-V does lag behind the others - ARM, x86/64 - here, at least for now.

                                                                                                                                                                                                                                                • snvzz a day ago

                                                                                                                                                                                                                                                  Not eating. Only drinking water or zero calory drinks such as black coffee.

                                                                                                                                                                                                                                                  Only while fasting can a person think clearly. When thinking clearly, RISC-V is inevitably chosen as the ISA.

                                                                                                                                                                                                                                                  Fasting will also eventually make you hungry. Thus "RISC-V is coming to eat its lunch, FAST."

                                                                                                                                                                                                                                                • astrange 16 hours ago

                                                                                                                                                                                                                                                  Doesn't RISC-V use vector stream processing instead of SIMD? That's a poor fit for ffmpeg.

                                                                                                                                                                                                                                                  • astrange 13 hours ago

                                                                                                                                                                                                                                                    I should say, I think it would be. I haven't actually tried it and know ARM has added it too, so it'd be interesting to see for sure.

                                                                                                                                                                                                                                                  • 201984 a day ago

                                                                                                                                                                                                                                                    Wake me up when a RISC-V processor is on par with an N50.