• Syonyk 3 hours ago

    Post got the big one: Total Store Ordering (TSO).

    The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.

    The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.

    • jsheard 3 hours ago

      It's surprising that (AFAIK) Qualcomm didn't implement TSO in the chips they made for the recent-ish Windows ARM machines. If anything they need fast x86 emulation even more than Apple does since Windows has a much longer tail of software support than macOS, there's going to be important Windows apps that stubbornly refuse to support native ARM basically forever.

      • mdasen 2 hours ago

        It's definitely surprising that Qualcomm didn't. Not only does Windows have a longer tail of software to support, but given that the vast majority of Windows machines will continue to be x86-64, there's little incentive to do work to support ARM.

        With the Mac, Apple told everyone "we're moving to ARM and that's final." With Windows, Microsoft is saying, "these ARM chips could be cool, what do you think?" On the Mac, you either got on board or were left behind. Users knew that the future was ARM and bought machines even if there might be some short-term growing pains. Developers knew that the future was ARM and worked hard to support it.

        But with Windows, there isn't a huge incentive for users to switch to ARM and there isn't an incentive for developers to encourage it. You can say there's some incentive if the ARM chips are better. While Qualcomm's chips are good, the benchmarks aren't really ahead of Intel/AMD and they aren't the power-sipping processors that Apple is putting out.

        If Apple hadn't implemented TSO, Mac users/developers would still switch because Apple told them to. Qualcomm has to convince users that their chips are worth the short-term pain - and that users shouldn't wait a few years to make the switch when the ecosystem is more mature. That's a much bigger hill to climb.

        Still, for Qualcomm, they might not even care about losing a little money for 5-10 years if it means they become one of the largest desktop processor vendors for the following 20+ years. As long as they can keep Microsoft's interest in ARM as a platform, they can bide their time.

        • bee_rider 24 minutes ago

          I wonder if possible Qualcomm doesn’t super care about the long tail of software? Like maybe MS has some stats indicating that a very large percentage of software that they think will be used on these devices is first party, or stuff that reasonably should be expected to be compiled for ARM.

          How does the windows App Store work anyway, can they guarantee that all the stuff there gets compiled for ARM?

          Anyway, it is Windows not MacOS. The users expect some rough edges and poor craftsmanship, right?

        • p_l 2 hours ago

          Qualcomm has been phoning it in in various forms for over a decade, including forcing MS to ship machines that do not really pass windows requirements (like broken firmware support). Maybe it got fixed with recent Snapdragon X, but I won't hold my breath.

          We're talking about a company that, if certain personal sources are to be believed, started the Snapdragon brand by deciding to cheapen out on memory bandwidth despite feedback that increasing it was critical and leaving the client to find out too late in the integration stage.

          Deciding that they make better money by not spending on implementing TSO, or not spending transistors on bigger caches, and getting more volume at lower cost, is perfectly normal.

          • deaddodo 2 hours ago

            Microsoft's AoT+JiT techniques still pull off impressive performance (90+% in almost every case, 96-99% in the majority).

            But yes, if they were actually serious about Windows on ARM, they would have implemented TSO in their "custom" Qualcomm SQ1/SQ2 chips.

            • wtallis an hour ago

              Last time I checked, the default behavior for Microsoft's translation was to pretend that the hardware is doing TSO, and hope it works out. So that should obviously be fast, but occasionally wrong.

              • saagarjha an hour ago

                They're a decent bit smarter than that but yes their emulation is not quite correct.

            • dundarious an hour ago

              On a first order analysis, Qualcomm doesn't want good x64 support, because good x64 support furthers the lifetime of x64, and delays the "transition" to ARM. In the final analysis, I doubt that is an economically rational strategy, because even if there is to be a transition away from x64, you need a good legacy and migration story. And I doubt such a transition will happen in the next 10 years, and certainly not spurred by anything in Microsoft land.

              So maybe it's rational after all, because they know these Windows ARM products will never succeed, so they're just saving themselves the cost/effort of good support.

              • scottlamb 3 hours ago

                Does Windows's translation take advantage of those where they exist? E.g. if I launch an aarch64 Windows VM on my M2, does it use the M2's support for TSO when running x86_64 .exes or does it insert these memory barriers?

                If not, it makes sense that Qualcomm didn't bother adding them.

                • saagarjha an hour ago

                  No because Windows is not aware of how Apple does it. There exist Linux patches documenting how to do so, though.

                  • zeusk 2 hours ago

                    The OS can use what hardware supports, Mac OS does because SEG is a tightly integrated group at Apple whereas Microsoft treats hardware vendors at arm's length (pun unintended). There are roadmap sharing, planning events through leadership but it is not as cohesive as it is at Apple.

                    • Syonyk 2 hours ago

                      I would expect it to not use TSO, because the toggle for it isn't, to the best of my knowledge, a general userspace toggle. It's something the kernel has to toggle, and so a VM may or may not (probably does not) even have access to the SCRs (system control registers) to change it.

                      • zeusk 2 hours ago

                        TSO toggle on Apple Silicon is a user-space accessible/writable register.

                        It is used when you install rosetta2 for Linux VMs

                        https://developer.apple.com/documentation/virtualization/run...

                        • Syonyk 2 hours ago

                          Are you sure it's userspace accessible?

                          Based on https://github.com/saagarjha/TSOEnabler/blob/master/TSOEnabl..., it's a field in ACTLR_EL1, which is explicitly (per the ARMv8 spec, at least...) not accessible to userspace (EL0) execution.

                          There may be some kernel interface to allow userspace to toggle that, but that's not the same as being a userspace-accessible SCR (and I also wouldn't expect it to be passed through to a VM - you'd likely need a hypercall to toggle it, unless the hypervisor emulated that, though admittedly I'm not quite as deep weeds on ARMv8 virtualization as I would prefer at the moment.

                          • zeusk 2 hours ago

                            Hmm, you’re right - maybe my memory serves incorrectly but yeah it seems it is privileged access but the interface is open to all processes to toggle the bit.

                          • shadowfacts 2 hours ago

                            It is not directly accessible from user-space. Making it so requires kernel support. Apple published a set of patches for doing this on Linux: https://developer.apple.com/documentation/virtualization/acc...

                            Without that kernel support, all processes in the VM (not just Rosetta-translated ones) are opted-in to TSO:

                            > Without selective enablement, the system opts all processes into this memory mode [TSO], which degrades performance for native ARM processes that don’t need it.

                            • mrpippy an hour ago

                              Before Sequoia, a Linux VM using Rosetta would have TSO enabled all the time.

                              With Sequoia, TSO is not enabled for Linux VMs, and that kernel patch (posted in the last few weeks) is required for Rosetta to be able to enable TSO for itself. If the kernel patch isn't present, Rosetta has a non-TSO fallback mode.

                          • saagarjha an hour ago

                            This is exposed to guest kernels of Sequoia (and maybe earlier?).

                        • saagarjha an hour ago

                          You can use RCpc atomics which are part of the standard architecture

                          • Syonyk 2 hours ago

                            My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.

                            I'm also fairly certain that the TSO changes to the memory system are non-trivial, and it's possible that Qualcomm doesn't see it as a value-add in their chips - and they're probably right. Windows machines are such a hot mess that outside a relatively small group of users (who probably run Linux anyway, so aren't anyone's target market), nobody would know or care what TSO is. If it add costs and power and doesn't matter, why bother?

                            • jsheard 2 hours ago

                              > My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance.

                              Games are a pretty notable exception that demand high performance and for the most part will be stuck on x86 forever. Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.

                              • doctorpangloss an hour ago

                                > Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.

                                Unity supports Windows ARM. Unreal: probably never. IMO, the PC gaming market is so fragmented, short of Microsoft developing games for the platform, like pre-sales scale multi-millions that EGS did, games on ARM will only happen by complete accident, not because it makes sense.

                              • tiagod an hour ago

                                > My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.

                                In my experience, there's a lot of that kind of software around that was initially designed for a much simpler use-case, and has decades of badly coded features bolted in, with questionable algorithmic choices. It can be unreasonably slow in modern hardware.

                                Old government database sites are the worst examples in my experience. Clearly tested with a few hundred records, but 15 years later there's a few million and nobody bothered to create a bunch of indexes so searches take a couple minutes. I guess this way they can just charge to upgrade the hardware once in a while instead.

                            • saagarjha an hour ago

                              TSO is nice to have but it's definitely not necessary. Rosetta doesn't even require TSO on Linux anymore by default. It performs fine for typical applications.

                              • wbl an hour ago

                                Barrier injection isn't the issue as much as the barriers becoming expensive. There's no reason a CPU can't use the techniques of TSO support to support lesser barriers just as cheaply.

                                • vlovich123 an hour ago

                                  Is TSO something other than doing atomics with seq_cst?

                                  • ant6n 2 hours ago

                                    perhaps you could keep each process on one core. But that would kill multi-threaded performance.

                                  • NL807 4 minutes ago

                                    Good article.

                                    • leshokunin 3 hours ago

                                      Super interesting. Putting my PM hat on, I wonder: how many x86 apps on Apple still benefit from this much performance? What's the coverage? The switch to M1 happened 4 years ago, so the software was designed for hardware nearly half a decade old.

                                      Excellent engineering and nice that it was built properly. Is this something that Linux / Wine / the Steam compatibility layer already benefit from?

                                      • spockz 3 hours ago

                                        I think it is less of numbers game and more of a guarantee thing. As a user of a new Apple silicon machine you do not have to worry about running x86 software. (Aside from maybe specific audio software and such that are a pain to run on any other hardware and software combination.)

                                        As such it may very well be a loss leader and that is fine. Probably most development has been done and there is little maintenance needed.

                                        Also, while most native macOS apps that I encounter have an Apple silicon version now, I still find docker images for amd64 without an arm64 version present. Rosetta2 also helps with these applications.

                                        • Syonyk 3 hours ago

                                          "Apple M-series chips emulating x86," in certain benchmarks and behaviors, was right up there with the fastest x86 chips at the time - I'd guess largely in stuff that benefited from the huge L1I/L1D cache (compared to x86).

                                          I had a M1 Mini for a while, and it played Kerbal Space Program (x86) far better than my previous Intel Mini, which had Intel Integrated Graphics that could barely manage a 4k monitor, much less actual gaming.

                                          I believe there's a way to use Rosetta with Linux VMs, too (to translate x86 VM applications to ARM and run them natively) - but I no longer have any Macs, so I've not had a chance to play with it.

                                          • aaomidi 3 hours ago

                                            Games. So many games.

                                            Also, x86 containers.

                                            • jsheard 3 hours ago

                                              Then again games didn't stop Apple from dropping x86-32 support, which nuked half of the Mac Steam library. It wouldn't be out of character for them to drop x86-64 support and nuke the rest which haven't been updated to native ARM.

                                              • astrange 2 hours ago

                                                Developers had something like 15 years of warning before x86-32 was dropped, which was enough for everyone except Carbon apps and games.

                                                Btw, Rosetta 2 actually supports x86-32. Which means you can run 32-bit Windows binaries through WINE, just not Mac 32-bit binaries.

                                                • p_l 2 hours ago

                                                  For games on intel macs they had fallback of BootCamp so combined with not really caring about games outside of random bursts like support for Unity, they were fine telling people to run windows. (ironically, the only Mac I owned ran faster under windows than under macOS...)

                                                  • saagarjha an hour ago

                                                    Rosetta supports it at least. You can run Linux 32-bit games!

                                                    • darknavi 2 hours ago

                                                      Or OpenGL support

                                                      • rdsnsca 36 minutes ago

                                                        OpenGL was deprecated, not removed from macOS.

                                                • dhosek 2 hours ago

                                                  I wonder if these lessons might be applied to Wasm runtimes where the Wasm could be JIT compiled into native code. Of course this does raise the possibility of security concerns if the Wasm compilation has some bug, and then of course there’s also the question of whether Wasm’s requirements might mean native compilation doesn’t give much of a performance boost (as seems to be the case with e.g., Java byte code).

                                                  • kccqzy 2 hours ago

                                                    One other thing that is not mentioned is that Apple has an extension to compute rarely used x86 flags such as the parity flag in hardware rather than in software.

                                                    • benchess 2 hours ago

                                                      It’s mentioned

                                                      • kccqzy 2 hours ago

                                                        Ah I see it now. Sorry for the noise.

                                                    • brycewray 3 hours ago

                                                      (2022)