Comments Page - Faster substring search with SIMD in Zig

« Back Faster substring search with SIMD in Zigaarol.devSubmitted by todsacerdoti 3 days ago

burntsushi 2 days ago
Good write-up. This is a very popular approach to substring search! It is still worst case `O(m*n)` though. Do you have a fallback implementation like the `memchr` crate has to guarantee `O(m+n)`?
I'll also push back on some bits in the end:
```
    > But if it’s so much better, then why haven’t I made a pull request to
    > change std.mem.indexOf to use SIMD? Well, the reason is that
    >
    > std.mem.indexOf is generic over element size, and having a size
    > larger than u8 makes the algorithm much slower
    >
    > The algorithm used in stdmem.indexOf is cross-platform, while the
    > SIMD code wouldn’t be. (not all platforms have SIMD registers at all,
    > Arm has only 128-bit)
```
Does Zig not have a way to specialize this for sequences of unsigned 8-bit integers? If not, and you're thereforce force to used a more generic algorithm, that seems pretty unfortunate.
```
    > Substring searching is rarely the bottleneck in programs,
    > especially ones written in a fast language like Zig. That’s why
    > I don’t personally think it would be worth it to add it to the
    > standard library.
```
Oh I'm not sure I buy this at all! Substring search is a primitive operation and easily can be a bottleneck. There's a reason why widely used substring search implementations tend to be highly optimized.
- aarol 2 days ago
  I'm the author of the post, thanks for your feedback! I was inspired by your comment on HN a while back and started learning about this stuff, reading the source code of `memchr` was especially great.
  You're totally right about the first part there was a serious consideration to add this to zig's standard library, there would definitely need to be a fallback to avoid the `O(m*n)` situation.
  I'll admit that there are a lot of false assumptions at the end, you could totally specialize it for u8 and also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`
  hansvm 2 days ago
  Or at runtime, if you'd like. You can create a generic binary that runs faster on supported platforms.
  vlovich123 2 days ago
  > Or at runtime, if you'd like
  You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.
  > also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`
  You have to be careful with this since std.simd.suggestVectorSize is going to return values for the minimum SIMD version you're targeting I believe which can be suboptimal for portable binaries.
  You probably want a mix where you carefully compute the vector size for the current platform globally once and have multiple compiled dispatch paths in your binary that you can pick based on that value & let the CPU prefetcher hide the cost of a check before each invocation.
  stouset 2 days ago
  > You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.
  That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.
  Why would checking for supported SIMD instructions cause that much additional work?
  Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.
  vlovich123 2 days ago
  > Why would checking for supported SIMD instructions cause that much additional work?
  Because CPUID checks on x86 are expensive for whatever reason.
  > That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.
  Compilers can't elide those checks unless they are given specific flags that tell them the target CPU supports that specific instruction set OR they always just choose to target the minimum supported SIMD instruction set on the target CPU. They often emit suboptimal code for all sorts of reasons, this being one of them.
  > Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.
  Yes, but like I said, you have to do it very carefully to make sure you're calling CPUID once outside of a hot loop to initialize your decision making and then relying on the CPU's predictor to elide the cost of a boolean / switch statement in your code doing the actual dispatch.
- ozgrakkurt 2 days ago
  It is very easy to specialise a function in zig, you just put if(T == u8) or something like that inside the function and do w/e in there
  unwind 2 days ago
  Can the compiler detect that and use the proper code so no test is needed at runtime?
  This is Zig so I guess the answer is "yeah, duh" but wanted to ask since it sounded like the solution is less, uh, "compiler-friendly" than I would expect.
  hansvm 2 days ago
  Yes, and if you're paranoid you can write
  if (comptime T == u8) { // code }
  to guarantee that if you're wrong about how the compiler behaves then you'll get a compiler error.
  dpatterbee 2 days ago
  Zig has no type info at runtime so this is and always will be guaranteed to be a comptime check.
  Zambyte a day ago
  Hence the paranoia :-D
  But I do think adding explicit `comptime` in places like this is reasonable for the sake of conveying intent to other programmers.
ashvardanian 2 days ago
I like that more people are getting involved with SIMD, and there have been several posts lately on both memmem-like and memcpy-like operations implemented in SIMD in different programming languages.
In most cases, though, these still focus on AVX/NEON instructions from over 10 years ago, rather than newer and more powerful AVX-512 variations, SVE & SVE2, or RVV.
These newer ISAs can noticeably change how one would implement a state-of-the-art substring search or copy/move operation. In my projects, such as StringZilla, I often use mask K registers (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...) and an input-dependent mix of temporal and non-temporal loads and stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...).
In typical cases, the difference between the suggested SIMD kernels and the state-of-the-art can be as significant as 50% in throughput. As SIMD becomes more widespread, it would be beneficial to focus more on delivering software and bundling binaries, rather than just the kernels.
- ack_complete 2 days ago
  Sure, but I have to support a range of target CPUs in the consumer desktop market, and the older CPUs are the ones that need optimizations the most. That means NEON on ARM64 and AVX2 or SSE2-4 on x64. Time spent on higher vector instruction sets benefits a smaller fraction of the user base that already has better performance, and that's especially problematic if the algorithm has to be reworked to take best advantage of the higher extensions.
  AVX-512 is also in bad shape market-wise, despite its amazing feature set and how long it's been since initial release. The Steam Hardware Survey, which skews toward the higher end of the market, only shows 18% of the user base having AVX-512 support. And even that is despite Intel's best efforts to reverse progress by shipping all new consumer CPUs with AVX-512 support disabled.
- moregrist 2 days ago
  I’m not as familiar with the NEON side, but AVX512 support is pretty variable on new processors. Alder Lake omits it entirely. So we’re still in a world where AVX2 is the lowest common denominator for a system library that wants wide support.
  debugnik 2 days ago
  Even that is too high of a requirement if your target user runs low end hardware. Most Intel chips launched between 2017 and 2021 under the Pentium Silver/Gold and Celeron brands lack AVX (the first one, let alone AVX2).
  teo_zero 2 days ago
  How strange! I was about to add a comment that I would probably stick to SSE2 or something like that to be sure my code suits as large an audience as possible, including CPUs from more than 10 years ago, ARM, etc.
  Case in point: I've been very disappointed lately when I wanted to try Ghostty on my laptop and the binary compiled for Debian failed to run due to an invalid instruction. I don't want to force the same experience to others.
  burntsushi 2 days ago
  This is sort of a category error. I don't know what ghostty is doing (or perhaps its distributor), but lots of widely used software (including my own ripgrep, but also even things like glibc) will query what the CPU supports at runtime and dispatch to the correct routine based on that. So things like, "I'm only going to stick to SSE2 for maximal compatibility" have a false assumption baked into them.
  jonstewart 2 days ago
  Not so much in AWS, though I’m unsure of other cloud providers. For desktop systems, sure.
- ashvardanian 2 days ago
  PS: Finding CPUs that support AVX-512 and SVE is relatively trivial - practically every cloud has them by now. It's harder to find Arm CPUs with wide physical registers, but that's another story.
  nromiun 2 days ago
  But no one likes to develop on the cloud. The latency and storage synchronization can be very off putting.
- nromiun 2 days ago
  Because it is very hard to find new hardware to test it, let alone expect your users to take advantage of it on their machines.
  AVX512 is such a mess that Intel just removed it after a generation or two. And on ARM SVE side it is even worse. There is already SVE2, but good luck finding even a SVE enabled machine.
  Apple does not support it on their Apple Silicon™ (only SME), Snapdragon does not support it even on their latest 8 Elite. 8 Elite Gen 2 is supposed to come with it.
  Only Mediatek and Neoverse chips support them. So finding one machine to develop and test such code can be a little difficult.
lukaslalinsky 2 days ago
I really wish Zig decided to add SIMD intrisics. There are many SIMD algorithms that can be done, but you have to switch back to C for those, because they depend on operations outside of what LLVM provides for vectors.
- AndyKelley 2 days ago
  Missing SIMD functionality is welcome issue reports. Obviously we're not going to copy the C intrinsics directly since Zig SIMD is portable, but everything should be expressible.
  It doesn't really have to do with what operations LLVM provides for vectors. LLVM supports all the SIMD intrinsics of clang, and LLVM is one of many backends of zig.
- steeve 2 days ago
  Like what?
  You can also directly call LLVM intrinsics in case this doesn’t work
  lukaslalinsky 2 days ago
  Just a couple of days ago, I wanted to implement specialized StreamVByte decoding in Zig, but @shuffle() needs to mask to be compile time known, while _mm_shuffle_epi8() works just fine with a dynamic mask. I remember that some time ago, I couldn't find an alternative to _mm_alignr_epi8().
  aqrit 2 days ago
  `_mm_alignr_epi8` is a compile-time known shuffle that gets optimized well by LLVM [1].
  If you need the exact behavior of `pshufb` you can use asm or the llvm intrinsic [2]. iirc, I once got the compiler to emit a `pshufb` for a runtime shuffle... that always guaranteed indices in the 0..15 range?
  Ironically, I also wanted to try zig by doing a StreamVByte implementation, but got derailed by the lack of SSE/AVX intrinsics support.
  [1] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5deec34... [2] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5deec34...
  lukaslalinsky 2 days ago
  Oh, that's actually quite neat, it did not occur to me that you can use @shuffle with a compile time mask and it will optimize it to a specialized instruction.
ww520 2 days ago
Excellent write up. I really appreciate the clear and detailed explanation of the algorithm.
The nice thing about Zig’s SIMD operation is the register size support is transparent. You can just declare a 64-byte vector as the Block, and Zig would use an AVX256 or two AVX2 (32-byte) registers behind the scene. All other SIMD operations on the type are transparently done with regard to the registers when compiled to the targeted platform.
Even using two AVX2 registers for 64 bytes of data is a win due to instruction pipelining. Most CPU have multiple SIMD registers and the two 32-byte data chunks are independent. CPU can run them in parallel.
The next optimization is to line the data up at 64 byte alignment, to match the L1/L2 cache line size. Memory access is slow in general.
- burntsushi 2 days ago
  Another challenge may be as a result of using a portable SIMD API instead of specific ISA instructions. I'm specifically thinking about computing the mask, which on x86-64 is I assume implemented via movemask. But aarch64 lacks such an instruction, so you need to do other shenanigans for the best codegen: https://github.com/BurntSushi/memchr/blob/ceef3c921b5685847e...
  ww520 2 days ago
  A language level or std library level poly-fill for the missing SIMD operations for the platforms would be great.
lokeg 2 days ago
What about the worst case? I.e. something like searching for 1000 'a's in a long string of 'a's interspersed with 'b's every 500-1000 steps? Seems accidentally quadradic unfortunately in the absence of some KMP-like fallback
- MattPalmer1086 2 days ago
  Worst case for these types of search is O(mn), m length of needle, n length of haystack. It is not linear in n.
  The absolute worst case is when both the needle and haystack are both composed of the same byte repeated (e.g. all zero).
- expenses3 2 days ago
  How is it quadratic? You do 1000 checks every character in the haystack but that's still O(n)
  burntsushi 2 days ago
  The worst case is that `std.mem.eql`[1] is executed at every position in the haystack, which gives you `O(n*m)` time complexity. Several substring search algorithms are `O(n)`.
  [1]: https://github.com/aarol/substr/blob/9392f9557de735929dfb79e...
suddenlybananas 2 days ago
>The difference between 4μs vs 1μs is extremely small, but it’s slightly faster nonetheless.
Put that in a loop and its an enormous speed-up.
okr 2 days ago
I remember reading Daniel Lemire's Blog about SIMD a few days ago, in which he discussed substring search as well!
https://lemire.me/blog/2025/08/09/why-do-we-even-need-simd-i...
codethief 2 days ago
Nice article!
Also, this might be a stupid question (I'm a Zig newbie) but… instead of calling std.mem.eql() in the while loop to look at each potential match individually, couldn't you repeat the same trick as before? That is, use SIMD to search for the second and second-to-last character of the needle, then third and third-to-last, and so on, and finally take a bitwise AND of all the resulting bit masks? This way, one would avoid looking at each potential match one by one, and instead look at all of them at the same time.
Even if that doesn't work for some reason and you still need to loop over all potential matches individually, couldn't you use SIMD inside the while loop to replace std.mem.eql and thereby speed up string comparison? My understanding was that std.mem.eql loops over bytes one by one and compares them?
- ncruces 2 days ago
  Knowing little about zig, std.mem.eql very likely already uses SIMD.
  This is about using SIMD to avoid even calling std.mem.eql for 99% of the possible attempts.
  llimllib 2 days ago
  std.mem.eql is here, super easy to read: https://github.com/ziglang/zig/blob/master/lib/std/mem.zig#L...
  My read is it would use SIMD if T is @Vector, and not otherwise? But I'm neither a zig nor SIMD expert
  ozgrakkurt 2 days ago
  Pretty sure that compiles into assembly for any primitive type like integers, floats etc.
  IshKebab 2 days ago
  What do you mean it "compiles into assembly"?
  ozgrakkurt 2 days ago
  I meant to write SIMD
jiehong 2 days ago
Nice!
But, does that work with non-ascii characters? (aka Unicode).
- llimllib 2 days ago
  Kind of! This script is assuming that you're dealing with a byte slice, which means you've already encoded your unicode data.
  If you just encoded your string to bytes naïvely, it will probably-mostly still work, but it will get some combining characters wrong if they're represented differently in the two sources you're comparing. (eg, e-with-an-accent-character vs. accent-combining-character+e)
  If you want to be correct-er you'll normalize your UTF string[1], but note that there are four different defined ways to do this, so you'll need to choose the one that is the best tradeoff for your particular application and data sources.
  [1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...
  codethief 2 days ago
  > If you just encoded your string to bytes naïvely
  By "naïvely" I assume you mean you would just plug in UTF-8 bytestrings for haystack & needle, without adjusting the implementation?
  Wouldn't the code still need to take into account where characters (code points) begin and end, though, in order to prevent incorrect matches?
  burntsushi 2 days ago
  IDK what "encoded your string to bytes naively" means personally. There is only one way to correctly UTF-8 encode a sequence of Unicode scalar values.
  In any case, no, this works because UTF-8 is self synchronizing. As long as both your needle and your haystack are valid UTF-8, the byte offsets returned by the search will always fall on a valid codepoint boundary.
  In terms of getting "combining characters wrong," this is a reference to different Unicode normalization forms.
  To be more precise... Consider a needle and a haystack, represented by a sequence of Unicode scalar values (typically represented by a sequence of unsigned 32-bit integers). Now encode them to UTF-8 (a sequence of unsigned 8-bit integers) and run a byte level search as shown by the OP here. That will behave as if you've executed the search on the sequence of Unicode scalar values.
  So semantically, a "substring search" is a "sequence of Unicode scalar values search." At the semantic level, this may or may not be what you want. For example, if you always want `office` to find substrings like `oﬃce` in your haystack, then this byte level search will not do what you want.
  The standard approach for performing a substring search that accounts for normalization forms is to convert both the needle and haystack to the same normal form and then execute a byte level search.
  (One small caveat is when the needle is an empty string. If you want to enforce correct UTF-8 boundaries, you'll need to handle that specially.)
  llimllib 2 days ago
  By naively, I meant without normalization.
  You know much more about this than I do though
  edit: this is what I mean for example, that `tést` != `tést` in rg, because \ue9 (e with accent) != e\u0301 (e followed by combining character accent)
  $ printf "t\\u00E9st" > /tmp/a $ xxd /tmp/a 00000000: 74c3 a973 74 t..st $ cat /tmp/a tést $ printf "te\\u0301st" > /tmp/b $ xxd /tmp/b 00000000: 7465 cc81 7374 te..st $ cat /tmp/b tést $ printf "t\\u00E9st" | rg -f - /tmp/a 1:tést $ printf "t\\u00E9st" | rg -f - /tmp/b # ed: no result
  edit 2: if we normalize the UTF-8, the two strings will match
  $ printf "t\\u00E9st" | uconv -x any-nfc | xxd 00000000: 74c3 a973 74 t..st $ printf "te\\u0301st" | uconv -x any-nfc | xxd 00000000: 74c3 a973 74 t..st
  Which you know, and indicate! Just working an example of it that maybe will help people understand, I dunno
  jiehong 2 days ago
  Thanks for this detailed answer!
- codethief 2 days ago
  I suppose generalizing the approach to UTF-32 should be straightforward, but variable-length encodings like UTF-8 and UTF-16 might be more involved(?) Either way, I'm sure BurntSushi found a solution and built it into ripgrep.
  burntsushi 2 days ago
  ripgrep always deals with UTF-8. When it sees a different encoding, like UTF-16, ripgrep first transcodes to UTF-8 and then searches.
  This is absolutely in part because of all of the byte oriented optimizations that are baked into ripgrep (and its regex engine). Note that I said a part. Making ripgrep (and its regex engine) work on things other than a sequence of bytes is far more difficult than just porting a bunch of SIMD algorithms. There are also many optimizations and architectural constraints in the code based on the alphabet size. That is, with 8-bit integers, its alphabet size is 256. With 16-bit integers, the alphabet size is 65,536.
  tialaramex 2 days ago
  I think this is the right choice because in practice UTF-8 "won" just like how the two's complement machine integer won. It's pretty good, Wikipedia has a brief section explaining how Ken Thompson for example made it self-synchronizing, which seems like a "duh" feature today but the concept before Ken touched it didn't have this. It's a Best Common Practice for the Internet, it's the default in most modern systems and places such as Java's virtual machine or Windows which can't easily "just" use UTF-8 have nevertheless gradually shifted toward being very friendly toward it.
undefined 2 days ago
[deleted]
garganzol 2 days ago
Every language tries to implement the best in class memory set/search primitives. Maybe we should move them to something called libos, where they will be implemented by every host OS? Then, OS manufacturers can supply libos.so/libos.dll/libos.dylib as part of the official OS distributions.
If the wheels get reinvented again and again, it means that they should be readily available.
- vitalnodo 2 days ago
  There is an existing concept called an exokernel. Moreover, I think Zig could be very suitable for this, because it already encourages passing allocators to functions to make memory allocation explicit. Recently, I/O has been (and is still being) reworked in a similar way.
- ozgrakkurt 2 days ago
  Isn’t this what libc is? Like musl-libc or glibc
  garganzol 2 days ago
  libc is de jure for C language in Unix-like systems. De facto, it is used for other things and by other languages creating a bit of an architectural mess.
  What I am talking about is creating a cross-vendor standard.