Implementing rotate through carry like that was a really bad decision IMO - it's almost never by more than one bit left or right at a time, and this could be done much more efficiently than with the constant-time code which is only faster when the count is > 6.
Is the full microcode available anywhere?
I haven't published it yet as there are still some rough edges to clear up, but if you email me (andrew@reenigne.org) I'll send you the current work-in-progress (the same one that nand2mario is working from).
Since the shifter is also used for bit tests, the 'most things are a 1-bit shift' might not be the case. Perhaps they did the analysis and it made sense.
There are separate opcodes for shift/rotate by 1, by CL, or by an immediate operand. Those are decoded to separate microcode entry points, so they could have at least optimized the "RCL/RCR x,1" case.
And the microcode for bit test has to be different anyway.
> For memory operands, there's an additional twist: the bit index is a signed offset that can address bits outside the nominal operand. A bit index of 35 on a dword accesses bit 3 of the next dword in memory.
I wonder what is the use case for testing a bit outside of the memory address given.
So you can have bit arrays of any length in memory, rather than just 32 bits in a register.
That makes sense. LLVM could probably do better here by using the memory operand version:
Don't think the memory operand version would work here. If I understand the x86 architectural manual description, the 32-bit operand form interprets the bit offset as signed. A 64-bit operand could work around that but then run into issues with over-read due to fetching 64 bits of data.
The memory operand version tends to be as slow or slower than the manual implementation, so LLVM is right to avoid it.
It was probably easier to just implement it that way, given that the barrel shifter is 64 bits wide.