Comments Page - Exploring pre-1990 versions of wc(1) (2023)

« Back Exploring pre-1990 versions of wc(1) (2023)sigwait.orgSubmitted by henry_flower 2 years ago

shric 2 years ago
A fun read on word count optimization can be found in Abrash's Black Book:
https://www.jagregory.com/abrash-black-book/#lessons-learned...
You can gloss over the asm if you wish, the tricks that are explained around it are worth it imho.
- Joker_vD 2 years ago
  I wonder if large lookup tables/table-driven state machines are still as good as they used to be. After all, even with all the on-chip caches, the additional memory accesses today seem to be slower than doing some multi-instruction SIMD voodoo.
  LegionMammal978 2 years ago
  At least the GNU version of wc [0] uses AVX2 for line counting, if available. Though it falls back to a simple character-by-character loop if you ask for a character count [not to be confused with a byte count!] or a word count.
  [0] https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/wc_...
tripdout 2 years ago
Those `goto`s between two different for loops is crazy.
- actionfromafar 2 years ago
  Assembly / machine code thinking.
  amszmidt 2 years ago
  More like a relic of (actual) "spaghetti code", it was relatively common in really old Lisp code.
- lifthrasiir 2 years ago
  Not that crazy given that it closely mirrors it's state machine structure.
Joker_vD 2 years ago
> A word is a maximal string of characters delimited by spaces, tabs or newlines.
And then the actual code explicitly filters out and ignores every character larger than 0x7F. Just why.
- jolmg 2 years ago
  Probably because they're not characters. They're just bytes undefined by ASCII.
- Tor3 2 years ago
  ASCII is 7 bits (the eight bit would be parity), so that makes perfect sense, in an ASCII world.
  Joker_vD 2 years ago
  So the character e.g. "B" would have this parity bit set and therefore should be filtered out and not count as a letter, in the ASCII world?
  aap_ 2 years ago
  There are only 7 bits in ASCII. An 8th can be used for parity when transmitting data but a regular program will never see it. Anything above 0x7F is simply not a character.
  Tor3 2 years ago
  Parity bits are not part of the character. They are for detecting transmission errors. You filter off the parity bit before looking at the byte.
  Joker_vD 2 years ago
  But this is not what's the code doing, is it? It's not doing (ch & 0x7F), it's doing ch <= 0x7F. And the parity checking/filtering is done in the tape drive/serial port driver anyhow, it would never reach wc in the first place.
  Tor3 2 years ago
  Yes, that's true for that code. But that wasn't really the point, the point I wrote in my earlier post was that ASCII is 7 bits, it's 0..127, and, depending on where the characters came from, only values below 128 are valid ASCII. What I was talking about was that because a parity bit was common, ASCII was limited to 7 bits, to make room for a parity bit. When other transports are involved, e.g. reading from a file, there aren't any parity bits (well, that's not entirely true - a minicomputer I worked with back in the day used parity bits on characters in text files, but that's not the case for the platform where this particular old 'wc' was used), the code simply focuses on valid ASCII, which is below 128.
  epcoa 2 years ago
  What in the hell are you going on about? B is 0x46 which is < 0x7F.
  Joker_vD 2 years ago
  I am going about the parity bit. 0x46 has odd number of bits set (three, to be precise) so for the parity to check out (that is, the number of bits set has to be even), a parity bit needs to be set and the resulting encoding has to be 0xC6, with four bits set.
  icedchai 2 years ago
  Assuming parity is enabled, the parity check is done at a lower level (serial port, TTY driver, etc.) and you'll never see it from the application. I used to mess around with serial ports and terminals a ton in my youth.
  Tor3 2 years ago
  The parity bit is not part of the character. It's external, an error detecting device. To read ASCII you always look at bits 6..0, seven bits. You don't filter away the character because it has the parity bit set, you filter off the parity bit (whether it's set or not).
  undefined 2 years ago
  [deleted]
- ivan_gammel 2 years ago
  Because they thought that a word is something said in a human language that they can understand.
  Joker_vD 2 years ago
  Mi ne pensas ke lingvoj kiuj usas ekskluzive la basan latinan alfabeton estas komprepeneblaj per si mem.
  luismedel 2 years ago
  Cool how my native language is Spanish and I can almost-understand 80% of Esperanto.
  actionfromafar 2 years ago
  Ze riform iz komplit.
  Joker_vD 2 years ago
  The [z] and [ð] are phonemically different in English, just as [i] and [i:] are, so it'd actually be "Ðe riform is komplijt". American rhotacism prevents us from spelling it "rifoom" as would be proper, unfortunately.
dexen 2 years ago
The brevity carried over to Plan 9. Re-posting my older comment (https://news.ycombinator.com/item?id=4023385):
http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs follows the Unix philosophy. A lot of legacy has been shed. I can count 13 options to ls, 11 options to sed and just 5 to sed.
The standard Plan 9 shell, Rc, is described in mere ~500 lines of manpage, while Bash takes whooping ~5400 lines.
Oh, and there is no `dll hell' in P9 :-)