Comments Page - Spotting base64 encoded JSON, certificates, and private keys

« Back Spotting base64 encoded JSON, certificates, and private keysergaster.orgSubmitted by jandeboevrie 21 hours ago

mmastrac 21 hours ago
I built a JWT support library at work (https://github.com/geldata/gel-rust/tree/master/gel-jwt) and I can confirm that JWTs all sound like "eyyyyyy" in my head.
- layer8 19 hours ago
  It's like how all certificates sound like "miiiiii".
  schoen 19 hours ago
  "It's miiiii! And I can prove it!"
  chizzy123 16 hours ago
  Ok
- proactivesvcs 20 hours ago
  "Is that you again, punk zip?" when seeing the first few bytes of a zip file.
  ragmodel226 19 hours ago
  Probably shouldn’t call Phil Katz a punk
  ragmodel226 19 hours ago
  Also, MZ in exe is Mark Zbikowski.
- pedropaulovc 18 hours ago
  Ey, I'm JSON!
  jdwithit 12 hours ago
  Eyy, I'm authin' here!
  sans_souse 9 hours ago
  Eyy, JSON I'm PTER, you don't know me yet but just you wait. Eyyyy.
- Muromec 20 hours ago
  >JWTs all sound like "eyyyyyy" in my head.
  "eeey bruh, open the the API it's me"
- isoprophlex 21 hours ago
  eyy lmao
nneonneo 16 hours ago
Useful ones to know:
- `R0lGOD` - GIF files
- `iVBOR` - PNG files
- `/9j/` - JPG files
- `eyJ` - JSON
- `PD94` - XML
- `MII` - ASN.1 file, such as a certificate or private key
These are nice to know since they show up pretty frequently (images in data: URLs, JSON/XML/ASN.1 in various protocols).
- 0points 7 hours ago
  > GIF, PNG, JPG
  It makes more sense to transmit binary formats in binary.
  You would save bandwidth, memory and a decoding step.
  Then you could also inspect the header bytes, instead of memorizing how they present in some intermediate encoding.
  mr_mitm 5 hours ago
  Knowing these magic bytes in base64 is mostly relevant in situations in which you see data encoded by other people, which means you probably had no control over the encoding. Other people (or rather every body) sometimes do things which don't make sense.
  demurgos 7 hours ago
  Sometimes you need to embed binary data in a text format (e.g. JSON).
  naikrovek 6 hours ago
  the amount of data i have stuffed into json as base64 encoded text makes me sick.
  i wrote a glTF model converter once. 99% of those millions of JSON files I wrote were base64 encoded binary data.
  a single glTF model sometimes wants to be two files on disk. one for the JSON and one for the binary data, and you use the JSON to describe where in the binary data the vertices are defined, and other windows for the various other bits like the triangles, triangle fans, textures, and other stuff are stored. But you can also base64 encode that data and put it in the JSON file and not have a messy double-file model. so that's what I did and I hated it. but it still felt better than having .gltf files and .bin files which together made up a single model file.
  ThrowawayTestr 4 hours ago
  I have a tampermonkey script that's a megabyte in size because it includes an encoded gif file.
  moralestapia 4 hours ago
  Thanks, GPT.
- feliixh 12 hours ago
  Very useful!
gnabgib 21 hours ago
You can spot Base64 encoded JSON.
The PEM format (that begins with `-----BEGIN [CERTIFICATE|CERTIFICATE REQUEST|PRIVATE KEY|X509 CRL|PUBLIC KEY]-----`) is already Base64 within the body.. the header and footer are ASCII, and shouldn't be encoded[0] (there's no link to the claim so perhaps there's another format similar to PEM?)
You can't spot private keys, unless they start with a repeating text sequence (or use the PEM format with header also encoded).
[0]: https://datatracker.ietf.org/doc/html/rfc7468
- ctz 20 hours ago
  The other base64 prefix to look out for is `MI`. `MI` is common to every ASN.1 DER encoded object (all public and private keys in standard encodings, all certificates, all CRLs) because overwhelmingly every object is a `SEQUENCE` (0x30 tag byte) followed by a length introducer (top nibble 0x8). `MII` is very very common, because it introduces a `SEQUENCE` with a two byte length.
  schoen 16 hours ago
  You'll also see "AQAB" a lot. This is the base64 version of the integer representation of 65537, the usual public exponent parameter e in modern RSA implementations.
  Muromec 20 hours ago
  I for one wait for the day when quantum computers will break all the encryption forever so nobody will have to suffer broken asn1 decoders, plaintext specifications of machine-readable formats and unearned aura of arcane art that surrounds the whole thing.
  ctz 19 hours ago
  asn1 enjoyers can also look forward to the sweet release of death. though if you end up in hell you might end up staring at XER for the rest of eternity
- thibaultamartin 20 hours ago
  Thanks for pointing it out! I've added an errata to the blog post
- mschuster91 9 hours ago
  > The PEM format (that begins with `-----BEGIN [CERTIFICATE|CERTIFICATE REQUEST|PRIVATE KEY|X509 CRL|PUBLIC KEY]-----`) is already Base64 within the body.. the header and footer are ASCII, and shouldn't be encoded[0] (there's no link to the claim so perhaps there's another format similar to PEM?)
  In practice, you will spot fully b64 encoded PEMs all the time once you have Kubernetes in play... create a Secret from a file and that's what you will find.
  CBLT 3 hours ago
  I don't always store my Kubernetes Secrets in files, but when I do, I prefer stringData.
  mdaniel 2 hours ago
  I believe OP meant $(kubectl get secret) which by default returns them in JSON and base64 encoded. I do agree with you that it would be stellar if kubectl were bright enough to recognize "there's no weird characters, show me in stringData" but there are already other way more important DX issues that haven't gotten any traction
Sophira 20 hours ago
Mathematically, base64 is such that every block of three characters of raw input will result in four characters of base64'd output.
These blocks can be considered independent of each other. So for example, with the string "Hello world", you can do the following base64 transformations:
* "Hel" -> "SGVs"
* "lo " -> "bG8g"
* "wor" -> "d29y"
* "ld" -> "bGQ="
These encoded blocks can then be concatenated together and you have your final encoded string: "SGVsbG8gd29ybGQ="
(Notice that the last one ends in an equals sign. This is because the input is less than 3 characters, and so in order to produce 4 characters of output, it has to apply padding - part of which is encoded in the third digit as well.)
It's important to note that this is simply a byproduct of the way that base64 works, not actually an intended thing. My understanding is that it's basically like how if you take an ASCII character - which could be considered a base 256 digit - and convert it to hexadecimal (base 16), the resulting hex number will always be two digits long - the same two digits, at that - even if the original was part of a larger string.
In this case, every three base 256 digits will convert to four base 64 digits, in the same way that it would convert to six base 16 digits.
- Sophira 19 hours ago
  By the way, I would guess that this is almost certainly why LLMs can actually decode/encode base64 somewhat well, even without the help of any MCP-provided tools - it's possible to 'read' it In a similar way to how an LLM might read any other language, and most encoded base64 on the web will come with its decoded version alongside it.
- zokier 20 hours ago
  nitpick but ascii would be base128, largest ascii value is 0x7f which in itself is a telltale if you are looking at hex dumps.
  Sophira 20 hours ago
  Yeah, I was aware of that, but I figured it was the easiest way to explain it. It's true that "character representation of a byte" is more accurate, but it doesn't roll off the tongue as easily.

rgovostes 21 hours ago

There is a Base64 quasi-fixed point:

    $ echo -n Vm0 | base64
    Vm0w

It can be extended indefinitely one character at a time, but there will always be some suffix.

o11c 21 hours ago

For reference, a program to generate the quasi-fixed point from scratch:

  #!/usr/bin/env python3
  import base64

  def len_common_prefix(a, b):
      assert len(a) < len(b)
      for i in range(len(a)):
          if a[i] != b[i]:
              return i
      return len(a)

  def calculate_quasi_fixed_point(start, length):
      while True:
          tmp = base64.b64encode(start)
          l = len_common_prefix(start, tmp)
          if l >= length:
              return tmp[:length]
          print(tmp[:l].decode('ascii'), tmp[l:].decode('ascii'), sep='\v')
          # Slicing beyond end of buffer will safely truncate in Python.
          start = tmp[:l*4//3+4] # TODO is this ideal?

  if __name__ == '__main__':
      final = calculate_quasi_fixed_point(b'\0', 80)
      print(final.decode('ascii'))

This ultimately produces:

  Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVUbGho

syncsynchalt 19 hours ago
From the other direction, you'd call it a tail-eating unquine?
thaumasiotes 21 hours ago
Note that the suffix will grow in length with the input, making it less and less interesting.
(Because the output is necessarily 8/6 the size of the input, the suffix always adds 33% to the length.)

Muromec 21 hours ago
After staring one time too much at base64-encoded or hex-encoded asn1 I started to believe that scene in the Matrix where operator was looking at raw stream from Matrix at his terminal and was seeing things in it.
- cestith 21 hours ago
  Years ago I was part of a group of people I knew who could read and edit large parts of sendmail.cf by hand without using m4. Other people who had to deal with mail servers at the time certainly treated it like a superpower.
  PeterWhittaker 19 hours ago
  In 1989, my Toronto-based team was at TJ Watson for the final push on porting IBM's first TCP/IP implementation to MVS. Some of our tests ran raw, no RACF, no other system protections. I was responsible for testing the C sockets API, a very cool job for a co-op.
  When one of my tests crashed one of those unprotected mainframes, two guys who were then close to my age now stared at an EBCDIC core dump, one of them slowly hitting page down, one Matrix-like screen after another, until they both jabbed at the screen and shouted "THERE!" simultaneously.
  (One of them hand delivered the first WATFOR compiler to Yorktown, returning from Waterloo with a car full of tapes. I have thought of him - and this "THERE!" moment - every time I have come across the old saw about the bandwidth of a station wagon.)
  nickdothutton 20 hours ago
  A significant part of my 1st ever job consisted of editing sendmail.cf’s by hand. Occasionally had to defer to my boss at the time for the real mind bending stuff. I now believe that he was in fact a non-human alien.
  Muromec 21 hours ago
  Where I work right now superpower of the day is pressing ctrl-r in the terminal.
  quesera 20 hours ago
  In some ways, I miss those days.
  Spending hours wrangling sendmail.cf, and finally succeeding, felt like a genuine accomplishment.
  Nowadays, things just work, mostly. How boring.
  anyfoo 17 hours ago
  I feel that nowadays, it's a combination of "things just work" and "if they don't, good luck figuring out why".
  I recently installed Tru64 UNIX on a DEC Alpha I got off eBay. I felt like it was more sluggish than it should be, so I looked around at man-Pages about the VM (virtual memory, not virtual machine) subsystem, and was amazed how cleanly and detailed it was described, and what insights I could get about its state. The sys_attrs_vm man-page alone, which just describes every VM-layer tunable, gave a pretty good description of what the VM subsystem does, how each of those tunables affects it, and why you might want to change it.
  Nowadays, things are massively complex, underdocumented (or just undocumented), constantly changing, and often inconsistent between sub-parts. Despite thinking that I have both wide and deep knowledge (I'm a low-level code kernel dev), it often takes me ages to figure out the root cause of sometimes even simple problems.
  esseph 15 hours ago
  I wasn't one of those people, but I knew those people. The deepest I got was blindly doing fairly complex BIND DNS configs.
cogman10 21 hours ago
I don't really love this. It just feels so wasteful.
JWT does it as well.
Even in this example, they are double base64 encoding strings (the salt).
It's really too bad that there's really nothing quite like json. Everything speaks it and can write it. It'd be nice if something like protobuf was easier to write and read in a schemeless fashion.
- dlt713705 20 hours ago
  What’s wrong with this?
  The purpose of Base64 is to encode data—especially binary data—into a limited set of ASCII characters to allow transmission over text-based protocols.
  It is not a cryptographic library nor an obfuscation tool.
  Avoid encoding sensitive data using Base64 or include sensitive data in your JWT payload unless it is encrypted first.
  xg15 20 hours ago
  I think it's more the waste of space in it all. Encoding data in base64 increases the length by 33%. So base64-encoding twice will blow it up by 33% of the original data and then again 33% of the encoded data, making 69% in total. And that's before adding JSON to the mix...
  And before "space is cheap": JWT is used in contexts where space is generally not cheap, such as in HTTP headers.
  cogman10 19 hours ago
  Precisely my thoughts.
  You have to ask the question "why are we encoding this as base64 in the first place?"
  The answer to that is generally that base64 plays nice with http headers. It has no newlines or special characters that need special handling. Then you ask "why encode json" And the answer is "because JSON is easy to handle". Then you ask the question "why embed a base64 field in the json?" And the answer is "Json doesn't handle binary data".
  These are all choices that ultimately create a much larger text blob than needs be. And because this blob is being used for security purposes, it gets forwarded onto the request headers for every request. Now your simple "DELETE foo/bar" endpoint ends up requiring a 10kb header of security data just to make the request. Or if you are doing http2, then it means your LB will end up storing that 10kb blob for every connected client.
  Just wasteful. Especially since it's a total of about 3 or 4 different fields with relatively fixed sizes. It could have been base64(key_length(1byte)|iterations(4bytes)|hash_function(1byte)|salt(32bytes)) Which would have produced something like a 51 byte base64 string. The example is 3x that size (156 characters). It gets much worse than that on real systems I've seen.
  rini17 10 hours ago
  JSON doesn't even handle text...
  0xml 14 hours ago
  Not exactly - encoding it twice increases by 4/3 * 4/3 - 1 = 7/9, which is about 77.78% more than the original.
  zokier 20 hours ago
  JSON is already text based and not binary so encoding it with base64 is bit wasteful. Especially if you are going to just embed the text in another json document.
  And of course text-based things themselves are quite wasteful.
  pak9rabid 20 hours ago
  Exactly. Using base64 as an obfuscation tool, or (shudder) encryption is seriously misusing it for what it was originally intended for. If that's what you need to do then avoid using base64 in favor for something that was designed to do that.
- zokier 20 hours ago
  > It's really too bad that there's really nothing quite like json
  messagepack/cbor are very similar to json (schemaless, similar primitive types) but can support binary data. bson is another similar alternative. All three have implementations available in many languages, and have been used in big mature projects.
- reactordev 21 hours ago
  We just need to sacrifice n*field_count to a header describing the structure. We also need to define allowed types.
- Muromec 21 hours ago
  >Everything speaks it and can write it.
  asn.1 is super nice -- everything speaks it and tooling is just great (runs away and hides)
- derefr 20 hours ago
  > It'd be nice if something like protobuf was easier to write and read in a schemeless fashion.
  If you just want a generic, binary, hierarchical type-length-value encoding, have you considered https://en.wikipedia.org/wiki/Interchange_File_Format ?
  It's not that there are widely-supported IFF libraries, per se; but rather that the format is so simple that as long as your language has a byte-array type, you can code a bug-free IFF encoder/decoder in said language about five minutes.
  (And this is why there are no generic IFF metaformat libraries, ala JSON or XML libraries; it's "too simple to bother everyone depending on my library with a transitive dependency", so everyone just implements IFF encoding/decoding as part of the parser + generator for their IFF-based concrete file format.)
  What's IFF used in? AIFF; RIFF (and therefore WAV, AVI, ANI, and — perhaps surprisingly — WebP); JPEG2000; PNG [with tweaks]...
  • There's also a descendant metaformat, the ISO Base Media File Format ("BMFF"), which in turn means that MP4, MOV, and HEIF/HEIC can all be parsed by a generic IFF parser (though you'll miss breaking some per-leaf-chunk metadata fields out from the chunk body if you don't use a BMFF-specific parser.)
  • And, as an alternative, there's https://en.wikipedia.org/wiki/Extensible_Binary_Meta_Languag... ("EBML"), which is basically IFF but with varint-encoding of the "type" and "length" parts of TLV (see https://matroska-org.github.io/libebml/specs.html). This is mostly currently used as the metaformat of the Matroska (MKV) format. It's also just complex enough to have a standalone generic codec library (https://github.com/Matroska-Org/libebml).
  My personal recommendation, if you have some structured binary data to dump to disk, is to just hand-generate IFF chunks inline in your dump/export/send logic, the same way one would e.g. hand-emit CSV inline in a printf call. Just say "this is an IFF-based format" or put an .iff extension on it or send it as application/x-iff, and an ecosystem should be able to run with that. (And just like with JSON, if you give the IFF chunks descriptive names, people will probably be able to suss out what the chunks "mean" from context, without any kind of schema docs being necessary.)
  naikrovek 6 hours ago
  yeah! I agree with this. I use plain TLV (which is very close to this IFF format) and is similar to how PNG stores all its chunks in a single file. As you mentioned.
  I got grief for saying that I prefer TLV data over textual data (even if the data is text) because of how easy it is to write code to output and ingest this format, and it is way, WAY faster than JSON will ever be.
  It really is a very easy way to get much faster transmission of data over the wire than JSON, and it's dead easy to write viewers for. It's just an underrated way to store binary data. storing things as binary is underrated in general.
tetha 20 hours ago
Reminds me of 1213486160[1]
Besides that, I just spent way too much time figuring out this is an encrypted OpenTofu state. It just looked way too much like a terraform state but not entirely. Tells ya what I spend a lot of time with at work.
This is probably another interesting situation in which you cannot read the state, but you can observe changes and growth by observing the ciphertext. It's probably fine, but remains interesting.
1: https://rachelbythebay.com/w/2016/02/21/malloc/
dhosek 20 hours ago
Kind of reminds me of a junior being amazed when I was able to read ascii strings out of a hex stream. Us old folks have seen a lot.
- schoen 19 hours ago
  For anyone here who's never pondered it ("today's lucky 10,000"?), there's a lot of intentional structure in the organization of ASCII that comes through readily in binary or hex.
  https://altcodeunicode.com/ascii-american-standard-code-for-...
  The first nibble (hex digit) shows your position within the chart, approximately like 2 = punctuation, 3 = digits, 4 = uppercase letters, 6 = lowercase letters. (Yes, there's more structure than that considering it in binary.)
  For digits (first nibble 3), the value of the digit is equal to the value of the second nibble.
  For punctuation (first nibble 2), the punctuation is the character you'd get on a traditional U.S. keyboard layout pressing shift and the digit of the second nibble.
  For uppercase letters (first nibble 4, then overflowing into first nibble 5), the second nibble is the ordinal position of the letter within the alphabet. So 41 = A (letter #1), 42 = B (letter #2), 43 = C (letter #3).
  Lowercase letters do the same thing starting at 6, so 61 = a (letter #1), 62 = b (letter #2), 63 = c (letter #3), etc.
  The tricky ones are the overflow/wraparound into first nibble 5 (the letters from letter #16, P) and into first nibble 7 (from letter #16, p). There you have to actually add 16 to the letter position before combining it with the second nibble, or think of it as like "letter #0x10, letter #0x11, letter #0x12..." which may be less intuitive for some people).
  Again, there's even more structure and pattern than that in ASCII, and it's all fully intentional, largely to facilitate meaningful bit manipulations. E.g. converting uppercase to lowercase is just a matter of adding 32, or logical OR with 0x00100000. Converting lowercase to uppercase is just a matter of subtracting 32, or logical AND with 0x11011111.
  For reading hex dumps of ASCII, it's also helpful to know that the very first printable character (0x20) is, ironically, blank -- it's the space character.
  schoen 16 hours ago
  I should just have put the printable character chart right here in the post for people to compare:
  0 1 2 3 4 5 6 7 8 9 A B C D E F .. 2 ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ \ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~
  I don't have a mnemonic for punctuation characters with second nibble >9, or for the backtick. The @ can be remembered via Ctrl+@ which is a way of typing the NUL character, ASCII 00 (also not coincidental; compare to Ctrl+A, Ctrl+B, Ctrl+C... for inputting ASCII 01, 02, 03...).
  dhosek 15 hours ago
  Hex 21 through 29 were the shift characters on the numbers on the old Apple ][ keyboard.
  anitil 15 hours ago
  > the character you'd get on a traditional U.S. keyboard layout
  I use a different layout so I'd never realised there was method to the madness! I get the following
  $ echo -n ' !@#$%^&*(' | xxd -p 2021402324255e262a28
  dhosek 15 hours ago
  It’s more the old TTY layout which differs somewhat from the modified typewriter layout that’s become standard for computer keyboards. The old Apple ][ keyboard had 1–9 corresponding to the next row in ASCII, shift-0 was @, I think other characters were ±16 based on shift. Early ASCII implementations were often slightly inconsistent but codings were often based on keyboard layouts.
  userbinator 11 hours ago
  The order of the punctuation descends from the very first typewriters, in the late 19th century:
  https://en.wikipedia.org/wiki/File:Remington_2_typewriter_ke...
  schoen 15 hours ago
  The @ for shift-2 replaced the earlier " which you would see on many 1980s-era PCs.
  I forget the story about what changed for shift-6 through shift-9.
  When I say "traditional U.S. keyboard layout" I mean to contrast this with the modern one, which is the same as what you and I have.
- themk 10 hours ago
  I used to be able to read ascii flying over a uart using an oscilloscope. I think these days the scopes will decode it for you.
  Good times.
cfontes 18 hours ago
Not directly correlated but I know a old guy that can decrypt EBCDIC and credit card positional data format on the fly. And sometimes it was a "feeling" he couldn't explain it properlly but knew exactly the value, name and other data.
It was amazing to see him decode VISA and MASTER transactions on the fly in logs and other places.
- Rygian 17 hours ago
  I've seen that done live, during audits, on live logs on the screen. Needles to say, audit didn't fly first time round (those logs should have been redacted).
- VoidWhisperer 18 hours ago
  I would hope that these logs don't include the full details of the credit card (such as number/cvv).. if it does, the company that is logging this info could end up having some issues with Visa/MC
  Edit: Now that I looked at it a little deeper, i'm assuming they are talking about these[0] sort of files?
  [0]: https://docs.helix.q2.com/docs/card-transaction-file
  Aspos 18 hours ago
  PCI DSS is a relatively new thing. Before it card data flew in the open
- userbinator 11 hours ago
  I can do the same with several proprietary network protocols and data formats I've worked on, as well as some x86 Asm - once you start seeing enough of it, you begin to absorb it almost like learning a language.
- andrepd 18 hours ago
  That's got to be the most niche party trick I've ever heard of.
calibas 20 hours ago
The encoded JSON string is going to start with "ey", unless there's whitespace in the first couple characters.
Also, it seem like the really important point is kind of glossed over. Base64 is not a kind of encryption, it's an encoding that anybody can easily decode. Using it to hide secrets in a GitHub repo is a really really dumb thing to do.
- palunon 17 hours ago
  There is actual encryption here. The base64 JSON only encodes the salt and parameters of the key derivation function used to encrypt the data.
netsharc 20 hours ago
Good knowledge, now explain why it's like that.
{" is ASCII 01111011, 00100010
Base64 takes 3 bytes x 8 bits = 24 bits, groups that 24 bit-sequence into four parts of 6 bits each, and then converts each to a number between 0-63. If there aren't enough bits (we only have 2 bytes = 16 bits, we need 18 bits), pad them with 0. Of course in reality the last 2 bits would be taken from the 3rd character of the JSON string, which is variable.
The first 6 bits are 011110, which in decimal is 30.
The second 6 bits are 110010, which in decimal is 50.
The last 4 bits are 0010. Pad it with 00 and you get 001000, which is 8.
Using an encoding table (https://base64.guru/learn/base64-characters), 30 is e, 50 is y and 8 is I. There's your "ey".
Funny how CS people are so incurious now, this blog post touches the surface but didn't get into the explanation.
- appreciatorBus 19 hours ago
  That’s really a leap about the writer’s interest.
  They could just as easily have felt the underlying reason was so obvious it wasn’t worth mentioning.
  I know how base64 encoding works but had never noticed the pattern the author pointed out. As soon as read it, I ubderstood why. It didn’t occur to me that the author should have explained it at a deeper level.
  tharkun__ 17 hours ago
  Is it? In the first paragraph the author clearly shows his ignorance of base64. When told that it "looks like base64 encoded JSON" he
  was incredulous but gave it a go, and it worked!!
  Even if you don't notice the ey specifically the string itself just screams base64 encoding, regardless of what's actually inside.
  appreciatorBus 16 hours ago
  TBC I was addressing the parents suggestion that the writer was incurious.
  One blog post is hardly enough to just someone as ignorant but after quick look at the author's writing/coding/job history, I doubt he is that either.
  I think it's fantastic that you can look at a string and feel it's base64 essence come through without a decoder. Thinking about it for a minute, I suspect I could train myself to do the same. If someone who already knew how to do it well wrote a how-to, I bet it would hit the front page and inspire many people, just like this article did.
  I just don't get the urge to dump on the original author for sharing a new-to-him insight.
  creatonez 15 hours ago
  They were probably expecting base64 encoded binary data. Base64-encoded-binary-inside-Base64-encoded-JSON-inside-JSON is a really strange construction if you haven't encountered it before, because of how much space it's wasting playing a game of Russian nesting dolls.
  fc417fc802 11 hours ago
  Just add a layer of compression periodically to reclaim the wasted space and it will all work out.
  throwaway4496 4 hours ago
  If you have a compression that works on encrypted data, you can avoid wasting your time on the "encryption".
  fc417fc802 4 hours ago
  Base64 isn't encryption. The overhead added follows an extremely predictable pattern. That said I've no idea what the performance of common compression algorithms might be in such a use case. The comment was entirely tongue in cheek.
- syncsynchalt 19 hours ago
  I think the audience already understands why it works, it's more the knowing there's a relatively small set of mnemonics for these things that's interesting. "eyJ" for JSON, "LS0" for dashes (PEM encoding), "MII" for the DER payload inside a PEM, and so on.
  I've been doing this a long time but until today the only one I'd noticed was "MII".
  ruszki 10 hours ago
  The audience yes, but the author clearly seems to not understand it when they wrote this the first time.
  > I did a few tests in my terminal, and he was right!
  He clearly had no clue how base64 worked. You don’t need a test, if you know it.
  > As pointed out by gnabgib and athorax on Hacker News, this actually detects the leading dashes of the PEM format
  They needed help for this. I’m not sure that they opened Wikipedia at last to understand how base64 works even now. The whole article has an “it’s magic!” vibe.
- perching_aix 20 hours ago
  I'd be very hesitant to consider this as some runaway symbol of "CS people being incurious now" over the author simply not being this deeply invested in this at the time of writing in the context of their discovery, especially since it almost certainly doesn't actually matter for them beyond the pattern existing, if even that does.
  netsharc 19 hours ago
  > it almost certainly doesn't actually matter for them beyond the pattern existing, if even that does.
  https://web.cs.ucdavis.edu/~rogaway/classes/188/materials/th...
  perching_aix 19 hours ago
  And there goes my limit of curiosity now regarding this. I'm interested in what you have to say, but not 25 page mini-novel PDF from someone else interested. I'm glad you enjoyed that piece, but I have no interest in reading it, nor do I think it's reasonable for you to expect me to be interested. Much like with the author and the specifics of this encoding.
  netsharc 18 hours ago
  [flagged]
  perching_aix 18 hours ago
  I guess I fully deserve this as some sort of karmic retribution, because I'm usually the person in the room who's frustrated about people poking things they don't fully understand, about folks continuing to spitball rather than looking a layer deeper, and the one who over-obsesses over details. It took me a very long time to accept that sometimes ignorance is not only acceptable, but optimal, and it continues to challenge me to this day.
  You mention "being hackerly". Imagine you were reverse engineering some gnarly 100 MB obfuscated x86 binary. Surely you can appreciate that especially if you have a specific goal, it is overwhelmingly preferable to guess, experiment, and poke than to kick off some heroic RE effort that will take tens of people years, just so that you can supposedly "fully understand" what's happening. Attention is precious - not everything is worth equal attention. And it is absolutely possible to correctly guess things from limited information, and is even essential to be able to.
  You find base64 encoding interesting enough that you were able to either recall detailed facts about its operation from memory here, or looked it up quickly to break it down. How is the author, or me, not doing so is any evidence for you we're:
  - ignorant about how base64 works and always have been
  - don't care about (CS) things at depth in general
  These are such immensely strong claims to make. Surely you can appreciate that some people just have different interests sometimes? That they might focus on different things? That they can learn things and then forget about them? That to some level everything is connected, so appealing to that is not exactly some grand revelation of missing a "key piece"?
  Few years, or I guess more than just a few years ago, in college, I met up with a former classmate from primary school. He was studying history and shared some great (historical) stories that I really enjoyed. But then another thought formulated in my mind: if I had to actively study this, rather than just catch a story or two, I'd definitely be dropping out. And that's when I realized that there can be value to things, they can be interesting, yet at the same time it's OK for me not to be interested by them or pursue them deeper. Just like how I think it is perfectly OK to be interested in this pattern, but not care for the underlying mapping mechanism, as it is essentially irrelevant. The fun was in the fact, not in the mechanism (in my view for the author anyways).
  Telemakhos 17 hours ago
  Wow, that's an amazing story. I'd never read anything by E. M. Forster before, and I certainly wasn't expecting 1920s sci-fi like that.
- nxnsxnbx 18 hours ago
  The author also doesn't explain what JSON is. Because it's obvious to the target audience. There's simply no explanation necessary
- positisop 19 hours ago
  I think CS grads often skip the part of how something actually works and are happy with abstractions.
- throwaway4496 12 hours ago
  CS post covid is in the worst state it has ever been, vibe coding and AI has enabled a category of grifters beyond the wet dreams of blockchain hacks.
Faaak 20 hours ago
Nitpick, but enclosing the first string in single quotes would make the reading better:
$ echo '{"' | base64
Vs
$ echo "{\"" | base64
- mdaniel 11 hours ago
  That's a huge pet peeve of mine, but similar to the control-r comment elsewhere, I have just come to terms with the fact that most developers are allergic to shell
delecti 21 hours ago
Oh that's nifty. Spotting base64 encoded strings is easy enough (and easy enough to test that I give it a shot if I'm even vaguely curious), but I'd never looked at them closely enough to spot patterns.
- morkalork 21 hours ago
  After copy and pasting enough access tokens into various tools you pick up on it pretty fast.
snickerdoodle12 21 hours ago
Isn't this obvious to anyone who has seen a few base64 encoded json strings or certificates? ey and LS are a staple.
- mmastrac 21 hours ago
  `MII` for RSA private keys.
  Muromec 21 hours ago
  MII is not RSA, it's an opening header of asn1 structure encoded to DER -- 30 82 0x which is basically "{" when which can be pretty much anything from x509 certificate to private keys fro ECDSA.
  Actual RSA oid is somewhere in the middle.
  mmastrac 20 hours ago
  True, but for the most part, RSA keys are the only keys that anyone encounters that start with long SEQUENCEs requiring two-byte lengths.
  `eY` could be any JSON, but it's most likely going to be a JWT.
  Neither is a perfect signal, but contextually is more likely correct than not.
  Muromec 20 hours ago
  That depends on the kind of abyss you are staring into. Mine had plenty of non-RSA keys, certificates (which are of course two-byte length all the time) and CMS containers.
- InfoSecErik 21 hours ago
  IMO depends on your career. I did a lot of pentesting with Burp Suite so I was able to (forced to) pick it up.
- SkyPuncher 20 hours ago
  Probably is, but I still found it to be a fun tidbit.
  I work with this stuff often enough to recognize something that looks like a key or a hash. I don't work with it often enough to have picked up `ey` and `LS`.
- FelipeCortez 21 hours ago
  I thought so too, but xkcd 1053 / lucky 10000, I guess! I knew about ey but not LS
karel-3d 20 hours ago
I debugged way too many JWT tokens
I know eyJhbG by heart
- karel-3d 20 hours ago
  they technically don't need to begin like that! JWT is JSON and is therefore infamously vague... but in practice they for some reason always begin with "alg" so always like eyJhbG
  xg15 20 hours ago
  Has anyone tried to send a JWT token with the fields in a different order (e.g. a long key first and key ID and algorithm behind) and see how many implementations will break?
  karel-3d 10 hours ago
  there are better things to do, like send json that has "alg" twice, each different (one of them "none" ideally) and different implementations handle it differently
- syncsynchalt 19 hours ago
  I didn't even realize I knew that string, but I recognized it immediately from your post.
gabesullice 21 hours ago
I love this post style. Never stop learning friend!
- zavec 20 hours ago
  Yeah, people are being snarky and saying it's obvious, but it was new to me! I guess I'm not staring at base64 all that often. It's a neat trick though, now I'm going to pay attention next time I have an opportunity to use it.
doppelgunner 4 hours ago
You should not store sensitive data into jwt.
bashwizard 6 hours ago
A dev who have never seen a base64 string before? Fascinating.
athorax 21 hours ago
Base64 encoded yaml files will also be LS-prefixed if they have the document marker (---)
- thibaultamartin 20 hours ago
  That's right, I've added an errata to clarify. Thanks for the heads up!
  DPDmancul 9 hours ago
  You can increase the guess accuracy a little by looking for the "tLS" characters , skipping the first 3 chars. Also this is a mnemonic about TLS and identifies all strings starting with 5 dashes, excluding so most of yaml documents
pabs3 14 hours ago
Are there any secret scanning tools that know about base64 encoded stuff?
metalliqaz 21 hours ago
Something similar pops up if you have to spend a lot of time looking at binary blobs with a hex editor. Certain common character sequences become familiar. This also leads to choosing magic numbers in data formats that decode to easily recognized ASCII strings. I'm sure if I worked with base64 I'd be choosing something that encoded nicely into particular strings for the same purpose.
- skissane 20 hours ago
  Related trick I've learnt: binary data containing lots of 0x40 may be EBCDIC text, or binary data containing embedded EBCDIC strings – 0x40 is EBCDIC space character
  Probably not a very useful trick outside of certain specific environments
  metalliqaz 19 hours ago
  so, uhh... insurance or banking?
koolba 20 hours ago
Well duh. It’s a deterministic encoding. Does not matter if it’s base64, hex, or even rot13.
Is this the state of modern understanding of basic primitives?
cluckindan 21 hours ago
On mobile, the long rows in the code blocks blow up the layout.
undefined 20 hours ago
[deleted]
benatkin 18 hours ago
I'm more partial to PCFkb2N0eXBlIGh0bWw+
Naru41 16 hours ago
Funny trivia. But of course -- there is absolute zero reason to base64 encode ascii text. Evenmore laughable to put Json encoded in base64 text inside regular Json.
shortrounddev2 21 hours ago
I discovered this when I created a JWT system for my internship. I got really good at spotting JWTs, or any base64 encoded json payloads in our Kafka streams
iou 20 hours ago
“Welcome to the party, pal!”
yahoozoo 21 hours ago
babby's first base64
curiousObject 20 hours ago
Yikes! It would be smart to bury these strings in an ad hoc obfuscation so they aren’t so obvious.
It doesn’t even need to be much better than ROT13. Security by obscurity is good for this situation.