• Onavo 2 days ago

    Congrats on reinventing the data lake? This is actually how most of the newer generations of "cloud native" databases work, where they separate compute and storage. The key is that they have a more sophisticated caching layer so that the latency cost of a query can be amortized across requests.

    • mbrt 2 days ago

      It's my understanding that the newer generation of data lakes still make use of a tiny, strongly consistent metadata database to keep track of what is where. This is orders of magnitudes smaller than what you'd have by putting everything in the same database, but it's still there. This is also the case in newer data streaming platforms (e.g. https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...).

      I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.

      • Onavo 2 days ago

        Love your article by the way. Not an expert but off the top of my head:

        https://docs.datomic.com/operation/architecture.html

        (However they cheat with dynamo lol)

        There's also some listed here

        https://davidgomes.com/separation-of-storage-and-compute-and...

        • mbrt 2 days ago

          OK, thanks for the reference. Yeah, so indeed separating storage and compute is nothing new. Definitely not claiming I invented that :)

          And as you mention, Datomic uses DynamoDB as well (so, not a pure s3 solution). What I'm proposing is to only use object storage for everything, pay the price in latency, but don't give up on throughput, cost and consistency. The differentiator is that this comes with strict serializability guarantees, so this is not an eventually consistent system (https://jepsen.io/consistency/models/strong-serializable).

          No matter how sophisticated the caching is, if you want to retain strict serializability, writes must be confirmed by s3 and reads must validate in s3 before returning, which puts a lower bound on latency.

          I focused a lot on throughput, which is the one we can really optimize.

          Hopefully that's clear from the blog, though.

  • undefined 3 days ago
    [deleted]
    • svrakitin 3 days ago

      Pretty cool! Do you have any ideas already about how to make it work with S3, considering it doesn't support If- headers?