• woodhull 14 hours ago

    We've used PeerDB's hosted offering for sync'ing data from Postgres to Clickhouse both pre and post acquisition by Clickhouse Inc. We've also helped test the integrated sync features in Clickhouse Cloud built on top of PeerDB. We're using it to power customer facing features within our product.

    It works well. Their team is great. I feel a bit spoiled having had as much access to the engineering team during the private beta as we've experienced.

    It's great for use cases where it makes sense to sync postgres tables across to clickhouse without denormalizing them. PeerDB can transform rows in a single table sent via CDC using a lua scripting language, but it can't (yet!) denormalize data into clickhouse that is stored in 3NF on Postgres across multiple tables.

    On the clickhouse query side, we end up wanting denormalized data for query performance and to avoid JOINs. It's frequently not a great idea to query in clickhouse using the same table structure as you're using in your transactional db.

    In our experience we sync a few tables with PeerDB but mostly end up using app-level custom code to sync denormalized data into Clickhouse for our core use-cases. Most of the PeerDB sync'd tables end up as Clickhouse Dictionaries which we then use in our queries.

    PeerDB works well and I like it for what it is. Just don't expect to be satisfied with querying in Clickhouse against the same table structure as you've got in Postgres unless your data size is tiny.

    Curious to know about how others are using it and the architectures you've developed.

    • higeorge13 16 minutes ago

      I also agree with most of your comments and conclusions.

      In our setup, we use app ingestion to send all the denormalised data into Clickhouse using async inserts and Debezium/Kafka/Kafka engine and materialized views to sync a few Postgres tables into Clickhouse. 2 of the replicated tables are in the order of billions of rows, and are used in 20% of the queries (usually directly and less frequently with no more than 1-2 joins). Everything else queries the denormalised tables (usually no joins there, only some dictionary usage). Overall query performance is great, although it would have been even better since we use replacing merge trees and final.

      The 2 main issues that we are facing are:

      - we need to periodically cleanup the deleted rows from the replacing merge trees, since the application does lots of upserts and deleted rows just stay there.

      - there is not much flexibility in the ordering keys of the replicated Postgres tables, unless you enable full replica identity. We took that performance hit (although nothing really noticeable in Postgres side) in order to have some flexibility and better query performance in the replicated tables in Clickhouse.

      • saisrirampur 13 hours ago

        Thanks, Nathan, for chiming in and for all the support during the private beta! <3

        Overall, what you shared makes sense for use cases like yours. However, there are other scenarios—such as multi-tenant SaaS analytics running large-scale workloads with PeerDB/PostgreSQL CDC. In these cases there are 100s of tables across different schemas that are synced using CDC. Some customers denormalize tables using materialized views (MVs), which is a powerful feature in ClickHouse, while others power dashboards directly with JOINs using the recent JOIN improvements in ClickHouse and suitable/optimized order keys (tenant_id,id).

        When dealing with 100s to 1000s of tables and a heavily relational schema, building dual-write pipelines with denormalization becomes extremely difficult—especially when the workload involves UPDATEs.

        We have many customers falling in the above bucket, replicating multiple petabytes of data to ClickHouse. A few customer deep dives on this are coming soon! :)

        Side note: We are tracking support for in-transit transformations as a future feature. However, MVs are the way to go—more of an ELT approach.

        • JonoBB 7 hours ago

          > On the clickhouse query side, we end up wanting denormalized data for query performance and to avoid JOINs. It's frequently not a great idea to query in clickhouse using the same table structure as you're using in your transactional db.

          > In our experience we sync a few tables with PeerDB but mostly end up using app-level custom code to sync denormalized data into Clickhouse for our core use-cases.

          Have you explored dbt? You may find that using custom code is not scalable, and that dbt solves this exact problem.

          • woodhull an hour ago

            Our use case requires real-time inserts as events happen.

            dbt is as I understand it for batch processing transformations on a set schedule.

        • jascha_eng 19 hours ago

          Not to sound too sales-y but if you are looking into clickhouse and are currently based on postgres, you might also want to check out timescale. Since we're just a postgres extension it's 100% compatible with existing systems but provides a lot of the same speed benefits as clickhouse for analytical queries.

          Don't be confused by the timeseries branding.

          • simonw 18 hours ago

            I've been very confused by the timeseries branding - I had always thought timescale was purely about adding time series features to PostgreSQL. I didn't know the extension had other applications.

            Looks like you've expanded into vector indexing - https://github.com/timescale/pgvectorscale - and an extension which bakes RAG patterns (including running prompts from SQL queries) into PostgreSQL: https://github.com/timescale/pgai

            • akulkarni 17 hours ago

              That's interesting. Our first extension (TimescaleDB) is great for time-series and real-time analytics.

              And yes you are correct, pgvectorscale scales pgvector for embeddings, and pgai includes dev experience niceties for AI (eg automatic embedding management).

              Would love to hear any suggestions on how we could make this less confusing. :-)

              • mritchie712 14 hours ago

                The name of the company is timescale. That’s what’s confusing.

                • simonw 12 hours ago

                  People form initial impressions of a company and what they do, then file those away in their heads and it can be hard to get them to expand their mental model later on.

                  I guess that's why we have marketing teams!

              • qeternity 18 hours ago

                We use TSDB and are pretty happy with it.

                But it is much less performant than CH.

                • klaussilveira 10 hours ago

                  How do you guys fare for ad tech aggregation? We have something similar to this: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

                  But actively trying to simplify and remove as many gears as possible.

                  • js4ever 17 hours ago

                    Timescale is a very nice product but not at all close to clickhouse in terms of speed based on my own tests on very large tables (billions of rows)

                    • whitepoplar 18 hours ago

                      How does Timescale compare to other extensions like Citus and Hydra/DuckDB?

                    • jascha_eng 17 hours ago

                      I didn't expect so many comments. I'm about to fly cross Atlantic and can't answer appropriately to everyone right now without internet but will try to do it justice once I'm home.

                      • geoka9 18 hours ago

                        Could you go into the details of how one might go about replicating a PG db to a tsdb one? I assume application level would not be the most simple/reliable?

                        • asadawadia 18 hours ago

                          you don't

                          the data stays in PGDB - TSDB is an extension installed onto the data base server

                          • akulkarni 17 hours ago

                            Exactly. You can have the best of both worlds with Timescale.

                        • tomnipotent 18 hours ago

                          Not at all too sales-y.

                          I'm all for keeping as much as possible in your initial Postgres deployment as possible. If your team isn't having to work around things and things "just work" it's a wonderful conjunction of requirements and opportunity. It's incredible how much you can get out of a single instance, really remarkable. I'd also add it's still worth it even if there is a little pain.

                          But I've found that once I cross about 8-12 terabytes of data I need to specialize, and that a pure columnar solution like ClickHouse really begins to shine even compared to hybrid solutions given the amortized cost of most analytic workloads. This difference quickly adds up and I think at that scale really makes a difference to the developer experience that a switch is worth the consideration. Otherwise stick to Postgres and save your org some money and more importantly sanity.

                          You reach a point when you have enough queries doing enough work that the extra I/O and memory required by PAX/hybrid becomes noticeably more costly than pure columnar, at least for the workloads that I have experience with.

                          ClickHouse is now in my toolbox right alongside Postgres with things to deploy that I can trust to get the job done.

                          • saisrirampur 15 hours ago

                            Great summary and spot on! Once you reach that TBs scale Postgres is hard to scale. Yes, you could make Postgres scale to larger scales but it would need deep expertise and architecting and the experience wouldn’t be “it just works”. Ex-Citus here, we had PB scale deployments which needed multiple months of effort to implement and an expert team to manage. Eventually many (ex: CloudFlare, Heap) migrated to purpose built stores like ClickHouse, Singlestore. And not to forget storage costs. Sure there was compression/columnar in Postgres/Citus too - but it didn’t fare well compared to pure columnar stores.

                            (Disclaimer: This is Sai from ClickHouse/PeerDB team)

                            • akulkarni 17 hours ago

                              YMMV but our largest internal dogfooded Timescale instance is 100s of terabytes

                              https://www.timescale.com/blog/how-we-scaled-postgresql-to-3...

                              (Post is a year old, IIRC the database is over one petabyte now)

                              • tomnipotent 2 hours ago

                                Totally doable, of course. But I'll need fewer ClickHouse servers for the same amount of data, and I'll get more utilization out of them with faster query times. High selectivity combined with colocated row data means that hybrid storage formats will need to read more I/O, use more memory, and churn through more buffer for the same queries.

                              • nijave 17 hours ago

                                A few other things I can think of as well

                                - you'd probably at least want a read replica so you're not running queries on your primary db

                                - if you're going to the trouble of setting up a column store, it seems likely you're wanting to integrate other data sources so need some ETL regardless

                                - usually column store is more olap with lower memory and fast disks whereas operational is oltp with more memory and ideally less disk io usage

                                I suppose you could get some middle ground with PG logical rep if you're mainly integrating PG data sources

                            • jedberg 10 hours ago

                              > Once the databases are selected, the next challenge is ensuring reliable, ongoing data replication from PostgreSQL to ClickHouse. While it’s possible to handle this at the application layer by writing data directly into ClickHouse, replicating data at the database level tends to be simpler and more reliable.

                              I'm curious if you have data that backs this up, or if it's more of a "gut feeling" sort of thing. At first blush, I agree with you, but at the same time, by doing it at the application level, it opens up so many more possibilities, such as writing "pre-coalesced" data to the data warehouse or pre-enriching the data that goes into the data warehouse.

                              • saisrirampur 10 hours ago

                                I think it's more about convenience / operational simplicity — managing multiple ingestion pipelines with potentially 100s of tables at a production grade isn’t trivial and adds significant overhead. This also includes ensuring data consistency across both targets, monitoring lag, and so on.

                                Secondly, OLAP/DWH systems aren’t as friendly as OLTP databases when it comes to UPDATEs/DELETEs. You can’t just perform point UPDATEs or DELETEs and call it a day. So why not let a replication tool handle this for you in the most efficient way.

                              • spapas82 17 hours ago

                                If anybody's interested I've written a script in python that executes a query in a postgres db and saves the results to a parquet file (the script is not as simple as one would expect because of the fact that parquet is not as simple as csv and because the amount of data may be huge).

                                The parquet file is a columnar friendly friendly that can then be simply inserted to clickhouse or duckdb or even queried directly.

                                This script and a cron job are enough for my (not very complex) needs on replicating my postgres data on clickhouse for fast queries.

                                https://github.com/spapas/pg-parquet-py