Comments Page - Reliably replicating data between Postgres and ClickHouse

« Back Reliably replicating data between Postgres and ClickHousebenjaminwootton.comSubmitted by saisrirampur 4 months ago

woodhull 4 months ago
We've used PeerDB's hosted offering for sync'ing data from Postgres to Clickhouse both pre and post acquisition by Clickhouse Inc. We've also helped test the integrated sync features in Clickhouse Cloud built on top of PeerDB. We're using it to power customer facing features within our product.
It works well. Their team is great. I feel a bit spoiled having had as much access to the engineering team during the private beta as we've experienced.
It's great for use cases where it makes sense to sync postgres tables across to clickhouse without denormalizing them. PeerDB can transform rows in a single table sent via CDC using a lua scripting language, but it can't (yet!) denormalize data into clickhouse that is stored in 3NF on Postgres across multiple tables.
On the clickhouse query side, we end up wanting denormalized data for query performance and to avoid JOINs. It's frequently not a great idea to query in clickhouse using the same table structure as you're using in your transactional db.
In our experience we sync a few tables with PeerDB but mostly end up using app-level custom code to sync denormalized data into Clickhouse for our core use-cases. Most of the PeerDB sync'd tables end up as Clickhouse Dictionaries which we then use in our queries.
PeerDB works well and I like it for what it is. Just don't expect to be satisfied with querying in Clickhouse against the same table structure as you've got in Postgres unless your data size is tiny.
Curious to know about how others are using it and the architectures you've developed.
- saisrirampur 4 months ago
  Thanks, Nathan, for chiming in and for all the support during the private beta! <3
  Overall, what you shared makes sense for use cases like yours. However, there are other scenarios—such as multi-tenant SaaS analytics running large-scale workloads with PeerDB/PostgreSQL CDC. In these cases there are 100s of tables across different schemas that are synced using CDC. Some customers denormalize tables using materialized views (MVs), which is a powerful feature in ClickHouse, while others power dashboards directly with JOINs using the recent JOIN improvements in ClickHouse and suitable/optimized order keys (tenant_id,id).
  When dealing with 100s to 1000s of tables and a heavily relational schema, building dual-write pipelines with denormalization becomes extremely difficult—especially when the workload involves UPDATEs.
  We have many customers falling in the above bucket, replicating multiple petabytes of data to ClickHouse. A few customer deep dives on this are coming soon! :)
  Side note: We are tracking support for in-transit transformations as a future feature. However, MVs are the way to go—more of an ELT approach.
- sarwarbproton 4 months ago
  Disclaimer: I work for Timeplus in the field team.
  This is exactly the kind of problem we've been solving with a few of our customers. With Timeplus, we can listen to Kafka and then do streaming joins to create denormalized records to send downstream to ClickHouse. Traditionally we did this with stream processing and this would build up really large join state in memory for when cardinality on the join keys would get very large (think 100s of millions of keys).
  This has recently been improved with two additional enhancements: 1. You can setup the join states to use hybrid memory/disk based hash tables in Timeplus Enterprise if you still want to keep the join happening locally (assume all data in the join is still coming in via Kafka) and maintaining high throughput
  2. Alternatively, where you have slow changing data on the right hand side(s), we can use a Kafka topic on the left hand side and do direct lookups against MySQL/Postgres/etc on each change on the LHS. This takes a hit throughput but may be ok for say 100s of records per second per join. There's an additional caching capability with TTL here to allow for the most frequently accessed reference data to be kept locally so that future joins are faster.
  On additional benefit from using Timeplus to send data downstream to ClickHouse is being able to batch appropriately so that it is not emitting lots of small writes to ClickHouse.
- higeorge13 4 months ago
  I also agree with most of your comments and conclusions.
  In our setup, we use app ingestion to send all the denormalised data into Clickhouse using async inserts and Debezium/Kafka/Kafka engine and materialized views to sync a few Postgres tables into Clickhouse. 2 of the replicated tables are in the order of billions of rows, and are used in 20% of the queries (usually directly and less frequently with no more than 1-2 joins). Everything else queries the denormalised tables (usually no joins there, only some dictionary usage). Overall query performance is great, although it would have been even better since we use replacing merge trees and final.
  The 2 main issues that we are facing are:
  - we need to periodically cleanup the deleted rows from the replacing merge trees, since the application does lots of upserts and deleted rows just stay there.
  - there is not much flexibility in the ordering keys of the replicated Postgres tables, unless you enable full replica identity. We took that performance hit (although nothing really noticeable in Postgres side) in order to have some flexibility and better query performance in the replicated tables in Clickhouse.
  saisrirampur 4 months ago
  Great points! We're making progress on improving both of the issues you mentioned.
  1. For deleted rows, you can create policies to simplify querying. However, periodic deletions are still necessary. We've been optimizing lightweight deletes/updates to improve performance, which should help with automatic deletions.
  2. For the second issue, refreshable materialized views with different order keys than raw tables are an option worth considering. However, having it in real time for tables with billions of rows might not be viable. That said, processing within tens of minutes to a few hours could work. We're tracking that the order key serves a dual role—as both a deduplication key and a skip index—which is the root cause of this issue of enabling REPLICA IDENTITY on Postgres side.
  Separately, working on a guide covering best practices for Postgres to ClickHouse data modeling, detailing these concepts further. More on this coming soon!
- JonoBB 4 months ago
  > On the clickhouse query side, we end up wanting denormalized data for query performance and to avoid JOINs. It's frequently not a great idea to query in clickhouse using the same table structure as you're using in your transactional db.
  > In our experience we sync a few tables with PeerDB but mostly end up using app-level custom code to sync denormalized data into Clickhouse for our core use-cases.
  Have you explored dbt? You may find that using custom code is not scalable, and that dbt solves this exact problem.
  ZiliangXK 4 months ago
  This is exactly a use case a streaming processing like Timeplus excels, which help users do 20s streaming to dynamic table joins (imagine the hand side tables are dynamically updates) to denormalize data before the data lands in ClickHouse. This helps a lot for warehouse like ClickHouse to save the joins in them. Disclaimer, I am one of the Timepluer core engine engineer.
  woodhull 4 months ago
  Our use case requires real-time inserts as events happen.
  dbt is as I understand it for batch processing transformations on a set schedule.
jascha_eng 4 months ago
Not to sound too sales-y but if you are looking into clickhouse and are currently based on postgres, you might also want to check out timescale. Since we're just a postgres extension it's 100% compatible with existing systems but provides a lot of the same speed benefits as clickhouse for analytical queries.
Don't be confused by the timeseries branding.
- simonw 4 months ago
  I've been very confused by the timeseries branding - I had always thought timescale was purely about adding time series features to PostgreSQL. I didn't know the extension had other applications.
  Looks like you've expanded into vector indexing - https://github.com/timescale/pgvectorscale - and an extension which bakes RAG patterns (including running prompts from SQL queries) into PostgreSQL: https://github.com/timescale/pgai
  akulkarni 4 months ago
  That's interesting. Our first extension (TimescaleDB) is great for time-series and real-time analytics.
  And yes you are correct, pgvectorscale scales pgvector for embeddings, and pgai includes dev experience niceties for AI (eg automatic embedding management).
  Would love to hear any suggestions on how we could make this less confusing. :-)
  mritchie712 4 months ago
  The name of the company is timescale. That’s what’s confusing.
  simonw 4 months ago
  People form initial impressions of a company and what they do, then file those away in their heads and it can be hard to get them to expand their mental model later on.
  I guess that's why we have marketing teams!
  jascha_eng 4 months ago
  Honestly, before I joined timescale I had the same impression. Since then I learned that a bunch of the improvements timescale brings (continuous aggregates, hybrid row/column storage and automatic partitioning) are much more widely useful than just IOT sensor data. There is definitely some room for improvement just in communication there.
  Ajay already commented, that he's open to new ideas on how to frame timescale. I for myself always thought of postgres as the best jack of all trades database. It's basically the best db if you don't 100% know what's the best choice yet. Timescale expands on that and enhances postgres's capabilities even further so that use-cases which would usually call for a second storage option (e.g. analytics, vector) end up working great with just postgres itself.
  I'd personally love if we also had a full-text offering akin to paradedb/pg_search so noone would ever need to host elasticsearch again. But it also doesn't make sense to spread the valuable postgres expert resources too thin.
- tomnipotent 4 months ago
  Not at all too sales-y.
  I'm all for keeping as much as possible in your initial Postgres deployment as possible. If your team isn't having to work around things and things "just work" it's a wonderful conjunction of requirements and opportunity. It's incredible how much you can get out of a single instance, really remarkable. I'd also add it's still worth it even if there is a little pain.
  But I've found that once I cross about 8-12 terabytes of data I need to specialize, and that a pure columnar solution like ClickHouse really begins to shine even compared to hybrid solutions given the amortized cost of most analytic workloads. This difference quickly adds up and I think at that scale really makes a difference to the developer experience that a switch is worth the consideration. Otherwise stick to Postgres and save your org some money and more importantly sanity.
  You reach a point when you have enough queries doing enough work that the extra I/O and memory required by PAX/hybrid becomes noticeably more costly than pure columnar, at least for the workloads that I have experience with.
  ClickHouse is now in my toolbox right alongside Postgres with things to deploy that I can trust to get the job done.
  saisrirampur 4 months ago
  Great summary and spot on! Once you reach that TBs scale Postgres is hard to scale. Yes, you could make Postgres scale to larger scales but it would need deep expertise and architecting and the experience wouldn’t be “it just works”. Ex-Citus here, we had PB scale deployments which needed multiple months of effort to implement and an expert team to manage. Eventually many (ex: CloudFlare, Heap) migrated to purpose built stores like ClickHouse, Singlestore. And not to forget storage costs. Sure there was compression/columnar in Postgres/Citus too - but it didn’t fare well compared to pure columnar stores.
  (Disclaimer: This is Sai from ClickHouse/PeerDB team)
  akulkarni 4 months ago
  YMMV but our largest internal dogfooded Timescale instance is 100s of terabytes
  https://www.timescale.com/blog/how-we-scaled-postgresql-to-3...
  (Post is a year old, IIRC the database is over one petabyte now)
  tomnipotent 4 months ago
  Totally doable, of course. But I'll need fewer ClickHouse servers for the same amount of data, and I'll get more utilization out of them with faster query times. High selectivity combined with colocated row data means that hybrid storage formats will need to read more I/O, use more memory, and churn through more buffer for the same queries.
  jascha_eng 4 months ago
  At the risk of getting my CEO angry (sorry Ajay :D): ClickHouse is great. But it also means hosting another database and losing ACID compliance, the question is often not ClickHouse vs Timescale but Postgres + ClickHouse vs just Timescale.
  In general the argument I was originally trying to make is not to never use ClickHouse, I think it's a great product. But if you already are on postgres, it might just be easier to give Timescale a try than to adapt everything to work with ClickHouse right away. There is more to consider here than raw query speed.
  And while I'm sure the systems behave differently scale and speed wise, I also wouldn't say Timescale looses straight up, there is situations where Timescale is faster and if it really breaks down for a use-case nothing stops you from still doing the postgres to ClickHouse migration. In the end timescale is just a better postgres, so there is no lock in.
  nijave 4 months ago
  A few other things I can think of as well
  - you'd probably at least want a read replica so you're not running queries on your primary db
  - if you're going to the trouble of setting up a column store, it seems likely you're wanting to integrate other data sources so need some ETL regardless
  - usually column store is more olap with lower memory and fast disks whereas operational is oltp with more memory and ideally less disk io usage
  I suppose you could get some middle ground with PG logical rep if you're mainly integrating PG data sources
- qeternity 4 months ago
  We use TSDB and are pretty happy with it.
  But it is much less performant than CH.
  jascha_eng 4 months ago
  Glad you like it! And yes I'm not saying Timescale is better than Clickhouse generally. But it does avoid having to host a second database, you keep ACID compliance, your application code doesn't have to change... There is more to analytics than raw speed, and even in raw speed we're slowly catching up. For some types of queries TS actually performs better than Clickhouse afaik, but I'm not a benchmarking expert so take it with a grain of salt.
  Always choose the right tool for the right job, Clickhouse is amazing software. I just wanted to mention that if someone currently runs analytics queries via postgres and runs into performance issues, trying out timescale doesn't really hurt and might be a simpler solution than migrating to Clickhouse.
- js4ever 4 months ago
  Timescale is a very nice product but not at all close to clickhouse in terms of speed based on my own tests on very large tables (billions of rows)
- whitepoplar 4 months ago
  How does Timescale compare to other extensions like Citus and Hydra/DuckDB?
  jascha_eng 4 months ago
  I'm not an expert on either of those, but this is my take anyway: Citus is distributed and afaik just eventually consistent, for a lot of folks that's simply not an option as their application relies on ACID compliance.
  As far as I understand Hyda + DuckDB it is a higher level add-on onto postgres than timescale is. This is on the one hand nice since you can just put it into an existing database without any migration effort whatsoever. But this also means they likely don't really interact with systems like the storage engine. For example Timescales deeper integration allows us to actually store the data differently on disk which allows for saving space via compression.
  mdaniel 4 months ago
  for one thing, depending on your licensing contraints:
  - Citus is AGPLv3 https://github.com/citusdata/citus/blob/v13.0.1/LICENSE
  - Hydra is Apache 2 https://github.com/hydradatabase/columnar/blob/v1.1.2/LICENS...
  - Timescale is mostly Apache 2 https://github.com/timescale/timescaledb/blob/2.18.2/LICENSE
- klaussilveira 4 months ago
  How do you guys fare for ad tech aggregation? We have something similar to this: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
  But actively trying to simplify and remove as many gears as possible.
- jascha_eng 4 months ago
  I didn't expect so many comments. I'm about to fly cross Atlantic and can't answer appropriately to everyone right now without internet but will try to do it justice once I'm home.
- geoka9 4 months ago
  Could you go into the details of how one might go about replicating a PG db to a tsdb one? I assume application level would not be the most simple/reliable?
  jascha_eng 4 months ago
  As other folks have already mentioned, since timescale is just a postgres extension you can install it on your existing postgres instance and don't need to migrate anything.
  Of course if you come to our cloud you're going to have to do some sort of migration effort but that shouldn't be more complicated than going from one postgresdb to another.
  asadawadia 4 months ago
  you don't
  the data stays in PGDB - TSDB is an extension installed onto the data base server
  akulkarni 4 months ago
  Exactly. You can have the best of both worlds with Timescale.
spapas82 4 months ago
If anybody's interested I've written a script in python that executes a query in a postgres db and saves the results to a parquet file (the script is not as simple as one would expect because of the fact that parquet is not as simple as csv and because the amount of data may be huge).
The parquet file is a columnar friendly friendly that can then be simply inserted to clickhouse or duckdb or even queried directly.
This script and a cron job are enough for my (not very complex) needs on replicating my postgres data on clickhouse for fast queries.
https://github.com/spapas/pg-parquet-py
- zX41ZdbW 4 months ago
  ClickHouse can query PostgreSQL directly, plus you can create a refreshable materialized view, which will do it automatically on a schedule:
  https://clickhouse.com/docs/sql-reference/table-functions/po...
  https://clickhouse.com/docs/materialized-view/refreshable-ma...
  Additionally, you can set up incremental import with https://clickhouse.com/blog/postgres-cdc-connector-clickpipe...
- asah 4 months ago
  nice. there's also https://github.com/exyi/pg2parquet which is written in rust. Not sure of pros vs cons.
  if you use AWS, you can upload to s3 and query via Athena. AWS Glue is a little klunky but it works and if your query load is small then it's cheap and very reliable.
- halfcat 4 months ago
  Thanks for sharing. I’m curious, would pandas, polars, or similar work here (read_sql/to_parquet)? Or is there a challenge around larger than memory data sets (or some other issue I’m not thinking of)?
  spapas82 4 months ago
  That's a good question. Unfortunately I haven't tried it because I am not really familiar with pandas.
jedberg 4 months ago
> Once the databases are selected, the next challenge is ensuring reliable, ongoing data replication from PostgreSQL to ClickHouse. While it’s possible to handle this at the application layer by writing data directly into ClickHouse, replicating data at the database level tends to be simpler and more reliable.
I'm curious if you have data that backs this up, or if it's more of a "gut feeling" sort of thing. At first blush, I agree with you, but at the same time, by doing it at the application level, it opens up so many more possibilities, such as writing "pre-coalesced" data to the data warehouse or pre-enriching the data that goes into the data warehouse.
- saisrirampur 4 months ago
  I think it's more about convenience / operational simplicity — managing multiple ingestion pipelines with potentially 100s of tables at a production grade isn’t trivial and adds significant overhead. This also includes ensuring data consistency across both targets, monitoring lag, and so on.
  Secondly, OLAP/DWH systems aren’t as friendly as OLTP databases when it comes to UPDATEs/DELETEs. You can’t just perform point UPDATEs or DELETEs and call it a day. So why not let a replication tool handle this for you in the most efficient way.
lukaslalinsky 4 months ago
With all these CDC solutions based on logical replication, how are PostgreSQL failovers handled? You can't really rewind ClickHouse. Does it leave it with inconsistencies? Does it rebuild the DB fully?
- saisrirampur 4 months ago
  Planned failovers can be managed through maintenance windows by recreating replication slots before maintenance ends. Unplanned failovers require resyncs, which are heavily optimized with PeerDB (https://blog.peerdb.io/parallelized-initial-load-for-cdc-bas...).
  With PG17+, this shouldn’t be an issue due to failover slots.
  Another important observation: Aurora Postgres, which is used by many customers, persists slots during failover, so this isn’t a problem at all. PeerDB has built-in retries that resume from the last committed source LSN.
- higeorge13 4 months ago
  I guess you talk about hard failures or replication slot deletion, right? Debezium has a snapshot mode (1) to refetch everything from scratch, we have tested it but opted out of using it because it would put pressure on postgres with the queries on huge tables and on kafka and its storage. Imo we prefer to manually rebuild the clickhouse tables via postgres snapshots and latest cdc data, but there might be better options or ideas.
  (1) https://debezium.io/documentation/reference/stable/connector...