1

I need to integrate data from 3 different PostgreSQL databases (OLTP application backends) in a data warehouse. For the data warehouse itself I consider using Citus or Greenplum. There is a requirement that the data from applications has to be synced with the data warehouse as close to real time as possible (everything above 3-5 minutes delay is unacceptable, real time replication would be the best). In this regard I have the following questions:

  1. Will Postgres logical replication work with Citus? Citus is a Postgres extension, can you treat a Citus cluster as an ordinary Postgres database? If yes, then logical replication should theoretically work, but how does it deal with distributed tables?
  2. Greenplum is a Postgres fork, so will Postgres logical replication work with it at all? I have also read that Greenplum is not optimized for OLTP workloads, does that mean it will break when I try to ingest OLTP data into it?
  3. If logical replication does not work with Citus/Greenplum, then how to stream data from Postgres? Do I need to stream logical-level WAL into Kafka and then write custom logic for translating it into SQL statements on the target database? Are there any tools for that?

Bonus question: does anyone have experience with both Citus and Greenplum, especially with their SQL limitations? I know that Citus does not fully support correlated subqueries and recursive CTEs, does Greenplum have any similar limitations?

I would appreciate any help with these questions, I tried googling but there is little or no info on the subject, could you please give at least some direction?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633

2 Answers2

1

After some research I've managed to come up with an answer. Kudos to @a_horse_with_no_name for pointing me in the right direction.

Real-time sync between different data sources can be done using Kafka Connect.

Kafka Connect allows to send data from various systems to a Kafka cluster via Source Connectors and read data from Kafka using Sink Connectors. All the platform-specific code for getting data from a system to Kafka or transferring data from Kafka to a system is abstracted away by the maintainers of the specific connector.

Connector maintainers usually will respect some kind of a "contract" between Sinks and Sources, i.e. they will agree in advance on a normalized message format, which allows to use disparate Sources and Sinks (e.g. you can have a PostgreSQL Source and Elasticsearch & Neo4J Sinks in parallel, despite the fact they have completely different paradigms).

For the case of syncing Postgres with Citus/Greenplum, it should be possible to do via Debezium PostgreSQL Source Connector and JDBC Sink connector (both Citus and Greenplum have JDBC support).

0

After research I found that the latest Greenplum database version 6.19 uses Postgres version 9.4 commits. Only in Postgres versions 9.5+ is upsert mode with JDBC sink connector possible using ON CONFLICT. So, only insert & update modes in "insert.mode" property of JDBC connector are possible.

@Denis Arharov Can you maybe provide some details on how did you setup JDBC sink connector? In particular, I'm interested in Greenplum upserts with sink connector