Listen to Database Changes Through the Postgres WAL

114 points by pjullrich 6 days ago

> The problem with Postgres' NOTIFY is that all notifications go through a single queue!

> Even if you have 20 database connections making 20 transactions in parallel, all of them need to wait for their turn to lock the notification queue, add their notification, and unlock the queue again. This creates a bottleneck especially in high-throughput databases.

We're currently working hard on optimizing LISTEN/NOTIFY: https://www.postgresql.org/message-id/flat/6899c044-4a82-49b...

If you have any experiences of actual workload where you are currently experiencing performance/scalability problems, I would be interested in hearing from you, to better understand the actual workload. In some workloads, you might only listen to a single channel. For such single-channel workloads, the current implementation seems hard to tweak further, given the semantics and in-commit-order guarantees. However, for multi-channel workloads, we could do a lot better, which is what the linked patch is about. The main problem with the current implementation for multi-channel workloads, is that we currently signal and wake all listening backends (a backend is the PostgreSQL processes your client is connected to), even if they are not interested in the specific channels being notified in the current commit. This means that if you have 100 connections open in which each connect client has made a LISTEN on a different channel, then when someone does a NOTIFY on one of those channels, instead of just signaling the backend that listen on that channel, all 100 backends will be signaled. For multi-channel workloads, this could mean an enormous extra cost coming from the context-switching due to the signaling.

I would greatly appreciate if you could please reply to this comment and share your different workloads when you've had problems with LISTEN/NOTIFY, to better understand approximately how many listening backends you had, and how many channels you had, and the mix of volume on such channels. Anything that could help us do better realistic simulations of such workloads, to improve the benchmark tests we're working on. Thank you.

tobyhinloopen 4 hours ago
We use it like this:
```
    CREATE TRIGGER notify_events_trg AFTER INSERT ON xxx.events FOR EACH ROW EXECUTE PROCEDURE public.notify_events();

    CREATE FUNCTION public.notify_events() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
    BEGIN
      PERFORM pg_notify('events', row_to_json(NEW)::text);
      RETURN NEW;
    END;
    $$;
```
And then we have a bunch of triggers like this on many tables:
```
    CREATE TRIGGER create_category_event_trg AFTER INSERT OR DELETE OR UPDATE ON public.categories FOR EACH ROW EXECUTE PROCEDURE public.create_category_event();

    CREATE FUNCTION public.create_category_event() RETURNS trigger
        LANGUAGE plpgsql SECURITY DEFINER
        AS $$
    DECLARE
      category RECORD;
      payload JSONB;
    BEGIN
      category := COALESCE(NEW, OLD);
      payload := jsonb_build_object('id', category.id);
      IF NEW IS NULL OR NEW.deleted_at IS NOT NULL THEN
        payload := jsonb_set(payload, '{deleted}', 'true');
      END IF;
      INSERT INTO xxx.events (channel, inserted_at, payload)
        VALUES ('category', NOW() AT TIME ZONE 'utc', payload);
      RETURN NULL;
    END;
    $$;
```
We found no notable performance issues. We have a single LISTEN in another application. We did some stress testing and found that it performs way better than we would ever need
- JoelJacobson 3 hours ago
  
  Thanks for the report. For that use-case (if you have a single application using a single connection with a LISTEN) then it's expected that is should perform well, since then there is only a single backend which will be context-switched to when each NOTIFY signals it.
  
  oulipo2 3 hours ago
  
  Just out of curiosity, could you try to frame in what context this would or would not work? If you have multiple backends with multiple connections for instance? And then if we start with such a "simple" solution and we later need to scale with distributed backends, how should we do this?
  
  JoelJacobson 2 hours ago
  
  In the linked "Optimize LISTEN/NOTIFY" pgsql-hackers, I've shared a lot of benchmark results for different workloads, which also include results on how PostgreSQL currently works (this is "master" in the benchmark results), that can help you better understand the expectations for different workloads.
  The work-around solution we used at Trustly (a company I co-founded), is a component named `allas` that a colleague of mine at that time, Marko Tikkaja, created to solve our problems, that massively reduced the load on our servers. Marko has open sourced and published this work here: https://github.com/johto/allas
  Basically, `allas` opens up a single connection to PostgreSQL, on which it LISTEN on all the channels it needs to listen on. Then clients connect to `allas` over the PostgreSQL protocol, so it's basically faking a PostgreSQL server, and when clients do LISTEN on a channel with allas, allas will then LISTEN on that channel on the real PostgreSQL server on the single connection it needs. Thanks to `allas` being implemented in Go, using Go's efficient goroutines for concurrency, it efficiently scales with lots and lots of connections. I'm not a Go-expert myself, but I've understood Go is quite well suited for this type of application.
  This component is still being used at Trustly, and is battle-tested and production grade.
  That said, it would of course be much better to avoid the need for a separate component, and fix the scalability issues in core PostgreSQL, so that's what I'm currently working on.
JoelJacobson 4 hours ago

Here is the Commitfest entry if you want to help with reviewing/development/testing of the patch: https://commitfest.postgresql.org/patch/6078/
oulipo2 3 hours ago

The post seems to say that NOTIFY is generally not a good idea, then comments here say that NOTIFY can actually work, but it depends on some particular things (which are not easy to know for newcomers to Postgres), makes it a bit complicated to know what is the way to go for a new database
In my case I have an IoT setting, where my devices can change their "DesiredState", and I want to listen on this to push some message to MQTT... but then there might be also other cases where I want to listen to some messages elsewhere (eg do something when there is an alert on a device, or listen to some unrelated object, eg users, etc)
I'm not clear right now what would be the best setting to do this, the tradeoffs, etc
Imagine I have eg 100k to 10M range of devices, that sometimes these are updated in bulks and change their DesiredState 10k at a time, would NOTIFY work in that case? Should I use the WAL/Debezium/etc?
Can you try to "dumb down" in which cases we can use NOTIFY/LISTEN and in which case it's best not to? you're saying something about single-channel/multi-channel/etc but to a newcomer I'm not clear on what all these are

matharmin 44 minutes ago

We're relying on logical replication heavily for PowerSync, and I've found it is a great tool, but it is also very low-level and under-documented. This article gives a great overview - I wish I had this when we started with our implementation.

Some examples of difficulties we've ran into: 1. LSNs for transactions (commits) are strictly increasing, but not for individual operations across transactions. You may not pick this up during basic testing, but it starts showing up when you have concurrent transactions. 2. You cannot resume logical replication in the middle of a transaction (you have to restart the transaction), which becomes relevant when you have large transactions. 3. In most cases, replication slots cannot be preserved when upgrading Postgres major versions. 4. When you have multiple Postgres clusters in a HA setup, you _can_ use logical replication, but it becomes more tricky (better in recent Postgres versions, but you're still responsible for making sure the slots are synced). 5. Replication slots can break in many different ways, and there's no good way to know all the possible failure modes until you've run into them. Especially fun when your server ran out of disk space at some point. It's a little better with Postgres 17+ exposing wal_status and invalidation_reason on pg_replication_slots. 6. You need to make sure to acknowledge keepalive messages and not only data messages, otherwise the WAL can keep growing indefinitely when you don't have incoming changes (depending on the hosting provider). 7. Common drivers often either don't implement the replication protocol at all, or attempt to abstract away low-level details that you actually need. Here it's great that the article actually explains the low-level protocol details.

jumski 41 minutes ago

Worth mentioning that WAL is enabling and used by Supabase Realtime [0] and WalEx [1] which both are doing very similar thing - polling the WAL changes in order to emit them as Websocket events.

I never worked with WalEx but have experience with Supabase Realtime and it is a joy to work with and fits great into the Supa ecosystem. So many layers are disappearing when you rely more on Postgres!

[0] https://github.com/supabase/realtime [1] https://github.com/cpursley/walex

decasia an hour ago

I misinterpreted the title and was hoping that this was going to be a post about realtime algorithmic music generation from the Postgres WAL, something like the Hatnote “listen to Wikipedia edits” project.

http://listen.hatnote.com/

JoelJacobson 6 hours ago

> If you call pg_notify or NOTIFY inside a trigger, it will get called 100,000 times and send out 100,000 notifications if you change 100,000 rows in a single transaction which from a performance perspective is ... not ideal.

This is only true if those notifications are different; if they are identical, such as in the same the notification is to alert listeners some table has new data (for cache invalidation), they are sent out as one notification only. See source code comment in async.c:

     *   Duplicate notifications from the same transaction are sent out as one
     *   notification only. This is done to save work when for example a trigger
     *   on a 2 million row table fires a notification for each row that has been
     *   changed. If the application needs to receive every single notification
     *   that has been sent, it can easily add some unique string into the extra
     *   payload parameter.

isidor3 4 hours ago

I was hoping from the title that this would be about creating an audio representation so you could listen to the database work as if it were an old PC that had distinctive sounds for different tasks

aitchnyu 2 hours ago

We were kids learning C++ at school and we giggled each time we printed the \a character (system bell). At home I piped /dev/mem to /dev/dsp to get the harshest white noise.

JoelJacobson 4 hours ago

> It works, but suddenly your query times explode! Instead of doing 1 million transactions per second* you can now do only 3 (*These numbers were exaggerated for dramatic effect)

In general, a single-queue design doesn’t make throughput collapse when you add more parallelism; it just gives you a fixed ceiling. With a well-designed queue, throughput goes up with concurrency, then flattens when the serialized section (the queue) saturates, maybe sagging a bit from context switching.

If instead you see performance severely degrade as you add workers, that typically means there’s an additional problem beyond “we have one queue” — things like broadcast wakeups (“every event wakes every listener”), global scans on each event, or other O(N) work per operation. That’s a very different, and more serious, scalability bug than simply relying on a single queue.

refset 3 hours ago

Recently released Clojure implementation of the same pattern: https://github.com/eerohele/muutos

davidelettieri 6 hours ago

It's worth mentioning debezium https://debezium.io/

It allows to publish all changes from the db to Kafka.

umanwizard 3 hours ago

Perhaps the situation has gotten better since I looked a few years ago, but my experience is the Debezium project doesn’t really guarantee exactly-once delivery. Meaning that if row A is replaced by row B, you might see (A, -1), (A, -1), (B, +1), if for example Debezium was restarted at precisely the wrong time. Then if you’re using this stream to try to keep track of what’s in the database, you will think you have negatively many copies of A.
It sounds silly, but caused enormous headaches and problems for the project I was working on (Materialize), one of whose main use cases is creating incrementally maintained live materialized views on top of replicated Postgres (or MySql) data.
- gunnarmorling 3 hours ago
  
  Debezium generally produces each change event exactly once if there are no unclean connector shut-downs. If that's not the case, I'd consider this a bug which ought to be fixed.
  (Disclaimer: I used to lead the Debezium project)
oulipo2 3 hours ago

Does it handle the things that the post mentions about the ever-growing WAL, and the fact that some listeners can go offline and need to get back old messages (eg if Kafka crashes?)
- gunnarmorling 3 hours ago
  
  Robustness is a key design goal of Debezium. It supports heart beating to address WAL growth issues (wrote about that issue at [1]). If Kafka crashes (or Debezium itself), it will resume consuming the replication slot from where it left off before (applying at-least once semantics, i.e. there can be duplicates in case of an unclean shut-down).
  Naturally, if the consumer is down, WAL retained for that replication slot continues to grow until it comes back up again, hence monitoring is key (or the slot gets invalidated at a certain threshold, it will restart with a new initial snapshot).
  Disclaimer: I used to lead the Debezium project
  [1] https://www.morling.dev/blog/mastering-postgres-replication-...

edistra 2 hours ago

Thanks for the deep dive on Postgres WAL.

I recently had the opportunity to play with PostgreSQL WAL in the scope of implementing opensearch cdc pipeline, and it was really exciting to see what is possible to achieve with it.

Be cautious with idle replica slots though, I got bitten by inactive slots filling up the production database storage.

PostgreSQL 18 introduces idle_replication_slot_timeout to mitigate this.

gunnarmorling an hour ago

For those not on 18 yet, there's also max_slot_wal_keep_size, which invalidates slots based on the amount of WAL they retain rather than based on a period of inactivity. It's available since PG 13.

antirez 4 hours ago

Related: Redis has a keyspace notification doing something similar that is not very well known, but when it is needed, it is really needed. We are thinking of extending this mechanism in new ways. Similarly I have seen setups where the notifications arriving from Postgres/MySQL are used in order to materialize (and keep updated) a cached view in Redis. To me, it is interesting how certain teams relay on these kind of mechanisms provided by database systems, and other teams like to do things in the client side to have full control, even in the face of having to reimplement some logic.

a_t48 5 hours ago

I had to set one of these up somewhat under duress to replace a poorly written scheduled query to export data to BigQuery. It’s nice to know what it’s actually doing under the hood, thanks for the info.

hbarka 7 hours ago

I remember decades ago working in something similar in SQL Server called transactional log shipping. It worked on the same principle as write-ahead to another server and was an ETL method for incremental change capture to a data warehouse.

cpursley 2 hours ago

Thanks for the WalEx mention!: https://github.com/cpursley/walex also, there’s https://github.com/sequinstream/sequin if you want something more robust that already works with connectors and smokes debezium in terms of performance and is simpler to operate.

morshu9001 7 hours ago

This is cool, but also, why is pg_notify designed with such a contentious queue?

JoelJacobson 2 hours ago

It's a common misconception that the single queue is a poor design choice. The user reports, of seeing notifications/second severely degrade with lots of backends, cannot be explained by the single-queue design. An efficient implementation of a single-queue, should flatten out as parallellism increases, not degrade and go towards zero.