In an event-driven system, what are the best practices around the number of events and size/atomicity?

Question

If you can have one event with multiple "things to process" or one event per thing to process, what is considered the best practice? Bearing in mind, the former option may involve persisting the payload in a DB (because it's too big) whilst the latter may mean 1000s of more events, but I'm unsure of the real disadvantage to having 1000s.

This is in the context of AWS using SQS queues, EventBridge, and Lambdas (possibly with step functions depending on the approach). These physical resources place constraints on the event size, but may affect the best choice in other ways.

Example: A service called 'File Splitter' takes large files and splits them into paragraphs. Each paragraph needs to be read by a downstream service called 'Paragraph Processor' that will do something with it (e.g., add up the words).

Options:

The File Splitter service persists the paragraphs in a DB and then publishes a single 'ProcessParagraphs' event that includes a key to the paragraphs in the DB. Then, the Paragraph Processor handles that event by reaching out to the File Splitter and asking for the paragraphs from the DB. Or
The File Splitter service publishes one 'ProcessParagraph' (singular) event per paragraph, each containing the paragraph to be processed. The event size is under the 256kb limit of SQS events. The Paragraph Processor then handles each event individually, possibly in batches up to 10000 as allowed by SQS batching.

score 2 · Accepted Answer · answered Apr 20 '23 at 16:53

Setting aside the term 'best practice', there are some considerations.

First off, when it comes to messaging, forget any ideas you might have about 'guaranteed' delivery etc. It's very common for messages to be 'lost' or mishandled in such systems. Almost every time I've worked with someone who is new to messaging, they fail to understand that getting a message from a queue is non-safe operation. That is, unlike reading a row from a DB, reading a message from a queue/topic modifies the state of the queue/topic, at least from the consumers perspective. If you fail to process the message successfully but commit the read, that message be 'gone'. It's very common for this to be the cause of bugs.

For that reason, I have become very wary of using queues as the primary means of tracking work to be done and using them as the primary storage mechanism for that work is just a terrible idea for anything that matters.

The fact that you are using a DB for storing means that the latter is not a relevant issue here. But the former is a concern. You should track the processing status of the paragraph in the DB, if you are not already doing so.

This brings me to the question of your DB design. Is there a reason you don't store the paragraphs independently in the DB? That is, why not split the paragraphs in the DB and store them as separate rows? That would allow you to track their status independently and in the case of a failure, allow you to only reprocess the failed rows. It also makes it much easier to track the processing in a distributed design.

And that brings us to the next point: what is messaging good for then? The main value of messaging on common contemporary systems (64bit) is the ease of distributing work in a very flexible way. They work well with pull models where you can increase and decrease workers as need to keep up with load. They also handle spikes in load well such that can accept more load that you can process for short periods of time without failures.

My recommendation would be to split the paragraphs in the DB which would get you out of LOB territory based on your description of the paragraph size. Then I would probably publish and event per paragraph. Whether the event contains the content of the paragraph or just a reference to where it can be retrieved is a little nuanced. On one hand, it's possible that the read from the queue involves just as much system and network overhead as retrieving from a DB. The assumption that the queue brings the message content 'closer' to the client is not necessarily true but there may be something to that in a multi-region implementation. If you have a high number of workers, you might have contention on reads from the DB, which would indicate putting the content right on the queue. You should read a little about how SQS manages messages and distributes them and also understand what your DB can handle. Then, do some profiling and monitor the performance once you have a working solution.

In an event-driven system, what are the best practices around the number of events and size/atomicity?

1 Answers1