6

I'm looking at building a message queuing library in Go, which will be used as part of a larger application. A doubly-linked-list seems like a sensible approach for an in-memory data structure, but let's say this list grows to be 4 million items and each item is 10k, things no longer are guaranteed to fit in memory, plus I need it to be persistent across restarts.

Most of the activity on the queue will be writing to the end and reading and deleting from the beginning, however there are cases where consuming a message may fail for reasons not related to the queue and the item needs to be moved to the end of the queue.

I'm familiar with some data structures that work well on disk for other specific cases. A log structured merge tree seems to be great for random access data, but not when the reads and writes are localized to the end and beginning of the tree (based on my understanding). B Trees and B+ Trees also seem intended more for random access and traversal.

Curious what algorithms or approaches exist that are already tailored to this problem.

NOTE: Regarding SQLite, one of my concerns is the compaction time. The queue contents are likely to be very high turnover and it's important that any sort of compaction that needs to be done can happen on a fairly large data set without a very large performance impact. I.e. if there are 4G worth of messages sitting the queue I don't want a scenario where things just lock up for 2 minutes while it does compaction, which is what I suspect is going to happen with SQLite.

3 Answers3

4

I think I have enough information to provide an answer. It may not be what you're looking for, though.

From what we discussed and I understood, we have the following scenario:

  • Large number of messages to be processed;
  • Persistence needed to handle clean reboots (I'm not talking about power surges or power outages);
  • Inconsistent size of each message;
  • A need for FIFO structure;
  • Possibility of using the filesystem as storage;
  • Need to be portable, within go supported environments (MacOS, Linux, Windows);
  • Need to be self-hosted, cannot use an external queue provider;

That being said, I'd architect something that would rely 100% on the filesystem.

Store each message as a single file, name them with a timestamp or a counter you can control from a reliable source and leverage the filesystem capabilities for you.

You would be limited by what the filesystem can do for you (number of files per directory, number of directories per disk, maximum size per file, etc.), but it's very flexible, easy to debug and monitor and extremely portable and flexible, allowing you to increase the speed of your IO by just changing hardware around (HD -> SSD -> Raid -> Storage Server, etc.).

You will have to study different filesystems to push each one to the limits, but the most common ones can handle these requirements very well:

Data below copied from the excellent answer on StackOverflow:

FAT32:

  • Maximum number of files: 268,173,300
  • Maximum number of files per directory: 216 - 1 (65,535)
  • Maximum file size: 2 GiB - 1 without LFS, 4 GiB - 1 with

NTFS:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum file size
    • Implementation: 244 - 26 bytes (16 TiB - 64 KiB)
    • Theoretical: 264 - 26 bytes (16 EiB - 64 KiB)
  • Maximum volume size
    • Implementation: 232 - 1 clusters (256 TiB - 64 KiB)
    • Theoretical: 264 - 1 clusters

ext2:

  • Maximum number of files: 1018
  • Maximum number of files per directory: ~1.3 × 1020 (performance issues past 10,000)
  • Maximum file size
    • 16 GiB (block size of 1 KiB)
    • 256 GiB (block size of 2 KiB)
    • 2 TiB (block size of 4 KiB)
    • 2 TiB (block size of 8 KiB)
  • Maximum volume size
    • 4 TiB (block size of 1 KiB)
    • 8 TiB (block size of 2 KiB)
    • 16 TiB (block size of 4 KiB)
    • 32 TiB (block size of 8 KiB)

ext3:

  • Maximum number of files: min(volumeSize / 213, numberOfBlocks)
  • Maximum file size: same as ext2
  • Maximum volume size: same as ext2

ext4:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum number of files per directory: unlimited
  • Maximum file size: 244 - 1 bytes (16 TiB - 1)
  • Maximum volume size: 248 - 1 bytes (256 TiB - 1)
Machado
  • 4,130
3

Another way of approaching this is to basically just store the messages in blocks of a reasonable size, and each block is stored in a file on disk, and either read in (or possibly mem-mapped) for use in memory when needed.

Blocks could be, for example, 64MB each. Each one is a file, named with timestamp and random number. The purpose of grouping the messages is to avoid a proliferation of small messages from creating a bottleneck at the filesystem level. If the message count reaches into the millions it could prove difficult to efficiently navigate the FS to find the first and last message. Block sizes must be larger than the largest possible message and must be guaranteed to fit in memory at least a few at a time.

Folder prefixing can be used to make navigation simpler. I.e. if the file's timestamp is 201706021852 then it could live in the 20170602/18 folder. The prefixing could be configurable for different predicted message volumes if needed.

When blocks are read into memory, a doubly-linked-list would work as the in-memory representation. The first and the last block should be easily findable on the disk by crawling the folder structure, these blocks are loaded into memory and then the in-memory versions worked on - messages being pushed to the back and popped from the front. Messages in a block when consumed could simply be flagged as such and when all messages in the block are consumed the block is deleted.

At whatever intervals are appropriate, the loaded blocks would flushed to disk and everything fsync()ed. Operations that are guaranteed to be atomic (rename) would be used to reliably know if your write succeeded.

The above structure should work well if you never need to find a message in the middle of the queue (which would required a full scan of all blocks).

1

And yet another solution would be to just use a SQLite table. Internally SQLite uses binary trees for both table data and indexes (not specifically optimized for head and tail-heavy reads and writes but also not particularly inefficient). And it will re-use pages which become completely available. It may very well be that SQLite will do the job with reasonable performance and never require compaction (since a queue which is consumed in sequence should theoretically completely remove full pages, since all of the records will be deleted in sequence). Needs to be tested.