Efficient methods for storing tens of millions of objects for querying, with a high number of inserts per second?

Question

This is basically a logging/counting application that is counting the number of packets and counting the type of packet, etc. on a p2p chat network. This equates to about 4-6 million packets in a 5 minute period. And because I only take a "snapshot" of this information, I am only removing packets older than 5 minutes every five minutes. So the maximum about of items that will be in this collection is 10 to 12 million.

Because I need to make 300 connections to different superpeers, it is a possibility that each packet is trying to be inserted at least 300 times (which is probably why holding this data in memory is the only reasonable option).

Currently, I have been using a Dictionary for storing this information. But because of the large amount of items I'm trying to store, I run into issues with the large object heap and the amount of memory usage continuously grows over time.

Dictionary<ulong, Packet>

public class Packet
{
    public ushort RequesterPort;
    public bool IsSearch;
    public string SearchText;
    public bool Flagged;
    public byte PacketType;
    public DateTime TimeStamp;
}

I have tried using mysql, but it was not able to keep up with the amount of data that I need to insert (while checking to make sure it was not a duplicate), and that was while using transactions.

I tried mongodb, but the cpu usage for that was insane and did not keep either.

My main issue arises every 5 minutes, because I remove all packets that are older than 5 minutes, and take a "snapshot" of this data. As i'm using LINQ queries to count the number of packets containing a certain packet type. I also am calling a distinct() query on the data, where I strip 4 bytes (ip address) out of the keyvaluepair's key, and combine it with the requestingport value in the Value of the keyvalupair and use that to get a distinct number of peers from all the packets.

The application currently hovers around 1.1GB of memory usage, and when a snapshot is called it can go so far as to double the usage.

Now this wouldn't be an issue if I have an insane amount of ram, but the vm I have this running on is limited to 2GB of ram at the moment.

Is there any easy solution?

score 15 · Accepted Answer · answered Mar 15 '12 at 21:23

Instead of having one dictionary and searching that dictionary for entries that are too old; have 10 dictionaries. Every 30 seconds or so create a new "current" dictionary and discard the oldest dictionary with no searching at all.

Next, when you're discarding the oldest dictionary, put all the old objects onto a FILO queue for later, and instead of using "new" to create new objects pull an old object off the FILO queue and use a method to reconstruct the old object (unless the queue of old objects is empty). This can avoid a lot of allocations and a lot of garbage collection overhead.

score 4 · Answer 2 · answered Mar 15 '12 at 21:49

Simple approach: try memcached.

It is optimized to run tasks like this.
It can reuse spare memory on less busy boxes, not only on your dedicated box.
It has built-in cache expiration mechanism, which is lazy so no hiccups.

The downside is that it's memory-based and does not have any persistence. If an instance is down, data is gone. If you need persistence, serialize the data yourself.

More complex approach: try Redis.

It is optimized to run tasks like this.
It has built-in cache expiration mechanism.
It scales / shards easily.
It has persistence.

The downside is that it's slightly more complex.

score 3 · Answer 3 · answered Mar 15 '12 at 14:04

The first thought that springs to mind is why you wait 5 minutes. Could you do the snap-shots more often and thus reduce the big overload you see at the 5 minute boundary?

Secondly, LINQ is great for concise code, but in reality LINQ is syntactic sugar on "regular" C# and there is no guarantee that it will generate the most optimal code. As an exercise you could try and rewrite the hot spots withouth LINQ, you may not improve performance but you will have a clearer idea what you are doing and it would make profiling work easier.

Another thing to look at is data structures. I don't know what you do with your data, but could you simplify the data you store in any way? Could you use a string or byte array and then extract relevant parts from those items as you need them? Could you use a struct instead of a class and even do something evil with stackalloc to set aside memory and avoid GC runs?

Codism · Answer 4 · 2012-03-16T00:34:47.420

You don't have to store all the packages for the queries you have mentioned. For example - package type counter:

You need two arrays:

int[] packageCounters = new int[NumberOfTotalTypes];
int[,] counterDifferencePerMinute = new int[6, NumberOfTotalTypes];

The first array keeps track of how many packages in different types. The second array keeps track of how many more packages been added in every minutes such that you know how many packages need to be removed at every minute interval. I hope you can tell that the second array is used as a round FIFO queue.

So for each package, the following operations are performed:

packageCounters[packageType] += 1;
counterDifferencePerMinute[current, packageType] += 1;
if (oneMinutePassed) {
  current = (current + 1) % 6;
  for (int i = 0; i < NumberOfTotalTypes; i++) {
    packageCounters[i] -= counterDifferencePerMinute[current, i];
    counterDifferencePerMinute[current, i] = 0;
}

At any time, the package counters can be retrieved by the index instantly and we don't to store all the packages.

score 1 · Answer 5 · answered Feb 03 '14 at 16:26

(I know this is an old question, but I ran across it while looking for a solution to a similar problem where the second gen garbage collection pass was pausing the app for several seconds, so recording for other people in similar situation).

Use a struct rather than a class for your data (but remember it's treated as a value with pass-by-copy semantics). This takes out one level of searching the gc has to do each mark pass.

Use arrays (if you know the size of data you are storing) or List - which uses arrays internally. If you really need the fast random access, use a dictionary of array indices. This takes out another couple of levels (or a dozen or more if you are using a SortedDictionary) for the gc to have to search.

Depending on what you are doing, searching a list of structs may be faster than the dictionary lookup (due to the localisation of memory) - profile for your particular application.

The combination of struct&list reduces both the memory usage and the size of the garbage collector sweep significantly.

Efficient methods for storing tens of millions of objects for querying, with a high number of inserts per second?

5 Answers5