53

One of my friend was asked this interview question -

"There is a constant flow of numbers coming in from some infinite list of numbers out of which you need to maintain a datastructure as to return the top 100 highest numbers at any given point of time. Assume all the numbers are whole numbers only."

This is simple, you need to keep a sorted list in descending order and keep a track on lowest number in that list. If new number obtained is greater than that lowest number then you have to remove that lowest number and insert the new number in sorted list as required.

Then question was extended -

"Can you make sure that the Order for insertion should be O(1)? Is it possible?"

As far as I knew, even if you add a new number to list and sort it again using any sort algorithm, it would by best be O(logn) for quicksort (I think). So my friend told it was not possible. But he was not convinced, he asked to maintain any other data structure rather than a list.

I thought of balanced Binary tree, but even there you will not get the insertion with order of 1. So the same question I have too now. Wanted to know if there is any such data structure which can do insertion in Order of 1 for above problem or it is not possible at all.

11 Answers11

35

Let's say k is the number of highest numbers you want to know (100 in your example). Then, you can add a new number in O(k) which is also O(1). Because O(k*g) = O(g) if k is not zero and constant.

duedl0r
  • 487
  • 4
  • 7
19

Keep the list unsorted. Figuring out whether or not to insert a new number will take a longer time, but the insertion will be O(1).

12

This is easy. The size of the list of constant, therefore the sort time of the list is constant. A operation that executes in constant time is said to be O(1). Therefore sorting the list is O(1) for a list of fixed size.

9

Once you pass 100 numbers, the maximum cost you'll ever incur for the next number is the cost to check if the number is in the highest 100 numbers (let's label that CheckTime) plus the cost to enter it into that set and eject the lowest one (let's call that EnterTime), which is constant time (at least for bounded numbers), or O(1).

Worst = CheckTime + EnterTime

Next, if the distribution of numbers is random, the average cost decreases the more numbers you have. For example, the chance you will have to enter the 101st number into the maximum set is 100/101, the chances for the 1000th number would be 1/10, and the chances for the nth number would be 100/n. Thus, our equation for average cost will be:

Average = CheckTime + EnterTime / n

Thus, as n approaches infinity, only CheckTime is important:

Average = CheckTime

If the numbers are bound, CheckTime is constant, and thus it is O(1) time.

If the numbers are not bound, the check time will grow with more numbers. Theoretically, this is because if the smallest number in the maximum set gets big enough, your check time will be greater because you'll have to consider more bits. That makes it seem like it will be slightly higher than constant time. However, you could also argue that the chance that the next number is in the highest set approaches zero as n approaches infinity and so the chance you'll need to consider more bits also approaches 0, which would be an argument for O(1) time.

I'm not positive, but my gut says that it is O(log(log(n))) time. This is because the chance that the lowest number increases is logarithmic, and the chance that the number of bits you need to consider for each check is logarithmic as well. I'm interested in other peoples takes on this, because I'm not really sure...

Briguy37
  • 271
  • 1
  • 6
7

this one is easy if you know Binary Heap Trees. Binary heaps support insertion in average constant time, O(1). And give you easy access to the first x elements.

ratchet freak
  • 25,986
6

If by the question the interviewer really meant to ask “can we make sure each incoming number is processed in constant time”, then as many already pointed out (e.g. see @duedl0r's answer), your friend's solution is already O(1), and it would be so even if he had used unsorted list, or used bubble sort, or whatever else. In this case the question does not make much sense, unless it was tricky question or you remember it wrong.

I assume the interviewer's question was meaningful, that he was not asking how to make something to be O(1) which is very obviously already that.

Because questioning algorithm complexity only makes sense when size of input grows indefinitely, and the only input that can grow here is 100—the list size; I assume the real question was “can we make sure we get Top N spending O(1) time per number (not O(N) as in your friend's solution), is it possible?”.

The first thing that comes to mind is counting sort, which will buy complexity of O(1) time per number for Top-N-problem for the price of using O(m) space, where m is the length of range of incoming numbers. So yes, it is possible.

4

Use a min-priority queue implemented with a Fibonacci heap, which has constant insertion time:

1. Insert first 100 elements into PQ
2. loop forever
       n = getNextNumber();
       if n > PQ.findMin() then
           PQ.deleteMin()
           PQ.insert(n)
2

The task is clearly to find an algorithm that is O(1) in the length N of the required list of numbers. So it doesn't matter if you need the top 100 number or 10000 numbers, the insertion time should be O(1).

The trick here is that although that O(1) requirement is mentioned for the list insert, the question didn't say anything about the order of search time in the whole number space, but it turns out this can be made O(1) as well. The solution then is as follows:

  1. Arrange for a hashtable with numbers for keys and pairs of linked list pointers for values. Each pair of pointers is the start and end of a linked list sequence. This will normally just be one element then the next. Every element in the linked list goes next to the element with the next highest number. The linked list thus contains the sorted sequence of required numbers.Keep a record of the lowest number.

  2. Take a new number x from the random stream.

  3. Is it higher than the last recorded lowest number? Yes => Step 4, No => Step 2

  4. Hit the hash table with the number just taken. Is there an entry? Yes => Step 5. No => Take a new number x-1 and repeat this step (this is a simple downward linear search, just bear with me here, this can be improved and I'll explain how)

  5. With the list element just obtained from the hash table, insert the new number just after the element in the linked list (and update the hash)

  6. Take the lowest number l recorded (and remove it from the hash/list).

  7. Hit the hash table with the number just taken. Is there an entry? Yes => Step 8. No => Take a new number l+1 and repeat this step (this is a simple upward linear search)

  8. With a positive hit the number becomes the new lowest number. Go to step 2

To allow for duplicate values the hash actually needs to maintain the start and end of the linked list sequence of elements that are duplicates. Adding or removing an element at a given key thus increases or decreases the range pointed to.

The insert here is O(1). The searches mentioned are, I guess something like, O(average difference between numbers). The average difference increases with the size of the number space, but decreases with the required length of the list of numbers.

So the linear search strategy is pretty poor, if the number space is large (e.g. for a 4 byte int type, 0 to 2^32-1) and N=100. To get around this performance issue you can keep parallel sets of hashtables, where the numbers are rounded to higher magnitudes (e.g. 1s, 10s, 100s, 1000s) to make suitable keys. In this way you can step up and down gears to perform the required searches more quickly. The performance then becomes an O(log numberrange), I think, which is constant, i.e. O(1) also.

To make this clearer, imagine that you have the number 197 to hand. You hit the 10s hash table, with '190', it's rounded to the nearest ten. Anything? No. So you go down in 10s until you hit say 120. Then you can start at 129 in the 1s hashtable, then try 128, 127 until you hit something. You've now found where in the linked list to insert the number 197. Whilst putting it in, you must also update the 1s hashtable with the 197 entry, the 10s hashtable with the number 190, 100s with 100, etc. The most steps you ever have to do here are 10 times the log of the number range.

I might have got some of the details wrong, but since this is the programmers exchange, and the context was interviews I would hope the above is a convincing enough answer for that situation.

EDIT I added some extra detail here to explain the parallel hashtable scheme and how it means the poor linear searches I mentioned can be replaced with an O(1) search. I've also realised there is of course no need to search for the next lowest number, because you can step straight to it by looking in the hashtable with the lowest number and progressing to the next element.

Benedict
  • 1,097
1

Can we assume that the numbers are of a fixed data type, such as Integer? If so, then keep a tally of every single number that is added. This is an O(1) operation.

  1. Declare an array with as many elements as there are possible numbers:
  2. Read each number as it is streamed.
  3. Tally the number. Ignore it if that number has been tallied 100 times already as you'll never need it. This prevents overflows from tallying it an infinite number of times.
  4. Repeat from step 2.

VB.Net code:

Const Capacity As Integer = 100

Dim Tally(Integer.MaxValue) As Integer ' Assume all elements = 0
Do
    Value = ReadValue()
    If Tally(Value) < Capacity Then Tally(Value) += 1
Loop

When you return the list, you can take as long as you like. Simply itterate from the end of the list and create a new list of the highest 100 values recorded. This is an O(n) operation, but that's irrelivant.

Dim List(Capacity) As Integer
Dim ListCount As Integer = 0
Dim Value As Integer = Tally.Length - 1
Dim ValueCount As Integer = 0
Do Until ListCount = List.Length OrElse Value < 0
    If Tally(Value) > ValueCount Then
        List(ListCount) = Value
        ValueCount += 1
        ListCount += 1
    Else
        Value -= 1
        ValueCount = 0
    End If
Loop
Return List

Edit: In fact, it doesn't really matter if it's a fixed data type. Given there's no imposed limits on memory (or hard disk) consumption, you could make this work for any range of positive integers.

Hand-E-Food
  • 1,645
1

A hundred numbers are easily stored in an array, size 100. Any tree, list or set is overkill, given the task at hand.

If the incoming number is higher than the lowest (=last) in the array, run over all entries. Once you find the first that is smaller than your new number (you can use fancy searches to do that), run through the remainder of the array, pushing each entry "down" by one.

Since you keep the list sorted from the beginning, you don't need to run any sort algorithm at all. This is O(1).

0

You could use a Binary Max-Heap. You'd have to keep track of a pointer to the minimum node (which could be unknown/null).

You start by inserting the first 100 numbers into the heap. The max will be at the top. After this is done, you will always keep 100 numbers in there.

Then when you get a new number:

if(minimumNode == null)
{
    minimumNode = findMinimumNode();
}
if(newNumber > minimumNode.Value)
{
    heap.Remove(minimumNode);
    minimumNode = null;
    heap.Insert(newNumber);
}

Unfortunately findMinimumNode is O(n), and you do incur that cost once per insert (but not during the insert :). Removing the minimum node and inserting the new node are, on average, O(1) because they'll tend towards the bottom of the heap.

Going the other way with a Binary Min-Heap, the min is at the top, which is great for finding the min for comparison, but sucks when you have to replace the minimum with a new number that's > min. That's because you have to remove the min node (always O(logN)) and then insert the new node (average O(1)). So, you still have O(logN) which is better than Max-Heap, but not O(1).

Of course, if N is constant, then you always have O(1). :)