50

I'm trying to teach myself how to calculate BigO notation for an arbitrary function. I found this function in a textbook. The book asserts that the function is O(n2). It gives an explanation as to why this is, but I'm struggling to follow. I wonder if someone might be able to show me the math behind why this is so. Fundamentally, I understand that it is something less than O(n3), but I couldn't independently land on O(n2)

Suppose we are given three sequences of numbers, A, B, and C. We will assume that no individual sequence contains duplicate values, but that there may be some numbers that are in two or three of the sequences. The three-way set disjointness problem is to determine if the intersection of the three sequences is empty, namely, that there is no element x such that x ∈ A, x ∈ B, and x ∈ C.

Incidentally, this is not a homework problem for me -- that ship has sailed years ago : ), just me trying to get smarter.

def disjoint(A, B, C):
        """Return True if there is no element common to all three lists."""  
        for a in A:
            for b in B:
                if a == b: # only check C if we found match from A and B
                   for c in C:
                       if a == c # (and thus a == b == c)
                           return False # we found a common value
        return True # if we reach this, sets are disjoint

[Edit] According to the textbook:

In the improved version, it is not simply that we save time if we get lucky. We claim that the worst-case running time for disjoint is O(n2).

The book's explanation, which I struggle to follow, is this:

To account for the overall running time, we examine the time spent executing each line of code. The management of the for loop over A requires O(n) time. The management of the for loop over B accounts for a total of O(n2) time, since that loop is executed n different times. The test a == b is evaluated O(n2) times. The rest of the time spent depends upon how many matching (a,b) pairs exist. As we have noted, there are at most n such pairs, and so the management of the loop over C, and the commands within the body of that loop, use at most O(n2) time. The total time spent is O(n2).

(And to give proper credit ...) The book is: Data Structures and Algorithms in Python by Michael T. Goodrich et. all, Wiley Publishing, pg. 135

[Edit] A justification; Below is the code before optimization:

def disjoint1(A, B, C):
    """Return True if there is no element common to all three lists."""
       for a in A:
           for b in B:
               for c in C:
                   if a == b == c:
                        return False # we found a common value
return True # if we reach this, sets are disjoint

In the above, you can clearly see that this is O(n3), because each loop must run to its fullest. The book would assert that in the simplified example (given first), the third loop is only a complexity of O(n2), so the complexity equation goes as k + O(n2) + O(n2) which ultimately yields O(n2).

While I cannot prove this is the case (thus the question), the reader can agree that the complexity of the simplified algorithm is at least less than the original.

[Edit] And to prove that the simplified version is quadratic:

if __name__ == '__main__':
    for c in [100, 200, 300, 400, 500]:
        l1, l2, l3 = get_random(c), get_random(c), get_random(c)
        start = time.time()
        disjoint1(l1, l2, l3)
        print(time.time() - start)
        start = time.time()
        disjoint2(l1, l2, l3)
        print(time.time() - start)

Yields:

0.02684807777404785
0.00019478797912597656
0.19134306907653809
0.0007600784301757812
0.6405444145202637
0.0018095970153808594
1.4873297214508057
0.003167390823364258
2.953308343887329
0.004908084869384766

Since the second difference is equal, the simplified function is indeed quadratic:

enter image description here

[Edit] And yet even further proof:

If I assume worst case (A = B != C),

if __name__ == '__main__':
    for c in [10, 20, 30, 40, 50]:
        l1, l2, l3 = range(0, c), range(0,c), range(5*c, 6*c)
        its1 = disjoint1(l1, l2, l3)
        its2 = disjoint2(l1, l2, l3)
        print(f"iterations1 = {its1}")
        print(f"iterations2 = {its2}")
        disjoint2(l1, l2, l3)

yields:

iterations1 = 1000
iterations2 = 100
iterations1 = 8000
iterations2 = 400
iterations1 = 27000
iterations2 = 900
iterations1 = 64000
iterations2 = 1600
iterations1 = 125000
iterations2 = 2500

Using the second difference test, the worst case result is exactly quadratic.

enter image description here

SteveJ
  • 636

8 Answers8

71

The book is indeed correct, and it provides a good argument. Note that timings are not a reliable indicator of algorithmic complexity. The timings might only consider a special data distribution, or the test cases might be too small: algorithmic complexity only describes how resource usage or runtime scales beyond some suitably large input size.

The book makes the argument that complexity is O(n²) because the if a == b branch is entered at most n times. This is non-obvious because the loops are still written as nested. It is more obvious if we extract it:

def disjoint(A, B, C):
  AB = (a
        for a in A
        for b in B
        if a == b)
  ABC = (a
         for a in AB
         for c in C
         if a == c)
  for a in ABC:
    return False
  return True

This variant uses generators to represent intermediate results.

  • In the generator AB, we will have at most n elements (because of the guarantee that input lists won't contain duplicates), and producing the generator takes O(n²) complexity.
  • Producing the generator ABC involves a loop over the generator AB of length n and over C of length n, so that its algorithmic complexity is O(n²) as well.
  • These operations are not nested but happen independently, so that the total complexity is O(n² + n²) = O(n²).

Because pairs of input lists can be checked sequentially, it follows that determining whether any number of lists are disjoint can be done in O(n²) time.

This analysis is imprecise because it assumes that all lists have the same length. We can say more precisely that AB has at most length min(|A|, |B|) and producing it has complexity O(|A|•|B|). Producing ABC has complexity O(min(|A|, |B|)•|C|). Total complexity then depends how the input lists are ordered. With |A| ≤ |B| ≤ |C| we get total worst-case complexity of O(|A|•|C|).

Note that efficiency wins are possible if the input containers allow for fast membership tests rather than having to iterate over all elements. This could be the case when they are sorted so that a binary search can be done, or when they are hash sets. Without explicit nested loops, this would look like:

for a in A:
  if a in B:  # might implicitly loop
    if a in C:  # might implicitly loop
      return False
return True

or in the generator-based version:

AB = (a for a in A if a in B)
ABC = (a for a in AB if a in C)
for a in ABC:
  return False
return True
amon
  • 135,795
8

Note that if all elements are different in each of the list which is assumed, you can iterate C only once for each element in A (if there's element in B which is equal). So inner loop is O(n^2) total

RiaD
  • 1,710
7

We will assume that no individual sequence contains duplicate.

is a very important piece of information.

Otherwise, the worst-case of optimized version would still be O(n³), when A and B are equal and contain one element duplicated n times:

i = 0
def disjoint(A, B, C):
    global i
    for a in A:
        for b in B:
            if a == b:
                for c in C:
                    i+=1
                    print(i)
                    if a == c:
                        return False 
    return True

print(disjoint([1] * 10, [1] * 10, [2] * 10))

which outputs:

...
...
...
993
994
995
996
997
998
999
1000
True

If the inputs of this function are considered to be three arbitrary collections, the above code is O(n³).

But, as mentioned by @sdenham :

by stipulation, it is being analyzed as a function from three sets to a boolean, which is O(n²) for a non-obvious (and therefore pedagogically useful) reason.

As explained by other answers, if no duplicates are allowed, the worst-case is indeed O(n²).

An additional optimization would be to use sets or dicts in order to test inclusion in O(1). In that case, disjoint would be O(n) for every input.

3

To put things into the terms that your book uses:

I think you have no problem understanding that the check for a == b is worst-case O(n2).

Now in the worst case for the third loop, every a in A has a match in B, so the third loop will be called once for every a-b pair (n-pairs. In the case where a doesn't exist in C, it will run through the entire C set (n-checks).

In other words, it's 1 time for every a and 1 time for every c, or n * n. O(n2)

So there is the O(n2) + O(n2) that your book points out.

Mars
  • 273
0

The trick of the optimized method is to cut corners. Only if a and b match, c will be given worth a look. Now you may figure that in the worst case you would still have to evaluate each c. This is not true.

You probably think the worst case is that every check for a == b results in a run over C because every check for a == b returns a match. But this is not possible because the conditions for this are contradictory. For this to work you would need an A and a B that contain the same values. They may be ordered differently but each value in A would have to have a matching value in B.

Now here's the kicker. There is no way to organize these values so that for each a you would have to evaluate all b's before you find your match.

A: 1 2 3 4 5
B: 1 2 3 4 5

This would be done instantly because the matching 1's are the first element in both series. What about

A: 1 2 3 4 5
B: 5 4 3 2 1

That would work for the first run over A: only the last element in B would yield a hit. But the next iteration over A would already have to be quicker because the last spot in B is already occupied by 1. And indeed this would take only four iterations this time. And this gets a little better with every next iteration.

Now I am no mathematician so I cannot proof this will end up in O(n2) but I can feel it on my clogs.

Martin Maat
  • 18,652
0

Note that the algorithm given is unnecessary complicated. An obvious O(n^2) algorithm that is also O(n^2) for arrays with duplicated elements is very simple:

  1. Write a function contains(array A, value X) which returns whether A contains X in O(n); this is trivial.

  2. Disjoint(array A, B, C): for a in A: if contains(B, a) and contains (C, a) return false. Finally return true. Obviously O(n^2).

If your standard library implements sets then likely as a hash table checking membership in O(log n), so the total would be O(n log n). And if A, B and C are all sorted arrays then you can check if they are disjoint in O(n).

To make the algorithm a bit faster, pay some attention to the size of the sets or arrays. For unsourced arrays, pick A as the array with the smallest number of elements, and B with the second smallest number. This makes the worst case better. But also this will exclude the most searches. The number of common elements will be smallest between the two smallest sets.

The effect is stronger for sets with O(log n) lookup. The number of lookups grows linear with the size of A, the time for each lookup only logarithmic with the size of B or C, so you really want A to be the smallest set.

gnasher729
  • 49,096
-1

Was baffled at first, but Amon's answer is really helpful. I want to see if I can do a really concise version:

For a given value of a in A, the function compares a with every possible b in B, and it does it only once. So for a given a it performs a == b exactly n times.

B doesn't contain any duplicates (none of the lists do), so for a given a there will be at most one match. (That's the key). Where there is a match, a will be compared against every possible c, which means that a == c is carried out exactly n times. Where's there is no match, a == c doesn't happen at all.

So for a given a, there are either n comparisons, or 2n comparisons. This happens for every a, so the best possible case is (n²) and the worst is (2n²).

TLDR: every value of a is compared against every value of b and against every value of c, but not against every combination of b and c. The two issues add together, but they don't multiply.

-3

Think about it this way, some numbers may be in two or three of the sequences but the average case of this is that for each element in set A, an exhaustive search is performed in b. It is guaranteed that every element in set A will be iterated over but implied that less than half of the elements in set b will be iterated over.

When the elements in set b are iterated over, an iteration happens if there's a match. this means that the average case for this disjoint function is O(n2) but the absolute worst case for it could be O(n3). If the book didn't go into detail, it would probably give you average case as an answer.