85

This is a rather conceptual question, but I was hoping I could get some good advice on this. A lot of the programming I do is with (NumPy) arrays; I often have to match items in two or more arrays that are of different sizes and the first thing I go to is a for-loop or even worse, a nested for-loop. I want to avoid for-loops as much as possible, because they are slow (at least in Python).

I know that for a lot of things with NumPy there are pre-defined commands that I just need to research, but do you (as more experienced programmers) have a general thought process that comes to mind when you have to iterate something?

So I often have something like this, which is awful and I want to avoid it:

small_array = np.array(["one", "two"])
big_array = np.array(["one", "two", "three", "one"])

for i in range(len(small_array)):
    for p in range(len(big_array)):
        if small_array[i] == big_array[p]:
            print "This item is matched: ", small_array[i]

I know there are multiple different ways to achieve this in particular, but I am interested in a general method of thinking, if it exists.

gnat
  • 20,543
  • 29
  • 115
  • 306
turnip
  • 1,701

3 Answers3

94

This is a common conceptual difficulty when learning to use NumPy effectively. Normally, data processing in Python is best expressed in terms of iterators, to keep memory usage low, to maximize opportunities for parallelism with the I/O system, and to provide for reuse and combination of parts of algorithms.

But NumPy turns all that inside out: the best approach is to express the algorithm as a sequence of whole-array operations, to minimize the amount of time spent in the slow Python interpreter and maximize the amount of time spent in fast compiled NumPy routines.

Here's the general approach I take:

  1. Keep the original version of the function (which you are confident is correct) so that you can test it against your improved versions both for correctness and speed.

  2. Work from the inside out: that is, start with the innermost loop and see if can be vectorized; then when you've done that, move out one level and continue.

  3. Spend lots of time reading the NumPy documentation. There are a lot of functions and operations in there and they are not always brilliantly named, so it's worth getting to know them. In particular, if you find yourself thinking, "if only there were a function that did such-and-such," then it's well worth spending ten minutes looking for it. It's usually in there somewhere.

There's no substitute for practice, so I'm going to give you some example problems. The goal for each problem is to rewrite the function so that it is fully vectorized: that is, so that it consists of a sequence of NumPy operations on whole arrays, with no native Python loops (no for or while statements, no iterators or comprehensions).

Problem 1

def sumproducts(x, y):
    """Return the sum of x[i] * y[j] for all pairs of indices i, j.

    >>> sumproducts(np.arange(3000), np.arange(3000))
    20236502250000

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            result += x[i] * y[j]
    return result

Problem 2

def countlower(x, y):
    """Return the number of pairs i, j such that x[i] < y[j].

    >>> countlower(np.arange(0, 200, 2), np.arange(40, 140))
    4500

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            if x[i] < y[j]:
                result += 1
    return result

Problem 3

def cleanup(x, missing=-1, value=0):
    """Return an array that's the same as x, except that where x ==
    missing, it has value instead.

    >>> cleanup(np.arange(-3, 3), value=10)
    ... # doctest: +NORMALIZE_WHITESPACE
    array([-3, -2, 10, 0, 1, 2])

    """
    result = []
    for i in range(len(x)):
        if x[i] == missing:
            result.append(value)
        else:
            result.append(x[i])
    return np.array(result)

Spoilers below. You'll get much the best results if you have a go yourself before looking at my solutions!

Answer 1

np.sum(x) * np.sum(y)

Answer 2

np.sum(np.searchsorted(np.sort(x), y))

Answer 3

np.where(x == missing, value, x)

Gareth Rees
  • 1,489
8

To make things faster you have to read up on your data structures and use the ones appropriate.

For non trivial sizes of small array and big array (let's say small=100 elements and big=10.000 elements) one way to do it is to sort the small array, then iterate over big-array and use a binary search to find matching elements in the small array.

This would make the maximum time complexity, O(N log N) (and for small small-arrays and very big big-arrays it's closer to O(N)) where your nested loop solution is O(N^2)

However. what data structures are most efficient is highly dependent on the actual problem.

Pieter B
  • 13,310
-3

You can use a dictionary to optimize performance significantly

This is another example:

locations = {}
for i in range(len(airports)):
    locations[airports["abb"][i][1:-1]] = (airports["height"][i], airports["width"][i])

for i in range(len(uniqueData)):
    h, w = locations[uniqueData["dept_apt"][i]]
    uniqueData["dept_apt_height"][i] = h
    uniqueData["dept_apt_width"][i] = w