Arrays and Data Structures

Contributed by Mike Stock.  The slides (here) are your best resource for this week’s topic.  Several images and content within came from the High Performance Python for Climate Science Short Course at the 2013 AMS Annual Meeting.  Short summary snippets are below.  For this week’s discussion, we will make use of NumExprhttp://code.google.com/p/numexpr/ ).

 

NDArray Data Structures

Even though the python ‘view’ of numpy ndarrays may be multi-dimensional, they are stored linearly in memory.  Computations are fastest between memory-continuous cells.  Test this by timing operations on different axes in ipython with %timeit:


import numpy as np

ary = np.random.random([1000,1000,1000])

%timeit ary[1,1,:]**2
%timeit ary[1,:,1]**2 #Slower
%timeit ary[:,1,1]**2 #Slowest

You will note that the first example is the fastest.

 

Numbers

Python native numbers are not equivalent to numpy numbers. To see this, try looking at the output ipython gives for a=10**100 vs. a=ary[0]**100

When you operate on an ndarray with a python number, a conversion will take place.  For some applications, you may want to consider specifying the dtype.

Numpy scalars are very slow!  Try with ipython:


a = .375
%timeit a**2+a      #OK

a = np.float64(a)
%timeit a**2+a      #Slower

#It's still faster even if you convert and then compute
%timeit float(a)**2+float(a)

 

Lists vs. Arrays

For array-type computations, arrays become faster than lists once there are more than about 20 elements.

listsvsarrays

 

Masks

Masks are incredibly convenient constructs that are used to filter out certain elements in an array.  Try:


ary = np.arange(10)
mask = (ary>3)&(ary<6)
mask
#Out: array([False, False, False, False,True, True, False, False, False, False],dtype=bool)

To learn more, check out the insightful official documentation on related masked arrays:

http://docs.scipy.org/doc/numpy/reference/maskedarray.html

The image below shows one example of things that are easy-peasy with arrays.  An image was masked such that only points where the signal to noise ratio is greater than three remain.  You may sacrifice speed by using masks rather than copying array values directly in an example such as this, but masks make coding quite convenient for the user.

maskingimage

 

Numexpr

NumExpr speeds up array computations by doing them in blocks, to maximize memory efficiency.  It also automatically parallelizes the operation, to utilize all the cores on the computer.  However, there is a large amount of overhead, and only some arithmetic operations are supported.

numexprvsnumpy

 

Arrays Summary

Which is best to use for different array lengths?

  • N<10: Lists and ‘for’ loops
  • N>10: numpy arrays and numpy computations
  • N>104: numpy arrays and numexpr computations

 

Memory Maps

For big data, it is often useful to save a memory object to the hard disk.  This is called memory mapping.  Specifically, it’s faster to memory map if you’re trying to access a small piece of a large data set, since you don’t need to load the entire data set into memory.  For reference, see http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

HDF5 files are also slick, but we won’t cover them here.

Leave a Reply

Your email address will not be published. Required fields are marked *