NaNs

NaN (Not a Number) is an interesting and useful concept.  For many people, NaNs represent errors or bugs in code, but there are many applications for which they are beneficial and convenient.  In this discussion we will make use of numpy masked arrays as we cover a few different cases where NaNs come into play.  You should check out the official webpage at
http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
They have several excellent examples and a much more thorough coverage of the topic.  I do not intend this guide as a replacement, but rather as a brief supplement with additional specific applications.



As always, begin by importing numpy:


import numpy as np

First, I want to demonstrate the proper way to check if a number is an NaN.  You do NOT use the usual double equal sign.  Instead, use np.isnan()  For example:

a = 5
a == 5
# --> True

a = np.nan
a == np.nan # --> False
np.isnan(a) # --> True

For the first example, let’s generate some sample data (a quadratic with random noise) and explore what happens when we insert some NaNs.

#Generate some x-vals, as well as some y-vals with scatter.
xvals = np.arange(-3,5,.2)
yvals = xvals**2 - 7
for i in range( len(yvals) ): yvals[i] += 2*np.random.randn()

If we wanted to perform a simple fit, we could use scipy.polyfit(), since we have defined this as a quadratic.  If we wanted to find the minimum, we could use either of these:
yvals.min()
np.min(yvals)

And life goes on.  But now, what if you have an NaN for one or more of your y-array values?  Let’s make an example array:

yvals_with_nans = yvals.copy()
yvals_with_nans[3:10] = np.nan

You can still add, subtract, multiply, etc. just fine.  Even plotting with plt.plot(xvals,yvals_with_nans) works as expected, but note what happens when you try to determine special properties of the array:

yvals_with_nans.min()
# --> nan

np.min(yvals_with_nans) # --> nan

np.max(yvals_with_nans) # --> nan

np.sum(yvals_with_nans) # --> nan

Luckily, this is an easy fix for this particular situation.  Simply use numpy’s built-in tools for dealing with NaNs:
numpy.nanmin(), numpy.nanmax()

np.nanmin(yvals_with_nans)
# --> -10.287...


But what if you want to use some regular functions on your array?  You can copy over only the parts that don’t have NaNs, but there is a more powerful way that lets you keep the original array while at the same time only using the non-NaN elements in calculations.  For this we use numpy masked arrays.

Masked arrays consist of two layers: the array of data and the array of mask values.  Experiment with the numpy.ma module:

a = np.ma.masked_array([1,2,3,4,5],mask=0)
print a
# --> [1 2 3 4 5], mask = [False False False False False]

np.sum(a) # --> 15

a.mask[2] = True; a.mask[4] = True
print a
# --> [1 2 -- 4 --], mask = [False False True False True]

print a.mask # --> [False False True False True]
print a.data # --> [1 2 3 4 5]

#Notice how a.data returns the full list of un-masked elements.

np.sum(a) # --> 7
#It only summed the masked elements.


You may also find this trick useful: you can sum up the number of masked/un-masked elements quite easily.

np.sum(a.mask) # --> 2
np.sum(-a.mask) # --> 3

Note the use of the negative sign.  When you sum the mask, you are summing the True/False values.  Since Booleans are 1/0, the sum of [True True False True] = [1 1 0 1] will be 3.  If you sum the negative of the mask, you are summing [-True -True -False -True] = [False False True False] = [0 0 1 0], which is 1.  We will make use of this in a later example.


Now let’s invoke a special version of the masked array to automatically mask out invalid numbers like NaN and inf: numpy.ma.masked_invalid()

y_masked = np.ma.masked_invalid( yvals_with_nans )

print y_masked #Note how the nans have been replaced by -- marks

np.sum(y_masked) # --> 3.34306...

Et voila!  Arithmetic operations and np.min(), etc., will work correctly now on the masked array, though polyfit will not.  Be sure to check out the myriad other incdredibly useful masking arrays, such as

  •      np.ma.masked_less(array,less-than-value)
  •      np.ma.masked_where(condition,array)
  •      np.ma.fix_invalid()
  •      and many more…

If you simply want to remove the NaNs entirely from your array, the easiest method is to compress the masked array.  This returns a 1-D array with all masked values stricken.  Note that even for 2-D or 3-D arrays (images or cubes), the masked_array.compressed() command will still return the 1-D array of all your un-masked values.

y_compressed = y_masked.compressed() #If you print, you will notice that the NaNs are now gone.

We probably want to keep our x-array the same length (for fitting, plotting, etc.).  We can remove the x-values corresponding to the y-array NaNs easily with the same mask:

x_masked = np.ma.masked_array(xvals, mask=y_masked.mask)
x_compressed = x_masked.compressed()
print len(xvals), len(yvals_with_nans), len(x_compressed), len(y_compressed)
# --> 40 40 33 33

#Or you can even just do it on-the-fly with
x_compressed = np.ma.masked_array(xvals, mask=y_masked.mask).compressed()




Now is a good time to insert a warning: Interpolation of missing data points should only be done with caution.  In general, it’s much better practice to perform a fit on your data (see the tutorial on fitting in python here) and get your data points that way.  The following is an example of why.  It can be useful to interpolate data for missing holes in 2-D images, though, if you need to make it continuous.

There are two easy methods for interpolation if you have 1-D data:

This is quick and easy if all your data values are increasing, but it will give you nonsense results if they are not.
Basic usage is numpy.interp(,x_array,y_array)

This is a much more full-featured module, which also has support for 2-D arrays/images.
Basic usage is scipy.interp1d(,x_array,y_array)

Here are examples for interpolating the y-value at index 2.5 (numpy) and at x=-2.5 (scipy) of the compressed arrays:

interpvals1 = np.interp(2.5, x_compressed, y_compressed) # --> -0.6589... (for my particular randomly-generated array)

from scipy import interpolate
interpvals2 = interpolate.interp1d(x_compressed, y_compressed, kind='cubic')
interpvals2( -2.5 )



Here is a plot demonstrating why blind interpolation can be dangerous.  The interpolated value can differ by quite a bit from the expected best fit.  Of course, in order to fit a curve, you must be sure of the function’s form (polynomial, exponential, etc.).

Interpolation Plot Code
import numpy as np
from scipy import polyfit
from matplotlib import pyplot as plt

plt.rc('font',family='serif') #Just because I prefer serif fonts on my plots

#Making the arrays as we did above
xvals=np.arange(-3,5,.2)
yvals_with_nans=xvals**2-7
for i in range( len(yvals) ): yvals_with_nans[i]+=2*np.random.randn()
yvals_with_nans[3:10]=np.nan

#Masking the NaNs
y_masked=np.ma.masked_invalid(yvals_with_nans)
y_compressed=y_masked.compressed()
x_compressed=np.ma.masked_array(xvals,mask=y_masked.mask).compressed()

#Performing additional steps
interpval=np.interp(2.5,x_compressed,y_compressed) #x-val is -1.8
fitvals=polyfit(x_compressed,y_compressed,2)
yfit=fitvals[0]*xvals**2+fitvals[1]*xvals+fitvals[2] #Note that I am using xvals here to smoothly fill in the empty region

#Actually making the figure
fig1=plt.figure(1)
ax1=fig1.add_subplot(1,2,1)
plt.plot(xvals,yvals_with_nans,'.',color='#153E7E')
plt.xlabel('x'); plt.ylabel('y')
plt.xlim(-3.5,5.5)
plt.title('y_masked',size=10)

ax2=fig1.add_subplot(1,2,2)
plt.plot(x_compressed,y_compressed,'.',color='#347C17',label='Compressed Array')
plt.plot(-1.8,interpval,'s',color='#990000',label='Interpolated Point')
plt.plot(xvals,yfit,'--',color='k',label='Quadratic Fit')
leg2=plt.legend(loc=2,prop={'size':8},fancybox=True,handlelength=2.5,numpoints=1)
plt.xlabel('x'); plt.ylabel('y')
plt.xlim(-3.5,5.5)
plt.title('Fit on y_compressed',size=10)

#Adding text boxes below the plots with some diagnostic info
ax1.text(0,-.2,'np.min(yvals_with_nans)       = %.2f\nnp.nanmin(yvals_with_nans) = %.2f\nnp.min(y_masked)                 = %.2f'%(np.min(yvals_with_nans),np.nanmin(yvals_with_nans),np.min(y_masked)),ha='left',va='top',transform=ax1.transAxes,size=8,bbox=dict(boxstyle='round',fc='w'))

ax2.text(0,-.2,'Best fit:  y = %.2f x$^2$ + %.2f x + %.2f\nTrue function: y = x$^2$ - 7'%(fitvals[0],fitvals[1],fitvals[2]),ha='left',va='top',transform=ax2.transAxes,size=8,bbox=dict(boxstyle='round',fc='w'))

plt.suptitle('Sample Arrays with NaNs',size=16)
plt.subplots_adjust(bottom=.5,top=.9,wspace=.4)

#plt.show()
plt.savefig('NaN_plot_example1.png',bbox_inches='tight')
plt.clf()

NaN_plot_example1



Images and NaNs

===========================

Many astronomical images have NaNs for valid reasons – most often it’s because we blank everything outside of some desired region, whether it be a photometry aperture, the edge of the a detector’s field-of-view, etc…  But often we want to perform arithmetic on an image with NaNs, which would normally make the routines give NaN answers. The methods presented above for 1-D cases extend to 2-D images as well.

Let’s use a real galaxy image as an example.  You can find the LITTLE THINGS VLA neutral hydrogen velocity (Moment-1) map for CVnIdwA here (click Save Link As; it’s ~1MB).  You can learn more about the LITTLE THINGS project at the NRAO site here:
https://science.nrao.edu/science/surveys/littlethings

This image contains the velocity values in m/s of the galaxy, determined from the beam-smoothed first moment.  All the values outside of the galaxy’s emission have been blanked with NaNs.  See Hunter, et al., 2012 in AJ for more details on the data processing.

Let’s calculate the rms value of the velocity values in all of the pixels.  Recall that RMS=sqrt(values^2/number_summed)   To make the process easier to follow, we’ll break it up into parts:

import pyfits

#Import the data with pyfits.  It still has additional dimensions, so we will only load the ones we care about - the last two
#veldata = pyfits.getdata('<path to where you saved the .FITS file>/CVnIdwA_NA_XMOM1.FITS')[0,0,:,:]
veldata = pyfits.getdata('CVnIdwA_NA_XMOM1.FITS')[0,0,:,:]
veldata_masked = np.ma.masked_invalid(veldata)

#We will sum all the non-NaN pixels.  We can calculate the number of pixels to be used in a number of ways.
#...Pick your favorite.

n_used = np.count_nonzero( veldata_masked.compressed() )
# -- OR -- #
n_used = np.sum(-veldata_masked.mask) #Note the negative sign.

rms = np.sqrt( np.sum(veldata_masked**2) / n_used)
# --> pixrms = 311076.84 m/s

# -- OR an example of on-the-fly calculation -- #
rms_v2 = np.sqrt( np.nansum(veldata**2) / np.count_nonzero( np.ma.masked_invalid(veldata).compressed() ))

#Now if we want the RMS of the difference from the mean:
rms_dif = np.sqrt( np.sum( (veldata_masked - np.mean(veldata_masked))**2 ) / n_used)
# --> The RMS difference of a pixel's velocity from the mean value is 9550.46 m/s

Note that we could also have done simple RMS calculation without masked arrays as shown above.  But there is no numpy.nanmean() to use for the RMS difference calculation. This is one example of the versatility of masked arrays.

If we simply want to set the NaNs to a number (say, 0) for processing in some later routine, we can do it easily with np.ma.fix_invalid()

veldata_fix_temp = np.ma.fix_invalid(veldata,fill_value=0)
#Then call the fixed array with:
veldata_fixed = veldata_fix_temp.data

#Or, just do it on-the-fly:
veldata_fixed = np.ma.fix_invalid(veldata,fill_value=0).data
np.max(veldata_fixed) # --> Max = 338740.47 m/s



You may want to leave your image with NaNs instead of filling with 0s, depending on what you’re interested in showing.  If you are plotting with pyplot.imshow() and you want to emphasize the regions that are masked, you can do this quite effectively with a clever choice of color map.  It is also a convenient trick if you want the background transparent instead of black for publications, etc.  It’s a matter of preference, really.  See the plots below. (The code to produce them is there as well.)  For a primer on plotting with kapteyn and pywcsgrid2, see my tutorial here.[link] Replace Plot NaNs with 0s

from matplotlib import pyplot as plt

plt.rc('font',family='serif') #Just because I prefer serif fonts on my plots

velheader=pyfits.getheader('CVnIdwA_NA_XMOM1.FITS')
veldata=pyfits.getdata('CVnIdwA_NA_XMOM1.FITS')[0,0,:,:]
veldata_fixed = np.ma.fix_invalid(veldata,fill_value=0).data

rms=np.sqrt( np.nansum(veldata**2)/np.count_nonzero(np.ma.masked_invalid(veldata).compressed()))
rms_dif=np.sqrt(np.sum((veldata_masked-np.mean(veldata_masked))**2)/np.count_nonzero(np.ma.masked_invalid(veldata).compressed()))

#To plot wcs coordinates nicely, use the kapteyn or pywcsgrid2 modules.  I demonstrate use of pywcsgrid2 here.
#If you simply wanted to plot the data as you would any other image (without WCS coords), you could use the following code instead:
#
# fig2=plt.figure(2)
# ax1=fig2.add_subplot(1,2,1)
# plt.imshow(veldata,origin='lower',interpolation='nearest')
#
# ax2=fig2.add_subplot(1,2,2)
# plt.imshow(veldata_fixed,origin='lower',interpolation='nearest')
#

import pywcsgrid2

fig2=plt.figure(2)
ax1=pywcsgrid2.subplot(121,header=velheader)
plt.imshow(veldata[110:390,110:390],origin='lower',interpolation='nearest',cmap='gist_heat')
plt.title('With NaNs')
ax1.text(.05,.05,'RMS vel. = %.2e m/s'%rms,ha='left',va='bottom',transform=ax1.transAxes,size=8,bbox=None)

ax2=pywcsgrid2.subplot(122,header=velheader)
plt.imshow(veldata_fixed[110:390,110:390],origin='lower',interpolation='nearest',cmap='gist_heat',vmin=np.nanmin(veldata))
plt.title('NaNs $\\rightarrow$ 0')
ax2.text(.05,.05,'RMS vel. from mean = %.2e m/s'%rms_dif,ha='left',va='bottom',transform=ax2.transAxes,size=8,bbox=None,color='w')

plt.suptitle('CVnIdwA Integrated Velocity Maps',x=.5,y=.8,size=16)
plt.subplots_adjust(wspace=.5)

#plt.show()
plt.savefig('NaN_plot_example2.png',bbox_inches='tight')
plt.clf()

NaN_plot_example2

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *