Saturday, May 9, 2009

Working with Data: Record Arrays

A few posts are going to be directed at people who are not as familiar with Python and/or SciPy. I will try to assume as little as possible about what the user knows. Over the summer these types of posts might be put together as a larger tutorial.

One of the goals of the SciPy stats project is to provide a transparent way to work with the different arrays types. Towards this end, we are going to work with some data as record arrays (note: right now this is only supported for files containing ASCII characters. If this is not the case, you must do some data cleaning beforehand with Python).

We have four data files: a comma-delimited .csv file with variable labels (headers) with all numeric data (here, info here), a comma-delimited file with headers with a mix of numeric and string variables (here, info here), this same file without headers (here), and this same file without headers that is tab delimited (here).

First, we are going to put the data into record arrays. Record arrays are simply structured arrays that allow access to the data through attribute access. If you are interested, you can have a look here for a bit more about the basic data types in NumPy. Then see here for more on structured and record arrays (and why access to record arrays is slower) and here for more details on creating and working with record arrays.

First download the data files and put them in a directory. I am using Linux, so my directory paths will look a little different than those using Windows, but all of the other details should be the same.


>>>import numpy as np
>>>educ_dta=np.recfromcsv('/home/skipper/scipystats/scipystats/data/educ_data.csv')


We now have our data in a record array. We can take a closer look at what's going on by typing


>>>educ_dta.dtype
dtype([('ahe', '<f8'), ('female', '<i8'), ('ne', '<i8'), ('midwest', '<i8'), ('south', '<i8'), ('west', '<i8'), ('race', '<i8'), ('yrseduc', '<i8'), ('ba', '<i8'), ('hsdipl', '<i8'), ('age', '<i8')])
>>>educ_dta.dtype.names
('ahe', 'female', 'ne', 'midwest', 'south', 'west', 'race', 'yrseduc', 'ba', 'hsdipl', 'age')
>>>educ_dta.ahe
array([ 12.5 , 6.25, 17.31, ..., 9.13, 11.11, 14.9 ])
>>>educ_dta['ahe']
array([ 12.5 , 6.25, 17.31, ..., 9.13, 11.11, 14.9 ])


We can do the same for each of our other data files. First we have another dataset that contains a string variable for the state name. We proceed just as above, though the string variables will have to be handled differently when used in a statistical model. Then we have the same file, but without any headers information. Last we have a file without headers that is tab-delimited.


>>>gun_dta=np.recfromcsv('/home/skipper/scipystats/scipystats/data/handguns_data.csv')
>>>gun_dta_nh=np.recfromcsv('/home/skipper/scipystats/scipystats/data/handguns_data_noheaders.csv', names=None)
>>>gun_dta_tnh=np.recfromtxt('home/skipper/scipystats/scipystats/data/handguns_data_tab_noheaders', names=None, delimiter='\t')


You can learn more about the possibilities for loading data into arrays here, or by having a look at the doc string for np.genfromtxt in the Python interpreter. All of the functions loadtxt, recfromcsv, recfromtxt, etc. use genfromtxt they just have different default values for the keyword arguments.