Exercise 14 - Structured Data in Python (pandas)
Contents
Exercise 14 - Structured Data in Python (pandas
)¶
Andrew Valentine, Louis Moresi, louis.moresi@anu.edu.au
You should now know how to open files, read data tables from them using numpy, and write your outputs to new files. Getting data in/out of Python is a routine task for data analysis, and people have put a lot of effort into making it as simple as possible. pandas
is one of the main data science modules, which specialises in handling ‘spreadsheet’-type data - i.e. a data table with column and/or row labels.
In this practical you’ll learn the basics of reading and writing data using pandas
, some basic data manipulation and plotting.
In addition to this practical, pandas
has a number of excellent tutorials focussed at beginners. There is also a chapter about pandas
in the Python Data Science Handbook. If your work is based around spreadsheet-type data, we strongly recommend going through these tutorials in due course.
pandas
basics¶
The central concept of pandas
is the DataFrame
. This is a 2D table of data, with associated row and column labels - just like a spreadsheet. A DataFrame
can be ‘indexed’ by row or column names - i.e. you use the row/column labels, instead of their numeric indices like you do in numpy
arrays.
pandas
started as a convenient way to import and export labeled data, but has grown into something a lot more comprehensive, and now does basic statistics, plotting, time-series analysis, ‘split-apply-combine’ data grouping… and much more.
You import pandas just like any other module; conventionally, we use
import pandas as pd
To create a DataFrame
, you need some 2D data, which you give to pandas
:
df = pd.DataFrame(data)
➤ Generate a numpy array of random data with 3 rows and 4 columns, and make a DataFrame
.
# Try it here!
When you look at the DataFrame
, it should look like a nice table in the Jupyter notbook, with numbered rows and columns. These numbers are the same as the numpy
indices that you could use to access these rows/column.
You can access these data by these indices, just like you would in numpy
. However, because DataFrames
are quite a lot more complex than numpy
arrays, you can’t index them directly - you have to use .loc[]
.
➤ Use the .loc[]
method to index your DataFrame
. Compare it to numpy
indexing.
# Try it here!
So far, everything we’ve done can be achieved with equivalent convenience in numpy
… so what’s the point? The strength of pandas
is in its use of data labels, so you can associate rows/columns with meaningful information, rather than numerical indices.
This can be really useful if, for example, you’re working with a dataset that you add new columns to over time. If you were using numpy, the numeric indices of the columns of interest might change, and break your code. With pandas
, your code will be fine as long as the labels of the columns stays the same.
Let’s assign some labels to your dataset. You can do this by setting the columns
or index
(rows) attributes of your dataframe with a list of strings whose length is equal to the number of columns or rows. For example,
df = pd.DataFrame(...)
df.columns=['Label_1', 'Label_2',...,'Label_N']
df.index = ['Row_1','Row_2',...,'Row_M']
What should labels be?¶
You can call columns/rows whatever you like. A column name could be a whole paragraph, if you really wanted… but this is a really bad idea. You should use labels that are unique, meaningful and concise.
For example, Temperature
isn’t great - what are the units? What if you have Kelvin, Centigrade and Fahrenheit columns? Temperature °C
is better, but will be cumbersome in routine use, because it’s long, and has special characters and spaces in. T_C
is ideal - it’s nice and short, containing a commonly used abbreviation for temperature (‘T’), and the unit.
Bonus: If your label names start with a letter and don’t contain spaces or special characters, you can access them directly using the .
operator, without needing to use .loc[]
. For example:
df.Label_1
df.loc[:, 'Label_1']
both produce return the same column, but df.Label_1
brings the added benefit of having Tab auto-completion for your column names! This can be really convenient if, for example, you want to access columns during plotting.
➤ Label the columns of your dataframe as [A, B, C, D], and your rows [X, Y, Z]:
# Try it here!
When you look at your dataframe again, you should see that it now displays the column/row names that you gave it.
➤ Access column ‘B’ of your data:
Hint: If you want to select a particular column, you’ll have to use the :
operator to specify ‘all row values’, just like in numpy
.
# Try it here!
You may also notice here that numeric indices no longer work with the .loc[]
operator. You can still use numeric indices if you really want to using .iloc[]
… although if you’re doing this all the time you’d be better off using numpy
instead of pandas.
Just like numpy
arrays, DataFrames
also contain a number of useful properties, such as .size
and .shape
, which tell you useful information about the data.
➤ Try it here:
# Try it here!
Read & Write Data¶
Pandas has a number of functions for reading and writing data in a wide range of formats.
Data reading functions are available within the pandas module in the form pd.read_FORMAT()
, and return a DataFrame
. Writing functions are available as a method directly accessible from the DataFrame
, and are in the form dataframe.to_FORMAT()
. In both these cases, replace FORMAT
with the desired data format (e.g. ‘excel’ or ‘csv’). Both of these functions take a file path and a number of other arguments that modify their behaviour.
➤ Load the file ‘boston.csv’ into a DataFrame
, find out how large the dataset is, and look at the top 5 rows.
Hint 1: Take a look at the file in a text editor first, and then use pandas’ read_csv
function. You’ll probably want to use the comment
parameter to tell pandas that lines beginning with #
are comments and should be ignored!
Hint 2: Try the .shape
attribute.
Hint 3: Use the .head()
method to see the top N lines (and .tail()
to see the bottom N!)
# Try it here!
➤ Save the Boston DataFrame
as an html table
Take a look at the resulting files, to make sure they look right. Have a play around with parameters (e.g. index=False
to remove the row numbers).
Microsoft Excel¶
Is irritatingly common in science. Irritating, because it saves data in a proprietary binary format which can’t be easily read without having a copy of Microsoft Excel (or sometimes, the right version of Microsoft Excel!). There are ways of importing Excel files into Python, but most are quite complex. pandas
offers a very easy solution to this problem.
➤ Load the file iris.xlsx
into a DataFrame
, with species
as the index (row) labels
Hint 1: the data are in a sheet called ‘iris_data’.
Hint 2: Everything is installed on the RSES Jupyter server, but you will need to install the package xlrd
to read Excel spreedsheets with Pandas on your computer.
iris = pd.read_excel('iris.xlsx', sheet_name='iris_data', index_col=0)
iris.head()
➤ Save the iris DataFrame
as a csv
➤ Save the iris DataFrame
as a \(\LaTeX\) table
This can be really useful if you do your work in \(\LaTeX\), as you might for a MSc or PhD thesis!
Reading from the Web(!)¶
pandas
can also read data directly from the web, which is useful if you’re using a central database which is regularly updated.
For example, the famous Mauna Loa Atmospheric CO2 Dataset, which is updated monthly.
The URL for the monthly dataset is: ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt
➤ Look at this dataset in your browser, and read it into a pandas.DataFrame
using the pd.read_table
function.
Hint 1: you’ll need to specify the delimiter between the data columns using the delimiter
argument.
Hint 2: because of data formatting choices, there’s no simple way to get the column headings from this dataset, and you’ll have to assign them manually.
co2 = pd.read_table('ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt', comment='#', delimiter='\s+', header=None)
co2.columns = ['year', 'month', 'decimal_year', 'co2_average', 'co2_interpolated', 'co2_trend', 'n_days']
co2.head()
Plotting & Data Cleaning¶
Let’s have a look at this CO2 record. Combine what you learned about indexing DataFrames and plotting to make a line graph of time (decimal year) vs. average CO2.
➤ Make a plot! You can simply pass the dataframe columns to plt.plot()
as if they were Numpy arrays.
# Try it here!
import matplotlib.pyplot as plt
This looks mad! There are several points where average CO2 is less than zero… what’s going on?
You’ll see from the header of the Mauna Load CO2 record that values that are equal the -99.99 represent ‘missing data’. We therefore want to exclude this data from further analyses.
➤ Do this by using the .replace()
method to substitute np.nan
for -99.99
, then re-draw the plot
Hint: you want to replace these values in the original dataset, rather than return a copy of the data with these values replaced. Think about what the inplace
argument does in this function.
# Try it here!
Much better! Let’s keep a copy of this data for future offline use.
➤ Save the co2 data as a csv file.
# Try it here!
You’ll notice this data has an overall trend, and a period oscillation around the trend. This oscillation is caused by seasonal changes in the balance between respiration and photosynthesis - CO2 drops in the summer, when plants in the Northern hemisphere are photosynthesising more.
Let’s try to isolate the overall trend by working out a yearly, instead of monthly trend.
We can do this using the ‘split-apply-combine’ methodology. We ‘split’ the data according to the value of a particular field, ‘apply’ a function to each subset, and the ‘combine’ all the results back together.
In pandas
we do this using the .groupby()
function to specify that all records with the same entry for a particular field should be treated as a group, followed by aggregate
(or apply
, for more complex operations) to specify how each group should be reduced into a single number.
➤ Calculate and plot the annual mean CO2 for the Mauna Loa record
# Try it here!
Subsets¶
Often, you might want to create a subset of a dataset. For example, you might wish to isolate only the year
and co2_average
information from the yearly mean.
You can do this by indexing the dataset, and assigning it to a new name.
➤ Create a subset of the annual mean data containing only the co2_average
column
# Try it here!
reset_index()¶
In this subset you’ll notice that you only have one column, but the ‘year’ information is preserved in the index
of the data. You can turn the index back into a data column using .reset_index()
.
➤ Try it here!
Merge Dataframes¶
Let’s try to look at the seasonal oscillation in more detail. To do this, we need to subtract the annual mean. However, at the moment our annual mean and our seasonal data are in two different DataFrames
of different sizes.
pandas
has a number of functions for combining and merging DataFrames
in different ways. Let’s have a go at using pd.merge()
to combine the annual mean dataset with the original data.
➤ Try it here! You will need to look at the help for pd.merge()
to work out how to do this.
Operations and New Columns¶
Now we’ve combined them, we need to subtract the annual mean from the monthly signal to isolate the seasonal trend.
➤ Create a new column containing the seasonal signal
Hint: If you try to assign something to an index that doesn’t exist, pandas
creates a new column.
Selecting Data¶
Remember indexing from above? You can use a similar approach in pandas
to select data in a number of ways. For example, we can use ‘logical indexing’ to select data only from the last 10 years.
To do this, we use logical operators (>
, <
, ==
) to create a boolean
array the same size as a dimension of our array. For example:
ind = co2.loc[:, 'year'] >= 2008
In this example, ind
will be True
wherever year is greater than or equal to 2008.
We can then apply this to select a subset of our DataFrame
:
last_ten = co2.loc[ind, :]
➤ Create a subset containing the years between 1980-1990, and plot the seasonal trend.