# Exercise 14 - Structured Data in Python (pandas)

## Contents

# Exercise 14 - Structured Data in Python (`pandas`

)¶

*Andrew Valentine, Louis Moresi, louis.moresi@anu.edu.au*

You should now know how to open files, read data tables from them using numpy, and write your outputs to new files. Getting data in/out of Python is a routine task for data analysis, and people have put a lot of effort into making it as simple as possible. `pandas`

is one of the main data science modules, which specialises in handling ‘spreadsheet’-type data - i.e. a data table with column and/or row labels.

In this practical you’ll learn the basics of reading and writing data using `pandas`

, some basic data manipulation and plotting.

In addition to this practical, `pandas`

has a number of excellent tutorials focussed at beginners. There is also a chapter about `pandas`

in the Python Data Science Handbook. If your work is based around spreadsheet-type data, we strongly recommend going through these tutorials in due course.

`pandas`

basics¶

The central concept of `pandas`

is the `DataFrame`

. This is a 2D table of data, with associated row and column labels - just like a spreadsheet. A `DataFrame`

can be ‘indexed’ by row or column names - i.e. you use the row/column *labels*, instead of their numeric indices like you do in `numpy`

arrays.

`pandas`

started as a convenient way to import and export labeled data, but has grown into something a lot more comprehensive, and now does basic statistics, plotting, time-series analysis, ‘split-apply-combine’ data grouping… and much more.

You import pandas just like any other module; conventionally, we use

```
import pandas as pd
```

To create a `DataFrame`

, you need some 2D data, which you give to `pandas`

:

```
df = pd.DataFrame(data)
```

**➤ Generate a numpy array of random data with 3 rows and 4 columns, and make a DataFrame.**

```
# Try it here!
```

When you look at the `DataFrame`

, it should look like a nice table in the Jupyter notbook, with numbered rows and columns. These numbers are the same as the `numpy`

indices that you could use to access these rows/column.

You can access these data by these indices, just like you would in `numpy`

. However, because `DataFrames`

are quite a lot more complex than `numpy`

arrays, you can’t index them directly - you have to use `.loc[]`

.

**➤ Use the .loc[] method to index your DataFrame. Compare it to numpy indexing.**

```
# Try it here!
```

So far, everything we’ve done can be achieved with equivalent convenience in `numpy`

… so what’s the point? The strength of `pandas`

is in its use of data labels, so you can associate rows/columns with meaningful information, rather than numerical indices.

This can be really useful if, for example, you’re working with a dataset that you add new columns to over time. If you were using numpy, the numeric indices of the columns of interest might change, and break your code. With `pandas`

, your code will be fine as long as the labels of the columns stays the same.

Let’s assign some labels to your dataset. You can do this by setting the `columns`

or `index`

(rows) attributes of your dataframe with a list of strings whose length is equal to the number of columns or rows. For example,

```
df = pd.DataFrame(...)
df.columns=['Label_1', 'Label_2',...,'Label_N']
df.index = ['Row_1','Row_2',...,'Row_M']
```

### What should labels be?¶

You can call columns/rows whatever you like. A column name could be a whole paragraph, if you really wanted… but this is a really bad idea. You should use labels that are **unique**, **meaningful** and **concise**.

For example, `Temperature`

isn’t great - what are the units? What if you have Kelvin, Centigrade *and* Fahrenheit columns? `Temperature °C`

is better, but will be cumbersome in routine use, because it’s long, and has special characters and spaces in. `T_C`

is ideal - it’s nice and short, containing a commonly used abbreviation for temperature (‘T’), and the unit.

**Bonus:** If your label names start with a letter and don’t contain spaces or special characters, you can access them directly using the `.`

operator, without needing to use `.loc[]`

. For example:

```
df.Label_1
df.loc[:, 'Label_1']
```

both produce return the same column, but `df.Label_1`

brings the added benefit of having `Tab` auto-completion for your column names! This can be really convenient if, for example, you want to access columns during plotting.

**➤ Label the columns of your dataframe as [A, B, C, D], and your rows [X, Y, Z]:**

```
# Try it here!
```

When you look at your dataframe again, you should see that it now displays the column/row names that you gave it.

**➤ Access column ‘B’ of your data:**

**Hint:** If you want to select a particular column, you’ll have to use the `:`

operator to specify ‘all row values’, just like in `numpy`

.

```
# Try it here!
```

You may also notice here that numeric indices no longer work with the `.loc[]`

operator. You *can* still use numeric indices if you *really* want to using `.iloc[]`

… although if you’re doing this all the time you’d be better off using `numpy`

instead of pandas.

Just like `numpy`

arrays, `DataFrames`

also contain a number of useful properties, such as `.size`

and `.shape`

, which tell you useful information about the data.

**➤ Try it here:**

```
# Try it here!
```

### Read & Write Data¶

Pandas has a number of functions for reading and writing data in a wide range of formats.

Data reading functions are available within the pandas module in the form `pd.read_FORMAT()`

, and return a `DataFrame`

. Writing functions are available as a method directly accessible from the `DataFrame`

, and are in the form `dataframe.to_FORMAT()`

. In both these cases, replace `FORMAT`

with the desired data format (e.g. ‘excel’ or ‘csv’). Both of these functions take a file path and a number of other arguments that modify their behaviour.

**➤ Load the file ‘boston.csv’ into a DataFrame, find out how large the dataset is, and look at the top 5 rows.**

**Hint 1:** Take a look at the file in a text editor first, and then use pandas’ `read_csv`

function. You’ll probably want to use the `comment`

parameter to tell pandas that lines beginning with `#`

are comments and should be ignored!

**Hint 2:** Try the `.shape`

attribute.

**Hint 3:** Use the `.head()`

method to see the top N lines (and `.tail()`

to see the bottom N!)

```
# Try it here!
```

**➤ Save the Boston DataFrame as an html table**

Take a look at the resulting files, to make sure they look right. Have a play around with parameters (e.g. `index=False`

to remove the row numbers).

### Microsoft Excel¶

Is irritatingly common in science. Irritating, because it saves data in a proprietary binary format which can’t be easily read without having a copy of Microsoft Excel (or sometimes, the right *version* of Microsoft Excel!). There are ways of importing Excel files into Python, but most are quite complex. `pandas`

offers a very easy solution to this problem.

**➤ Load the file iris.xlsx into a DataFrame, with species as the index (row) labels**

**Hint 1:** the data are in a sheet called ‘iris_data’.

**Hint 2:** Everything is installed on the RSES Jupyter server, but you will need to install the package `xlrd`

to read Excel spreedsheets with Pandas on your computer.

```
iris = pd.read_excel('iris.xlsx', sheet_name='iris_data', index_col=0)
```

```
iris.head()
```

**➤ Save the iris DataFrame as a csv**

**➤ Save the iris DataFrame as a \(\LaTeX\) table**

This can be *really* useful if you do your work in \(\LaTeX\), as you might for a MSc or PhD thesis!

### Reading from the Web(!)¶

`pandas`

can also read data directly from the web, which is useful if you’re using a central database which is regularly updated.

For example, the famous Mauna Loa Atmospheric CO2 Dataset, which is updated monthly.

The URL for the monthly dataset is: ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt

**➤ Look at this dataset in your browser, and read it into a pandas.DataFrame using the pd.read_table function.**

**Hint 1:** you’ll need to specify the delimiter between the data columns using the `delimiter`

argument.

**Hint 2:** because of data formatting choices, there’s no simple way to get the column headings from this dataset, and you’ll have to assign them manually.

```
co2 = pd.read_table('ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt', comment='#', delimiter='\s+', header=None)
co2.columns = ['year', 'month', 'decimal_year', 'co2_average', 'co2_interpolated', 'co2_trend', 'n_days']
```

```
co2.head()
```

## Plotting & Data Cleaning¶

Let’s have a look at this CO2 record. Combine what you learned about indexing DataFrames and plotting to make a line graph of time (decimal year) vs. average CO2.

**➤ Make a plot!** You can simply pass the dataframe columns to `plt.plot()`

as if they were Numpy arrays.

```
# Try it here!
import matplotlib.pyplot as plt
```

This looks mad! There are several points where average CO2 is less than zero… what’s going on?

You’ll see from the header of the Mauna Load CO2 record that values that are equal the -99.99 represent ‘missing data’. We therefore want to exclude this data from further analyses.

**➤ Do this by using the .replace() method to substitute np.nan for -99.99, then re-draw the plot**

**Hint:** you want to replace these values in the original dataset, rather than return a copy of the data with these values replaced. Think about what the `inplace`

argument does in this function.

```
# Try it here!
```

Much better! Let’s keep a copy of this data for future offline use.

**➤ Save the co2 data as a csv file.**

```
# Try it here!
```

You’ll notice this data has an overall trend, and a period oscillation around the trend. This oscillation is caused by seasonal changes in the balance between respiration and photosynthesis - CO2 drops in the summer, when plants in the Northern hemisphere are photosynthesising more.

Let’s try to isolate the overall trend by working out a yearly, instead of monthly trend.

We can do this using the ‘split-apply-combine’ methodology. We ‘split’ the data according to the value of a particular field, ‘apply’ a function to each subset, and the ‘combine’ all the results back together.

In `pandas`

we do this using the `.groupby()`

function to specify that all records with the same entry for a particular field should be treated as a group, followed by `aggregate`

(or `apply`

, for more complex operations) to specify how each group should be reduced into a single number.

**➤ Calculate and plot the annual mean CO2 for the Mauna Loa record**

```
# Try it here!
```

## Subsets¶

Often, you might want to create a subset of a dataset. For example, you might wish to isolate only the `year`

and `co2_average`

information from the yearly mean.

You can do this by indexing the dataset, and assigning it to a new name.

**➤ Create a subset of the annual mean data containing only the co2_average column**

```
# Try it here!
```

## reset_index()¶

In this subset you’ll notice that you only have one column, but the ‘year’ information is preserved in the `index`

of the data. You can turn the index back into a data column using `.reset_index()`

.

**➤ Try it here!**

## Merge Dataframes¶

Let’s try to look at the seasonal oscillation in more detail. To do this, we need to subtract the annual mean. However, at the moment our annual mean and our seasonal data are in two different `DataFrames`

of different sizes.

`pandas`

has a number of functions for combining and merging `DataFrames`

in different ways. Let’s have a go at using `pd.merge()`

to combine the annual mean dataset with the original data.

**➤ Try it here!** You will need to look at the help for `pd.merge()`

to work out how to do this.

## Operations and New Columns¶

Now we’ve combined them, we need to subtract the annual mean from the monthly signal to isolate the seasonal trend.

**➤ Create a new column containing the seasonal signal**

**Hint:** If you try to assign something to an index that doesn’t exist, `pandas`

creates a new column.

## Selecting Data¶

Remember indexing from above? You can use a similar approach in `pandas`

to select data in a number of ways. For example, we can use ‘logical indexing’ to select data only from the last 10 years.

To do this, we use logical operators (`>`

, `<`

, `==`

) to create a `boolean`

array the same size as a dimension of our array. For example:

```
ind = co2.loc[:, 'year'] >= 2008
```

In this example, `ind`

will be `True`

wherever year is greater than or equal to 2008.

We can then apply this to select a subset of our `DataFrame`

:

```
last_ten = co2.loc[ind, :]
```

**➤ Create a subset containing the years between 1980-1990, and plot the seasonal trend.**