Reading and writing files

Authors

Andrew Valentine

Louis Moresi

Summary

In this section we learn about data-file handling (reading / writing information) which is a pre-cursor to doing useful computing at scale. We will use some very low-level tools to read and write to storage. In practice, we would probably use packages that can read common data formats, download data from the web, or work with datasets that are too large to read all at once.

Up to this point, we have typed all the data for our programs in ‘by hand’. However, as you have no doubt noticed, this quickly gets tedious. It is useful to be able to read and write data files, allowing information to be stored and shared with other people.

In order to read a data file, we need to understand what information it contains, and how this has been encoded. This is generally referred to as the ‘file format’. Different programs produce files in different formats - a photograph in .jpeg format cannot be read by a spreadsheet package, which might expect to receive files in .xlsx format.

The simplest file format for storing and transferring scientific data is a plain text file in ‘ascii’ (‘American Standard Code for Information Exchange’) format. This is the sort of file that can be read and produced using a simple text editor such as ‘notepad’ (on Windows) or ‘TextEdit’ (on a Mac). Commonly, such files will have an extension such as .txt or .dat.

Reading ascii files in Python is a four-step process: 1. Open the file; 2. Read each line from the file; 3. Parse each line (i.e., extract the required information from it); 4. Close the file.

We need to make sure the data file we are going to use is saved on your computer so we first have to make a copy into the sandbox file system we are using (this area is deleted whenever we reload this page)

We can check to see if the file is available

Reading data from a file

To open a file, we use the open() function, which returns a file object. It is important to assign this to a variable (here, fp) so that we can continue to access the open file.

filename = 'sample.dat'
fp = open(filename, 'r')

Here, filename is a variable holding the file name (or file path and name, if it is not in our current working directory), and the second argument, 'r', specifies that we want to open the file for reading only.

Once the file is open, we can read from it. There are various ways of doing this, but for small files the simplest method is generally to call the readlines() function, which returns the entire file in the form of a list:

lines = fp.readlines()

Each element of the list lines is now a string, containing one of the lines from the file sample.dat. Since we have read all the information in the file, we can now close the file:

fp.close()

Open the file and print out the individual lines to see what it contains.

Here is a hint that you might see in python code elsewhere. If you precede a structure like a tuple or list with a ’ * ’ then it is unpacked before being used. This might be helpful if you want to make the list of lines look more like the original file when printed. Another way would be to write a loop over all the lines and print them individually.

    t = (1,2,3,4)
    print(t)
    print(*t)

    # Output:
    # (1, 2, 3, 4)
    # 1 2 3 4
    l = ['a', 'b', 'c', '...', 'z']
    print(l)
    print(*l)

    # Output:
    # ['a', 'b', 'c', '...', 'z']
    # a b c ... z

python with statement

Opening and closing files explicitly is useful to illustrate how Python handles reading and writing files. However, doing this in practice can get quite messy because these ‘connections’ to files stay open until you explicitly close them. With single files, this can lead to data corruption and data loss, with more complex scripts you might open 10,000 files and not close any of them - you may then run into some limit on the system that will block your script or may block some other work that you are trying to do.

Python has a really handy construct for dealing with this issue automatically and you will doubtless see code that uses ‘context managers’ aka with statements.

This is not a very intuitive choice of language but, once you recognise what is happening, you will be able to understand python programs more easily. Any time you have an action which needs some clearing-up to be done afterwards, the context manager handles everything behind the scenes.

In the case of reading/writing files, you create a ‘context’ that contains a connection to a file, which is automatically closed when the code within the context is finished. This is a bit complicated to follow, but file-management is a very clear example of how it works.

filename = 'sample.dat'
with open(filename, 'r') as fp:
    lines = fp.readlines()
    ...
    ...

    # fp.close() is handled automatically here

The with statement tells python that the file you’re giving it is only used in the following indented code, and can be closed afterwards. This performs exactly the same as manually opening and closing the file, as above, but automatically cleans up after you.

Can you test to see if the fp was closed / deleted after the statement completed ? What about the contents of the file, are they available once the with statement is done ?

Once you have a list of strings, you can use the list- and string-parsing tools we have already encountered to extract the necessary data and store it in appropriate data structures. The file sample.dat contains records of the mass and volume of twenty samples of a given material.

::: {.callout-tip collapse=true icon=“false”}

Exercise 3 - work with the data

Read the data from this file, compute the density of each sample and hence the average density of the material.

  • You will need to parse the individual strings and store the information for your calculations.

  • Output the information to the screen with print statements so that you can validate your results.

:::

Writing data to a file

We don’t want to have to repeat this work once the processing is done, so we would like to write out the new information to a file.

To write data to file, we need to first open the file for writing. This can be done by using ‘w’ instead of ‘r’ when opening the file. Note that if a file already exists with the specified name, it will be immediately overwritten and lost. To avoid this, you can instead use ‘x’ when opening the file. This will throw an error if there is an existing file, rather than overwrite it. Again, when opening the file it is important to assign the result of open() to a variable, so we can write to it.

Once the file is open, we can write any text strings to it by calling the write() function. Remember, to insert a new line, you use the symbol '\n', otherwise everything will be on a single line:

fp.write("Hello!\n")
fp.write("This is some text to store in a file... ")
line = "The file has only two lines."
fp.write(line)

Once everything is written to the file, call close() to close it.

fp.close()

Create a new file, based on the data you read earlier. It should contain three columns: mass, sample volume and sample density. All quantities should be in SI units.

*Remember, you can use the string-formatting tools we encountered in the last exercise to control how your numbers are written out. Verify that the file has been correctly written (e.g. Exercise 1)

From the examples above, we just saw how to read and write data using Python built-in methods. These are good for simple files, but not for more complex information such as an Excel spreadsheet. Later in the course, we will encounter a number of more sophisticated tools that can help us with these kinds of files.

Sandbox