Strings

Authors

Louis Moresi

Andrew Valentine

Summary

In this section we look at the way we can manipulatre character strings so that we can produce well-formatted output of our data.

Until now, we have mainly been dealing with numeric data (numbers). However, it is often necessary to process text sequences, referred to as ‘character strings’. In this section, we will see some of the mechanisms Python provides for text processing. Even when we are not processing lots of text, we do need to be able to read and write the results of our work and the starting point for that is text manipulation.

As we have already seen, any literal text sequence can be entered between a pair of inverted commas (” ” or ’ ’), and stored to a variable. Python allows both single and double quotes to be used (not all languages do, so be careful if translating this sort of thinking to another language in the future).

name = "Hamlet"
quote = 'To be or not to be: that is the question.'

If we look at type(name), we see that Python has a str type to represent strings.

To join two strings together, we can simply combine them using the + operator:

x = name+" says "+quote
print(x)

We can enter quotation marks in a string literal by using the other kind of quotation mark as the string delimiter: either

x = name+" says '"+quote+"'"
y = name+' says "'+quote+'"'
print(x)
print(y)

We can also ‘escape’ the quotation mark, by placing a \ in front of it. This forces Python to treat it as a literal character, without any special meaning:

x = name+" says \""+quote+"\""
print(x)

Finally, we can use triple-quotes (““” “““) to mark the beginning and end of the string, which allows ’ and” to appear within the string without difficulties:

difficult = """ Apparently "we can't just use single quotes" around this string """

We already encountered triple-quoted strings, in the context of docstrings at the start of a function.

We can also use the multiplication operator to repeat a string multiple times:

y = 3*name
print(y)

However, most other ‘mathematical’ operators (such as subtraction and division) have no meaning with strings.

Exercise 1 - familiarisation

Try the string manipulation examples above to make sure you understand them

String conversion

Note that string variables containing numbers are not the same as integer or float variables containing the same numbers:

x = '6'
y = '3'
print(x + y) # prints '63'

However, we can obtain an integer or float variable by using the int and float functions to convert them:

print(int(x) + int(y)) # prints '9'
print(float(x) + float(y)) # prints '9.0'

This only works if the string is a plausible representation of a single number. int('3 6') is assumed to correspond to two distinct numbers, and so they cannot be converted to a single integer and python does not automatically return a tuple such as (3,6). Similarly, while numbers containing a decimal point can be converted to floating-point form, they cannot be interpreted as an integer.

Substrings

If we want to extract a ‘substring’ - a letter, or sequence of letters, from the middle of the string - we can use syntax similar to that for extracting a subset of a list:

quote = 'To be or not to be: that is the question.'
print(quote[2:35:3])

This will print every 3rd character, starting from the character at position 2. Remember, Python counts from 0:

0123456789...
To be or n...

We can see that the character at position 2 is a space, ’ ’. In general, a substring specification takes the form:

variable[istart:istop:istep]

Omitting istart means the substring should begin from the start of variable; omitting istep means it should go to the end of variable; and omitting istep implies that all intervening characters should be printed:

print(quote[:35:3])
print(quote[2::3])
print(quote[2:35])

If we wish to extract only a single character, we simply provide its index:

print(quote[4])

We can also iterate over the letters in a string:

for letter in quote:
    print(letter)

Exercise 2 - string iterations

Try the string sampling examples above to make sure you understand them. We will encounter this syntax again when we look at handling large data arrays and matrices, so your effort will not be wasted,

Special characters

A new-line (carriage return) can be represented within a string by entering '\n', for example:

multiline = 'This\nstring\noccupies\nfive\nlines.'
print(multiline)

Similarly, \t can be used to enter a Tab character. Spaces and tab characters are collectively known as ‘whitespace’.

String functions / methods

As with other data types, Python provides a number of functions to work with a string, s. Some of the more important ones are:

len(s) - Return the number of characters in s.
s.count(x) - Count the number of occurrences of string x within string s. Again, this is case-sensitive.
s.join(x) - Here, x is assumed to be an iterable (typically a list or tuple) of strings. This function returns a single string, containing all the strings from x with a copy of s between each. For example, ':'.join(['a','b','c']) will return 'a:b:c'.
s.split() - Return a list of all of the ‘words’ (substrings separated by whitespace) within s. Optionally, provide a character to be regarded as the word separator. For example, 'a,b,c'.split(',') will return ['a','b','c']. It is also possible to specify the maximum number of words to be returned; once this limit is reached, no further splitting is performed. A variant s.rsplit() works backwards from the end of the string.
s.replace(x,y) - Return a version of s where every occurrence of string x is replaced by string y.
s.find(x) - Return the index of the start of the first occurrence of string x in string s. Note that this is case-sensitive: compare quote.find('to') with quote.find('To'). A variant s.rfind(x) finds the last occurrence of x. Variants s.index() and s.rindex() are almost identical, except that they have different behaviour if x cannot be found within s: whereas s.find() raises an error, s.index() returns -1.
s.upper(), s.lower() and s.title() - Return a copy of the string s converted to be entirely in UPPER CASE/lower case/Title Case respectively.
s.isupper(), s.islower(), s.istitle() - Return True if s is entirely in UPPER CASE/lower case/Title Case respectively, otherwise False.
s.capitalize() - Return a version of s where the first character is in UPPER CASE and the remainder in lower case.
s.swapcase() - Return a version of s where all UPPER CASE characters are converted to lower case and vice versa.
s.center(n) - Create a string of length n containing a copy of s centered within this. By default, this is achived by padding with spaces (’ ’) before and after s; optionally, you can specify a different charater. For example, 'hello'.center(11, '_') returns '___hello___'.
s.ljust(n) - Create a string of length n containing a copy of s at its left. Optionally, specify a character to use for padding. Similarly, s.rjust(n) places s at the right of the n-character string.
s.strip() - Return a copy of s with all whitespace removed. s.lstrip() and s.rstrip() are variants removing whitespace only at the start or end of the string, respectively. For example, 'elephant'.strip('e') returns 'lephant'.

Exercise 3 - Do you really understand strings ?

Check the examples given above. Not everything is obvious so try the examples until you are confident you understand how strings work.

Use this block to get help on the string methods

Exercise 4 - Application: tables

Write a function which takes in strings like 'apples:3; pears:17; bananas:21' and produces a nicely-formatted table like this:

+-----------+-----+
| Apples    |   3 |
| Pears     |  17 |
| Bananas   |  21 |
+-----------+-----+
|     TOTAL |  41 |
+-----------+-----+

Here are some additional input strings it should be able to cope with:

"cheese:0; BREAD:5; milk:1; Bacon:2"
"Ham:15000; salami:36030; corned beef:1836"
"triangles:3; squares:4; pentagons:5; hexagons:6; heptagons:7; octagons:8; nonagons:9"

Pay attention to producing uniform formatting of the text !

String formatting

As you may have noticed, Python’s print function often displays information to a large number of decimal places, and it does not generally produce nicely-formatted output. To achieve this, we must make use of Python’s string-formatting facilities. These provide a mechanism for converting numbers into strings, and controlling the exact form this takes.

Python 3 provides three different frameworks for string formatting. In each, you create a string containing placeholders for the contents of each variable you want to output, then insert the data into these. The different methods for formatting are evolutions of an original idea and the older ones are retained so that older code does not have to be rewritten. There is a discussion about this on the Real Python website that is a worthwhile read.

Literal strings like "this string" are common in python and we often see them in print statements and variable assignments. We saw that there are some special characters that can be inserted into strings using the \ character and that these are used to build strings with (for example) line endings, tabs and so on. We can over-ride this behaviour by adding another \ character but this can get pretty cumbersome pretty quickly.

To get around this, we can write strings like this:

The last example has a ‘raw’ string - the r"... tells python to ingest this particular string without processing any of the special character sequences.

An ‘f-string’ is a similar idea but works in the opposite direction (don’t ignore, do more !). It tells python to parse this particular string by executing code that it finds enclosed in { } pairs.

There is an f-string mini-language that allows

Here are some examples

The first way to write things in clearly less prone to error because you write the subsitution in place and don’t have to take care to match the arguments of the substitution with the placeholders in the string one by one.

Remember this pattern: look for ways people can make mistakes and write code that makes the mistakes less likely. It’s always a good idea, not just for string formatting.

The official python docuementation has a tutorial on formatting strings.

Exercise 5 - Tables using f-strings

Repeat the table-formatting example, making use of the string formatting capabilities. Pay particular attention to using the string formatting to size the columns and align the contents. Can you fill the empty space with dashes instead of space ?

We will encounter many more text-formatting examples in later exercises and certainly when we review other people’s code. Learn to read all three types, but prefer ‘f-strings’ when you write code.

Worked Example: ‘Caesar’ cipher

As we have already discussed, every piece of information within a computer must be organised and represented in binary form. This implies that the sequence of letters in the alphabet can be mapped onto the set of integers, and this is usually done via the ‘ASCII’ code sequence.

Python provides the function chr(integer) to convert integers into their ASCII alphanumeric equivalent.

Exercise 6 - Character manipulation

Write a loop to print the integers from 33 up to 128, and their ASCII equivalents. (The first 32 ASCII codes are special control characters, and do not have alphanumeric equivalents).

A ‘Caesar cipher’ is a very simple way to hide a message making it difficult for someone to read. To encode a piece of text with a Caesar cipher, we simply shift each letter \(N\) places up (or down) the alphabet. For example, choosing \(N=1\), the message

I like Python

would become

J mjlf Qzuipm

because ‘J’ is one letter after ‘I’, ‘m’ is one after ‘l’, and so on.

Exercise 7 - Hail Caesar !

Write a function to encode messages using Caesar ciphers (for any choice of \(N\)). Note that the ‘decoder’ is simply the ‘encoder’, but instead using \(-N\).

Here is a message encoded using a Caesar cipher:

Pfl yrmv wzezjyvu kyzj vovitzjv

By looping through all possible values of \(N\), find the \(N\) used to encode this message and decode it.

Older approaches to string formatting

Older format - examples

This is best illustrated by an example, using the first (older) formatting framework:

x = 1/11
print(x)
s = "One eleventh is approximately %.3f"
print(s)
print(s%x)

Here, the string s contains the text we wish to produce, and the entry %.3f is a placeholder representing a floating point number with three decimal places. x is a floating point variable, calculated to many decimal places. The code s%x combines the two, resulting in the contents of x being inserted into the string s, formatted as required.

All placeholders begin with the ‘%’ symbol. Integer placeholders end with the letter ‘i’, floating-point placeholders end with the letter ‘f’, and string placeholders end with the letter ‘s’. Between the ‘%’ and the letter, one can specify various options controlling the exact form of output:

Placeholder	Description	Example	Output
`%i`	General integer (no further formatting specified)	`'%i'%3`	‘3’
`%3i`	Integer, at least 3 characters wide	`'%3i'%3`	`' 3'`
`%03i`	Integer, at least 3 characters wide, zero-padded	`'%03i'%3`	`'003'`
`%f`	General floating-point number (no format specified)	`'%f'%2.9`	‘2.900000’
`%12f`	Floating-point number, occupying at least 12 characters	`'%12f'%2.9`	`' 2.900000'`
`%012f`	Floating-point number, occupying at least 12 characters, zero-padded	`'%012f'%2.9`	`'00002.900000'`
`%8.2f`	Floating-point number, occupying at least 8 characters, rounded to two decimal places	`'%8.2f'%2.9`	`' 2.90'`
`%s`	General string (no format specified)	`'%s'%'test'`	`'test'`
`%10s`	String, occupying at least 10 characters	`'%10s'%'test'`	`' test'`
`%%`	Literal ‘%’ character	`'%6.2f%%'%2.9`	`' 2.90%'`

Where a string contains more than one placeholder, we can pass the required information as a tuple:

phrase = '%i litres of %s at $%.2f/L costs a total of $%.2f'
print(phrase%(2, 'milk', 1.29, 2*1.29))
print(phrase%(40, 'petrol', 1.53, 40*1.53))
print(phrase%(7.5, 'water', 0.17, 7.5*0.17))

Sometimes, it may be necessary to use string formatting to write the placeholders, allowing the style of output to be set at runtime:

def print_result(result,number_of_decimal_places):
    fmt = "The result is %%.%if"
    print((fmt%number_of_decimal_places)%result)

However, this is best avoided if possible.

A second, newer approach to formatting uses braces {} instead of %... to represent a placeholder, and a .format() function that can act on any string. The syntax of the format specifiers is also different. Our example would become:

x = 1/11
print(x)
s = "One eleventh is approximately {:.2f}"
print(s)
print(s.format(x))

Similarly,

phrase = '{} litres of {} at ${:.2f}/L costs a total of ${:.2f}'
print(phrase.format(2, 'milk', 1.29, 2*1.29))
print(phrase,format(40, 'petrol', 1.53, 40*1.53))
print(phrase.format(7.5, 'water', 0.17, 7.5*0.17))

The new approach provides a much richer set of formatting options, described in full in the online documentation. One benefit of the new style is that it is no longer necessary to pass information to format in the same order as it is used: we can number the placeholders. For example,

phrase = '{3} litres of {0} at ${1:.2f}/L costs a total of ${2:.2f}'
print(phrase.format('milk', 1.29, 2*1.29, 2))
print(phrase,format('petrol', 1.53, 40*1.53, 40))
print(phrase.format('water', 0.17, 7.5*0.17, 7.5))

This is particularly useful if you need to repeat the same information several times in a string:

phrase = "This sentence has the word {0} {0} {0} repeated three times and the word {1} {1} repeated twice."
print(phrase.format('cat', 'dog'))

Coding scratch space