Strings
Until now, we have mainly been dealing with numeric data (numbers). However, it is often necessary to process text sequences, referred to as ‘character strings’. In this section, we will see some of the mechanisms Python provides for text processing. Even when we are not processing lots of text, we do need to be able to read and write the results of our work and the starting point for that is text manipulation.
As we have already seen, any literal text sequence can be entered between a pair of inverted commas (” ” or ’ ’), and stored to a variable. Python allows both single and double quotes to be used (not all languages do, so be careful if translating this sort of thinking to another language in the future).
= "Hamlet"
name = 'To be or not to be: that is the question.' quote
If we look at type(name)
, we see that Python has a str
type to represent strings.
To join two strings together, we can simply combine them using the +
operator:
= name+" says "+quote
x print(x)
We can enter quotation marks in a string literal by using the other kind of quotation mark as the string delimiter: either
= name+" says '"+quote+"'"
x = name+' says "'+quote+'"'
y print(x)
print(y)
We can also ‘escape’ the quotation mark, by placing a \
in front of it. This forces Python to treat it as a literal character, without any special meaning:
= name+" says \""+quote+"\""
x print(x)
Finally, we can use triple-quotes (““” “““) to mark the beginning and end of the string, which allows ’ and” to appear within the string without difficulties:
= """ Apparently "we can't just use single quotes" around this string """ difficult
We already encountered triple-quoted strings, in the context of docstrings at the start of a function.
We can also use the multiplication operator to repeat a string multiple times:
= 3*name
y print(y)
However, most other ‘mathematical’ operators (such as subtraction and division) have no meaning with strings.
String conversion
Note that string variables containing numbers are not the same as integer or float variables containing the same numbers:
= '6'
x = '3'
y print(x + y) # prints '63'
However, we can obtain an integer or float variable by using the int
and float
functions to convert them:
print(int(x) + int(y)) # prints '9'
print(float(x) + float(y)) # prints '9.0'
This only works if the string is a plausible representation of a single number. int('3 6')
is assumed to correspond to two distinct numbers, and so they cannot be converted to a single integer and python does not automatically return a tuple such as (3,6)
. Similarly, while numbers containing a decimal point can be converted to floating-point form, they cannot be interpreted as an integer.
Substrings
If we want to extract a ‘substring’ - a letter, or sequence of letters, from the middle of the string - we can use syntax similar to that for extracting a subset of a list:
= 'To be or not to be: that is the question.'
quote print(quote[2:35:3])
This will print every 3rd character, starting from the character at position 2. Remember, Python counts from 0:
0123456789...
To be or n...
We can see that the character at position 2 is a space, ’ ’. In general, a substring specification takes the form:
variable[istart:istop:istep]
Omitting istart
means the substring should begin from the start of variable
; omitting istep
means it should go to the end of variable
; and omitting istep
implies that all intervening characters should be printed:
print(quote[:35:3])
print(quote[2::3])
print(quote[2:35])
If we wish to extract only a single character, we simply provide its index:
print(quote[4])
We can also iterate over the letters in a string:
for letter in quote:
print(letter)
Special characters
A new-line (carriage return) can be represented within a string by entering '\n'
, for example:
= 'This\nstring\noccupies\nfive\nlines.'
multiline print(multiline)
Similarly, \t
can be used to enter a Tab character. Spaces and tab characters are collectively known as ‘whitespace’.
String functions / methods
As with other data types, Python provides a number of functions to work with a string, s
. Some of the more important ones are:
len(s)
- Return the number of characters ins
.s.count(x)
- Count the number of occurrences of stringx
within strings
. Again, this is case-sensitive.s.join(x)
- Here,x
is assumed to be an iterable (typically a list or tuple) of strings. This function returns a single string, containing all the strings fromx
with a copy ofs
between each. For example,':'.join(['a','b','c'])
will return'a:b:c'
.s.split()
- Return a list of all of the ‘words’ (substrings separated by whitespace) withins
. Optionally, provide a character to be regarded as the word separator. For example,'a,b,c'.split(',')
will return['a','b','c']
. It is also possible to specify the maximum number of words to be returned; once this limit is reached, no further splitting is performed. A variants.rsplit()
works backwards from the end of the string.s.replace(x,y)
- Return a version ofs
where every occurrence of stringx
is replaced by stringy
.s.find(x)
- Return the index of the start of the first occurrence of stringx
in strings
. Note that this is case-sensitive: comparequote.find('to')
withquote.find('To')
. A variants.rfind(x)
finds the last occurrence ofx
. Variantss.index()
ands.rindex()
are almost identical, except that they have different behaviour ifx
cannot be found withins
: whereass.find()
raises an error,s.index()
returns-1
.s.upper()
,s.lower()
ands.title()
- Return a copy of the strings
converted to be entirely in UPPER CASE/lower case/Title Case respectively.s.isupper()
,s.islower()
,s.istitle()
- ReturnTrue
ifs
is entirely in UPPER CASE/lower case/Title Case respectively, otherwiseFalse
.s.capitalize()
- Return a version ofs
where the first character is in UPPER CASE and the remainder in lower case.s.swapcase()
- Return a version ofs
where all UPPER CASE characters are converted to lower case and vice versa.s.center(n)
- Create a string of lengthn
containing a copy ofs
centered within this. By default, this is achived by padding with spaces (’ ’) before and afters
; optionally, you can specify a different charater. For example,'hello'.center(11, '_')
returns'___hello___'
.s.ljust(n)
- Create a string of lengthn
containing a copy ofs
at its left. Optionally, specify a character to use for padding. Similarly,s.rjust(n)
placess
at the right of then
-character string.s.strip()
- Return a copy ofs
with all whitespace removed.s.lstrip()
ands.rstrip()
are variants removing whitespace only at the start or end of the string, respectively. For example,'elephant'.strip('e')
returns'lephant'
.
Use this block to get help on the string methods
String formatting
As you may have noticed, Python’s print
function often displays information to a large number of decimal places, and it does not generally produce nicely-formatted output. To achieve this, we must make use of Python’s string-formatting facilities. These provide a mechanism for converting numbers into strings, and controlling the exact form this takes.
Python 3 provides three different frameworks for string formatting. In each, you create a string containing placeholders for the contents of each variable you want to output, then insert the data into these. The different methods for formatting are evolutions of an original idea and the older ones are retained so that older code does not have to be rewritten. There is a discussion about this on the Real Python website that is a worthwhile read.
Literal strings like "this string"
are common in python and we often see them in print
statements and variable assignments. We saw that there are some special characters that can be inserted into strings using the \
character and that these are used to build strings with (for example) line endings, tabs and so on. We can over-ride this behaviour by adding another \
character but this can get pretty cumbersome pretty quickly.
To get around this, we can write strings like this:
The last example has a ‘raw’ string - the r"...
tells python to ingest this particular string without processing any of the special character sequences.
An ‘f-string’ is a similar idea but works in the opposite direction (don’t ignore, do more !). It tells python to parse this particular string by executing code that it finds enclosed in { }
pairs.
There is an f-string mini-language that allows
Here are some examples
The first way to write things in clearly less prone to error because you write the subsitution in place and don’t have to take care to match the arguments of the substitution with the placeholders in the string one by one.
Remember this pattern: look for ways people can make mistakes and write code that makes the mistakes less likely. It’s always a good idea, not just for string formatting.
The official python docuementation has a tutorial on formatting strings.
We will encounter many more text-formatting examples in later exercises and certainly when we review other people’s code. Learn to read all three types, but prefer ‘f-strings’ when you write code.
Worked Example: ‘Caesar’ cipher
As we have already discussed, every piece of information within a computer must be organised and represented in binary form. This implies that the sequence of letters in the alphabet can be mapped onto the set of integers, and this is usually done via the ‘ASCII’ code sequence.
Python provides the function chr(integer)
to convert integers into their ASCII alphanumeric equivalent.
A ‘Caesar cipher’ is a very simple way to hide a message making it difficult for someone to read. To encode a piece of text with a Caesar cipher, we simply shift each letter \(N\) places up (or down) the alphabet. For example, choosing \(N=1\), the message
I like Python
would become
J mjlf Qzuipm
because ‘J’ is one letter after ‘I’, ‘m’ is one after ‘l’, and so on.
Older approaches to string formatting
This is best illustrated by an example, using the first (older) formatting framework:
= 1/11
x print(x)
= "One eleventh is approximately %.3f"
s print(s)
print(s%x)
Here, the string s
contains the text we wish to produce, and the entry %.3f
is a placeholder representing a floating point number with three decimal places. x
is a floating point variable, calculated to many decimal places. The code s%x
combines the two, resulting in the contents of x
being inserted into the string s
, formatted as required.
All placeholders begin with the ‘%’ symbol. Integer placeholders end with the letter ‘i’, floating-point placeholders end with the letter ‘f’, and string placeholders end with the letter ‘s’. Between the ‘%’ and the letter, one can specify various options controlling the exact form of output:
Placeholder | Description | Example | Output |
---|---|---|---|
%i |
General integer (no further formatting specified) | '%i'%3 |
‘3’ |
%3i |
Integer, at least 3 characters wide | '%3i'%3 |
' 3' |
%03i |
Integer, at least 3 characters wide, zero-padded | '%03i'%3 |
'003' |
%f |
General floating-point number (no format specified) | '%f'%2.9 |
‘2.900000’ |
%12f |
Floating-point number, occupying at least 12 characters | '%12f'%2.9 |
' 2.900000' |
%012f |
Floating-point number, occupying at least 12 characters, zero-padded | '%012f'%2.9 |
'00002.900000' |
%8.2f |
Floating-point number, occupying at least 8 characters, rounded to two decimal places | '%8.2f'%2.9 |
' 2.90' |
%s |
General string (no format specified) | '%s'%'test' |
'test' |
%10s |
String, occupying at least 10 characters | '%10s'%'test' |
' test' |
%% |
Literal ‘%’ character | '%6.2f%%'%2.9 |
' 2.90%' |
Where a string contains more than one placeholder, we can pass the required information as a tuple:
= '%i litres of %s at $%.2f/L costs a total of $%.2f'
phrase print(phrase%(2, 'milk', 1.29, 2*1.29))
print(phrase%(40, 'petrol', 1.53, 40*1.53))
print(phrase%(7.5, 'water', 0.17, 7.5*0.17))
Sometimes, it may be necessary to use string formatting to write the placeholders, allowing the style of output to be set at runtime:
def print_result(result,number_of_decimal_places):
= "The result is %%.%if"
fmt print((fmt%number_of_decimal_places)%result)
However, this is best avoided if possible.
A second, newer approach to formatting uses braces {}
instead of %...
to represent a placeholder, and a .format()
function that can act on any string. The syntax of the format specifiers is also different. Our example would become:
= 1/11
x print(x)
= "One eleventh is approximately {:.2f}"
s print(s)
print(s.format(x))
Similarly,
= '{} litres of {} at ${:.2f}/L costs a total of ${:.2f}'
phrase print(phrase.format(2, 'milk', 1.29, 2*1.29))
print(phrase,format(40, 'petrol', 1.53, 40*1.53))
print(phrase.format(7.5, 'water', 0.17, 7.5*0.17))
The new approach provides a much richer set of formatting options, described in full in the online documentation. One benefit of the new style is that it is no longer necessary to pass information to format in the same order as it is used: we can number the placeholders. For example,
= '{3} litres of {0} at ${1:.2f}/L costs a total of ${2:.2f}'
phrase print(phrase.format('milk', 1.29, 2*1.29, 2))
print(phrase,format('petrol', 1.53, 40*1.53, 40))
print(phrase.format('water', 0.17, 7.5*0.17, 7.5))
This is particularly useful if you need to repeat the same information several times in a string:
= "This sentence has the word {0} {0} {0} repeated three times and the word {1} {1} repeated twice."
phrase print(phrase.format('cat', 'dog'))