Python: An Intro to Regular Expressions

Regular expressions are basically a tiny language all their own that you can use inside of Python and many other programming languages. You will often hear regular expressions referred to as "regex", "regexp" or just "RE". Some languages, such as Perl and Ruby, actually support regular expression syntax directly in the language itself. Python only supports them via a library that you need to import. The primary use for regular expressions is matching strings. You create the string matching rules using a regular expression and then you apply it to a string to see if there are any matches.

The regular expression "language" is actually pretty small, so you won't be able to use it for all your string matching needs. Besides that, while there are some tasks that you can use a regular expression for, it may end up so complicated that it becomes difficult to debug. In cases like that, you should just use Python. It should be noted that Python is an excellent language for text parsing in its own right and can be used for anything you do in a regular expression. However, it may take a lot more code to do so and be slower than the regular expression because regular expressions are compiled down and executed in C.


The Matching Characters

When you want to match a character in a string, in most cases you can just use that character or that sub-string. So if we wanted to match "dog", then we would use the letters dog. Of course, there are some characters that are reserved for regular expressions. These are known as *metacharacters*. The following is a complete list of the metacharacters that Python's regular expression implementation supports:

. ^ $ * + ? { } [ ] \ | ( )

Let's spend a few moments looking at how some of these work. One of the most common pairs of metacharacters you will encounter are the square braces: [ and ]. There are used for creating a "character class", which is a set of characters that you would like to match on. You may list the characters individually like this: [xyz]. This will match any of the characters listed between the braces. You can also use a dash to express a range of characters to match against: [a-g]. In this example, we would match against any of the letters a through g.

To actually do a search though, we would need to add a beginning character to look for and an ending character. To make this easier, we can use the asterisk which allows repetitions. Instead of matching \*, the \* will tell the regular expression that the previous character may be matched zero or more times.

It always helps to see a simple example:

'a[b-f]*f

This regular expression pattern means that we are going to look for the letter a, zero or more letters from our class, [b-f] and it needs to end with an f. Let's try using this expression in Python:

>>> import re
>>> text = 'abcdfghijk'
>>> parser = re.search('a[b-f]*f', text)
<_sre.SRE_Match object; span=(0, 5), match='abcdf'>
>>> parser.group()
'abcdf'

Basically this expression will look at the entire string we pass to it, which in this case is abcdfghijk. It will find the a at the beginning match against that. Then because it has a character class with an asterisk on the end, it will actually read in the rest of the string to see if it matches. If it doesn't, them it will backtrack one character at a time attempting to find a match.

All this magic happens when we call the re module's search function. If we don't find a match, then None is returned. Otherwise, we will get a Match object, which you can see above. To actually see what the match looks like, you need to call the group method.

There's another repeating metacharacter which is similar to \*. It is +, which will match one or more times. This is a subtle difference from \* which matches zero or more times. The + requires at least one occurrence of the character it is looking for.

The last two repeating metacharacters work a bit differently. There is the question mark, ?, which will match either once or zero times. It sort of marks the character before it as optional. A simple example would be "co-?op". This would match both "coop" and "co-op".

The final repeated metacharacter is {a,b} where a and b are decimal integers. What this means is that there must be at least a repetitions and at most b. You might want to try out something like this:

xb{1,4}z

This is a pretty silly example, but what it says is that we will match things like xbz, xbbz, xbbbz and xbbbbz, but not xz because it doesn't have a "b".

The next metacharacter that we'll learn about is ^. This character will allow us to match the characters that are not listed inside our class. In other words, it will complement our class. This will only work if we actually put the ^ inside our class. If it's outside the class, then we will be attempting to actually match against ^. A good example would be something like this: [^a]. This will match any character except the letter 'a'.

The ^ is also used as an anchor in that it is usually used for matches at the beginning of string. There is a corresponding anchor for the end of the string, which is $.

We've spent a lot of time introducing various concepts of regular expressions. In the next few sections, we'll dig into some more real code examples!


Pattern Matching Using search

Let's take a moment to learn some pattern matching basics. When using Python to look for a pattern in a string, you can use the search function like we did in the example earlier in this chapter. Here's how:

import re

text = "The ants go marching one by one"

strings = ['the', 'one']

for string in strings:
    match = re.search(string, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        text_pos = match.span()
        print(text[match.start():match.end()])
    else:
        print('Did not find "{}"'.format(string))

For this example, we import the re module and create a simple string. Then we create a list of two strings that we'll search for in the main string. Next we loop over the strings we plan to search for and actually run a search for them. If there's a match, we print it out. Otherwise we tell the user that the string was not found.

There are a couple of other functions worth explaining in this example. You will notice that we call span. This gives us the beginning and ending positions of the string that matched. If you print out the text_pos that we assigned the span to, you'll get a tuple like this: (21, 24). Alternatively, you can just call some match methods, which is what we do next. We use start and end to grab the starting and ending position of the match, which should also be the two numbers that we get from span.


Escape Codes

There are also some sequences that you can search for using special escape codes in Python. Here's a quick list with brief explanations of each:

    
\d
    Matches digit
\D
    Matches non-digit
\s
    Matches whitespace
\S
    Matches non-whitespace
\w
    Matches alphanumeric
\W
    Matches non-alphanumeric

You can use these escape codes inside of a character class, like so: [\d]. This would allow us to find any digit and is the equivalent of [0-9]. I highly recommend trying out a few of the others yourself.


Compiling

The re module allows you to "compile" the expressions that you are searching for frequently. This will basically allow you to turn your expression into a SRE_Pattern object. You can then use that object in your search function. Let's use the code from earlier and modify it to use compile:

import re

text = "The ants go marching one by one"

strings = ['the', 'one']

for string in strings:
    regex = re.compile(string)
    match = re.search(regex, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        text_pos = match.span()
        print(text[match.start():match.end()])
    else:
        print('Did not find "{}"'.format(string))

You will note that here we create our pattern object by calling compile on each string in our list and assigning the result to the variable, regex. We then pass that regex to our search function. The rest of the code is the same. The primary reason to use compile is to save it to be reused later on in your code. However, compile also takes some flags that can used to enable various special features. We will take a look at that next.

Special Note: When you compile patterns, they will get automatically cached so if you aren't using lot of regular expressions in your code, then you may not need to save the compiled object to a variable.


Compilation Flags

There are 7 compilation flags included in Python 3 that can change how your compiled pattern behaves. Let's go over what each of them do and then we will look at how to use a compilation flag.

re.A / re.ASCII

The ASCII flag tells Python to only match against ASCII instead of using full Unicode matching when coupled with the following escape codes: \w, \W, \b, \B, \d, \D, \s and \S. There is a re.U / re.UNICODE flag too that is for backwards compatibility purposes; however those flags are redundant since Python 3 already matches in Unicode by default.

re.DEBUG

This will display debug information about your compiled expression.

re.I / re.IGNORECASE

If you'd like to perform case-insensitive matching, then this is the flag for you. If your expression was ``[a-z]`` and you compiled it with this flag, your pattern will also match uppercase letters too! This also works for Unicode and it's not affect by the current locale.

re.L / re.LOCALE

Make the escape codes: \w, \W, \b, \B, \d, \D, \s and \S depend on the current locale. However, the documentation says that you should not depend on this flag because the locale mechanism itself is very unreliable. Instead, just use Unicode matching. The documentation goes on to state that this flag really only makes sense for bytes patterns.

re.M / re.MULTILINE

When you use this flag, you are telling Python to make the ^ pattern character match at both the beginning of the string and at the beginning of each line. It also tells Python that $ should match at the end of the string and the end of each line, which is subtly different from their defaults. See the documentation for additional information.

re.S / re.DOTALL

This fun flag will make the . (period) metacharacter match any character at all. Without the flag, it would match anything except a newline.

re.X / re.VERBOSE

If you find your regular expressions hard to read, then this flag could be just what you need. It will allow to visually separate logical sections of your regular expressions and even add comments! Whitespace within the pattern will be ignored except when in a character class or when the whitespace is preceded by an unescaped backslash.

Using a compilation flag

Let's take a moment and look at a simple example that uses the VERBOSE compilation flag! One good example is to take a common email finding regular expression such as r'[\w\.-]+@[\w\.-]+' and add some comments using the VERBOSE flag. Let's take a look:

re.compile('''
           [\w\.-]+      # the user name
           @
           [\w\.-]+'     # the domain
           ''',
           re.VERBOSE)

Let's move on and learn how to find multiple matches.


Finding Multiple Instances

Up to this point, all we've seen is how to find the first match in a string. But what if you have a string that has multiple matches in it. Let's review how to find a single match:

>>> import re
>>> silly_string = "the cat in the hat"
>>> pattern = "the"
>>> match = re.search(pattern, text)
>>> match.group()
'the'

Now you can see that there are two instances of the word "the", but we only found one. There are two methods of finding both. The first one that we will look at is the findall function:

>>> import re
>>> silly_string = "the cat in the hat"
>>> pattern = "the"
>>> re.findall(pattern, silly_string)
['the', 'the']

The findall function will search through the string you pass it and add each match to a list. Once it finishes searching your string, it will return a list of matches. The other way to find multiple matches is to use the finditer function.

import re

silly_string = "the cat in the hat"
pattern = "the"

for match in re.finditer(pattern, silly_string):
    s = "Found '{group}' at {begin}:{end}".format(
        group=match.group(), begin=match.start(),
        end=match.end())
    print(s)

As you might have guessed, the finditer method returns an iterator of Match instances instead of the strings that you would get from findall. So we needed to do a little formatting of the results before we could print them out. Give this code a try and see how it works.


Backslashes are complicated

Backslashes are a bit complicated in Python's regular expressions. The reason being that regular expressions use backslashes to indicate special forms or to allow a special character to be searched for instead of invoking it, such as when we want to search for an dollar sign: \$. If we didn't backslash that, we'd just be creating an anchor. The issue comes in because Python uses the backslash character for the same thing in literal strings. Let's say you want to search for a string like this (minus the quotes): "\python".

To search for this in a regular expression, you will need to escape the backslash but because Python also uses the backslash, then that backslash also has to be escaped so you'll end up with the following search pattern: "\\\\python". Fortunately, Python supports raw strings by pre-pending the string with the letter 'r'. So we can make this more readable by doing the following: r"\\python".

So if you need to search for something with a backslash, be sure to use raw strings or you may end up with some unexpected results!


Wrapping Up

This article barely scratches the surface of all you can do with regular expressions. In fact, there is much more to the module itself. There are entire books on the subject of regular expressions, but this should give you the basics to get started. You will need to search for examples and read the documentation, probably multiple times when you're using regular expressions, but they are a handy tool to have when you need it.

Copyright © 2024 Mouse Vs Python | Powered by Pythonlibrary