Python 201: An Intro to Generators

The topic of generators has been covered numerous times before. However, it's still a topic that a lot of new programmers have trouble with and I would hazard a guess that even experienced users don't really use them either.

Python generators allow developers to lazily evaluate data. This is very helpful when you are dealing with so-called "big data". Their main use is for generating values and for doing so in an efficient manner. In this article, we will go over how to use a generator and take a look at generator expressions. Hopefully by the end you will comfortable using generators in your own projects.

The canonical use case for a generator is to show how to read a large file in a series of chunks or lines. There's nothing wrong with that idea, so let's use that for our first example too. To create a generator, all we need to do is use Python's yield keyword. The yield statement will turn a function into an iterator. All you have to do to change a regular function into an iterator is to replace the return statement with a yield statement. Let's take a look at an example:

#----------------------------------------------------------------------
def read_large_file(file_object):
    """
    Uses a generator to read a large file lazily
    """
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data

#----------------------------------------------------------------------
def process_file(path):
    """"""
    try:
        with open(path) as file_handler:
            for line in read_large_file(file_handler):
                # process line
                print(line)
    except (IOError, OSError):
        print("Error opening / processing file")
                
#----------------------------------------------------------------------
if __name__ == "__main__":
    path = "TB_burden_countries_2014-01-23.csv"
    process_file(path)

To make testing easier, I went to the World Health Organization's (WHO) site and downloaded a CSV file on Tuberculosis. Specifically I grabbed the "WHO TB burden estimates [csv 890kb]" file from here. If you already have a big file to play with, feel free to edit the code appropriately. Anyway, in this code we create a function named read_large_file and turn it into a generator by making it yield back its data.

Here's how the magic works: We create a for loop that loops over our generator function. For each iteration, the generator function will yield up a generator object that contains a line of data and the for loop will process it. In this case, the "process" is just to print the line to stdout, but you can modify that to do whatever you need. In a real program, you would probably be saving data to a database or creating a PDF or other report with the data. When the generator is returned, it suspends the state of execution in the function so that local variables are preserved. This allows us to continue on the next loop without losing our place.

Anyway, when the generator function runs out of data, we break out of it so that the loop doesn't continue on indefinitely. The generator allows us to process only one chunk of data at a time, which saves a lot of memory.

Update 2014/01/28: One of my readers pointed out that files return lazy iterators to begin with, which is something I thought they did. Oddly enough, everyone and their dog recommends using generators for reading files, but just iterating over the file is enough. So let's rewrite the example above to utilize this concept:

#----------------------------------------------------------------------
def process_file_differently(path):
    """
    Process the file line by line using the file's returned iterator
    """
    try:
        with open(path) as file_handler:
            while True:
                print next(file_handler)
    except (IOError, OSError):
        print("Error opening / processing file")
    except StopIteration:
        pass
        
#----------------------------------------------------------------------
if __name__ == "__main__":
    path = "TB_burden_countries_2014-01-23.csv"
    process_file_differently(path)

In this code, we create an infinite loop that will call Python's next function on the file handler object. This will cause Python to return the file back to use line-by-line. When the file runs out of data, the StopIteration exception is raised, so we make sure we catch it and ignore it.

Generator Expressions

Python has the concept of generator expressions. The syntax for a generator expression is very similar to a list comprehension. Let's take a look at both to see the difference:

# list comprehension
lst = [ord(i) for i in "ABCDEFGHI"]

# equivalent generator expression
gen = list(ord(i) for i in "ABCDEFGHI")

This example is based on one found in Python's HOWTO section on generators and frankly I find it a bit obtuse. The main difference between a generator expression and a list comprehension is in what encloses the expression. For a list comprehension, it is square brackets; for the generator expression, it is regular parentheses. Let's create the generator expression itself without turning it into a list:

gen = (ord(i) for i in "ABCDEFGHI")
while True:
    print gen.next()

If you run this code, you will see it print out each ordinal value for each member of the string and then you'll see a traceback stating that a StopIteration has occurred. That means that the generator has exhausted itself (i.e. it's empty). So far, I have not found a use for the generator expression in my own work, but I would be interested to know what you're using it for.

Wrapping Up

Now you know what a generator is for and one of it's most popular uses. You have also learned about the generator expression and how it works. I have personally used a generator for parsing data files that are supposed to become "big data". What have you used these for?

Copyright © 2024 Mouse Vs Python | Powered by Pythonlibrary