Reading Spreadsheets with OpenPyXL and Python

There are a couple of fundamental actions that you will do with Microsoft Excel documents. One of the most basic is the act of reading data from an Excel file. You will be learning how to get data from your Excel spreadsheets.

Editor's note: This article is based on a chapter from the book: Automating Excel with Python. You can order a copy on Gumroad or Kickstarter.

Before you dive into automating Excel with Python, you should understand some of the common terminologies:

  • Spreadsheet or Workbook – The file itself (.xls or .xlsx).
  • Worksheet or Sheet – A single sheet of content within a Workbook. Spreadsheets can contain multiple Worksheets.
  • Column – A vertical line of data labeled with letters, starting with “A”.
  • Row – A horizontal line of data labeled with numbers, starting with 1.
  • Cell – A combination of Column and Row, like “A1”.

Now that you have some basic understanding of the vocabulary, you can move on.

In this chapter, you will learn how to do the following tasks:

  • Open a spreadsheet
  • Read specific cells
  • Read cells from a specific row
  • Read cells from a specific column
  • Read cells from multiple rows or columns
  • Read cells from a range
  • Read all cells in all sheets

You can get started by learning how to open a workbook in the next section!

Open a Spreadsheet

The first item that you need is a Microsoft Excel file. You can use the file that is in this GitHub code repository. There is a file in the chapter 2 folder called books.xlsx that you will use here.

It has two sheets in it. Here is a screenshot of the first sheet:

Book Worksheet

For completeness, here is a screenshot of the second sheet:

Sales Worksheet

Note: The data in these sheets are inaccurate, but they help learn how to use OpenPyXL.

Now you're ready to start coding! Open up your favorite Python editor and create a new file named open_workbook.py. Then add the following code to your file:

# open_workbook.py

from openpyxl import load_workbook


def open_workbook(path):
    workbook = load_workbook(filename=path)
    print(f"Worksheet names: {workbook.sheetnames}")
    sheet = workbook.active
    print(sheet)
    print(f"The title of the Worksheet is: {sheet.title}")


if __name__ == "__main__":
    open_workbook("books.xlsx")

The first step in this code is to import load_workbook() from the openpyxl package. The load_workbook() function will load up your Excel file and return it as a Python object. You can then interact with that Python object like you would any other object in Python.

You can get a list of the worksheets in the Excel file by accessing the sheetnames attribute. This list contains the titles of the worksheets from left to right in your Excel file. Your code will print out this list.

Next, you grab the currently active sheet. If your workbook only has one worksheet, then that sheet will be the active one. If your workbook has multiple worksheets, as this one does, then the last worksheet will be the active one.

The last two lines of your function print out the Worksheet object and the title of the active worksheet.

What if you want to select a specific worksheet to work on, though? To learn how to accomplish that, create a new file and name it read_specific_sheet.py.

Then enter the following code:

# read_specific_sheet.py

from openpyxl import load_workbook


def open_workbook(path, sheet_name):
    workbook = load_workbook(filename=path)
    if sheet_name in workbook.sheetnames:
        sheet = workbook[sheet_name]
        print(f"The title of the Worksheet is: {sheet.title}")
        print(f"Cells that contain data: {sheet.calculate_dimension()}")


if __name__ == "__main__":
    open_workbook("books.xlsx", sheet_name="Sales")

Your function, open_workbook() now accepts a sheet_name. sheet_name is a string that matches the title of the worksheet that you want to read. You check to see if the sheet_name is in the workbook.sheetnames in your code. If it is, you select that sheet by accessing it using workbook[sheet_name].

Then you print out the sheet's title to verify that you have the right sheet. You also call something new: calculate_dimension(). That method returns the cells that contain data in the worksheet. In this case, it will print out that "A1:D4" has data in them.

Now you are ready to move on and learn how to read data from the cells themselves.

Read Specific Cells

There are a lot of different ways to read cells using OpenPyXL. To start things off, you will learn how to read the contents of specific cells.

Create a new file in your Python editor and name it reading_specific_cells.py. Then enter the following code:

# reading_specific_cells.py

from openpyxl import load_workbook


def get_cell_info(path):
    workbook = load_workbook(filename=path)
    sheet = workbook.active
    print(sheet)
    print(f'The title of the Worksheet is: {sheet.title}')
    print(f'The value of A2 is {sheet["A2"].value}')
    print(f'The value of A3 is {sheet["A3"].value}')
    cell = sheet['B3']
    print(f'The variable "cell" is {cell.value}')

if __name__ == '__main__':
    get_cell_info('books.xlsx')

In this example, there are three hard-coded cells: A2, A3 and B3. You can access their values by using dictionary-like access: sheet["A2"].value. Alternatively, you can assign sheet["A2"] to a variable and then do something like cell.value to get the cell's value.

You can see both of these methods demonstrated in your code above.

When you run this code, you should see the following output:

<Worksheet "Sales">
The title of the Worksheet is: Sales
The value of A2 is 'Python 101'
The value of A3 is 'wxPython Recipes'
The variable "cell" is 5

This output shows how you can easily extract specific cell values from Excel using Python.

Now you're ready to learn how you can read the data from a specific row of cells!

Read Cells From Specific Row

In most cases, you will want to read more than a single cell in a worksheet at a time. OpenPyXL provides a way to get an entire row at once, too.

Go ahead and create a new file. You can name it reading_row_cells.py. Then add the following code to your program:

# reading_row_cells.py

from openpyxl import load_workbook


def iterating_row(path, sheet_name, row):
    workbook = load_workbook(filename=path)
    if sheet_name not in workbook.sheetnames:
        print(f"'{sheet_name}' not found. Quitting.")
        return

    sheet = workbook[sheet_name]
    for cell in sheet[row]:
        print(f"{cell.column_letter}{cell.row} = {cell.value}")


if __name__ == "__main__":
    iterating_row("books.xlsx", sheet_name="Sheet 1 - Books",
                  row=2)

In this example, you pass in the row number 2. You can iterate over the values in the row like this:

for cell in sheet[row]:
    ...

That makes grabbing the values from a row pretty straightforward. When you run this code, you'll get the following output:

A2 = Title
B2 = Author
C2 = Publisher
D2 = Publishing Date
E2 = ISBN
F2 = None
G2 = None

Those last two values are both None. If you don't want to get values that are None, you should add some extra processing to check if the value is None before printing it out. You can try to figure that out yourself as an exercise.

You are now ready to learn how to get cells from a specific column!

Read Cells From Specific Column

Reading the data from a specific column is also a frequent use case that you should know how to accomplish. For example, you might have a column that contains only totals, and you need to extract only that specific column.

To see how you can do that, create a new file and name it reading_column_cells.py. Then enter this code:

# reading_column_cells.py

from openpyxl import load_workbook


def iterating_column(path, sheet_name, col):
    workbook = load_workbook(filename=path)
    if sheet_name not in workbook.sheetnames:
        print(f"'{sheet_name}' not found. Quitting.")
        return

    sheet = workbook[sheet_name]
    for cell in sheet[col]:
        print(f"{cell.column_letter}{cell.row} = {cell.value}")


if __name__ == "__main__":
    iterating_column("books.xlsx", sheet_name="Sheet 1 - Books",
                    col="A")

This code is very similar to the code in the previous section. The difference here is that you are replacing sheet[row] with sheet[col] and iterating on that instead.

In this example, you set the column to "A". When you run this code, you will get the following output:

A1 = Books
A2 = Title
A3 = Python 101
A4 = wxPython Recipes
A5 = Python Interviews
A6 = None
A7 = None
A8 = None
A9 = None
A10 = None
A11 = None
A12 = None
A13 = None
A14 = None
A15 = None
A16 = None
A17 = None
A18 = None
A19 = None
A20 = None
A21 = None
A22 = None
A23 = None

Once again, some columns have no data (i.e., "None"). You can edit this code to ignore empty cells and only process cells that have contents.

Now let's discover how to iterate over multiple columns or rows!

Read Cells from Multiple Rows or Columns

There are two methods that OpenPyXL's worksheet objects give you for iterating over rows and columns. These are the two methods:

  • iter_rows()
  • iter_cols()

These methods are documented fairly well in OpenPyXL's documentation. Both methods take the following parameters:

  • min_col (int) – smallest column index (1-based index)
  • min_row (int) – smallest row index (1-based index)
  • max_col (int) – largest column index (1-based index)
  • max_row (int) – largest row index (1-based index)
  • values_only (bool) – whether only cell values should be returned

You use the min and max rows and column parameters to tell OpenPyXL which rows and columns to iterate over. You can have OpenPyXL return the data from the cells by setting values_only to True. If you set it to False, iter_rows() and iter_cols() will return cell objects instead.

It's always good to see how this works with actual code. With that in mind, create a new file named iterating_over_cells_in_rows.py and add this code to it:

# iterating_over_cells_in_rows.py

from openpyxl import load_workbook


def iterating_over_values(path, sheet_name):
    workbook = load_workbook(filename=path)
    if sheet_name not in workbook.sheetnames:
        print(f"'{sheet_name}' not found. Quitting.")
        return

    sheet = workbook[sheet_name]
    for value in sheet.iter_rows(
        min_row=1, max_row=3, min_col=1, max_col=3,
        values_only=True):
        print(value)


if __name__ == "__main__":
    iterating_over_values("books.xlsx", sheet_name="Sheet 1 - Books")

Here you load up the workbook as you have in the previous examples. You get the sheet name that you want to extract data from and then use iter_rows() to get the rows of data. In this example, you set the minimum row to 1 and the maximum row to 3. That means that you will grab the first three rows in the Excel sheet you have specified.

Then you also set the columns to be 1 (minimum) to 3 (maximum). Finally, you set values_only to True.

When you run this code, you will get the following output:

('Books', None, None)
('Title', 'Author', 'Publisher')
('Python 101', 'Mike Driscoll', 'Mouse vs Python')

Your program will print out the first three columns of the first three rows in your Excel spreadsheet. Your program prints the rows as tuples with three items in them. You are using iter_rows() as a quick way to iterate over rows and columns in an Excel spreadsheet using Python.

Now you're ready to learn how to read cells in a specific range.

Read Cells from a Range

Excel lets you specify a range of cells using the following format: (col)(row):(col)(row). In other words, you can say that you want to start in column A, row 1, using A1. If you wanted to specify a range, you would use something like this: A1:B6. That tells Excel that you are selecting the cells starting at A1 and going to B6.

Go ahead and create a new file named read_cells_from_range.py. Then add this code to it:

# read_cells_from_range.py

import openpyxl
from openpyxl import load_workbook


def iterating_over_values(path, sheet_name, cell_range):
    workbook = load_workbook(filename=path)
    if sheet_name not in workbook.sheetnames:
        print(f"'{sheet_name}' not found. Quitting.")
        return

    sheet = workbook[sheet_name]
    for column in sheet[cell_range]:
        for cell in column:
            if isinstance(cell, openpyxl.cell.cell.MergedCell):
                # Skip this cell
                continue
            print(f"{cell.column_letter}{cell.row} = {cell.value}")


if __name__ == "__main__":
    iterating_over_values("books.xlsx", sheet_name="Sheet 1 - Books",
                          cell_range="A1:B6")

Here you pass in your cell_range and iterate over that range using the following nested for loop:

for column in sheet[cell_range]:
    for cell in column:

You check to see if the cell that you are extracting is a MergedCell. If it is, you skip it. Otherwise, you print out the cell name and its value.

When you run this code, you should see the following output:

A1 = Books
A2 = Title
B2 = Author
A3 = Python 101
B3 = Mike Driscoll
A4 = wxPython Recipes
B4 = Mike Driscoll
A5 = Python Interviews
B5 = Mike Driscoll
A6 = None
B6 = None

That worked quite well. You should take a moment and try out a few other range variations to see how it changes the output.

Note: while the image of "Sheet 1 - Books" looks like cell A1 is distinct from the merged cell B1-G1, A1 is actually part of that merged cell.

The last code example that you'll create will read all the data in your Excel document!

Read All Cells in All Sheets

Microsoft Excel isn't as simple to read as a CSV file, or a regular text file. That is because Excel needs to store each cell's data, which includes its location, formatting, and value, and that value could be a number, a date, an image, a link, etc. Consequently, reading an Excel file is a lot more work! openpyxl does all that hard work for us, though.

The natural way to iterate through an Excel file is to read the sheets from left to right, and within each sheet, you would read it row by row, from top to bottom. That is what you will learn how to do in this section.

You will take what you have learned in the previous sections and apply it here. Create a new file and name it read_all_data.py. Then enter the following code:

# read_all_data.py

import openpyxl
from openpyxl import load_workbook


def read_all_data(path):
    workbook = load_workbook(filename=path)
    for sheet_name in workbook.sheetnames:
        sheet = workbook[sheet_name]
        print(f"Title = {sheet.title}")
        for row in sheet.rows:
            for cell in row:
                if isinstance(cell, openpyxl.cell.cell.MergedCell):
                    # Skip this cell
                    continue

                print(f"{cell.column_letter}{cell.row} = {cell.value}")


if __name__ == "__main__":
    read_all_data("books.xlsx")

Here you load up the workbook as before, but this time you loop over the sheetnames. You print out each sheet name as you select it. You use a nested for loop to loop over the rows and cells to extract the data from your spreadsheet.

Once again, you skip MergedCells because their value is None -- the actual value is in the normal cell that the MergedCell is merged with. If you run this code, you will see that it prints out all the data from the two worksheets.

You can simplify this code a bit by using iter_rows(). Open up a new file and name it read_all_data_values.py. Then enter the following:

# read_all_data_values.py

import openpyxl
from openpyxl import load_workbook


def read_all_data(path):
    workbook = load_workbook(filename=path)
    for sheet_name in workbook.sheetnames:
        sheet = workbook[sheet_name]
        print(f"Title = {sheet.title}")
        for value in sheet.iter_rows(values_only=True):
            print(value)


if __name__ == "__main__":
    read_all_data("books.xlsx")

In this code, you once again loop over the sheet names in the Excel document. However, rather than looping over the rows and columns, you use iter_rows() to loop over only the rows. You set values_only to True which will return a tuple of values for each row. You also do not set the minimum and maximum rows or columns for iter_rows() because you want to get all the data.

When you run this code, you will see it print out the name of each sheet, then all the data in that sheet, row-by-row. Give it a try on your own Excel worksheets and see what this code can do!

Wrapping Up

OpenPyXL lets you read an Excel Worksheet and its data in many different ways. You can extract values from your spreadsheets quickly with a minimal amount of code.

In this chapter, you learned how to do the following:

  • Open a spreadsheet
  • Read specific cells
  • Read cells from a specific row
  • Read cells from a specific column
  • Read cells from multiple rows or columns
  • Read cells from a range
  • Read all cells in all sheets

Now you are ready to learn how to create an Excel spreadsheet using OpenPyXL. That is the subject of the next article in this series!

Copyright © 2024 Mouse Vs Python | Powered by Pythonlibrary