Parse Excel with ease using Perl

Aug 5, 2014 by David Farrell

In the business world, it seems like Excel spreadsheets are everywhere. Recently I had to parse several hundred spreadsheets under a tight deadline for a client. To make matters worse, the spreadsheets were in a mix of Excel 2003 (xls) and 2007 (xlsx) formats. Fortunately I know Perl, and using the Spreadsheet::Read module, it was easy. This article will show you how to use Spreadsheet::Read to parse Excel spreadsheets.

Requirements

You’ll need to install Spreadsheet::Read and a couple of interface modules. Spreadsheet::ParseExcel is an interface for Excel 2003 spreadsheets and Spreadsheet::XLSX is for reading the modern Excel format. You can install all three modules from the terminal using cpan:

$ cpan Spreadsheet::ParseExcel Spreadsheet::XLSX Spreadsheet::Read

Using Spreadsheet::Read

Let’s parse the spreadsheet shown in the cover image, which contains an income statement. Spreadsheet::Read provides a simple, unified interface for reading spreadsheets. It exports the ReadData function which requires a filepath to the spreadsheet:

use Spreadsheet::Read;

my $workbook = ReadData('income_statement.xlsx');

Now $workbook contains the data structure representing the spreadsheet. We can inspect this structure by printing it with Data::Printer:

\ [
    [0] {
        error     undef,
        parser    "Spreadsheet::XLSX",
        sheet     {
            Sheet1   1
        },
        sheets    1,
        type      "xlsx",
        version   0.13
    },
    [1] {
        attr     [],
        B1       "Income Statement 2014",
        B2       "Revenue",
        B3       "Cost of goods sold",
        B4       "Gross profit",
        B5       "Financing costs",
        B6       "Tax",
        B7       "Net profit",
        cell     [
            [0] [],
            [1] [],
            [2] [
                [0] undef,
                [1] "Income Statement 2014",
                [2] "Revenue",
                [3] "Cost of goods sold",
                [4] "Gross profit",
                [5] "Financing costs",
                [6] "Tax",
                [7] "Net profit"
            ],
            [3] [
                [0] undef,
                [1] undef,
                [2] 50000,
                [3] 2500,
                [4] 47500,
                [5] 7150,
                [6] 10087.5,
                [7] 30262.5
            ]
        ],
        C2       " $ 50,000.00 ",
        C3       " $ 2,500.00 ",
        C4       " $ 47,500.00 ",
        C5       " $ 7,150.00 ",
        C6       " $ 10,087.50 ",
        C7       " $ 30,262.50 ",
        label    "Sheet1",
        maxcol   3,
        maxrow   7
    }
]

This shows that $workbook is an arrayref, whose first element describes the file, and subsequent elements represent the individual worksheets. The label key pair contains the worksheet name, access it like this:

$workbook->[1]{label}; #Sheet1

Cells can be referenced using Excel’s grid notation (“A3”) or via standard Perl array access. The different between these is formatting:

$workbook->[1]{C2}; #$ 50,000.00

$workbook->[1]{cell}[3][2]; #50000

So if you need to perform additional processing on the data you’re extracting (such as saving to a database), you probably want to use the {cell} notation, to obtain clean data. With Spreadsheet::Read array indexes begin at 1, so cell “C2” is [3][2].

Perhaps you want to loop through two columns at once and print them? No problem:

for (2..7) {
    print "$workbook->[1]{cell}[2][$_]: $workbook->[1]{cell}[3][$_]\n"; 
}

There are some data points which Spreadsheet::Read does not provide: you cannot access the underlying formula of a cell and the styling data is also not available.

Conclusion

Spreadsheet::Read isn’t just great for command line apps, it has many uses. Unlike the Microsoft .Net interop library, Perl’s Excel interfaces are not single threaded and do not require Excel to be installed to work. Instead Spreadsheet::Read directly parses the Excel file. That makes it possible to process large computing tasks in parallel. Another possible use case is for a spreadsheet upload interface on a web application; Spreadsheet::Read also supports the Libre / Open Office formats as well as CSV text files.

This article was originally posted on PerlTricks.com.

Tags

data