In this blog post I’ll show you how to scrape Income Statement, Balance Sheet, and Cash Flow data for companies from Yahoo Finance using Python, LXML, and Pandas.

I’ll use data from Mainfreight NZ (MFT.NZ) as an example.

The screenshot below shows what you can expect to get by following the steps in this blog post:

The first few columns of a Pandas DataFrame containing MFT.NZ Balance Sheet data

By following the steps in this blog post, you’ll also generate a DataFrame containing data from the Income Statement, and Cash Flow statement.

After creating Pandas DataFrames, I’ll show you how to export everything to an Excel file, so you’ll have output that looks something like this:

Disclaimers

Before we start, a few disclaimers:

  • This code doesn’t come with any guarantee or warranty.
  • I’m not a financial advisor. This blog post doesn’t represent financial advice.
  • I don’t recommend the use of this code for any investment decisions.
  • Use the code at your own risk.

Prerequisites

Make sure you have installed the Anaconda distribution of Python .. this includes Jupyter Notebook, which we’ll use throughout this blog post.

Now we begin!

Find the ticker symbol

In this case, we’ll be scraping data for Mainfreight NZ.

In Yahoo Finance, the symbol for Mainfreight is MFT.NZ:

Take a look at the Balance Sheet data that we’re going to scrape.

Here’s an example of some of the financial data we’ll be wanting to extract. Take note of the data displayed. Once we’ve scraped the data, we’ll cross-check it to ensure the scraping was accurate.

Inspect the page source

Open up the Chrome developer tools, and inspect the page source. Note that all of the balance sheet data is within an HTML table element, and that there is only one table element on the page:

Because of this, we’ll be able to use the //table xpath command, which fetches all <table> elemets in the document.

Now, we’ll look at where the numeric data is:

Note how in the scrrenshot above, the “80,526” value for Cash and Cash Equivalents is in <span> element within a <td> element.

Scrape some balance sheet data

Open up Jupyter Notebook, and execute the following code block:

import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd

symbol = 'MFT.NZ'

url = 'https://finance.yahoo.com/quote/' + symbol + '/balance-sheet?p=' + symbol

# Fetch the page that we're going to parse
page = requests.get(url)

# Parse the page with LXML, so that we can start doing some XPATH queries
# to extract the data that we want
tree = html.fromstring(page.content)

# Using XPATH, fetch all table elements on the page
table = tree.xpath('//table') 

# Ensure that there is only one table on the page; if there is more than
# one table, then it's possible that Yahoo Finance has changed their page
# structure. If this fails, please let me know which symbol you're trying
# to scrape in the comments below
assert len(table) == 1 

# Now that we've got the table element, convert it back to a string,
# so that it can be parsed by Pandas
tstring = lxml.etree.tostring(table[0], method='html')

# Read the HTML table into a Pandas DataFrame - read_html
# is designed to read HTML tables into a list of dataframe objects,
# one dataframe for each table.
df = pd.read_html(tstring)[0]

df

You should see some output which looks like the following:

There are a few observations to be taken from the screenshot of the Pandas DataFrame above:

  • The header row contains index values (0, 1, 2, 3, etc), rather than useful column names.
  • The first row of the table contains dates.
  • The first column contains account names.
  • The Current Assets contains NAN in every column. This is because the Current Assets row in Yahoo Finance has no values.
  • Rows such as Short Term Investments contain values ‘-’

Cross-checking this output with the numbers shown on the Balance Sheet screenshot from Yahoo Finance, and noting that the numbers and dates match, indicates that the scrape was a success.

Next, we’ll do some data cleanups and transformations to make the data more useful.

Clean up the data

Because we’re using Pandas, it’ll be more convenient if the columns are the account names, and the rows are indexed by Date, so let’s do that now:

df = df.set_index(0) # Set the index to the first column: 'Period Ending'.
df = df.dropna() # Get rid of rows containing 'NaN'
df = df.transpose() # Transpose the DataFrame, so that our header contains the account names
df

You should now see output which looks like:

Much better!

Now, let’s look at the data types of these columns:

df.dtypes

A few observations:

  • Period Ending is of type ‘object’ when it should be a date type.
  • All other columns such as Cash and Cash Equivalents are also of type ‘object’ when they should be numeric.

Let’s fix that up now, starting with the ‘Period Ending’ column:

df[df.columns[0]] = pd.to_datetime(df[df.columns[0]])
df = df.set_axis(df.columns, axis='columns', inplace=False)
df

The ‘Period Ending’ column should now be of type datetime64[ns]:

Now let’s convert the rest of the columns to numeric:

df = df.replace('-', '0') # Remove the '-' values that can't be converted to numeric.
numeric_columns = list(df.columns)[1::] # Take all columns, except the first (which is the 'Period Date' column)
df[numeric_columns] = df[numeric_columns].astype(np.float64) # Convert all columns to float64
df.dtypes

The numeric columns should be now of type float64:

Let’s have another look at the DataFrame,:

df

Which should output something something like:

Looking good! Now the Balance Sheet data has been fully scraped, with correct data types, in a form that’s ready to use.

Scraping Income Statement data from Yahoo Finance

Now we’ll create a more generalised form of the code above, by combining the code into a method.

The code is essentially the same as above, with one change - we’re renaming the ‘Period Ending’ date column to ‘Date’.

import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd

def scrape_table(url):
    page = requests.get(url)
    tree = html.fromstring(page.content)
    table = tree.xpath('//table')
    assert len(table) == 1

    df = pd.read_html(lxml.etree.tostring(table[0], method='html'))[0]

    df = df.set_index(0)
    df = df.dropna()
    df = df.transpose()
    df = df.replace('-', '0')

    # The first column should be a date
    df[df.columns[0]] = pd.to_datetime(df[df.columns[0]])
    cols = list(df.columns)
    cols[0] = 'Date'
    df = df.set_axis(cols, axis='columns', inplace=False)

    numeric_columns = list(df.columns)[1::]
    df[numeric_columns] = df[numeric_columns].astype(np.float64)

    return df

symbol = 'MFT.NZ'
balance_sheet_url = 'https://finance.yahoo.com/quote/' + symbol + '/balance-sheet?p=' + symbol

df_balance_sheet = scrape_table(balance_sheet_url)
df_balance_sheet

You should get the same output as the final balance sheet above. The only change will be that the ‘Period Ending’ column is now called ‘Date’.

Now, let’s try the same method with the URL for income statement:

df_income_statement = scrape_table('https://finance.yahoo.com/quote/' + symbol + '/financials?p=' + symbol)
df_income_statement

Cross check these values with the income statement on yahoo finance:

and they match!

Scraping Statement of Cash Flows data from Yahoo Finance

Now that we’ve got a generic method that can be used on the Balance Sheet, and Income Statement, let’s try it on the Cash Flow statement.

df_cash_flow = scrape_table('https://finance.yahoo.com/quote/' + symbol + '/cash-flow?p=' + symbol)
df_cash_flow

Cross check these values with the cash flow statement on Yahoo Finance:

and they match!

Now you’ve got the following Pandas DataFrames:

  • df_cash_flow, containing data scraped from the Statement of Cash Flows
  • df_income_statement, containing data scraped from the Income Statement
  • df_balance_sheet, containing data scraped from the Balance Sheet

Exporting to Excel

It’s possible to export the Pandas DataFrame to Excel via ExcelWriter.

Below is the code to export to an Excel file with three worksheets; Income Statement, Balance Sheet, and Statement of Cash Flows

writer = pd.ExcelWriter(symbol + '-scraped.xlsx')
df_income_statement.to_excel(writer,'Income Statement')
df_balance_sheet.to_excel(writer,'Balance Sheet')
df_cash_flow.to_excel(writer,'Statement of Cash Flows')
writer.save()

I imported the file into Google Sheets, and got the following:

Thanks for Reading!

In this blog post, I’ve shown you how to scrape Income Statement, Balance Sheet, and Cash Flow data for companies in Yahoo Finance using Python and Pandas, and export the result to an Excel file for further analysis.

Did you enjoy this article? Please share it on Twitter, Facebook, or Linkedin via one of the share links below.

Any questions, feedback, or anything that you think I’d be interested in? Please leave a comment below.