Leveraging Financial Analysis with Google BigQuery and Python: A Financial Big Data Application.
1) Installing Python in RStudio
For this project, I will work through the IDE of RStudio, given that it is one of the easiest and fastest ways to generate a Markdown document with snippets of Python. I first installed Python through the reticulate package and Conda repositories, which will allow me to run Python commands within the R session. The process is simple: first install the reticulate with dependencies, then run the install_miniconda() Command for installing Python. The third step is to accept the conditions of installing Python on the local machine. And, finally, creating a Python environment that will be called (binded) when an R session starts (this happens when I couple the Markdown document).
(If you have already installed Python, or you prefer working with Jupiter, skip to part 2).
# Install reticulate
install.packages("reticulate", dependencies = T)
# Load Library
library(reticulate)
# Install Python
install_miniconda()
# Accept ToS for all the defaults channels it complained about
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/main"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/r"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/msys2"))
# Now create the env reticulate wanted
reticulate::conda_create("r-reticulate", packages = c("python=3.10","pip"))
# Install libraries from Needed
reticulate::conda_install("r-reticulate",
c("pandas","pyarrow","google-cloud-bigquery",
"pandas-gbq","google-cloud-bigquery-storage","db-dtypes"),
channel = "conda-forge"
)
After the installation is done, it’s best to start a clean new Rsession (Ctrl + Alt + F10), and load Python in the session. Recall that this snippet will be loaded to bind R with Python within Rstudio.
# to bind the env of python to the Rsession is best to restart ctrl + alt + F10
reticulate::use_condaenv("r-reticulate", required = TRUE)
reticulate::py_config()
2) Setting Up the Cloud Environment
For this project, I will showcase the use of Google Cloud CLI, which provides a powerful way to interact with Google BigQuery and Python. The Google BigQuery has the advantage of using the Cloud Infrastructure from Google can run regular SQL and BigQuery if needed. Google Cloud CLI also has the advantage of monitoring and managing query jobs that run regularly. Furthermore, the Google BigQuery can directly train, evaluate and run ML models suitable for prediction and forecasting.
The dataset that I will use is public and accessible through Google BigQuery, called eCommerce, is rich in tables that contain inventories, KPIs and other financial data typically needed for decision-making making.
2.1) Install Google Cloud CLI
The first step is to install the Google Cloud CLI, I am installing it with Blunted Python and Beta Commands, and skipping the Cloud Tools for PowerShell that I do not currently need.
After installation, the Google Shell will open, and it will require authentication with a Google Account.
After selecting Yes
Make sure the authentication is correct:
2.2) Creating a Project in Gcloud Power-Shell
After accepting the terms, we can go back to the PowerShell and
to create a new project (Option 3), give a name to the project, and I am
calling my project finance-bigq-demo-2025-mgs
. If there is a problem,
for instance, a non-compatible project ID name, you can run gcloud projects create finance-bigq-demo-2025-mgs
to create the project.
2.3) Authentication and Enabling Big Data Services
After the project is created, they can proceed to authenticate the user
name in the Gcloud, this is done one time and credentials are stored
locally- gcloud auth application-default login
. After running this
command, your default browser will open, and just sign in with your
Google Credentials. Then, a final step is enabling the services we will
use for this project, namely, finance-bigq-demo-2025-mgs
and
finance-bigq-demo-2025-mgs
with this command:
finance-bigq-demo-2025-mgs
2.4) Sanity Checks
Before continuing, it is useful to perform some sanity check commands to
verify everything is working in good order. I recommend
gcloud config list
, for showing the active project, followed by
gcloud services list --enabled
that verifies that the BigQuery API
services are enabled. An additional step is to verify that your account is
set up as the active account, which you can do with gcloud auth list
.
After running this last command, you should be able to see your account
marked with *
.
3) The thelook_ecommerce
Dataset
The GCloud services come with public datasets designed to test Google Service Capabilities for processing Big Data. This particular data set has variables relevant to the context of finance and e-commerce. The data set has the following tables according to Google:
Table Name | Number of Rows | Number of Columns | Description |
---|---|---|---|
distribution_centers | 5 | 5 | Lists the distribution centers, including their ID and basic location data. |
events | 1.1 million+ | 14 | Website activity for users (page views, cart events, etc.). |
inventory_items | 2.5 million+ | 14 | Item-level inventory records, including status (shipped, returned, etc.). |
order_items | 2.5 million+ | 19 | Links products to orders with item-level details, including sale price. |
orders | 2.5 million+ | 10 | Transaction headers for each customer order. |
products | 28,000+ | 10 | Product catalog with category, brand, and cost. |
users | 100,000 | 12 | User demographics and traffic source. |
4) Connecting Gcloud with Python
This query connects to the data thelook_ecommerce schema and retrieves all tables and column names with their variable type. For later use, I am saving this as a CSV to study the variables for further analysis.
# Load libraries
import pandas as pd # to manipulate data
import pandas_gbq as pgbq # to connect to the Gcloud
# Define ENV Variables
PROJECT_ID = "finance-bigq-demo-2025-mgs"
LOCATION = "US" # theLook public dataset is in US
USE_STORAGE_API = False # set to True if Storage API is enabled, faster for big data
SCHEMA_PATH = "look_ecom_schema.csv"
if 'look_ecom_schema' not in globals():
# Load from CSV if not already defined
look_ecom_schema = pd.read_csv(SCHEMA_PATH)
print(look_ecom_schema.head())
else:
# read the schema of the data
look_ecom_schema = pgbq.read_gbq("""SELECT table_name, column_name, data_type
FROM `bigquery-public-data.thelook_ecommerce`.INFORMATION_SCHEMA.COLUMNS
WHERE table_name IN ('distribution_centers', 'events', 'inventory_items', 'order_items', 'orders',
'products', 'users', 'products','order_items')
ORDER BY table_name, column_name""", project_id=PROJECT_ID, location="US")
look_ecom_schema.to_csv("look_ecom_schema.csv", index=False)
print(look_ecom_schema.head())
## table_name column_name data_type
## 0 distribution_centers distribution_center_geom GEOGRAPHY
## 1 distribution_centers id INT64
## 2 distribution_centers latitude FLOAT64
## 3 distribution_centers longitude FLOAT64
## 4 distribution_centers name STRING
5) General Approach for Data Manipulation (ETL)
My approach to handling Big Data is to take advantage of the process of filtering, aggregation, and joining that are performed efficiently in The GCloud with SQL/Google BigQuery. Once the data set is ready, save it locally as a data mart that has been opened for further transformation and analysis using Python.
6.1 Data Filtering, Aggregation and Joining Strategy
For data consistency, I use fallbacks when NULL
values are detected in
key columns. For instance, when filtering the data by timestamp, I first
Check when the product was delivered, and if that is NULL
, I fall back
to the order creation timestamp:
DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH))
.
Stage 1
To create the P&L data, I join three tables. I start by retrieving
oi.sale_price
from the order_items
fact table, which is used to
calculate revenue. For the calculation of cost, I left join
order_items
(bigquery-public-data.thelook_ecommerce.order_items AS oi
) with the
inventory_items
table
(LEFT JOIN bigquery-public-data.thelook_ecommerce.inventory_items AS ii
)
using the join key ON ii.id = oi.inventory_item_id
. Then, as a
fallback, I left join with the products
table
(LEFT JOIN bigquery-public-data.thelook_ecommerce.products AS p
) using
the join key ON p.id = oi.product_id
. This fallback ensures we can
retrieve the unit_cost
when the cost is missing from the inventory
table, by looking it up in the products table:
COALESCE(ii.cost, p.cost) AS unit_cost
.
Tables & columns used
-
Facts:
order_items
->oi.sale_price
,oi.delivered_at
,oi.created_at
,oi.returned_at
-
Cost:
inventory_items
ii.cost
(preferred),products
->p.cost
(fallback)
Stage 2
In the second stage, I aggregate gross revenue as
SUM(sale_price) AS revenue_gross
and cost as SUM(unit_cost) AS cogs
.
It is important to note that these are net line prices and costs that do
not include taxes, freight, or other operational expenses that may
affect the estimate.
Stage 3
I estimate returns in the month they occur, which is beneficial for
real-time operational dashboards. I filter using
WHERE returned_at IS NOT NULL
to locate returned products. The
aggregations are straightforward:
SUM(sale_price) AS returns
and SUM(unit_cost) AS cogs_returns
.
Stage 4
In the last stage, I leverage BigQuery to perform fast arithmetic
operations with fallbacks for NULL
values via COALESCE
. Here, it is
worth noting that I use a FULL OUTER JOIN
to include all months in the
P&L data, even if they contain only revenue or only returns.
# Import os to check if the file exist otherwise ETL the Data
import os
# Set the working directory
os.chdir(r"R:/PHD/Semester 20/Jobs/Empresas/Solvo/financial_analyst/project")
# define mart file to save the data
PARQUET_PATH = "pnl_monthly_5y_operational.parquet"
if os.path.isfile(PARQUET_PATH):
# Load cached mart
df_pnl = pd.read_parquet(PARQUET_PATH)
print("Loaded cached mart from", PARQUET_PATH)
print(df_pnl.head())
else:
# ETL in the GCloud
sql = """
-- Base CTE: delivery-based timing, last 5y, and unit cost with fallback
WITH base AS (
SELECT
-- month bucket as a DATE (1st of month, UTC)
DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)) AS revenue_month,
oi.returned_at,
DATE(TIMESTAMP_TRUNC(oi.returned_at, MONTH)) AS return_month,
oi.sale_price,
COALESCE(ii.cost, p.cost) AS unit_cost
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
LEFT JOIN `bigquery-public-data.thelook_ecommerce.inventory_items` AS ii
ON ii.id = oi.inventory_item_id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
ON p.id = oi.product_id
WHERE
-- Compare DATE to DATE (last 5 years)
DATE(COALESCE(oi.delivered_at, oi.created_at))>= DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR)),
-- Operational policy rollup 1: revenue & cogs by delivery month
revenue_cogs AS (
SELECT
revenue_month AS month,
SUM(sale_price) AS revenue_gross,
SUM(unit_cost) AS cogs
FROM base
GROUP BY month),
-- Operational policy rollup 2: returns by the month they happen
returns_only AS (
SELECT
return_month AS month,
SUM(sale_price) AS returns,
SUM(unit_cost) AS cogs_returns
FROM base
WHERE returned_at IS NOT NULL
GROUP BY month)
-- Final monthly P&L (operational)
SELECT
m.month,
COALESCE(m.revenue_gross, 0) AS revenue_gross,
COALESCE(r.returns, 0) AS returns,
(COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)) AS revenue_net,
COALESCE(m.cogs, 0) AS cogs,
COALESCE(r.cogs_returns, 0) AS cogs_returns,
-- GP = revenue_net - (cogs - cogs_returns)
(COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)
- (COALESCE(m.cogs, 0) - COALESCE(r.cogs_returns, 0))) AS gross_profit
FROM revenue_cogs m
FULL OUTER JOIN returns_only r USING (month)
ORDER BY month"""
# GCloud -> pandas DataFrame
df_pnl = pgbq.read_gbq(
sql,
project_id=PROJECT_ID,
location=LOCATION,
use_bqstorage_api=USE_STORAGE_API
)
# Save locally as Parquet (typed, compressed, fast reloads)
df_pnl.to_parquet(PARQUET_PATH, index=False)
print(f"Saved monthly P&L mart (operational policy) to {PARQUET_PATH}")
print(df_pnl.head())
print(list(df_pnl.columns))
## Loaded cached mart from pnl_monthly_5y_operational.parquet
## month revenue_gross ... cogs_returns gross_profit
## 0 2020-10-01 31392.780016 ... 1047.665769 15271.909169
## 1 2020-11-01 41859.880024 ... 2139.164078 19352.347184
## 2 2020-12-01 42759.699989 ... 2774.276256 18981.968990
## 3 2021-01-01 47594.270006 ... 2497.217012 22142.032789
## 4 2021-02-01 48373.559991 ... 2644.166791 22102.089914
##
## [5 rows x 7 columns]
# Some sanity checks
df = df_pnl.copy()
df["gm_pct"] = (df["gross_profit"] / df["revenue_net"]).replace([pd.NA, pd.NaT], 0)
df["cogs_pct"] = (df["cogs"] / df["revenue_gross"]).replace([pd.NA, pd.NaT], 0)
print(df.tail(12)[["month","revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit","gm_pct","cogs_pct"]])
## month revenue_gross returns ... gross_profit gm_pct cogs_pct
## 49 2024-11-01 269082.380185 26214.960039 ... 126124.694354 0.519315 0.480289
## 50 2024-12-01 273170.510381 29337.700056 ... 126714.060647 0.519676 0.480170
## 51 2025-01-01 297889.750371 30776.970042 ... 138763.774925 0.519495 0.479869
## 52 2025-02-01 281236.250379 28218.760069 ... 131552.407939 0.519934 0.479882
## 53 2025-03-01 312965.760157 28176.590032 ... 146840.430599 0.515611 0.484954
## 54 2025-04-01 339263.010250 32180.190041 ... 159081.004919 0.518039 0.481683
## 55 2025-05-01 363811.930262 37850.470028 ... 169256.207956 0.519252 0.480367
## 56 2025-06-01 377284.620347 35158.139980 ... 177297.917865 0.518223 0.481696
## 57 2025-07-01 448945.130368 40269.100062 ... 212147.626249 0.519110 0.481006
## 58 2025-08-01 497113.050341 44222.930023 ... 234784.138660 0.518413 0.481715
## 59 2025-09-01 607773.990589 58880.459994 ... 285932.294740 0.520925 0.478581
## 60 2025-10-01 529830.500402 61318.850029 ... 243232.963535 0.519161 0.481192
##
## [12 rows x 9 columns]
print("Overall GM% (net):", (df["gross_profit"].sum() / df["revenue_net"].sum()))
## Overall GM% (net): 0.5190164068500782
print("Median monthly GM%:", df["gm_pct"].median())
## Median monthly GM%: 0.5190465748341138
print("Median monthly COGS%:", df["cogs_pct"].median())
## Median monthly COGS%: 0.481191899780154
test = (df_pnl["revenue_net"] - df_pnl["gross_profit"]) - (df_pnl["cogs"] - df_pnl["cogs_returns"])
test.abs().max()
## np.float64(1.4551915228366852e-11)
7) Visualize the Profit & Loss (P&L) Statement
For a nice visualisation of the P&L statement, I am using the plotly
library that transforms our P&L time series into dynamic plots. They are great for dashboarding because they are interactive, you can zoom in, zoom out, save PNG directly, and they render information if you hover the mouse over the plot.
import plotly.express as px
df_pnl["month"] = pd.to_datetime(df_pnl["month"])
long = df_pnl.melt(
id_vars="month",
value_vars=["revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit"],
var_name="metric",
value_name="value"
)
fig = px.line(long, x="month", y="value", color="metric",
title="Monthly P&L",
labels={"month": "Month", "value": "USD", "metric": "Series"},
markers=True)
fig = fig.update_layout(hovermode="x unified")
fig = fig.update_yaxes(tickprefix="$", separatethousands=True)
fig.show()