1) Installing Python in RStudio

For this project, I will work through the IDE of RStudio, given that it is one of the easiest and fastest ways to generate a Markdown document with snippets of Python. I first installed Python through the reticulate package and Conda repositories, which will allow me to run Python commands within the R session. The process is simple: first install the reticulate with dependencies, then run the install_miniconda() Command for installing Python. The third step is to accept the conditions of installing Python on the local machine. And, finally, creating a Python environment that will be called (binded) when an R session starts (this happens when I couple the Markdown document).

(If you have already installed Python, or you prefer working with Jupiter, skip to part 2).

# Install reticulate
install.packages("reticulate", dependencies = T)


# Load Library
library(reticulate)

# Install Python
install_miniconda()

# Accept ToS for all the defaults channels it complained about
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/main"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/r"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/msys2"))



# Now create the env reticulate wanted
reticulate::conda_create("r-reticulate", packages = c("python=3.10","pip"))

# Install libraries from Needed
reticulate::conda_install("r-reticulate",
                          c("pandas","pyarrow","google-cloud-bigquery",
                            "pandas-gbq","google-cloud-bigquery-storage","db-dtypes"),
                          channel = "conda-forge"
)

After the installation is done, it’s best to start a clean new Rsession (Ctrl + Alt + F10), and load Python in the session. Recall that this snippet will be loaded to bind R with Python within Rstudio.

# to bind the env of python to the Rsession is best to restart ctrl + alt + F10
reticulate::use_condaenv("r-reticulate", required = TRUE)

reticulate::py_config()

2) Setting Up the Cloud Environment

For this project, I will showcase the use of Google Cloud CLI, which provides a powerful way to interact with Google BigQuery and Python. The Google BigQuery has the advantage of using the Cloud Infrastructure from Google can run regular SQL and BigQuery if needed. Google Cloud CLI also has the advantage of monitoring and managing query jobs that run regularly. Furthermore, the Google BigQuery can directly train, evaluate and run ML models suitable for prediction and forecasting.

The dataset that I will use is public and accessible through Google BigQuery, called eCommerce, is rich in tables that contain inventories, KPIs and other financial data typically needed for decision-making making.

2.1) Install Google Cloud CLI

The first step is to install the Google Cloud CLI, I am installing it with Blunted Python and Beta Commands, and skipping the Cloud Tools for PowerShell that I do not currently need.

After installation, the Google Shell will open, and it will require authentication with a Google Account.

After selecting Yes

Make sure the authentication is correct:

2.2) Creating a Project in Gcloud Power-Shell

After accepting the terms, we can go back to the PowerShell and to create a new project (Option 3), give a name to the project, and I am calling my project finance-bigq-demo-2025-mgs. If there is a problem, for instance, a non-compatible project ID name, you can run gcloud projects create finance-bigq-demo-2025-mgs to create the project.

2.3) Authentication and Enabling Big Data Services

After the project is created, they can proceed to authenticate the user name in the Gcloud, this is done one time and credentials are stored locally- gcloud auth application-default login. After running this command, your default browser will open, and just sign in with your Google Credentials. Then, a final step is enabling the services we will use for this project, namely, finance-bigq-demo-2025-mgs and finance-bigq-demo-2025-mgs with this command: finance-bigq-demo-2025-mgs

2.4) Sanity Checks

Before continuing, it is useful to perform some sanity check commands to verify everything is working in good order. I recommend gcloud config list, for showing the active project, followed by gcloud services list --enabled that verifies that the BigQuery API services are enabled. An additional step is to verify that your account is set up as the active account, which you can do with gcloud auth list. After running this last command, you should be able to see your account marked with *.

3) The thelook_ecommerce Dataset

The GCloud services come with public datasets designed to test Google Service Capabilities for processing Big Data. This particular data set has variables relevant to the context of finance and e-commerce. The data set has the following tables according to Google:

Table Name Number of Rows Number of Columns Description
distribution_centers 5 5 Lists the distribution centers, including their ID and basic location data.
events 1.1 million+ 14 Website activity for users (page views, cart events, etc.).
inventory_items 2.5 million+ 14 Item-level inventory records, including status (shipped, returned, etc.).
order_items 2.5 million+ 19 Links products to orders with item-level details, including sale price.
orders 2.5 million+ 10 Transaction headers for each customer order.
products 28,000+ 10 Product catalog with category, brand, and cost.
users 100,000 12 User demographics and traffic source.

4) Connecting Gcloud with Python

This query connects to the data thelook_ecommerce schema and retrieves all tables and column names with their variable type. For later use, I am saving this as a CSV to study the variables for further analysis.

# Load libraries
import pandas as pd # to manipulate data
import pandas_gbq as pgbq # to connect to the Gcloud

# Define ENV Variables
PROJECT_ID = "finance-bigq-demo-2025-mgs"         
LOCATION   = "US"                      # theLook public dataset is in US
USE_STORAGE_API = False                 # set to True if Storage API is enabled, faster for big data
SCHEMA_PATH = "look_ecom_schema.csv"

if 'look_ecom_schema' not in globals():
    # Load from CSV if not already defined
    look_ecom_schema = pd.read_csv(SCHEMA_PATH)
    print(look_ecom_schema.head())
else:
  # read the schema of the data
  look_ecom_schema = pgbq.read_gbq("""SELECT table_name, column_name, data_type
  FROM `bigquery-public-data.thelook_ecommerce`.INFORMATION_SCHEMA.COLUMNS
  WHERE table_name IN ('distribution_centers', 'events', 'inventory_items', 'order_items', 'orders',
  'products', 'users', 'products','order_items')
  ORDER BY table_name, column_name""", project_id=PROJECT_ID, location="US")
  look_ecom_schema.to_csv("look_ecom_schema.csv", index=False)
  print(look_ecom_schema.head())
##              table_name               column_name  data_type
## 0  distribution_centers  distribution_center_geom  GEOGRAPHY
## 1  distribution_centers                        id      INT64
## 2  distribution_centers                  latitude    FLOAT64
## 3  distribution_centers                 longitude    FLOAT64
## 4  distribution_centers                      name     STRING

5) General Approach for Data Manipulation (ETL)

My approach to handling Big Data is to take advantage of the process of filtering, aggregation, and joining that are performed efficiently in The GCloud with SQL/Google BigQuery. Once the data set is ready, save it locally as a data mart that has been opened for further transformation and analysis using Python.

6.1 Data Filtering, Aggregation and Joining Strategy

For data consistency, I use fallbacks when NULL values are detected in key columns. For instance, when filtering the data by timestamp, I first Check when the product was delivered, and if that is NULL, I fall back to the order creation timestamp:
DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)).

Stage 1

To create the P&L data, I join three tables. I start by retrieving oi.sale_price from the order_items fact table, which is used to calculate revenue. For the calculation of cost, I left join order_items (bigquery-public-data.thelook_ecommerce.order_items AS oi) with the inventory_items table (LEFT JOIN bigquery-public-data.thelook_ecommerce.inventory_items AS ii) using the join key ON ii.id = oi.inventory_item_id. Then, as a fallback, I left join with the products table (LEFT JOIN bigquery-public-data.thelook_ecommerce.products AS p) using the join key ON p.id = oi.product_id. This fallback ensures we can retrieve the unit_cost when the cost is missing from the inventory table, by looking it up in the products table:
COALESCE(ii.cost, p.cost) AS unit_cost.

Tables & columns used

  • Facts: order_items -> oi.sale_price, oi.delivered_at, oi.created_at, oi.returned_at

  • Cost: inventory_items ii.cost (preferred), products -> p.cost (fallback)

Stage 2

In the second stage, I aggregate gross revenue as SUM(sale_price) AS revenue_gross and cost as SUM(unit_cost) AS cogs. It is important to note that these are net line prices and costs that do not include taxes, freight, or other operational expenses that may affect the estimate.

Stage 3

I estimate returns in the month they occur, which is beneficial for real-time operational dashboards. I filter using WHERE returned_at IS NOT NULL to locate returned products. The aggregations are straightforward:
SUM(sale_price) AS returns and SUM(unit_cost) AS cogs_returns.

Stage 4

In the last stage, I leverage BigQuery to perform fast arithmetic operations with fallbacks for NULL values via COALESCE. Here, it is worth noting that I use a FULL OUTER JOIN to include all months in the P&L data, even if they contain only revenue or only returns.

# Import os to check if the file exist otherwise ETL the Data
import os

# Set the working directory
os.chdir(r"R:/PHD/Semester 20/Jobs/Empresas/Solvo/financial_analyst/project")

# define mart file to save the data
PARQUET_PATH = "pnl_monthly_5y_operational.parquet"

if os.path.isfile(PARQUET_PATH):
    # Load cached mart
    df_pnl = pd.read_parquet(PARQUET_PATH)
    print("Loaded cached mart from", PARQUET_PATH)
    print(df_pnl.head())
else:
    # ETL in the GCloud
    sql = """
    -- Base CTE: delivery-based timing, last 5y, and unit cost with fallback
    WITH base AS (
      SELECT
      
      -- month bucket as a DATE (1st of month, UTC)
      
      DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)) AS revenue_month,

      oi.returned_at,
      DATE(TIMESTAMP_TRUNC(oi.returned_at, MONTH)) AS return_month,

      oi.sale_price,
      COALESCE(ii.cost, p.cost) AS unit_cost
      FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.inventory_items` AS ii
      ON ii.id = oi.inventory_item_id
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
      ON p.id = oi.product_id
      WHERE
      -- Compare DATE to DATE (last 5 years)
      DATE(COALESCE(oi.delivered_at, oi.created_at))>= DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR)),
      
      -- Operational policy rollup 1: revenue & cogs by delivery month
      revenue_cogs AS (
        SELECT
        revenue_month AS month,
        SUM(sale_price) AS revenue_gross,
        SUM(unit_cost)  AS cogs
        FROM base
        GROUP BY month),
        
        -- Operational policy rollup 2: returns by the month they happen
        returns_only AS (
          SELECT
          return_month AS month,
          SUM(sale_price) AS returns,
          SUM(unit_cost) AS cogs_returns
          FROM base
          WHERE returned_at IS NOT NULL
          GROUP BY month)
        
        -- Final monthly P&L (operational)
        SELECT
        m.month,
        COALESCE(m.revenue_gross, 0)                      AS revenue_gross, 
        COALESCE(r.returns, 0)                            AS returns, 
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)) AS revenue_net, 
        COALESCE(m.cogs, 0)                               AS cogs,
        COALESCE(r.cogs_returns, 0)                             AS cogs_returns,
        -- GP = revenue_net - (cogs - cogs_returns)
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)
        - (COALESCE(m.cogs, 0) - COALESCE(r.cogs_returns, 0))) AS gross_profit
        FROM revenue_cogs m
        FULL OUTER JOIN returns_only r USING (month)
        ORDER BY month"""
      
    # GCloud -> pandas DataFrame
    df_pnl = pgbq.read_gbq(
      sql,
      project_id=PROJECT_ID,
      location=LOCATION,
      use_bqstorage_api=USE_STORAGE_API
    )
    
    # Save locally as Parquet (typed, compressed, fast reloads)
    df_pnl.to_parquet(PARQUET_PATH, index=False)
    print(f"Saved monthly P&L mart (operational policy) to {PARQUET_PATH}")
    print(df_pnl.head())
    print(list(df_pnl.columns))
## Loaded cached mart from pnl_monthly_5y_operational.parquet
##        month  revenue_gross  ...  cogs_returns  gross_profit
## 0 2020-10-01   31392.780016  ...   1047.665769  15271.909169
## 1 2020-11-01   41859.880024  ...   2139.164078  19352.347184
## 2 2020-12-01   42759.699989  ...   2774.276256  18981.968990
## 3 2021-01-01   47594.270006  ...   2497.217012  22142.032789
## 4 2021-02-01   48373.559991  ...   2644.166791  22102.089914
## 
## [5 rows x 7 columns]
    
    
# Some sanity checks
df = df_pnl.copy()
df["gm_pct"] = (df["gross_profit"] / df["revenue_net"]).replace([pd.NA, pd.NaT], 0)
df["cogs_pct"] = (df["cogs"] / df["revenue_gross"]).replace([pd.NA, pd.NaT], 0)

print(df.tail(12)[["month","revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit","gm_pct","cogs_pct"]])
##         month  revenue_gross       returns  ...   gross_profit    gm_pct  cogs_pct
## 49 2024-11-01  269082.380185  26214.960039  ...  126124.694354  0.519315  0.480289
## 50 2024-12-01  273170.510381  29337.700056  ...  126714.060647  0.519676  0.480170
## 51 2025-01-01  297889.750371  30776.970042  ...  138763.774925  0.519495  0.479869
## 52 2025-02-01  281236.250379  28218.760069  ...  131552.407939  0.519934  0.479882
## 53 2025-03-01  312965.760157  28176.590032  ...  146840.430599  0.515611  0.484954
## 54 2025-04-01  339263.010250  32180.190041  ...  159081.004919  0.518039  0.481683
## 55 2025-05-01  363811.930262  37850.470028  ...  169256.207956  0.519252  0.480367
## 56 2025-06-01  377284.620347  35158.139980  ...  177297.917865  0.518223  0.481696
## 57 2025-07-01  448945.130368  40269.100062  ...  212147.626249  0.519110  0.481006
## 58 2025-08-01  497113.050341  44222.930023  ...  234784.138660  0.518413  0.481715
## 59 2025-09-01  607773.990589  58880.459994  ...  285932.294740  0.520925  0.478581
## 60 2025-10-01  529830.500402  61318.850029  ...  243232.963535  0.519161  0.481192
## 
## [12 rows x 9 columns]
print("Overall GM% (net):", (df["gross_profit"].sum() / df["revenue_net"].sum()))
## Overall GM% (net): 0.5190164068500782
print("Median monthly GM%:", df["gm_pct"].median())
## Median monthly GM%: 0.5190465748341138
print("Median monthly COGS%:", df["cogs_pct"].median())
## Median monthly COGS%: 0.481191899780154
test = (df_pnl["revenue_net"] - df_pnl["gross_profit"]) - (df_pnl["cogs"] - df_pnl["cogs_returns"])
test.abs().max()
## np.float64(1.4551915228366852e-11)

7) Visualize the Profit & Loss (P&L) Statement

For a nice visualisation of the P&L statement, I am using the plotly library that transforms our P&L time series into dynamic plots. They are great for dashboarding because they are interactive, you can zoom in, zoom out, save PNG directly, and they render information if you hover the mouse over the plot.

import plotly.express as px
df_pnl["month"] = pd.to_datetime(df_pnl["month"])
long = df_pnl.melt(
    id_vars="month",
    value_vars=["revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit"],
    var_name="metric",
    value_name="value"
)

fig = px.line(long, x="month", y="value", color="metric",
              title="Monthly P&L",
              labels={"month": "Month", "value": "USD", "metric": "Series"},
              markers=True)
fig = fig.update_layout(hovermode="x unified")
fig = fig.update_yaxes(tickprefix="$", separatethousands=True)
fig.show()