Creating a Dashboard of CPU Benchmarks Using R and Python.

Introduction

In this post I will teach you how to create and deploy a dashboard with a preview of the dataset alongside useful data visualization tools. Dashboards are useful for creating interactive and customizable data visualizations and web applications. They allow you to create a dynamic user interface that can interact with data and update in real-time, making it an excellent tool for data exploration, analysis, and sharing. Dashboards can be used for a wide range of purposes, from monitoring business metrics to visualizing scientific data.

To create the dashboard, I will combine R an Python to take advantage of strengths of each language for web scraping, data cleaning, dashboard creation and deployment. My source to gather CPU information and benchmarks is CPU list from cpubenchmark.net. The goal is to scrap this data to create a dataset of benchmarks for our CPU dashboard example.

Web Scrapping HTML tables with rvest

The library rvest from R has many interesting functions for web scrapping. We are interested a function that can transform a HTML tables (<table> and </table>) into readable dataframes.

library("rvest")

## Read data
webpage <- read_html("https://www.cpubenchmark.net/cpu_list.php")
tbls <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
length(tbls)

## [1] 3

head(tbls[[2]])

## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##   <chr>                   <chr>                            <int>   <dbl> <chr>  
## 1 AArch64 rev 2 (aarch64) 2,246                             2187      NA <NA>   
## 2 AArch64 rev 4 (aarch64) 1,797                             2439      NA <NA>   
## 3 AC8257V/WAB             774                               3269      NA <NA>   
## 4 AMD 3015Ce              2,088                             2263      NA <NA>   
## 5 AMD 3015e               2,691                             1969      NA <NA>   
## 6 AMD 3020e               2,446                             2069      NA <NA>   
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`

Now the next step is to convert this table into a dataframe and perform some basic data cleaning. We are going to use regular expressions to transform strings of character into numeric and integer values.

cpus_bench <- tbls[[2]]
cpus_bench$`CPU Mark(higher is better)` <- as.numeric(gsub(",", "", cpus_bench$`CPU Mark(higher is better)`))
cpus_bench$`Rank(lower is better)` <- as.numeric(cpus_bench$`Rank(lower is better)`)
cpus_bench$`CPU Value(higher is better)` <- as.numeric(cpus_bench$`CPU Value(higher is better)`)
cpus_bench$`Price(USD)` <-  gsub("(^\\$)|(\\*$)", "", cpus_bench$`Price(USD)`)
cpus_bench$`Price(USD)` <-  gsub(",", "", cpus_bench$`Price(USD)`)
cpus_bench$`Price(USD)` <- as.numeric(cpus_bench$`Price(USD)`)
head(cpus_bench)

## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##   <chr>                                          <dbl>     <dbl>   <dbl>   <dbl>
## 1 AArch64 rev 2 (aarch64)                         2246      2187      NA      NA
## 2 AArch64 rev 4 (aarch64)                         1797      2439      NA      NA
## 3 AC8257V/WAB                                      774      3269      NA      NA
## 4 AMD 3015Ce                                      2088      2263      NA      NA
## 5 AMD 3015e                                       2691      1969      NA      NA
## 6 AMD 3020e                                       2446      2069      NA      NA
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`

Now that we have the data in good shape, it is time to retrieve more information on CPU benchmarks. The cpus_bench contains information on 4080 CPUs, however, to make the bashboard more efective, I want to concentrate on the top-1000 CPUs according to the CPU Mark.

## Sort according to CPU Mark
cpus_bench <- cpus_bench[order(cpus_bench$`CPU Mark(higher is better)`, decreasing = T), ]
cpus_bench_1000 <- cpus_bench[1:1000,]
head(cpus_bench_1000, 10L)

## # A tibble: 10 × 5
##    `CPU Name`                        CPU Mark(higher i…¹ Rank(…² CPU V…³ Price…⁴
##    <chr>                                           <dbl>   <dbl>   <dbl>   <dbl>
##  1 AMD EPYC 9654                                  124119       1    10.5  11805 
##  2 AMD Ryzen Threadripper PRO 5995WX               96237       2    14.5   6645.
##  3 AMD EPYC 7773X                                  90731       3    21.4   4249 
##  4 AMD EPYC 7763                                   85944       4    23.4   3665 
##  5 AMD EPYC 7J13                                   85661       5    NA       NA 
##  6 AMD EPYC 7713                                   85521       6    23.1   3700.
##  7 AMD EPYC 7713P                                  83439       7    18.3   4550 
##  8 AMD Ryzen Threadripper PRO 3995WX               83097       8    13.3   6267.
##  9 AMD EPYC 7V13                                   82878       9    NA       NA 
## 10 AMD Ryzen Threadripper 3990X                    81109      10    11.5   7069 
## # … with abbreviated variable names ¹​`CPU Mark(higher is better)`,
## #   ²​`Rank(lower is better)`, ³​`CPU Value(higher is better)`, ⁴​`Price(USD)`

Very interesting, in the top-10 we find only AMD processors…

To add the rest of the CPU benchmark we are going to take advantage of this simple function that takes the name of a CPU and creates an HTML link that will be used by a web scrapping algorithm to retrieve the CPU benchmarks.

i <- 1L
paste0("https://www.cpubenchmark.net/cpu.php?cpu=", gsub(" ", "\\+", cpus_bench_1000$`CPU Name`[i]))

## [1] "https://www.cpubenchmark.net/cpu.php?cpu=AMD+EPYC+9654"

The idea is to write a simple loop that would iterate over all the top-1000 CPUs and gather information on benchmarks, such as “integer_math(MOps/Sec)”,“floating_point_math(MOps/Sec)”,“find_prime_numbers(Million Primes/Sec)”, ect…

# read in HTML data
i <- 1L
df_bind <- list()
for(i in 1L:1000L){
webpage <- read_html(paste0("https://www.cpubenchmark.net/cpu.php?cpu=", gsub(" ", "\\+", cpus_bench_1000$`CPU Name`[i])))
tbls <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
df <- try(data.table(t(tbls[[2]][2])))
if(any(class(df)%in%"data.table")){
  if(ncol(df) == 9){
    setnames(df, t(tbls[[2]][1]))
    year <- as.integer(sub("^.*\\s", "",trimws(gsub('</p>.*$', '', gsub('^.*<strong class="bg-table-row">CPU First Seen on Charts:</strong>', "", webpage)))))
    gz <- as.integer(sub("\\s.*", "",trimws(gsub('</p>.*$', '', gsub('^.*<strong>Clockspeed:</strong>', "", webpage)))))
    cores <- as.integer(trimws(gsub('<strong>.*$', '', gsub('^.*<strong>Cores:</strong>', "", webpage))))
    threads <- as.integer(trimws(gsub('</p>.*$', '', gsub('^.*<strong>Threads:</strong>', "", webpage))))
    df_bind[[i]] <- cbind.data.frame(cpus_bench_1000[i,], df, gz, cores, threads, year)  
  }
}
}

cpus_bench_full <- rbindlist(df_bind, fill = TRUE)
head(cpus_bench_full)

The algorithm may seem a bit intimidating but it is actually quite simple. It is gathering pieces of information on specific parts of the HTML code. For instance, to gather the information on the CPU benchmarks we are always scrapping the second table from the website as tbls[[2]][2]. Then we are scrapping the begging and end of HTML tabs that contain useful information such as gz, cores and the threads from the HTML source code. The final data set looks like this:

##   X                          cpu_name cpu_mark.higher_is_better.
## 1 1                     AMD EPYC 9654                     124119
## 2 2 AMD Ryzen Threadripper PRO 5995WX                      95829
## 3 3                    AMD EPYC 7773X                      90731
## 4 4                     AMD EPYC 7763                      85944
## 5 5                     AMD EPYC 7J13                      85661
## 6 6                     AMD EPYC 7713                      85521
##   cpu_value.higher_is_better. price_usd integer_math.MOps.Sec.
## 1                       10.51  11805.00                 978227
## 2                       14.98   6399.00                 631867
## 3                       21.11   4299.00                 533457
## 4                       23.26   3695.00                 547840
## 5                          NA        NA                 555507
## 6                       23.11   3699.99                 533785
##   floating_point_math.MOps.Sec. find_prime_numbers.Million.Primes.Sec.
## 1                        522611                                     NA
## 2                        343904                                    676
## 3                        301129                                     NA
## 4                        299973                                    665
## 5                        300486                                    686
## 6                        272582                                    621
##   random_string_sorting.Thousand.Strings.Sec. data_encryption.MBytes.Sec.
## 1                                          NA                      187949
## 2                                         676                      132563
## 3                                          NA                      135770
## 4                                         665                      124591
## 5                                         686                      123954
## 6                                         621                      107100
##   data_compression.MBytes.Sec. physics.Frames.Sec.
## 1                           NA                  NA
## 2                           NA                  NA
## 3                           NA                  NA
## 4                           NA                  NA
## 5                           NA                  NA
## 6                           NA                  NA
##   extended_instructions.Million.Matrices.Sec. single_thread.MOps.Sec. ghz cores
## 1                                      200277                    2893   2    96
## 2                                      123388                    3302   2    64
## 3                                       91298                    2513   2    64
## 4                                       98801                    2576   2    64
## 5                                       99971                    2449   2    64
## 6                                       94897                    2718   2    64
##   threads year
## 1     192 2022
## 2     128 2022
## 3     128 2022
## 4     128 2021
## 5     128 2021
## 6     128 2021

Dasboard in Plotly Dash from Python

The library that I am going to use to create the dashboard is called Plotly Dash. Plotly Dash has several advantages for deploying a static web application compared to other libraries. Firstly, it has high level of interactivity, meaning that users are able to play around with the data, apply filters, and perform various operations. Secondly, in my view, it is also flexible as it allows users to create an customize different plots and layouts, and it is relatively easy to customize. Thirdly, it has a high level of integration specially with Pandas and Numpy that are the main libraries that are commonly use of data science in Python. Finally, the library is relatively easy to deploy at zero cost as a static website that can be easily embedded or used as a stand alone service.

Don’t let the code overwhelm you! The structure of the dashboard python code is simple, we start by loading the packages that we are going to use. The first line loads the entire Dash library to be used in the script and the the next three lines import specific modules from the Dash library, which are dcc, html, and dash_table. These modules are needed to create the visual components of the dashboard such as tables, dropdowns menus, graphs, and other HTML elements. Pandas is imported in order to read and manipulate the dataset, which is stored in a CSV file. Then, the Plotly graph objects (graph_objs) are imported to create a pie chart and the Plotly express (px) is imported to create a scatter plot. Afther we have loaded the libraries and modules, we load the dataset using read_csv method from Pandas, either from a local file or from a remote URL, in this case I am exporting the data from my Github repository.

The app.layout is where the components of the dashboard are defined such as tables, dropdowns, graphs, and other HTML elements. Here, you can be creative and write a layout that is both visually appealing and functional. I am going for a simple design with a heather, html.H1('CPU Benchmark Data'), and a slim line that works as separator between the sections of the dashboard html.Hr(),. I start the dashboard presenting a preview of the top-10 rows of the dataset using the function dash_table.DataTable(). The function has several arguments, but perhaps the most important one is the data source data=df.head(10).to_dict('records'),which displays only the first 10 rows of the data.

After the data preview, I define another line html.Br(), to mark the beginning of another section of the web application, followed by the function, html.H4('Histogram variable:'), that displays a tittle of the histogram. Next, I define a dropdown menu to select between each column of the dataset:

dcc.Dropdown( id='variable-selector', options=[{'label': i, 'value': i} for i in df.columns], value='cpu_value(higher_is_better)' )

This is followed by two bottoms that are used to sort the table in an ascending or descending manner according to the variable selected. This snipped of code is the following one:

dcc.RadioItems( id='sort-order', options=[{'label': i, 'value': i} for i in ['Ascending', 'Descending']], value='Ascending', labelStyle={'display': 'inline-block'} )

And finally, we display the histogram using the following function:

dcc.Graph( id='histogram', figure={} )

The rest of the layout follows the same mechanics. I define a dropdown menu, dcc.Dropdown(), to select a variable for the next plot then I render the plot using the same dcc.Graph() function.

After defining the app.layout, we have to write the app.callback decorator that is used to bind the input/output of the interactive components (i.e., the dropdown menus) to the graph. Furthermore, the update_histogram_and_table, update_pie_chart, and update_scatter_plot functions that are the callback functions that update the graph based on the user input.

import dash
from dash import dcc
from dash import html
from dash import dash_table
import pandas as pd
import plotly.graph_objs as go # for the pie chart
import plotly.express as px # for the scatter plot

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

# Load the dataset
#df = pd.read_csv("test_cpus.csv")
df = pd.read_csv("https://raw.githubusercontent.com/Wario84/blog/main/assets/data/test_cpus.csv")

app.layout = html.Div([
    html.H1('CPU Benchmark Data'),
    html.Hr(),
    html.H3('Data Preview:'),
    dash_table.DataTable(
        id='table',
        columns=[{"name": i, "id": i} for i in df.columns],
        data=df.head(10).to_dict('records'),
        style_table={'overflowX': 'auto'},
        style_cell={'textAlign': 'left'},
        sort_action='native',
        page_action='none',
        style_data_conditional=[{
            'if': {'row_index': 'odd'},
            'backgroundColor': 'rgb(248, 248, 248)'
        }]
    ),
    html.Br(),
    html.H4('Histogram variable:'),
    dcc.Dropdown(
        id='variable-selector',
        options=[{'label': i, 'value': i} for i in df.columns],
        value='cpu_value(higher_is_better)'
    ),
    dcc.RadioItems(
        id='sort-order',
        options=[{'label': i, 'value': i} for i in ['Ascending', 'Descending']],
        value='Ascending',
        labelStyle={'display': 'inline-block'}
    ),
    dcc.Graph(
        id='histogram',
        figure={}
    ),
        html.Br(),
    html.H4('Pie-chart variable:'),
     dcc.Dropdown(
                id="variable-selector-2",
                options=[
                    {"label": "Ghz", "value": "ghz"},
                    {"label": "Cores", "value": "cores"},
                    {"label": "Threads", "value": "threads"},
                    {"label": "Year", "value": "year"},
                    #"ghz","cores","threads"

                ],
                #style={"width": "45%"}
                value="cores"
                
            ),
             dcc.Graph(id="pie-chart"),
             html.Br(),
    html.H4('Scatter-Plot variable:'),
    dcc.Dropdown(
        id='variable-selector-3',
        options=[{'label': i, 'value': i} for i in df.columns],
        value='cpu_name'
    ),
             
             dcc.Graph(id="scatter-plot"),
])

@app.callback(
    [dash.dependencies.Output('histogram', 'figure'),
     dash.dependencies.Output('table', 'data')],
    [dash.dependencies.Input('variable-selector', 'value'),
     dash.dependencies.Input('sort-order', 'value')]
)
def update_histogram_and_table(variable, sort_order):
    df_sorted = df.sort_values(by='cpu_value(higher_is_better)', ascending=False)
    if sort_order == 'Ascending':
        df_sorted = df_sorted.iloc[::-1]
    data_table = df_sorted.head(10).to_dict('records')

    fig = {
        'data': [{
            'x': df[variable],
            'type': 'histogram'
        }],
        'layout': {
            'title': 'Histogram of ' + variable,
            'xaxis': {'title': variable},
            'yaxis': {'title': 'Count'}
        }
    }

    return fig, data_table

@app.callback(
    dash.dependencies.Output("pie-chart", "figure"),
    [dash.dependencies.Input("variable-selector-2", "value")]
)

def update_pie_chart(selected_column):
    #filtered_df = df[df['year'] == selected_column]
    values = df[selected_column].value_counts().values
    labels = df[selected_column].value_counts().index
    fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
    #fig.update_layout(title=f"{selected_column} distribution in {selected_column}")
    return fig

@app.callback(
    dash.dependencies.Output("scatter-plot", "figure"),
    [dash.dependencies.Input("variable-selector-3", "value")]
)
def update_scatter_plot(variable):
    return px.scatter(df, x="price_usd", y=variable).update_layout(
        xaxis={"title": "Price (USD)"},
        yaxis={"title": variable.capitalize()},
        margin={"l": 40, "b": 40, "t": 10, "r": 10},
        height=300,
    )

if __name__ == '__main__':
    app.run_server(debug=False)

Deploying as a static website

To finally deploy the dashboard as web application, I am going to rely on this video put forward by the people from Plotly: Deploy your Python Data App to the Web for Free - Dash. The procedure is step by step, and it very simple, first, we put the .py python script in a public Github repository. Then we open an account on render.com and follow a simple procedure.

The final CPU Dashboard

Finally, I present you the CPU benchmark Dashboard. But for a better experience and visualization, I invite you to check out the static website at cpu-benchmark-plotly-dash.onrender.com

Table of Content:

Introduction

Web Scrapping HTML tables with rvest

Dasboard in Plotly Dash from Python

Deploying as a static website

The final CPU Dashboard