softwareRisk: Computation of Node and Path-Level Risk Scores in Scientific Models

The R package softwareRisk leverages the network-like architecture of scientific models together with software quality metrics to identify risky paths, which are defined by the complexity of its functions and the extent to which errors can cascade along and beyond their execution order. It operates on tbl_graph objects representing call dependencies between functions (callers and callees). By leveraging the sensobol package (Puy et al. 2022), sofwareRisk also supports variance-based uncertainty and sensitivity analyses to evaluate how the identification of risky function-call paths varies under alternative assumptions about the relative importance of function complexity, coupling and structural position within the software.

Workflow

We first load the packages needed for the analysis.

library(softwareRisk)
library(tidygraph)

Prepare the required datasets

softwareRisk draws on the representation of the model’s source code as a directed call graph $G = (V,E)$ in which each node $v_i \in V$ is a function or subroutine and each directed edge $e_{ij} = (v_i, v_j) \in E$ is a function call. It also assumes that each function will have a cyclomatic complexity value. Therefore the analyst should have two different datasets before starting the analysis:

A spreadsheet listing the set set of directed edges as an edge list, with one row per function call. The first column may contian the caller function (source node, $v_i$, “from”) and the second column the calle function (target node, $v_j$, “to”):
A spreadsheet listing the cyclomatic_complexity values for each function in the model.

Let us create these datasets to illustrate their format:


# Dataset 1: calls (edge list) -------------------------------------------------

calls_df <- data.frame(
  from = c("clean_data", "compute_risk", "compute_risk", "calc_scores", "plot_results"),
  to = c("load_data", "clean_data", "calc_scores", "mean", "compute_risk")
)

calls_df
#>           from           to
#> 1   clean_data    load_data
#> 2 compute_risk   clean_data
#> 3 compute_risk  calc_scores
#> 4  calc_scores         mean
#> 5 plot_results compute_risk

# Dataset 2: cyclomatic complexity (node attributes) ---------------------------

cyclo_df <- data.frame(
  name  = c("clean_data", "load_data", "compute_risk", "calc_scores", "mean", "plot_results"),
  cyclo = c(6, 3, 12, 5, 2, 4)
)

cyclo_df
#>           name cyclo
#> 1   clean_data     6
#> 2    load_data     3
#> 3 compute_risk    12
#> 4  calc_scores     5
#> 5         mean     2
#> 6 plot_results     4

The analyst can then merge them into a tbl_graph:


# Merge into a tbl_graph -------------------------------------------------------

graph <- tbl_graph(nodes = cyclo_df, edges = calls_df, directed = TRUE)

graph
#> # A tbl_graph: 6 nodes and 5 edges
#> #
#> # A rooted tree
#> #
#> # Node Data: 6 × 2 (active)
#>   name         cyclo
#>   <chr>        <dbl>
#> 1 clean_data       6
#> 2 load_data        3
#> 3 compute_risk    12
#> 4 calc_scores      5
#> 5 mean             2
#> 6 plot_results     4
#> #
#> # Edge Data: 5 × 2
#>    from    to
#>   <int> <int>
#> 1     1     2
#> 2     3     1
#> 3     3     4
#> # ℹ 2 more rows

Once this is done, the data is prepared for softwareRisk.

Analysis

Here we illustrate the functions of softwareRisk by using the build-in data synthetic_graph. It consists of five entry nodes, 35 middle nodes and 15 sink nodes. Each entry node calls between two and five middle nodes and each middle node calls one to three sink nodes, thus simulating realistic code architecture. The synthetic example reproduces the characteristic right-tailed distribution of cyclomatic complexity found in real software systems, with many low-complexity functions and few highly complex ones (Landman et al. 2016).


# Load the data ----------------------------------------------------------------

data("synthetic_graph")

# Print it ---------------------------------------------------------------------

synthetic_graph
#> # A tbl_graph: 55 nodes and 122 edges
#> #
#> # A directed acyclic simple graph with 1 component
#> #
#> # Node Data: 55 × 2 (active)
#>    name  cyclo
#>    <chr> <dbl>
#>  1 E1        7
#>  2 M15      10
#>  3 M14      39
#>  4 M3       12
#>  5 M10      13
#>  6 E2       43
#>  7 M25      17
#>  8 M26       3
#>  9 E3        6
#> 10 M5        8
#> # ℹ 45 more rows
#> #
#> # Edge Data: 122 × 2
#>    from    to
#>   <int> <int>
#> 1     1     2
#> 2     1     3
#> 3     1     4
#> # ℹ 119 more rows

The next step is to compute the in-degree and betweenness centrality of each node, calculate its risk score and identify all simple paths through the directed function-call graph. All this is done with the all_paths_fun function. The in-degree and betweenness of the nodes are calculated internally by functions in the igraph package. The risk score for node $v_i$ is computed as

\[\begin{equation} r_{(v_i)} = \alpha\,\tilde{C}_{(v_i)} + \beta\, \tilde{d}_{(v_i)}^{\text{in}} + \gamma\,\tilde{b}_{(v_i)}\,, \end{equation}\]

where the tilde $\tilde{}$ refers to normalization, $C$ denotes cyclomatic complexity, $d^{\text{in}}$ refers to in-degree and $b$ denotes betweenness. The weights $\alpha$, $\beta$ and $\gamma$ reflect the relative importance of complexity, coupling and network position. High $r$ values indicate functions that are complex and/or highly interconnected and hence potential points of structural vulnerability.

Path-level risk scores are defined as

\[\begin{equation} P_k = 1 - \prod_{i=1}^{n_k} (1 - r_{k(v_i)})\,, \label{eq:independent_events} \end{equation}\]

where $r_{k(v_i)}$ is the risk of the $i$-th function in path $k$ and $n_k$ is the number of functions in that path. The equation above behaves like a saturating OR-operator: $P_k$ is at least as large as the maximum individual function risk and monotonically increases as more functions on the path become risky, approaching 1 when several functions have high risk scores. High $P_k$ scores thus identify not only vulnerable paths, but also paths whose potential failure can have a larger cascading effect into other parts of the system through their shared high-centrality functions.

In this example, we set $\alpha=0.6$, $\beta=0.3$ and $\gamma = 0.1$ to adopt a definition of risk that prioritizes defect-proneness and the likelihood of unexpected behaviours, thus relegating propagation potential as secondary. Other weights are possible, with the only constrain that $\alpha + \beta + \gamma = 1$.


# Run the function -------------------------------------------------------------

output <- all_paths_fun(graph = synthetic_graph,
                        alpha = 0.6,
                        beta  = 0.3,
                        gamma = 0.1,
                        complexity_col = "cyclo",
                        weight_tol = 1e-8)

# Print the output -------------------------------------------------------------

output
#> $nodes
#> # A tibble: 55 × 6
#>    name  cyclomatic_complexity indeg outdeg   btw risk_score
#>    <chr>                 <dbl> <dbl>  <dbl> <dbl>      <dbl>
#>  1 E1                        7     0      4  0        0.0610
#>  2 M15                      10     3      3 14.5      0.236 
#>  3 M14                      39     2      3 12        0.485 
#>  4 M3                       12     1      3  2.5      0.157 
#>  5 M10                      13     3      2 34.8      0.289 
#>  6 E2                       43     0      3  0        0.427 
#>  7 M25                      17     3      4 25        0.319 
#>  8 M26                       3     2      3  7        0.114 
#>  9 E3                        6     0      4  0        0.0508
#> 10 M5                        8     1      6  7.25     0.122 
#> # ℹ 45 more rows
#> 
#> $paths
#> # A tibble: 209 × 10
#>    path_id path_nodes path_str       hops path_risk_score path_cc gini_node_risk
#>      <int> <list>     <chr>         <dbl>           <dbl> <list>           <dbl>
#>  1       1 <chr [6]>  E1 → M14 → M…     5           0.900 <dbl>           0.253 
#>  2       2 <chr [4]>  E1 → M14 → M…     3           0.793 <dbl>           0.277 
#>  3       3 <chr [5]>  E1 → M10 → M…     4           0.773 <dbl>           0.243 
#>  4       4 <chr [6]>  E2 → M14 → M…     5           0.939 <dbl>           0.136 
#>  5       5 <chr [4]>  E2 → M14 → M…     3           0.874 <dbl>           0.0946
#>  6       6 <chr [5]>  E2 → M25 → M…     4           0.868 <dbl>           0.132 
#>  7       7 <chr [3]>  E2 → M25 → S…     2           0.725 <dbl>           0.0841
#>  8       8 <chr [5]>  E3 → M25 → M…     4           0.781 <dbl>           0.253 
#>  9       9 <chr [3]>  E3 → M25 → S…     2           0.545 <dbl>           0.269 
#> 10      10 <chr [6]>  E3 → M5 → M1…     5           0.799 <dbl>           0.285 
#> # ℹ 199 more rows
#> # ℹ 3 more variables: risk_slope <dbl>, risk_mean <dbl>, risk_sum <dbl>

The output of all_paths_fun is a list with two slots, nodes and paths.

$nodes: it contains the name of the nodes and their relevant metrics (cyclomatic complexity, in-degree, out-degree, betweenness and risk score).
$paths: it describes each simple path in the call graph and includes the ordered sequence of nodes forming the path, the number of hops (i.e., the number of edges traversed along the path), the path-level risk score and the vector of cyclomatic complexity values for the nodes along the path (path_cc column). In addition, two distributional metrics are reported: the Gini coefficient of node-level risk along the path (gini_node_risk column), which quantifies the inequality of risk contributions across nodes within a path, and the risk_slope, which captures the direction and magnitude of change in node-level risk from the beginning to the end of the path.

Plotting

softwareRisk provides functions to inspect the results of the analysis. The function path_fix_heatmap allows the analyst to chose the top $n$ nodes and $n$ paths in terms of their risk score and observe how much the risk score of the riskiest paths would decrease if the selected high-risk nodes were made perfectly reliable. This analysis identifies nodes that act as chokepoints for risk propagation, highlights paths dominated by single high-risk functions and reveals which refactoring actions would yield the greatest reductions in path-level risk.


path_fix_heatmap(all_paths_out = output, n_nodes = 20, k_paths = 20)
#> $delta_tbl
#> # A tibble: 400 × 3
#>    node  path_id deltaR
#>    <fct> <fct>    <dbl>
#>  1 S6    44       0    
#>  2 S6    146      0    
#>  3 S6    192      0    
#>  4 S6    103      0    
#>  5 S6    39       0    
#>  6 S6    57       0    
#>  7 S6    143      0    
#>  8 S6    159      0    
#>  9 S6    93       0.260
#> 10 S6    4        0    
#> # ℹ 390 more rows
#> 
#> $plot

softwareRisk also allows to plot the call graph with the top risky paths highlighted. This is done with the function plot_top_paths_fun. The top ten most risky paths are highlighted in colour. The thickness of the edge shows how frequently an edge participates in the top 10 most risky paths. The color of the edge (from orange to red) indicates the mean risk of paths including that edge.


plot_output <- plot_top_paths_fun(graph = synthetic_graph,
                                  all_paths_out = output,
                                  model.name = "ToyModel",
                                  language = "Fortran",
                                  top_n = 10,
                                  alpha_non_top = 0.05)

The color of the nodes maps onto the cyclomatic complexity categories defined by Watson and McCabe (1996) (0-10 low risk; 10-20 moderate complexity, 20-50 complex, high risk; $>50$ very complex, untestable).

The alpha_non_top argument controls the transparency of the paths that are not identified as top. For small or sparse models, it can be set to alpha_non_top = 1 to better visualize the full call graph:


plot_output <- plot_top_paths_fun(graph = synthetic_graph,
                                  all_paths_out = output,
                                  model.name = "ToyModel",
                                  language = "Fortran",
                                  top_n = 10,
                                  alpha_non_top = 1)

Uncertainty and sensitivity analysis

softwareRisk also enables the analyst to perform uncertainty and sensitivity analyses of risk score calculations by leveraging the sensobol package (Puy et al. 2022). By systematically varying the relative contribution of cyclomatic complexity and structural metrics such as in-degree and betweenness, the package allows users to evaluate how sensitive node- and path-level risk scores are to modelling assumptions. This approach makes it possible to assess the robustness of the identification of high-risk paths under alternative definitions of risk.

The relative weights assigned to cyclomatic complexity and structural metrics (in-degree and betweenness) are parameterized by the coefficients $\alpha$, $\beta$, and $\gamma$, which are randomly sampled and internally transformed into normalized weights ($\alpha_{\mathrm{raw}}$, $\beta_{\mathrm{raw}}$, and $\gamma_{\mathrm{raw}}$) that sum to one.

Uncertainty and sensitivity analyses are implemented through the function uncertainty_fun:


# Run uncertainty analysis -----------------------------------------------------

uncertainty_analysis <- uncertainty_fun(all_paths_out = output, 
                                        N = 2^10, 
                                        order = "first")

# Print the top five rows ------------------------------------------------------

lapply(uncertainty_analysis, function(x) head(x, 5))
#> $nodes
#>      name
#>    <char>
#> 1:     E1
#> 2:    M15
#> 3:    M14
#> 4:     M3
#> 5:    M10
#>                                                           uncertainty_analysis
#>                                                                         <list>
#> 1: 0.03389831,0.04358354,0.02033898,0.02773498,0.04745763,0.05649718,...[1024]
#> 2:       0.2474083,0.1956474,0.3198736,0.2317179,0.2819272,0.1860685,...[1024]
#> 3:       0.3543718,0.3739882,0.3269088,0.3141830,0.4427872,0.4340059,...[1024]
#> 4:       0.1190252,0.1122161,0.1285580,0.1024348,0.1555242,0.1287104,...[1024]
#> 5:       0.3393575,0.3138678,0.3750430,0.3478581,0.3206560,0.2893171,...[1024]
#>    sensitivity_analysis
#>                  <list>
#> 1:        <sensobol[5]>
#> 2:        <sensobol[5]>
#> 3:        <sensobol[5]>
#> 4:        <sensobol[5]>
#> 5:        <sensobol[5]>
#> 
#> $paths
#>    path_id                         path_str  hops
#>      <int>                           <char> <num>
#> 1:       1 E1 → M14 → M23 → M29 → M31 → S15     5
#> 2:       2             E1 → M14 → M23 → S15     3
#> 3:       3       E1 → M10 → M29 → M31 → S15     4
#> 4:       4 E2 → M14 → M23 → M29 → M31 → S15     5
#> 5:       5             E2 → M14 → M23 → S15     3
#>                                                     uncertainty_analysis
#>                                                                   <list>
#> 1: 0.9257106,0.9177837,0.9390559,0.9351443,0.9143014,0.9025447,...[1024]
#> 2: 0.7303364,0.6921096,0.7880574,0.6902704,0.8078236,0.7221158,...[1024]
#> 3: 0.8739811,0.8528306,0.9041735,0.8980060,0.8258811,0.8012193,...[1024]
#> 4: 0.9413505,0.9402631,0.9466475,0.9462448,0.9399195,0.9375586,...[1024]
#> 5: 0.7871077,0.7762923,0.8144586,0.7432827,0.8652713,0.8219545,...[1024]
#>                                                                     gini_index
#>                                                                         <list>
#> 1:       0.2650993,0.3061022,0.2641478,0.3104253,0.2243279,0.2928069,...[1024]
#> 2:       0.2713840,0.3227031,0.2863332,0.3057715,0.2310621,0.3307675,...[1024]
#> 3:       0.2977858,0.3538348,0.2918408,0.3444242,0.2655702,0.3216598,...[1024]
#> 4:       0.1630284,0.1905796,0.2040234,0.2229020,0.1054993,0.1722266,...[1024]
#> 5: 0.10803287,0.15313501,0.19156308,0.15104038,0.05330423,0.15508876,...[1024]
#>                                                                                 risk_trend
#>                                                                                     <list>
#> 1:       0.038713760,0.019454135,0.065677236,0.041949065,0.031596090,0.004800082,...[1024]
#> 2:             0.07905928,0.03634152,0.13886414,0.06933133,0.10046077,0.02449915,...[1024]
#> 3:             0.04848646,0.02560504,0.08052045,0.04335761,0.05976993,0.01916173,...[1024]
#> 4:  0.009658070,-0.017903181, 0.048243822, 0.018176228,-0.009081876,-0.043626068,...[1024]
#> 5:        0.01804233,-0.04210884, 0.10225397, 0.01940837, 0.01503704,-0.07719577,...[1024]

The output is a list with two slots:

$nodes: a name column with the name of the node, an uncertainty_analysis column with a list of all the node-level risk scores after randomizing $\alpha$, $\beta$ and $\gamma$, and a sensitivity_analysis column with a data.table informing of the results of the sensitivity analysis.
$paths: a path_id column with the ID number of the path, a path_str column informing on the sequence of functions calls for that path, a hops column informing on the number of edges traversed along the path, and an uncertainty_analysis with a vector of path-level risk scores after the uncertainty analysis.

The sensitivity_analysis column can be unlisted and inspected as any other data.table. See the sensobol package for information on first-order ($S_i$) and total-order ($T_i$) indices (Puy et al. 2022).

The analyst can also plot the top $n$ risky paths and their uncertainty with the function path_uncertainty_plot. The error bars encompass the minimum, mean and maximum $P_k$ value for that path after the uncertainty analysis.


path_uncertainty_plot(ua_sa_out = uncertainty_analysis, n_paths = 20)
#> `height` was translated to `width`.