This vignette demonstrates how to use the auditor package for auditing residuals of machine learning models. The residual is the difference between the observed value and the value predicted by model. The auditor provides methods for model verification and validation by error analysis. It helps in finding answers to questions that may be crucial in deeper analyses.

Many algorithms, such as random forests and neutral networks are sometimes treated as black boxes and there is a lack of techniques that help in the analysis of errors in those models. Most methods provided in auditor package are model-agnostic, what means that they can be used regardless of expected distribution of residuals.

Data

Use case - predicting a length of life

To illustrate application of auditor we will use dataset “dragons” available in the DALEX package. The dataset contains characteristics of fictional creatures (dragons), like year of birth, height, weight, etc (see below). The goal is to predict the length of life of dragons (a regression problem).

library(DALEX)
data(dragons)
head(dragons)
##   year_of_birth   height   weight scars colour year_of_discovery
## 1         -1291 59.40365 15.32391     7    red              1700
## 2          1589 46.21374 11.80819     5    red              1700
## 3          1528 49.17233 13.34482     6    red              1700
## 4          1645 48.29177 13.27427     5  green              1700
## 5            -8 49.99679 13.08757     1    red              1700
## 6           915 45.40876 11.48717     2    red              1700
##   number_of_lost_teeth life_length
## 1                   25   1368.4331
## 2                   28   1377.0474
## 3                   38   1603.9632
## 4                   33   1434.4222
## 5                   18    985.4905
## 6                   20    969.5682

Models

First, we need models to compare. We selected linear regression and random forest because of their different structures. Linear regression model linear relationships between target response and independent variables. While random forest should be able to capture also non-linear relationships between variables.

# Linear regression
lm_model <- lm(life_length ~ ., data = dragons)

# Random forest
library(randomForest)
set.seed(59)
rf_model <- randomForest(life_length ~ ., data = dragons)

Preparation for residual (error) analysis

Analysis begins with creation of an explainer object with explain function from DALEX package. Explainer wraps a model with its meta-data, such as dataset that was used for training or observed response.

lm_exp <- DALEX::explain(lm_model, label = "lm", data = dragons, y = dragons$life_length)
rf_exp <- DALEX::explain(rf_model, label = "rf", data = dragons, y = dragons$life_length)

Next step requires creation of model_residual objects of each explained model. From this step on, only auditor functions will be used.

library(auditor)
lm_mr <- model_residual(lm_exp)
rf_mr <- model_residual(rf_exp)

Plots

In the following section, we show individual plotting functions which demonstrate different aspects of residual analysis. We devote more attention to selected functions, but usage of each function is more or less similar.

Observed vs predicted

First plot is a basic plot comparising predicted versus observed values. The red line corresponds to the y = x function. The patterns for both models are non-random around the diagonal line. The points corresponding to a random forest (darker dots) show the tendency to underprediction for large values of observed response. Points for linear model (lighter dots) are located more or less around diagonal line which means that this model predicted quite well.

plot(rf_mr, lm_mr, type = "prediction", abline = TRUE)

# alternatives:
# plot_prediction(rf_mr, lm_mr, abline = TRUE)
# plot_prediction(rf_mr, lm_mr, variable = "life_length")

Function plot_prediction presents observed values on the x-axis. However, on the x-axis there may be values of any model variable or by index of observations (variable = NULL).

plot(rf_mr, lm_mr, variable = "scars", type = "prediction")
plot(rf_mr, lm_mr, variable = "height", type = "prediction")

As you can notice, on above plots, there is no relationship for variable height and predicted values while for increasing number of scars, model predictions also increase. This means that that model captured monotonic relationship between number of scars and length of life of dragon.

Residuals vs observed, fitted or variable values

Next function (plot_residual()) shows residuals versus observed values. This plot is used to detect dependence of errors, unequal error variances, and outliers. For appropriate model, residuals should not show any functional dependency. Expected mean value should be equal to 0, regardless of \(\hat{y}\) values. Structured arrangement of points suggests a problem with the model. It is worth looking at the observations that clearly differ from the others. If points on the plot are not randomly dispersed around the horizontal axis, it may be presumed that model is not appropriate for the data.

plot(lm_mr, rf_mr, type = "residual")

# alternative:
# plot_residual(lm_mr, rf_mr)

Here also values (residuals) may be ordered by target variable, fitted values, any other variable or may be presented unordered.

plot(rf_mr, lm_mr, type = "residual", variable = "_y_hat_")
plot(rf_mr, lm_mr, type = "residual", variable = "scars")

# alternative:
# plot_residual(rf_mr, lm_mr, variable = "_y_hat_")
# plot_residual(rf_mr, lm_mr, variable = "scars")

In all examples above, we can see that linear model is better fitted for the data than random forest, because for greater values of selected variables residuals are larger Additionaly, we can identify most outlying observations:

# plot_residual(rf_mr, variable = "_y_hat_", nlabel = 10)

Density of residuals

Residual density plot (plot_residual_density()) detects the incorrect behavior of residuals. On the plot, there are estimated densities of residuals. Their values are displayed as marks along the x axis. For some models, the expected shape of density derives from the model assumptions. For example, simple linear model residuals should be normally distributed. However, even if the model does not have an assumption about the distribution of residuals, such a plot may be informative. If most of the residuals are not concentrated around zero, it is likely that the model predictions are biased.

plot(rf_mr, lm_mr, type = "residual_density")

# alternative
# plot_residual_density(rf_mr, lm_mr)

Resuduals may be also divided by values of a choosen variable (median of a numeric variable or levels of a factor).

plot_residual_density(rf_mr, lm_mr, variable = "colour")

Boxplot of residuals

Residual boxplot (plotResidualBoxplot()) shows the distribution of the absolute values of residuals. Boxplot usually presents following values:

  • box width which corresponds to the second and third quartile,
  • vertical line which reflects median,
  • the whiskers which extend to the smallest and largest values, no further than 1.5 of interquartile.

auditor adds sixth component to the boxplot which is the root mean square error (RMSE) measure, shown as . For the appropriate model, most of the residuals should lay near zero. A large spread of values indicates problems with a model. Comparing our two models we can see that random forest model is much more spreaded (worse) that linear one.

plot(lm_mr, rf_mr, type = "residual_boxplot")

# alternative
# plot_residual_boxplot(lm_mr, rf_mr)

More

more plots and more motivation TODO: linki do githuba, papera