3.5 Evaluation

The evaluation step includes the choice of a ‘best’ model and the decision to deploy a model (or not), based on performance indicators that were derived from the project’s objective. In addition to the performance directly related to the business objective, such as better services or reduced costs, compliance with regulations as well as possible risks and side effects have to be evaluated before deployment. Even after deployment, any change in the AI system should entail a new evaluation, including passive changes like demographic shifts in the input data.

The performance optimised in all the aspects described in Section 3.3 Model development is weighted against the aspects described here. The relevance and relative importance of the respective audit parts depend on the type of model and on the application.

3.5.1 Transparency and explainability

Regulations for public administration usually include the requirement for transparent procedures and substantiation of decisions where individuals are concerned. Use of ML in decision-making can render these requirements difficult to fulfil, in particular when a black box model is used. However, there are several approaches to understand the behaviour of ML algorithms:17

  • The global relationship between features and model predictions can be tested by visualising correct and false predictions dependent on single features. This approach can serve as a first, coarse understanding, but it neglects correlations and nonlinear behaviour.

    Other standard tools for global explainability readily implemented in various R and python packages include partial dependence plots (PDP), individual conditional expectation (ICE) and accumulated local effect (ALE). These usually require the model to be available, as they use different averaging procedures. If the model is not available to the auditors, they can either implement the first-mentioned approach themselves, or ask the auditee organisation to provide suitable plots.

  • In order to explain a particular model prediction (for example, in context of a user complaint and the necessity to justify a particular decision), local explainability is necessary, where ‘local explainability’ is defined as the need to explain the influence of single features in a specific point of parameter space (that is, for a specific user or case). Most common methods include LIME and Shapley values and are available in standard libraries in R and python. Motivation for the chosen method and its applicability should be documented.18 These methods use local approximations of the model and hence need the model to be available. It is not constructive for auditors to test the model behaviour with these methods. The main objective here is to test that the auditee organisation has implemented respective methods to be able to substantiate ML based decisions.


  • No understanding of the model’s predictions
  • No understanding of the effects of the different input variables
  • The administrative unit is not able to explain and justify decisions made by or with support of the ML model

3.5.2 Equal treatment and fairness

The training data for ML models may incorporate demographic disparities which are learned by the model, and then reinforced if the model predictions are used for decisions that impact the same demographics. The most common sources for such disparities are the training data and the training procedure. The data is influenced by the measurement procedure and variable definitions (for example, when a model that is supposed to find the best candidate in a recruitment process is trained on data from candidates passing previous recruitment criteria). Since ML models usually improve if more data is available, their performance is often best on the majority group, while they can give significantly worse results for minorities.

Fairness in ML has become an increasingly important topic in the last few years. There is no common standard for ML fairness, instead, many different definitions and metrics apply. The most tangible class of fairness definitions is group based fairness, which requires ML models to treat different groups of people in the same way and thus fits well to requirements of equal treatment in anti-discrimination laws. At the same time, group based fairness is easy to test by looking at the performance of a ML model separately for different groups.

For classification models, the relevant metrics are based on the confusion matrix, which shows (in the most simple case of binary classification) how many true/false cases are classified correctly.19

It is important for auditors to understand that if the true distribution of classes is not the same between two groups represented in the data, it is impossible to satisfy all fairness criteria20 (see Appendix One Equality and Fairness measures in classification models for details). This mathematical fact is most easily understood when considering fairness focusing on equal treatment (procedural fairness, equality of opportunity) versus equal impact (minimal inequality of outcome).

Auditors have to assess whether or not a model sufficiently satisfies equality requirements by defining relevant groups, testing which criteria are violated to what extent, and considering the consequences in the respective application of the model.


  • Reinforcement of inequalities that were picked up from the training data
  • Worse performance for minorities
  • Unequal treatment based on protected variables, in the worst case discrimination of groups defined by gender, religion, nationality etc.

3.5.3 Security

AI systems naturally face the same security issues related to physical infrastructure as other IT systems. Due to the massive amount of data and computing power needed for the development of ML models, and sometimes also the deployed system, the security of distributed and possibly cloud-based computing infrastructure tends to be relevant. Privacy protection might be particularly challenging if the data is processed or temporarily stored in countries with different regulations.

ML applications can bear new additional security risks, in particular when they run automatically in real-time applications. Poisoning was mentioned in Section 3.2, as well as disclosure of personal data (in Section 3.1). Similarly, there might be a disclosure risk for industrial secrets, which in the context of public administration can relate to information about infrastructure or safety procedures that should not be publicly available.

For image recognition models, the risk of adversarial attacks has to be considered. Adversarial attacks are data modifications that are carefully designed to trick the algorithm, for example, small stickers placed on a roadside stop sign so that the image recognition system in a self-driving car falsely identifies it as a speed limit sign[18] [19].

These examples are by no means exhaustive. The variety of ML model types and applications makes it difficult to foresee all potential failures and attack vectors, even for ML developers.21


  • Security risks depend on the application: disclosure of personal data or of other confidential information, poisoning of the model, adversarial attacks

3.5.4 Risk assessment: Evaluation

The above mentioned risks can be assessed by reviewing the documentation of the internal evaluation, including

  • internal risk assessments related to relevant security risks;
  • a comparison of different models (including a defined weighting of performance, transparency, fairness and safety aspects);
  • approaches to explain model behaviour;
  • approaches to minimise bias;
  • (where applicable) compliance with privacy laws; and
  • the model’s deployment and retraining strategy documents.
Table 3.5: Aspects and contact persons: Evaluation
Product owner User Helpdesk Chief information officer Project leader Data analyst Data engineer Developer Controller IT security Data protection official Budget holder Helper tool reference
Overall responsibility: Evaluation x x A5, A6
Evaluation method x x x A5.001
Comparison of different models x x x A5.002, A5.003
Approaches to explain model behaviour x x x x x A6.013, A6.014
Bias tests x x x x A6.021
Compliance to privacy laws (where applicable) x x x x x A6.011
Safety risks assessment and mitigation strategies x x x x x x x A5.004, A6.015, A6.018
Communication with data subjects (where applicable) x x A7.009
Policy for human-AI interactions x x x A6.004, A6.008, A6.009
Quality assurance plan x x x x x A6.019, A6.020, A7.004
Monitoring of the model performance in production x x x x A6.006
Strategy for development / maintenance x x x A6.001

3.5.5 Possible audit tests: Evaluation

In the evaluation phase, auditors should pay close attention to the grounds on which the project owner declared their acceptance of the deliverables of the ML development project.

In addition, one should look for the following:

  • Test the global model behaviour for each feature, using tools like PDP, ICE or ALE.

  • Test applicability and implementation of local explainability methods like LIME or Shapley values.

  • Determine which laws and policies apply in the ML application context (protected groups, affirmative action).

  • Calculate group-based equality and fairness metrics based on the data, including predictions.22
    If auditors suspect the use of proxy variables, independent data sources might be necessary to show the correlations and define relevant groups.
    Suggested metrics:23

    • Disparity in: Prevalence, predicted prevalence, precision, false positive rate, true positive rate, negative predictive value.
    • Fairness metrics: Statistical parity (also called demographic parity), equalized odds (also called disparate mistreatment), sufficiency (also called predictive rate parity).
  • Test the model predictions with one feature changed (for example, change all men into women while keeping all other features the same). If the model is not available, the auditee organisation can test the model’s performance on similarly manipulated data

  • Calculate the performance when less/no personal data is used.


[18] K. Eykholt et al. (2017): Robust Physical-World Attacks on Deep Learning Models, https://arxiv.org/abs/1707.08945.

[19] N. Morgulis et al. (2019): Fooling a Real Car with Adversarial Traffic Signs, https://arxiv.org/abs/1907.00374.

  1. For example, see reference [15] for an instructive overview↩︎

  2. For example, see [16] and [17] for problems with explanations with LIME or Shapley values, respectively.↩︎

  3. See Appendix One Equality and Fairness measures in classification models for an overview of the most common equality and fairness metrics, and Appendix Two Model evaluation terms for an introduction to the confusion matrix.↩︎

  4. Except for the unrealistic case of a perfect model with 100% accuracy.↩︎

  5. See reference [1] for suggestions of control measures: While ‘bug bounties’ might be less appropriate in the public sector, an independent internal third party challenging the application (‘red team exercise’) could be feasible for comparatively large auditee organisations. Smaller organisations might apply organisational incentives for independent individuals to be vigilant about their AI applications, raising issues when necessary.↩︎

  6. For example, use the python package ‘aequitas’↩︎

  7. See Appendix One Equality and fairness measures in classification models for details↩︎