3.3 Model development

ML developers often focus on optimising a certain performance metric. Defining the best-suited metric might be challenging in itself due to unavoidable trade-offs that include value judgements such as weighting consequences of false positives versus false negatives. There are, however, many more aspects of model development that ML applications at first impression have in common with other IT development projects, but incorporate additional challenges when it comes to ML models.

3.3.1 Development process and performance

An important aspect of the development process, and consequently of the final model, is reproducibility. Reproducibility can be tested by a simple review of the documentation, if that is deemed sufficient by the auditors. If any information on reproducibility is lacking, unclear or contradictory, and the process is not fully automated, reproducing the model and/or its predictions may be required. This can be a time-consuming exercise as it might include an iterative process with updated tests based on new information obtained by additional dialogue with the auditee organisation.

A well-structured and commented codebase (according to standards of the chosen coding language) and extensive documentation of all hardware and software used, including versions (also of, for example, R/python modules and libraries), is not only a prerequisite for reproducibility, but also for long-term reliability, maintenance and possible succession or handover to new staff.

The effect of variables that are used as ‘features’ of the model can be investigated, both in number as well as nature, to avoid unnecessarily complicated models that are prone to overfitting. Most features can be interpreted as private data depending on their use and context, and some can be proxies for protected variables, hence careful attention is necessary to ensure compliance with data protection regulations (for examples, see Possible audit tests (above)).

The type of ML algorithm that is chosen should be well motivated. If a hard-to-explain black box model is used, it should be documented that this is justified by a significantly better performance compared to a white box model. If auditors doubt that this is the case, they could train a simple white box model to test any (or the lack of) performance differences12.

Given that an appropriate performance metric is chosen that fits the objective of the application (see Section 3.1), it can be assumed that the model’s performance is well documented. Auditors should nevertheless verify that the reported performance is accurate and consistent, in particular if the granularity of input data to the model is not identical to that of the reported performance (for example the model operates on applications/issues/periods while the reported performance is defined on customers).

A comparison of the performance on training data versus test data is a standard procedure to test how well the model generalises to new data. If cross-validation was used during training, or trustworthy independent test data is not available, auditors might decide to test the performance on synthetic data13. If the performance on production data is very different from the test performance, the reason might be an overfitted model, or substantial differences in the production data compared to the training/test data. The latter might occur when the use of a well-performing model is extended to areas it was not originally designed for.


  • Irreproducible and/or incomprehensible predictions
  • Dependencies on unspecified hardware, installation or environment variables (for example default model parameters dependent on the use of GPU vs CPU, undocumented software version)
  • Use of unnecessary data (correlated or unpredictive variables); if personal data is used unnecessarily, additional violation of GDPR may occur (data minimisation principle)
  • Overfitting (model does not generalise well to new data) or model bias/underfitting (oversimplified; model does not describe the data well), inappropriate metric not targeting the application objective
  • Unnecessarily complicated model used because of convenience from earlier use, or personal preference rather than performance (for example, black box model that is not significantly better than a white box model)
  • Model optimised for inappropriate performance metric. For example, use of the inappropriate metric because of personal preference or convenience because of earlier implementation rather than the application objective
  • Code can only be executed or understood by a single person or a small group of people

3.3.2 Cost-benefit analysis

The performance of the ML model in production should be compared to the previously used system (that is the respective performance metric without ML or with an older model). If the main objective of the ML application is to reduce costs, the savings with ML can be compared to the development plus maintenance costs (presumably staff or consultancy costs). Otherwise, depending on the application objective, classical cost-efficiency, cost-utility or cost-benefit analyses can be appropriate.


  • ML used for the sake of using ML, without improving the service
  • Inefficient spending and inappropriate use of the auditee organisation’s budget

3.3.3 Reliability

Reproducibility, which is a mandatory condition for reliability, is already discussed above but there the focus was on the ML component alone. Dependencies on other parts of the pipeline can influence the reliability of the model’s performance, in particular if the model runs automatically in real-time mode. Auditors should consider all possible variations in the input to the model (intentional and unintentional) to assess the behaviour of the model under these circumstances. Another aspect of reliability in the long term is the in-house competence in the auditee organisation to maintain the model (in particular, in performance monitoring, retraining and, where appropriate, re-optimisation).


  • Performance expected from development not reached in production, or degrading over time
  • Untrustworthy prediction if unintended input data is given to the algorithm
  • High maintenance costs due to lack of in-house competence

3.3.4 Quality assurance.

Quality assurance in the context of ML algorithms can be viewed as three-fold, with separate implementations for data quality, code quality and model quality. Data quality and code quality are not specific to audits of algorithms and we therefore do not discuss them further here. However, model quality in this context refers to the model’s performance under different circumstances and over time.

  • Data quality: How is data quality ensured internally?
  • Code quality: A code review by auditors may not be feasible (over what is done for reproducibility), hence a test of the internal code quality assurance is advisable (for example, version control, unit tests, code review).
  • Model quality: Are sufficient performance tests in place including
    • tests for overfitting: Does the model generalize well to unseen data?
    • retraining frequency: Is the model frequently retrained to accommodate changes in the data (for example, demographic changes), changes in policies or legislation, orupdated objectives?
    • is the data available for retraining biased by previous model predictions?


  • Unstable or erroneous results (for example, dependencies on data types that can change)
  • Unknown performance
  • Degrading performance over time
  • Model reinforcement loop (for example, if one is retraining on data selected by the model)

3.3.5 Risk assessment: Model development

In order to assess the various risks explained in the sections above, documentation of the following aspects should be reviewed:

  • Coding language, hardware, software versions (including all libraries)
  • Data transformations
  • Definition of performance metric(s) based on the project objective(s)
  • Choice of ML algorithm type (including black box versus white box)
  • Hyperparameter optimisation and final values (incl. used default values)
  • Performance on different datasets
  • Comparison with the previous system
  • Privacy considerations (where applicable)
  • Safety considerations (where applicable)
  • Internal quality assurance
Table 3.3: Aspects and contact persons: Model development
Product owner User Helpdesk Chief information officer Project leader Data analyst Data engineer Developer Controller IT security Data protection official Budget holder Helper tool reference
Overall responsibility: Model development x x A4
Hardware and software specifications x x x x
Data transformations, choice of features x x x
Choice of performance metric(s) x x
Optimisation process x x x A4.011
Choice of ML algorithm type (including black box versus white box) x x x
Code quality assurance x x
Maintenance plan x x
Model quality assurance x x
Cost-benefit analysis x x x

3.3.6 Possible audit tests: Model development

  • Reproduction of the model (with given parameters), possibly on a subset of the training data or independent (possibly synthetic) data.
  • Reproduction of the model prediction/score with (a) the model and/or (b) a best reproduction of the model.
  • If not done internally: test of the feature importance.14
  • If there is a suspicion that too many/unnecessary features are used: re-train the model with less features to show the significant performance difference (or lack of them).15
  • If personal data is used as features: re-train the model with less / no personal data to quantify the trade-off between performance and personal data protection (see also suggestions from ICO in [14]).
  • If not done internally: test for overfitting, performance train versus test/validation versus production data.
  • If a black box model is used: train a simple white box model and compare the performance (if not done internally).
  • Cost-benefit analysis.

If the code and model are not available:

  • Reproducibility can only be verified completely in the same software, as the training of the model depends on the implementation in the respective library. Hhowever, if a similarly performing model of the same type with the same parameters can be trained in a different software, it can be used for further tests.
  • Correlations between features, and between each feature and the model prediction, can be analysed by external tools16 using only the data (including predictions).
  • Retraining with less or different features can be done by the auditee organisation.


[14] Information commissioner’s Office AI auditing framework, https://ico.org.uk/about-the-ico/news-and-events/ai-auditing-framework/.

  1. Note that while black box models often have the potential to outperform white box models, this potential might not be realised if the black box model is not sufficiently optimised. If its performance has further been judged to be sufficient and a white box model can achieve similar performance, the simpler model should be used to comply with explainability requirements.↩︎

  2. For example data produced with the synthpop package in R or similar. Synthetic data is data that is produced artificially but with the same features as real data.↩︎

  3. Most ML libraries have built-in methods to test and visualise feature importance. A useful tool for explorative data analysis in R is the DataExplorer library, which can be used to easily test the features by their relation to the model predictions, feature correlations and principal component analysis.↩︎

  4. Note that features that do not contribute much to the performance alone can be important in combination with other features (feature combination possibilities depend on the type of model)↩︎

  5. E.g. with the R package “DataExplorer”↩︎