3.3 Model development

ML developers often focus on optimising a certain performance metric. Defining the best-suited metric might be challenging in itself due to unavoidable trade-offs that include value judgements such as weighting consequences of false positives versus false negatives. There are, however, many more aspects of model development that ML applications at first impression have in common with other IT development projects, but incorporate additional challenges when it comes to ML models.

3.3.1 Development process and performance

An important aspect of the development process, and consequently of the final model, is reproducibility. Reproducibility can be tested by a simple review of the documentation, if that is deemed sufficient by the auditors. If any information on reproducibility is lacking, unclear or contradictory, and the process is not fully automated, reproducing the model and/or its predictions may be required. This can be a time-consuming exercise as it might include an iterative process with updated tests based on new information obtained by additional dialogue with the auditee organisation.

Prerequisites for reproducibility, long-term reliability, maintenance and possible succession or handover to new staff are: - A well-structured and commented codebase (according to standards of the chosen coding language). - Extensive documentation of all hardware and software used, including versions (also of, for example, R/python modules and libraries). - Version control of data, model and predictions as well as developed software.

A key aspect of model development is the feature engineering process. Methods used for feature generation and extraction as well as the process of feature selection should be well documented. It is especially important to investigate the effect of variables that are used as ‘features’ on the model performance and potential connections and dependencies between those variables. With this, unnecessarily complicated models that are prone to overfitting can be avoided. Some features can be interpreted as private data depending on their use and context, and some can be proxies for protected variables. Hence, careful attention is necessary to ensure compliance with data protection regulations (for examples, see Possible audit tests (above)).

The type of ML algorithm that is chosen should be well motivated. If a hard-to-explain black box model is used, it should be documented that this is justified by a significantly better performance compared to a white box model. If auditors doubt that this is the case, they could train a simple white box model to test any (or the lack of) performance differences¹².

Given that an appropriate performance metric is chosen that fits the objective of the application (see Section 3.1), it can be assumed that the model’s performance is well documented. Auditors should nevertheless verify that the reported performance is accurate and consistent, in particular if the granularity of input data to the model is not identical to that of the reported performance (for example the model operates on applications/issues/periods while the reported performance is defined on customers). Auditors should also verify whether or not model performance is constant or increasing over time. Only if this is true, it can be assumed that the model will generalise well on new unseen data. Results of this kind of analysis can be used to define requirements for model performance during production.

A comparison of the performance on training data versus test data is a standard procedure to test how well the model generalises to new data. If cross-validation was used during training, or trustworthy independent test data is not available, auditors might decide to test the performance on synthetic data¹³. If the performance on production data is very different from the test performance, the reason might be an overfitted model, or substantial differences in the production data compared to the training/test data. The latter might occur when the use of a well-performing model is extended to areas it was not originally designed for.

Risks

Irreproducible and/or incomprehensible predictions
Dependencies on unspecified hardware, installation or environment variables (for example default model parameters dependent on the use of GPU vs CPU, undocumented software version)
Use of unnecessary data (correlated or unpredictive variables); if personal data is used unnecessarily, additional violation of GDPR may occur (data minimisation principle)
Overfitting (model does not generalise well to new data) or model bias/underfitting (oversimplified; model does not describe the data well), inappropriate metric not targeting the application objective
Unnecessarily complicated model used because of convenience from earlier use, or personal preference rather than performance (for example, black box model that is not significantly better than a white box model)
Model optimised for inappropriate performance metric. For example, use of the inappropriate metric because of personal preference or convenience because of earlier implementation rather than the application objective
Code can only be executed or understood by a single person or a small group of people

3.3.2 Cost-benefit analysis

When moving the model to production, the performance of the ML model should be compared to the previously used system (that is the respective performance metric without ML or with an older model). This comparison should entail model performance metrics but also, more importantly, the effect on functional key peformance indicators of the underlying (business) process(es). Ideally, there is a connection between model performance and functional key performance indicators. If the main objective of the ML application is to reduce costs, the potential savings with ML can be estimated considering the amount of correct classifications by the model. This can be compared with the additional costs for false classifications (considering the cost for manual corrections) in combination with the cost for development plus maintenance (presumably staff or consultancy costs). Otherwise, depending on the application objective, classical cost-efficiency, cost-utility or cost-benefit analyses may be appropriate.

Generally, there are multiple options for undertaking such an analysis. The easiest option is to compare both solutions (e.g. ML model and traditional approach) using historical data. A disadvantage of this option is that the decision is whether or not to put a model into production is made whithout testing the model under realistic circumstances. An alternative option utilises testing periods (pilot studies), during which the model is tested under realistic circumstances. Another option evaluates the new model by running it alongside the previous solution (A/B tests). For both these options, reliable key performance indicators are obtained after running the tests. With the latter option there is the additional advantage of being able to measure key performance indicators for identical circumstances.

Risks:

ML used for the sake of using ML, without improving the service
The actual advantage of using ML is estimated solely based on theoretical model performance and not under realistic circumstances in production. Model performance has not been compared to original solution.
Inefficient spending and inappropriate use of the auditee organisation’s budget

3.3.3 Reliability

Reproducibility, which is a mandatory condition for reliability, is already discussed above but there the focus was on the ML component alone. Dependencies on other parts of the pipeline can influence the reliability of the model’s performance, in particular if the model runs automatically in real-time mode. Auditors should consider all possible variations in the input to the model (intentional and unintentional) to assess the behaviour of the model under these circumstances. Another aspect of reliability in the long term is the in-house competence in the auditee organisation to maintain the model (in particular, in performance monitoring, retraining and, where appropriate, re-optimisation).

Risks:

Performance expected from development not reached in production, or degrading over time
Untrustworthy prediction if unintended input data is given to the algorithm
High maintenance costs due to lack of in-house competence

3.3.4 Quality assurance.

Quality assurance in the context of ML algorithms can be seen as three-fold, with separate implementations for data quality, code quality and model quality. Code quality is not specific to audits of algorithms and we therefore do not discuss it further here. However, model quality in this context refers to the model’s performance under different circumstances (incl. varying errors in data preparation i.e. data quality) and over time.

Code quality: A code review by auditors may not be feasible (over what is done for reproducibility), hence a test of the internal code quality assurance is advisable (for example, version control, unit tests, code review, integration tests, end-to-end tests for the complete ML system).
Data quality: ML systems should ideally contain validation components which check the quality of incoming (raw) data, transformed data (feature engineering) and outgoing data (predictions). These should be able to stop the process if data quality drops below preset requirements (quality gates). A validation component can check for simple as well as complex requirements. Examples are requirements that set the maximum amount of missing values for a feature or statistical tests, which check the distribution of incoming feature values. The latter are especially important for models that are used on new incoming data without retraining. The test complexity should be in line with the functional process. It is important to also check model predictions for consistency (outgoing data).
Model quality: Already during model development, tests and requirements have to be implemented / set for model quality. During production, these tests should be implemented as quality gates. Such performance tests are e.g.:
- tests for overfitting: Does the model generalise well to unseen data? Is there a (significant) mismatch between model performance on training and testing data? If the model should be retrained during production, thresholds for model performance have to be determined.
- retraining frequency: How often should the model be retrained to accommodate changes in the data (for example, demographic changes), changes in policies or legislation, or updated objectives? One option to answer this question are the results of stability analysis (see Development process and performance (above)). Ideally, scenarios should be agreed upon, under which retraining is not sensible anymore due to concept drift.
- test for confidence: with many ML projects, there is a requirement to add confidence measures to model predictions. Oftentimes these confidence measures are not tested and can lead to an overestimation of model confidence. Auditors should verify, which method is being used to estimate model confidence, how it is being tested and potentially adapted / improved (measures for testing confidence metrics are e.g. negative log-likelyhood, Brier-score, or expected calibration error. Methods to improve confidence estimates depend on the functional process. There are direct estimates e.g. Monte-Carlo dropouts, parametric uncertainty, or deep ensembles as well as posterior methods e.g. using isotonic regression).
- is the data available for retraining biased by previous model predictions?
- model reliability: Has the model been tested with synthetically minimally changed data? Do these tests still lead to the expected model behaviour? Were other measures, such as expert-augmented learning and regularisation used to guarantee model reliability?
- Is it possible to monitor model performance during production?

Risks:

Low code quality / missing or mismatched tests
Error in data preparation (feature extraction) that influence or bias the model
Unstable or erroneous results (for example, dependencies on data types that can change)
Unknown performance or overestimation of model performance / confidence
Degrading performance over time
Model reinforcement loop (for example, if one is retraining on data selected by the model)
Minimally altered data, leading to erroneous and unreliable model behaviour

3.3.5 Risk assessment: Model development

In order to assess the various risks explained in the sections above, documentation of the following aspects should be reviewed:

Coding language, hardware, software versions (including all libraries)
Data transformations
Definition of performance metric(s) based on the project objective(s)
Choice of ML algorithm type (including black box versus white box)
Hyperparameter optimisation and final values (incl. used default values)
Performance on different datasets
Comparison with the previous system incl. cost-benefit analysis
Privacy considerations (where applicable)
Safety considerations (where applicable)
Internal quality assurance (code, data and model quality)

Table 3.3: Aspects and contact persons: Model development
Aspect	Roles												Tool
	Product owner	User	Helpdesk	Chief information officer	Project leader	Data analyst	Data engineer	Developer	Controller	IT security	Data protection official	Budget holder	Helper tool reference
Overall responsibility: Model development					x		x						A4
Hardware and software specifications						x	x	x		x
Data transformations, choice of features					x	x			x
Choice of performance metric(s)					x	x
Optimisation process					x	x		x					A4.011
Choice of ML algorithm type (including black box versus white box)					x	x		x
Code quality assurance					x			x
Maintenance plan					x			x
Model quality assurance	x				x
Cost-benefit analysis	x				x							x

3.3.6 Possible audit tests: Model development

Reproduction of the model (with given parameters), possibly on a subset of the training data or independent (possibly synthetic) data.
Reproduction of the model prediction/score with (a) the model and/or (b) a best reproduction of the model.
Verify that data transformations are reasonable for the chosen ML algorithm.
If not done internally: test of the feature importance.¹⁴
If there is a suspicion that too many/unnecessary features are used: re-train the model with less features to show the significant performance difference (or lack of them).¹⁵
If personal data is used as features: re-train the model with less / no personal data to quantify the trade-off between performance and personal data protection (see also suggestions from ICO in [14]).
If not done internally: test for overfitting, performance train versus test/validation versus production data.
If a black box model is used: train a simple white box model and compare the performance (if not done internally).
Cost-benefit analysis.

If the code and model are not available:

Reproducibility can only be verified completely in the same software, as the training of the model depends on the implementation in the respective library. Hhowever, if a similarly performing model of the same type with the same parameters can be trained in a different software, it can be used for further tests.
Correlations between features, and between each feature and the model prediction, can be analysed by external tools¹⁶ using only the data (including predictions).
Retraining with less or different features can be done by the auditee organisation.

Bibliography

[14] Information commissioner’s Office AI auditing framework, https://ico.org.uk/about-the-ico/news-and-events/ai-auditing-framework/.

Note that while black box models often have the potential to outperform white box models, this potential might not be realised if the black box model is not sufficiently optimised. If its performance has further been judged to be sufficient and a white box model can achieve similar performance, the simpler model should be used to comply with explainability requirements.↩︎
For example data produced with the synthpop package in R or similar. Synthetic data is data that is produced artificially but with the same features as real data.↩︎
Most ML libraries have built-in methods to test and visualise feature importance. A useful tool for explorative data analysis in R is the DataExplorer library, which can be used to easily test the features by their relation to the model predictions, feature correlations and principal component analysis.↩︎
Note that features that do not contribute much to the performance alone can be important in combination with other features (feature combination possibilities depend on the type of model)↩︎
E.g. with the R package “DataExplorer”↩︎