Data quality is central to the performance of ML models. Like with any other quantitative modelling, when data is not representative, outcomes tend to be biased. The model may perform well for those characteristics that are represented well in the data, but it may underperform for those characteristics that are underrepresented (for example, a facial recognition algorithm trained on data with pictures of a certain ethnicity will perform well for phenotypes sufficiently represented and badly for phenotypes that are underrepresented). Hence, scrutinising data quality for issues that cause problems in regular (statistical) modeling remains important. Examples of these central issues for data-analysis are: data reliability, population representativeness of the data and disclosure of personal data.
However, there are new issues concerning data that are specific to ML modelling. One of the more well known issues is the lack of separation between training and testing/validation data. When a part of the data is used both for shaping the model (during the training phase) and verifying the performance of the model (during the testing or validation phase), the performance metrics of the model will be inflated. This is called ‘overfitting’ and leads to a loss of performance of the model on new data, such as in production. ML models can also underperform when the historical data used to train/retrain the model is no longer representative. When data is collected and used for training/retraining the model can also be ‘poisoned’. Poisoning occurs when an agent outside of the developers of the ML model introduces corrupted or manipulated data into the training set. Several ways to achieve this come to mind when one considers ML models that use user input for retraining. Intended effects can be to introduce bias by systematically influencing training data, or to let the ML model underperform by introducing ‘noise’ into the training data.
A final important new data issue that comes with the use of ML models is leakage, or target leakage. This occurs when the training data contains information that is not available to the ML model during prediction. This information is usually contained in the variables that are used in the model, we will addres these in that section.
- Insufficient separation between training and testing/validation data
- Mis-treatment of personal data (for example purpose limitation violation, lack of control over access and timely deletion, lack of transparency)
- Biased, unreliable or not representative training data
- Poisoning - adversarial introduction of bad-quality data
- Target leakage
3.2.1 Personal data and GDPR in the context of ML and AI
In most countries, special rules apply to the use of personal data. While the definition of ‘personal data’ and the accompanying laws are country-specific, this white refers to the EU’s GDPR  as it applies in many countries either directly (in EU member states), via a European Economic Area-agreement, or due to processing of EU citizen’ data.
National data protection authorities are working on GDPR interpretations and guidance for practitioners, and the Norwegian data protection authority (Datatilsynet) has summarised the most important challenges around the use of personal data in ML algorithms in a dedicated report . Relevant considerations are summarised in Appendix Personal data and GDPR in the context of ML and AI. The main risks identified are related to purpose limitation, data minimisation, proportionality and transparency.
The responsibility to guarantee compliance with data protection laws lies with the authority that develops and uses the ML algorithm. For auditors, a review of the respective documentation should suffice (in particular the data protection impact assessment, where appropriate). On suspicion of violation of data protection laws, the case could possibly be forwarded to data protection authorities.
Auditors should pay attention to data used in different development steps. Their considerations should also include data that is not used in the final model but that was nonetheless considered and tested during the model development phase. For example, a feature importance test could also indicate that personal data was used.
3.2.2 Risk assessment: Data
|Product owner||User||Helpdesk||Chief information officer||Project leader||Data analyst||Data engineer||Developer||Controller||IT security||Data protection official||Budget holder||Helper tool reference|
|Overall responsibility: Data||x||x||x||x||A2, A3|
|Data acquisition method||x||x||x||A2|
|Group representation and potential bias (raw data)||x||x||x||x||A2|
|Data quality (raw data)||x||A2.004|
|List of variables used||x||x||A3|
|Personal data and data protection||x||x||x||A2.008|
3.2.3 Possible audit tests: Data
- Verify that test and validation data are used and stored separately.
- Define important population characteristics and test whether they are sufficiently represented in the data
- Verify that stored training data is not outdated.
- Verify compliance with GDPR.
 The European Parliament and of the Council of the European Union (2016): Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), https://eur-lex.europa.eu/eli/reg/2016/679/oj.