3.2 Data

Data quality and reliability is central to the performance of ML models. Data knowledge and knowledge of the data source are key to determine the usability of data for a given ML algorithm. Only with this knowledge can bias in data collection, measurement and transformation be identified and reduced. Like with any other quantitative modelling, when data is not representative, outcomes tend to be biased. If synthetic data is used, it has to be guaranteed, that this does not introduce model bias.

Another key data quality is its representativeness. Issues with representativeness may arise in multiple ways: * Characteristics that are significant for model development are not sufficiently represented in the data (e.g. because of sampling or coverage bias). In such cases, the model will show poor performance in both development and production. The model may perform well for characteristics that are represented well in the data, but it may underperform for characteristics that are underrepresented (for example, a facial recognition algorithm trained on data with pictures of a certain ethnicity will perform well for phenotypes sufficiently represented and badly for phenotypes that are underrepresented). Hence, scrutinising data quality for issues that cause problems in regular (statistical) modeling remains important. * Data representativeness is also key during the retraining of a model. If new data is not representative anymore, the model will become biased. * While splitting the data into training, validation and test sets, data representation has to be considered. With upsampling or downsampling, data representativeness can be maintained.

However, there are new issues concerning data that are specific to ML modelling. One of the more well known issues is the lack of separation between training and testing/validation data. When a part of the data is used both for shaping the model (during the training phase) and verifying the performance of the model (during the testing or validation phase), the performance metrics of the model will be inflated. Model performance would then drop in production.

Similarly, ‘overfitting’ the model to the training data leads to a loss of model performance on new data, such as in production. Overfitting occurs if the amount of data used for training is small compared to model complexity. For a given model complexity, overfitting can be avoided by using more data for training. However, the amount of data is usually not easily increased, so model complexity has to be reduced to avoid overfitting. In contrast, ‘underfitting’ occurs if the model complexity is too small.

When data is collected and used for training/retraining, the model can also be ‘poisoned’. Poisoning occurs when an agent outside of the developers of the ML model introduces corrupted or manipulated data into the training set. Several ways to achieve this come to mind when one considers ML models that need i.e. user input for retraining. Intended effects can be to introduce bias by systematically influencing training data, or to let the ML model underperform by introducing ‘noise’ into the training data.

A final important new data issue that comes with the use of ML models is leakage, or target leakage. This occurs when the training data contains information that is not available to the ML model during prediction. This information is usually contained in the variables that are used in the model, we will address leakage in that specific section.

Risks:

Usage of unreliable, biased and non-representative raw data.
Biasing the data during data transformation.
Insufficient separation between training and testing/validation data.
Poor generalisability of the model on new data.
Poisoning - adversarial introduction of bad-quality data.
Target leakage.

3.2.1 Personal data and GDPR in the context of ML and AI

In most countries, special rules apply to the use of personal data. While the definition of ‘personal data’ and the accompanying laws are country-specific, this white refers to the EU’s GDPR [13] as it applies in many countries either directly (in EU member states), via a European Economic Area-agreement, or due to processing of EU citizen’ data.

National data protection authorities are working on GDPR interpretations and guidance for practitioners, and the Norwegian data protection authority (Datatilsynet) has summarised the most important challenges around the use of personal data in ML algorithms in a dedicated report [1]. Relevant considerations are summarised in Appendix Personal data and GDPR in the context of ML and AI. The main risks identified are related to purpose limitation, data minimisation, proportionality and transparency.

The responsibility to guarantee compliance with data protection laws lies with the authority that develops and uses the ML algorithm. For auditors, a review of the respective documentation should suffice (in particular the data protection impact assessment, where appropriate). On suspicion of violation of data protection laws, the case could possibly be forwarded to data protection authorities.

Auditors should pay attention to data used in different development steps. Their considerations should also include data that is not used in the final model but that was nonetheless considered and tested during the model development phase. For example, a feature importance test could also indicate that personal data was used. If used, the importance of the personal data for model performance can be determined. Then, it can be determined whether or not the usage of personal data is justified.

Risks: - Mis-treatment of personal data (for example purpose limitation violation, lack of control over access and timely deletion, lack of transparency). - Unjust or unnecessary usage of personal data.

3.2.2 Risk assessment: Data

Table 3.2: Aspects and contact persons: Data
Aspect	Roles												Tool
	Product owner	User	Helpdesk	Chief information officer	Project leader	Data analyst	Data engineer	Developer	Controller	IT security	Data protection official	Budget holder	Helper tool reference
Overall responsibility: Data	x				x		x	x					A2, A3
Data acquisition method					x	x	x						A2
Group representation and potential bias (raw data)	x				x	x		x					A2
Data quality (raw data)							x						A2.004
Database structure							x						A3
List of variables used						x		x					A3
Personal data and data protection					x			x			x		A2.008

3.2.3 Possible audit tests: Data

Verify that the logic for splitting training, test and validation data maintains representativeness.
Verify that test and validation data are used and stored separately.
Compare data and model quality for training, test and validation data.
Define important population characteristics and test whether they are sufficiently represented in training, test and validation data.
Verify that stored training data is not outdated.
Verify compliance with GDPR.

Bibliography

[13] The European Parliament and of the Council of the European Union (2016): Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), https://eur-lex.europa.eu/eli/reg/2016/679/oj.