4.2 Data
AI systems are only as effective as the data on which they are built. While traditional data quality dimensions – such as completeness, validity, and uniqueness – remain important, AI introduces additional challenges. These include the quality, quantity, and representativeness of training data, as well as risks related to bias, overfitting, and the need for ongoing monitoring and maintenance. This section outlines the main data-related risks and controls relevant to auditing AI systems.
4.2.1 Data quantity and quality
The reliability of an AI system depends on the quality and sufficiency of its training data. Inadequate or biased data can result in models that produce unfair or inaccurate outcomes. For example, if training data does not reflect the real-world population, the model may systematically favour certain groups over others.
Other common data quality issues in AI systems include inaccuracy or incompleteness, where errors, omissions, or outdated information can undermine model performance. Inconsistency may arise when data from multiple sources is combined, leading to mismatches in format or meaning.
For externally developed or purchased AI systems, organisations may have limited access to the training data, making it difficult to assess its suitability for their context.
Insufficient data presents a significant challenge in developing AI systems. The amount of data required increases with the complexity of the problem the system is intended to address. When the available data is too limited relative to the complexity of the model, there is a risk of overfitting, where the model learns the training data – including its noise – too closely and fails to generalise to new situations. Conversely, if the model is too simple for the task, it may underfit, missing important patterns and failing to capture the underlying structure of the data.
In supervised learning, sufficient data is needed to perform training, validation and testing. Insufficient data in any of these stages or the use of the same data for training and testing will lead to inflated performance metrics. Techniques such as increasing the number of samples from less common categories (upsampling) or reducing the number of samples from more common categories (downsampling) may be used to help maintain balance and data representativeness. A further risk is target leakage, where training data may inadvertently include information unavailable at prediction time, leading to misleadingly high performance during development but poor real-world results. Such leakage often stems from variables that inadvertently contain outcome-related information.
AI models are also vulnerable to risks such as data poisoning, where malicious actors may introduce corrupted data into the training set to undermine model integrity and reliability
Data infrastructure also matters. Data silos, where information is fragmented across systems, can hinder integration and analysis. As data volumes grow, scalable infrastructure is needed to support development and ongoing operations. Security and privacy considerations are paramount, especially when handling sensitive or personal data.
4.2.1.1 Risks to consider
- Use of unreliable, biased, or non-representative data.
- Introduction of bias during data transformation.
- Inadequate separation of training and testing data.
- Poor generalisability to new data.
- Data poisoning or adversarial manipulation.
- Target leakage.
- Lack of transparency about the representativeness of training data in externally developed models.
4.2.1.2 Expected controls
- Assessing the provenance of training and fine-tuning datasets, especially when using web-scraped or synthetic data, to ensure compliance with intellectual property and data protection requirements.
- Ensuring that data splits for training, validation, and testing maintain representativeness and are handled appropriately.
- Comparing data and model quality for training, test and validation data.
- Defining key population characteristics and testing whether they are adequately represented in training, test and validation data.
- Regularly reviewing and updating stored training data to prevent obsolescence.
- For externally developed or purchased systems, obtaining sufficient information about the training data from the developer, verifying performance with local data, and testing for potential biases.
4.2.2 Large language models and retrieval-augmented generation
Retrieval-augmented generation (RAG) is an approach that enables LLMs to draw on trusted external sources of information when generating responses. By integrating relevant documents or data into the model’s context, RAG can help address persistent challenges in generative AI, such as hallucinations, outdated information, and the absence of source citations. However, this technique introduces new risks related to the quality and reliability of the retrieved data, as well as the construction and maintenance of the underlying knowledge base.64
Information about the data used to train foundation models developed by external organisations is often not available. Data collected automatically from open sources (for example by webscraping) can be vulnerable to data poisoning. If this type of data is used for training, fine-tuning, or in live AI systems (for example, AI agents that use web search), auditors should expect to see evidence that the risk of data poisoning has been considered and managed. This risk is also relevant for retrieval-augmented generation, where the knowledge base is built using automatically collected data.
Testing and evaluating the outputs of LLMs, including those produced by RAG pipelines, often requires large volumes of example data that reflect the range of possible user queries and expected answers. In practice, manually assembling such datasets is rarely feasible, particularly for complex or specialised applications. An alternative approach is to generate synthetic data – artificially generated rather than collected from real-world interactions – and use this to create diverse and scalable test sets for training, fine-tuning, or validating AI models. Despite its advantages, synthetic data must be carefully validated to ensure it does not introduce new biases or degrade model performance. The quality of synthetic datasets depends on how well they reflect the intended use cases and the diversity of real-world scenarios.
4.2.2.1 Risks to consider
● Poisoned data included in externally developed models, in a RAG knowledge base or at query time.
● Unreliable evaluation of the quality of model outputs.
● Amplification of biases or other performance issues when using synthetic LLM-generated test data.
4.2.2.2 Expected controls
● Use of external models from trusted sources, where possible with description of the training data.
● Validation of data quality in automatically collected input data.
● Validating synthetic data to prevent the reinforcement of existing biases or introduction of new biases, or the degradation of model performance.
● Human verification of LLM-generated or synthetic data, where feasible.
4.2.3 Privacy issues in the context of AI systems
The use of personal data in AI is subject to strict legal requirements, including purpose limitation, data minimisation, proportionality, and transparency. We discuss jurisdiction-specific requirements in Section 3.3.
Responsibility for compliance lies with the organisation developing or deploying the AI system. Auditors should review documentation, such as data protection impact assessments (DPIA), to confirm that data protection obligations have been met. Attention should be paid to all data used during development, not just the final model inputs. For example, features tested but not ultimately used may still have involved processing personal data.
Where personal data is used, its significance for model performance should be assessed. By determining how much personal data contributes to the accuracy or effectiveness of the AI system, auditors and organisations can evaluate whether its use is truly necessary and proportionate to the intended purpose. Where there is doubt or suspicion of non-compliance with legal requirements, cases could be referred to the appropriate data protection authorities.
4.2.3.1 Risks to consider
- Breaches of data protection regulations (see Appendix 2).