4.4 Evaluation of the model before deployment

Before an AI system is deployed, it must be evaluated to ensure it meets the standards set out in the project proposal and during model development. This stage is critical for confirming that the system is fit for purpose, operates as intended, and complies with legal and ethical requirements. Evaluation should be proportionate to the risks and context of use, and should be repeated if the system or its environment changes.

Using AI to support evaluation processes will become increasingly common as AI systems grow in size and complexity, making manual testing and assessment more challenging. One example is “LLM-as-a-judge”, where an LLM is used to evaluate and score outputs of AI systems. Auditors should ensure that any additional use of AI has been sufficiently evaluated by the project owner.

This chapter covers:

  • Performance and acceptance testing;
  • Transparency and explainability;
  • Fairness and minimising harm;
  • Security; and
  • Environmental impact.

The evaluation process ensures that the AI system meets the standards set out in governance (Section 4.1) and model development (Section 4.3), and that it is fit for deployment into production. The release and change plan following successful evaluation is addressed in Section 4.5. Ongoing monitoring and retraining after deployment are addressed in Section 4.6. If monitoring after deployment exposes issues with fairness, security, or accuracy, the relevant evaluation tests described here should be rerun.

4.4.1 Performance and acceptance testing

AI systems should be tested under realistic conditions before deployment. This includes technical validation of outputs and user testing in workflows that reflect actual use. Testing should be based on the business indicators and acceptance criteria defined at the start of the project.

Potential weaknesses and corrective actions should be documented. The results of these tests, along with evidence of meeting acceptance criteria, form the basis for the go/no-go deployment decision (see Section 4.5). This applies to both in-house developments and purchased systems.

4.4.1.1 Risks to consider

  • AI system does not meet business requirements.
  • Worse performance when using unseen data.

4.4.1.2 Expected controls

  • Ensure that acceptance criteria are clear and agreed in advance.
  • Test the system with representative data.
  • Involve end-users in acceptance testing.
  • Document test results and decisions.

4.4.2 Transparency and explainability

Transparency and explainability are essential for building trust in AI systems and ensuring accountability. Transparency means providing clear information about how, when, and why an AI system is used. Explainability is the ability to describe how the system arrives at its outputs or decisions.

Assessing explainability can be challenging, especially for complex or third-party systems. Many AI models operate as “black boxes,” where the internal logic is not easily understood. Generative AI systems are often probabilistic, making it difficult to reproduce specific outputs or trace how a particular response was generated. Section 4.3 explains how correlations between features and predictions can be analysed to infer model behaviour.

The level of explainability required depends on the system’s purpose and the audience of the explanation (such as a regulator or member of the public). Explainability should be proportionate to the risk presented by the AI system. A balance needs to be found between the resources required and the level of explanation that is necessary.

When systems are purchased externally, organisations should seek information from vendors about upstream inputs, such as model architecture, training data, fine-tuning processes, known limitations, and performance metrics. Model cards and data sheets are often used to summarise this information. Open-source foundation models may provide access to code, model weights, and responsible use guides.

4.4.2.1 Risks to consider

  • Inability to explain or justify decisions, limiting the organisation’s ability to judge model predictions and reliability.
  • Non-compliance with regulations that require transparent procedures and substantiation of decisions (common for regulations for public administration).
  • Insufficient information for end-users to challenge or appeal outcomes.
  • Bias and discrimination may go undetected due to limited understanding of the data, model, and predictions.
  • “Fairwashing”, where post-hoc explanation techniques are manipulated to rationalise unfair outcomes.

4.4.2.2 Expected controls

  • Maintain documentation on system purpose, design, expected outcomes, and limitations. This may include data cards (dataset origins and biases), system cards (architecture, training data, limitations), and audit cards (regulatory requirements and risk quantifications).
  • Inform users when they are interacting with AI.
  • Provide mechanisms for users to challenge or appeal decisions, and ensure that arrangements exist to give users any required redress from harms.
  • Justify the use of black-box models where appropriate.
  • Clarify legal responsibilities, including requirements for disclosing when content is synthetic.
  • Use solutions like watermarking (disclosures placed on AI-generated content) and provenance data detection (tracking content origin and history) to help identify AI-generated material.

4.4.3 Fairness and minimising harm

AI systems should be designed and tested to minimise harm and ensure fairness. Fairness means treating individuals and groups equitably, avoiding negative impacts based on characteristics such as gender, ethnicity, or location. Unfair outcomes can result in discrimination, financial loss, or other social disadvantages.

There are many ways to define and measure fairness. The right approach depends on the system’s purpose and context. Auditors should expect organisations to document and justify their choice of fairness metrics, and to show how these align with legal and policy requirements. Both individual and group fairness should be considered.

AI systems rarely operate in isolation. Where multiple models interact, it is important to assess how these interactions might create or amplify risks. Focusing only on individual model evaluations may underestimate the risks present in the overall system.

For purchased systems or foundation models, documentation should be provided that demonstrates how bias has been identified and addressed. This includes providing details about training data sources, bias mitigation methods, and fairness metrics. Documentation and fairness tests from third-party developers should be relevant to the organisation’s use case and data.

Fairness testing can include counterfactual testing and the use of benchmark datasets. These help to identify whether the system treats similar cases consistently, regardless of protected characteristics. Counterfactual testing can be used to assess fairness by changing sensitive variables in input data (such as gender or location) and observing whether outputs change inappropriately. Benchmark datasets can be used to evaluate the AI system’s performance and fairness across different groups.

Fairness in AI classification can be assessed by comparing different performance metrics across groups to identify potential bias or discrimination. Auditors should note that no model can meet all fairness criteria at once, especially when group prevalence differs. For further details on fairness metrics for classification models, see Appendix 3.

Bias can arise from non-representative or poor-quality training data, or from the design of the system itself. Since model performance improves with data quantity (see Section 4.2), results often favour majority groups while disadvantaging minorities. Generative AI models are especially vulnerable as they are often trained on large, unfiltered datasets from the internet that may reflect societal biases. Generated content can reflect or reinforce harmful stereotypes.

When assessing bias in training data, auditors should review the origin and quality of datasets, especially if data has been web-scraped or generated synthetically. Organisations should show how they have addressed intellectual property rights, data protection, and ethical standards in collecting and using data. Clear documentation should be available.

Human oversight is important, especially for high-risk or sensitive applications. Auditors should confirm that human reviewers are involved in validating outputs (human-in-the-loop validation), particularly where automated decisions may have significant impacts. Ongoing monitoring should be in place to detect and address fairness issues as they arise in live operation (see section 4.6 for more information).

4.4.3.1 Risks to consider

  • The AI system reinforces or amplifies existing inequalities.
  • Bias or discrimination goes undetected due to inadequate or inappropriate fairness testing.
  • Interacting models or systems create new risks that are not identified by evaluating components in isolation.
  • Purchased or third-party systems lack transparency about fairness testing or bias mitigation.
  • Relying on a single fairness metric or generic benchmarks may give a false sense of assurance.

4.4.3.2 Expected controls

Fairness should be a primary consideration throughout the design, development and deployment of AI systems. Auditors should expect to see evidence of robust, context-appropriate fairness testing, clear documentation, and ongoing monitoring. Where fairness cannot be fully assured, risks should be transparently acknowledged and managed.

  • Document training data sources, bias mitigation methods, and fairness metrics. Include the rationale where appropriate.
  • Use counterfactual testing (changing sensitive variables to check for inappropriate output changes) and benchmarking datasets to evaluate fairness.
  • Build human oversight into the validation process, especially for high-risk or sensitive applications.
  • Ensure ongoing monitoring is in place to detect and address fairness issues in live operation.
  • For generative AI, evaluation of prompts and generated responses to ensure content is not prejudiced or offensive.
  • Anticipated discrimination risks should be identified, monitored, and addressed.
  • For third-party or foundation models, the auditee organisation should ensure that fairness tests and documentation provided by the developer are relevant to the intended use case.

4.4.4 Security

Security is about protecting data, assets, and system functionality from unauthorised access, misuse, or damage. As AI systems become more capable and autonomous, the risks to safety and security increase. This is especially true for systems that can act independently or process sensitive information.

AI systems face many of the same security challenges as other IT systems. However, the large amounts of data and computing power involved in AI development mean that the security of distributed and cloud-based infrastructure is particularly important.

4.4.4.1 Risks to consider

  • AI systems may access, expose, or misuse personal or confidential data. There is also a risk of unauthorised changes, damage, or loss of data.
  • Failure to comply with data protection laws, such as GDPR, can lead to legal and reputational consequences. This risk increases if data is processed or stored in countries with different regulations.
  • Malicious actors may manipulate data inputs to deceive models, leading to incorrect or harmful outputs. Examples include data poisoning and prompt injection attacks.
  • Attackers may gain access to models or underlying code, allowing them to manipulate or misuse the system.
  • Highly autonomous AI systems, especially agentic AI, may make decisions or take actions without appropriate human oversight.

4.4.4.2 Expected controls

Security should be considered at every stage of the AI lifecycle. Regular risk assessments and reviews should be undertaken to ensure that controls remain effective as technology and threats evolve. Ongoing security monitoring is addressed further in section 4.6.

The variety of AI systems and applications makes it difficult to foresee all potential risks. Organisations should undertake a thorough risk assessment and therefore be aware of the specific security risks that may arise at different parts of the AI life cycle, from development to deployment. Security should be a consideration when deciding whether to develop AI systems in-house or purchase from a third party. For known security issues, mitigation strategies should be implemented and tested. Systems should be regularly reviewed as part of a mitigation strategy, particularly in the case of new developments.

Possible controls include:

  • Prevent unauthorised access to data, models, and code by using access controls, network monitoring, and cryptography (including encryption).
  • Keep AI development environments separate from main IT networks to reduce the risk of wider compromise.
  • Regularly review systems for vulnerabilities and signs of attack. Test for vulnerabilities issues and update defences as needed (red teaming).
  • Train staff to recognise security threats and understand their responsibilities.
  • Apply data protection measures to ensure compliance with relevant laws and regulations. For example, tests which filter malicious data before it enters the system, and tests outgoing data before it leaves the system (covers quality assurance and data protection). For generative AI, use content filters to prevent the disclosure of sensitive information.
  • Check that cloud service providers and external model suppliers meet security standards and ensure that service level agreements cover incident management and support.
  • Establish procedures to detect, report, and respond to security incidents, including rollback or decommissioning if necessary.

4.4.5 Environmental impact

AI systems, especially LLMs, can have a significant environmental impact. Training and deploying these systems often require substantial computing power, which leads to high energy use and increased carbon emissions. As AI becomes more widely adopted, understanding and managing its environmental footprint is increasingly important.

When AI systems are purchased from third parties, auditors should consider the environmental credentials of suppliers. This may include the use of renewable energy, energy-efficient infrastructure, and sustainable business practices. Many cloud service providers now offer tools to estimate the carbon footprint of cloud computing activities.

4.4.5.1 Risks to consider

  • The environmental costs of an AI system outweigh its benefits.
  • Non-compliance with environmental reporting standards or legislation.

4.4.5.2 Expected controls

Auditors should look for documentation to ensure compliance with any environmental reporting standards or other legislation. As methodologies to measure the environmental impact are not yet well developed, investigative audits are challenging. As standards and tools for measuring environmental impact continue to evolve, auditors should stay informed about best practice and emerging requirements in this area.

  • Document compliance with environmental reporting standards and relevant legislation.
  • Consider the environmental credentials of third-party suppliers.
  • Use available tools to estimate the carbon footprint and resource use of AI systems.
  • Assess and document energy use, water consumption, and greenhouse gas emissions where possible.