4.3 Model development

Model development is a pivotal stage in the AI system lifecycle, where conceptual designs are translated into operational models. For auditors, this phase provides the clearest view of how technical, organisational, and ethical requirements are embedded within the system.

This section explains how models are built and quality-assured, covering model selection and adaptation (including the use of foundation models and LLMs), feature engineering, prompt and parameter design, internal quality assurance, and methods to ensure reliability, robustness, and reproducibility. We address outcome-based acceptance testing and go/no-go decisions in Section 4.5, while operational testing in live environments is covered in Section 4.6.

4.3.1 Development process and performance characteristics

Model development involves building, training, and refining an AI model to perform specific tasks. This process typically includes preparing data (see section 4.2), selecting or adapting algorithms, evaluating performance, and optimising parameters to ensure the model performs accurately on new data. Increasingly, organisations adapt pre-trained or foundation models, such as LLMs, rather than developing models from scratch. Auditors should therefore assess model provenance, licensing, adaptation methods (such as fine-tuning, prompt engineering, or RAG), and any dependencies on external vendors.

Choosing the right model is critical, as it directly affects performance, scalability, and the ability to generalise to new situations. Key factors influencing this decision, including whether to develop a custom model or adapting a pre-trained model, are:

  • Data availability: quantity, quality and relevance of data.
  • Technical expertise: the organisation’s skills in machine learning and the subject domain.
  • Computing resources: the capacity to train and run models.
  • Time constraints: project timelines and delivery requirements.
  • Explainability needs: requirements for transparency and interpretability.
  • Budget: financial resources for development and ongoing operation.

For example, organisations might compare the performance of (i) simpler machine-learning methods, (ii) fine-tuned deep-learning/small language models, and (iii) LLMs accessed via APIs. In some cases, simpler machine learning methods or smaller, fine-tuned models may outperform more complex or costly general-purpose models. Auditors should expect organisations to document the rationale for model selection, ensuring choices are technically sound, justified, and transparent.

Feature engineering – the process of transforming raw data into meaningful variables for the model – is particularly important. Methods for generating, extracting, and selecting features should be well documented. Auditors should pay attention to how individual features affect model performance and whether any features could introduce unnecessary complexity or risk of bias.

For generative AI and foundation models, prompt design and versioning are crucial, as the instructions or context provided can significantly influence outputs. In addition to prompt engineering, the configuration of sampling parameters plays a key role in shaping the behaviour and consistency of generated responses.65 Documenting these settings is essential for reproducibility, transparency, and accountability.

Supervised machine learning approaches are typically evaluated by measuring the performance on the test data. For generative and general-purpose AI systems other measures need to be evaluated. A key challenge in developing and auditing these systems is the risk of hallucinations. Hallucinations occur when a model generates outputs that sound plausible but are factually incorrect, misleading, or nonsensical. This risk is especially high in LLMs, which can generate authoritative responses that are not grounded in reliable data. Verifying outputs against trusted sources is essential to detect inaccuracies and prevent misinformation, helping maintain confidence in the system. Possible measures for evaluating modern AI systems include

  • Hallucination rates and factual accuracy.
  • Robustness to different prompts or inputs.
  • Safety and toxicity metrics.
  • Adaptation to specific domains.
  • Efficiency in handling context and data.

No single metric can capture overall model quality, and evaluation often requires balancing competing priorities, such as accuracy versus fairness. Auditors should expect organisations to define and justify their chosen performance metrics.

Reproducibility – the ability to recreate results using the same data, parameters, and environment – remains a cornerstone of trustworthy AI system development. Auditors typically begin with documentation reviews to establish an audit baseline. If documentation is insufficient, unclear, or contradictory, reproducing the model and/or its predictions may be necessary. Where full reproduction is not feasible (for example, with large or proprietary models), thorough documentation of prompts, retrieval contexts, and configuration settings is important to support auditability.

4.3.1.1 Risks to consider

  • Errors in data preparation or feature engineering that introduce bias or reduce model reliability (see also Section 4.2).
  • Overestimation of model performance, particularly if testing is inadequate or not representative.
  • Output instability and non-determinism, including sensitivity to prompt/context changes in generative systems.
  • Claims about a foundation model’s capabilities are often unverifiable with limited resources and without access to model weight.

4.3.1.2 Expected controls

Effective documentation and version control are essential for transparency, traceability, and auditability. Auditors should expect to find:

  • Clear records of model selection, adaptation, and version history.
  • Performance evaluations using both general benchmarks and application-specific tests.
  • Evidence of reproducibility, including data, parameters, and environment details.
  • Integrated risk assessments covering operational and societal impacts.
  • For generative AI and foundation models, documentation should include prompt design and versioning, parameter settings, and details of any external APIs or pre-trained models used.
  • Model selection: clearly document the criteria and tests used for model comparison, or provide a rationale if no comparison was made.
  • Version control: maintain robust version control for data, model artefacts, prompts, prediction outputs, and pipeline code. Include details of prompt design, versioning, and sampling configurations.
  • Comprehensive records: include information on model architecture, training data, known limitations, performance metrics, development processes, and deployment decisions.
  • Generative AI and foundation models: document the methods used to detect and mitigate hallucinations, and ensure these are proportionate to the risks associated with the system’s intended use.
  • Feature engineering: describe how variables are generated and selected, with attention to features that could act as proxies for sensitive attributes, to ensure compliance with data protection standards and prevent bias.

4.3.2 Quality assurance, reliability and robustness in development

Quality assurance in development brings together code quality, model quality, and best practices to ensure reliability and robustness before deployment.

  • Reliability refers to the model’s ability to perform consistently and accurately under expected conditions, particularly when handling data similar to that used in training.
  • Robustness is the model’s ability to maintain performance when faced with unexpected inputs – such as data from a different distribution to the data it was trained on, or ‘out of distribution’ data; noise; or changes in the environment.

These considerations are particularly relevant for foundation and generative models, where the distinction between training data and operational data is often blurred. Many modern computer vision and language models are created by fine-tuning a foundation model that was originally trained on a separate, much larger dataset. As a result, the data used during fine-tuning is only part of what shapes the final model, making it difficult to determine exactly what constitutes “out-of-distribution” data. Evidence suggests that models adapted from foundation models tend to generalise better to new data than those trained from scratch using traditional supervised learning. However, the reasons for this improved generalisation are still being explored.

Techniques such as RAG can improve reliability by supplementing model outputs with information from external sources, helping to address issues like hallucinations or outdated knowledge. Advances in model architecture, such as longer context windows that enable models to consider more surrounding information when generating responses, support more reliable, accurate, and contextually appropriate outputs.

Auditors should expect to see evidence that reliability and robustness have been considered from the outset, with appropriate controls in place to manage risks.

4.3.2.1 Risks to consider

  • Poor code quality or lack of testing, leading to unstable or unreliable outputs.
  • Vulnerabilities arising from external dependencies, such as vendor updates or changes to third-party models.
  • Risks of misinformation, including hallucinations or the perpetuation of biases.
  • Minimally altered data leading to erroneous and unreliable model behavior.

4.3.2.2 Expected controls

  • Output consistency: test for consistent outputs by running the same prompt multiple times, especially for generative models. Assess reproducibility using subsets of training data or independent (possibly synthetic) data. If overfitting or unnecessary features are suspected, retrain the model with fewer features and compare results.
  • Personal data sensitivity: For models using personal data, retrain with reduced or no personal data to assess the trade-off between performance and privacy.
  • Prompt and context tracking: record variations in prompts, context, and RAG) contexts to support targeted reproduction when full model replication is not possible.
  • External correlation analysis: if code or models are inaccessible, use available data to analyse correlations between features and predictions, inferring model behaviour. Possible methods include LIME or SHAP.
  • Demonstration of variants: ask the auditee to demonstrate retraining or simplified workflows to understand model behaviour under different conditions.
  • Robust machine learning operations (MLOps), including version control, unit tests, code reviews, integration tests, and end-to-end testing.

  1. Parameters such as “temperature” or “top-k” control the diversity and randomness of generated responses. For example, a higher temperature increases variability, while lower values produce more deterministic results.↩︎