Quality is being served here – How to think about quality in AI initiatives.

Machine learning is being adopted and utilized across businesses at an increasing pace; a trend that has been ongoing for several years already. However, legislators have been slow in reacting, but are starting to catch up. Regulations are being proposed and developed. For example, in Europe, the AI act might enter into force by the end of this year.

The AI Act will directly impact only a limited set of use cases. There are likely to be much broader implications for customers' expectations towards the quality of AI products and ML solutions in general, which is a good thing.

Vendors need to start paying more attention to the design and implementation of their products, services, and custom solutions. A natural starting point for delivering a good-quality product is to make sure that the business impact (such as increased revenue, reduced costs, or improved customer satisfaction) has been properly evaluated. However, going beyond establishing a business case has not been generally considered in the past.

I will briefly discuss what is meant by quality with respect to AI/ML and how it can be promoted. In short, it boils down to adopting rigorous approaches for managing the life cycle of AI/ML solutions as well as making them more trustworthy from the end user’s point of view.

How to define the quality of AI?

The quality of AI has been discussed in several research papers as well as represented in quality assurance frameworks. When defining quality, the following aspects are typically mentioned: Robustness, Performance, Interpretability, and Transparency. Other important aspects include User experience, Scalability, and Maintainability.

Robustness measures the reliability and stability of the solution over time, for example in different environments or under variable conditions. Performance concerns the accuracy and response time of the solution. Together these two aspects contribute to the solution being able to serve its intended purpose.

Interpretability encompasses both explainability (the ability of the solution to provide transparent and interpretable results, allowing stakeholders to understand and trust the predictions given or decisions being made) and uncertainty (information about the confidence around a given prediction/decision), contributing a great deal to the transparency of a solution.

Transparency also involves proper documentation, making it possible to understand the logic behind the development process. It also enables stakeholders to understand and interpret the solution outputs, helping to find and correct any biases or errors in the data or model.

By promoting transparency, organizations can build trust and accountability, and ensure that the predictions given by ML solutions, or decisions made by AI systems, are aligned with their values and goals.

Robust MLOps tackles the technical aspects of quality

Most of the problems related to the quality of ML solutions/AI systems can be tackled by following MLOps (Machine Learning Operations) best practices. This allows organizations to ensure that their machine-learning models are developed, tested, and deployed in a scalable, efficient, and reliable manner.

The essence of MLOps is to combine data science development with commonly applied software engineering tools to robustly manage the life cycle of ML solutions. This typically includes:

  1. automation of as much of the ML pipeline as possible, including data pre-processing, model training, and deployment
  2. version control of everything (including source code, data, models, and environment definitions)
  3. audit trail of operation
  4. monitoring changes in data quality and model performance
  5. managing data flow from sources via pre-processing, feature engineering, and training/serving data generation
  6. governance of model lifecycle

MLOps is becoming an everyday business in many organisations, which means we are moving away from production solutions being based on non-standardized and poorly documented code notebooks, with little or no reusability. However, MLOps only addresses quality from a technical perspective and thus cannot resolve all aspects of trustworthiness.

Understanding uncertainty

Users are unlikely to trust any predictions/decisions that are not accompanied by estimates of uncertainty. This is understandable; it is much easier to base orders or production planning on sales estimates of 1000±50 units than those of 1000±500 units!

ML model performance has traditionally been evaluated using aggregate metrics of accuracy that do not tell anything specific about the reliability of a point prediction. Knowing how a model performs on average and whether there are any systematic biases is still useful, but for operative purposes, it would be useful to know, e.g., what is the demand for a given product, in each store, on a given day in the future.

Addressing the uncertainty problem requires the adoption of novel approaches/techniques for model development and evaluation, which is an area of active research and development. Currently, there are several alternative approaches available, including adjustments to existing algorithms, new developer tools, and probabilistic methods (e.g., gpy and gpflow).

A unified approach is provided by conformal prediction, applicable to problems across fields of general machine learning, computer vision, natural language processing, or even deep reinforcement learning. Conveniently, one can use conformal prediction with any pre-trained model, to produce sets that are guaranteed to contain the ground truth with a user-specified probability, such as 90%.

In simple terms, conformal prediction compares predictions made for training data and validation data with the actual outcomes, and a measure of the confidence or uncertainty associated with each prediction is calculated. This confidence measure is then used to determine which predictions are reliable and which are uncertain, allowing the model to avoid making predictions that are too uncertain.

Understanding the uncertainty around model predictions might not be enough. Predictions made by ML models are (largely) contingent on the properties of the training data. Unanticipated variation in inputs at inference time can thus result in unexpected outcomes, which is undesirable, especially in critical applications.

This kind of uncertainty can be addressed by data augmentation techniques (also for tabular data), where the goal is to help the model generalize better and become more robust to unseen data, leading to improved accuracy on the task it is trained for. With existing models, it is also possible to identify conditions of reduced performance and/or unexpected behaviour using a technique called metamorphic testing. This involves testing a system by applying a set of predefined transformations to the inputs and checking that the output remains consistent with the expected properties.


The application of data science to solve business problems is becoming more mature, facilitated by the adoption of software development best practices. A massive gain is the improved quality, robustness, and replicability of machine learning solutions. Moreover, being more transparent about the underlying uncertainties helps in building trust, which in turn makes it easier for stakeholders to accept and adopt new tools. Quality is being served and that will (or at least should) be accepted with gratitude.