Testing Generative AI Applications for Quality

How to test apps powered by Generative Artificial Intelligence (GenAI)? In this article, we’ll walk you through everything about testing generative AI applications, including a comprehensive checklist, upcoming challenges, and how all this can add value to your projects.

Generative Artificial Intelligence - Unveiling the potential of testing applications powered by Generative AI models - Ilustration

With the widespread use of advanced technologies such as GPT and other Large Language Models (LLMs), the methods we employ to enhance quality in software development and testing must evolve to match the complexity and adaptability of these systems.

Unlike traditional deterministic applications, AI-driven models introduce inherent variability, requiring novel testing strategies to enable reliability, accuracy, and security.

Applications integrating GPT and other LLMs are transforming industries such as customer service, medicine, education, finance, content creation, and beyond. These systems operate through natural language processing, image generation, and multimodal capabilities, making them highly versatile but also introducing significant challenges in evaluation and control.

Such advancements bring unique challenges, especially in software quality, highlighting the need for generative AI testing strategies that go beyond conventional validation techniques. Boosting software robustness requires dynamic and adaptive testing approaches that can assess AI behavior under diverse and unpredictable conditions.

The ability to integrate these models via APIs opens up novel possibilities in product development, but it also raises critical questions regarding reproducibility, fairness, and security. This introduces additional layers of complexity in test automation, as it requires handling non-deterministic responses and optimizing token consumption while maintaining meaningful evaluation criteria.

Ready to boost your productivity by 30%? Revolutionize Your Testing with OUR AI-powered assistant Abstracta Copilot! Request a Demo here.

Hyperparameters

Generative Artificial Intelligence - Exploring the intricacies of a Generative AI model in application testing

Hyperparameters are external configurations of a machine-learning model. They are set before training and are not learned by the model itself from the data.

Unlike model parameters, which are learned from the data during training, the development team is who determines the hyperparameters. They are crucial for controlling the behavior and performance of the algorithm.

To illustrate, consider a video game in development: hyperparameters would be like the difficulty settings configured before starting a test session. These settings, such as the power of enemy characters or how often they appear, don’t change during the game. Yet, they can greatly impact how the game feels and its difficulty level.

In Generative Artificial Intelligence, the right choice of hyperparameters is key to optimizing the model’s accuracy and effectiveness. Common examples include the learning rate, the number of layers in a neural network, and the batch size.

Temperature and Seed in Generative AI Models

Although these terms don’t fit the classic definition of hyperparameters (like learning rate or batch size used in machine learning model training), they serve a similar function in generative model practice and are fundamental for testing with Generative Artificial Intelligence support.

Their importance in generative model settings and their impact on outcomes make them akin to hyperparameters in practical terms. Therefore, in some contexts, especially in describing generative models, they are referred to as hyperparameters, albeit a special type of configuration.

They affect how the model behaves (in terms of randomness or reproducibility) but are not learned by the model from data. They are configured before using the model for generation.

Temperature

When using the API, it’s important to consider the temperature we choose, which determines the balance between precision and creativity in the model’s responses. It can vary from 0 to 2: the lower the temperature we select, the more precise and deterministic the response. The higher it is, the more creative or random.

Seed

This is an initial value used in algorithms to enable the reproducibility of random results. Using a specific seed, AI can help ensure that, under the same conditions and inputs, the model generates consistent responses.

This is crucial for testing, as it increases the reproducibility of results and allows for more precise and reliable evaluation of model behavior in different scenarios and after changes in code or data. Reproducibility plays a key role in testing environments. Using consistent test data helps improve the quality and efficiency of applications based on Generative Artificial Intelligence.

Tokens

In language or generative models, tokens are basic text-processing units, like words or parts of them. Each token can be a word, part of a word, or a symbol.

The process of breaking text into tokens is called tokenization, and it’s crucial for the model to understand and generate language. Essentially, tokens are building blocks the model uses to interpret and produce language. If you’re interested in learning more about the tokenization process, we recommend visiting Open AI’s Tokenizer.

It’s important to note that models have a limit on the number of tokens they can process in a single step. Upon reaching this limit, they may start to “forget” previous information to incorporate new tokens, which is a crucial aspect of managing the model’s memory.

Max Tokens Parameter and Cost

Token limit and cost: Not only does the number of tokens processed affect the model’s performance, but it also has direct implications in terms of cost. In cloud-based services, token usage is a key factor in billing: charges are based on the number of tokens processed. This means that processing more tokens increases the service cost..

Max_tokens parameter: Using the max_tokens parameter is essential in API configuration. It acts as a controller that determines when to cut off the model’s response to avoid consuming more tokens than necessary or desired.

In practice, it sets an upper limit on the number of tokens the model will generate in response to a given input and allows managing the balance between the completeness of the response and the associated cost.

Although tokens are not hyperparameters, they are an integral part of the model’s design and operation. The number of tokens processed and the setting of the max_tokens parameter are critical both for performance and for cost management in applications based on Generative Artificial Intelligence.

For all these reasons, Federico emphasized the need to find a proper balance between cost and outcome when providing the number of tokens the API can use.

Applicable Testing Approaches in Generative Artificial Intelligence

Testing applications using Generative Artificial Intelligence, as well as testing generative AI systems like LLMs, presents unique challenges. Testing generative AI systems requires a mix of traditional and AI-specific approaches.

Below, we explore different methods to evaluate these systems effectively. To address these, let’s review the most well-known categorization in the context of testing generative AI apps: black box and white-box methods

Black Box

These approaches focus on the system’s inputs and outputs, regardless of the underlying code. The key here is understanding the domain, the system, and how people interact with it. Internal implementation comprehension is not necessary.

The major limitation is the lack of insight into what happens inside the system, i.e., its implementation. This can lead to less test coverage and redundant cases. We cannot control internal system aspects. Moreover, testing can only begin once the system is somewhat complete.

White Box

In contrast to black box methods, this approach requires a deep understanding of the system’s interior.

We can analyze how parameters and prompts are configured and used, and even how the system processes responses. This is feasible because we can look inside the box. We leverage available information to design tests.

White box approaches allow more detailed analysis, avoid redundant test execution, and create tests with greater coverage.

However, disadvantages exist: programming knowledge and API handling are necessary. This might lead to bias, precisely due to knowledge of the construction.

This could limit our creativity in generating tests. It might result in losing focus on the real experiences and challenges of people using the systems.

If you’re interested in diving deeper into testing concepts, we recommend you to read Federico Toledo’s book ‘Introduction to Information Systems Testing.’ ¡You can download it for free!

Don’t miss this article! Overcome Black Box AI Challenges

Quality Factor-Based Approach

Diagram showing eight quality factors in AI testing: Usability, Accessibility, Maintainability, Security, Performance, Real-World Environments, Ethical Considerations, and a central checklist icon.

Another high-level approach involves asking questions guided by different software quality factors. These include reliability, usability, accessibility, security, real-world environments, and ethical considerations. In each of these aspects, it’s important to consider both general and specific questions for systems based on Generative Artificial Intelligence.

Usability and Accessibility

As applications increasingly have conversational interfaces, we face the challenge of making them inclusive and adaptable. It’s crucial to validate whether they are intuitive and useful for all people.

They should comprehend different voices under various conditions and provide understandable responses in different contexts and needs. Here, accessibility goes beyond mere compatibility with assistive tools. It seeks effective and accessible communication.

Maintainability

In this aspect, we aim to understand how easy it is to update and change the system. How can we automate tests for a system with non-deterministic results (that is, not always yielding the same results)? Are regression tests efficiently managed to prevent negative impacts on existing functionalities? Are prompts managed like code, easy to understand, well modularized, and versioned?

Security

Within the security framework, GAI introduced a new type of vulnerability known as ‘Prompt Injection.’ Like SQL Injection, an attacker could alter the system’s expected behavior through unauthorized data manipulation, injecting malicious instructions, or extracting unintended information.

They do this by manipulating inputs maliciously, knowing that if not carefully processed and concatenated to a preset prompt, it could exploit this vulnerability. In testing, we have to consider these types of attacks. We review how the system handles unexpected or malicious inputs that could alter its functioning.

One important aspect of security evaluation is adversarial testing, where AI models are exposed to intentionally crafted inputs designed to exploit vulnerabilities. This approach helps identify and mitigate generative AI risk, fostering robust and secure applications.

Another critical threat vector is when the model generates code or scripts that could be used to execute malicious code, particularly in coding assistants or autonomous agents.

How does the system respond to inputs designed to provoke errors or unwanted responses? It may return harmful, misleading, or nonsensical outputs. Or, in some cases, it may reveal system prompts or sensitive data. Testing these scenarios helps us detect such behaviors early and implement safeguards to prevent them in real-world use.

This evaluation is essential to check if applications based on Generative Artificial Intelligence are safe and resistant to manipulation attempts.

We invite you to read this article! Tips for Using ChatGPT Safely in Your Organization

Performance

Here, we focus on how the system handles load and efficiency under different conditions. In the context of GAI, many questions arise.

How does our system perform when OpenAI’s service, used via the API, experiences slowness or high demand? This includes evaluating the system’s ability to handle a high number of requests per second and how this affects the quality and speed of responses.

Moreover, we need to consider efficient computational resource management, especially about token usage and its impact on performance and operational costs. Analyzing performance under these circumstances is crucial to check whether applications based on GAI are not only functional and safe but also scalable and efficient in a real-use environment.

Real World Environments

Traditional testing setups often simulate ideal conditions, but Generative Artificial Intelligence systems must also be validated in real world environments, where factors like noisy inputs, fluctuating internet speeds, incomplete user prompts, and diverse user behaviors impact the system’s reliability.

Testing in these environments helps uncover issues that remain hidden in lab conditions, providing a more accurate picture of how the system behaves in actual production contexts. However, doing so also introduces specific risks, especially when the system under test is capable of generating dynamic content or interacting with real users or external APIs.

Uncontrolled environments can lead to unexpected outputs, trigger unwanted actions, or expose sensitive information if not properly sandboxed. This makes data security a critical concern, as systems may inadvertently access or reveal protected content during testing.

That’s why testing in real-world scenarios must be planned carefully, using isolated environments, realistic but anonymized data, and strict monitoring. The goal is to simulate production-level conditions without compromising privacy, performance, or system integrity.

Ethical Considerations

Beyond technical robustness, testing Generative AI also demands attention to ethical considerations. These include identifying biased outputs, preventing misinformation, and evaluating how systems handle sensitive topics such as gender, accessibility, religion, ageism, or politics.

Moreover, quality engineers must assess how AI behaves when prompted in ethically complex scenarios, where the correct response may depend on context, cultural norms, or subjective human judgment.

Responsible testing includes proactively identifying these risks and focusing on mitigating ethical issues before they can affect real users. This requires not only diverse testing teams and comprehensive scenario coverage, but also a clear framework to define acceptable outputs and flag potentially harmful behaviors.

Generative Artificial Intelligence: Why We Recommend Using Evals and Their Importance

Generative Artificial Intelligence - Evals - Ilustration

At Abstracta, we recognize the importance of responsible and efficient use of LLMs and the tools we build based on them. To keep pace with development, we face the challenge of automatically testing non-deterministic systems. The variability in expected outcomes complicates the creation of effective tests.

Faced with this challenge, we recommend adopting OpenAI’s framework known as ‘Evals’ for regression testing. It works through different mechanisms. To address the non-determinism of responses, it allows the use of an LLM as a ‘judge’ or ‘oracle.’ This method, integrated into our regression tests, lets us evaluate the results against an adaptable and advanced criterion. It aims to achieve more effective and precise tests.

What are OpenAI Evals?

It is a framework specifically developed for evaluating LLMs or generative AI tools that leverage these models. This framework is valuable for several reasons:

Open-Source Test Registry: It includes a collection of challenging and rigorous test cases, in an open-source repository. This facilitates access and use.
Ease of Creation: Evals are simple to create and do not require writing complex code. This makes them accessible to a wider variety of people and contexts.
Basic and Advanced Templates: OpenAI provides both basic and advanced templates. These can be used and customized according to each project’s needs.

Incorporating Evals into our testing process strengthens our ability to validate if LLMs and related tools function as intended. It allows us to maintain high standards of quality and efficacy. Additionally, organizations are increasingly adopting AI-powered testing tools to enhance automation and efficiency in generative AI-based testing.

If you’re interested in this topic, we recommend reading the article ‘Decoding OpenAI Evals.’ It offers a detailed view of this framework and its application in different contexts.

Challenges in Testing Generative AI Applications

We face multiple challenges every day. AI technology advances at striking paces. Therefore, building teams and sharing knowledge while continuing to experiment is vital.

Referring to software quality, possibly one of the most challenging aspects of testing AI in the context of Generative Artificial Intelligence is creating effective test sets that account for variability, subjectivity, and contextual dependencies in model outputs. This area, filled with complexities and nuances, requires a combination of creativity, deep technical understanding, and a meticulous approach.

Key Challenges in Testing Generative AI

Complexity and Variability of Responses: Generative Artificial Intelligence-based systems, like LLMs, can produce a wide range of different responses to the same input. This variability makes it difficult to predict and verify the correct or expected responses, presenting challenges that traditional testing methods are not designed to handle and complicating the creation of a test set that adequately covers all possible scenarios.
Need for Creativity and Technical Understanding: Designing tests for Generative AI requires not only a deep technical understanding of how these models work but also a significant dose of creativity, as in exploratory testing, where testers dynamically analyze system behavior. This is to anticipate and model the various contexts and uses that users might apply to the application.
Context Management and Continuity in Conversations: In applications like chatbots, managing context and maintaining conversation continuity are critical. This means that tests must be capable of simulating realistic and prolonged interactions to assess the system’s response coherence and relevance.
Evaluation of Subjective Responses: Often, the ‘correctness’ of a Generative Artificial intelligence-generated response can be subjective or context-dependent. This is especially true when evaluating whether the model produces or tolerates harmful content such as hate speech, which requires not only technical evaluation but ethical judgment as well. So this requires a more nuanced evaluation approach than traditional software testing.
Bias and Security Management: Generative AI models can reflect and amplify biases present in the training data. Identifying and mitigating these biases in tests is crucial to enable the system’s fairness and safety. Additionally, it is necessary to evaluate scenarios that could lead to data leakage, where sensitive information from training datasets might be unintentionally exposed in generated outputs.
Rapid Evolution of Technology: Generative Artificial Intelligence technology is constantly evolving. This means that test sets must be continually updated to remain relevant and effective.
Integration and Scalability: In many cases, Generative AI is integrated with other systems and technologies. Effectively testing these integrations, especially on a large scale, can be complex.
Observability: Given the complexity of systems, and as most approaches are black-box due to context, data, or code access restrictions, observability plays a key role in our effectiveness in testing.

While this presents a major challenge, it also presents a singular opportunity to venture into uncharted territories in software quality.

Our Solution

At Abstracta, we embrace this challenge with tools like Abstracta Copilot, our AI-powered solution designed to enhance manual testing and observability in complex systems. By leveraging Generative AI, Abstracta Copilot helps teams streamline test case creation, identify edge cases, and analyze large volumes of test results with greater efficiency.

Thanks to the ability to understand and make visible through observability platforms what happens inside the system, we can trace each step of internal processes, correlating inputs, behaviors, and outputs. This enables us to design more precise and effective tests, improving software reliability in AI-driven applications.

Ready to boost your productivity by 30%? Revolutionize Your Testing with OUR AI-powered assistant Abstracta Copilot! Request a Demo here.

Key Considerations for Testing AI Applications at Scale

Infographic showing a checklist of key areas for testing AI applications, including scope, inputs, quality, risk, oversight, behavior, scalability, integration, and coverage.

Testing AI applications involves much more than checking if an output is right or wrong. It requires a strategic view that aligns with quality goals, data sensitivity, performance expectations, and scalability. As these systems become deeply integrated into products and services, it’s crucial to cover the most critical areas.

AI Testing Checklist

Use this list to assess whether your testing strategy addresses the essential aspects of building reliable, secure, and responsible AI systems.

1. Scope and Intent

Have you validated the system’s intended capabilities?
Are the specific tasks the AI must perform clearly defined?
Has testing been integrated early in the design process?

2. Inputs and Evaluation

Are you using diverse prompts and structured data?
Have you tested across realistic real-world scenarios?
Are you accounting for a varying degree of responses?
Do you continuously monitor and log the AI’s responses?

3. Quality and Performance

Are you running automated benchmarking tests?
Are you establishing metrics and defining benchmarks?
Do your quality metrics reflect accuracy, context, and stability?
Have you verified the system’s ability to perform under different loads?
Are you using external tools to monitor performance?

4. Data Protection and Risk

Have you prevented data leaks and exposure of sensitive data?
Have you tested for risks such as:
- Divulging personal user information
- Revealing financial records
- Extracting sensitive data
- Leaking customer-sensitive data
Is the system restricted from accessing prohibited sites or downloading large files?
Are integrations with external systems secure?

5. Oversight and Failures

Are human testers involved in key evaluations?
Are you identifying and analyzing test fails?
Have you implemented effective AI’s safeguards?

6. System Behavior and Inputs

Are you evaluating the model’s behavior when faced with incomplete or ambiguous prompts?
Are you testing edge cases where the model may hallucinate, bypass rules, or respond inconsistently across contexts?
How do seed and temperature parameters affect your ability to reproduce bugs?

7. Scalability and Adaptability

Is your testing approach dynamic enough to adapt to rapid changes in model behavior after updates?
Are you monitoring token usage and its cost/performance tradeoffs?

8. AI Agents and Real-World Integration

Have you tested how agents or specialized copilots interact with external APIs and systems in real-world flows?
Can you trace, through observability tools, the full journey of each input, from user prompt to system response?

9. Testing Coverage

Are you applying different types of testing (functional, performance, accessibility, security, observability) according to the role of the AI system?

Testing practice involves learning, anticipating, and evolving with the technology. At Abstracta, we help teams transform these insights into concrete, scalable testing strategies.

Why Choose Abstracta for AI Transformation Services?

Generative Artificial Intelligence - Generative AI Models - Customized Copilots

With over 16 years of experience and a global presence, Abstracta is a leading technology solutions company with offices in the United States, Chile, Colombia, and Uruguay. We specialize in software development, AI-driven innovations & copilots, and end-to-end software testing services.

We believe that actively bonding ties propels us further. That’s why we’ve forged robust partnerships with industry leaders like Microsoft, Marvik, Datadog, Tricentis, Perforce BlazeMeter, and Saucelabs, empowering us to incorporate cutting-edge technologies.

We combine all our expertise in reliability, performance, accessibility, and test automation into AI transformation projects.

Gen AI has huge potential, but defining where & how to integrate it needs several considerations and expertise. We’ve already deployed successful Generative AI solutions for real use cases, and we would love to support you in this journey.

Find the Best Way to Implement Generative Artificial Intelligence in Your Team

Easily Add your Own AI Assistant

We’ve developed an open-source tool that helps teams build their own web-based assistants, integrating private data over any existing web application just by using a browser extension.

This enables your users to ask different cognitive tasks with either voice commands, writing messages, or sending images without touching your own app’s code! Check it out at Github.

🗣️Harness Leaders and Empower Your Team in a Safe Manner

Worried about your data? Worried about how to lead the challenges and changes that AI will bring to your organization? You don’t know how to onboard your leaders in digital transformation using AI? We’ve got you covered.

Using private-gpt your team can chat safely, share prompts and assistants using any LLMs using your own account, and control the budget. Check it out on Azure Marketplace.

🚀Craft Custom Solutions with the Support of an Experienced Partner

We’ve created AI-enabled tools and copilots for different technologies. These solutions aim to boost productivity and tackle challenges such as recruitment screening processes, supporting the team’s career path, and helping business experts with AI-based copilots, among others.

Let’s build strong AI solutions together!
Schedule a 30-minute call to understand how we can support you.

Follow us on Linkedin & X to be part of our community!

Blog

Testing Generative AI Applications

Hyperparameters