Testing Applications Powered by Generative Artificial Intelligence

Would you like to delve into how to test applications powered by Generative Artificial Intelligence (GenAI)? We invite you to explore this topic, discover the upcoming opportunities and challenges, and understand how all this can add value to your projects.

Generative Artificial Intelligence - Unveiling the potential of testing applications powered by Generative AI models - Ilustration

With the widespread use of advanced technologies such as GPT and other Large Language Models (LLMs), the methods we employ to enhance quality in software development and testing are continuously progressing.

Applications integrating GPT and other LLMs are transforming industries such as customer service, medicine, education, finance, content creation, and beyond. Of course, all of it involves natural language processing and image generation.

These advancements bring unique challenges, especially in software quality. In the midst of accelerated digital transformation, the quality of systems impacts most areas of people’s lives. Likewise, the dynamic nature of applications demands a novel and adaptable approach to ensure their effectiveness.

As expressed by Abstracta’s Chief Quality Officer, Federico Toledo, in a talk he gave at the Global Quality Summit organized by Globant, there is a large ecosystem. ‘We are talking about a Generative Artificial Intelligence that makes us rethink even the way to test things,‘ he said.

He explained that we have ChatGPT, which is the platform that most people interested in the subject have tried so far, for generating everything from content to code, with a conversational approach. This platform not only excels in natural language processing but also holds potential for advancements in areas like image generation. But it doesn’t end there; the system also offers an API, which we can use in multiple and novel ways in product development.

To use the API, it is necessary to have a Plus account, that is, a paid account. ‘To simplify, we could say that using OpenAI’s API is like having programs run a prompt on ChatGPT, which is going to be composed of context information from the program being tested, such as session data or user inputs,’ Federico introduced.

Generative Artificial Intelligence - Exploring the intricacies of a Generative AI model in application testing


Hyperparameters are external configurations of a machine-learning model. They are set before training and are not learned by the model itself from the data.

Unlike model parameters, which are learned from the data during training, the development team is who determines the hyperparameters. They are crucial for controlling the behavior and performance of the algorithm.

To illustrate, consider a video game in development: hyperparameters would be like the difficulty settings configured before starting a test session. These settings, such as the power of enemy characters or how often they appear, don’t change during the game. Yet, they can greatly impact how the game feels and its difficulty level.

In Generative Artificial Intelligence, the right choice of hyperparameters is key to optimizing the model’s accuracy and effectiveness. Common examples include the learning rate, the number of layers in a neural network, and the batch size.

Temperature and Seed in Generative AI Models

Although these terms don’t fit the classic definition of hyperparameters (like learning rate or batch size used in machine learning model training), they serve a similar function in generative model practice and are fundamental for testing with Generative Artificial Intelligence support.

Their importance in generative model settings and their impact on outcomes make them akin to hyperparameters in practical terms. Therefore, in some contexts, especially in describing generative models, they are referred to as hyperparameters, albeit a special type of configuration.

They affect how the model behaves (in terms of randomness or reproducibility) but are not learned by the model from data. They are configured before using the model for generation.


When using the API, it’s important to consider the temperature we choose, which determines the balance between precision and creativity in the model’s responses. It can vary from 0 to 2: the lower the temperature we select, the more precise and deterministic the response. The higher it is, the more creative, or random.


This is an initial value used in algorithms to ensure the reproducibility of random results. Using a specific seed, AI can help ensure that, under the same conditions and inputs, the model generates consistent responses.

This is crucial for testing, as it increases the reproducibility of results and allows for more precise and reliable evaluation of model behavior in different scenarios and after changes in code or data.

Reproducibility is especially valuable in testing environments where controlling and predicting model output is vital to ensuring the quality and efficiency of applications based on Generative Artificial Intelligence.


In language or generative models, tokens are basic text-processing units, like words or parts of them. Each token can be a word, part of a word, or a symbol.

The process of breaking text into tokens is called tokenization, and it’s crucial for the model to understand and generate language. Essentially, tokens are building blocks the model uses to interpret and produce language. If you’re interested in learning more about the tokenization process, we recommend visiting Open AI’s Tokenizer.

It’s important to note that models have a limit on the number of tokens they can process in a single step. Upon reaching this limit, they may start to “forget” previous information to incorporate new tokens, which is a crucial aspect of managing the model’s memory.

Max Tokens Parameter and Cost

Token limit and cost: Not only the number of tokens processed does affect the model’s performance but also has direct implications in terms of cost. In cloud-based services, token usage is a key factor in billing: charges are based on the number of tokens processed. This means that processing more tokens increases the service cost.

Max_tokens parameter: Using the max_tokens parameter is essential in API configuration. It acts as a controller that determines when to cut off the model’s response to avoid consuming more tokens than necessary or desired.

In practice, it sets an upper limit on the number of tokens the model will generate in response to a given input, and allows managing the balance between the completeness of the response and the associated cost.

Although tokens are not hyperparameters, they are an integral part of the model’s design and operation. The number of tokens processed and the setting of the max_tokens parameter are critical both for performance and for cost management in applications based on Generative Artificial Intelligence.

For all these reasons, Federico emphasized the need to find a proper balance between cost and outcome when providing the number of tokens the API can use.

Applicable Testing Approaches in Generative Artificial Intelligence

Testing applications using Generative Artificial Intelligence, like LLMs, presents unique challenges. Below, we explore various approaches applicable in this context. Each has its own advantages and considerations.

To categorize and order these approaches, let’s review the most known categorization. It separates them into black box and white box methods.

Black Box

Black Box

These approaches focus on the system’s inputs and outputs, regardless of the underlying code. The key here is understanding the domain, the system, and how people interact with it. Internal implementation comprehension is not necessary.

The major limitation is the lack of insight into what happens inside the system, i.e., its implementation. This can lead to less test coverage and redundant cases. We cannot control internal system aspects. Moreover, testing can only begin once the system is somewhat complete.

White Box

White Box

In contrast to black box methods, this approach requires a deep understanding of the system’s interior.

We can analyze how parameters and prompts are configured and used, and even how the system processes responses. This is feasible because we can look inside the box. We leverage available information to design tests.

White box approaches allow more detailed analysis, avoid redundant test execution, and create tests with greater coverage.

However, disadvantages exist: programming knowledge and API handling are necessary. This might lead to bias, precisely due to knowledge of the construction.

This could limit our creativity in generating tests. It might result in losing focus on the real experiences and challenges of people using the systems.

If you’re interested in diving deeper into testing concepts, we recommend you to read Federico Toledo’s book ‘Introduction to Information Systems Testing.’ ¡You can download it for free!

Quality Factor-Based Approach

Another high-level approach involves asking questions guided by different software quality factors. These include reliability, usability, accessibility, and security, among others. In each of these aspects, it’s important to consider both general and specific questions for systems based on Generative Artificial Intelligence.

Usability and Accessibility

As applications increasingly have conversational interfaces, we face the challenge of making them inclusive and adaptable. It’s crucial to ensure they are intuitive and useful for all people.

They should comprehend different voices under various conditions and provide understandable responses in different contexts and needs. Here, accessibility goes beyond mere compatibility with assistive tools. It seeks effective and accessible communication.


In this aspect, we aim to understand how easy it is to update and change the system. How can we automate tests for a system with non-deterministic results (that is, not always yielding the same results)? Are regression tests efficiently managed to prevent negative impacts on existing functionalities? Are prompts managed like code, easy to understand, well modularized, and versioned?


Within the security framework, GAI introduced a new type of vulnerability known as ‘Prompt Injection.’ Similar to the concept of SQL Injection, an attacker could alter the system’s expected behavior.

They do this by manipulating inputs maliciously, knowing that if not carefully processed and concatenated to a preset prompt, it could exploit this vulnerability. In testing, we have to consider these types of attacks. We review how the system handles unexpected or malicious inputs that could alter its functioning.

How does the system respond to inputs designed to provoke errors or unwanted responses? This evaluation is essential to ensure applications based on Generative Artificial Intelligence are safe and resistant to manipulation attempts.

We invite you to read this article! Tips for Using ChatGPT Safely in Your Organization


Here, we focus on how the system handles load and efficiency under different conditions. In the context of GAI, many questions arise.

How does our system perform when OpenAI’s service, used via the API, experiences slowness or high demand? This includes evaluating the system’s ability to handle a high number of requests per second and how this affects the quality and speed of responses.

Moreover, we need to consider efficient computational resource management, especially in relation to token usage and its impact on performance and operational costs. Analyzing performance under these circumstances is crucial to ensure that applications based on GAI are not only functional and safe but also scalable and efficient in a real-use environment.

Generative Artificial Intelligence: Why We Recommend Using Evals and Their Importance

Generative Artificial Intelligence - Evals - Ilustration

At Abstracta, we recognize the importance of responsible and efficient use of LLMs and the tools we build based on them. To keep pace with development, we face the challenge of automatically testing non-deterministic systems. The variability in expected outcomes complicates the creation of effective tests.

Faced with this challenge, we recommend adopting OpenAI’s framework known as ‘Evals’ for regression testing. It works through different mechanisms. To address the non-determinism of responses, it allows the use of an LLM as a ‘judge’ or ‘oracle.’ This method, integrated into our regression tests, lets us evaluate the results against an adaptable and advanced criterion. It aims to achieve more effective and precise tests.

What are OpenAI Evals?

It is a framework specifically developed for evaluating LLMs or tools using LLMs. This framework is valuable for several reasons:

  • Open-Source Test Registry: It includes a collection of challenging and rigorous test cases, in an open-source repository. This facilitates access and use.
  • Ease of Creation: Evals are simple to create and do not require writing complex code. This makes them accessible to a wider variety of people and contexts.
  • Basic and Advanced Templates: OpenAI provides both basic and advanced templates. These can be used and customized according to each project’s needs.

Incorporating Evals into our testing process strengthens our ability to ensure that LLMs and related tools function as intended. It allows us to maintain high standards of quality and efficacy.

If you’re interested in this topic, we recommend reading the article ‘Decoding OpenAI Evals.’ It offers a detailed view of this framework and its application in different contexts.

Future Challenges in Testing Applications Powered by Generative AI Models

We face multiple challenges every day. AI technology advances at striking paces. Therefore, building teams and sharing knowledge while continuing to experiment is vital.

Referring to software quality, possibly one of the most challenging aspects of testing Generative Artificial Intelligence is creating effective test sets. This area, filled with complexities and nuances, requires a combination of creativity, deep technical understanding, and a meticulous approach.

While this presents a major challenge, it also presents a singular opportunity to venture into uncharted territories in software quality.

The development of efficient test sets for Generative Artificial Intelligence poses significant challenges due to various factors:

  • Complexity and Variability of Responses

Generative Artificial Intelligence-based systems, like LLMs, can produce a wide range of different responses to the same input. This variability makes it difficult to predict and verify the correct or expected responses, complicating the creation of a test set that adequately covers all possible scenarios.

  • Need for Creativity and Technical Understanding

Designing tests for Generative AI requires not only a deep technical understanding of how these models work but also a significant dose of creativity. This is to anticipate and model the various contexts and uses that users might apply to the application.

  • Context Management and Continuity in Conversations

In applications like chatbots, managing context and maintaining conversation continuity are critical. This means that tests must be capable of simulating realistic and prolonged interactions to assess the system’s response coherence and relevance.

  • Evaluation of Subjective Responses

Often, the ‘correctness’ of a Generative Artificial Intelligence generated response can be subjective or context-dependent. This requires a more nuanced evaluation approach than traditional software testing.

  • Bias and Security Management

Generative AI models can reflect and amplify biases present in the training data. Identifying and mitigating these biases in tests is crucial to ensure the system’s fairness and safety.

  • Rapid Evolution of Technology

Generative Artificial Intelligence technology is constantly evolving. This means that test sets must be continually updated to remain relevant and effective.

  • Integration and Scalability

In many cases, Generative AI is integrated with other systems and technologies. Effectively testing these integrations, especially on a large scale, can be complex.

  • Observability

Given the complexity of systems, and as most approaches are black-box due to context, data, or code access restrictions, observability will play a key role in our effectiveness in testing.

Thanks to the ability to understand and make visible through observability platforms what happens inside the system, we can understand each step of internal processes. This allows us to associate inputs with behaviors and outputs, and to think of more and better tests.

These challenges mean that software quality in Generative AI requires an innovative and adaptable approach. It must keep up with the rapid development of this technology.

At Abstracta, we commit to navigating these challenges and accompanying businesses on this journey. We provide the knowledge and experience necessary to maximize the potential of Generative Artificial Intelligence applied to quality, from start to finish.

AI & Customized Copilots

Generative Artificial Intelligence - Generative AI Models - Customized Copilots

Why Choose Abstracta for AI Transformation Services?

  • We are an industry-leading firm focused on building quality software. We combine all our expertise in reliability, performance, and test automation into AI transformation projects.
  • We’ve successfully deployed several Generative AI solutions for real-use scenarios as well as different open-source tools to harness AI for positive long-lasting outcomes.
  • We’ve forged robust partnerships with industry leaders like Microsoft, Marvik, and GeneXus. This empowers us with the confidence to seamlessly incorporate cutting-edge technologies into our projects.
  • Constantly innovating, our team is committed to continuous learning by experimenting and applying the most recent AI breakthroughs.
  • We adopt a co-creation approach, meticulously tailored to meet your unique needs and objectives. This strategy can help ensure clear understanding and strong team support while aligning seamlessly with your core values and business priorities.

How Can We Help You?

Gen AI has huge potential, but defining where & how to integrate it needs several considerations and expertise. Don’t use AI just because it’s cool or powerful, make it count! We’ve already deployed successful Generative AI solutions for real use cases, and we would love to support you in this journey.

Find the Best Way to Implement Generative Artificial Intelligence in Your Team

🎧Easily Add your Own AI Assistant

We’ve developed an open-source tool that helps teams build their own web-based assistants, integrating private data over any existing web application just by using a browser extension.

This enables your users to ask different cognitive tasks with either voice commands, writing messages, or sending images without touching your own app’s code! Check it out at Github.

🗣️Harness Leaders and Empower your Team in a Safe Manner

Worried about your data? Worried about how to lead the challenges and changes that AI will bring to your organization? You don’t know how to onboard your leaders in digital transformation using AI? We’ve got you covered.

Using private-gpt your team can chat safely, share prompts and assistants using any LLMs using your own account, and control the budget. Check it out on Azure Marketplace.

🚀Craft Custom Solutions with the Support of an Experienced Partner

We’ve created AI-enabled tools and copilots for different technologies. These solutions aim to boost productivity and tackle challenges such as recruitment screening processes, supporting the team’s career path, and helping business experts with AI-based copilots, among others.

Whether starting to understand the latest LLMs capabilities or expert in Machine Learning practices, we help you find the right path to support you along the way, and step up your game.

Let us partner with you and build strong AI solutions together! Schedule a 30-minute call to understand how we can support you in maximizing the potential of AI to add real value to your business.

Would you like to know who our partners are? Find out here!

Follow us on Linkedin & X to be part of our community!

Tags In
395 / 444