The problem of AI agent instability and regression on the path from prototype to stable product

1. Describe the problem:

I have been in development for 8 years. When you create an AI agent or an automation tool, the first 80% of the result is easy to achieve. This is enough to attract users, but you can only retain them with a stability of 95–98%. The problem is that the path from 80% to 95% is very difficult, and beyond that — almost impossible. When fixing one scenario «for example, by changing a prompt» others break. In systems with multiple AI agents, this instability accumulates, leading to poor results.

Example: an automatic resume parser. Building a prototype is simple, but bringing it to a state of stable operation is extremely difficult. When a prompt is changed, one scenario is fixed, but others break — this is a classic regression, like in programming.

In the case of parsing, the response can still be more or less formalized, but, let's say, with image generation it's much more difficult. Or, for example, with code generation. With responses that are difficult to formalize, the ultimate source of truth is the user.

I don't know what the solution might look like, but this is something I've experienced myself and heard from other developers. It's probably some kind of system for A/B testing prompts plus a feedback system. I don't know.

Also, if you have several sequential AI agents working in your system, their instability begins to accumulate, and this leads to very poor results.

2. How often does the problem occur?

This problem is not a one-time event, but a mandatory stage in the development of any more or less complex AI product. I encounter it in every such project on the way from a prototype to a stable product.

3. What attempts have you made to solve the problem?

I specified a seed in GPT, formalized responses, removed temperature, endlessly changed prompts. Wrote tests for parsers, tried fine-tuning models «but that turned out to be expensive». Experiments took from 2 to 4 weeks per project. In some cases, it was easier to abandon complex ideas, knowing about the problem in advance.

4. How much are you willing to pay for the solution?

Willingness to pay: 5000–9000 rubles (50$–90$). The problem is critical for serious B2B products that have resources for quality. Currently, it seems that the solution could be either expensive fine-tuning of our own model, or a dedicated service/specialist for improving prompts.

5. Problem author:

Name: Andrey
Country: USA
Contacts: Telegram