What AI coding benchmarks still miss about software quality
Date:
Thu, 21 May 2026 10:12:43 +0000
Description:
AI coding benchmarks miss long-term code quality degradation from repeated iterative changes.
FULL STORY ======================================================================Copy link Facebook X Whatsapp Reddit Pinterest Flipboard Threads Email Share this article 0 Join the conversation Follow us Add us as a preferred source on Google Newsletter Subscribe to our newsletter Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests?
This is a useful question, but it is too narrow. Software development is iterative. Requirements change and edge cases appear. Old design decisions become constraints on new work. Code that passes today can still make the
next change slower and more expensive, while also increasing risk. The gap matters more as AI raises the volume of code change. When generation gets cheap, the real question shifts from can the agent produce a working patch?
to what kind of codebase does repeated agent use create over time? Latest Videos From You may like AI hype and the quality hangover Software 3.0 is speeding up coding - but delivery is a different story Testing AI is not like testing software and most companies haven't figured that out yet Andrian Budantsov Social Links Navigation
CEO of Hypersequent. A recent paper, SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Orlanski et al.), gets closer to that question than most benchmark work. Instead of scoring one-shot solutions, it makes agents extend their own prior code across 20 problems and 93 checkpoints.
Each checkpoint changes the specification. The agent does not start fresh and is not given an internal design to follow. It has to live with earlier choices.
This setup is closer to real development than most benchmark suites, because real teams inherit yesterday's shortcuts. Green tests can hide a worse codebase The paper tracks two quality signals alongside correctness.
Verbosity measures redundant or duplicated code. Structural erosion measures how much of a codebase's complexity gets trapped inside functions that are already too complex. Are you a pro? Subscribe to our newsletter Sign up to
the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed! Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners
or sponsors By submitting your information you agree to the Terms &
Conditions and Privacy Policy and are aged 16 or over.
Those are failure modes familiar for every engineering manager. A system can keep passing tests while more logic gets pushed into the same large functions and more special cases get bolted on. More files need to be touched for every feature. The software still works, but becomes more difficult to change.
The code-search example in the test is a good example of this issue. At
first, the system only needs to find Python code using exact text or regular expressions. Later on, it needs to handle more languages, understand the code structure (AST matching), and even automatically fix problems.
If the initial design is too strict and makes early assumptions, it might
pass the first tests but won't be able to handle the complex, later requirements easily. What to read next AI can summarize meetings, but heres what it still cant do in 2026 Tame your AI gremlins before the chaos becomes permanent From fragmentation to flow: Rethinking modern software development
The results are clear. None of the evaluated agents solved any problem end to end. The best strict solve rate was 17.2 percent, and by the final checkpoint strict solve rates fell to 0.5 percent. Across trajectories, verbosity rose
in 89.8 percent of runs and structural erosion in 80 percent.
The comparison with human-maintained code is even more useful. Against 48 maintained Python repositories, agent-generated code was 2.2 times more verbose and more structurally eroded.
When the authors tracked 20 of those repositories over time, the human code was comparatively flat while the agent code kept worsening with each iteration.
A passing suite tells you the latest version satisfied known checks. It does not tell you whether the code is becoming more fragile or more expensive to extend. Why this matters for QA For QA leaders, there are two key takeaways. The first is obvious: AI-built product code can degrade under repeated change even while current tests stay green. Teams may read continued output as proof that the system is healthy. In reality, they may be accumulating future regression cost at higher speed.
The second is closer to home. QA teams are now using AI tools to write and maintain tests, especially functional UI automation in tools like Playwright. That work follows the same pattern as the paper: the product changes, the
test has to change, the next feature adds another branch, another selector, another exception, another helper.
The paper is about coding broadly, not automation test suites specifically, but the mechanism carries over. A test suite can also become verbose and structurally weak under repeated AI-assisted edits.
A degraded test suite is harder to notice than degraded product code. The pipeline can still be green and the suite can still look larger on paper. Coverage can appear to improve.
Meanwhile, the core asset might be degrading. This could include bad selectors, weak checks, copied test steps, overly large helper functions, and UI tests that are hard to fix and easy to doubt. While test flakiness is obvious, problems like tests that don't do much or tests that run very slowly might not be noticed right away.
For QA leaders, that shifts the job. Quality assurance cannot stop at validating the latest output against today's requirements. It also has to watch whether repeated change is damaging both the product and the test
system that is supposed to protect it.
The role of QA leadership is changing; quality assurance must now go beyond simply verifying the latest product output against current requirements. QA leaders must also monitor whether continuous change is negatively impacting both the product's quality and the integrity of the testing system designed
to safeguard it. Prompting will not solve this by itself The paper also
tested whether better prompts could control the drift. They helped at the start, but not for long. Quality-aware prompts lowered initial verbosity and erosion. One anti-slop prompt cut initial verbosity by about a third on GPT-5.4.
The change was minimal. Cleaner starting points still degraded at roughly the same rate, and the better-looking code did not reliably improve pass rates.
In some cases, the prompts increased cost.
Many organizations treat prompting as a governance layer. While this helps,
it is not enough. If the workflow keeps asking an agent to extend its own
code under changing requirements, the organization still needs controls outside the prompt. A better way to evaluate AI-assisted development To
manage AI-assisted development well, you need to look past quick wins. Check the code changes after a few adjustments, not just the first fix. Watch out for complex or repeated parts in the code.
Don't confuse success on the current feature with confidence in long-term stability. Consider how easy the code is to maintain as a release risk, especially for systems dealing with things like cost, user ID , access
rights, money, or rules.
The same rule applies to tests. Review how AI-generated test code changes after several product iterations. Watch for suites that grow faster than
their signal and UI tests that absorb behavior better covered at lower
levels.
Also be aware of self-healing maintenance that subtly lowers assertion strength. A larger suite doesnt automatically mean better control.
Quality needs to move upstream. By the time a feature reaches final validation, some of the damage may already be baked into the path the system took to get there.
QA needs a voice earlier in the loop: in design constraints, review
standards, regression strategy, and the definition of acceptable change quality for both product code and test code.
Ultimately, passing tests still matters, but as AI increases the volume of code change, the more useful question is whether each successful change
leaves the codebase safer to extend or more dangerous to touch. We've
featured the best AI website builder. This article was produced as part of TechRadar Pro Perspectives , our channel to feature the best and brightest minds in the technology industry today.
The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here:
https://www.techradar.com/pro/perspectives-how-to-submit
======================================================================
Link to news story:
https://www.techradar.com/pro/what-ai-coding-benchmarks-still-miss-about-softw are-quality
--- Mystic BBS v1.12 A49 (Linux/64)
* Origin: tqwNet Technology News (1337:1/100)