Maintainability sensors for coding brokers

There are a number of dimensions we normally need to obtain and monitor in our codebases: Useful correctness (works as meant), architectural health (is quick/safe/usable sufficient), and maintainability. I outline maintainability right here as making it simple and low danger to alter the codebase over time – often known as “inner high quality”. So I do not solely need to have the ability to make modifications rapidly at the moment, but additionally sooner or later. And I do not need to fear about introducing bugs or degradation of health each time I make a change – or have AI make a change. I normally see the primary indicators of cracks within the maintainability of an AI-generated codebase when the variety of recordsdata modified for a small adjustment will increase. Or when modifications begin breaking issues that used to work.

Inner high quality issues have an effect on AI brokers in comparable ways in which they have an effect on human builders. An agent working in a tangled codebase would possibly look within the incorrect place for an current implementation, create inconsistencies as a result of it has not seen a replica, or be pressured to load extra context than a process ought to require.

On this article, I describe my experimentation with varied sensors that assist us and AI replicate on the maintainability of a codebase, and what I realized from that.

The applying

I am engaged on an inner analytics dashboard for neighborhood managers that reads chat area exercise, engagement, and demographic knowledge from a mixture of APIs and presents the info in an internet frontend.

Maintainability sensors for coding brokers

Determine 1:
The instance app: internet UI, service layer, and exterior APIs.

The tech stack is a TypeScript, NextJS, and React. The backend reads and joins knowledge from the APIs. The applying has been round for some time, however for the sake of those experiments I rebuilt it with AI from scratch.

There are hardly any guides (e.g. markdown recordsdata) for AI about code high quality and maintainability current, I wished to see how nicely it could do exactly by counting on sensor suggestions.

Overview of all sensors used

Overview of sensors: During coding session, after integration in the pipeline, repeatedly, and runtime feedback in production

Determine 2:
The place sensors can run: through the preliminary coding session, within the pipeline, on a schedule, and in manufacturing.

That is an summary of the sensors I arrange throughout the trail to manufacturing.

Throughout coding session

Sensors that run repeatedly alongside the agent to supply quick suggestions.

Kind checker (computational)
ESLint (computational)
Semgrep, SAST device prescribed by our inner AppSec crew (computational)
dependency-cruiser, runs structural guidelines to test inner module dependencies (computational)
Take a look at suite outcomes together with take a look at protection (computational – although the take a look at suite is generated by AI, due to this fact created in an inferential means)
Incremental mutation testing (computational)
GitLeaks runs as a part of the pre-commit hook, I contemplate it to be a sensor as nicely, as it is going to give the agent suggestions when it tries to commit (computational)

After integration – pipeline

The identical computational sensors run once more in CI. The in-session sensors give the agent early suggestions throughout improvement. The CI pipeline confirms the end result on clear infrastructure and after integration.

Repeatedly

Sensors that run on a slower cadence to detect drift that accumulates over time, relatively than errors that happen within the second.

A safety assessment, immediate derived from our AppSec guidelines for inner purposes (inferential)
An information dealing with assessment, immediate describes issues like “no consumer names ought to ever be despatched to the online frontend” (inferential)
Dependency freshness report, which runs a script first to get the age and exercise of the library dependencies, after which has AI create a report with suggestions about potential upgrades, deprecations, and so on (computational and inferential)
Modularity and coupling assessment (computational and inferential)

With this context out of the best way, let’s dive into the primary class of sensors.

Base harnesses and fashions

All through constructing the appliance, I used a mixture of Cursor, Claude Code, and OpenCode (in that order of frequency). My default mannequin was normally Claude Sonnet, for a few of the planning and evaluation duties I used Claude Opus, and for implementation duties I ceaselessly used Cursor’s composer-2 mannequin.

Static code evaluation: Fundamental linting

I will begin with my learnings from utilizing ESLint on this utility. Fundamental linting instruments like ESLint principally goal maintainability danger on the degree of particular person recordsdata and capabilities.

Guidelines for typical AI shortcomings

In my expertise, the AI failure modes which can be essentially the most low-hanging fruit for static code evaluation are

Max variety of arguments for capabilities
File size
Perform size
Cyclomatic complexity

Nevertheless, these weren’t even lively in ESLint’s default preset, I needed to configure maximums for them first. Hopefully, static evaluation instruments will evolve to supply higher presets for utilization with AI. A little bit of analysis reveals that persons are additionally beginning to publish ESLint plugins with rule units which can be particularly focusing on recognized agent failure modes, like this one by Manufacturing facility, with guidelines about issues like requiring take a look at recordsdata or structured logging.

Steerage for self-correction

A sensor is supposed to present the agent suggestions in order that it could self-correct. Ideally, we need to give the agent further context for that self-correction – an excellent form of immediate injection. To do this, I constructed a customized ESLint formatter to override a few of the default messages – with the assistance of AI after all, naturally.

Right here is an instance of my steering for the no-explicit-any warning.

We wish issues to be typed to make it simpler to keep away from errors, particularly for key ideas.
However we additionally need to keep away from cluttering our codebase with pointless sorts. Make a judgment
name about this. In the event you select to not introduce a kind, suppress it with:
// eslint-disable-next-line @typescript-eslint/no-explicit-any -- (give motive why)`,

Managing warnings – now extra possible?

Static code evaluation has been round for a very long time, and but, groups usually did not use it persistently, even after they had it arrange. One of many causes for that’s the administration overhead that comes with it. Efficient use of this evaluation requires a crew to maintain a “clear home”, in any other case the metrics simply turn into noise. Specifically warnings just like the no-explicit-any instance above are tough, since you do not all the time need to repair them – it relies upon. And suppressing them one after the other has all the time felt tedious, and like noise within the code.

With coding brokers, we would now have an opportunity at that clear baseline. Within the steering textual content above, the agent is instructed to make a judgment name, and allowed to suppress a warning within the code. This retains the suppressions manageable, seen and reviewable.

For thresholds, like the utmost variety of strains, or the utmost allowed cyclomatic complexity, I instructed the agent within the lint message that it could barely enhance the thresholds if it thinks {that a} refactoring is pointless or not possible in a selected case. This does not suppress the brink perpetually, simply will increase it, in order that the rule fires once more if it will get even worse sooner or later. Constraints are preserved with out forcing a binary suppress-or-comply selection.

Observations

Trying on the exceptions AI created (suppressed warnings, elevated thresholds) was an excellent level to begin my code assessment.
AI ceaselessly determined to extend the cyclomatic complexity threshold, however instructed good refactorings after I nudged it additional. It was the one class the place it did that, and I later found that I did not have a self-correction steering in place for this one, so there was no express instruction saying {that a} threshold enhance ought to be absolutely the exception. That is an indicator that the customized lint messages can certainly make fairly a distinction.
Typically I need to deal with guidelines in another way in numerous elements of the code. Let’s take no-console, telling AI off when it makes use of console.log. Within the backend, I would like it to make use of a logger part as a substitute. Within the frontend, I’d need to not use direct logging in any respect, or on the very least I would like to make use of a special logging part. That is one other instance of the ability of the self-correction steering, and the place AI might help with semantic judgment and administration of study warnings.
I used to be watching out for examples of trade-offs between guidelines. The one one I’ve seen to date was created by the max-lines and max-lines-per-function guidelines. I’ve seen AI do fairly a little bit of helpful refactoring and breakdown into smaller capabilities and parts because of this sensor suggestions. Nevertheless, within the React frontend, I am seeing a worrying pattern of parts with heaps and plenty of properties because of passing values by a rising chain of smaller and smaller parts. I have not bought helpful observations but about how good AI is likely to be at making constant choices between tradeoffs like that.

Most important takeaways

General, I used to be positively shocked by what number of issues I can cowl with static evaluation. I needed to remind myself a number of instances why it has been considerably underused previously, and what has modified: The fee-benefit stability. Value is lowered as a result of it is less expensive to create customized scripts and guidelines with AI. And the profit has additionally elevated: the evaluation outcomes assist me get a primary sense of a lot of hygiene components that would not even occur that a lot after I write code myself, so I can get frequent AI errors out of the best way.

Nevertheless, I am unable to assist however surprise if this will additionally result in a false sense of safety and an phantasm of high quality. In any case, one more reason why linters like this have been much less used previously is that they’ve limits, and now we have been cautious of utilizing them as a simplified indicator of high quality. There are many extra semantic points of high quality that static evaluation can not catch, it stays to be seen if AI can adequately fill that hole in partnership with these instruments. I additionally found new supposed points within the code each time I activated a brand new algorithm. It was all the time a mixture of irrelevant issues and issues that really matter. So I fear about suggestions overload for the agent, sending it right into a spiral of over-engineered refactorings.

Static code evaluation: Dependency guidelines

Fundamental linting is generally focussed on high quality and complexity inside a file or operate. Subsequent I began trying into sensors that would give me and the agent suggestions about maintainability considerations that cross file and module boundaries. Evaluation instruments on this space are traditionally much more underused than the fundamental linting.

To study concerning the potential of sensors that may assist us and AI sustain good modularity inside a codebase, I explored three issues:

Dependency guidelines (deterministic)
Coupling evaluation (deterministic and inferential)
Modularity assessment (inferential)

Let’s begin with dependency guidelines. I labored with the agent to provide you with a layered module construction for my utility, about half means by implementing it. I requested it to assist me write dependency-cruiser guidelines to implement these layers.

Determine 3:
Layered module construction and dependency guidelines

For instance, one of many guidelines enforces that code within the purchasers folder by no means imports something from the companies folder:

{
  title: “clients-no-services”,
  remark:
    “API purchasers should not depend upon the orchestration layer above them. “ + LAYERS,
  severity: “error”,
  from: { path: “^server/purchasers/”, pathNot: “/__tests__/” },
  to: { path: “^server/companies/” },
},

As with the ESLint messages, I additionally expanded the error messages a bit to be self-correction steering, recapping the layering idea as an entire:

ERROR  clients-no-services
  API purchasers should not depend upon the orchestration layer above them. 
  [Layers: routes -> services -> clients + domain; Services orchestrate: fetch data via clients, compute via domain -- no I/O, no SDKs, no knowledge of data fetching.]

Observations

With out AI, I’d not have gotten these guidelines in place rapidly. The device’s configuration syntax has a steep entry value, and AI absorbed that value virtually solely.
The agent violated the principles a handful of instances after I launched them, after which self-corrected primarily based on dependency-cruiser suggestions, so it did assist preserve my folder ideas.
I additionally used the identical method to introduce conventions for a way React hooks ought to be structured within the frontend.
I had to determine how one can catch issues when AI begins creating new folders outdoors of this construction, with a rule that requires each new file to be someplace within the predefined folder construction.

Most important takeaways

On the level after I launched these guidelines, the structuring of code into folders had already turn into a bit of bit haphazard. I might see how the principles helped the agent clear that up, after which proceed implement these layers going ahead. So I’ve discovered it fairly a helpful substitute for describing code construction in a markdown information. Nevertheless, instruments like this are restricted to what’s expressible by way of imports, file names, and folder construction.

Static code evaluation: Coupling knowledge

Subsequent, I experimented with the extraction of typical coupling metrics from my codebase, i.e. the variety of incoming and outgoing imports and calls per file.

I did not use any current instruments for this, as a substitute I had a coding agent write an utility that creates these metrics with the assistance of the typescript compiler, in order that I might have most flexibility to mess around with this as a part of my experimentation. I had it add two interfaces: An online interface with a bunch of various visualisations of these metrics for my very own human consumption. And a CLI that may present these metrics to a coding agent.

Determine 4:
Coupling metrics: internet visualisations and CLI for brokers.

For human consumption

Most of those visualisations are nicely established ideas, like a dependency construction matrix (DSM). I discovered them tedious to interpret, and regardless that they had been vibe coded and will most actually be improved, I believe that had extra to do with the character of the info. It is fairly detailed knowledge that wants a whole lot of context and expertise to interpret it, and map it again to extra excessive degree good practices. So I’ve a sense that these kinds of instruments nonetheless will not actually assist scale back a human’s cognitive load a lot when reviewing codebases that had been modified by AI.

For AI consumption

I gave an agent entry to this practice CLI (coupling-analyser) and requested it to create a report primarily based on the info, together with recommendations of how one can enhance the important points.

Right here is an excerpt of what that immediate regarded like – I am primarily reproducing this to point out you that I did not truly give it a lot steering on what good or dangerous modularity appears to be like like, I principally delegated to the mannequin to interpret what good and dangerous appears to be like like:

Produce a markdown report on modularity and coupling high quality for the goal TypeScript codebase, grounded in precise CLI output from npx coupling-analyser, not guesswork from static searching alone.

Collect proof (run the CLI)

Execute the CLI and seize stdout. Use the report subcommands—mix as helpful for the query:
…

Write the markdown report

Use clear headings. Desire concrete module IDs / paths and numbers quoted or paraphrased from CLI output.

Instructed sections:

Context — What was analyzed
Govt abstract — 2–5 bullets: total modularity posture, high 1–3 systemic points.
Findings from the device — Summarize hotspots, high dangers, notable cycles or mutual dependencies, and behavioural highlights as reported by the CLI.
Interpretation (modularity lens) — Tie metrics to software program design: cohesion vs. unfold of change, stability vs. dependency path, fan-in/fan-out instinct, cycle impression.
Deep dives for every excessive and important challenge

What it’s — Module(s), function within the system, dependency neighbours (from CLI + minimal code peek if wanted).
Tasks at the moment …
Why it hurts …
Design choices (2+ the place cheap) …
Why the brand new design is healthier — Fewer cycles, clearer dependency path, smaller surfaces, take a look at seams, align with doubtless change vectors.
Future change danger — How every possibility reduces regression danger and makes protected evolution cheaper (concrete situations: “including X”, “swapping Y”, “transport Z independently”).

…

This LLM-led evaluation truly pointed me to the identical coupling sizzling spots that I’d have discovered by trying by the visible diagrams, simply in a format that was extra digestible. And asking the LLM to floor its evaluation within the outcomes from the deterministic device gave me a better degree of confidence, and possibly additionally used much less time and tokens than if the agent had scanned the codebase itself to search out coupling issues.

Observations

What the LLM discovered primarily based on this knowledge was fairly lackluster (I used Claude Opus 4.7 for this):

It mentioned one of many largest points was a manufacturing facility that initialises all the mandatory parts, however I had launched that manufacturing facility on function as a part that acts like a light-weight dependency injection framework.
One other challenge it had was with a shared (zod) schema between frontend and backend, declared a “god module” by the LLM. This can be a frequent sample although to create an express contract between backend and frontend, and isn’t as a lot of a difficulty when backend and frontend evolve collectively anyway, and even reside collectively in the identical repo, like in my case.
When professional patterns seem as high-coupling hubs, there must be a strategy to suppress these in future analyses, in any other case they create much more noise.
The one form of attention-grabbing discovering it had: An index.ts file within the area folder indiscriminately uncovered all recordsdata in ./area, and is imported by a lot of locations. Whereas that can be a typical sample to create express contracts for a layer, it does have its professionals and cons, and is a minimum of price an investigation to see whether it is applicable for this codebase.

Most important takeaways

The examples above present that much more so than with the fundamental linting, good and dangerous doesn’t have a transparent definition, as a substitute it’s all about what’s applicable. And what coupling is suitable relies on a whole lot of context, not simply the uncooked name and import graph of a codebase. So primarily based on this small experiment, I haven’t got the impression that one of these coupling knowledge is helpful to AI by itself.

A extra sensible use I can think about for this knowledge is throughout danger triage for code assessment. Once I assessment a code change made by AI, it appears helpful to know what the impression radius of the modified recordsdata is, in order that I pays extra consideration when e.g. a file with 10+ callers is modified. Or an AI assessment agent might use the info to prioritise the place it spends its tokens.

Static code evaluation: AI modularity assessment

The lackluster outcomes from the coupling knowledge experiment might have a number of causes:

My immediate about what to analyse was not very particular
The coupling knowledge shouldn’t be helpful to AI
The coupling knowledge solely is just too shallow and lacks context of the complete code

So the ultimate factor I did was to go absolutely down the inferential route and use Vlad Khononov’s “Modularity Expertise” to analyse the codebase design and discover modularity points. This proved to be very fruitful! It gave me a lot of attention-grabbing pointers for refactorings that may clearly scale back the danger of future modifications. I ran the abilities a second time and gave them entry to my coupling evaluation CLI. The AI principally discovered affirmation within the knowledge, however not any further findings. Quite the opposite, it identified a lot of issues that the CLI was lacking. It is also price noting that the second run of the evaluation (with out context of the primary one) surfaced one more challenge that the primary run didn’t discover. A helpful reminder that when it issues, it is usually price operating an LLM-based evaluation a number of instances, to get a fuller image.

Observations

Listed here are some highlights from the outcomes (mannequin used was Claude Opus 4.7, similar as for the coupling evaluation):

Duplicate route code – all my three backend endpoints had their very own route file, and every of these route implementations was virtually equivalent. So each time I’d need to introduce a change to the overall rules of the backend API (to illustrate introducing a request ID, or altering the error dealing with or logging method), I would must do it in a number of recordsdata. I had solely simply launched a 3rd endpoint, so I believe it is truthful sufficient that this wasn’t abstracted out but. However in my expertise, AI brokers normally do not go forward and begin refactoring with out an express nudge after they repeat a bit of code for the third or fourth time, they’re fairly joyful to repeat and paste.
Inconsistency in calling the backend – or put one other means, one more type of semantic duplication. I’ve 3 pages within the utility that have to name the backend with the identical set of parameters (chosen chat area, and which date vary to analyse). Two of these pages had been utilizing the identical hook and basic method to do that, however when AI launched the third web page, it deviated from that and reimplemented comparable behaviour in its personal means. This could e.g. result in inconsistencies in error dealing with, or once more the necessity to change a number of recordsdata when backend API rules change.
Inefficient dealing with of the core arguments – As simply talked about, all of the pages within the utility go on a chat area ID and a date vary to the backend. I had already seen after I modified the best way a consumer can specify a date vary that AI needed to change a lot of recordsdata for that change – over 40! So I used to be already conscious that one thing was fishy right here, and the evaluation confirmed it: “Situation: Request parameters repeated at each degree”. The advice was to introduce an object that wraps all of those parameters. AI had already finished that in a means – however by no means absolutely adopted by with the utilization of that object, so it was an inconsistent mess.
Tasks within the incorrect place – The assessment discovered a little bit of authentication code sitting inside our manufacturing facility that was imagined to solely be answerable for wiring up our modules. It carried out a fallback to mock knowledge when the consumer shouldn’t be authenticated. An sudden location like that creates a danger of being missed when new routes are added.
Higher interpretation of acceptable high-import-count “hubs” – Keep in mind the “god courses” discovered by my earlier coupling evaluation? The modularity abilities additionally seen these, however in each instances properly identified that they’ve a function within the context of this utility. I assume that’s both because of the good prompting in these abilities, or as a consequence of the truth that this evaluation truly learn what was within the code, whereas I requested the opposite one to solely depend on the coupling knowledge.

Most important takeaways

Dependency parsers like dependency-cruiser could be efficient reside sensors to implement some fundamental folder constructions and dependency instructions, however they’ll solely go to date.
The AI modularity assessment is a superb instance of “rubbish assortment”, and labored fairly nicely when given highly effective prompts. Grounding it in precise coupling knowledge did not appear to make a lot distinction. It could be nice to discover a strategy to apply this to the modified recordsdata in a commit, to have this earlier within the pipeline, however I didn’t discover this but.
I ran the modularity assessment after constructing a lot of the codebase with out making use of that sort of assessment myself – and it had some fairly regarding and really legitimate findings that may have elevated danger sooner or later. It reveals that with out human assessment and coupling experience, AND with out these further AI critiques, the agent was undoubtedly compounding inadvertent technical debt.

General, codebase design and modularity looks like a priority the place computational sensors alone can not assist us a lot, AI is required so as to add semantic interpretation, and contemplate trade-offs.

Within the subsequent replace to this text, I’ll share about regression
testing’s function as a sensor, and my expertise with utilizing protection and
mutation testing on AI-generated take a look at suites.

To seek out out after we publish the following installment subscribe to this
web site’s
RSS feed, or Martin’s feeds on
Mastodon,
Bluesky,
LinkedIn, or
X.

Maintainability sensors for coding brokers

The applying

Overview of all sensors used

Base harnesses and fashions

Static code evaluation: Fundamental linting

Guidelines for typical AI shortcomings

Steerage for self-correction

Managing warnings – now extra possible?

Observations

Most important takeaways

Static code evaluation: Dependency guidelines

Observations

Most important takeaways

Static code evaluation: Coupling knowledge

For human consumption

For AI consumption

Collect proof (run the CLI)

Write the markdown report

Observations

Most important takeaways

Static code evaluation: AI modularity assessment

Observations

Most important takeaways

Related Articles

Google I/O 2026 introduces the ‘Agentic Internet’ period with main Chrome updates