A couple of years ago, voters in a large state approved a multi-billion-dollar bond to overhaul how behavioral health services are funded, delivered, and measured. The ambition is broad, covering physical infrastructure, treatment capacity, populations the existing system has failed for decades, and the way counties plan, providers operate, and the state oversees all of it. The public that gets to pay off the bond in the form of higher taxes, will want to see a return on what they voted for.
Transformation at this scale has a terrible track record. Large public-sector programs often arrive late, over budget, or delivering a fraction of what was promised, and the causes are rarely singular. They are often tales of drift across technology, operations, workforce, and stakeholder alignment, and by the time a traditional assessment catches the problems, the report tends to read like an autopsy. Of course, programs were likely dead way before arrival.
A state agency I worked with recently was carrying that weight. They’d contracted a major systems integrator for the program and sought an independent view on whether the intended outcomes were likely to be realized, in a manner that could be continuously updated rather than handed over as a quarterly or annual verdict.

The Scale Problem
Intended outcomes in a program like this span a wide group of people and organizations, including people receiving services, counties implementing processes, providers delivering care, partners integrating with state systems, the public that votes and pays, and the state itself. Getting a sense where things are headed at this scale has traditionally meant interviews, document reviews, and a snapshot read at a point in time. Those snapshots arrive quarterly when things are going well and annually when they aren’t, and by the time patterns show up in that format, the window for course correction has already narrowed. The assessment ends up reporting on what happened rather than shaping what happens next.
Traditional oversight sometimes looks like status theater, where dashboards turn green because nobody wants to be the one to color them yellow, and where lagging indicators arrive so late they’re more useful for assigning blame than for adjusting course.
A Different Kind of Lens
Rather than default to the traditional approach, we designed something different. The underlying idea was to use AI to widen the lens and sharpen the resolution at the same time, picking up more signals from more sources across more stakeholder groups faster, and tying all of it back to the outcomes the bond was meant to produce.
One piece of the framework was a metrics architecture. Outcome metrics are lagging by definition, so on their own they don’t help much while you still have time to do something about them. What we needed were the leading and enabling metrics that should move first if the outcomes were going to land, along with a way to judge whether those leading indicators were real or aspirational. AI helped generate and pressure-test those chains, and then stitched them into an incremental path where reaching a certain level of confidence for year two was a precondition for claiming confidence about year five.
Tying hard evidence to that framework was the next problem. Large programs produce enormous amounts of planning and execution data, sitting in various planning and execution tools, including epics, features, acceptance criteria, test results, and deployment records. In principle, you can trace the thread from an outcome commitment down to a test case. In practice that trace is buried across multiple systems and nobody has time to assemble it manually. Building automated solutions across all these sources, and correlating them, used to be a project all by itself. But with AI-based code generation, we an now rapidly automate pulling quantitative and qualitative data from the planning and execution tooling and connecting the results to the metrics framework. We’d have a much better take on the difference between things that are being delivered versus things that are being claimed.
The softer signals required a different kind of attention. Hard data alone doesn’t tell you enough on a program this complex. County responses to state guidance, provider sentiment in feedback loops, the tone of stakeholder correspondence, and the slow shift from enthusiasm into fatigue all say something about trajectory that a velocity chart will not. AI helped surface patterns across meeting notes, feedback sessions, and status reporting at a breadth no individual reviewer could have tracked, and without the surveillance optics that come with trying to do it manually.
A Bit More Confidence, One Increment at a Time
Putting these pieces together was never meant to produce a single verdict. Programs of this scale don’t lend themselves to pass or fail. The point was to build, and continuously update, a defensible view of confidence that the bond would produce the outcomes the public was promised, and that where drift was occurring, it would be visible early enough to matter.
Confidence in this context is less a number than a position you can defend, backed by evidence you can show. Every increment of the program that produced linked, tested, working capability raised it, and every increment that produced activity without observable downstream movement lowered it. The real value is to keep that ledger honest and current, without pretending the answer was simpler than it actually was.
The Posture Question
Independent assessments in the public sector often drift into adversarial territory, where the integrator or the state leadership becomes the villain, because otherwise the report feels soft. That framing costs everyone something. Integrators defend rather than adjust, program leadership manages the review rather than absorbing it, and the people who would benefit most from early honest signal learn to steer clear of the reviewer.
You’d want the posture to be independent but aligned, remaining outside the delivery structure to see what the delivery team cannot, while working toward the same outcomes. That shift in posture, from retrospective judgment toward dynamic guidance, changes what the assessment work is for. It becomes a resource to the people delivering the work rather than a hazard they have to navigate around.
Extending Public Transparency to the Public
There’s a larger thread here that’s worth pulling on. Large public programs are funded by people who rarely see the return on what they voted for. Status reports go to legislators, formal audits arrive years after the fact, and the public reads headlines and forms opinions with very little real visibility into whether the thing they approved is actually working.
Doing this kind of assessment well, and doing it transparently, could be a modest contribution to closing that gap. Not a press release or a public-facing dashboard, both of which are easily gamed, but an evidence-grounded, continuously updated read on how a multi-billion-dollar commitment is tracking against what it was supposed to do. It is my hope that the public starts to demand this level of insight.
Takeaway
Large transformations have always been risky, and for a long time we accepted that assessing them would be coarse, late, and often political, because doing otherwise required a level of breadth and depth that was not practical. That constraint is relaxing. AI lets us look wider and finer at the same time, and it makes the assessment continuous rather than episodic. When the stakes are this high and the money is this public, that is worth paying attention to as an opening for a kind of accountability that was difficult to offer at this scale before.

