Defense leaders keep repeating a familiar promise about military AI. Faster decisions. Shorter timelines. More lethal outcomes. The problem is that most coverage treats that claim as a vibe, not a measurable statement.
If you want the data behind the headline, start where the military already measures performance: the targeting cell. Not the model. Not the vendor. The cell. The targeting workflow has bottlenecks, handoffs, approval gates, and queues. It produces numbers commanders recognize, even if the public rarely sees them.
This is also where the JADC2 pitch becomes testable. JADC2 literature frames the point of better connectivity and a common operating picture as improving the tactical timeline, getting warfighters the data they need to make decisions faster. That is a claim that can live or die on a scoreboard. It either compresses timelines and reduces friction, or it does not.
We already have a hint of what a real delta can look like. The head of the National Geospatial-Intelligence Agency has said NGA Maven decreased one warfighting elementโs targeting workflow from hours to minutes from sensing to target engagement during an exercise, and separate NGA testimony cited decreases as high as 80 percent. Those are not universal results, but they are the right kind of result: time measured across a defined workflow.
The targeting cell scoreboard
The scoreboard focuses on throughput, latency, and error. These are the metrics that tend to move first when AI is introduced into ops workflows, assuming the integration is real and not a demo layer sitting on top of broken pipes.
Key numbers to watch:
-
Missions planned per hour
-
Targets vetted per shift
-
Time to update the common operating picture
-
Re-tasking time
-
Data ingest latency (Sensor-to-Model)
-
False positive rate
If you want to track these metrics the same way any operations team would, build a simple dashboard and update it on a cadence. A plain Google Sheets dashboard is often enough to turn big claims into a weekly trend line that either improves or does not.
| Metric | What to track |
|---|---|
| Missions planned per hour | Median plans per hour by mission type, plus rework rate (revisions, reversals) |
| Targets vetted per shift | Median vetted targets per shift, plus returned-for-review count |
| Time to update the common operating picture | Median minutes from new input to update, plus correction rate |
| Re-tasking time | Median minutes from change in conditions to new tasking issued, plus failed re-task attempts |
| Data ingest latency | Median time from sensor capture to model ingestion (identifies bandwidth bottlenecks) |
| False positive rate | Percent of AI-flagged candidates later rejected, by source and scenario |
Why these metrics are the right test
Military AI discussions often drift toward abstractions: decision advantage, multi-domain awareness, faster kill chains. Those can be real, but they hide the mechanism. The mechanism is usually one of two things.
First, AI can reduce the transaction cost of information flow from sensor to analyst to commander. That is the core story behind Project Mavenโs early framing, which emphasized assisting humans with large volumes of imagery and video rather than replacing them. When that works, humans stop spending time on screen-watching and start spending time on higher-value judgment.
Second, AI can compress time across a chain of handoffs. That is what the NGA Maven exercise claim is describing: hours to minutes from sensing to target engagement. A number like that matters because it maps to a workflow, not a marketing deck.
JADC2 theory makes the same bet in broader terms. Improve connectivity. Present data through a common operating picture. Accelerate the tactical timeline. The scoreboard above translates that theory into things a targeting cell can actually count.
What the hard-number upside looks like
In a data-driven article, the goal is not to claim universal gains. The goal is to define what a win looks like in numbers, then show where the early public signals point.
-
Latency compression: hours to minutes from sensing to target engagement in an exercise context, with NGA citing reductions as high as 80 percent in testimony. That is a clean example of a tactical timeline claim expressed as a measurable delta.
-
Throughput increase: more imagery and sensor inputs triaged per shift, which tends to show up as targets vetted per shift, or a lower backlog of unreviewed detections. Maven and related programs are explicitly oriented around this type of scale problem.
-
Planning cycle compression: if AI is truly improving how options are assembled, missions planned per hour should move, and re-tasking time should shrink, without a matching rise in reversals and corrective cycles.
None of these gains are free. They require clean data flows, stable interfaces, and a workflow that can accept machine assistance without creating new queues. AI can speed one part of a process and still make the whole system slower if it increases rework.
The concerns that show up on the same scoreboard
The most important risks have a shared pattern. They look like wins until you measure error and rework.
-
False positives and confidence bias: if the model flags more candidates, targets vetted per shift can rise while quality falls. If the cell starts trusting confident outputs under time pressure, mistakes become harder to catch.
-
Deception and poisoned inputs: adversaries do not have to beat the model at everything. They have to bend the decision process at the right moment. If time to update the common operating picture falls but accuracy falls with it, the unit gets a faster path to the wrong answer.
-
Auditability gaps: if a recommendation influences a decision, investigators need a trail showing what the system saw, what it produced, and who acted. Otherwise, speed comes at the cost of accountability.
-
Human judgment drift (DoD 3000.09): If “time to update” drops to zero, are operators actually exercising appropriate levels of human judgment, or just rubber-stamping? Speed cannot legally or ethically replace the human decision in the loop.
-
Workflow brittleness: re-tasking time can shrink while plans become fragile, because the system optimizes for what it has seen before, not what an adversary will do next.
When leaders sell AI as battlefield decision-making improvement, the simplest accountability demand is this: publish a minimum scoreboard for major exercises and pilots. Time, throughput, error, and rework. If the program cannot show those numbers, the claim is not ready for prime time.
Who gains the most if the scoreboard moves
The winners are not evenly distributed.
-
Targeting cells and ISR exploitation teams gain first because their bottleneck is often volume and time. If AI reduces the transaction cost of information flow, they feel it immediately in backlog and triage speed.
-
Commanders gain when the common operating picture updates faster and stays accurate, because it reduces the friction of retasking assets. JADC2 framing explicitly targets this tactical timeline improvement through better connectivity and a coherent operating picture.
-
Program Offices (CDAO/GIDE) gain because they need valid test data. The Chief Digital and Artificial Intelligence Office runs experiments explicitly to find these improvements; a unit that already tracks them becomes a primary candidate for new capabilities.
-
Vendors and integrators gain when the metric improvements become institutionalized and rolled into doctrine and procurement. The moment the scoreboard becomes part of readiness reporting, tool adoption becomes hard to reverse.
The minimum evidence this claim needs
If leaders want the public to believe AI is improving battlefield decision-making, they should publish a minimum set of numbers from major exercises and pilots. No sensitive details required. Just the scoreboard, the baseline, and the error column, measured the same way every time.
-
Baseline vs post-AI medians for the five metrics above, reported for the same mission type and tempo window.
-
Error and rework reported alongside throughput: false positives, reversals, returned-for-review counts, and correction rates.
-
Locked definitions for what counts as a mission plan, a vetted target, an operating picture update, and a re-task event, so comparisons stay valid over time.
-
Assist vs automation clearly labeled: where AI proposed, where a human approved, and where any step ran automatically.
If the program cannot produce that minimum evidence, the โfaster decisionsโ line is still a slogan.
The next time a leader promises faster battlefield decisions, ask for one number from the targeting cell scoreboard and one number from the error column. That is where the truth usually lives.