Bot mitigation programs often look stronger on dashboards than they feel in operations.
Every threshold spends something
False positives are not an abstract statistic. They become:
- checkout friction
- login escalations
- support load
- conversion loss
If you do not model that cost explicitly, the system will drift toward invisible harm.
Tune against outcomes, not only scores
A production review should ask:
- Which enforcement decisions were reversed?
- Which customer journeys degraded?
- Which features created the most analyst disagreement?
Minimum review loop
That is what separates an interesting research detector from a production control.