Robotics: Science and Systems ยท 2026

TAIL-Safe: Task-Agnostic Safety Monitoring
for Imitation-Learning Policies

Riad Ahmed1  ยท  Momotaz Begum1

1Department of Computer Science, University of New Hampshire

Overview of TAIL-Safe pipeline
Figure 1. A Gaussian-Splatting digital twin (left) is built from a five-minute capture and reconstructed offline. Inside the twin we sweep an imitation-learning policy under perturbations, collecting safe and unsafe rollouts that train a Lipschitz reach-avoid Q-function and a fusion network over three short-horizon, task-agnostic criteria. At deployment, TAIL-Safe sits between the policy and the robot: while the predicted safety value remains positive it forwards the policy action; when the value approaches zero it performs gradient ascent on Q to bring the system back into the empirical safe set.

Abstract

An imitation-learning policy that knows when it is about to fail.

Modern visuomotor policies โ€” flow-matching, diffusion, ACT โ€” are remarkably capable on the trajectories they were trained on and remarkably brittle outside of them. Field deployment requires more than higher accuracy: it requires the policy to recognize when it has wandered to a state from which the demonstrations no longer apply, and to do something sensible about it.

TAIL-Safe attaches a task-agnostic safety filter to any deterministic imitation-learning policy. We train a Lipschitz-continuous, reach-avoid Q-value function whose zero-superlevel set is an empirical control-invariant set โ€” the region of state-action space from which the underlying policy reliably succeeds. The function is learned from rollouts inside a Gaussian-Splatting reconstruction of the workspace, scored along three short-horizon criteria โ€” visibility, recognizability, and graspability โ€” that are agnostic to the manipulation task. At run time, when the nominal policy action would leave the safe set, a Nagumo-inspired controller performs bounded gradient ascent on Q until the system re-enters it. On a Franka Emika robot, base flow-matching policies that fail under modest perturbations finish their tasks consistently when guided by TAIL-Safe, and detected unsafe states are caught roughly twenty milliseconds before failure on average.

Method

What can fail, where it can fail, and how to step away from the cliff.

A safety monitor is only useful if it can be trained without endangering the real robot, and if it generalises beyond the specific task it observed. TAIL-Safe addresses both points by separating the learning problem from the task it ultimately protects.

A photo-real digital twin as a free safety simulator. A short five-minute capture of the workspace is reconstructed into a Gaussian-Splatting scene; segmented object Gaussians are then rigidly pose-updated from a wrist-mounted RGB-D stream at ten hertz. The twin gives us a renderable, physics-faithful environment in which we can perturb the policy and harvest unsafe trajectories that we would never deliberately execute on a real arm.

Three task-agnostic criteria, instead of a hand-crafted reward. At every step we score the state along three lightweight signals that remain meaningful across manipulation tasks: visibility (does the target lie in the wrist camera's field of view?), recognizability (does the rendered object embedding match the reference DINOv2 prototype, in Mahalanobis distance?), and graspability (do antipodal grasp candidates exist near the predicted end-effector pose?). A small fusion network โ€” WeightNet โ€” learns when each criterion matters: visibility dominates during approach, graspability near contact, and so on.

A Lipschitz reach-avoid Q-function. We then train a single state-action value function on rollouts collected in the twin, using a reach-avoid Bellman operator and an energy-shaping loss that prevents the value landscape from flattening. Spectral-norm regularisation gives the network a Lipschitz constant that is small enough to support a closed-form recovery step, but large enough to sharply separate safe from unsafe regions.

Recovery as gradient ascent on the safety landscape. When the nominal action would push the system across the zero level-set of Q, we replace it with a bounded gradient-ascent step on the same function โ€” a discrete Nagumo-tangentiality update โ€” and continue executing the policy from the corrected state. The loop runs at 350 Hz on a single GPU and converges in fewer than three iterations on average.

Learned WeightNet weights
Figure 2. Learned criterion weights along a pick-and-place trajectory. WeightNet shifts emphasis from visibility during the approach phase to graspability near contact, then rebalances at completion.
2D Q-value landscapes
Figure 3. Two-dimensional slices of the learned Q-function around expert actions. The function forms a bounded "hill" of safety with a clean negative exterior, exactly the structure required for recovery.
3D safety landscape
Figure 4. The three-dimensional value landscape over the workspace. The zero-superlevel set defines an empirical control-invariant safe set; crossing its boundary is a reliable predictor of imminent task failure.

Experiments

Real-robot results on a Franka Emika and 270 perturbed rollouts in the twin.

We evaluate TAIL-Safe on two physical manipulation tasks โ€” Candy Pick and Pick-and-Place โ€” and on a controlled set of perturbed rollouts inside the Gaussian-Splatting twin. We protect a base flow-matching policy and ask three questions: does the safety filter predict failure before it happens, does the recovery controller restore task progress, and does the protected policy complete tasks it would otherwise abandon?

99.3%
trajectory-level AUROC of the learned Q-function on held-out rollouts.
100%
recovery success rate with energy-shaped training (vs. 20% without).
โ‰ˆ 23 ms
average detection latency before the underlying policy would have failed.

Calibration of the safety predictor.

On 270 held-out rollouts in the twin, the trajectory-level AUROC of Q reaches 0.993, with a per-state AUROC of 0.962. WeightNet separates safe and unsafe trajectories with effect size Cohen's d = 0.93, against 0.29 for an instantaneous heuristic. Crucially, the empirically measured Lipschitz constant of the trained network is 2.31, comfortably below the theoretical bound of 2.5 enforced during training, which is what makes the closed-form recovery step well-posed.

Recovery on the real Franka.

We perturb the robot online either by physically pushing the end-effector or by injecting SpaceMouse commands. Without the filter, the base flow-matching policy fails on every interrupted episode. With TAIL-Safe attached, the policy completes the task on every recoverable episode in our study, with an average of 2.3 recovery iterations per intervention and no observable degradation in nominal task time.

Detection of out-of-distribution states.

We deliberately drive the policy into severe OOD configurations on the real arm: camera occlusion, large object displacement, scene additions not present in any demonstration. In every such configuration the predicted Q-value crosses zero before the policy would have completed an unsafe action, and the recovery controller is triggered.

Table 1. Headline performance on the digital-twin evaluation set, averaged over 270 perturbed rollouts. Best row in bold.
MethodTraj. AUROCState AUROCRecoveryLatency
Equal-weight criteria0.4370.514
WeightNet only (no recovery)0.9710.928
Q-function, no energy shaping0.9840.94320%15 ms
TAIL-Safe (full)0.9930.962100%23 ms
Real-robot recovery sequence
Figure 5. A representative real-robot recovery: the operator perturbs the end-effector mid-trajectory; the safety filter triggers, gradient ascent on Q brings the arm back into the safe set, and the policy resumes without an explicit reset.
Extreme out-of-distribution cases
Figure 6. Severe out-of-distribution scenes on the real arm. The base policy diverges; TAIL-Safe correctly flags these states as unsafe in real time and refuses to execute the nominal action.

Videos

How it looks on the robot.

A short overview is followed by real-robot recovery from human and teleoperated interruption, two extreme out-of-distribution scenes, and a small set of rollouts collected entirely inside the Gaussian-Splatting twin that are used to train the safety predictor.

System overview

A compiled walkthrough of the pipeline โ€” from digital-twin construction to closed-loop recovery on the real Franka.

Recovery from interruption (real Franka)

The base policy is interrupted mid-task. The safety filter detects the unsafe state, the recovery controller steers the end-effector back into the safe set, and the policy resumes without an explicit reset.

Manual interruption by the operator

The operator physically perturbs the end-effector; the protected policy completes the task.

SpaceMouse teleoperation injected mid-trajectory

The robot is pushed off-trajectory through teleoperation; TAIL-Safe brings it back.

Extreme out-of-distribution cases

Configurations far outside the demonstration distribution. The base policy diverges; TAIL-Safe flags the unsafe state in real time.

OOD scene A

The base policy diverges; the safety predictor crosses zero before contact.

OOD scene B

A second extreme configuration; the unsafe state is correctly flagged.

Data collection inside the Gaussian-Splatting twin

Successful and intentionally failed rollouts collected inside the photo-real twin, used to train WeightNet and the reach-avoid Q-function. No physical robot is required for this stage.

Successful rollout 1

Nominal pick-and-place inside the twin.

Successful rollout 2

ยฑ5 cm variation in object placement.

Successful rollout 3

ยฑ30ยฐ variation in object orientation.

Failure rollout (unsafe class)

An intentionally failed trajectory used to label the unsafe region of state-action space.

Citation

If TAIL-Safe is useful in your work, please cite us.

@inproceedings{ahmed2026tailsafe, title = {{TAIL}-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies}, author = {Ahmed, Riad and Begum, Momotaz}, booktitle = {Proceedings of Robotics: Science and Systems (RSS)}, year = {2026}, }