A prompt that still looks like the intended task can quietly redirect a robot's final physical outcome. We formalize command-preserving trajectory redirection and search for the near-benign prompts that trigger it.
Abstract
Vision-language-action (VLA) policies bring natural language into closed-loop robot control, letting robots execute manipulation tasks directly from text. The prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations the policy acts on.
Existing VLA attacks elicit targeted low-level actions or make them persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still appears to specify the intended task but redirects the final physical outcome. We formalize this as command-preserving trajectory redirection, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components stay fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language.
To find such prompts, we introduce an on-policy prompt search that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task under these constraints. In simulation and on hardware, near-benign perturbations redirect VLA rollouts to attacker-specified targets: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome.
Threat Model
The attack surface is only the task text. The robot, environment, dynamics, and policy weights are all frozen. A single prompt is chosen before the episode and reused at every step, yet it must still read as the benign command.
Benign task: "put the bowl on the stove" · attacker target: put the bowl on the plate. Only prompts that pass all four checks above are admissible.
Figure 1–2. The task text is the only attack surface; the decoded action changes the state from which the next observation is collected, so the perturbation steers both the current action and all future observations.
Method
Redirection is a trajectory-level problem: a prompt scored on fixed, pre-collected observations is judged on the wrong state distribution, since the relevant observations are the ones the candidate prompt induces in closed loop. Our search uses the frozen VLA as its own teacher and aggregates data on-policy, the prompt-search analogue of DAgger.
Query the frozen policy under benign prompt τb and target prompt τt to get teacher actions Ab(o) and At(o). τt is used only here, never deployed.
Generate near-benign candidates, filter every one through the four admissibility checks, then rank by a target-vs-benign margin that favors target-like behavior.
Roll out top candidates, relabel their visited states with both teachers, and feed those states back into scoring, so prompts are judged on the distribution they create.
Evaluate candidates in the fixed episode, scoring target success, benchmark failure, and text cost. Keep the best admissible prompt, then prune to the shortest valid one.
Figure 3. High-level overview of the on-policy teacher-matching prompt search.
Experiments & Key Findings
Evaluated on LIBERO across discrete-token, flow-matching, diffusion, continuous-chunk, and action-as-text architectures plus a non-VLA baseline, and validated on a physical SO-100 arm.
Command-preserving perturbations redirect VLAs across very different training recipes and action decoders, reliably moving the rollout toward the attacker's intended outcome.
Successful attacks average ~3.4 character edits. The vulnerable region sits right next to the original command; bigger budgets mainly cut search cost.
Causal tracing through π0.5 localizes the attack to the corrupted destination phrase. Patching those states back to benign removes the target behavior; patching ordinary words does not.
On a fine-tuned SO-100 arm, benign prompts run the task reliably while near-benign adversarial prompts collapse it, across all three hardware models.
The attack stays target-like along its own closed-loop rollout, not just at the first frame, especially where benign and target behaviors diverge.
Whitespace or Unicode cleanup barely dents the attack. Only command-level normalization, mapping noisy instructions back to validated commands, sharply cuts success.
| Model | CSR % | TSR % | Attack ASR % | Bench fail % | Target final % | Edit |
|---|---|---|---|---|---|---|
| OpenVLA | 76.5 | 69.4 | 91.8 | 94.6 | 93.2 | 3.7 |
| MolmoAct | 86.6 | 82.1 | 93.4 | 95.7 | 94.8 | 3.1 |
| π0.5 | 94.2 | 91.7 | 97.5 | 98.4 | 98.1 | 2.6 |
| Octo | 75.1 | 70.8 | 88.6 | 91.5 | 90.1 | 4.2 |
| SmolVLA | 88.8 | 85.4 | 94.7 | 96.1 | 95.3 | 3.3 |
| GR00T-N1 | 93.9 | 92.6 | 96.8 | 98.0 | 97.6 | 2.5 |
| OpenVLA-OFT | 97.1 | 94.8 | 93.9 | 95.0 | 94.6 | 3.8 |
| π0-FAST | 85.5 | 82.9 | 95.6 | 96.9 | 96.2 | 2.4 |
| VLA-0 | 94.7 | 91.9 | 82.8 | 84.4 | 83.7 | 5.4 |
CSR = clean benign-task success · TSR = direct target-prompt success · Attack ASR = share of attackable episodes where the search returns a valid command-preserving prompt that fails the benchmark task and reaches the attacker target · Edit = median character-edit distance over successes. Macro-averaged over LIBERO-Spatial/Object/Goal.
| Defense | Clean % | ASR % |
|---|---|---|
| None | 100.0 | 95.1 |
| Whitespace norm. | 99.4 | 83.7 |
| Punctuation strip | 98.6 | 58.3 |
| Unicode NFKC | 99.7 | 92.4 |
| Spell correction | 96.9 | 31.8 |
| Nearest-task canon. | 94.2 | 7.4 |
Only command-level canonicalization meaningfully closes the attack surface.
| Method | ASR % | Q/succ. |
|---|---|---|
| Random perturb. | 11.7 | 1846 |
| Fixed-obs. search | 54.8 | 239 |
| Target-teacher only | 71.6 | 164 |
| No on-policy agg. | 79.3 | 104 |
| Full method | 95.1 | 42 |
On-policy aggregation over attacked rollout states drives both higher ASR and far fewer policy queries.
Figure 6. Under the benign prompt the arm completes the task; under the near-benign adversarial prompt, original-task success collapses across all three hardware models.