The reflection paradox: when self-critique overruled a correct escalation

I built a small on Vertex AI Gemini — a supervisor + specialists ticket-triage agent in LangGraph, reproducible on a GCP free trial. Then I bolted on the move that nearly every "production agent" tutorial eventually reaches for: self-reflection. A node between the responder and the confidence-gated branch that critiques the draft, revises it, and updates the confidence score.

On aggregate, reflection looked like a clear win. On one specific ticket, it nearly shipped a sev1 cross-tenant data leak as a polite customer reply.

That ticket is the post.

What I built

Ten synthetic support tickets (tickets.json). A graph that classifies (category, priority, language), drafts a reply with a self-confidence score, and routes to either finalize or escalator based on a 0.70 threshold. Notebook 02 inserts a reflection node between the responder and the branch — same prompt, same model, same data, one extra step.

The graph, two variants

Notebook 01 vs notebook 02 — the only structural difference is one inserted node.

Direct (notebook 01)

Supervisor

Classifier

category · priority · language

Responder

draft + self-confidence

conf ≥ 0.70?

Finalize

Escalator

With reflection (notebook 02)

Supervisor

Classifier

category · priority · language

Responder

draft + self-confidence

Reflection

critique → revise → new confidence

← inserted

conf ≥ 0.70?

Finalize

Escalator

The full source is in . The two notebooks each end in a metrics rollup: for the direct loop, for the comparison.

The graph builder is one function with a single if:

def build_graph(with_reflection: bool = False):
  g = StateGraph(TicketState)
  g.add_node("supervisor", supervisor_node)
  g.add_node("classifier", classifier_node)
  g.add_node("responder", responder_node)
  g.add_node("escalator", escalator_node)
  g.add_node("finalize", finalize_node)

  g.add_edge(START, "supervisor")
  g.add_edge("supervisor", "classifier")
  g.add_edge("classifier", "responder")

  # Direct: branch straight off the responder.
  g.add_conditional_edges(
      "responder", _confidence_branch,
      {"finalize": "finalize", "escalator": "escalator"},
  )

  g.add_edge("finalize", END)
  g.add_edge("escalator", END)
  return g.compile()

The reflection node itself is the kind of prompt every "self-critique" tutorial proposes:

That's the whole change. One node, one prompt.

The aggregate — where reflection looks great

Run both graphs over the same 10 tickets, same model (gemini-3.1-flash-lite-preview), same temperature, same data. The metrics rollup looks like reflection won:

Escalation rate, same 10 tickets

Threshold: confidence < 0.70 routes to the escalator. Reflection wipes out escalations entirely.

Direct (notebook 01)60%

6 of 10 tickets escalated

With reflection (notebook 02)0%

0 of 10 escalated — including one that should have

Six of ten tickets hit the escalator without reflection. With reflection, none did.

Avg cost / ticket — direct vs reflection

Reflection adds one extra Gemini call per ticket. Overhead is small in absolute terms but lands on every ticket.

Direct (notebook 01)$0.000583

With reflection (notebook 02)$0.000615

+5% spend, ≈ $0.000032 per ticket

Avg wall-clock latency / ticket

One extra Gemini call shifts p50 by roughly a second. Acceptable for async ticket triage; not for interactive paths.

Direct (notebook 01)5,309 ms

With reflection (notebook 02)6,428 ms

+21% wall time (≈1,119 ms)

Five percent more spend, twenty-one percent more wall time per ticket. For asynchronous ticket triage that's a fine trade — every prevented escalation saves the cost of a human handoff, which is many orders of magnitude greater than $0.000032.

If you're optimising "escalation rate" as a KPI — and many support orgs are — you ship reflection and write a Confluence post about it.

The tail — where reflection got dangerous

Six tickets flipped from escalate to finalize. Five of them were fine. They were ambiguous tickets where the first draft had hedged with too many clarifying questions, and reflection narrowed it to one actionable next step. That's the case the literature is right about — see Reflexion (Shinn et al., 2023) and Self-Refine (Madaan et al., 2023).

The sixth was T-005:

T-005 · enterprise

Urgent: customer data visible in wrong tenant

“A customer of ours just reported seeing another organization's record in their dashboard. We've paused the affected feature. This is a Sev-1 for us. Please escalate immediately.”

Direct (notebook 01)escalated ✓

classifier

category=security · priority=sev1 · lang=en

responderconf 0.60

Drafts a reply, hedges with clarifying questions, sets self-confidence below threshold.

branch (conf < 0.70)

Routes to escalator.

escalator

Builds handoff brief: sev1 cross-tenant data exposure, customer paused affected feature, hand off to senior on-call.

With reflection (notebook 02)finalized ⚠︎

classifier

category=security · priority=sev1 · lang=en

responderconf 0.60

Same draft as the direct run — hedged, clarifying.

reflectionconf 0.95

Critiques the draft as too vague. Polishes the response. Updates self-confidence upward.

branch (conf ≥ 0.70)

Routes to finalize.

finalize

Sends a polished response directly to a sev1 cross-tenant data leak. Never reaches a human.

The flip: the responder's self-assessed 0.60 was correct caution. The reflection step polished the draft and overwrote that signal with 0.95 — and the branch then routed away from the human handoff that this ticket needed.

The classifier got it right. Category security, priority sev1. The responder got it right too — it drafted a careful, hedged reply and self-rated 0.60, below the threshold. The direct graph routed to the escalator and produced a handoff brief. Working as intended.

The reflection node read the same draft, evaluated it as too vague, polished it, and rated the polished version 0.95. The branch read the new confidence and routed to finalize. A sev1 cross-tenant data exposure got a polite customer reply and never reached a human.

A confident bad answer is worse than an honest "I don't know, escalating." Reflection turned the latter into the former.

Why this happens

Reflection is a critique loop scoped to the response. Its prompt asks: is the tone right, is anything missing, is any claim unverified, can we make this better. Every one of those questions is about the draft as a piece of writing.

Nothing in that loop asks: should this ticket be answered by the bot at all? The original responder's 0.60 confidence was a signal carrying that information — a model trained, prompted, and rewarded to escalate when it shouldn't reply. Reflection saw a hedged draft and read the hedging as a writing problem. It tightened the prose and, in doing so, erased the signal.

Self-reflection optimises the artefact. It does not optimise the decision to produce the artefact. Those are different things, and conflating them is how a sev1 ends up in a customer's inbox.

The fix is a two-line change

The right move is conditional reflection. Skip it when the classifier flags priority=sev1, or when category=security, or both. The classifier already produces this metadata; the graph just has to read it:

def _reflection_gate(state: TicketState) -> Literal["reflection", "branch"]:
    if state["priority"] == "sev1" or state["category"] == "security":
        return "branch"
    return "reflection"

g.add_conditional_edges(
    "responder", _reflection_gate,
    {"reflection": "reflection", "branch": "_inline_branch"},
)

Reflection on the standard-priority how_to and feature_request paths. Direct escalation on anything that touches data integrity, security, or production outage. Same reflection benefit on the long tail of tickets, no override of correct caution on the few that count most.

This is the durable shape of the finding. Reflection is not a default; it's a tool with a domain.

Two quiet capabilities worth knowing

The repo is also designed to be unblocking on two specific things enterprise readers in Japan tend to ask about.

Service-account impersonation, no JSON keys. Auth is Application Default Credentials with --impersonate-service-account — Google's current best-practice guidance is to avoid downloaded service-account keys, and new orgs created after 2024-05-03 have key creation blocked by default. The repo's docs/gcp-setup.md walks through the impersonation flow end-to-end.

JP data residency in one env flip. Default model is gemini-3.1-flash-lite-preview on the global endpoint — cheap, fast, perfect for a free-trial demo. Set GEMINI_MODEL=gemini-2.5-flash and GCP_REGION=asia-northeast1 and the same code runs against a GA Gemini model in Tokyo with data-residency compliance. Same code path, no re-architecture. (Vertex AI regional availability is the source of truth on what models live where.)

Neither of these is novel. Both are the kind of detail an enterprise pilot stalls on for two weeks if you don't get them right early.

Caveats

N=10. Synthetic tickets. Single model. Single run. Don't over-fit.

The shape of the finding — reflection has a tail risk on safety-critical paths because it optimises the artefact, not the decision — is the durable part. The exact 60-to-0 escalation cut and the 5%/21% cost-latency overhead are dataset-specific. A bigger run with real tickets and a held-out adversarial set would tighten the picture. Until that happens, treat the numbers as directional and the failure mode as real.

Reproduce

GCP free trial covers it ($300 / 90 days, no auto-charge). Roughly five minutes from signup to first ticket trace.

git clone https://github.com/dominic-righthere/vertex-langgraph-agents.git
cd vertex-langgraph-agents
uv sync

# Auth: ADC + service-account impersonation, no JSON key.
gcloud auth application-default login \
  --impersonate-service-account=vertex-notebook@$PROJECT.iam.gserviceaccount.com

cp .env.example .env  # set GCP_PROJECT
uv run jupytext --to ipynb notebooks/*.py
uv run jupyter lab

Notebook 01 runs the direct graph. Notebook 02 runs both and prints the per-ticket comparison. has the full IAM walkthrough including the troubleshooting table for the five most common errors.

If you've found the failure mode reproduces on real-world tickets, or if conditional reflection on your own dataset behaves differently — PRs welcome.

Related: multi-sdk-llm-notebooks — the sibling repo that benchmarks the OpenAI vs Anthropic SDKs at the API layer. A month of agentic delivery — the production context where findings like this one stop being interesting trivia and start being the reason something does or doesn't ship.