Error handling and recovery

Workflows managed by the Agents SDK are designed to keep agents moving forward, even when things do not go exactly as planned. Errors are inevitable when reasoning, tools, and external systems interact. This lesson exists to orient us to how the SDK exposes those errors and how we recover without losing control of the agent.

How the SDK surfaces errors during execution

When something goes wrong during a workflow step, the SDK does not hide it. Errors are surfaced as structured outcomes of execution rather than raw crashes.

An error appears as part of the step result, alongside successful outputs. This lets the agent continue running and decide what to do next, instead of terminating the entire program.

Distinguishing reasoning errors from tool failures

Not all errors are the same. The SDK makes a clear distinction between failures in reasoning and failures in tools.

A reasoning error comes from the model producing an unusable or invalid decision. A tool failure occurs when a valid decision leads to a tool that cannot complete its work. Treating these differently allows recovery logic to be precise instead of blunt.

Recovering from failed steps

Once a step fails, recovery becomes a normal part of workflow execution. The SDK allows the agent to inspect the failure and continue with adjusted behavior.

Recovery might mean selecting a different tool, revising the plan, or skipping the failed step entirely. The important point is that recovery is explicit and deliberate, not accidental.

Retrying or adjusting behavior after errors

Some failures are temporary. The SDK supports retrying steps under controlled conditions rather than repeating them blindly.

A retry usually happens with additional context or modified inputs, informed by what just failed. This keeps retries purposeful and prevents the agent from looping endlessly on the same mistake.

Preventing runaway or stalled workflows

Uncontrolled retries and indecision can cause workflows to stall or spiral. The SDK provides hooks to detect repeated failures and halt or redirect execution.

By setting limits on retries and requiring progress between steps, we ensure the workflow either advances or stops cleanly. This preserves predictability and keeps the agent under our control.

Conclusion

At this point, we are oriented to how the Agents SDK handles errors inside workflows. We know how errors are surfaced, how different kinds of failures are distinguished, and how recovery, retries, and safeguards keep execution stable. With this mental model, we can design workflows that fail gracefully and keep agents productive rather than fragile.

Try it

We can now observe how an SDK-managed agent surfaces tool and validation errors as structured results, allowing it to adjust, retry, and recover without losing control of the workflow.

In this demonstration, the LLM is instructed to assume the user is a child curious about the solar system.

Suspiciously fast results? This demonstration shows the output of a prior offline run, since a live LLM-enabled agent would be slower and more costly to execute on demand.