Retry and recovery

Longer agent workflows rarely succeed on the first attempt every time. Files may be missing, tools may fail, or intermediate results may be incomplete. This lesson exists to show how a classical agent can respond to those situations without crashing or restarting from scratch.

The goal is not to eliminate errors, but to make workflows resilient. We want an agent that can retry when it makes sense, adapt when it doesn’t, and stop safely when progress is no longer possible.

Recoverable and non-recoverable failures

Not every failure deserves the same response. Some failures are temporary and can be retried, while others indicate that continuing would be pointless or unsafe.

A recoverable failure is one where repeating the same step might succeed later. A missing file that may appear shortly or a transient write failure are common examples.

A non-recoverable failure is one where retrying will not change the outcome. Invalid input data or a missing configuration value usually falls into this category.

The workflow needs a way to classify failures so later decisions are deliberate rather than accidental.

def is_recoverable(error):
    return error == "temporary_failure"

Retrying failed steps

Once a failure is considered recoverable, the workflow can retry the step under controlled conditions. Retrying blindly is risky, so retries are usually limited and tracked explicitly.

A simple retry mechanism counts how many times a step has failed and stops once a threshold is reached. This prevents infinite loops while still allowing for recovery.

max_retries = 3
attempts = 0

while attempts < max_retries:
    result = run_step()
    if result["success"]:
        break
    attempts += 1

Changing behavior after repeated failures

If a step keeps failing, repeating the same action may no longer be useful. At that point, the workflow can change behavior instead of retrying again.

This might mean switching to an alternative tool, skipping an optional step, or recording the failure for later review. The key idea is that repeated failure becomes a signal, not just an error.

if attempts >= max_retries:
    state["mode"] = "degraded"

Tracking partial progress in state

Workflows often make progress even when they fail partway through. That progress should be reflected in state so it is not lost.

Recording which steps have completed allows the workflow to resume intelligently. This is especially important in long-running agents that may continue operating after an error.

state["completed_steps"].append("generate_index_page")

Continuing or aborting safely

Eventually, the workflow must decide whether to continue or stop. Continuing makes sense when enough progress has been made and remaining steps are optional. Aborting makes sense when required steps cannot be completed.

The important part is that the decision is explicit and leaves the system in a consistent state. A stopped workflow should stop cleanly, not halfway through a side effect.

if not state["can_continue"]:
    workflow_status = "aborted"

Conclusion

By distinguishing failure types, retrying deliberately, adapting after repeated problems, and recording partial progress, we gain control over how workflows behave under stress. At this point, we are no longer reacting to errors—we are managing them.

That orientation is the foundation for building agents that can run for long periods, recover from setbacks, and remain predictable even when things go wrong.