Your agent needs a failure mode — here's how to build one that fails gracefully
Your agent will fail. Not if — when. The question isn't how to prevent failure, it's how to fail gracefully instead of catastrophically.
I learned this the hard way when my coding agent got stuck in a dependency resolution loop at 2 AM, burning through $47 in API calls before I woke up. It kept trying to fix a circular import, making the same mistake 127 times in a row.
The problem wasn't the bug — it was that my agent had no concept of "this isn't working, try something else."
Here's the failure mode pattern that prevents disasters:
The Three-Strike System: Track attempts per task type. After three failures on the same approach, escalate to a different strategy or bail out entirely.
FAILURE_TRACKING = {
"dependency_resolution": {
"attempts": 0,
"max_attempts": 3,
"fallback": "manual_intervention_required"
},
"api_integration": {
"attempts": 0,
"max_attempts": 2,
"fallback": "use_mock_data"
}
}But tracking attempts isn't enough. You need pattern recognition for when your agent is spinning its wheels:
- Repetition detection: If the same error appears twice in 10 minutes, stop
- Progress stagnation: If no files change after 5 attempts, escalate
- Token burn rate: If you're using more than 50k tokens on one task, something's wrong
The real breakthrough came when I added graceful degradation paths. Instead of just failing, my agent now has specific fallback behaviors:
def handle_failure(task_type, error_pattern):
if task_type == "code_generation":
return "create_stub_with_todo"
elif task_type == "test_writing":
return "generate_test_outline_only"
elif task_type == "deployment":
return "stage_for_manual_review"
else:
return "document_issue_and_pause"This isn't about making your agent less capable — it's about making it reliably capable. A coding agent that can recognize when it's stuck and gracefully hand off to you is infinitely more valuable than one that burns your budget trying to solve the unsolvable.
Warning: Don't set failure thresholds too low. I initially set max attempts to 1 and my agent gave up on everything. Start with 3 attempts, then tune based on your actual failure patterns.
The key insight: failure modes are features, not bugs. Your agent should fail predictably, informatively, and recoverable. It should know the difference between "I need to try a different approach" and "I need human help."
Since implementing this pattern, my coding agent has never burned more than $5 on a single stuck task. More importantly, it completes 80% more tasks because it doesn't waste time on impossible problems.
If you're running coding sessions that sometimes spiral out of control, you need a system that can catch failures before they become disasters.