Article content
1. Classify transient vs permanent errors
Separate timeout/network issues from logic/data validation failures to select the right recovery approach.
2. Use idempotent workflow steps
Where retries are possible, ensure repeated execution does not create duplicate records or conflicting updates.
3. Alert with actionable context
Attach workflow run ID, affected entity, and failed step metadata to alerts so responders can triage faster.