Error Handling

"A program that doesn't handle errors is a program that's broken by design."

Error handling is the craft of anticipating, detecting, and responding to failures in ways that keep systems reliable, understandable, and recoverable. Good error handling turns catastrophic failures into manageable incidents, transforms cryptic crashes into actionable diagnostics, and builds user trust through graceful degradation.

Why error handling matters

Failures are inevitable. Networks fail, disks fill, APIs timeout, users provide invalid input. Software that doesn't plan for failure will fail unpredictably.
Errors are data. Well-structured error information accelerates debugging, reduces mean-time-to-resolution (MTTR), and enables better monitoring and alerting.
User experience depends on it. Users forgive failures if the system fails gracefully with clear guidance. Cryptic error messages or silent failures erode trust.

Core principles

Fail fast Detect errors as early as possible, ideally at input validation boundaries before corrupting state.
Make errors explicit Use exceptions, Result types, or error returns—don't hide failures in special values (null, -1, empty strings).
Provide context Include what failed, why it failed, and what action the user or operator should take.
Handle errors at the right level Don't catch-and-ignore everywhere; propagate errors to code that knows how to handle them meaningfully.
Preserve stack traces When wrapping or re-throwing errors, maintain the original stack trace for debugging.
Recover gracefully where possible Retry transient failures, provide fallback values, or degrade functionality rather than crashing.
Never swallow errors silently Empty catch blocks and ignored error returns are bugs waiting to happen.

Error handling strategies

Programmer errors vs. operational errors

Understanding the distinction shapes your handling strategy:

Type	Description	Examples	Strategy
Programmer	Bugs in the code; should never occur	`null` dereference, type errors	Fail fast; fix the bug
Operational	Expected failures in production	Network timeout, disk full, bad input	Handle gracefully; retry/fallback

Programmer errors indicate bugs—crash immediately (in dev) to surface the issue quickly.
Operational errors are part of normal operation—handle them with retries, fallbacks, or error responses.

Fail fast for programmer errors

# ❌ BAD: Defensive programming that hides bugs
def calculate_discount(order):
    if order is None:
        return 0  # Silently returns wrong result
    return order.total * 0.1

# ✅ GOOD: Fail fast exposes the bug immediately
def calculate_discount(order):
    if order is None:
        raise ValueError("order cannot be None")
    return order.total * 0.1

Retry with exponential backoff for transient failures

import time
from typing import Callable, TypeVar

T = TypeVar('T')

def retry_with_backoff(
    func: Callable[[], T],
    max_attempts: int = 3,
    initial_delay: float = 1.0,
    max_delay: float = 60.0
) -> T:
    """Retry a function with exponential backoff."""
    attempt = 0
    delay = initial_delay

    while True:
        try:
            return func()
        except Exception as e:
            attempt += 1
            if attempt >= max_attempts:
                raise  # Re-raise after exhausting retries

            # Log the retry attempt
            print(f"Attempt {attempt} failed: {e}. Retrying in {delay}s...")

            time.sleep(delay)
            delay = min(delay * 2, max_delay)  # Exponential backoff with cap

# Usage
data = retry_with_backoff(lambda: fetch_from_api(), max_attempts=5)

Circuit breaker for cascading failures

A circuit breaker prevents cascading failures by stopping requests to a failing dependency:

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage
api_breaker = CircuitBreaker(failure_threshold=3, timeout=30.0)
data = api_breaker.call(lambda: fetch_from_external_api())

Exception design patterns

Use specific exception types

# ❌ BAD: Generic exceptions lose context
def withdraw(account, amount):
    if amount > account.balance:
        raise Exception("Can't withdraw")  # Too vague
    account.balance -= amount

# ✅ GOOD: Specific exceptions enable precise handling
class InsufficientFundsError(Exception):
    def __init__(self, account_id, requested, available):
        self.account_id = account_id
        self.requested = requested
        self.available = available
        super().__init__(
            f"Account {account_id}: insufficient funds. "
            f"Requested ${requested}, available ${available}"
        )

def withdraw(account, amount):
    if amount > account.balance:
        raise InsufficientFundsError(account.id, amount, account.balance)
    account.balance -= amount

Chain exceptions to preserve context

# ✅ GOOD: Chain exceptions to preserve full context
try:
    user = load_user_from_database(user_id)
except DatabaseConnectionError as e:
    raise UserLoadError(f"Failed to load user {user_id}") from e
    # The 'from e' preserves the original exception chain

Result types as an alternative to exceptions

In languages that support them (Rust, Haskell, TypeScript with libraries), Result types make error handling explicit:

// TypeScript example using a Result type
type Result<T, E> = { ok: true; value: T } | { ok: false; error: E };

function parseConfig(raw: string): Result<Config, string> {
  try {
    const config = JSON.parse(raw);
    if (!config.apiKey) {
      return { ok: false, error: 'Missing apiKey' };
    }
    return { ok: true, value: config };
  } catch (e) {
    return { ok: false, error: `Invalid JSON: ${e.message}` };
  }
}

// Caller must explicitly handle both cases
const result = parseConfig(rawData);
if (result.ok) {
  console.log('Config loaded:', result.value);
} else {
  console.error('Config error:', result.error);
}

Error handling at different layers

API/Controller layer

Validate input and translate domain exceptions to HTTP status codes:

from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/accounts/<account_id>/withdraw', methods=['POST'])
def withdraw_endpoint(account_id):
    try:
        amount = request.json.get('amount')
        if not amount or amount <= 0:
            return jsonify({
                'error': 'invalid_amount',
                'message': 'Amount must be positive'
            }), 400

        account = account_service.get_account(account_id)
        account_service.withdraw(account, amount)

        return jsonify({'balance': account.balance}), 200

    except AccountNotFoundError as e:
        return jsonify({
            'error': 'account_not_found',
            'message': str(e)
        }), 404

    except InsufficientFundsError as e:
        return jsonify({
            'error': 'insufficient_funds',
            'message': str(e),
            'available': e.available
        }), 409

    except Exception as e:
        # Log unexpected errors
        logger.error(f"Unexpected error in withdraw: {e}", exc_info=True)
        return jsonify({
            'error': 'internal_error',
            'message': 'An unexpected error occurred'
        }), 500

Service/Business logic layer

Handle domain-specific errors and propagate with context:

class AccountService:
    def withdraw(self, account, amount):
        if amount > account.balance:
            raise InsufficientFundsError(account.id, amount, account.balance)

        try:
            transaction = self._create_transaction(account.id, -amount)
            account.balance -= amount
            self.db.save(account)
            self.db.save(transaction)
        except DatabaseError as e:
            raise TransactionError(f"Failed to process withdrawal for {account.id}") from e

Data layer

Wrap database/infrastructure errors:

class AccountRepository:
    def save(self, account):
        try:
            self.connection.execute(
                "UPDATE accounts SET balance = ? WHERE id = ?",
                (account.balance, account.id)
            )
            self.connection.commit()
        except sqlite3.OperationalError as e:
            raise DatabaseError(f"Failed to save account {account.id}") from e

Structured error responses

Provide consistent, machine-readable error formats:

{
  "error": {
    "code": "insufficient_funds",
    "message": "Account 12345: insufficient funds. Requested $500.00, available $250.00",
    "details": {
      "account_id": "12345",
      "requested_amount": 500.0,
      "available_amount": 250.0
    },
    "timestamp": "2025-10-19T14:32:10Z",
    "request_id": "req_abc123"
  }
}

Key fields in error responses

code — Machine-readable error identifier (e.g., invalid_input, rate_limit_exceeded)
message — Human-readable description
details — Context-specific data (field names, constraint values, etc.)
timestamp — When the error occurred
request_id — Correlation ID for tracing across logs

Logging and monitoring

What to log

Error type and message
Stack trace (for unexpected errors)
Request/correlation ID (to trace across services)
User/session context (sanitized, no PII in plain logs)
Input parameters (sanitized)
Timestamps and latency

Example structured error log

import logging
import structlog

logger = structlog.get_logger()

try:
    result = process_payment(order_id, amount)
except PaymentGatewayError as e:
    logger.error(
        "payment_failed",
        order_id=order_id,
        amount=amount,
        gateway="stripe",
        error_code=e.code,
        error_message=str(e),
        request_id=request.id,
        exc_info=True  # Include stack trace
    )
    raise

Learn more about logging

For comprehensive guidance on logging strategies and structured logging, see our Logging chapter.

Error handling anti-patterns

1. Swallowing exceptions

# ❌ BAD: Silent failure makes debugging impossible
try:
    send_email(user.email, message)
except Exception:
    pass  # Error disappears without trace

# ✅ GOOD: Log and handle appropriately
try:
    send_email(user.email, message)
except EmailServiceError as e:
    logger.warning(f"Failed to send email to {user.email}: {e}")
    # Maybe add to retry queue or alert admin
    error_queue.add_retry(user.email, message)

2. Overly broad exception handlers

# ❌ BAD: Catches everything, including KeyboardInterrupt
try:
    data = fetch_data()
    process(data)
except Exception:
    return "Error occurred"  # Too vague

# ✅ GOOD: Catch specific, expected exceptions
try:
    data = fetch_data()
    process(data)
except NetworkError as e:
    logger.error(f"Network error: {e}")
    return {"error": "network_error", "message": str(e)}
except ValidationError as e:
    return {"error": "validation_error", "message": str(e)}
# Let unexpected exceptions bubble up

3. Using exceptions for control flow

# ❌ BAD: Exceptions are expensive for normal flow
def get_user_or_default(user_id):
    try:
        return database.get_user(user_id)
    except UserNotFound:
        return AnonymousUser()  # This is normal, not exceptional

# ✅ GOOD: Use explicit checks for expected branches
def get_user_or_default(user_id):
    user = database.get_user_or_none(user_id)
    return user if user else AnonymousUser()

4. Leaking implementation details

# ❌ BAD: Exposes internal database details to API clients
@app.route('/users/<user_id>')
def get_user(user_id):
    try:
        return get_user_from_db(user_id)
    except psycopg2.IntegrityError as e:
        return str(e), 500  # Raw database error exposed

# ✅ GOOD: Translate to domain errors
@app.route('/users/<user_id>')
def get_user(user_id):
    try:
        return get_user_from_db(user_id)
    except DatabaseError as e:
        logger.error(f"Database error fetching user {user_id}", exc_info=True)
        return {"error": "database_error", "message": "Failed to retrieve user"}, 500

5. Missing error context

# ❌ BAD: No information about what failed
raise ValueError("Invalid input")

# ✅ GOOD: Include actionable context
raise ValueError(
    f"Invalid email format: '{email}'. Expected format: user@domain.com"
)

Testing error handling

Test each error path

import pytest

def test_withdraw_insufficient_funds():
    account = Account(id="123", balance=100.0)

    with pytest.raises(InsufficientFundsError) as exc_info:
        account.withdraw(200.0)

    assert exc_info.value.account_id == "123"
    assert exc_info.value.requested == 200.0
    assert exc_info.value.available == 100.0
    assert "insufficient funds" in str(exc_info.value).lower()

def test_withdraw_invalid_amount():
    account = Account(id="123", balance=100.0)

    with pytest.raises(ValueError, match="Amount must be positive"):
        account.withdraw(-50.0)

Test retry logic

def test_retry_eventually_succeeds():
    call_count = 0

    def flaky_function():
        nonlocal call_count
        call_count += 1
        if call_count < 3:
            raise NetworkError("Temporary failure")
        return "success"

    result = retry_with_backoff(flaky_function, max_attempts=5)

    assert result == "success"
    assert call_count == 3

def test_retry_exhausts_attempts():
    def always_fails():
        raise NetworkError("Permanent failure")

    with pytest.raises(NetworkError):
        retry_with_backoff(always_fails, max_attempts=3)

Learn more about testing

For comprehensive testing strategies including error scenarios, see our Software Testing chapter.

Error handling checklist

Run through this before finalizing error handling in a new feature:

Detection & Validation

✅ Are inputs validated at system boundaries (API, file uploads, user forms)?
✅ Does the code fail fast on invalid state or arguments?
✅ Are errors detected early, before state corruption?

Error Types & Context

✅ Are custom exception types used for domain-specific errors?
✅ Do error messages include what, why, and what to do next?
✅ Are stack traces preserved when wrapping exceptions?
✅ Is sensitive data (passwords, tokens) excluded from error messages and logs?

Handling & Recovery

✅ Are transient failures (network, timeout) retried with backoff?
✅ Are circuit breakers used for external dependencies?
✅ Is there graceful degradation or fallback behavior where appropriate?
✅ Are errors handled at the correct layer (API vs. service vs. data)?

Logging & Monitoring

✅ Are errors logged with structured context (request ID, user ID, parameters)?
✅ Are critical errors alerting the team via monitoring systems?
✅ Are error rates tracked in dashboards?

Testing & Documentation

✅ Is every error path covered by tests?
✅ Are retry and circuit breaker behaviors tested?
✅ Are error response formats documented for API consumers?

Key takeaways

Fail fast for bugs, recover gracefully for operations — Distinguish programmer errors (bugs to fix) from operational errors (expected failures to handle).
Make errors explicit and contextual — Use specific exception types or Result types; include what failed, why, and what to do next.
Handle errors at the right layer — Validate at boundaries, translate at API layers, and centralize logging and monitoring.
Retry transient failures — Use exponential backoff for network errors, timeouts, and rate limits.
Prevent cascading failures — Implement circuit breakers and timeouts to protect against failing dependencies.
Log with structure and context — Include request IDs, user context, and stack traces to enable fast debugging.
Test every error path — Errors are part of your API contract; test them as thoroughly as success cases.
Never swallow errors silently — Empty catch blocks are bugs. Log, handle, or propagate—never ignore.

Remember: Error handling is not defensive paranoia—it's a design discipline that makes systems safer, more observable, and easier to debug. Every unhandled error is a missed opportunity to build trust with your users and your future self.

Why error handling matters​

Core principles​

Error handling strategies​

Programmer errors vs. operational errors​

Fail fast for programmer errors​

Retry with exponential backoff for transient failures​

Circuit breaker for cascading failures​

Exception design patterns​

Use specific exception types​

Chain exceptions to preserve context​

Result types as an alternative to exceptions​

Error handling at different layers​

API/Controller layer​

Service/Business logic layer​

Data layer​

Structured error responses​

Key fields in error responses​

Logging and monitoring​

What to log​

Example structured error log​

Error handling anti-patterns​

1. Swallowing exceptions​

2. Overly broad exception handlers​

3. Using exceptions for control flow​

4. Leaking implementation details​

5. Missing error context​

Testing error handling​

Test each error path​

Test retry logic​

Error handling checklist​

Detection & Validation​

Error Types & Context​

Handling & Recovery​

Logging & Monitoring​

Testing & Documentation​

Further reading​

Books​

Online Resources​

Research & Standards​

Key takeaways​