server.web.middleware.db_reset_retry

ASGI middleware that retries a request once on transient DB connection resets.

Azure Private Link (and similar cloud NATs / load balancers) periodically sever idle TCP connections. Those resets surface from SQLAlchemy as OperationalError with MySQL errno 2006 / 2013 / 2026. pool_pre_ping=True already validates pooled connections at checkout, but it cannot help when a query is already in flight or when the commit roundtrip that happens in jobmon.server.web.db.deps.get_db’s dependency finalizer hits the reset.

This middleware wraps the entire request pipeline — handler body AND dependency finalizers — in a single retry boundary. A transient connection-reset exception triggers one retry with a short backoff. Other OperationalError variants (deadlocks 1213, lock timeouts 1205, etc.) propagate immediately.

Implementation notes

We intentionally implement this as raw ASGI middleware rather than starlette.middleware.base.BaseHTTPMiddleware. BaseHTTPMiddleware wires call_next through anyio memory streams that are closed after a single use, so it cannot re-invoke the downstream app. Raw ASGI lets us buffer the request body once, replay it on each attempt, and stage the outgoing send() messages so a failed attempt can be discarded cleanly before any bytes reach the client.

Safety

get_db defers session.commit() until after the handler returns and rolls back on any in-handler exception before the session is closed. That means a mid-handler reset guarantees no write was committed, so a replay is safe. The rarer commit-phase race (MySQL committed but client got RST before ACK) has identical semantics whether or not we retry; jobmon’s hash-based unique constraints on the important writes already absorb the duplicate case.

Attributes

logger

Classes

DBResetRetryMiddleware

ASGI middleware that retries one HTTP request on a transient DB reset.

Functions

is_connection_reset(→ bool)

Return True iff exc represents a transient connection-loss error.

Module Contents

server.web.middleware.db_reset_retry.logger

server.web.middleware.db_reset_retry.is_connection_reset(exc: BaseException) → bool

Return True iff exc represents a transient connection-loss error.

Only errors clearly caused by a severed connection are retryable. Deadlocks (1213), lock timeouts (1205), integrity errors, etc. must propagate.

class server.web.middleware.db_reset_retry.DBResetRetryMiddleware(app: starlette.types.ASGIApp, max_attempts: int = DEFAULT_MAX_ATTEMPTS, backoff_seconds: float = DEFAULT_BACKOFF_SECONDS, budget_seconds: float = DEFAULT_BUDGET_SECONDS)

ASGI middleware that retries one HTTP request on a transient DB reset.

Non-HTTP scopes (lifespan, websocket) are forwarded unchanged.

Store retry policy.

max_attempts must be >= 1. budget_seconds caps total time spent retrying so a slow-query + retry doesn’t blow past the client’s read_timeout (default 20s). We stop retrying when the next backoff would land us outside the budget, even if attempts remain.

DEFAULT_MAX_ATTEMPTS = 2

DEFAULT_BACKOFF_SECONDS = 0.2

DEFAULT_BUDGET_SECONDS = 15.0

SCOPE_STATE_KEY = 'db_reset_retry'

app

static should_retry_connection_reset(scope: starlette.types.Scope) → bool

Return True iff a retry attempt remains within the configured budget.

Called from the generic exception handler to decide whether to re-raise a connection-reset error (so this middleware sees it and retries) or to let it flow through to a normal error response. Safe to call when no retry middleware is registered — returns False.