server.web.middleware.db_reset_retry
ASGI middleware that retries a request once on transient DB connection resets.
Azure Private Link (and similar cloud NATs / load balancers) periodically
sever idle TCP connections. Those resets surface from SQLAlchemy as
OperationalError with MySQL errno 2006 / 2013 / 2026.
pool_pre_ping=True already validates pooled connections at checkout,
but it cannot help when a query is already in flight or when the commit
roundtrip that happens in jobmon.server.web.db.deps.get_db’s
dependency finalizer hits the reset.
This middleware wraps the entire request pipeline — handler body AND
dependency finalizers — in a single retry boundary. A transient
connection-reset exception triggers one retry with a short backoff.
Other OperationalError variants (deadlocks 1213, lock timeouts 1205,
etc.) propagate immediately.
Implementation notes
We intentionally implement this as raw ASGI middleware rather than
starlette.middleware.base.BaseHTTPMiddleware. BaseHTTPMiddleware
wires call_next through anyio memory streams that are closed after a
single use, so it cannot re-invoke the downstream app. Raw ASGI lets us
buffer the request body once, replay it on each attempt, and stage the
outgoing send() messages so a failed attempt can be discarded cleanly
before any bytes reach the client.
Safety
get_db defers session.commit() until after the handler returns
and rolls back on any in-handler exception before the session is closed.
That means a mid-handler reset guarantees no write was committed, so a
replay is safe. The rarer commit-phase race (MySQL committed but client
got RST before ACK) has identical semantics whether or not we retry;
jobmon’s hash-based unique constraints on the important writes already
absorb the duplicate case.
Attributes
Classes
ASGI middleware that retries one HTTP request on a transient DB reset. |
Functions
|
Return True iff |
Module Contents
- server.web.middleware.db_reset_retry.logger
- server.web.middleware.db_reset_retry.is_connection_reset(exc: BaseException) bool
Return True iff
excrepresents a transient connection-loss error.Only errors clearly caused by a severed connection are retryable. Deadlocks (1213), lock timeouts (1205), integrity errors, etc. must propagate.
- class server.web.middleware.db_reset_retry.DBResetRetryMiddleware(app: starlette.types.ASGIApp, max_attempts: int = DEFAULT_MAX_ATTEMPTS, backoff_seconds: float = DEFAULT_BACKOFF_SECONDS, budget_seconds: float = DEFAULT_BUDGET_SECONDS)
ASGI middleware that retries one HTTP request on a transient DB reset.
Non-HTTP scopes (lifespan, websocket) are forwarded unchanged.
Store retry policy.
max_attemptsmust be >= 1.budget_secondscaps total time spent retrying so a slow-query + retry doesn’t blow past the client’s read_timeout (default 20s). We stop retrying when the next backoff would land us outside the budget, even if attempts remain.- DEFAULT_MAX_ATTEMPTS = 2
- DEFAULT_BACKOFF_SECONDS = 0.2
- DEFAULT_BUDGET_SECONDS = 15.0
- SCOPE_STATE_KEY = 'db_reset_retry'
- app
- static should_retry_connection_reset(scope: starlette.types.Scope) bool
Return True iff a retry attempt remains within the configured budget.
Called from the generic exception handler to decide whether to re-raise a connection-reset error (so this middleware sees it and retries) or to let it flow through to a normal error response. Safe to call when no retry middleware is registered — returns False.