server.web.middleware.db_reset_retry ==================================== .. py:module:: server.web.middleware.db_reset_retry .. autoapi-nested-parse:: ASGI middleware that retries a request once on transient DB connection resets. Azure Private Link (and similar cloud NATs / load balancers) periodically sever idle TCP connections. Those resets surface from SQLAlchemy as ``OperationalError`` with MySQL errno 2006 / 2013 / 2026. ``pool_pre_ping=True`` already validates pooled connections at checkout, but it cannot help when a query is already in flight or when the commit roundtrip that happens in ``jobmon.server.web.db.deps.get_db``'s dependency finalizer hits the reset. This middleware wraps the entire request pipeline — handler body AND dependency finalizers — in a single retry boundary. A transient connection-reset exception triggers one retry with a short backoff. Other ``OperationalError`` variants (deadlocks 1213, lock timeouts 1205, etc.) propagate immediately. Implementation notes -------------------- We intentionally implement this as raw ASGI middleware rather than ``starlette.middleware.base.BaseHTTPMiddleware``. ``BaseHTTPMiddleware`` wires ``call_next`` through anyio memory streams that are closed after a single use, so it cannot re-invoke the downstream app. Raw ASGI lets us buffer the request body once, replay it on each attempt, and stage the outgoing send() messages so a failed attempt can be discarded cleanly before any bytes reach the client. Safety ------ ``get_db`` defers ``session.commit()`` until after the handler returns and rolls back on any in-handler exception before the session is closed. That means a mid-handler reset guarantees no write was committed, so a replay is safe. The rarer commit-phase race (MySQL committed but client got RST before ACK) has identical semantics whether or not we retry; jobmon's hash-based unique constraints on the important writes already absorb the duplicate case. Attributes ---------- .. autoapisummary:: server.web.middleware.db_reset_retry.logger Classes ------- .. autoapisummary:: server.web.middleware.db_reset_retry.DBResetRetryMiddleware Functions --------- .. autoapisummary:: server.web.middleware.db_reset_retry.is_connection_reset Module Contents --------------- .. py:data:: logger .. py:function:: is_connection_reset(exc: BaseException) -> bool Return True iff ``exc`` represents a transient connection-loss error. Only errors clearly caused by a severed connection are retryable. Deadlocks (1213), lock timeouts (1205), integrity errors, etc. must propagate. .. py:class:: DBResetRetryMiddleware(app: starlette.types.ASGIApp, max_attempts: int = DEFAULT_MAX_ATTEMPTS, backoff_seconds: float = DEFAULT_BACKOFF_SECONDS, budget_seconds: float = DEFAULT_BUDGET_SECONDS) ASGI middleware that retries one HTTP request on a transient DB reset. Non-HTTP scopes (lifespan, websocket) are forwarded unchanged. Store retry policy. ``max_attempts`` must be >= 1. ``budget_seconds`` caps total time spent retrying so a slow-query + retry doesn't blow past the client's read_timeout (default 20s). We stop retrying when the next backoff would land us outside the budget, even if attempts remain. .. py:attribute:: DEFAULT_MAX_ATTEMPTS :value: 2 .. py:attribute:: DEFAULT_BACKOFF_SECONDS :value: 0.2 .. py:attribute:: DEFAULT_BUDGET_SECONDS :value: 15.0 .. py:attribute:: SCOPE_STATE_KEY :value: 'db_reset_retry' .. py:attribute:: app .. py:method:: should_retry_connection_reset(scope: starlette.types.Scope) -> bool :staticmethod: Return True iff a retry attempt remains within the configured budget. Called from the generic exception handler to decide whether to re-raise a connection-reset error (so this middleware sees it and retries) or to let it flow through to a normal error response. Safe to call when no retry middleware is registered — returns False.