Applicant Portal Critical Service Outage

Incident Report for Goal Get It

Postmortem

Postmortem: Applicant Portal Outage During Two-Node Rollout

Date: February 14, 2026
Primary incident window: ~11:00 AM to ~3:00 PM PT
Follow-up hardening window: ~3:00 PM to ~5:05 PM PT
Status: Service restored, monitoring in place, hardening completed

Executive Summary
On February 14, 2026, we migrated production from a single-node deployment to a two-node architecture behind a Cloudflare Load Balancer in order to increase capacity and resilience for decision-day traffic. During and after the rollout, multiple independent issues occurred across process startup, container orchestration, database connectivity, TLS/origin configuration, load balancer health checks, and cross-node environment parity. These issues interacted in a way that produced intermittent and, at times, severe disruption to the applicant portal and staff/admin experiences. Because the system had transitioned from one origin to multiple origins, any configuration drift or partial startup state could manifest as “sometimes works, sometimes fails” behavior depending on which node served the request.

The outage was not caused by a single bug. It was a cascade of failures and mismatches introduced during the topology change, combined with insufficient “readiness” validation (health checks were too shallow) and insufficient preflight enforcement of cross-node parity (secrets, flags, and TLS/origin artifacts). Applicant data remained secure and intact throughout the incident.

Systems and Components Involved

  • Cloudflare Load Balancer (origin pool + health monitoring + routing)
  • Node 1 (primary): application stack + workers + maintenance + primary PostgreSQL
  • Node 2 (app-only): backend/frontend/Caddy, connecting to the primary PostgreSQL on node 1
  • Caddy (origin web server / TLS termination at the origin)
  • Backend: Django behind Gunicorn
  • Frontend: Next.js (build + runtime)
  • Background workers and maintenance services (containerized)
  • Docker Compose orchestration and deployment automation
  • Production secrets/configuration (.env and integration credentials)

Customer Impact
Applicants intermittently experienced:

  • Inability to log in or complete login flows
  • HTTP 400/500 responses on portal actions
  • Timeouts and unstable portal behavior (requests failing non-deterministically)
  • TLS/origin failures on some requests (connection errors surfaced to users as failures/timeouts)

Staff intermittently experienced:

  • Manage/SSO/login inconsistencies (success depended on which origin handled the request)
  • Admin login failures (including 500s when routed to a partially broken node)
  • Inconsistent behavior in management features that depend on stable auth/session behavior

Operational impact:

  • Cloudflare LB routed traffic to origins that were “up” at a basic liveness level but not truly ready to serve end-to-end application traffic
  • Deployment automation and existing checks did not initially catch key broken states reliably (especially cross-node parity and application-path readiness)
  • Troubleshooting was complicated by multi-cause failure and by routing-dependent symptoms

Data and Security
Applicant data remained secure and intact. The incident affected availability and correctness (ability to log in, load pages, complete flows), not data integrity. There was no indication of data loss, data exposure, or unauthorized access as part of this event. Several failures were caused by missing or mismatched keys; those conditions typically prevent successful decryption/verification and therefore fail closed (deny or error), rather than exposing secrets.

What Changed Before the Incident (Architecture Change)
We moved from a single-node production setup to:

  • Server 1 (Node 1): primary node hosting DB + app + workers + maintenance
  • Server 2 (Node 2): app-only node (backend/frontend/Caddy) sharing the primary DB on node 1
  • Cloudflare Load Balancer distributing traffic across both nodes

This change introduced new classes of failure modes that did not exist in the single-node topology:

  • Cross-node secret and configuration drift can cause authentication/session inconsistency across nodes
  • Cross-node TLS/origin parity becomes mandatory when both nodes are serving as origins
  • Cross-node DB routing/network constraints must be valid from every app node to the primary DB
  • Health checks must validate “application readiness” (DB connectivity, auth behavior, critical endpoints), not merely process liveness

Timeline (PT, approximate)
~11:00 AM: Two-node rollout began. Cloudflare LB pool creation/config and node 2 setup initiated.
Shortly after: Node 1 experienced backend startup instability due to a startup config issue affecting Gunicorn worker launch. In one deployment path, WEB_CONCURRENCY was effectively empty, causing Gunicorn worker startup failures and marking the node unhealthy.
During node 2 bring-up: Docker Compose / dependency mismatches caused partial startup behavior. Services did not come up in a clean, deterministic app-only posture, creating states where some containers were running but the overall application was not end-to-end ready.
Next: Node 2 backend could not initialize reliably because node 2 could not reach the primary PostgreSQL instance (TCP 5432 connection timeouts). This blocked backend startup and any requests requiring DB access.
Next: Node 2 experienced Django worker boot failures due to missing production security config. Specifically, BACKUP_ENCRYPTION_KEY was missing while security maintenance was enabled, causing startup-time validation failures and preventing workers from coming online correctly.
Next: TLS/origin setup on node 2 was incomplete. Caddy certificate acquisition / ACME challenge flows did not complete successfully behind the proxied Cloudflare LB conditions, producing TLS/origin handshake failures and routing-specific HTTP 400 behavior.
As traffic was distributed: Cloudflare LB health behavior and routing exposed inconsistent node readiness. Because health checks were shallow, node 2 could appear “up” while still failing critical application paths.
After partial stabilization: Additional environment drift became visible between node 1 and node 2. Critical secrets/flags were not identical across nodes. Notably, LOGIN_PAYLOAD_ENCRYPTION_PRIVATE_KEY was invalid or different on node 2, and REQUIRE_SECURE_API differed between nodes. Additional secret/config drift also existed in some integrations. These differences caused intermittent failures depending on which node served the request, including login failures and SSO/manage inconsistencies.
~3:00 PM: Core user-facing functionality stabilized after DB connectivity restoration, TLS/origin correction, and cross-node env parity synchronization.
~5:05 PM: Monitoring update posted and hardening controls were completed to prevent recurrence.

Technical Root Causes
This incident had a single “primary” systemic root cause and multiple “technical” root causes.

Primary systemic root cause:
A multi-node rollout was executed without strict preflight enforcement of cross-node configuration parity and without end-to-end readiness validation prior to admitting node 2 to production traffic behind the load balancer.

Technical root causes (the failures that directly produced outage symptoms):

  1. Backend startup config error caused Gunicorn worker crashes on one deployment path
  • WEB_CONCURRENCY was effectively empty on one deployment path, resulting in invalid or failing Gunicorn worker startup behavior. When a node cannot keep workers running, it becomes unhealthy and capacity drops. During a topology change, losing stability on an existing origin amplifies the risk of introducing a second origin simultaneously.
  1. Node 2 compose/dependency mismatches created partial startup states
  • Compose/dependency/override mismatches caused node 2 to enter partial startup behavior where some services were running while others were not correctly initialized or were running with mismatched expectations. In multi-service deployments, “containers running” does not guarantee the app is ready; orchestration mismatches can create misleading “mostly up” conditions.
  1. Node 2 lacked reliable DB connectivity to the primary PostgreSQL instance
  • Node 2 could not establish TCP connections to the primary DB on port 5432 (timeouts). This prevented backend initialization and any DB-dependent endpoint from functioning. The underlying issue was that connectivity assumptions that held for node 1 (local DB) did not hold for node 2; node 2 needed a validated network path, routing, and firewall/provider allowances to the DB over the intended private network.
  1. Missing production security configuration on node 2 prevented Django worker boot
  • BACKUP_ENCRYPTION_KEY was missing while security maintenance was enabled. That combination caused Django startup-time failures (worker boot errors), effectively making the node unable to serve application traffic even if other containers were running.
  1. TLS/origin configuration on node 2 was incomplete; certificate acquisition failed under Cloudflare-proxied conditions
  • Caddy could not complete ACME challenge flow behind proxied Cloudflare LB conditions. In a two-origin setup, TLS must be correct on every origin node and must be compatible with the LB/origin flow. When origin TLS is inconsistent or incomplete, affected requests fail at the handshake layer or at routing layers (appearing as TLS errors, timeouts, or 4xx errors depending on the failure mode and routing path).
  1. Cloudflare LB health checks were insufficient to detect partial readiness and routing edge cases
  • The pre-existing health check strategy was too shallow (/healthz only). A node could respond “200 OK” to a basic liveness endpoint while still failing critical flows (DB-backed endpoints, login/auth endpoints, admin pages, or SSO initiation). This allowed Cloudflare to route traffic to a node that was not truly ready, exposing customers to intermittent errors.
  1. Cross-node configuration drift caused authentication and behavior to vary by origin
  • Critical drift existed between node 1 and node 2, including:

    • LOGIN_PAYLOAD_ENCRYPTION_PRIVATE_KEY invalid or different on node 2
    • REQUIRE_SECURE_API mismatched between nodes
    • Additional integration secrets/config drift
  • In a multi-node environment, auth/session and security-related secrets must be identical across origins. If they differ, requests may succeed on one node and fail on another, and multi-step workflows may break when step 1 and step 2 hit different nodes. That is why the outage presented as intermittent: load balancer routing made user experience depend on node selection.

Why the Issue Was Intermittent (Mechanics of “Sometimes Works”)
In single-node mode, all requests hit the same environment, same keys, same DB connectivity, and same TLS/origin behavior. In two-node mode behind a load balancer:

  • A request routed to node 1 might succeed while the same request routed to node 2 might fail (or vice versa) if configuration, keys, or connectivity differed.
  • Login flows are especially sensitive to key parity. If a node cannot decrypt/verify login payloads or session artifacts, users see login failures.
  • SSO and manage/admin flows are also sensitive to consistent security flags (e.g., requiring secure API), correct cookie/security policy, and identical integration credentials.
  • Even when a node responded to /healthz, it could still fail real application endpoints, which is why shallow health checks were not sufficient.

Detection and Signals
During the incident window, we observed patterns consistent with multi-origin inconsistency and partial readiness:

  • Spikes in HTTP 4xx/5xx responses
  • Increased login failures and user reports of inability to access the portal
  • TLS/origin handshake errors on requests routed to the incompletely configured origin
  • Inconsistent behavior depending on routing path and origin selection, indicating configuration drift and/or partial service readiness

Resolution (What We Did to Restore Service)
We restored service by addressing each failure class and removing node-to-node inconsistencies:

  1. Stabilized deployment model for node 2 (app-only)
  • Corrected compose/service behaviors and dependency expectations so node 2 starts cleanly in its intended role (backend/frontend/Caddy only, no DB). This removed partial startup ambiguity and made failures more diagnosable.
  1. Restored node 2 connectivity to the primary DB
  • Updated DB connectivity to use a validated private networking path and adjusted networking constraints so node 2 could reliably connect to the primary PostgreSQL instance on TCP 5432. Once node 2 had stable DB access, backend initialization and DB-backed endpoints could function reliably.
  1. Added missing required security configuration on node 2
  • Ensured required production secrets were present, including BACKUP_ENCRYPTION_KEY under the enabled security maintenance mode, eliminating Django worker boot failures.
  1. Corrected TLS/origin configuration on node 2 and aligned origin TLS across both nodes
  • Installed and aligned origin certificate/key on both nodes. Eliminated reliance on ACME issuance flows that are incompatible with the Cloudflare-proxied LB/origin conditions. This removed TLS/internal-origin handshake failures and reduced routing-path-specific 400s tied to origin TLS issues.
  1. Enforced cross-node environment parity for critical secrets/flags
  • Synchronized critical .env values across both nodes, including login encryption key material and security flags such as REQUIRE_SECURE_API, plus additional integration secrets/configs. This removed routing-dependent behavior in login, manage/admin, and SSO flows.
  1. Corrected LB monitoring behavior and validated endpoints per origin
  • Validated origin behavior directly and ensured LB checks reflected actual readiness rather than basic liveness.

Verification

  • Backend automated test suite is passing (160 tests).
  • Both nodes and LB health are actively monitored, including improved readiness signals and error-rate monitoring to detect regressions quickly.

Hardening Completed (Preventative Controls Now Enforced)
We implemented controls specifically designed to prevent two-node rollouts from admitting a broken node into the load balancer pool and to prevent drift from silently accumulating.

  1. Deploy-time parity gate for critical configuration
  • Added a deploy preflight that compares hashed values of critical secrets/config across node 1 and node 2 and fails the deploy immediately if drift is detected. This includes authentication and security-sensitive key material and flags that must match across origins.
  1. Deploy-time parity gate for origin TLS artifacts
  • Added preflight checks for origin certificate parity by comparing checksums of origin.crt and origin.key across nodes. Any mismatch blocks deploy. This prevents TLS/origin drift and eliminates “works on one node, fails on the other” TLS behavior.
  1. Stronger backend health checks (readiness, not just liveness)
  • Updated health checks so that a node is only considered healthy if it can serve application-critical paths. This includes /api/v1/healthz and an application-path check such as /admin/login that exercises routing, templates, and backend availability in a way closer to real usage.
  1. Node-local post-deploy smoke checks (end-to-end)
  • Added smoke checks that run per node after deploy and before admitting/continuing traffic, covering:

    • health endpoint
    • admin login page
    • static asset availability
    • login encryption key endpoint (when enabled)
    • manage SSO start endpoint (when enabled)
      These checks are explicitly intended to catch the exact failure modes we saw: partial app readiness, auth/key issues, static serving issues, and SSO path issues.
  1. Operational monitoring improvements
  • Continued monitoring across both origins and LB health, including alerting for elevated 4xx/5xx rates, auth/login failures, and TLS/origin handshake anomalies.

What Went Well

  • Data integrity and security were maintained throughout the incident.
  • The team isolated and resolved multiple independent faults under pressure.
  • The final two-node architecture is now substantially safer due to enforced parity gates and deeper readiness checks.

What Went Wrong

  • Too many critical changes were validated sequentially in production during the same window (topology, LB, new origin provisioning, DB connectivity assumptions, TLS/origin setup).
  • Multi-node readiness checks were incomplete at the start of the rollout; /healthz alone was not sufficient.
  • Cross-node secret/config drift was not blocked before traffic was routed to node 2.
  • TLS/origin management was not standardized across nodes prior to enabling LB routing.

Action Items (Next Steps)

  • Centralize production secret management and synchronize to both nodes from a single source of truth (reduce drift risk).
  • Keep deploy-time parity gates mandatory and expand the “critical set” as we learn more (already implemented; remain enforced).
  • Document and version Cloudflare LB monitor configuration (host header expectations, probe paths, and acceptable response codes) alongside infrastructure code to prevent silent changes.
  • Add a staging environment that mirrors the two-node topology (including Cloudflare/LB-like behavior) to rehearse rollouts before production.
  • Add external synthetic canary checks (from outside the cluster) for applicant login and manage SSO to catch real-world routing/TLS issues early.
  • Adopt a safer rollout method for future topology changes: keep new nodes out of LB until they pass parity + smoke checks; then ramp traffic gradually (weights) while monitoring.

Closing Statement
We regret the disruption this caused applicants and staff during a critical period. This incident highlighted the specific risks introduced by multi-origin architectures when parity and readiness checks are not enforced end-to-end. The corrective actions above are now in place, and we are continuing to monitor closely to ensure stable, consistent behavior across both nodes and the load balancer.

Posted Feb 14, 2026 - 18:20 PST

Resolved

This incident has been resolved.
Posted Feb 14, 2026 - 18:15 PST

Update

We are continuing to monitor for any further issues.
Posted Feb 14, 2026 - 17:10 PST

Monitoring

A fix has been implemented and we are monitoring the results.
----------------
On February 14, 2026 (approximately 11:00 AM–3:00 PM PT), we experienced an intermittent outage and degraded reliability during our migration from a single-node deployment to a two-node setup behind a Cloudflare Load Balancer. This incident was multi-causal: several configuration, dependency, network connectivity, and TLS/origin issues compounded while node 2 was being brought online. Because our previous health checks were too shallow, the load balancer could route traffic to a partially broken node, resulting in inconsistent behavior depending on which node served a given request. Throughout the incident, application data remained secure and intact.

What happened: A backend startup configuration issue caused Gunicorn workers to crash on one deployment path because WEB_CONCURRENCY was effectively empty, which made the node unhealthy. During node 2 bring-up, compose/dependency mismatches caused partial startup behavior. Node 2 initially could not reach the primary PostgreSQL instance (TCP 5432 timeouts), which prevented backend services from initializing reliably. Node 2 was also missing required production security configuration (BACKUP_ENCRYPTION_KEY while security maintenance was enabled), which caused Django worker boot failures. In parallel, TLS/origin setup on node 2 was incomplete: Caddy could not successfully complete the ACME challenge flow behind the proxied Cloudflare Load Balancer, leading to TLS/internal-origin errors. Cloudflare Load Balancer/origin health behavior was inconsistent during setup, including unhealthy endpoint detection that did not reflect full readiness and HTTP 400 responses under specific routing paths. Finally, we identified critical environment drift between node 1 and node 2 (including an invalid LOGIN_PAYLOAD_ENCRYPTION_PRIVATE_KEY on node 2, differing REQUIRE_SECURE_API values, and additional secret/config drift in some integrations). This drift caused intermittent behavior depending on which node served a request, including login failures, SSO/manage inconsistencies, and occasional admin errors. Because our prior checks only validated /healthz, a partially broken node could still appear “up” and receive traffic.
Posted Feb 14, 2026 - 16:57 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 14, 2026 - 15:51 PST

Update

We are continuing to investigate this issue.
Posted Feb 14, 2026 - 12:36 PST

Investigating

Goal Get It! is currently experiencing a critical error that is preventing or severely disrupting access to the applicant portal. Our team is actively working to restore service as quickly as possible. During this time, you may be unable to log in, submit materials, or view updates, and you may see errors or timeouts. We’re very sorry for the disruption—your application information remains secure and intact.
Posted Feb 14, 2026 - 12:30 PST
This incident affected: Goal Get It! Applicant Portal and Management Portal.