Postmortem -
Read details
Feb 14, 18:20 PST
Resolved -
This incident has been resolved.
Feb 14, 18:15 PST
Update -
We are continuing to monitor for any further issues.
Feb 14, 17:10 PST
Monitoring -
A fix has been implemented and we are monitoring the results.
----------------
On February 14, 2026 (approximately 11:00 AM–3:00 PM PT), we experienced an intermittent outage and degraded reliability during our migration from a single-node deployment to a two-node setup behind a Cloudflare Load Balancer. This incident was multi-causal: several configuration, dependency, network connectivity, and TLS/origin issues compounded while node 2 was being brought online. Because our previous health checks were too shallow, the load balancer could route traffic to a partially broken node, resulting in inconsistent behavior depending on which node served a given request. Throughout the incident, application data remained secure and intact.
What happened: A backend startup configuration issue caused Gunicorn workers to crash on one deployment path because WEB_CONCURRENCY was effectively empty, which made the node unhealthy. During node 2 bring-up, compose/dependency mismatches caused partial startup behavior. Node 2 initially could not reach the primary PostgreSQL instance (TCP 5432 timeouts), which prevented backend services from initializing reliably. Node 2 was also missing required production security configuration (BACKUP_ENCRYPTION_KEY while security maintenance was enabled), which caused Django worker boot failures. In parallel, TLS/origin setup on node 2 was incomplete: Caddy could not successfully complete the ACME challenge flow behind the proxied Cloudflare Load Balancer, leading to TLS/internal-origin errors. Cloudflare Load Balancer/origin health behavior was inconsistent during setup, including unhealthy endpoint detection that did not reflect full readiness and HTTP 400 responses under specific routing paths. Finally, we identified critical environment drift between node 1 and node 2 (including an invalid LOGIN_PAYLOAD_ENCRYPTION_PRIVATE_KEY on node 2, differing REQUIRE_SECURE_API values, and additional secret/config drift in some integrations). This drift caused intermittent behavior depending on which node served a request, including login failures, SSO/manage inconsistencies, and occasional admin errors. Because our prior checks only validated /healthz, a partially broken node could still appear “up” and receive traffic.
Feb 14, 16:57 PST
Identified -
The issue has been identified and a fix is being implemented.
Feb 14, 15:51 PST
Update -
We are continuing to investigate this issue.
Feb 14, 12:36 PST
Investigating -
Goal Get It! is currently experiencing a critical error that is preventing or severely disrupting access to the applicant portal. Our team is actively working to restore service as quickly as possible. During this time, you may be unable to log in, submit materials, or view updates, and you may see errors or timeouts. We’re very sorry for the disruption—your application information remains secure and intact.
Feb 14, 12:30 PST