-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Preflight checklist
- I could not find a solution in the existing issues, docs, nor discussions.
- I agree to follow this project's Code of Conduct.
- I have read and am following this repository's Contribution Guidelines.
- I have joined the Ory Community Slack.
- I am signed up to the Ory Security Patch Newsletter.
Ory Network Project
No response
Describe the bug
When operating Hydra in production, it has been observed through tracing that the GetOrGenerateKeys
is called routinely for the Authorization Endpoint, Token Endpoint, and WellKnown endpoints. That function leverages two RWMutexes to store the keysets in memory.
As identified in #3662, there can be problems when the tail latency of the database query to retrieve JWKs is above a certain threshold under highly concurrent conditions.
Additionally, when considering the locking logic for GetOrGenerateKeys
, there may be an opportunity to leverage RLock()
when reading from the database and then acquiring the write lock only when generating to enable more throughput for this endpoint.
Reproducing the bug
To recreate the issue locally:
Strategy
- Database Latency - Run
hydra
with an artificially set latency for thepersister_jwk
'sGetKeySet
that mimics real-world conditions (>= 25ms). - Request Saturation - Using any load testing tool like bombardier to saturate the server with hundreds of concurrent client connections against
/.well-known/openid-configuration
endpoint.
Steps
- Install bombardier:
go install github.com/codesenberg/bombardier@latest
- Run cockroachdb instance:
docker compose -f quickstart.yml -f quickstart-cockroach.yml up -d cockroachd
- Run migrations:
docker compose -f quickstart.yml -f quickstart-cockroach.yml run hydra-migrate
- Add artificial latency to
persister_jwk.go
:time.Sleep(25 * time.Millisecond)
- Run code:
DSN='cockroach://root@localhost:26257/defaultdb?sslmode=disable&max_conns=20&max_idle_conns=4' go run . serve all --dev -c contrib/quickstart/5-min/hydra.yml
- Run load test with
270
-300
concurrent connectionsbombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Expected Outcome
Request saturation would result in a moderate service degradation. Here's a test with only 100 concurrent connections:
❯ bombardier -d 30s -t 30s -a -c 100 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 100 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics Avg Stdev Max
Reqs/sec 26.14 17.31 57.67
Latency 3.62s 708.85ms 3.90s
Latency Distribution
50% 3.83s
75% 3.86s
90% 3.89s
95% 3.89s
99% 3.90s
HTTP codes:
1xx - 0, 2xx - 881, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 54.73KB/s
Actual Results
If the logarithmic growth in request handler timing reaches 10s, request latency degrades significantly.
❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics Avg Stdev Max
Reqs/sec 8.70 17.32 52.68
Latency 28.31s 22.94s 0.85m
Latency Distribution
50% 41.07s
75% 0.85m
90% 0.85m
95% 0.85m
99% 0.85m
HTTP codes:
1xx - 0, 2xx - 519, 3xx - 0, 4xx - 0, 5xx - 0
others - 10
Errors:
the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection - 10
Throughput: 19.69KB/s
Potential Fix
If GetOrGenerateKeys
were to leverage a read lock and only acquiring a write lock when it wants to generate, then this request saturation does not have the same effect, but this may not provide the desired synchronization guarantees intended by the authors.
func GetOrGenerateKeys(ctx context.Context, r InternalRegistry, m Manager, set, kid, alg string) (private *jose.JSONWebKey, err error) {
- getLock(set).Lock()
- defer getLock(set).Unlock()
-
+ getLock(set).RLock()
keys, err := m.GetKeySet(ctx, set)
+ getLock(set).RUnlock()
+
if errors.Is(err, x.ErrNotFound) || keys != nil && len(keys.Keys) == 0 {
r.Logger().Warnf("JSON Web Key Set \"%s\" does not exist yet, generating new key pair...", set)
+ getLock(set).Lock()
+ defer getLock(set).Unlock()
keys, err = m.GenerateAndPersistKeySet(ctx, set, kid, alg, "sig")
if err != nil {
return nil, err
❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[===============================================================================================================================================] 30s
Done!
Statistics Avg Stdev Max
Reqs/sec 1379.25 412.60 3602.03
Latency 195.31ms 321.83ms 9.35s
Latency Distribution
50% 110.06ms
75% 199.68ms
90% 377.27ms
95% 565.53ms
99% 1.44s
HTTP codes:
1xx - 0, 2xx - 41610, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 2.82MB/s
Relevant log output
No response
Relevant configuration
No response
Version
v2.2.0
On which operating system are you observing this issue?
Linux
In which environment are you deploying?
Kubernetes
Additional Context
This was observed in a production Kubernetes deployment where 4 replicas of a hydra pods were getting thousands of requests per second and it led to a significant increase in latency. Upon examination of the traces, the large gap between spans before persistence.sql.GetKeySet
was found to be a signature for the issue.