GetOrGenerateKeys locking leads to significant service degredation with high request saturation

### Preflight checklist

- [ ] I could not find a solution in the existing issues, docs, nor discussions.
- [X] I agree to follow this project's [Code of Conduct](https://github.com/ory/hydra/blob/master/CODE_OF_CONDUCT.md).
- [X] I have read and am following this repository's [Contribution Guidelines](https://github.com/ory/hydra/blob/master/CONTRIBUTING.md).
- [X] I have joined the [Ory Community Slack](https://slack.ory.sh).
- [X] I am signed up to the [Ory Security Patch Newsletter](https://www.ory.sh/l/sign-up-newsletter).

### Ory Network Project

_No response_

### Describe the bug

When operating Hydra in production, it has been observed through tracing that the `GetOrGenerateKeys` is called routinely for the Authorization Endpoint, Token Endpoint, and WellKnown endpoints. That function leverages two RWMutexes to store the keysets in memory.

As identified in https://github.com/ory/hydra/discussions/3662, there can be problems when the tail latency of the database query to retrieve JWKs is above a certain threshold under highly concurrent conditions.

Additionally, when considering the locking logic for `GetOrGenerateKeys`, there may be an opportunity to leverage `RLock()` when reading from the database and then acquiring the write lock only when generating to enable more throughput for this endpoint.

### Reproducing the bug

To recreate the issue locally:

## Strategy

1. **Database Latency** - Run `hydra` with an artificially set latency for the `persister_jwk`'s `GetKeySet` that mimics real-world conditions (>= 25ms).
2. **Request Saturation** - Using any load testing tool like [bombardier](https://github.com/codesenberg/bombardier) to saturate the server with hundreds of concurrent client connections against `/.well-known/openid-configuration` endpoint.

## Steps

0. Install bombardier: `go install github.com/codesenberg/bombardier@latest`
1. Run cockroachdb instance: `docker compose -f quickstart.yml -f quickstart-cockroach.yml up -d cockroachd`
2. Run migrations: `docker compose -f quickstart.yml -f quickstart-cockroach.yml run hydra-migrate`
3. Add artificial latency to [`persister_jwk.go`](https://github.com/awill1988/hydra/blob/9cc5f28825bb80398a9830d4fcb220147f74504c/persistence/sql/persister_jwk.go#L170): `time.Sleep(25 * time.Millisecond)`
4. Run code: `DSN='cockroach://root@localhost:26257/defaultdb?sslmode=disable&max_conns=20&max_idle_conns=4' go run . serve all --dev -c contrib/quickstart/5-min/hydra.yml`
5. Run load test with `270` - `300` concurrent connections `bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration`

## Expected Outcome

Request saturation would result in a moderate service degradation. Here's a test with only 100 concurrent connections:

```
❯ bombardier -d 30s -t 30s -a -c 100 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 100 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        26.14      17.31      57.67
  Latency         3.62s   708.85ms      3.90s
  Latency Distribution
     50%      3.83s
     75%      3.86s
     90%      3.89s
     95%      3.89s
     99%      3.90s
  HTTP codes:
    1xx - 0, 2xx - 881, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:    54.73KB/s
```

## Actual Results

If the logarithmic growth in request handler timing reaches 10s, request latency degrades significantly.

```
❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec         8.70      17.32      52.68
  Latency        28.31s     22.94s      0.85m
  Latency Distribution
     50%     41.07s
     75%      0.85m
     90%      0.85m
     95%      0.85m
     99%      0.85m
  HTTP codes:
    1xx - 0, 2xx - 519, 3xx - 0, 4xx - 0, 5xx - 0
    others - 10
  Errors:
    the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection - 10
  Throughput:    19.69KB/s
```

## Potential Fix

If [`GetOrGenerateKeys`](https://github.com/awill1988/hydra/blob/9cc5f28825bb80398a9830d4fcb220147f74504c/jwk/helper.go#L46) were to leverage a read lock and only acquiring a write lock when it wants to generate, then this request saturation does not have the same effect, but this may not provide the desired synchronization guarantees intended by the authors.

```
 func GetOrGenerateKeys(ctx context.Context, r InternalRegistry, m Manager, set, kid, alg string) (private *jose.JSONWebKey, err error) {
-       getLock(set).Lock()
-       defer getLock(set).Unlock()
-
+       getLock(set).RLock()
        keys, err := m.GetKeySet(ctx, set)
+       getLock(set).RUnlock()
+
        if errors.Is(err, x.ErrNotFound) || keys != nil && len(keys.Keys) == 0 {
                r.Logger().Warnf("JSON Web Key Set \"%s\" does not exist yet, generating new key pair...", set)
+               getLock(set).Lock()
+               defer getLock(set).Unlock()
                keys, err = m.GenerateAndPersistKeySet(ctx, set, kid, alg, "sig")
                if err != nil {
                        return nil, err
```

```
❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[===============================================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      1379.25     412.60    3602.03
  Latency      195.31ms   321.83ms      9.35s
  Latency Distribution
     50%   110.06ms
     75%   199.68ms
     90%   377.27ms
     95%   565.53ms
     99%      1.44s
  HTTP codes:
    1xx - 0, 2xx - 41610, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.82MB/s
```

### Relevant log output

_No response_

### Relevant configuration

_No response_

### Version

v2.2.0

### On which operating system are you observing this issue?

Linux

### In which environment are you deploying?

Kubernetes

### Additional Context

This was observed in a production Kubernetes deployment where 4 replicas of a hydra pods were getting thousands of requests per second and it led to a significant increase in latency. Upon examination of the traces, the large gap between spans before `persistence.sql.GetKeySet` was found to be a signature for the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GetOrGenerateKeys locking leads to significant service degredation with high request saturation #3863

Preflight checklist

Ory Network Project

Describe the bug

Reproducing the bug

Strategy

Steps

Expected Outcome

Actual Results

Potential Fix

Relevant log output

Relevant configuration

Version

On which operating system are you observing this issue?

In which environment are you deploying?

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GetOrGenerateKeys locking leads to significant service degredation with high request saturation #3863

Description

Preflight checklist

Ory Network Project

Describe the bug

Reproducing the bug

Strategy

Steps

Expected Outcome

Actual Results

Potential Fix

Relevant log output

Relevant configuration

Version

On which operating system are you observing this issue?

In which environment are you deploying?

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions