Avoid pending queue wedge if tracer flare is generated multiple times #9757

deejgregor · 2025-10-14T17:40:00Z

What Does This Do

This change does a few things:

Resets the index in collectTraces when the data field is replaced (and marks the index field as volatile). This should prevent the issue from happening.
In case the situation still happens, a stand-in CommandElement is returned to avoid returning null. A warning message is also logged.
The existing "testing tracer flare dump with multiple traces" test case is expanded to exercise problem.

Motivation

In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue.

I noticed this while working on a separate enhancement to dump the pending and long running traces over JMX and my unit test would reliably hang if it was run after the existing tracer flare dump unit test.

Additional Notes

Here is an example stack trace when the hang happens:

"dd-trace-monitor" #38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000]
   java.lang.Thread.State: RUNNABLE
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482)
    at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317)
    at java.lang.Thread.run(Thread.java:750)

Contributor Checklist

Format the title according the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any useful labels
Don't use close, fix or any linking keywords when referencing an issue.
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, move, or deletion
Update the public documentation in case of new configuration flag or behavior

Jira ticket: [PROJ-IDENT]

In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue. This change does a few things: - Resets the index in collectTraces when the data field is replaced (and marks the index field as volatile). This should prevent the above situation from happening. - In case the situation still happens, a stand-in CommandElement is returned to avoid returning null. A warning message is also logged. - The existing "testing tracer flare dump with multiple traces" test case is expanded to exercise problem. Here is an example stack trace when the hang happens: "dd-trace-monitor" DataDog#38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000] java.lang.Thread.State: RUNNABLE at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634) at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566) at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482) at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317) at java.lang.Thread.run(Thread.java:750)

dd-trace-core/src/test/groovy/datadog/trace/core/PendingTraceBufferTest.groovy

bric3

I believe this is good, but don't take my word on it. Another pair of eyes might be needed.

bric3 · 2025-10-15T11:56:30Z

dd-trace-core/src/test/groovy/datadog/trace/core/PendingTraceBufferTest.groovy

+    entries2.size() == 1
+    def pendingTraceText2 = entries2["pending_traces.txt"] as String
+    def parsedTraces2 = pendingTraceText2.split('\n').collect { new JsonSlurper().parseText(it) }.flatten()
+    parsedTraces2.size() == 2


question: would it be better to assert on the trace ids as well ?

The specific trace IDs don't really matter, so I think it is better to leave them out and not make the test over-specific. We mainly want to make sure:

The dump didn't hang.

We got the expected number of traces (and I'm not even sure that matters that much, but it was there in the original test).

I only generate a second set of traces because the first set is gone from the pending queue by the time the second dump is performed (well, they are still there, actually, but the traces have no pending spans left--they've been written within 500ms).

mhlidd

Thanks for the fix, it looks good! Just had an non-blocking open question about the logging.

mhlidd · 2025-10-21T19:14:35Z

dd-trace-core/src/main/java/datadog/trace/core/PendingTraceBuffer.java

+        LOGGER.warn(
+            "Index {} is out of bounds for data size {} in DumpDrain.get so returning filler CommandElement to prevent pending trace queue from breaking.",
+            index,
+            data.size());
+        return STAND_IN_ELEMENT;


I agree that having this STAND_IN_ELEMENT is better than returning null and breaking the queue.

❓ I wonder if there is a better way to alert Datadog/customers that this is the issue, I feel like this log could be easily skipped over. But this shouldn't block the PR.

❓ I wonder if there is a better way to alert Datadog/customers that this is the issue, I feel like this log could be easily skipped over. But this shouldn't block the PR.

I agree 100%. I poked around a little and it looks like SEND_TELEMETRY might be what we want?

Any other options are on the menu?

deejgregor · 2025-10-22T03:30:49Z

I should note: I've never successfully wedged a normal APM tracer this way. I've only seen the queue get wedged in the test. I've tried a bit, but I'm struggling to keep traces in the PendingTrace queue. Still trying.

deejgregor requested a review from a team as a code owner October 14, 2025 17:40

deejgregor requested a review from mhlidd October 14, 2025 17:40

bric3 reviewed Oct 15, 2025

View reviewed changes

dd-trace-core/src/test/groovy/datadog/trace/core/PendingTraceBufferTest.groovy Show resolved Hide resolved

bric3 approved these changes Oct 15, 2025

View reviewed changes

PerfectSlayer added type: enhancement Enhancements and improvements comp: core Tracer core labels Oct 16, 2025

mhlidd approved these changes Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid pending queue wedge if tracer flare is generated multiple times #9757

Avoid pending queue wedge if tracer flare is generated multiple times #9757

deejgregor commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

bric3 left a comment

Uh oh!

bric3 Oct 15, 2025

Uh oh!

deejgregor Oct 15, 2025 •

edited

Loading

Uh oh!

mhlidd left a comment

Uh oh!

mhlidd Oct 21, 2025

Uh oh!

deejgregor Oct 22, 2025

Uh oh!

deejgregor commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Avoid pending queue wedge if tracer flare is generated multiple times #9757

Are you sure you want to change the base?

Avoid pending queue wedge if tracer flare is generated multiple times #9757

Conversation

deejgregor commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

Uh oh!

bric3 left a comment

Choose a reason for hiding this comment

Uh oh!

bric3 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhlidd left a comment

Choose a reason for hiding this comment

Uh oh!

mhlidd Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deejgregor commented Oct 14, 2025 •

edited

Loading

deejgregor Oct 15, 2025 •

edited

Loading