-
Notifications
You must be signed in to change notification settings - Fork 314
Avoid pending queue wedge if tracer flare is generated multiple times #9757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Avoid pending queue wedge if tracer flare is generated multiple times #9757
Conversation
In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue. This change does a few things: - Resets the index in collectTraces when the data field is replaced (and marks the index field as volatile). This should prevent the above situation from happening. - In case the situation still happens, a stand-in CommandElement is returned to avoid returning null. A warning message is also logged. - The existing "testing tracer flare dump with multiple traces" test case is expanded to exercise problem. Here is an example stack trace when the hang happens: "dd-trace-monitor" DataDog#38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000] java.lang.Thread.State: RUNNABLE at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634) at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566) at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482) at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317) at java.lang.Thread.run(Thread.java:750)
dd-trace-core/src/test/groovy/datadog/trace/core/PendingTraceBufferTest.groovy
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is good, but don't take my word on it. Another pair of eyes might be needed.
entries2.size() == 1 | ||
def pendingTraceText2 = entries2["pending_traces.txt"] as String | ||
def parsedTraces2 = pendingTraceText2.split('\n').collect { new JsonSlurper().parseText(it) }.flatten() | ||
parsedTraces2.size() == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: would it be better to assert on the trace ids as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific trace IDs don't really matter, so I think it is better to leave them out and not make the test over-specific. We mainly want to make sure:
- The dump didn't hang.
- We got the expected number of traces (and I'm not even sure that matters that much, but it was there in the original test).
I only generate a second set of traces because the first set is gone from the pending queue by the time the second dump is performed (well, they are still there, actually, but the traces have no pending spans left--they've been written within 500ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, it looks good! Just had an non-blocking open question about the logging.
LOGGER.warn( | ||
"Index {} is out of bounds for data size {} in DumpDrain.get so returning filler CommandElement to prevent pending trace queue from breaking.", | ||
index, | ||
data.size()); | ||
return STAND_IN_ELEMENT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having this STAND_IN_ELEMENT
is better than returning null and breaking the queue.
❓ I wonder if there is a better way to alert Datadog/customers that this is the issue, I feel like this log could be easily skipped over. But this shouldn't block the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I wonder if there is a better way to alert Datadog/customers that this is the issue, I feel like this log could be easily skipped over. But this shouldn't block the PR.
I agree 100%. I poked around a little and it looks like SEND_TELEMETRY might be what we want?
Any other options are on the menu?
I should note: I've never successfully wedged a normal APM tracer this way. I've only seen the queue get wedged in the test. I've tried a bit, but I'm struggling to keep traces in the PendingTrace queue. Still trying. |
What Does This Do
This change does a few things:
Motivation
In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue.
I noticed this while working on a separate enhancement to dump the pending and long running traces over JMX and my unit test would reliably hang if it was run after the existing tracer flare dump unit test.
Additional Notes
Here is an example stack trace when the hang happens:
Contributor Checklist
type:
and (comp:
orinst:
) labels in addition to any useful labelsclose
,fix
or any linking keywords when referencing an issue.Use
solves
instead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]