Skip to content

Conversation

@dcherednik
Copy link
Member

Changelog entry

Allow to transfer XDC event via RDMA using READ verbs

RDMA XDC integration: robdrynkin@ydb.tech

Changelog category

  • Experimental feature

Description for reviewers

...

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 16:47:40 UTC Pre-commit check linux-x86_64-release-asan for cd18800 has started.
2025-11-10 16:47:59 UTC Artifacts will be uploaded here
2025-11-10 16:50:10 UTC ya make is running...
2025-11-10 17:06:31 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 16:48:06 UTC Pre-commit check linux-x86_64-relwithdebinfo for cd18800 has started.
2025-11-10 16:48:10 UTC Artifacts will be uploaded here
2025-11-10 16:49:35 UTC ya make is running...
2025-11-10 17:06:30 UTC Check cancelled

@github-actions
Copy link

🟢 2025-11-10 16:49:26 UTC The validation of the Pull Request description is successful.

Allow to transfer XDC event via RDMA using READ verbs

Merge to main: dcherednik@ydb.tech
@dcherednik dcherednik force-pushed the rdma_read_xdc_support_merge branch from 539e3b6 to 433c1b6 Compare November 10, 2025 17:06
@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 17:07:01 UTC Pre-commit check linux-x86_64-release-asan for a79ded9 has started.
2025-11-10 17:07:22 UTC Artifacts will be uploaded here
2025-11-10 17:08:45 UTC ya make is running...
2025-11-10 19:06:51 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 17:08:13 UTC Pre-commit check linux-x86_64-relwithdebinfo for a79ded9 has started.
2025-11-10 17:09:26 UTC Artifacts will be uploaded here
2025-11-10 17:11:35 UTC ya make is running...
2025-11-10 19:06:53 UTC Check cancelled

@iddqdex iddqdex added the rebase-and-check Rebase PR with the current base branch and check label Nov 10, 2025
@github-actions github-actions bot removed the rebase-and-check Rebase PR with the current base branch and check label Nov 10, 2025
@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 19:07:39 UTC Pre-commit check linux-x86_64-release-asan for 6e9be58 has started.
2025-11-10 19:09:22 UTC Artifacts will be uploaded here
2025-11-10 19:11:29 UTC ya make is running...
🟡 2025-11-10 21:50:02 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18851 18357 0 188 278 28

🟢 2025-11-10 21:50:13 UTC Build successful.
🔴 2025-11-10 21:50:42 UTC ydbd size 3.8 GiB changed* by +43.9 MiB, which is >= 2.0 MiB vs main: Alert

ydbd size dash main: 8721579 merge: 6e9be58 diff diff %
ydbd size 4 080 530 648 Bytes 4 126 534 744 Bytes +43.9 MiB +1.127%
ydbd stripped size 1 514 729 768 Bytes 1 518 784 328 Bytes +3.9 MiB +0.268%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

2025-11-10 19:07:45 UTC Pre-commit check linux-x86_64-relwithdebinfo for 6e9be58 has started.
2025-11-10 19:08:13 UTC Artifacts will be uploaded here
2025-11-10 19:10:25 UTC ya make is running...
🟡 2025-11-10 21:31:46 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
42462 39559 0 8 2866 29

2025-11-10 21:31:58 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-11-10 21:44:58 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
131 (only retried tests) 109 0 0 1 21

🟢 2025-11-10 21:45:05 UTC Build successful.
🔴 2025-11-10 21:45:27 UTC ydbd size 2.3 GiB changed* by +35.1 MiB, which is >= 2.0 MiB vs main: Alert

ydbd size dash main: 3b1cf4e merge: 6e9be58 diff diff %
ydbd size 2 437 459 680 Bytes 2 474 308 952 Bytes +35.1 MiB +1.512%
ydbd stripped size 518 602 832 Bytes 519 960 496 Bytes +1.3 MiB +0.262%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation


bool SerializeToArcadiaStreamImpl(TChunkSerializer* chunker, const TVector<TRope> &payload) {
// serialize payload first
bool SerializeHeaderCommon(const TVector<TRope> &payload, std::function<bool(const char *p, size_t len)> append) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need indirect call overhead here? For every little chunk of data?
I'd prefer function with template callback and other one calling it through std::function when necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Fixed.

ui32 CalculateSerializedSizeImpl(const TVector<TRope> &payload, ssize_t recordSize) {
ssize_t result = recordSize;
if (result >= 0 && payload) {
ui32 CalculateSerilizedHeaderSizeImpl(const TVector<TRope> &payload) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seriAlized

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

result += CalculateSerilizedHeaderSizeImpl(payload);
size_t totalPayloadSize = 0;
for (const TRope& rope : payload) {
totalPayloadSize += rope.GetSize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use for intermediate counter? Why not add to result directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm) fixed

size_t Tailroom = 0; // tailroom for the chunk
size_t Alignment = 0; // required alignment
bool IsInline = false; // if true, goes through ordinary channel
bool IsRdma = false; // if true, could go through RDMA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest calling it IsRdmaCapable or something like that.

if (!recordsSerializedBuf) {
return {};
}
Y_ABORT_UNLESS(Record.SerializePartialToArray(recordsSerializedBuf.GetDataMut(), size));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually side effects of an assertion are not guaranteed. So it's better to evaluate Serialize... outsize Y_ABORT_UNLESS.


task.Write<false>(buffer, partSize);

task.AttachRdmaPayloadSize(payloadSz);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it attach whole message at once? It may cause problems with fair bandwidth distribution between channels (in theory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory... But even in case of 200Gbit/s network time of transfer 1MB event (app to app via RDMA) compared or even less than time of whole one IC cycle, but we expect 400 or even 800Gbit/s network, so it is unclear does the splitting make some profit or we just fire some CPU. Moreover it is unclear how does the bandwidth will be distributed between TCP messages and RoCEv2 traffic. Probably we need to implement RDMA SEND and RECIEVE to reduce TCP overhead for control messages.
So we decided not to write some complicated code until we got some real feedback from production like load.

}
}

rope.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be absolutely sure we free rdma memory)
Actually this code is not needed after solving problem with "прикапыванием" rdma memory somewhere. But right now we meet some problems with it and this event copy in this function is just workaround.

TRope newRope;

for (TRope::TIterator it = rope.Begin(); it != rope.End(); ++it) {
TRcBuf chunk = it.GetChunk();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetChunk() returns mutable reference. You can modify it inplace. At least it should be possible and I suppose it is a bug if it is not.

Become(&TThis::WorkingState, DeadPeerTimeout, new TEvCheckDeadPeer);
LOG_DEBUG_IC_SESSION("ICIS01", "InputSession created");
if (RdmaQp) {
LOG_DEBUG_IC_SESSION("ICIS01", "InputSession created, rdma qp num: %d", RdmaQp->GetQpNum());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markers must not repeat in code, despite they are obsolete :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ou. It means we need to rename all ICRDMA markers we have added already. Ok)

XXH3_64bits_update(&state, data, size);
}
if (checksum != descr.Checksum) {
checksum = XXH3_64bits_digest(&state);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't just change checksumming algorithm, it will make this version incompatible with previous ones, won't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants