-
Notifications
You must be signed in to change notification settings - Fork 726
[ICRDMA] Support for XDC transver via RDMA. EXT-1506 #28557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[ICRDMA] Support for XDC transver via RDMA. EXT-1506 #28557
Conversation
|
🟢 |
Allow to transfer XDC event via RDMA using READ verbs Merge to main: dcherednik@ydb.tech
539e3b6 to
433c1b6
Compare
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
|
||
| bool SerializeToArcadiaStreamImpl(TChunkSerializer* chunker, const TVector<TRope> &payload) { | ||
| // serialize payload first | ||
| bool SerializeHeaderCommon(const TVector<TRope> &payload, std::function<bool(const char *p, size_t len)> append) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need indirect call overhead here? For every little chunk of data?
I'd prefer function with template callback and other one calling it through std::function when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Fixed.
| ui32 CalculateSerializedSizeImpl(const TVector<TRope> &payload, ssize_t recordSize) { | ||
| ssize_t result = recordSize; | ||
| if (result >= 0 && payload) { | ||
| ui32 CalculateSerilizedHeaderSizeImpl(const TVector<TRope> &payload) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seriAlized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| result += CalculateSerilizedHeaderSizeImpl(payload); | ||
| size_t totalPayloadSize = 0; | ||
| for (const TRope& rope : payload) { | ||
| totalPayloadSize += rope.GetSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use for intermediate counter? Why not add to result directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm) fixed
| size_t Tailroom = 0; // tailroom for the chunk | ||
| size_t Alignment = 0; // required alignment | ||
| bool IsInline = false; // if true, goes through ordinary channel | ||
| bool IsRdma = false; // if true, could go through RDMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest calling it IsRdmaCapable or something like that.
| if (!recordsSerializedBuf) { | ||
| return {}; | ||
| } | ||
| Y_ABORT_UNLESS(Record.SerializePartialToArray(recordsSerializedBuf.GetDataMut(), size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually side effects of an assertion are not guaranteed. So it's better to evaluate Serialize... outsize Y_ABORT_UNLESS.
|
|
||
| task.Write<false>(buffer, partSize); | ||
|
|
||
| task.AttachRdmaPayloadSize(payloadSz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it attach whole message at once? It may cause problems with fair bandwidth distribution between channels (in theory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory... But even in case of 200Gbit/s network time of transfer 1MB event (app to app via RDMA) compared or even less than time of whole one IC cycle, but we expect 400 or even 800Gbit/s network, so it is unclear does the splitting make some profit or we just fire some CPU. Moreover it is unclear how does the bandwidth will be distributed between TCP messages and RoCEv2 traffic. Probably we need to implement RDMA SEND and RECIEVE to reduce TCP overhead for control messages.
So we decided not to write some complicated code until we got some real feedback from production like load.
| } | ||
| } | ||
|
|
||
| rope.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be absolutely sure we free rdma memory)
Actually this code is not needed after solving problem with "прикапыванием" rdma memory somewhere. But right now we meet some problems with it and this event copy in this function is just workaround.
| TRope newRope; | ||
|
|
||
| for (TRope::TIterator it = rope.Begin(); it != rope.End(); ++it) { | ||
| TRcBuf chunk = it.GetChunk(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetChunk() returns mutable reference. You can modify it inplace. At least it should be possible and I suppose it is a bug if it is not.
| Become(&TThis::WorkingState, DeadPeerTimeout, new TEvCheckDeadPeer); | ||
| LOG_DEBUG_IC_SESSION("ICIS01", "InputSession created"); | ||
| if (RdmaQp) { | ||
| LOG_DEBUG_IC_SESSION("ICIS01", "InputSession created, rdma qp num: %d", RdmaQp->GetQpNum()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markers must not repeat in code, despite they are obsolete :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ou. It means we need to rename all ICRDMA markers we have added already. Ok)
| XXH3_64bits_update(&state, data, size); | ||
| } | ||
| if (checksum != descr.Checksum) { | ||
| checksum = XXH3_64bits_digest(&state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't just change checksumming algorithm, it will make this version incompatible with previous ones, won't it?
Changelog entry
Allow to transfer XDC event via RDMA using READ verbs
RDMA XDC integration: robdrynkin@ydb.tech
Changelog category
Description for reviewers
...