Fix files missing from large directories #25

buttercookie42 · 2024-11-04T19:02:50Z

In large directories I've observed that a number of files are missing on jfileserver shares. After some more investigation, it turned out that the missing files correspond to FIND_NEXT2 requests – the first file from every FIND_NEXT2 response is missing. (And I suspect that SMB2/3 would be similarly affected if a search needs to be split up into multiple request/reponse pairs.)

Fixing this issue in turn exacerbates an existing performance issue in the restartAt(FileInfo) code in JavaNIOSearchContext. Due to the way our search and SMB APIs are structured, when we need to split up a large search response (with lots of files) into multiple packets, at the boundary between each packet we need to retrieve the FileInfo object for the same file twice - once to find out that it doesn't fit into the previous packet anymore, and a second time to actually transmit it in the followup packet.

Because the NIO directory iterator cannot be rewound to go back by one entry, this means that we have to iterate through the whole directory up to the target file again, and due to the fix for the above we actually need to do it twice now.

To fix the resulting performance issue in large directories (where on a phone and with a few thousand files a single FIND_FIRST/NEXT2 call can take hundreds of ms and return usually at most ~100 files), I propose special-casing this common scenario of restarting the SearchContext by exactly one entry earlier.

The problem is that in order to do the comparison between the file name from the iterator and the passed-in FileInfo, we always need to consume the respective Path from the DirectoryStream iterator, so even when the comparison declares success, we have already consumed that Path. This means that the next FIND_NEXT2 call will skip that entry and start one Path too late when it accesses this search context again. For correctness, we therefore have to rewind the iterator by one after having found the correct Path, so that the next nextFileInfo() call restarts at the right point. The current solution isn't the best for large directories performance-wise due to having to iterate twice over all files up to the resume point, but at least it is correct and better than omitting the first file from every FIND_NEXT2 response (or equivalent split-up Find request reply for SMB2).

Since I'm adding another caller for FileInfo.copyFrom(), I've had a look at that method and it seems that we ought to handle m_shortname there, too. And while we're at it, same thing for resetInfo(), too.

…isting The problem with the way our search code is implemented is that when we need to split up a large search response (with lots of files) into multiple packets, at the boundary between each packet we need to retrieve the FileInfo object for the same file twice - once to find out that it doesn't fit into the previous packet anymore, and a second time to actually transmit it in the followup packet. The protocol handler therefore calls restartAt() on the SearchContext in order to effectively rewind it by one entry, so that the next call to nextFileInfo() during the subsequent response packet returns the correct FileInfo entry (compare also the previous commit). With the NIO files API, this turns into a problem, because the Streams-based iterator cannot be rewound, so every time we need to backtrack by one entry, we need to reiterate through all the directory's contents up to the desired restart point. (And even worse, the restartAt(FileInfo)-based method needs actually needs to iterate twice for every call.) For large directories with thousands of files (or more), this turns into a very noticeable overhead when listing the directory contents. To work around this issue, we now cache the last returned FileInfo object, and check in restartAt(FileInfo) whether the call corresponds to the common case of going back by merely one entry. If so, instead of expensively rewinding the iterator, we simply set the next call to nextFileInfo() to return the previously cached FileInfo object and only subsequently to resume iterating normally through the directory.

buttercookie42 added 4 commits November 3, 2024 00:37

Avoid leaking unclosed DirectoryStreams from searches

5882813

Have FileInfo.resetInfo() and copyFrom() handle the short name, too

43a548f

Since I'm adding another caller for FileInfo.copyFrom(), I've had a look at that method and it seems that we ought to handle m_shortname there, too. And while we're at it, same thing for resetInfo(), too.

buttercookie42 mentioned this pull request Nov 5, 2024

Speed up JavaNIODiskDriver.mapPath() #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix files missing from large directories #25

Fix files missing from large directories #25

Uh oh!

buttercookie42 commented Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix files missing from large directories #25

Are you sure you want to change the base?

Fix files missing from large directories #25

Uh oh!

Conversation

buttercookie42 commented Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant