[GRADLE-2546] directory scanning is inefficient when all include patterns share a common prefix Created: 04/Nov/12  Updated: 10/Feb/13  Resolved: 04/Feb/13

Status: Resolved
Project: Gradle
Affects Version/s: None
Fix Version/s: 1.4-rc-1

Type: Bug
Reporter: Adam Murdoch Assignee: Unassigned
Resolution: Fixed Votes: 0

Comment by Adam Murdoch [ 04/Nov/12 ]

For example, given includes = [a/b/c/*/d] and base dir = /base, currently DirectoryFileTree does the following:

1. List the children of /base
2. Iterate over the children looking for one called 'a'.
3. List the children of /base/a
4. Iterate over the children looking for one called 'b'.
5. Repeat this process down /base/a/b/c.
6. Traverse the children of /base/a/b/c looking for matches.

This can be quite slow if one of these directories has a large number of children (eg when looking for an artifact in the artifact file store). Instead, in this instance, DirectoryFileTree should skip directly to step 6. above.

Also, PatternSet is currently implemented in Groovy. It should be ported to Java.

Comment by Luke Daley [ 06/Nov/12 ]

This is a pretty big change, that will require some heavy redesigning.

I'd be a bit uncomfortable about putting this in 1.3 at this late stage.

Comment by Peter Niederwieser [ 23/Nov/12 ]

There's now SingleIncludePatternFileTree, which provides a more efficient way to walk a directory hierarchy than DirectoryFileTree. Question is where to go from here. Some options:

1. Leave DirectoryFileTree as-is and switch selected clients (e.g. PatternBasedLocallyAvailableResourceFinder) to SingleIncludePatternFileTree.
2. Optimize DirectoryFileTree.visit() (only) for the case that all include patterns share a common prefix, and no include specs are provided. This optimization doesn't hurt but is limited in applicability and effectiveness. For example, it won't help to speed up PatternBasedLocallyAvailableResourceFinder.
3. Optimize DirectoryFileTree.visit() to only exhaustively scan the directory hierarchy where needed (e.g. from the point where a '**' is encountered, or for handling include specs). This can be made to work no matter how the DirectoryFileTree is configured, provided one gives up on the breadthFirst/depthFirst traversal guarantees (which really are misnamed prefix/postfix depth-first traversal guarantees for directories; files are always traversed in prefix order).
4. Same as 3., but with a switch (say prefix/postfix/no guarantees).

I've already spiked 1. and 4., but don't have conclusive benchmarks.

Comment by Adam Murdoch [ 05/Dec/12 ]

For now, the cases that we're interested in are improving PathKeyFileStore and PatternBasedLocallyAvailableResourceFinder. So option 1 is certainly a good next step.

Option 4 would be worth doing. PathKeyFileStore and PatternBasedLocallyAvailableResourceFinder don't care about the guarantee, just as long as all the matching files are visited at some point. They also don't care about visiting the directories, so that offers some other potential optimisations. The same is true for FileCollection.getFiles() and a few other methods.

So, I think we should do something like this:
1. Wire up PathKeyFileStore and PatternBasedLocallyAvailableResourceFinder to SingleIncludePatternFileTree to pick up the improvements.
2. Move the optimisations across to DirectoryFileTree.visit() with some option to specify which guarantees the visitor requires.
3. Change PathKeyFile and PatternBasedLocallyAvailableResourceFinder to use this, in "I only want files in any order" mode, and remove SingleIncludePatternFileTree.
4. Leave the other callers of DirectoryFileTree.visit() in "I want all the directories and files, in prefix order" mode. We can improve these later. The idea is that these callers will pick up some, but not all, of the optimisations for now.

p.s. can you update the Javadocs to use "prefix order" instead of "breadth-wise order"?

Generated at Wed Jun 30 12:24:53 CDT 2021 using Jira 8.4.2#804003-sha1:d21414fc212e3af190e92c2d2ac41299b89402cf.