-
Notifications
You must be signed in to change notification settings - Fork 116
Sort source file list in FileBasedSignatureProvider #268
Conversation
|
thanks @sezruby , could you also hide this feature behind a flag which defaults to false? Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N). cc @imback82 |
|
I think it's crucial to have the same order whenever we do the fingerprint calculation. We cannot assume Spark will always give the right order, so let's sort them ourselves.
Set will have elements in sorted order, so it will be O(nlogn), so I don't think there is any difference. |
|
The root cause of issue is that our signature computation routine is not associative. Given that we have all input files before signature computation, sorting is one approach to make it order insensitive, however as @apoorvedave1 says above it could be an overkill on a case with 1000s of files, given the number of times this routine would be called. If we really believe this is a critical issue, then |
| _, | ||
| _) => | ||
| fingerprint ++= location.allFiles.foldLeft("")( | ||
| fingerprint ++= location.allFiles.sortBy(_.hashCode).foldLeft("")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is hashCode enough? What if two strings return the same hashCode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use sortWith to compare first with hashCode then falls back to string comparison (very rare)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash code function:
public int hashCode() {
if (hash != 0)
return hash;
int h = hashIgnoringCase(0, scheme);
h = hash(h, fragment);
if (isOpaque()) {
h = hash(h, schemeSpecificPart);
} else {
h = hash(h, path);
h = hash(h, query);
if (host != null) {
h = hash(h, userInfo);
h = hashIgnoringCase(h, host);
h += 1949 * port;
} else {
h = hash(h, authority);
}
}
hash = h;
return h;
}
or we could ues getPath.toString.hashCode
how about using XOR or just compare all file list like hybrid scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw changing signature is a breaking change.... 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw changing signature is a breaking change.... 😢
I think we can maintain the backward compatibility if we want by creating IndexSignatureProviderV2:
hyperspace/src/main/scala/com/microsoft/hyperspace/index/LogicalPlanSignatureProvider.scala
Line 55 in 199fa9a
| def create(name: String): LogicalPlanSignatureProvider = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare all file list like hybrid scan?
What do you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this?
means not using FileBasedSignature and check all files (name/len/modification time) ? sorting + hash computation is also not cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting + hash computation is also not cheap.
Are we looking at 10s, 100s or 1000s overhead?
means not using FileBasedSignature and check all files (name/len/modification time) ?
Signature was meant to quickly rule out indexes looking at source files, but it may not be useful with hybrid scan anymore. We can address this as a follow up PR, but let's fix this first with sorting (maybe sortBy(_.getPath.toString)?) since I see tests are failing due to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing 100k chunk dataset, I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case). A tag for signature value might be helpful to reduce the overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case)
Is this good, bad, or reasonable?
But this can cause false positive? 1 + 4 = 2 + 3? |
I guess even the current implementation using md5 could cause false positive (I believe 1 in 2^128 chance). 😄 |
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @sezruby!
@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)
@apoorvedave1 Gentle ping. (I will create a separate issue to follow up to handle perf concern) |
@imback82 thanks that should be fine then to track it in a separate issue. My concern is what if sorting O(NlogN) becomes costlier than just comparing set of files from both sides (O(n)). Meaning, perf hit of sorting could remove the benefit of using signature in the first place. Thanks for creating the issue. Other than that, LGTM. Thanks @sezruby |
But you have to build a "set" with some ordering, which is not O(n)? |
Oh sorry. I didn't mean SortedSet implementation. I just meant HashSet. Sorry for causing confusion. My suggestion was: (I am basing this on the understanding that inserting an element to hashset is O(1) amortized making set creation O(n). Please correct me if I am wrong.) Update: meaning if we sort, we could be paying more cost that just iterating over elements and comparing them one by one. |
Let's just measure it. Theory vs. actual can be quite different; 2*O(N) = O(N) in theory could be worse than O(nlogn) - all depend on the implementation. |
What is the context for this pull request?
What changes were proposed in this pull request?
The signature calculation in FileBasedSignatureProvider can differ depending on the order of input files. Therefore, we need to make sure the order is consistent.
For example, the following dataframes have the same list of input files, but can have different order in
HadoopFsRelation.allFiles. This can cause an unexpected signature mistmatch.Does this PR introduce any user-facing change?
Yes, fixes the bug described in above section.
How was this patch tested?
Unit test