Sort source file list in FileBasedSignatureProvider #268

sezruby · 2020-11-23T14:14:25Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

The signature calculation in FileBasedSignatureProvider can differ depending on the order of input files. Therefore, we need to make sure the order is consistent.

For example, the following dataframes have the same list of input files, but can have different order in HadoopFsRelation.allFiles. This can cause an unexpected signature mistmatch.

spark.read.parquet("testPath") // location.allFiles
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00000-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00001-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00003-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet

val basePath = "file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d"
spark.read.options("basePath", basePath).parquet(basePath + "/Date=2017-09-03", basePath + "/Date=2018-09-03")
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00003-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00000-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00001-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet

Does this PR introduce any user-facing change?

Yes, fixes the bug described in above section.

How was this patch tested?

Unit test

apoorvedave1 · 2020-11-23T22:02:04Z

thanks @sezruby ,

could you also hide this feature behind a flag which defaults to false?
My concern is I think this would be unnecessary calculation for regular scenarios. In any case, if signature doesn't match, RuleUtils will make sure that if index can be used, it will be used right? I am just wondering why bother O(NlogN) for sorting always for a rare scenario which will still work without this change(assuming hybrid scan is enabled eventually by default)?

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N).

cc @imback82

imback82 · 2020-11-23T23:51:12Z

I think it's crucial to have the same order whenever we do the fingerprint calculation. We cannot assume Spark will always give the right order, so let's sort them ourselves.

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N).

Set will have elements in sorted order, so it will be O(nlogn), so I don't think there is any difference.

pirz · 2020-11-23T23:51:34Z

The root cause of issue is that our signature computation routine is not associative. Given that we have all input files before signature computation, sorting is one approach to make it order insensitive, however as @apoorvedave1 says above it could be an overkill on a case with 1000s of files, given the number of times this routine would be called. If we really believe this is a critical issue, then what about changing the signature format, so instead of returning a long concatenated String, it switches to hashing filePaths to some number and adding up them to calculate the final signature. Adding a finite set of numbers is associative and regarding the signature usage we really only care about equality and inequality, not the content.
Update: We can not switch to hashing/adding due to false positives.

imback82 · 2020-11-23T23:51:37Z

src/main/scala/com/microsoft/hyperspace/index/FileBasedSignatureProvider.scala

          _,
          _) =>
-        fingerprint ++= location.allFiles.foldLeft("")(
+        fingerprint ++= location.allFiles.sortBy(_.hashCode).foldLeft("")(


is hashCode enough? What if two strings return the same hashCode?

Maybe use sortWith to compare first with hashCode then falls back to string comparison (very rare)?

hash code function:

public int hashCode() { if (hash != 0) return hash; int h = hashIgnoringCase(0, scheme); h = hash(h, fragment); if (isOpaque()) { h = hash(h, schemeSpecificPart); } else { h = hash(h, path); h = hash(h, query); if (host != null) { h = hash(h, userInfo); h = hashIgnoringCase(h, host); h += 1949 * port; } else { h = hash(h, authority); } } hash = h; return h; }

or we could ues getPath.toString.hashCode

how about using XOR or just compare all file list like hybrid scan?

btw changing signature is a breaking change.... 😢

btw changing signature is a breaking change.... 😢

I think we can maintain the backward compatibility if we want by creating IndexSignatureProviderV2:

hyperspace/src/main/scala/com/microsoft/hyperspace/index/LogicalPlanSignatureProvider.scala

Line 55 in 199fa9a

def create(name: String): LogicalPlanSignatureProvider = {

compare all file list like hybrid scan?

What do you mean by this?

What do you mean by this?

means not using FileBasedSignature and check all files (name/len/modification time) ? sorting + hash computation is also not cheap.

sorting + hash computation is also not cheap.

Are we looking at 10s, 100s or 1000s overhead?

means not using FileBasedSignature and check all files (name/len/modification time) ?

Signature was meant to quickly rule out indexes looking at source files, but it may not be useful with hybrid scan anymore. We can address this as a follow up PR, but let's fix this first with sorting (maybe sortBy(_.getPath.toString)?) since I see tests are failing due to this.

While testing 100k chunk dataset, I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case). A tag for signature value might be helpful to reduce the overhead.

I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case)

Is this good, bad, or reasonable?

imback82 · 2020-11-24T00:02:26Z

Adding a finite set of numbers is associative and regarding the signature usage we really only care about equality and inequality, not the content.

But this can cause false positive? 1 + 4 = 2 + 3?

imback82 · 2020-11-24T00:16:19Z

But this can cause false positive? 1 + 4 = 2 + 3?

I guess even the current implementation using md5 could cause false positive (I believe 1 in 2^128 chance). 😄

imback82

LGTM, thanks @sezruby!

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

imback82 · 2020-11-24T17:33:11Z

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

@apoorvedave1 Gentle ping. (I will create a separate issue to follow up to handle perf concern)

apoorvedave1 · 2020-11-24T18:50:51Z

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

@apoorvedave1 Gentle ping. (I will create a separate issue to follow up to handle perf concern)

@imback82 thanks that should be fine then to track it in a separate issue. My concern is what if sorting O(NlogN) becomes costlier than just comparing set of files from both sides (O(n)). Meaning, perf hit of sorting could remove the benefit of using signature in the first place. Thanks for creating the issue. Other than that, LGTM. Thanks @sezruby

imback82 · 2020-11-24T19:04:29Z

comparing set of files from both sides (O(n))

But you have to build a "set" with some ordering, which is not O(n)?

apoorvedave1 · 2020-11-24T19:11:10Z

comparing set of files from both sides (O(n))

But you have to build a "set" with some ordering, which is not O(n)?

Oh sorry. I didn't mean SortedSet implementation. I just meant HashSet. Sorry for causing confusion.

My suggestion was:
Option 1 (current approach): Sort elements -> iterate and create signature: O(NlogN + N) = O(NlogN)
Option 2 -> Insert elements to sets (O (N + N)) -> Compare sets O(N) = O(N)

(I am basing this on the understanding that inserting an element to hashset is O(1) amortized making set creation O(n). Please correct me if I am wrong.)

Update: meaning if we sort, we could be paying more cost that just iterating over elements and comparing them one by one.

imback82 · 2020-11-24T19:16:45Z

Oh sorry. I didn't mean SortedSet implementation. I just meant HashSet. Sorry for causing confusion.

My suggestion was:
Option 1 (current approach): Sort elements -> iterate and create signature: O(NlogN + N) = O(NlogN)
Option 2 -> Insert elements to sets (O (N + N)) -> Compare sets O(N) = O(N)

(I am basing this on the understanding that inserting an element to hashset is O(1) amortized making set creation O(n). Please correct me if I am wrong.)

Update: meaning if we sort, we could be paying more cost that just iterating over elements and comparing them one by one.

Let's just measure it. Theory vs. actual can be quite different; 2*O(N) = O(N) in theory could be worse than O(nlogn) - all depend on the implementation.

Sort source file list in FileBasedSignatureProvider

d4750bd

sezruby added the bug Something isn't working label Nov 23, 2020

sezruby self-assigned this Nov 23, 2020

sezruby added this to the November 2020 milestone Nov 23, 2020

sezruby requested review from apoorvedave1, imback82 and pirz November 23, 2020 14:15

imback82 reviewed Nov 23, 2020

View reviewed changes

imback82 closed this Nov 24, 2020

imback82 reopened this Nov 24, 2020

fix condition

d37c199

imback82 approved these changes Nov 24, 2020

View reviewed changes

imback82 added the breaking changes label Nov 24, 2020

imback82 mentioned this pull request Nov 24, 2020

Revisit FileBasedSignatureProvider for a possible performance optimization #271

Open

imback82 merged commit cbfed13 into microsoft:master Nov 24, 2020

sezruby deleted the fixSig branch November 25, 2020 15:46

imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021

clee704 mentioned this pull request Apr 8, 2021

Refactor hashing and fingerprint functions #404

Open

Sort source file list in FileBasedSignatureProvider #268

Sort source file list in FileBasedSignatureProvider #268

Uh oh!

Conversation

sezruby commented Nov 23, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

apoorvedave1 commented Nov 23, 2020

Uh oh!

imback82 commented Nov 23, 2020

Uh oh!

pirz commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 commented Nov 24, 2020

Uh oh!

imback82 commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

imback82 commented Nov 24, 2020

Uh oh!

apoorvedave1 commented Nov 24, 2020

Uh oh!

imback82 commented Nov 24, 2020

Uh oh!

apoorvedave1 commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Nov 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pirz commented Nov 23, 2020 •

edited

Loading

sezruby Nov 24, 2020 •

edited

Loading

imback82 commented Nov 24, 2020 •

edited

Loading

apoorvedave1 commented Nov 24, 2020 •

edited

Loading