Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Nov 23, 2020

What is the context for this pull request?

  • Tracking Issue: n/a
  • Parent Issue: n/a
  • Dependencies: n/a

What changes were proposed in this pull request?

The signature calculation in FileBasedSignatureProvider can differ depending on the order of input files. Therefore, we need to make sure the order is consistent.

For example, the following dataframes have the same list of input files, but can have different order in HadoopFsRelation.allFiles. This can cause an unexpected signature mistmatch.

spark.read.parquet("testPath") // location.allFiles
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00000-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00001-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00003-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet

val basePath = "file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d"
spark.read.options("basePath", basePath).parquet(basePath + "/Date=2017-09-03", basePath + "/Date=2018-09-03")
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2018-09-03/part-00003-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00000-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00001-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet
file:/tmp/spark-5a167814-d814-4a69-b579-c90d3604de4d/Date=2017-09-03/part-00002-376e6280-30cb-4d77-80ce-4afec0991682.c000.snappy.parquet

Does this PR introduce any user-facing change?

Yes, fixes the bug described in above section.

How was this patch tested?

Unit test

@sezruby sezruby added the bug Something isn't working label Nov 23, 2020
@sezruby sezruby self-assigned this Nov 23, 2020
@sezruby sezruby added this to the November 2020 milestone Nov 23, 2020
@apoorvedave1
Copy link
Contributor

thanks @sezruby ,

could you also hide this feature behind a flag which defaults to false?
My concern is I think this would be unnecessary calculation for regular scenarios. In any case, if signature doesn't match, RuleUtils will make sure that if index can be used, it will be used right? I am just wondering why bother O(NlogN) for sorting always for a rare scenario which will still work without this change(assuming hybrid scan is enabled eventually by default)?

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N).

cc @imback82

@imback82
Copy link
Contributor

I think it's crucial to have the same order whenever we do the fingerprint calculation. We cannot assume Spark will always give the right order, so let's sort them ourselves.

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N).

Set will have elements in sorted order, so it will be O(nlogn), so I don't think there is any difference.

@pirz
Copy link
Contributor

pirz commented Nov 23, 2020

The root cause of issue is that our signature computation routine is not associative. Given that we have all input files before signature computation, sorting is one approach to make it order insensitive, however as @apoorvedave1 says above it could be an overkill on a case with 1000s of files, given the number of times this routine would be called. If we really believe this is a critical issue, then what about changing the signature format, so instead of returning a long concatenated String, it switches to hashing filePaths to some number and adding up them to calculate the final signature. Adding a finite set of numbers is associative and regarding the signature usage we really only care about equality and inequality, not the content.
Update: We can not switch to hashing/adding due to false positives.

_,
_) =>
fingerprint ++= location.allFiles.foldLeft("")(
fingerprint ++= location.allFiles.sortBy(_.hashCode).foldLeft("")(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is hashCode enough? What if two strings return the same hashCode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use sortWith to compare first with hashCode then falls back to string comparison (very rare)?

Copy link
Collaborator Author

@sezruby sezruby Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash code function:

public int hashCode() {
        if (hash != 0)
            return hash;
        int h = hashIgnoringCase(0, scheme);
        h = hash(h, fragment);
        if (isOpaque()) {
            h = hash(h, schemeSpecificPart);
        } else {
            h = hash(h, path);
            h = hash(h, query);
            if (host != null) {
                h = hash(h, userInfo);
                h = hashIgnoringCase(h, host);
                h += 1949 * port;
            } else {
                h = hash(h, authority);
            }
        }
        hash = h;
        return h;
    }

or we could ues getPath.toString.hashCode

how about using XOR or just compare all file list like hybrid scan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw changing signature is a breaking change.... 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw changing signature is a breaking change.... 😢

I think we can maintain the backward compatibility if we want by creating IndexSignatureProviderV2:

def create(name: String): LogicalPlanSignatureProvider = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare all file list like hybrid scan?

What do you mean by this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?

means not using FileBasedSignature and check all files (name/len/modification time) ? sorting + hash computation is also not cheap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting + hash computation is also not cheap.

Are we looking at 10s, 100s or 1000s overhead?

means not using FileBasedSignature and check all files (name/len/modification time) ?

Signature was meant to quickly rule out indexes looking at source files, but it may not be useful with hybrid scan anymore. We can address this as a follow up PR, but let's fix this first with sorting (maybe sortBy(_.getPath.toString)?) since I see tests are failing due to this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing 100k chunk dataset, I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case). A tag for signature value might be helpful to reduce the overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed the overhead is similar to Hybrid Scan in case there's no candidate index (partial index case)

Is this good, bad, or reasonable?

@imback82
Copy link
Contributor

Adding a finite set of numbers is associative and regarding the signature usage we really only care about equality and inequality, not the content.

But this can cause false positive? 1 + 4 = 2 + 3?

@imback82
Copy link
Contributor

imback82 commented Nov 24, 2020

But this can cause false positive? 1 + 4 = 2 + 3?

I guess even the current implementation using md5 could cause false positive (I believe 1 in 2^128 chance). 😄

@imback82 imback82 closed this Nov 24, 2020
@imback82 imback82 reopened this Nov 24, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sezruby!

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

@imback82
Copy link
Contributor

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

@apoorvedave1 Gentle ping. (I will create a separate issue to follow up to handle perf concern)

@apoorvedave1
Copy link
Contributor

@apoorvedave1 you had a concern with this PR, so I will wait for you approval before merging. (btw, this is affecting open PRs)

@apoorvedave1 Gentle ping. (I will create a separate issue to follow up to handle perf concern)

@imback82 thanks that should be fine then to track it in a separate issue. My concern is what if sorting O(NlogN) becomes costlier than just comparing set of files from both sides (O(n)). Meaning, perf hit of sorting could remove the benefit of using signature in the first place. Thanks for creating the issue. Other than that, LGTM. Thanks @sezruby

@imback82
Copy link
Contributor

comparing set of files from both sides (O(n))

But you have to build a "set" with some ordering, which is not O(n)?

@apoorvedave1
Copy link
Contributor

apoorvedave1 commented Nov 24, 2020

comparing set of files from both sides (O(n))

But you have to build a "set" with some ordering, which is not O(n)?

Oh sorry. I didn't mean SortedSet implementation. I just meant HashSet. Sorry for causing confusion.

My suggestion was:
Option 1 (current approach): Sort elements -> iterate and create signature: O(NlogN + N) = O(NlogN)
Option 2 -> Insert elements to sets (O (N + N)) -> Compare sets O(N) = O(N)

(I am basing this on the understanding that inserting an element to hashset is O(1) amortized making set creation O(n). Please correct me if I am wrong.)

Update: meaning if we sort, we could be paying more cost that just iterating over elements and comparing them one by one.

@imback82
Copy link
Contributor

Oh sorry. I didn't mean SortedSet implementation. I just meant HashSet. Sorry for causing confusion.

My suggestion was:
Option 1 (current approach): Sort elements -> iterate and create signature: O(NlogN + N) = O(NlogN)
Option 2 -> Insert elements to sets (O (N + N)) -> Compare sets O(N) = O(N)

(I am basing this on the understanding that inserting an element to hashset is O(1) amortized making set creation O(n). Please correct me if I am wrong.)

Update: meaning if we sort, we could be paying more cost that just iterating over elements and comparing them one by one.

Let's just measure it. Theory vs. actual can be quite different; 2*O(N) = O(N) in theory could be worse than O(nlogn) - all depend on the implementation.

@imback82 imback82 merged commit cbfed13 into microsoft:master Nov 24, 2020
@sezruby sezruby deleted the fixSig branch November 25, 2020 15:46
@imback82 imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

breaking changes bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants