This repository was archived by the owner on Jan 28, 2021. It is now read-only.
plan: compute all inner joins in memory if they fit #638
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #577
Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.
This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.
An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the
MAX_MEMORY_INNER_JOIN
environment variable orthe
max_memory_joins
session variable specifying the number ofbytes. The default value for this is the half of the available
physical memory on the operating system.
Because previously we had two iterators:
innerJoinIter
andinnerJoinMemoryIter
, and nowinnerJoinIter
must be able to dothe join in memory,
innerJoinMemoryIter
has been removed andinnerJoinIter
replaced with a version that can work with threemodes:
unknownMode
we don't know yet how to perform the join, so keepiterating until we can find out. By the end of the first full pass
over the right branch
unknownMode
will either switch tomultipassMode
ormemoryMode
.memoryMode
which computes the rest of the join in memory. Theiterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
unknownMode
and switched to this mode, it's guaranteed to alreadyhave loaded all the right side. From that point on, they work in
exactly the same way.
multipassMode
, which was the previous default mode. Iterate theright side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from
unknownMode
in case thememory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.
Signed-off-by: Miguel Molina [email protected]