Skip to content

fix(dgraph): Fix out of order issues with split keys in bulk loader.#6083

Merged
martinmr merged 16 commits intomasterfrom
martinmr/bulk-split
Aug 3, 2020
Merged

fix(dgraph): Fix out of order issues with split keys in bulk loader.#6083
martinmr merged 16 commits intomasterfrom
martinmr/bulk-split

Conversation

@martinmr
Copy link
Copy Markdown
Contributor

@martinmr martinmr commented Jul 28, 2020

Split keys for indexes can cause out-of-order issues due to the variable length of the term inside the key.
To fix the issue, the split keys are written to a temporary DB first (using the writebatch to avoid the out
of order issues) and then copied to the main p directory.

Related to DGRAPH-1897


This change is Reviewable

@github-actions github-actions Bot added the area/bulk-loader Issues related to bulk loading. label Jul 28, 2020
@martinmr martinmr changed the title fix(dgraph): batch and write split keys in different streams during bulk load. fix(dgraph): Fix out of order issues with split keys in bulk loader. Jul 30, 2020
@martinmr martinmr requested review from a team, ashish-goswami, parasssh and poonai July 30, 2020 18:12
@martinmr martinmr marked this pull request as ready for review July 30, 2020 18:13
Comment thread dgraph/cmd/bulk/reduce.go Outdated
func (r *reducer) createTmpBadger() *badger.DB {
tmpDir, err := ioutil.TempDir(r.opt.TmpDir, "split")
x.Check(err)
db := r.createBadgerInternal(tmpDir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createbadgerInteral creates a badger instance with compression enabled. I don't think we need compression for temporary DB.

Copy link
Copy Markdown
Contributor

@parasssh parasssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but get Balaji to review as well.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @ashish-goswami, @balajijinnah, @manishrjain, @martinmr, @parasssh, and @vvbalaji-dgraph)

Copy link
Copy Markdown
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @ashish-goswami, @balajijinnah, @jarifibrahim, @manishrjain, and @vvbalaji-dgraph)


dgraph/cmd/bulk/reduce.go, line 151 at r1 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

createbadgerInteral creates a badger instance with compression enabled. I don't think we need compression for temporary DB.

Done.

Copy link
Copy Markdown
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check the logic carefully, but looks alright overall.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, @jarifibrahim, @martinmr, and @vvbalaji-dgraph)


dgraph/cmd/bulk/reduce.go, line 86 at r2 (raw file):

			splitWriter := tmpDb.NewManagedWriteBatch()

			ci := &countIndexer{reducer: r, writer: writer, splitWriter: splitWriter, tmpDb: tmpDb}

move this to multiple lines.


dgraph/cmd/bulk/reduce.go, line 428 at r2 (raw file):

		// value log from growing over the allowed limit.
		if splitBatchLen >= maxSplitBatchLen {
			x.Check(writer.Flush())

Do you need to explicitly call flush? I don't understand this -- so maybe think carefully here.

Copy link
Copy Markdown
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 3 files reviewed, 3 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, @jarifibrahim, @manishrjain, and @vvbalaji-dgraph)


dgraph/cmd/bulk/reduce.go, line 86 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

move this to multiple lines.

Done.


dgraph/cmd/bulk/reduce.go, line 428 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Do you need to explicitly call flush? I don't understand this -- so maybe think carefully here.

I cannot use a single writebatch or else the value log will be too big. So instead I am using multiple ones. According to the badger comments, Flush needs to be called to ensure that any pending writes are written to badger. Since the next line creates a new writer, the writer should be flushed here.

@martinmr martinmr merged commit 2a3b85c into master Aug 3, 2020
@martinmr martinmr deleted the martinmr/bulk-split branch August 3, 2020 16:13
martinmr added a commit that referenced this pull request Aug 3, 2020
…6083)

Split keys for indexes can cause out-of-order issues due to the variable length of the term inside the key.
To fix the issue, the split keys are written to a temporary DB first (using the writebatch to avoid the out
of order issues) and then copied to the main p directory.

Related to DGRAPH-1897

(cherry picked from commit 2a3b85c)
martinmr added a commit that referenced this pull request Aug 3, 2020
…6083)

Split keys for indexes can cause out-of-order issues due to the variable length of the term inside the key.
To fix the issue, the split keys are written to a temporary DB first (using the writebatch to avoid the out
of order issues) and then copied to the main p directory.

Related to DGRAPH-1897

(cherry picked from commit 2a3b85c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bulk-loader Issues related to bulk loading.

Development

Successfully merging this pull request may close these issues.

4 participants