Raise error in #scroll_batches when search backend returns a failure by tomdev · Pull Request #916 · toptal/chewy

tomdev · 2023-12-15T13:30:09Z

We are running into a bug in production when performing chewy:sync.

Our search backend is intermittently returning a 200 response without hits and containing a backend failure, see example response:

{
  "_scroll_id": "<scroll_id>",
  "took": 1,
  "timed_out": false,
  "terminated_early": false,
  "_shards": {
    "total": 5,
    "successful": 2,
    "skipped": 0,
    "failed": 3,
    "failures": [
      {
        "shard": -1,
        "index": null,
        "reason": {
          "type": "search_context_missing_exception",
          "reason": "No search context found for id [34462229]"
        }
      },
      {
        "shard": -1,
        "index": null,
        "reason": {
          "type": "search_context_missing_exception",
          "reason": "No search context found for id [34462228]"
        }
      },
      {
        "shard": -1,
        "index": null,
        "reason": {
          "type": "search_context_missing_exception",
          "reason": "No search context found for id [34888662]"
        }
      }
    ]
  },
  "hits": {
    "total": {
      "value": 720402,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": []
  }
}

scroll_batches currently is not taking these failures into account. Because there are no hits returned, the logic of fetched >= total will never be reached, causing the loop to never break.

Because of this we've experienced chewy:sync running for days instead of an hour. (Yes, we now have proper monitoring in place...)

This PR will raise a Chewy::Error when the search backend is returning failures.

Before submitting the PR make sure the following are checked:

The PR relates to only one subject with a clear title and description in grammatically correct, complete sentences.
Wrote good commit messages.
Commit message starts with [Fix #issue-number] (if the related issue exists).
Feature branch is up-to-date with master (if not - rebase it).
Squashed related commits together.
Added tests.
Added an entry to the changelog if the new code introduces user-observable changes. See changelog entry format for details.

…failures

konalegi

Looks good to me, small changes are required to format of the error

konalegi · 2023-12-15T14:20:36Z

lib/chewy/search/scrolling.rb


        loop do
+          failures = result.dig('_shards', 'failures')
+          raise Chewy::Error, failures if failures.present?


I think Chewy::Error is not the best thing here, from your example, failures is a hash, and Chewy::Error is simply a StandardError with a string argument. Please make a specific error message and convert failure into a meaningful string? For instance

chewy/lib/chewy/errors.rb

Lines 20 to 32 in 10ff43b

class ImportFailed < Error

def initialize(type, import_errors)

message = "Import failed for `#{type}` with:\n"

import_errors.each do |action, action_errors|

message << " #{action.to_s.humanize} errors:\n"

action_errors.each do |error, documents|

message << " `#{error}`\n"

message << " on #{documents.count} documents: #{documents}\n"

end

end

super message

end

end

konalegi · 2023-12-18T09:00:33Z

README.md

+## Running specs
+
+Make sure you're running a local Elasticsearch instance.
+
+```
+ES_PORT=9200 bundle exec rspec
+```
+


Btw, could we remove this? I planning to backport docker-compose setup with ES, so it runs on proper port https://github.com/toptal/chewy/pull/917/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R1-R16

Of course, done!

konalegi · 2023-12-18T12:56:21Z

spec/chewy/search/scrolling_spec.rb

    let(:countries) { Array.new(3) { |i| Country.create!(rating: i + 2, name: "country #{i}") } }

    describe '#scroll_batches' do
+      describe 'with search backend returning failures' do


btw, do you think it will be possible to provide integration spec? Where you really call ES and get real answer?

Do you have a suggestion how I can set up an integration spec that returns this error state? Elasticsearch should return a total count higher than zero but then return zero hits in order to hit this code path.

lib/chewy/search/scrolling.rb

barthez · 2023-12-18T13:49:31Z

Hey @tomdev
Thanks for the PR. I tried to reproduce it by setting low scroll time and adding extra wait time between scroll calls. Whenever there is an mentioned error (search_context_missing_exception) ES transport gem throws the exception for me:

[404] {"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}},"status":404} (Elasticsearch::Transport::Transport::Errors::NotFound)

Could you please share your chewy version, elasticsearch server version and elasticsearch gem version, that could be helpful?

The issue could be also solved if you could set the scroll pointer expiration (which defaults to 1 minute) for pluck calls (which is called in chewy:sync).

tomdev · 2023-12-18T15:26:21Z

Hey Barthez, thanks for reproducing the issue.

Interesting to see the response is actually a 404 response. In our setup we recently migrated from Elasticsearch 7.10 to OpenSearch 1.3; this appears to return a 200, even though the "search_context_missing_exception" is being hit.

We're unable to continue using Elasticsearch as we're using the AWS OpenSearch service.

We're running:

gem chewy (7.3.4)
OpenSearch 1.3
gem elasticsearch (7.13.3)

We haven't identified any other issues (thus far) using chewy against OpenSearch.

An attempt to increase the scroll pointer expiration did not succeed; we've set it to 10m, but due to the search_context_missing_exception (that in our case was not caused by an expiring scroll window, but seems to be related to an internal OpenSearch error, yielding the same error).

Do you think OpenSearch returning a 200 with failures should be handled in chewy, or would that be in a dependency like ES transport? You are probably more familiar than I am on where this should be handled.

barthez · 2023-12-19T10:41:58Z

Thanks @tomdev

Chewy does not aim to support OpenSearch. This sounds like an altered behavior of OpenSearch vs Elasticsearch.

I would suggest adding a custom exception, something like MissingHitsInScrollError, and raising it when we receive no hits when we expect some. I wouldn't parse the response as this is too platform-dependent. Can you do that?

tomdev · 2024-01-09T09:25:23Z

I implemented the MissingHitsInScrollError when no hits are returned when they are expected. This required me to change the looping behaviour when scrolling; instead of infinitely scrolling we now precalcalculate how often we should perform batched requests.

This is slightly altering the behaviour of how this data is retrieved and even though the test suite succeeds I wanted to double check if anyone knows why the previous approach was chosen (loop until fetched >= total hits). Could I be missing an edge case here that was covered by the previous implementation? The specs don't indicate that.

tomdev · 2024-01-23T14:47:32Z

We've been running this PR in production for ~2 weeks and have seen the failure (where we'd previously got stuck in an infinite loop) now successfully raising the MissingHitsInScrollError.

barthez · 2024-02-22T13:37:11Z

Thank you @tomdev. Sorry, I lost track of this PR. Cold you please rebase and fix the conflict? I will try to merge & release it as soon as possible.

konalegi · 2024-10-08T13:23:53Z

@tomdev is this PR still relevant? Sorry for the delay, I've been extremely busy.

tomdev · 2024-10-08T14:06:22Z

@tomdev is this PR still relevant? Sorry for the delay, I've been extremely busy.

No worries, same here! Yes; this is still relevant to us, and we've been running from this PR in production for months. It's working well for us.

I think this PR needs some rebasing and fixing conflicts, I can take a look at that.

konalegi · 2024-10-08T15:35:34Z

@tomdev Sure thank you! Keep in mind, we have moved to ES 8.x, so some extra adjustments might be needed.

bbatsov · 2026-02-25T13:16:46Z

Master has been updated with CI fixes and compatibility changes (#998) — we now target Ruby 3.2+ and Rails 7.2+. Could you rebase this PR on top of master so CI can run properly? Thanks!

bbatsov · 2026-02-25T20:05:44Z

This fixes a real production issue — shard failures during scroll causing infinite loops. The approach looks sound.

Could you:

Rebase onto current master
Add a changelog entry

Happy to merge once that's done.

AlfonsoUceda

Will take care of rubocop offenses and changelog entry

@tomdev

Adds the #scroll_batches error-raising entry (#916), moves the GitHub Actions service entry (#1008) into chronological order, and registers @tomdev as a contributor.

@tomdev

* CHANGELOG entries for #916 and #1008 are added Adds the #scroll_batches error-raising entry (#916), moves the GitHub Actions service entry (#1008) into chronological order, and registers @tomdev as a contributor. * Layout/TrailingWhitespace offense is corrected

tomdev added 2 commits December 15, 2023 14:14

fix: Raise error on #scroll_batches when search backend is returning …

040d857

…failures

doc: Add documentation on how to run specs

fc46a0e

konalegi reviewed Dec 15, 2023

View reviewed changes

konalegi reviewed Dec 18, 2023

View reviewed changes

lib/chewy/search/scrolling.rb Outdated Show resolved Hide resolved

tomdev added 2 commits January 9, 2024 10:17

Fail when expected hits are not returned

3aa2bc1

Remove docs on running spec per @konalegi request

467974f

barthez assigned barthez and konalegi Feb 22, 2024

barthez requested a review from konalegi February 22, 2024 13:37

Merge branch 'master' into tomdev/raise-error-on-search-backend-failure

1ec4064

AlfonsoUceda requested a review from a team as a code owner March 25, 2026 20:57

AlfonsoUceda removed the request for review from konalegi March 25, 2026 20:58

AlfonsoUceda approved these changes Mar 26, 2026

View reviewed changes

AlfonsoUceda merged commit 70de104 into toptal:master Mar 26, 2026
12 of 13 checks passed

AlfonsoUceda mentioned this pull request Mar 26, 2026

[PF] Rubocop offenses are corrected, CHANGELOG entries are added #1017

Merged

	class ImportFailed < Error
	def initialize(type, import_errors)
	message = "Import failed for `#{type}` with:\n"
	import_errors.each do \|action, action_errors\|
	message << " #{action.to_s.humanize} errors:\n"
	action_errors.each do \|error, documents\|
	message << " `#{error}`\n"
	message << " on #{documents.count} documents: #{documents}\n"
	end
	end
	super message
	end
	end

Conversation

tomdev commented Dec 15, 2023

Uh oh!

konalegi left a comment

Choose a reason for hiding this comment

Uh oh!

konalegi Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

konalegi Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

tomdev Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

konalegi Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

tomdev Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

barthez commented Dec 18, 2023

Uh oh!

tomdev commented Dec 18, 2023

Uh oh!

barthez commented Dec 19, 2023

Uh oh!

tomdev commented Jan 9, 2024

Uh oh!

tomdev commented Jan 23, 2024

Uh oh!

barthez commented Feb 22, 2024

Uh oh!

konalegi commented Oct 8, 2024

Uh oh!

tomdev commented Oct 8, 2024

Uh oh!

konalegi commented Oct 8, 2024

Uh oh!

bbatsov commented Feb 25, 2026

Uh oh!

bbatsov commented Feb 25, 2026

Uh oh!

AlfonsoUceda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants