Skip to content

Add retries to configureIndex and update operations #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 23, 2024

Conversation

aulorbe
Copy link
Contributor

@aulorbe aulorbe commented Dec 19, 2024

Problem

We first shipped retries to upsert as a POC. Now that we are happy with that, we are expanding retries to configureIndex and update operations.

This PR includes updates to the retry logic itself, too: I found some errors when I applied it to configureIndex, yay!

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

CI passes.


@@ -198,6 +198,8 @@ export const mapHttpStatusError = (failedRequestInfo: FailedRequestInfo) => {
return new PineconeInternalServerError(failedRequestInfo);
case 501:
return new PineconeNotImplementedError(failedRequestInfo);
case 503:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops forgot to map this the 1st time around

Comment on lines -78 to -97
// Scale up podType to x2
let state = true;
let retryCount = 0;
const maxRetries = 10;
while (state && retryCount < maxRetries) {
try {
await pinecone.configureIndex(podIndexName, {
spec: { pod: { podType: 'p1.x2' } },
});
state = false;
} catch (e) {
if (e instanceof PineconeInternalServerError) {
retryCount++;
await sleep(2000);
} else {
console.log('Unexpected error:', e);
throw e;
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this now that we've got retries!

expect(callCount).toBe(2);
});

test('Update operation should retry 1x if server responds 1x with error and 1x with success', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured duplicating this type of test across upsert and update would be enough to justify me not duplicating it again for configureIndex, but lmk if you disagree!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We should rly centralize this type of thing to avoid duplicating this logic, but I think this is okay for now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems useful to validate that the calls themselves trigger retries as we'd expect, is that what you're talking about centralizing?

Copy link
Contributor Author

@aulorbe aulorbe Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm I'm not 100% sure we're on the same page -- when you say "the calls themselves," are you talking about the different async funcs that we could pass into the RetryWrapper?

Assuming you answer yes to the above, yes that's what I'd like to centralize.... something like we have a single parameterized test that confirms that < whatever async func > is retried n times

@aulorbe aulorbe marked this pull request as ready for review December 19, 2024 01:14
Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, it's nice that this is a pretty easy replacement due to how you set things up.


// Helper function to start the server with a specific response pattern
const startMockServer = (shouldSucceedOnSecondCall: boolean) => {
// Create http server
server = http.createServer((req, res) => {
server = http.createServer({ keepAlive: false }, (req, res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what happens when setting this to false?

Copy link
Contributor Author

@aulorbe aulorbe Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simply ensures any outstanding http connections close once whatever you're doing on the spun up server concludes. It's set as a default to true in Node20+, which is a new development, and something I thought might be the cause of our failing tests. Unfortunately, setting it to false (so that it remains false across 18 and 20) didn't fix the problem.

if (error?.status) {
return mapHttpStatusError(error);
}
return error; // Return original error if no mapping is needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what errors we're seeing that end up without an associated error.status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite remember off the top of my head, and definitely need to look into this in the future at some point, but basically something about the BasePineconeError class sometimes has status in the json obj that's printed out in the console (if you print the error), but that shows up as undefined when you do error.status.

expect(callCount).toBe(2);
});

test('Update operation should retry 1x if server responds 1x with error and 1x with success', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems useful to validate that the calls themselves trigger retries as we'd expect, is that what you're talking about centralizing?

@aulorbe
Copy link
Contributor Author

aulorbe commented Dec 23, 2024

Re: CI/CD failures in Node20+:

I did a lot of research into why the http server we spin up in our integration tests fails in Node20+ but passes in Node18+. I honestly had to put a time-cap on my research into this, because I was finding a lot, but the TLDR is:

  • In Node20+, the native fetch API stabilized. With that stabilization came a major upgrade to Node's internal undici dependency... from 5.x to 7.x in this PR.
    • undici change log here for v.6+, where they call out that they changed how the lib handles http errors
  • Within this upgrade were changes to how undici handles http connections. You can see more here.

Basically, I had to add a sleep to our final integration test because I was getting the following error when I was debugging in Node20+, which indicates an aborted TCP connection:

TypeError: fetch failed
        at node:internal/deps/undici/undici:13185:13 {
      [cause]: Error: read ECONNRESET
          at TCP.onStreamRead (node:internal/stream_base_commons:216:20) {
        errno: -54,
        code: 'ECONNRESET',
        syscall: 'read'
      }
    }

I tried a bunch of different ways to add a sleep, including adding timeouts to server.listen and http.createServer, as well as sleeps to the final afterEach method, but none fixed it. The only fix that resolved the testing error in Node20+ was adding the sleep to the Max retries exceeded w/o resolve test itself.

@aulorbe aulorbe merged commit f71419d into main Dec 23, 2024
32 checks passed
@aulorbe aulorbe deleted the add-remaining-retries branch December 23, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants