Discussion about eureka.client.transport.retryable-client-quarantine-refresh-percentage property

*This issue explains my questions and suggestion regarding the eureka configuration.*

## Background
I'm working on guaranteeing high-availability of Eureka Server Instances, by configuring individual zones in the in-house bare-metal server and AWS EKS cluster.

There're 2 Eureka Server instances in the bare-metal server and 3 Eureka Server instances in the EKS cluster. In addition, it was grouped into the same region and configured as the peers to replicate each client instance information.

While doing a chaos test assuming that a specific zone isn't normally serviced, I found that Eureka Client instance (which located in the zone that is inaccessible.) tries `register` event only with the server instance of the inaccessible zone.

When the preferred zone became inaccessible, `candidateHosts` (returned by `RetryableEurekaHttpClient.getHostCandidates()`) is defined by `clusterEndpoints` (returned from `AsyncResolver` - it's temporarily affected by `eureka.client.availability-zones` property I explicitly enabled in application.yml), a condition (`quarantineSet.size() >= threshold`).
If this condition is met, `quarantineSet` will be cleared and `candidateHosts` will be the entire peers list of property yml, leading to a situation in which register event with the secondary zone could not be performed because the register event was continuously retried to the peer in the temporarily inaccessible zone.

This may vary depending on the number of server instances configured per zone, but to achieve my intention, I had to change `eureka.client.transport.retryable-client-quarantine-refresh-percentage` to prevent threshold from meeting the condition (`quarantineSet.size() >= threshold`).

https://github.com/Netflix/eureka/blob/50db5d24a7c45375c0b32fffa238256276e6432e/eureka-client/src/main/java/com/netflix/discovery/shared/transport/decorator/RetryableEurekaHttpClient.java#L173-L175

```java
private List<EurekaEndpoint> getHostCandidates() {
        List<EurekaEndpoint> candidateHosts = clusterResolver.getClusterEndpoints();
        quarantineSet.retainAll(candidateHosts);

        // If enough hosts are bad, we have no choice but start over again
        int threshold = (int) (candidateHosts.size() * transportConfig.getRetryableClientQuarantineRefreshPercentage());
        //Prevent threshold is too large
        if (threshold > candidateHosts.size()) {
            threshold = candidateHosts.size();
        }
        if (quarantineSet.isEmpty()) {
            // no-op

        /* ---------------------- */
        } else if (quarantineSet.size() >= threshold) {
            logger.debug("Clearing quarantined list of size {}", quarantineSet.size());
            quarantineSet.clear();

        /* ---------------------- */
        } else {
            List<EurekaEndpoint> remainingHosts = new ArrayList<>(candidateHosts.size());
            for (EurekaEndpoint endpoint : candidateHosts) {
                if (!quarantineSet.contains(endpoint)) {
                    remainingHosts.add(endpoint);
                }
            }
            candidateHosts = remainingHosts;
        }

        return candidateHosts;
    }

```

---

## Question
In this regard, I tried to find a document related to transportConfig while working on it, but I couldn't find it.

By any chance, could you share the link if I couldn't find the document even though it existed?

And additionally, I wonder if the config property (`eureka.client.transport.retryable-client-quarantine-refresh-percentage`) was introduced considering the HA of Eureka Server. Otherwise, I wonder what purpose this property is defined and what purpose is used.

## Suggestion

I would like to take this opportunity to contribute to the creation of documentation based on what I worked on, and I would appreciate it if you could help me if you could tell me the related procedure.

---

## Appendix

Regarding the threshold calculation

If you have two server instances configured in Zone A and three server instances in Zone B.

To attempt register event with Zone A server peers, when the Eureka Client in B Zone is unable to access the server peers in the same zone, `candidateHosts.size()` (is 5) * `retryable-client-quarantine-refresh-percentage` must be greater than the number of temporarily inaccessible Zone B server instances.

In order to satisfy this condition, the result is derived that the refresh percentage must be at least `0.8` or more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion about eureka.client.transport.retryable-client-quarantine-refresh-percentage property #4144

Background

Question

Suggestion

Appendix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion about eureka.client.transport.retryable-client-quarantine-refresh-percentage property #4144

Description

Background

Question

Suggestion

Appendix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions