Skip to content

Discussion about eureka.client.transport.retryable-client-quarantine-refresh-percentage property #4144

@kworkbee

Description

@kworkbee

This issue explains my questions and suggestion regarding the eureka configuration.

Background

I'm working on guaranteeing high-availability of Eureka Server Instances, by configuring individual zones in the in-house bare-metal server and AWS EKS cluster.

There're 2 Eureka Server instances in the bare-metal server and 3 Eureka Server instances in the EKS cluster. In addition, it was grouped into the same region and configured as the peers to replicate each client instance information.

While doing a chaos test assuming that a specific zone isn't normally serviced, I found that Eureka Client instance (which located in the zone that is inaccessible.) tries register event only with the server instance of the inaccessible zone.

When the preferred zone became inaccessible, candidateHosts (returned by RetryableEurekaHttpClient.getHostCandidates()) is defined by clusterEndpoints (returned from AsyncResolver - it's temporarily affected by eureka.client.availability-zones property I explicitly enabled in application.yml), a condition (quarantineSet.size() >= threshold).
If this condition is met, quarantineSet will be cleared and candidateHosts will be the entire peers list of property yml, leading to a situation in which register event with the secondary zone could not be performed because the register event was continuously retried to the peer in the temporarily inaccessible zone.

This may vary depending on the number of server instances configured per zone, but to achieve my intention, I had to change eureka.client.transport.retryable-client-quarantine-refresh-percentage to prevent threshold from meeting the condition (quarantineSet.size() >= threshold).

https://github.com/Netflix/eureka/blob/50db5d24a7c45375c0b32fffa238256276e6432e/eureka-client/src/main/java/com/netflix/discovery/shared/transport/decorator/RetryableEurekaHttpClient.java#L173-L175

private List<EurekaEndpoint> getHostCandidates() {
        List<EurekaEndpoint> candidateHosts = clusterResolver.getClusterEndpoints();
        quarantineSet.retainAll(candidateHosts);

        // If enough hosts are bad, we have no choice but start over again
        int threshold = (int) (candidateHosts.size() * transportConfig.getRetryableClientQuarantineRefreshPercentage());
        //Prevent threshold is too large
        if (threshold > candidateHosts.size()) {
            threshold = candidateHosts.size();
        }
        if (quarantineSet.isEmpty()) {
            // no-op

        /* ---------------------- */
        } else if (quarantineSet.size() >= threshold) {
            logger.debug("Clearing quarantined list of size {}", quarantineSet.size());
            quarantineSet.clear();

        /* ---------------------- */
        } else {
            List<EurekaEndpoint> remainingHosts = new ArrayList<>(candidateHosts.size());
            for (EurekaEndpoint endpoint : candidateHosts) {
                if (!quarantineSet.contains(endpoint)) {
                    remainingHosts.add(endpoint);
                }
            }
            candidateHosts = remainingHosts;
        }

        return candidateHosts;
    }

Question

In this regard, I tried to find a document related to transportConfig while working on it, but I couldn't find it.

By any chance, could you share the link if I couldn't find the document even though it existed?

And additionally, I wonder if the config property (eureka.client.transport.retryable-client-quarantine-refresh-percentage) was introduced considering the HA of Eureka Server. Otherwise, I wonder what purpose this property is defined and what purpose is used.

Suggestion

I would like to take this opportunity to contribute to the creation of documentation based on what I worked on, and I would appreciate it if you could help me if you could tell me the related procedure.


Appendix

Regarding the threshold calculation

If you have two server instances configured in Zone A and three server instances in Zone B.

To attempt register event with Zone A server peers, when the Eureka Client in B Zone is unable to access the server peers in the same zone, candidateHosts.size() (is 5) * retryable-client-quarantine-refresh-percentage must be greater than the number of temporarily inaccessible Zone B server instances.

In order to satisfy this condition, the result is derived that the refresh percentage must be at least 0.8 or more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions