-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
This issue explains my questions and suggestion regarding the eureka configuration.
Background
I'm working on guaranteeing high-availability of Eureka Server Instances, by configuring individual zones in the in-house bare-metal server and AWS EKS cluster.
There're 2 Eureka Server instances in the bare-metal server and 3 Eureka Server instances in the EKS cluster. In addition, it was grouped into the same region and configured as the peers to replicate each client instance information.
While doing a chaos test assuming that a specific zone isn't normally serviced, I found that Eureka Client instance (which located in the zone that is inaccessible.) tries register
event only with the server instance of the inaccessible zone.
When the preferred zone became inaccessible, candidateHosts
(returned by RetryableEurekaHttpClient.getHostCandidates()
) is defined by clusterEndpoints
(returned from AsyncResolver
- it's temporarily affected by eureka.client.availability-zones
property I explicitly enabled in application.yml), a condition (quarantineSet.size() >= threshold
).
If this condition is met, quarantineSet
will be cleared and candidateHosts
will be the entire peers list of property yml, leading to a situation in which register event with the secondary zone could not be performed because the register event was continuously retried to the peer in the temporarily inaccessible zone.
This may vary depending on the number of server instances configured per zone, but to achieve my intention, I had to change eureka.client.transport.retryable-client-quarantine-refresh-percentage
to prevent threshold from meeting the condition (quarantineSet.size() >= threshold
).
private List<EurekaEndpoint> getHostCandidates() {
List<EurekaEndpoint> candidateHosts = clusterResolver.getClusterEndpoints();
quarantineSet.retainAll(candidateHosts);
// If enough hosts are bad, we have no choice but start over again
int threshold = (int) (candidateHosts.size() * transportConfig.getRetryableClientQuarantineRefreshPercentage());
//Prevent threshold is too large
if (threshold > candidateHosts.size()) {
threshold = candidateHosts.size();
}
if (quarantineSet.isEmpty()) {
// no-op
/* ---------------------- */
} else if (quarantineSet.size() >= threshold) {
logger.debug("Clearing quarantined list of size {}", quarantineSet.size());
quarantineSet.clear();
/* ---------------------- */
} else {
List<EurekaEndpoint> remainingHosts = new ArrayList<>(candidateHosts.size());
for (EurekaEndpoint endpoint : candidateHosts) {
if (!quarantineSet.contains(endpoint)) {
remainingHosts.add(endpoint);
}
}
candidateHosts = remainingHosts;
}
return candidateHosts;
}
Question
In this regard, I tried to find a document related to transportConfig while working on it, but I couldn't find it.
By any chance, could you share the link if I couldn't find the document even though it existed?
And additionally, I wonder if the config property (eureka.client.transport.retryable-client-quarantine-refresh-percentage
) was introduced considering the HA of Eureka Server. Otherwise, I wonder what purpose this property is defined and what purpose is used.
Suggestion
I would like to take this opportunity to contribute to the creation of documentation based on what I worked on, and I would appreciate it if you could help me if you could tell me the related procedure.
Appendix
Regarding the threshold calculation
If you have two server instances configured in Zone A and three server instances in Zone B.
To attempt register event with Zone A server peers, when the Eureka Client in B Zone is unable to access the server peers in the same zone, candidateHosts.size()
(is 5) * retryable-client-quarantine-refresh-percentage
must be greater than the number of temporarily inaccessible Zone B server instances.
In order to satisfy this condition, the result is derived that the refresh percentage must be at least 0.8
or more.