Skip to content

Initial support for OpenJDK CRaC snapshotting #13619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

rvansa
Copy link

@rvansa rvansa commented Apr 20, 2023

This change intends to support an application using Vertx GRPC server to perform the Checkpoint and Restore on JVM implementing this, specifically using OpenJDK CRaC or future versions of OpenJDK. Package org.crac is a facade that either forwards the invocation to actual implementation or provides a no-op implementation.

Right now a test of the actual behaviour is not provided; without running on CRaC-enabled JVM there's nothing that would invoke the Resource methods, and making this a part of testsuite would be complicated (probably through a containerized test). If needed, I could try to put together a test not involving the actual checkpoint, that would verify that the code does not deadlock and that connections are eventually re-created.

It is not entirely clear to me what level of API publicity are the KafkaClient and Selectable interfaces and in what version could this land, or if I should do the changes only on the implementation classes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@divijvaidya
Copy link
Member

Hi @rvansa
Thank you for your first contribution to Apache Kafka! Thanks to you, today I learnt something new (CRaC).

For an effective discussion on this topic, starting a discussion thread with the community at [email protected] (preferably accompanied with a KIP)? I am asking for this because we need to dig more into the problem that this is solving for us, alternative solutions and weigh the tradeoffs of using a non-GA feature of OpenJDK (and I understand that you have tried to address this using the facade package). You can find the process of creating a KIP at https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

My initial thoughts (based on limited understanding, please correct me if I am wrong):

  • What is the downside of adding checkpoint & restore to the producer threads? What expense does it add?
  • Looks like we are trying to save resource by suspending the threads (on the producer) that are not actively doing anything and restoring them when we they are needed? Is that right?

Let's talk more on the mailing list.

It is not entirely clear to me what level of API publicity are the KafkaClient and Selectable interface

In general, you can find all public APIs for Kafka at https://kafka.apache.org/34/javadoc/allclasses-index.html

@rvansa
Copy link
Author

rvansa commented Apr 20, 2023

Sure, thanks for the pointers! I'll go through the docs and compose a proposal on the mailing list. If you don't mind I'll keep this PR open in the meantime.

What is the downside of adding checkpoint & restore to the producer threads? What expense does it add?

My take on this is that unless the checkpoint itself is performed, there shouldn't be any performance overhead (or very minimal). In case of this PR the sender performs a volatile read in the loop, which is cheap (unless contended with frequent writes). Also, usually some components that need to handle the checkpoint process need little bit of memory for tracking, but usually applications have only one or few instances of each component. On the other hand the cost of checkpoint itself can be significant as this happens in a controlled manner, sometimes even out of production environment.

Looks like we are trying to save resource by suspending the threads (on the producer) that are not actively doing anything and restoring them when we they are needed? Is that right?

The sender thread is paused, but not for saving resources but only to achieve correctness. Before performing the checkpoint we need to close all network connections, and don't want to re-create them unexpectedly, until restore. From my understanding of the code the affected components are used exclusively by the Sender thread (processing requests queues), therefore the most natural and performant option was to block it entirely, rather than trying to synchronize using locks (which would bring non-trivial overhead even without checkpoint).

Copy link
Contributor

@Hangleton Hangleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Radim,

Thanks for the PR. This would definitely need a KIP.

I am quite curious about the use cases driving this. Some problems I see with the approach:

  • Apache Kafka brokers have an extensive amount of states and cached metadata at any given time. This state is often not applicable upon application restart. For instance, by restoring the previous state of the application, replica states, leader epochs, consumer ids would be stale. This list is far from exhaustive.
  • Active connections are used by the controller and quorum to determine the liveness of a broker. Once these connections are closed, the broker is reportedly dead and any new process for that broker comes with a new identity either in the form of a Zookeeper session id or broker incarnation id. This conflicts with the re-use of the state hold by a defunct broker process.
  • A critical part of Apache Kafka's startup process is to verify the integrity of the data stored on the brokers. Based on the applicative checkpoints read upon start, Kafka can perform index and state reconstruction by scanning replica logs. Without this, we would remove the current integrity guarantees which the recovery path provides at startup.

Unless the broker needs to take the recovery actions stated above, start-up of a broker is usually fairly fast, a matter of seconds in common use cases. What pushes the need for the feature involved in this PR?

@rvansa
Copy link
Author

rvansa commented Apr 20, 2023

Hi @Hangleton , in this PR I am not addressing the broker but a client. Since I am not that familiar with the whole project, I am following the whack-a-mole strategy; in my case I am trying to demo a C/R of Quarkus Super-Heroes example application which uses Kafka Clients to report some data to another component.

I can imagine that in case of the broker you don't see the need for frequent scaling as it would be counter-productive. In case of applications that don't hold (Kafka) state, startup in seconds is quite a long time, especially if we're talking about serverless architectures and similar. Quarkus in native mode strives for sub-second startup, with CRaC we can get to tens of milliseconds.

@Hangleton
Copy link
Contributor

Hi, Radim,

Thank you for the follow-up and clarifying, I missed the fact that the targeted components are the Kafka clients.

Some of the previous statements regarding state may still be valid. Typically, a Kafka client holds cluster and topic metadata and one of its first operation on start-up, once a connection with a bootstrap broker is established, is to fetch these metadata to get an up-to-date view of the cluster (e.g. broker membership).

But, I lack the background to fully understand this approach.

@rvansa
Copy link
Author

rvansa commented Apr 21, 2023

@Hangleton You're right, this process should be repeated; for example in the JDK itself we flush DNS caches before checkpoint. I was hoping that the code in https://github.com/apache/kafka/pull/13619/files#diff-dcc1af531d191de8da1e23ad6d878a3efc463ba4670dbcf2896295a9dacd1c18R658 would reload the cluster view; is that not the case?

@github-actions
Copy link

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Jul 21, 2023
@tzvetkovg
Copy link

hi, any development on this? we've been using the the spring kafka reactive streams and recently started testing azul crac jdk. I wonder is there any way to gracefully pause the kafka stream before checkpointing and then re-enable it after? thanks

@rvansa
Copy link
Author

rvansa commented Jan 29, 2024

Hi @tzvetkovg , there was not so much reaction so far from the community, so basically - we need your feedback. You can use the artifacts at https://mvnrepository.com/artifact/io.github.crac.org.apache.kafka/kafka-clients/3.3.1.CRAC.0 - kind of preview until Kafka decides to integrate things. This should be working for the sender; TBH I haven't integrated any hooks for event receiver. Your feedback is welcome.

@hamburml
Copy link

hamburml commented Aug 14, 2024

Hi @rvansa @tzvetkovg @Hangleton ,

we use Quarkus for AWS Lambdas and have SnapStart (uses CRaC API) active, so a snapshot is created when the lambda is started and is reloaded during coldstart which makes coldstart very fast. We use smallrye-messaging and kafka connector to send messages to a kafka. After the coldstart we always get an exception that the connection is lost. The client reconnects and everything works.

Of course the connection is lost because when AWS took the snapshot there was one working connection open. When the lambda is used 2 hours later this connection does not exist anymore.

So we would appreciate this feature :) A simple close and reconnect method on a kafkaClient would work, which could be called inbeforeCheckpoint and afterRestore method.

quarkusio/quarkus#42286

Cheers
Michael

@rvansa
Copy link
Author

rvansa commented Aug 26, 2024

@hamburml Thanks for your use case! Have you tried the modified artifact I link above? I hope that since your concern is the sender side it might work...

@hamburml
Copy link

hamburml commented Aug 27, 2024

No sorry, i did not. For kafka we use SmallRye Reactive Messaging which has transitive dependency for kafka clients. I haven't tried if the modified artifacts are compatible to the one used in this.

edit

Additionally I would not download artifacts from github directly for security reasons. And if I tried this would not work because the dependencies are loaded from a dependency repository in my case. Would need integrated support.

@github-actions github-actions bot removed the stale Stale PRs label Jan 2, 2025
Copy link

github-actions bot commented Apr 2, 2025

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Apr 2, 2025
Copy link

github-actions bot commented May 2, 2025

This PR has been closed since it has not had any activity in 120 days. If you feel like this
was a mistake, or you would like to continue working on it, please feel free to re-open the
PR and ask for a review.

@github-actions github-actions bot added the closed-stale PRs that were closed due to inactivity label May 2, 2025
@github-actions github-actions bot closed this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-stale PRs that were closed due to inactivity stale Stale PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants