-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HIL SERL] (refactor): replace setup_process_handlers with ProcessSignalHandler #1263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
66e5255
bee6ae0
c3f69ae
0617d73
dec4d37
88d63e1
95576c2
d3ad67a
f57bded
c70c2e4
7458e73
80e2469
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,38 +16,68 @@ | |
# limitations under the License. | ||
|
||
import logging | ||
import os | ||
import signal | ||
import sys | ||
|
||
shutdown_event_counter = 0 | ||
|
||
class ProcessSignalHandler: | ||
"""Utility class to attach graceful shutdown signal handlers. | ||
|
||
def setup_process_handlers(use_threads: bool) -> any: | ||
if use_threads: | ||
from threading import Event | ||
else: | ||
from multiprocessing import Event | ||
The class exposes a shutdown_event attribute that is set when a shutdown | ||
signal is received. A counter tracks how many shutdown signals have been | ||
caught. On the second signal the process exits with status 1. | ||
""" | ||
|
||
shutdown_event = Event() | ||
_SUPPORTED_SIGNALS = ("SIGINT", "SIGTERM", "SIGHUP", "SIGQUIT") | ||
|
||
# Define signal handler | ||
def signal_handler(signum, frame): | ||
logging.info("Shutdown signal received. Cleaning up...") | ||
shutdown_event.set() | ||
global shutdown_event_counter | ||
shutdown_event_counter += 1 | ||
def __init__(self, use_threads: bool, display_pid: bool = False): | ||
# TODO: Check if we can use Event from threading since Event from | ||
# multiprocessing is the a clone of threading.Event. | ||
# https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Event | ||
if use_threads: | ||
from threading import Event | ||
else: | ||
from multiprocessing import Event | ||
|
||
if shutdown_event_counter > 1: | ||
logging.info("Force shutdown") | ||
sys.exit(1) | ||
self.shutdown_event = Event() | ||
self._counter: int = 0 | ||
self._display_pid = display_pid | ||
|
||
signal.signal(signal.SIGINT, signal_handler) # Ctrl+C | ||
signal.signal(signal.SIGTERM, signal_handler) # Termination request (kill) | ||
signal.signal(signal.SIGHUP, signal_handler) # Terminal closed/Hangup | ||
signal.signal(signal.SIGQUIT, signal_handler) # Ctrl+\ | ||
self._register_handlers() | ||
|
||
def signal_handler(signum, frame): | ||
logging.info("Shutdown signal received. Cleaning up...") | ||
shutdown_event.set() | ||
@property | ||
def counter(self) -> int: # pragma: no cover – simple accessor | ||
imstevenpmwork marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Number of shutdown signals that have been intercepted.""" | ||
return self._counter | ||
|
||
return shutdown_event | ||
def _register_handlers(self): | ||
"""Attach the internal _signal_handler to a subset of POSIX signals.""" | ||
|
||
def _signal_handler(signum, frame): | ||
pid_str = "" | ||
if self._display_pid: | ||
pid_str = f"[PID: {os.getpid()}]" | ||
logging.info(f"{pid_str} Shutdown signal {signum} received. Cleaning up…") | ||
self.shutdown_event.set() | ||
self._counter += 1 | ||
|
||
# On a second Ctrl-C (or any supported signal) force the exit to | ||
# mimic the previous behaviour while giving the caller one chance to | ||
# shutdown gracefully. | ||
# TODO: Investigate if we need it later | ||
if self._counter > 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any strong reason why you need a second signal to shutdown? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem that scripts sometimes stuck and doesn't stop. Or do that very slow - like minutes. We want to kill process immediately if user send There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But then why ignore the first ctlr+c? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't ignore. The flow is following - you press But sometimes something happen, maybe some bugs, or some issue in pytorch, or issue in multiprocesses setup and some of processes or threads are stuck. Like for minutes, or foreverer. In such case you will be seating and watching locked terminal. So, as every user u will try to press Without any signal handling - we won't be able to stop the process (processes) at all, without this trick we won't be able to stop processes in case of some bugs or breaking changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Smelly IMHO There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @imstevenpmwork what is the best from your opinion to solve it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the reason you guys need to ignore the first CTRL+C (but set the shutdown event) is to ensure synchronization between processes. If Process A terminates immediately, other processes (B and C) won't be able to check the
If Process A dies during the first signal, Processes B and C can't check its status. The delayed termination you observe likely occurs because the system relies on garbage collection to clean up these processes. I think, for example, that the approach below would have the same effect as ignoring the first signal: self.shutdown_event.set()
time.sleep(10) # Give Processes B & C time to detect the event and shutdown cleanly
sys.exit(1) # Process B & C had the time to respond to the shutdown signal before Process A terminates However, in my opinion, this would not be a good solution either. The best way to proceed is to just have a rigorous thread/process synchronization communication which is not straight forward There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point. This could be good too, yes. Just to give the context how we came up to the current solution: Theoretically - yeah, we can always do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, it seems that systems will be more complex - as we will add some planners, supervisors, we will add locomotion parts, and probably remote inference as default way to produce actions. So, would be good to have a good basic system. |
||
logging.info("Force shutdown") | ||
sys.exit(1) | ||
|
||
for sig_name in self._SUPPORTED_SIGNALS: | ||
sig = getattr(signal, sig_name, None) | ||
if sig is None: | ||
# The signal is not available on this platform (Windows for | ||
# instance does not provide SIGHUP, SIGQUIT…). Skip it. | ||
continue | ||
try: | ||
signal.signal(sig, _signal_handler) | ||
except (ValueError, OSError): # pragma: no cover – unlikely but safe | ||
# Signal not supported or we are in a non-main thread. | ||
continue |
Uh oh!
There was an error while loading. Please reload this page.