Add queue-based orchestration for three service requests#948
Add queue-based orchestration for three service requests#948zhtshr wants to merge 8 commits intoModelTC:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the data management and service architecture to support multi-tenant processing, introducing a 'room' concept for resource isolation and lifecycle management. Key changes include updating the DataManager to handle room-specific data arguments, threads, and events, and adding init and release methods for room lifecycle. A new ReqManager class is introduced for ZeroMQ-based request communication. The EncoderService, TransformerService, and DecoderService are modified to be stateless, receiving configuration per request, and now include exec_request methods for processing requests in a loop, managing their own DataSender/DataReceiver instances per room. A ControllerService is added to dispatch requests to these services. Review comments highlight a critical issue with using time.sleep for synchronization in the controller, a high-severity portability issue with a hardcoded user-specific path in mooncake.py, a medium-severity resource leak potential due to unhandled thread join timeouts, and a medium-severity code duplication issue in ReqManager's socket handling.
| self.req_mgr.send(bootstrap_addr, REQUEST_POLLING_PORT + 2, config) | ||
| self.logger.info("Request added to controller queue and dispatched to services") | ||
|
|
||
| time.sleep(10) # Sleep briefly to allow services to process the request |
There was a problem hiding this comment.
Using time.sleep(10) to wait for services to process requests is a brittle and unreliable synchronization mechanism. This can lead to race conditions and makes the system's behavior dependent on timing. A more robust solution would involve a proper synchronization mechanism, such as having services send a completion signal back to the controller. For an example script, this might be acceptable as a temporary solution, but it's a critical flaw for any production-level code.
| @staticmethod | ||
| def load_from_env() -> "MooncakeTransferEngineConfig": | ||
| config_file_path = os.getenv("MOONCAKE_CONFIG_PATH", "/data/nvme1/yongyang/FL/LightX2V/configs/mooncake_config.json") | ||
| config_file_path = os.getenv("MOONCAKE_CONFIG_PATH", "/root/zht/LightX2V/configs/mooncake_config.json") |
There was a problem hiding this comment.
The default path for MOONCAKE_CONFIG_PATH is user-specific (/root/zht/...). Hardcoding user-specific paths makes the code less portable and will likely cause it to fail for other developers or in different environments. It's better to remove the default value and rely on the environment variable being set, as the code already checks for None and raises an error.
| config_file_path = os.getenv("MOONCAKE_CONFIG_PATH", "/root/zht/LightX2V/configs/mooncake_config.json") | |
| config_file_path = os.getenv("MOONCAKE_CONFIG_PATH") |
| for t in threads: | ||
| if t.is_alive(): | ||
| t.join(timeout=1.0) |
There was a problem hiding this comment.
The join call has a timeout of 1 second, but there's no handling for the case where a thread doesn't terminate within this time. This could leave zombie threads running and lead to resource leaks. It would be more robust to check if the thread is still alive after the join and log a warning.
| for t in threads: | |
| if t.is_alive(): | |
| t.join(timeout=1.0) | |
| for t in threads: | |
| if t.is_alive(): | |
| t.join(timeout=1.0) | |
| if t.is_alive(): | |
| logger.warning("Thread %s for room %s did not terminate in time.", t.name, room) | |
| def receive(self, port: int): | ||
| socket = self.pull_sockets.get(port) | ||
| if socket is None: | ||
| socket = self.context.socket(zmq.PULL) | ||
| socket.bind(f"tcp://*:{port}") | ||
| self.pull_sockets[port] = socket | ||
| return socket.recv_pyobj() | ||
|
|
||
| def receive_non_block(self, port: int): | ||
| socket = self.pull_sockets.get(port) | ||
| if socket is None: | ||
| socket = self.context.socket(zmq.PULL) | ||
| socket.bind(f"tcp://*:{port}") | ||
| self.pull_sockets[port] = socket | ||
| try: | ||
| return socket.recv_pyobj(flags=zmq.NOBLOCK) | ||
| except zmq.Again: | ||
| return None |
There was a problem hiding this comment.
There's duplicated code for creating and caching pull sockets in receive and receive_non_block. This can be refactored into a private helper method to improve maintainability and reduce redundancy.
| def receive(self, port: int): | |
| socket = self.pull_sockets.get(port) | |
| if socket is None: | |
| socket = self.context.socket(zmq.PULL) | |
| socket.bind(f"tcp://*:{port}") | |
| self.pull_sockets[port] = socket | |
| return socket.recv_pyobj() | |
| def receive_non_block(self, port: int): | |
| socket = self.pull_sockets.get(port) | |
| if socket is None: | |
| socket = self.context.socket(zmq.PULL) | |
| socket.bind(f"tcp://*:{port}") | |
| self.pull_sockets[port] = socket | |
| try: | |
| return socket.recv_pyobj(flags=zmq.NOBLOCK) | |
| except zmq.Again: | |
| return None | |
| def _get_pull_socket(self, port: int) -> zmq.Socket: | |
| socket = self.pull_sockets.get(port) | |
| if socket is None: | |
| socket = self.context.socket(zmq.PULL) | |
| socket.bind(f"tcp://*:{port}") | |
| self.pull_sockets[port] = socket | |
| return socket | |
| def receive(self, port: int): | |
| socket = self._get_pull_socket(port) | |
| return socket.recv_pyobj() | |
| def receive_non_block(self, port: int): | |
| socket = self._get_pull_socket(port) | |
| try: | |
| return socket.recv_pyobj(flags=zmq.NOBLOCK) | |
| except zmq.Again: | |
| return None | |
This pull request refactors the
DataManagerclass and related threading logic inlightx2v/disagg/conn.pyto better support multi-room (multi-session) operation, improve thread management, and enhance code clarity. The changes introduce per-room resource management, refactor thread lifecycle handling, and add a newReqManagerutility class for ZeroMQ-based message passing.Key changes include:
Multi-room support and resource management:
DataManagerto manage per-room resources such asdata_args, threads, and events, enabling concurrent handling of multiple rooms/sessions. Initialization and cleanup for each room are now handled through newinitandreleasemethods.Thread lifecycle and event handling improvements:
API and naming consistency:
sender_data_ptrsandreceiver_ptrsinstead of ambiguous names. [1] [2]bootstrap_roominDataReceiver.New utility class:
ReqManager, a utility class for sending and receiving Python objects over ZeroMQ, including support for non-blocking receives and automatic conversion of nested mappings to built-in types.Minor improvements:
These changes collectively improve the scalability, maintainability, and clarity of the data disaggregation infrastructure.