Skip to content

Port HIL SERL #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 312 commits into from
Jun 13, 2025
Merged

Port HIL SERL #644

merged 312 commits into from
Jun 13, 2025

Conversation

AdilZouitine
Copy link
Member

@AdilZouitine AdilZouitine commented Jan 17, 2025

Implementing HIL-SERL

This PR implements the HIL-SERL approach as described in the paper. HIL-SERL combines Human in the loop intervention with reinforcement learning to enable efficient learning from human demonstrations.

The implementation includes:

  • Reward classifier training with pretrained architecture: Added a lightweight classification head built on top of a frozen, pretrained image encoder from HuggingFace. This classifier processes robot camera images to predict rewards, supporting binary and multi-class classification. The implementation includes metrics tracking with WandB.

  • Environment configurations for HILSerlRobotEnv: Added configuration classes for the HIL environment including VideoRecordConfig, WrapperConfig, EEActionSpaceConfig, and EnvWrapperConfig. These handle parameters for video recording, action space constraints, end-effector control, and environment-specific settings.

  • SAC-based reinforcement learning algorithm: Implemented Soft Actor-Critic (SAC) algorithm with configurable network architectures and optimization settings. The implementation includes actor and critic networks, policy configurations, temperature auto-tuning, and target network updates via exponential moving averages.

  • Actor-learner architecture with efficient communication protocols: Added actor server script that establishes connection with the learner, creating queues for parameters, transitions, and interactions. Implemented LearnerService class with gRPC for efficient streaming of parameters and transitions between components.

  • Replay buffer for storing transitions: Added ReplayBuffer class for storing and sampling transitions in reinforcement learning. Includes functions for random cropping and shifting of images, memory optimization, and batch sampling capabilities.

  • End-effector control utilities: Implemented input controllers (KeyboardController and GamepadController) that generate motion deltas for robot control. Added utilities for finding joint and end-effector bounds, and for selecting regions of interest in images.

  • Human intervention support: Added RobotEnv class that wraps robot interfaces to provide a consistent API for policy evaluation with integrated human intervention. Created PyTorch-compatible action space wrappers for seamless integration with PyTorch tensors.

Engineering Design Choices for HIL-SERL Implementation

Environment Abstraction and Entry Points

Currently, environment building for both simulation and real robot training is embedded within gym_manipulator.py. This creates a clean interface for robot interaction. While this approach works well for our immediate needs, future discussions may consider consolidating all environment creation through a single entry point in lerobot.common.envs.factory::make_env for consistency across the codebase and better maintainability.

Gym Manipulator

The gym_manipulator.py script contains the main RobotEnv class, which defines a gym-based interface for the Manipulator robot class. It also contains a set of wrappers that can be used on top of the RobotEnv class to provide additional functionality necessary for training. For example, the ImageCropResizeWrapper class is used to crop the image to a region of interest and resize it to a fixed size, EEActionWrapper is used to convert the end-effector action space to joint position commands, and so on.

The script contains three additional functions:

  • make_robot_env: This function builds a gymnasium environment with the RobotEnv base and the requested wrappers.
  • record_dataset: This function allows you to record the offline dataset of demonstrations by recording the robot's actions in the environment. This dataset can be used to train the reward classifier or as the offline dataset for the RL.
  • replay_dataset: This function allows you to replay a dataset which can be useful for debugging the action space on the robot.

You can record/replay a dataset by setting the arguments of HILSerlRobotEnvConfig in lerobot/common/envs/configs.py related to mode, dataset (more details in the guide).

Q: Why not use control_robot.py for collecting and replaying data?

A: Since we mostly use end-effector control and different teleoperation devices (gamepad, keyboard or leader), it is more convinent to collect and replay data using the gym env interface in gym_manipulator.py.
After PR #777 we might be able to seamlessly change then teleoperation device and action space. Then we can revert to using control_robot.py for collecting and replaying data.

Optional Dataset in TrainPipelineConfig

The TrainPipelineConfig class has been modified to make the dataset parameter optional. This reflects the reality that while imitation learning requires demonstration data, pure reinforcement learning algorithms can function without an offline dataset. This makes the training pipeline more versatile and better aligned with various learning paradigms supported by HIL-SERL.

Consolidation of Implementation Files

For actor_server.py, learner_server.py, and gym_manipulator.py, we deliberately chose to create larger, more comprehensive files rather than splitting functionality across multiple smaller files. While this approach goes against some code organization principles, it significantly reduces the cognitive load required to understand these critical components. Each file represents a complete, coherent system with clear boundaries of responsibility.

Organization of Server-Side Components

We've placed multiple related files in the lerobot/script/server folder as a first step toward better organization. This groups related functionality for the actor-learner architecture. We're waiting for reviewer feedback before proceeding with further organization to ensure our approach aligns with the project's overall structure.

MultiAdamConfig for Optimizer Management

We introduced the MultiAdamConfig class to simplify handling multiple optimizers. Reinforcement learning methods like SAC typically rely on different networks (actor, critic, temperature) that are optimized at different frequencies and with different hyperparameters. This class:

  • Provides a clean interface for creating and managing multiple optimizers
  • Reduces error-prone boilerplate code when updating different networks
  • Enables more sophisticated optimization strategies with minimal code changes
  • Simplifies checkpoint saving and loading for training resumption

Gradient Flow Through Normalization

We removed the torch.no_grad() decorator from normalization functions to allow gradients to flow through these operations. This is essential for end-to-end training where normalized inputs need to contribute to the gradient computation. Without this change, backpropagation would be blocked at normalization boundaries, preventing the model from learning to account for input normalization during training.


How it was tested

  • We trained an agent on ManiSkill using this actor-learner architecture. The main task is the PushCube-v1. The point of the ManiSkill experiments is to validate that the implementation of the soft-actor critic is correct. As for this baseline we don't have any human interventions. We validate that the implementation can work with both sparse and dense rewards, with and without an offline dataset.

image

Reward with maniskill, training without offline data and human intervention

  • Another baseline is the Mujoco based simulation of the franka panda arm in the repo HuggingFace/gym-hil. We have implemented the ability to teleoperate the simulated robot with an external keyboard or gamepad device.

image image

Plots of the intervention rate and reward vs time during one training run. We are able to train a policy with 100% success between 10-30 minutes.

Other videos using this implemenation:

IMG_5714.mov

How to check out & try it (for the reviewer) 😃

Documentation

@michel-aractingi michel-aractingi force-pushed the user/adil-zouitine/2025-1-7-port-hil-serl-new branch from b1be31a to 2211209 Compare February 3, 2025 15:11
@AdilZouitine AdilZouitine changed the title [WIP] Fix SAC and port HIL SERL [WIP] Port HIL SERL Mar 18, 2025
@AdilZouitine AdilZouitine force-pushed the user/adil-zouitine/2025-1-7-port-hil-serl-new branch from 9a68f20 to ae12807 Compare March 24, 2025 11:05
@AdilZouitine AdilZouitine changed the base branch from user/michel-aractingi/2024-11-27-port-hil-serl to main March 24, 2025 11:07
@AdilZouitine AdilZouitine force-pushed the user/adil-zouitine/2025-1-7-port-hil-serl-new branch 2 times, most recently from ad51d89 to 808cf63 Compare March 28, 2025 17:20
@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements policies Items related to robot policies labels Apr 17, 2025
@AdilZouitine AdilZouitine changed the title [WIP] Port HIL SERL Port HIL SERL Apr 18, 2025
Co-authored-by: Michel Aractingi <[email protected]>
Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments

def signal_handler(signum, frame):
logging.info("Shutdown signal received. Cleaning up...")
shutdown_event.set()
global shutdown_event_counter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this code in a class?
I'd really prefer to not have globals ^^

@@ -9,6 +9,10 @@
title: Getting Started with Real-World Robots
- local: cameras
title: Cameras
- local: hilserl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can refactor/reorganize the docs in an upcoming dedicated PR if not now

@imstevenpmwork
Copy link
Collaborator

Given the size of this PR and our tight deadline for merging into main, I would address any findings worth discussing incrementally via follow-up tickets/PRs. The development team of this PR has already validated the functional aspect of the features here described through in-house experiments.

Once this PR lands in main, we should open tickets/PR to address the unresolved conversations and to review more in-depth the code. This also applies for #1263, which introduces last minutes changes in critical resource management design, for which not all conversations were fully resolved either. Namely: #1263 (comment)

cc @AdilZouitine cc @michel-aractingi cc @helper2424

@helper2424
Copy link
Contributor

Given the size of this PR and our tight deadline for merging into main, I would address any findings worth discussing incrementally via follow-up tickets/PRs. The development team of this PR has already validated the functional aspect of the features here described through in-house experiments.

Once this PR lands in main, we should open tickets/PR to address the unresolved conversations and to review more in-depth the code. This also applies for #1263, which introduces last minutes changes in critical resource management design, for which not all conversations were fully resolved either. Namely: #1263 (comment)

cc @AdilZouitine cc @michel-aractingi cc @helper2424

@imstevenpmwork sounds good. I also have one more #1266. We can merge it after the hackathon 👍

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, massive thanks and congrats to everyone involved 🥳

@aliberts aliberts merged commit d807958 into main Jun 13, 2025
8 checks passed
@aliberts aliberts deleted the user/adil-zouitine/2025-1-7-port-hil-serl-new branch June 13, 2025 11:15
@@ -97,7 +98,8 @@ stretch = [
"pyrender @ git+https://github.com/mmatl/pyrender.git ; sys_platform == 'linux'",
"pyrealsense2>=2.55.1.6486 ; sys_platform != 'darwin'"
]
test = ["pytest>=8.1.0", "pytest-cov>=5.0.0", "mock-serial>=0.0.1 ; sys_platform != 'win32'"]
test = ["pytest>=8.1.0", "pytest-timeout>=2.4.0", "pytest-cov>=5.0.0", "pyserial>=3.5", "mock-serial>=0.0.1 ; sys_platform != 'win32'"]
hilserl = ["transformers>=4.48", "gym-hil>=0.1.8", "protobuf>=5.29.3", "grpcio==1.71.0"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gamepad introduces a dependency on pygame and hid, which don't seem to be explicitly defined in the .toml, but imported from the reliance of gym-hil. This means that trying to use the gamepad without importing gym-hil will raise.

@snknitheesh
Copy link

@AdilZouitine @michel-aractingi During data collection, Why does the cube spawn in the same position in all episodes. Is there a config to randomize the cube position during reset ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Suggestions for new features or improvements policies Items related to robot policies
Projects
None yet
Development

Successfully merging this pull request may close these issues.