ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments


Calvin Yu3

Qinxi Yu3

Jingzhou Liu1,3

Jia Lin Yuan3

Pooria P. Tehrani3

Ritvik Singh1,3

1 NVIDIA   2 ETH Zurich   3 University of Toronto & Vector Institute





Abstract


We present ORBIT, a unified and modular framework for robotics and robot learning, powered by NVIDIA Isaac Sim. It offers a modular design to easily and efficiently create robotic environments with photo-realistic scenes, and fast and accurate rigid and soft body simulation. With ORBIT, we provide a suite of benchmark tasks of varying difficulty– from single-stage cabinet opening and cloth folding to multi-stage tasks such as room reorganization. The tasks include variations in objects' physical properties and placements, material textures, and scene lighting. To support working with diverse observations and actions spaces, we include various fixed-arm and mobile manipulators with different controller implementations and physics-based sensors. ORBIT allows training reinforcement learning policies and collecting large demonstration datasets from hand-crafted or expert solutions in a matter of minutes by leveraging GPU-based parallelization. In summary, we offer fourteen robot articulations, three different physics-based sensors, twenty learning environments, wrappers to four different learning frameworks and interfaces to help connect to a real robot. With this framework, we aim to support various research areas, including representation learning, reinforcement learning, imitation learning, and motion planning. We hope it helps establish interdisciplinary collaborations between these communities and its modularity makes it easily extensible for more tasks and applications in the future.

Video





Robotic Workflows



Reinforcement Learning

We include wrappers for different RL frameworks, such as RSL-RL, RL-Games and Stable-Baselines3. These enable users to train their environments on a larger set of RL algorithms and facilitate algorithmic research. Using RSL-RL and RL-Games, we can train policies for cabinet opening and end-effector tracking in minutes, obtaining up to 100K samples per second. Since stable-baselines3 is not GPU-optimized, we obtain 6K-10K samples per second for the same tasks.


Imitation Learning

In ORBIT, we also include connections to various peripheral devices. These include keyboard and 3D Spacemouse. Using these interfaces it is possible to send SE(2) and SE(3) commands for motion generation on robot. Additionally, we provide data collection utilities to store demonstrations collected from peripheral devices or policies. This data is stored in the data structure from robomimic, which allows training a wide range of policies through learning from demonstrations.


Motion Planning and Control

Motion planning is one of the well-studied domains in robotics. The traditional Sense-Model-Plan-Act (SMPA) methodology decomposes the complex problem of reasoning and control into possible sub-components. ORBIT supports such paradigms by allowing users to define and evaluate hand-crafted state machines or motion planners.


Deployment on real robot

It is possible to extend the framework to real robots by using the same API. We studied the feasibility of deploying on a real robot using two different communication protocols: ZeroMQ, a lightweight message passing protocol, and ROS, a popular middleware for robotics.

Franka Emika Arm connection with ZeroMQ

In the following videos, we show the physical Franka Emika arm being controlled by the same actions as the simulated arm. The joint commands from ORBIT are sent to a computer running the real-time kernel for the robot. To abide by the real-time safety constraints, we use a quintic interpolator to upsample the 60 Hz joint commands from the simulator to 1000 Hz for execution on the robot.

We demonstrate the modular system design by keeping the same "agent" stack but replacing the simulated arm with the real arm. Additionally, we experiment with two different tools on the arm: a parallel-jaw gripper and a dexterous hand.

Franka Allegro Lift
Franka Allegro Teleop
Franka Lift
Franka Object Avoidence

Sim-to-real legged locomotion with ROS connection

Additionally, to demonstrate the flexibility of the framework and the ease of deployment, we show the ANYmal-D robot being controlled by a policy trained in simulation. We use an MLP-based actuator network to model the series elastic actuator of the robot which have complex dynamics due to non-linear dissipation and delays. Additionally, we add randomization to the physics using the Isaac Replicator tool. The trained policy is then deployed to a real ANYmal-D robot using the ANYbotics ROS stack.


Anymal Training in Simulation
Trained Policy in Simulation
Policy Deployed to Real Robot



Sample Tasks



Fixed Arm Manipulation


Rigid Obects

Open Cabinet
Hockey
Nut and Bolt
Peg in Hole

Deformable Objects

Hoist Flag
Drop Teddy
Pick Bear
Pour Fluid

In-hand Manipulation

Allegro Hand
Shadow Hand

Mobile Manipulator

Mobile Reach
Mobile Cabinet Opening



Benchmarking Simulation Throughput



Physics


In order to benchmark the physics performance, we trained four tasks described in with different numbers of parallel environments using RL-Games in ORBIT and in Isaac Gym. We evaluate the total frames per second (FPS) obtained with increasing number of concurrent environments running. The evaluation is done using an Intel i7-9800X CPU and a NVIDIA RTX3090 GPU.

allegro-hand
anymal-flat
shadow-hand
franka-cabinet

* These numbers were computed using Isaac Sim 2022.1.0.


Rendering


In order to benchmark the multi-camera rendering performance, we created a simple scene which contains just a robot and a table, and a detailed scene, which contains a variety of different assets, materials, and light sources. We benchmark these scenes with up to ten RGB cameras at two different resolutions: 320x240 and 640x480 with RTX ray-tracing. The figure above shows the total FPS obtained on the simple and the performance on the detailed scenes respectively. As expected, the simple scenes provides a higher throughput due to less clutter and lighting sources. In current Isaac Sim, the rendering throughput is limited to the virtual memory available on the GPU. Thus, increasing the number of cameras does not linearly increase the simulation output. The evaluation is done using NVIDIA RTX3090 GPU.

Scene 1: No direct lighting or background assets
Scene 2: Various lighting and background assets

* These numbers were computed using Isaac Sim 2022.1.0.


Citing


If you use Orbit in your research, please cite the following paper:

@article{mittal2023orbit,
  title={ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments},
  author={Mittal, Mayank and Yu, Calvin and Yu, Qinxi and Liu, Jingzhou and Rudin, Nikita and Hoeller, David and Yuan, Jia Lin and Tehrani, Pooria Poorsarvi and Singh, Ritvik and Guo, Yunrong and others},
  journal={arXiv preprint arXiv:2301.04195},
  year={2023}
}