← back
Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems
Original: Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems
Takeaway
High-performance robotics depends on pipelined, synchronized comms threads — and most 'bad policy' bugs are actually CAN/thread timing bugs you can only see with bus-level logging.
Summary
- Tesla Optimus engineer walks through what happens between controller and wire: when a robot policy misbehaves, the culprit is often the software/comms stack, not the policy.
- Toy example with CAN bus at 1 Mbps: 10 messages per loop saturate the bus and create a ~1ms gap, forcing pipelined RX/policy/TX threads to hit 2ms loop times.
- Multi-thread desync produces queued/skipped messages — cycle-time plots from can_dump reveal 4ms gaps followed by zero-gap bursts that manifest as motor jitter and 'catch-up' motion.
- Fixes: synchronization primitives (mutex, cond var, semaphore) on RTOS-capable systems; padding/cushion buffers where they aren't available.
- Diagnostic discipline: cheap external CAN transceivers into a laptop running can_dump are essential for bus-level timing analysis.
roboticsreal-timesystems
Original description
A robot's behavior is influenced by the control policy, the software configuration, and electrical characteristics of the communication protocol. When unexpected behaviors arise, it is not straightforward to root cause them to the RL policy, electrical characteristics, mechanical characteristics. This talk walks through some of these issues and explains what might cause the observed behavior. We will talk about concrete issues that audience will be able to take away from and develop their understanding of physical systems. It will build intuition for what kind of issues to expect when communication data rates increase manifold. Timestamps 00:00 Introduction to high-performance robotics challenges 00:15 The problem of unexplained robot behavior 00:54 Root cause analysis: policy vs. software 01:17 Designing a toy robotics system for analysis 01:24 System architecture: sensors, CPU, GPU, actuators, CAN bus 01:57 The initial, simple code loop 02:14 Expectation vs. reality: unexpected loop execution gaps 02:42 The impact of CAN bus data rate on loop execution 03:13 Potential solutions: accepting delay vs. multithreading 04:00 A new, pipelined design for reduced cycle time 04:32 New problems: "stuttering" and abnormal motor behavior 04:49 Data collection with external transceivers and "candump" 05:24 Expected vs. actual message plots: missed messages and jitter 06:12 Using cycle time plots to identify desynchronization 06:58 Transmit phase desynchronization: missed and queued data 08:03 Receive phase desynchronization: stale data and overcompensation 08:38 Resolving synchronization issues: kernel primitives and padding 09:25 The impact of logging on system performance 11:09 Reception and priority inversion 12:02 Conclusion and summary of key takeaways Rishabh Garg Robotics Engineer at Tesla Optimus I am Rishabh Garg, a robotics software engineer pushing the boundaries of software hardware integration to meet the ever increasing demand for data. I have been working with robots and embedded systems for the past 4 years, making systems more reliable and performant at companies like Tesla and Amazon. Eager to learn what experts in the industry are doing differently and share my own experience and insights into the challenges frequently encountered at the system software level for robotics.