Skip to main content
Loading...

Capstone: Simulated Humanoid Robot with Conversational AI

1. Heading Breakdown: Analytical Deconstruction

The title "Capstone: Simulated Humanoid Robot with Conversational AI" represents the convergence of every discipline in this textbook. It is the realization of the "Embodied Intelligence" dream.

"Capstone"

The Apex Stone: In architecture, the capstone is the final stone placed at the top of an arch or pyramid. Without it, the structure is incomplete. In this course, the Capstone is the System Integration project.

  • Holism: You have built individual modules (ROS 2 nodes, Simulation, Perception). Now you must bind them together. The failure modes of integration are complex (e.g., The Vision node steals too much GPU, causing the Balance node to miss a deadline).

"Simulated Humanoid Robot"

The Unitree G1 Digital Twin: We use the simulation (Gazebo/Isaac Sim) as the proving ground.

  • Kinematic Chains: The G1 is a branching tree of joints. The "Base" connects to the "Torso," which branches to "Head," "Left Arm," "Right Arm," "Left Leg," and "Right Leg."
  • Whole-Body Control (WBC): Unlike a mobile base where you just drive wheels, a humanoid requires coordination. If you reach your arm out, your center of mass shifts. The legs must adjust automatically to counter-balance. This coupling is the essence of humanoid control.

"Conversational AI"

The Interface of the Future: We are moving away from joysticks and keyboards. The interface is Natural Language.

  • LLM (Large Language Model): The cognitive engine (e.g., GPT-4, Llama-3). It handles the semantics and planning.
  • Prompt Engineering for Robotics: We don't just chat; we inject a "System Prompt" that defines the robot's capabilities.
    • Input: "Get me the red drill."
    • LLM Output: [SEARCH(red_drill), GRASP(red_drill), RETURN_TO_USER()].
  • Grounding: The hardest problem. The LLM knows what a "drill" is conceptally. The Vision system knows what a "red blob" is pixels. Grounding is the mapping between the Concept and the Pixels.

"Simulated... with..." (The VLA Loop)

Vision-Language-Action (VLA): This is the modern paradigm for AI robotics.

  1. Vision: See the world state ($S_t$).
  2. Language: Reason about the goal ($G$).
  3. Action: Generate the control policy ($A_t$). The loop is closed: S_t+1 <- Physics(S_t, A_t).

2. Assessment Challenge: The "Butler" Scenario

Your goal is to demonstrate a complete mission loop in simulation.

The Mission

  1. Start: Robot is in the "Kitchen" zone of the simulation.
  2. Command: User types/speaks: "I spilled some water. Find the sponge and bring it here."
  3. Execution:
    • Planner: LLM decomposes "Find sponge" -> Nav2(Kitchen_Counter) -> Detect(Sponge).
    • Navigation: Robot walks to the counter (avoiding the table).
    • Perception: Robot scans the surface, identifies the sponge (YOLO/Isaac).
    • Manipulation: Robot reaches out (MoveIt 2), grasps the sponge, and lifts it.
    • Return: Robot walks back to the user.

Requirements

  1. The "Brain" Node:

    • A Python node that interfaces with an LLM API (OpenAI or Local Llama).
    • It must maintain a State Machine (Idle, Planning, Moving, Grasping, Error).
  2. The "Safety" Layer:

    • The LLM is non-deterministic (it hallucinates).
    • You must implement a Pre-Condition Checker. If the LLM says GRASP, the checker asks: "Is the arm within 50cm of the object?" If no, reject the command.
  3. The "Body" Layer:

    • Use Nav2 for moving the base.
    • Use MoveIt 2 for moving the arm.
    • Synchronize them. (Stop walking before reaching).

Submission Artifacts

  • Video Demo: A screen recording of the simulation executing the full sequence.
  • System Logs: The ros2 bag recording of the topics.
  • Failure Analysis: A document listing 3 times the robot failed and how you fixed it (e.g., "Robot tried to grasp through the table -> Added Collision Object to MoveIt").

3. Analytical Deep Dive: The Hierarchy of Autonomy

We are building a Hierarchical Control System.

  • Level 1: Reflexes (500Hz). Keeping balance. Joint impedance. (Hardware/Firmware).
  • Level 2: Skills (10Hz). "Walk forward," "Grasp." (ROS 2 Nodes).
  • Level 3: Tactics (1Hz). "Navigate to kitchen." (Nav2/Behavior Trees).
  • Level 4: Strategy (0.1Hz). "Clean the spill." (LLM).

Research Question: Where does the VLA model fit?

  • End-to-End: Research like Google's RT-2 suggests merging Levels 2, 3, and 4 into a single Neural Network.
  • Hybrid: For this Capstone, we use a Hybrid approach. The LLM handles Strategy, but classical ROS 2 handles Reflexes. Why? Because LLMs are too slow (latency) and unpredictable for keeping a robot from falling over.

4. Conclusion

This Capstone is your portfolio piece. It proves you are not just a coder, but a Roboticist. You can take a vague human desire ("Clean this up") and translate it into torque, voltage, and motion. You have bridged the gap between the Digital Mind and the Physical Body.