E2Map: Experience-and-Emotion Map
for Self-Reflective Robot Navigation with Language Models


1 Seoul National University
2 Carnegie Mellon University
3 Stanford University

*Indicates Equal Contribution

Abstract

Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language-based instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments and immediate corrections by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance compared to existing LLM-based approaches by adjusting the agent's behavior in a one-shot manner.

Concept of E2Map

E2Map is a spatial map that captures the agent's emotional responses to its experiences. Our method enables one-shot behavior adjustments in stochastic environments by updating the E2Map through the diverse capabilities of LLMs and LMM.

System Overview

System Overview: (a) E2Map is created by embedding visual-language features and emotion parameters into corresponding grid cells. (b) When a user provides a language instruction, an LLM generates code through goal selection APIs to define goals to reach. (c) The planning algorithm then uses emotion as a cost function to determine the optimal path to the goal. (d) If the agent encounters unexpected events during navigation, the E2Map is updated through the sequential operation of the event descriptor and emotion evaluator. Following the update, the planning algorithm replans the path to adjust the agent’s behavior in a one-shot manner.

Experiments in Simulated Environments

Gazebo environment and Initial E2Map

We created a simulated environment that mirrored the real-world setting used for evaluation. We scanned the real-world setup using a 3D scanner and transferred the 3D model to the Gazebo simulator. The corresponding initial E2Map is displayed in the figure on the right.

Experimental Scenarios

In the simulated environment, we designed three scenarios to assess our method. First, after building the initial E2Map, we introduced static obstacles, such as danger signs (danger sign), to evaluate the method's ability to adapt to environmental changes. Second, we positioned a human figure behind a wall and had them step out unexpectedly (human-wall). Third, we added a dynamic door that opened unexpectedly as the robot approached (dynamic door). The human-wall and dynamic door scenarios tested whether our method could adjust behavior based on experiences with dynamic events, improving navigation performance.

Quantitative Result

Qualitative Result

Danger Sign

Human-Wall

Dynamic Door

Qualitative Results from the Event Descriptor and Emotion Evaluator

Danger Sign

Event Images

Event Description

Emotion Evaluation

Human-Wall

Event Images

Event Description

Emotion Evaluation

Dynamic Door

Event Images

Event Description

Emotion Evaluation

Experiments in Real-World Environment

Real-World Setup

To evaluate the scalability and applicability of our method in real-world settings, we first set up a real-world environment by placing objects such as a sofa, table, refrigerator, and microwave in the conference room at Seoul National University. We used the same language instructions as in the simulation and incorporated real humans and danger signs to replicate the three scenarios from the simulation.

Hardware Setup

For real-world experiments, we used a Unitree Go1 quadruped robot. The real-world robot is equipped with an Intel RealSense L515 RGB-D camera, a Velodyne VLP-16 3D LiDAR, and an Intel NUC 13 with i7 CPU for computation. For real-world experiments, the navigation algorithm runs on the Intel NUC, while all other algorithms are executed on a server with four RTX-4090 GPUs. The Intel NUC and the server communicate remotely via Wi-Fi.

Quantitative Result

BibTeX


@misc{kim2024e2mapexperienceandemotionmapselfreflective,
      title={E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models}, 
      author={Chan Kim and Keonwoo Kim and Mintaek Oh and Hanbi Baek and Jiyang Lee and Donghwi Jung and Soojin Woo and Younkyung Woo and John Tucker and Roya Firoozi and Seung-Woo Seo and Mac Schwager and Seong-Woo Kim},
      year={2024},
      eprint={2409.10027},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2409.10027}, 
}
      

Acknowledgements

This research was funded by the Korean Ministry of Land, Infrastructure and Transport (MOLIT) through the Smart City Innovative Talent Education Program and by the Korea Institute for Advancement of Technology (KIAT) under a MOTIE grant (P0020536). Additional support came from the Ministry of Education (MOE) and the National Research Foundation of Korea (NRF). K. Kim, D. Jung, and the corresponding author are affiliated with the Smart City Global Convergence program. Research facilities were provided by the Institute of Engineering Research at Seoul National University.