Problem: Developing autonomous home robots controlled by natural language has long been a pursuit of human. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: 1) the lack of a unified benchmark for more complex robot tasks. 2) limited evaluation methods and metrics. 3) data incompatibility between LLMs and mobile manipulation trajectories.
EMMOE: Embodied Mobile Manipulation in Open Environments (EMMOE) requires robots to explore environments and perform various open-vocabulary mobile manipulation tasks based solely on language instructions and sensor observations.
Method: We collect EMMOE-100, the first daily task dataset featuring Chain-of Thought (CoT) outputs, diverse task designs, detailed re-plan processes, SFT and Direct Preference Optimization (DPO) sub-datasets for LLM training. Additionally, we propose three new metrics for more overall assessment. Furthermore, we design HomieBot, a sophisticated agent system consists of LLM with DPO, light navigation and manipulation models, multiple error detection and adaptation mechanisms.
Evaluation: We demonstrate how to construct new datasets based on EMMOE-100. Besides, we show HomieBot's performance and evaluations of different models and policies. Finally, we provide in-depth result analysis, visualizations, and case study.
Data Collection. We design 100 daily mobile manipulation tasks based on different episodes from Replica Challenge. To enhance task diversity and better align with human demands, there are five different task attributes, and one task can possess multiple attributes simultaneously. Then, we manually control a Fetch robot in Habitat-Lab 2.0 to complete all tasks and decompose trajectories into discrete subtasks. Each subtask consists of a pre-defined action, a target, and a low-level model selection.
Dataset Features. All subtasks are annotated with four first-person view images and detailed reasoning processes. Moreover, we intentionally design some failed subtasks and provide corresponding re-plans to enhance dataset robustness. A key feature of EMMOE-100 is the emphasis on the reasoning process and interleaved execution. In the shown task, the agent must check the fridge first. Otherwise, even if the agent finally gets a banana from the kitchen, it will not be considered as a success.
SFT Augmentation. Initially, all failed subtasks will be skipped as they are treated as junk data for the SFT dataset. Then we use GPT-4o to rewrite text descriptions of tasks and the analysis of each subtask for three times.
DPO Augmentation. For the i-th subtask and its input instruction Ii, if the execution of output Oi fails but the next output Oi+1 succeeds after re-plan, we will choose Ii as the prompt, Oi as the rejected and Oi+1 as the chosen. To obtain more data, we utilize following augmentation methods: 1) Order Change: shuffle the order of successful subtasks, treat successful output Oi as chosen and Oi+1 as rejected. 2) Action Change: replace actions in subtasks with non-standard names or actions outside the available list. 3) Model Change: replace the model choice with models of the same type in the model list.
We also provide all data transformation and augmentation scripts here.
To better measure the task execution process and the interrelations among subtasks, we propose Task Progress (TP), which is calculated as follows:
\( TP = \max_{k_i \in K_T} \left( \frac{\text{len}(k_i^{\text{check}})}{\text{len}(k_i)} \right) \)
A keypath is defined as an ordered node set of all necessary subtasks required to complete a task, ki is the i-th keypath in the keypath set KT for task T. Each task is assigned with several keypaths, representing different ways to complete the task. We strictly match the execution trajectory with the subtask nodes in ki in sequential order. Once the node in ki is successfully matched, it will be added to another ordered set kicheck, then the ratio between the length of kicheck and the length of ki will be recorded. This process will be repeated for all keypaths in KT, and the highest ratio will become the TP value of the trajectory. Only if TP reaches 100%, the trajectory will be considered successful. TP considers both the flexibility of the execution process and the relationships between every step. The way of using natural language and execution results to evaluate also simplifies new task design and enables evaluation in real-world scenarios, where writing PDDL files is impractical.
A fully autonomous robot should be able to actively terminate the execution at a proper moment. Otherwise, even if the task is already done, the robot may continue running and get stuck in an endless loop. Therefore, we propose Success End Rate (SER) to evaluate whether the agent has the ability to understand its current situation and reasonably determine the appropriate timing for task termination. The calculation method is as follows:
\( SER = \frac{\text{len}(S) }{\sum_{t\in M} \text{count}_t(\text{end}) } \)
t represents a single trajectory and M is the set of trajectories for all tasks. countt(end) equals 1 if "End" is the final action of t or 0 otherwise. S is the set of successful trajectories, of which TP equals 100%. Then SER is calculated as the ratio of the number of successful trajectories to the number of trajectories that the agent deemed successful. Once SER reaches a certain threshold or even 100%, auxiliary methods or metrics are no longer needed to calculate SR.
Execution failures are common cases in the real world, especially in unfamiliar environments, which makes the ability to quickly adjust from failures and continuously adapt to new environments a crucial skill. To measure the adaptation and generalization abilities of the agent, we propose Success Re-plan Rate (SRR), which is calculated as follows:
\( SRR = \frac{\sum_{t\in S} \text{count}_t(\text{replan}) }{\sum_{t\in M} \text{count}_t(\text{replan}) } \)
countt(replan) is the number of re-plans in trajectory t. Other symbol definitions are the same as SER. SRR represents the effectiveness of re-planning and adaptability of the agent. When SRR reaches 100%, it indicates that the agent can adapt to all failures and then successfully complete the task.
HomieBot employs a hierarchical framework with communication mechanisms for interleaved execution. High-Level Planning (HLP) deals with embodied decision making and planning adaptation, while Low-Level Execution (LLE) handles continuous execution and provides feedback to HLP.
To facilitate communication with HLP and provide more detailed error information, we further classify common errors into four main types and several subtypes. 1) Logical error. L1: The agent's hands are already full but still attempts to pick/open/close; L2: The agent holds nothing but attempts to put; L3: The agent attempts to pick/put the object in a closed container; L4: The agent attempts to interact with a non-interactive object. 2) Distance error. D1: The agent stands too far and is unable to reach the target; D2: The agent is too close to the target and its arm is hindered from properly extending during interaction. 3) Format Error. F1: The output action or model is not in the available list; F2: The output target does not exist in the current scene or can not be recognized by low-level models. 4) Execution Error. E1: The limited capabilities of the low-level models or policies cause the failure; E2: Failed execution may result in the inventory information being accidentally updated.
We select 90 tasks from EMMOE-100, then we select Video-LLaVA-7B as our base model and conduct a two-stage training process. In the first stage, we fine-tune the base model. In the second stage, we align the fine-tuned model with DPO.
Metrics. In addition to SR, TP, SER and SRR, we also choose Path Length Weighted SR (PLWSR) as one of our evaluation metrics. PLWSR measures the ability gap between the agent and the expert in successful trajectories.
Baselines. We select GPT-4o, Gemini-1.5-Pro, Qwen2-VL-7B and MiniCPM-V 2.6 as baseline high-level planners. By leveraging in-context learning abilities and minor adaptations, these models can be easily deployed in our system. For low-level executor, we extract individual skills from M3, modify their implementations, and pass environmental information between executions to ensure environmental consistency. Additionally, robotic arms will be reset after each execution to enhance the success rate.
Evaluation Benchmarks. All EMMOE-100 tasks will be used for evaluation, the remaining ten untrained tasks will serve as test set. Each task will be executed three times with a maximum step limit of 20 each time, the average results will be used for the final calculation. (!Note: we report the average success rate of a trajectory, rather than the number of tasks can be completed.)
Model | SR | PLWSR | TP | SRR | SER |
---|---|---|---|---|---|
GPT-4o | 13.33 | 10.51 | 29.79 | 3.57 | 49.38 |
Gemini-1.5-Pro | 17.33 | 14.79 | 38.03 | 3.39 | 55.91 |
Qwen2-VL-7B | 1.00 | 0.50 | 16.55 | 0.59 | 25.00 |
MiniCPM-V 2.6 | 0.67 | 0.57 | 14.45 | 0.06 | 40.00 |
HomieBot-7B (SFT) | 27.67 | 20.88 | 50.27 | 9.23 | 53.90 |
HomieBot-7B (SFT+DPO) | 30.30 | 24.66 | 51.39 | 8.72 | 60.81 |
Model | Train split | Test split | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SR | PLWSR | TP | SRR | SER | SR | PLWSR | TP | SRR | SER | |
HomieBot (SFT) | 28.52 | 21.49 | 50.16 | 9.59 | 53.85 | 20.00 | 15.36 | 51.19 | 6.55 | 54.55 |
HomieBot (SFT+DPO) | 31.84 | 25.82 | 52.29 | 9.69 | 60.71 | 16.67 | 14.36 | 43.39 | 3.08 | 62.50 |
Empirical Findings
After each trial, HomieBot will automatically save trajectory videos in different views. To make the entire pipeline more intuitive, we demonstrate raw experimental videos of three tasks. The first two tasks are executed by SFT version, and the third is executed by DPO version.
To further explore the reasons for the overall low success rate and demonstrate how HomieBot can be used to simultaneously evaluate both HLP and LLE, we collect all errors occurred during experiments and conduct an in-depth analysis.
Empirical Findings
Comprehensive error types allow us to evaluate HLP and LLE separately. We further classify Execution Errors based on action types and count total occurrences of each action. It is evident that Pick action has a significantly lower success rate and the highest proportion of execution errors compared to other actions.
Metrics | Go to | Pick | Place | Open | Close |
---|---|---|---|---|---|
P | 38.49 | 49.77 | 7.30 | 3.32 | 1.11 |
SR | 45.32 | 22.45 | 40.97 | 43.13 | 36.45 |
We also evaluate SR for each type of task. Short-Horizon tasks are relatively easy due to straightforward processes and fewer overall steps. The most challenging are Open-Ended tasks, which usually have a very long total step count, with flexible processes and results, demanding powerful capabilities from both HLP and LLE models.
Model | Short-Horizon | Long-Horizon | Open-Ended | Logical | Human-Style |
---|---|---|---|---|---|
HomieBot (SFT) | 43.75 | 24.60 | 18.52 | 34.01 | 25.24 |
HomieBot (SFT+DPO) | 41.67 | 28.11 | 15.38 | 35.86 | 27.88 |
@misc{li2025emmoecomprehensivebenchmarkembodied, title={EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments}, author={Dongping Li and Tielong Cai and Tianci Tang and Wenhao Chai and Katherine Rose Driggs-Campbell and Gaoang Wang}, year={2025}, eprint={2503.08604}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2503.08604} }