Experiement Overview¶
🌍 Appworld Experiment¶
We tested ReMe on Appworld using qwen3-8b:
Method |
pass@1 |
pass@2 |
pass@4 |
|---|---|---|---|
without ReMe |
0.083 |
0.140 |
0.228 |
with ReMe |
0.109 (+2.6%) |
0.175 (+3.5%) |
0.281 (+5.3%) |
Pass@K measures the probability that at least one of the K generated samples successfully completes the task ( score=1). The current experiment uses an internal AppWorld environment, which may have slight differences.
You can find more details on reproducing the experiment in quickstart.md.
🧊 Frozenlake Experiment¶
without ReMe |
with ReMe |
|---|---|
|
|
We tested on 100 random frozenlake maps using qwen3-8b:
Method |
pass rate |
|---|---|
without ReMe |
0.66 |
with ReMe |
0.72 (+6.0%) |
You can find more details on reproducing the experiment in quickstart.md.
🔧 BFCL-V3 Experiment¶
We tested ReMe on BFCL-V3 multi-turn-base (randomly split 50train/150val) using qwen3-8b:
Method |
pass@1 |
pass@2 |
pass@4 |
|---|---|---|---|
without ReMe |
0.2472 |
0.2733 |
0.2922 |
with ReMe |
0.3061 (+5.89%) |
0.3500 (+7.67%) |
0.3888 (+9.66%) |
🛠️ Tool Memory Benchmark¶
We evaluated Tool Memory effectiveness using a controlled benchmark with three mock search tools using Qwen3-30B-Instruct:
Scenario |
Avg Score |
Improvement |
|---|---|---|
Train (No Memory) |
0.650 |
- |
Test (No Memory) |
0.672 |
Baseline |
Test (With Memory) |
0.772 |
+14.88% |
Key Findings:
Tool Memory enables data-driven tool selection based on historical performance
Success rates improved by ~15% with learned parameter configurations
You can find more details in tool_bench.md and the implementation at run_reme_tool_bench.py.

