Tool Memory Benchmark¶
Overview¶
This benchmark evaluates Tool Memory effectiveness by comparing agent performance with and without tool memory across multiple epochs. The experiment uses mock search tools with varying performance characteristics for different query complexities.
Experimental Setup¶
Mock Search Tools¶
Three LLM-based mock search tools with different performance profiles:
Tool |
Simple Queries |
Medium Queries |
Complex Queries |
|---|---|---|---|
SearchToolA |
⭐⭐⭐ Fast, high success (90%) |
❌ Poor (20% success) |
⚠️ Weak (50% success) |
SearchToolB |
⚠️ Over-engineered (30%) |
⭐⭐⭐ Optimal (90% success) |
⚠️ Limited (50% success) |
SearchToolC |
⚠️ Overkill (30%) |
⚠️ Excessive (40%) |
⭐⭐⭐ Best (90% success) |
Performance Characteristics:
success_rate: Probability of successful execution (vs “Service busy” error)relevance_ratio: Probability of returning relevant results (vs random content)extra_time: Simulated latency (currently 0 in implementation)
Each tool uses LLM to classify query complexity and generate appropriate responses.
Query Dataset¶
Source: cookbook/tool_memory/query.json
Train Set: 20 queries per complexity × 3 levels = 60 queries
Test Set: 20 queries per complexity × 3 levels = 60 queries
Complexity Levels: simple, moderate, complex
Benchmark Workflow¶
Single Epoch Process¶
Each epoch consists of 5 steps:
Step 1: Train without Memory¶
# Execute all train queries on TRAIN_WORKSPACE
# Agent selects tools without historical guidance
run_use_mock_search(TRAIN_WORKSPACE, train_queries, prompt_template)
# Add results to memory and get scored results
train_scored_results = add_tool_call_results(TRAIN_WORKSPACE, train_results)
Step 2: Test without Memory¶
# Execute all test queries on TEST_WORKSPACE (fresh workspace)
# Baseline performance without tool memory
run_use_mock_search(TEST_WORKSPACE, test_queries, prompt_template)
# Add results to memory (will be cleared in Step 4)
test_scored_results = add_tool_call_results(TEST_WORKSPACE, test_results)
Step 3: Summarize Tool Memory¶
# Summarize tool performance from TRAIN_WORKSPACE
summarize_tool_memory(TRAIN_WORKSPACE, "SearchToolA,SearchToolB,SearchToolC")
# Retrieve formatted tool memory content
memories = retrieve_tool_memory(TRAIN_WORKSPACE, tool_names)
The summarization produces memory content including:
Best/worst use cases per tool
Statistical metrics (avg score, success rate, token cost, time cost)
Usage recommendations
Step 4: Test with Memory¶
# Clear TEST_WORKSPACE to start fresh
delete_workspace(TEST_WORKSPACE)
# Inject tool memory into prompt
prompt_with_memory = f"Tool Information\n{memories}\nMust select one tool to answer\nQuery\n{query}"
# Execute test queries with memory guidance
run_use_mock_search(TEST_WORKSPACE, test_queries, prompt_with_memory)
# Add results and get scored results
test_scored_results_with_memory = add_tool_call_results(TEST_WORKSPACE, test_results)
Step 5: Compare Results¶
# Generate comparison table
print_comparison_table([train_no_memory_stats, test_no_memory_stats, test_with_memory_stats])
# Calculate improvements (baseline: test without memory)
improvements = calculate_improvements(test_no_memory_stats, test_with_memory_stats)
print_improvements(improvements)
Multi-Epoch Execution¶
# Run benchmark with 3 epochs
python cookbook/tool_memory/run_reme_tool_bench.py
# Test mode (5 queries per complexity level)
main(test_mode=True, run_epoch=3)
# Full mode (20 queries per complexity level)
main(test_mode=False, run_epoch=3)
Key Components¶
1. Tool Selection: UseMockSearchOp¶
# Agent uses LLM to select appropriate tool
tool_call = await self.select_tool(query, [SearchToolA(), SearchToolB(), SearchToolC()])
# Execute selected tool and record results
result = ToolCallResult(
create_time=timestamp,
tool_name=tool_call.name,
input={"query": query},
output=content,
token_cost=token_cost,
success=success,
time_cost=time_cost
)
2. Tool Call Result Evaluation¶
Results are automatically evaluated and scored:
score: 0.0 (failure/irrelevant) or 1.0 (complete success)success: Tool execution statussummary: Brief descriptionevaluation: Detailed assessment
3. Tool Memory Schema¶
ToolMemory(
workspace_id="workspace_id",
memory_type="tool",
when_to_use="Brief usage scenario description",
content="Detailed performance analysis and recommendations",
score=0.85,
tool_call_results=[list of ToolCallResult],
metadata={"tool_name": "SearchToolA"}
)
Evaluation Metrics¶
Per-Scenario Metrics¶
Avg Score: Average quality score (0.0-1.0)
Total Calls: Number of tool invocations
Success Rate: Percentage of successful executions
Improvement Calculation¶
improvement_percentage = ((with_memory_score - without_memory_score) / without_memory_score) * 100
Expected Results¶
Hypothesis¶
Tool Memory should enable the agent to:
Select optimal tools based on query complexity
Improve average score by 10-30% on test set
Increase consistency across multiple epochs
Sample Output¶
==================================================================================================
BENCHMARK RESULTS COMPARISON
==================================================================================================
Note: Avg Score = average quality score
+---------------------------+--------------+-----------+
| Scenario | Total Calls | Avg Score |
+===========================+==============+===========+
| Epoch1 - Train (No Memory)| 60 | 0.650 |
+---------------------------+--------------+-----------+
| Epoch1 - Test (No Memory) | 60 | 0.633 |
+---------------------------+--------------+-----------+
| Epoch1 - Test (With Memory)| 60 | 0.817 |
+---------------------------+--------------+-----------+
==================================================================================================
IMPROVEMENTS WITH TOOL MEMORY (Baseline: Test without memory)
==================================================================================================
Average Score : +29.07% ↑
==================================================================================================
Running the Benchmark¶
Prerequisites¶
pip install requests python-dotenv loguru tabulate
Start API Server¶
# Start ReMe API server
python reme_ai/app.py --port 8002
Execute Benchmark¶
# Full benchmark (3 epochs, 60+60 queries per epoch)
python cookbook/tool_memory/run_reme_tool_bench.py
# Quick test (3 epochs, 15+15 queries per epoch)
# Modify main() call: main(test_mode=True, run_epoch=3)
Output Files¶
tool_memory_benchmark_results.json: Complete benchmark resultsConsole output: Real-time progress and comparison tables
API Endpoints Used¶
/use_mock_search: Execute tool selection and searchInput:
workspace_id,queryOutput:
ToolCallResultJSON
/add_tool_call_result: Add results to memory and get evaluation scoresInput:
workspace_id,tool_call_results(list)Output:
memory_listwith scored results
/summary_tool_memory: Summarize tool performanceInput:
workspace_id,tool_names(comma-separated)Output: Updated
ToolMemorywith content
/retrieve_tool_memory: Retrieve formatted tool memoryInput:
workspace_id,tool_namesOutput: Markdown-formatted memory content
/vector_store: Delete workspaceInput:
workspace_id,action: "delete"
Concurrency Control¶
Max workers: 4 parallel queries
Rate limiting: 1 second delay between submissions
Timeout: 120 seconds per API call
References¶
Tool Memory Schema:
reme_ai/schema/memory.pyMock Tools Implementation:
reme_ai/agent/tools/mock_search_tools.pyLLM-based Search Op:
reme_ai/agent/tools/llm_mock_search_op.pyTool Selection Op:
reme_ai/agent/tools/use_mock_search_op.py