Working Memory Demo¶
This demo showcases how to use ReMe’s working memory capabilities with a ReAct agent. The working memory system automatically manages context by compressing and summarizing conversation history, enabling efficient long-context processing.
Installation¶
Install from PyPI (Recommended)¶
pip install reme-ai
Install from Source¶
git clone https://github.com/agentscope-ai/ReMe.git
cd ReMe
pip install .
Environment Configuration¶
Copy example.env to .env and modify the corresponding parameters:
FLOW_LLM_API_KEY=sk-xxxx
FLOW_LLM_BASE_URL=https://xxxx/v1
FLOW_EMBEDDING_API_KEY=sk-xxxx
FLOW_EMBEDDING_BASE_URL=https://xxxx/v1
Starting the Services¶
Before running the demo, you need to start both the HTTP and MCP services:
Start MCP Service¶
reme backend=mcp mcp.port=8002
The MCP service provides tools for working memory management including:
grep_working_memory: Search for content in working memoryread_working_memory: Read specific sections of working memory
Start HTTP Service¶
reme backend=http http.port=8003
The HTTP service provides the flow execution endpoint for memory operations.
Running the Demo¶
Once both services are running, execute the demo:
cd cookbook/working_memory
python work_memory_demo.py
What the Demo Does¶
The demo simulates a scenario where:
A large README content is loaded (repeated 4 times to create a long context)
The agent needs to search through this content and extract specific information
Working memory automatically compresses the context from ~24,586 tokens to ~1,565 tokens (compression ratio: 0.06)
The agent can still accurately answer questions about the content
Core Code Explanation¶
ReactAgent with Working Memory (react_agent_with_working_memory.py)¶
1. Agent Initialization¶
class ReactAgent:
def __init__(self, model_name="", max_steps: int = 50):
# Use your own LLM class
self.llm = OpenAICompatibleLLM(model_name=model_name)
self.max_steps = max_steps
The agent is initialized with an LLM model and a maximum number of reasoning steps.
2. Service Connection¶
async with FastMcpClient("reme_mcp_server", {
"type": "sse",
"url": "http://0.0.0.0:8002/sse",
}) as mcp_client, HttpClient(base_url="http://localhost:8003") as http_client:
The agent connects to both:
MCP Client: For tool execution (grep, read operations)
HTTP Client: For flow execution (memory summarization)
3. Tool Registration¶
tool_calls = await mcp_client.list_tool_calls()
for tool_call in tool_calls:
if tool_call.name in ["grep_working_memory", "read_working_memory"]:
tool_dict[tool_call.name] = tool_call
The agent registers working memory tools that will be available to the LLM.
Note:
summary_working_memoryis not an MCP tool. It is a flow exposed by the HTTP service and is invoked viaHttpClient.execute_flow, as shown in the next section.
4. Working Memory Summarization (Key Feature)¶
result = await http_client.execute_flow("summary_working_memory",
messages=[x.simple_dump() for x in messages],
working_summary_mode="auto",
compact_ratio_threshold=0.75,
max_total_tokens=20000,
max_tool_message_tokens=2000,
group_token_threshold=None,
keep_recent_count=1,
store_dir="./test_working_memory")
messages = [Message(**x) for x in result.answer]
This is the core of working memory management. Before each LLM call:
working_summary_mode="auto": Automatically decides when to compresscompact_ratio_threshold=0.75: Triggers compression when context exceeds 75% of max tokensmax_total_tokens=20000: Maximum total tokens allowedmax_tool_message_tokens=2000: Maximum tokens per tool messagekeep_recent_count=1: Keeps the most recent message uncompressedstore_dir: Directory to store compressed memory
The summarization process:
Analyzes the current message history
Identifies compressible content (especially long tool outputs)
Compresses/summarizes old messages while preserving semantic information
Returns a condensed message list that maintains context
5. ReAct Loop¶
for i in range(self.max_steps):
# Summarize working memory before each LLM call
result = await http_client.execute_flow("summary_working_memory", ...)
messages = [Message(**x) for x in result.answer]
# LLM generates next action
assistant_message = await self.llm.achat(messages=messages, tools=[...])
messages.append(assistant_message)
if not assistant_message.tool_calls:
break
# Execute tools
for tool_call in assistant_message.tool_calls:
result = await mcp_client.call_tool(tool_call.name,
arguments=tool_call.argument_dict)
messages.append(Message(role=Role.TOOL, content=result, ...))
The ReAct loop:
Compress: Summarize working memory to reduce context size
Reason: LLM decides what tool to use
Act: Execute the tool
Observe: Add tool result to messages
Repeat until task is complete or max steps reached
Benefits of Working Memory¶
Context Efficiency: Reduces token usage by ~94% (24,586 → 1,565 tokens in the demo)
Cost Reduction: Lower token counts mean lower API costs
Performance: Faster inference with smaller contexts
Scalability: Handle much longer conversations and tool outputs
Accuracy: Maintains semantic information despite compression
Model Configuration¶
The demo uses an OpenAI-compatible LLM configured via environment variables:
FLOW_LLM_API_KEY/FLOW_LLM_BASE_URL: LLM API credentials and endpointThe model name is specified in
work_memory_demo.py, for example:
model_name = "qwen3-coder-30b-a3b-instruct"
agent = ReactAgent(model_name=model_name, max_steps=50)
You can change model_name to any model that your backend supports, as long as it follows the OpenAI-compatible API.
Expected Output¶
When running the demo, you should see:
Token count before compression: ~24,586 tokens
Token count after compression: ~1,565 tokens
Compression ratio: ~0.06 (6% of original size)
The agent successfully answers the question about task memory performance in AppWorld
Customization¶
You can customize the working memory behavior by adjusting parameters in the summary_working_memory call:
compact_ratio_threshold: Lower values trigger compression earliermax_total_tokens: Adjust based on your model’s context windowmax_tool_message_tokens: Control individual tool output sizekeep_recent_count: Keep more recent messages uncompressed for better context
Troubleshooting¶
Services not starting: Ensure ports 8002 and 8003 are available
Connection errors: Verify both MCP and HTTP services are running
API errors: Check your
.envfile has valid API keys and endpointsMemory errors: Adjust
max_total_tokensbased on your available memory