Working Memory Demo¶

This demo showcases how to use ReMe’s working memory capabilities with a ReAct agent. The working memory system automatically manages context by compressing and summarizing conversation history, enabling efficient long-context processing.

Installation¶

Install from PyPI (Recommended)¶

pip install reme-ai

Install from Source¶

git clone https://github.com/agentscope-ai/ReMe.git
cd ReMe
pip install .

Environment Configuration¶

Copy example.env to .env and modify the corresponding parameters:

FLOW_LLM_API_KEY=sk-xxxx
FLOW_LLM_BASE_URL=https://xxxx/v1
FLOW_EMBEDDING_API_KEY=sk-xxxx
FLOW_EMBEDDING_BASE_URL=https://xxxx/v1

Starting the Services¶

Before running the demo, you need to start both the HTTP and MCP services:

Start MCP Service¶

reme backend=mcp mcp.port=8002

The MCP service provides tools for working memory management including:

grep_working_memory: Search for content in working memory
read_working_memory: Read specific sections of working memory

Start HTTP Service¶

reme backend=http http.port=8003

The HTTP service provides the flow execution endpoint for memory operations.

Running the Demo¶

Once both services are running, execute the demo:

cd cookbook/working_memory
python work_memory_demo.py

What the Demo Does¶

The demo simulates a scenario where:

A large README content is loaded (repeated 4 times to create a long context)
The agent needs to search through this content and extract specific information
Working memory automatically compresses the context from ~24,586 tokens to ~1,565 tokens (compression ratio: 0.06)
The agent can still accurately answer questions about the content

Core Code Explanation¶

ReactAgent with Working Memory (`react_agent_with_working_memory.py`)¶

1. Agent Initialization¶

class ReactAgent:
    def __init__(self, model_name="", max_steps: int = 50):
        # Use your own LLM class
        self.llm = OpenAICompatibleLLM(model_name=model_name)
        self.max_steps = max_steps

The agent is initialized with an LLM model and a maximum number of reasoning steps.

2. Service Connection¶

async with FastMcpClient("reme_mcp_server", {
    "type": "sse",
    "url": "http://0.0.0.0:8002/sse",
}) as mcp_client, HttpClient(base_url="http://localhost:8003") as http_client:

The agent connects to both:

MCP Client: For tool execution (grep, read operations)
HTTP Client: For flow execution (memory summarization)

3. Tool Registration¶

tool_calls = await mcp_client.list_tool_calls()

for tool_call in tool_calls:
    if tool_call.name in ["grep_working_memory", "read_working_memory"]:
        tool_dict[tool_call.name] = tool_call

The agent registers working memory tools that will be available to the LLM.

Note: summary_working_memory is not an MCP tool. It is a flow exposed by the HTTP service and is invoked via HttpClient.execute_flow, as shown in the next section.

4. Working Memory Summarization (Key Feature)¶

result = await http_client.execute_flow("summary_working_memory",
                                        messages=[x.simple_dump() for x in messages],
                                        working_summary_mode="auto",
                                        compact_ratio_threshold=0.75,
                                        max_total_tokens=20000,
                                        max_tool_message_tokens=2000,
                                        group_token_threshold=None,
                                        keep_recent_count=1,
                                        store_dir="./test_working_memory")

messages = [Message(**x) for x in result.answer]

This is the core of working memory management. Before each LLM call:

working_summary_mode="auto": Automatically decides when to compress
compact_ratio_threshold=0.75: Triggers compression when context exceeds 75% of max tokens
max_total_tokens=20000: Maximum total tokens allowed
max_tool_message_tokens=2000: Maximum tokens per tool message
keep_recent_count=1: Keeps the most recent message uncompressed
store_dir: Directory to store compressed memory

The summarization process:

Analyzes the current message history
Identifies compressible content (especially long tool outputs)
Compresses/summarizes old messages while preserving semantic information
Returns a condensed message list that maintains context

5. ReAct Loop¶

for i in range(self.max_steps):
    # Summarize working memory before each LLM call
    result = await http_client.execute_flow("summary_working_memory", ...)
    messages = [Message(**x) for x in result.answer]

    # LLM generates next action
    assistant_message = await self.llm.achat(messages=messages, tools=[...])
    messages.append(assistant_message)

    if not assistant_message.tool_calls:
        break

    # Execute tools
    for tool_call in assistant_message.tool_calls:
        result = await mcp_client.call_tool(tool_call.name,
                                           arguments=tool_call.argument_dict)
        messages.append(Message(role=Role.TOOL, content=result, ...))

The ReAct loop:

Compress: Summarize working memory to reduce context size
Reason: LLM decides what tool to use
Act: Execute the tool
Observe: Add tool result to messages
Repeat until task is complete or max steps reached

Benefits of Working Memory¶

Context Efficiency: Reduces token usage by ~94% (24,586 → 1,565 tokens in the demo)
Cost Reduction: Lower token counts mean lower API costs
Performance: Faster inference with smaller contexts
Scalability: Handle much longer conversations and tool outputs
Accuracy: Maintains semantic information despite compression

Model Configuration¶

The demo uses an OpenAI-compatible LLM configured via environment variables:

FLOW_LLM_API_KEY / FLOW_LLM_BASE_URL: LLM API credentials and endpoint
The model name is specified in work_memory_demo.py, for example:

model_name = "qwen3-coder-30b-a3b-instruct"
agent = ReactAgent(model_name=model_name, max_steps=50)

You can change model_name to any model that your backend supports, as long as it follows the OpenAI-compatible API.

Expected Output¶

When running the demo, you should see:

Token count before compression: ~24,586 tokens
Token count after compression: ~1,565 tokens
Compression ratio: ~0.06 (6% of original size)
The agent successfully answers the question about task memory performance in AppWorld

Customization¶

You can customize the working memory behavior by adjusting parameters in the summary_working_memory call:

compact_ratio_threshold: Lower values trigger compression earlier
max_total_tokens: Adjust based on your model’s context window
max_tool_message_tokens: Control individual tool output size
keep_recent_count: Keep more recent messages uncompressed for better context

Troubleshooting¶

Services not starting: Ensure ports 8002 and 8003 are available
Connection errors: Verify both MCP and HTTP services are running
API errors: Check your .env file has valid API keys and endpoints
Memory errors: Adjust max_total_tokens based on your available memory