In Part 1, we built a foundational AI agent with a modular tool system, type-safe structured outputs, and the ReAct reasoning pattern. We created an agent that could use tools, think through problems step-by-step, and provide reliable responses.
But our Part 1 agent had limitations that would make it unsuitable for production:
- No memory: Each conversation started fresh, with no context from previous interactions
- No oversight: The agent could perform any action without human approval
- Limited visibility: We couldn't easily debug or monitor agent behavior
- Fragile execution: Tool failures would crash the entire agent
There are many other limitations that we could address, but these are the most important ones that I decided to show you how to work on, as they can be useful for any type of agent that you may want to build.
Today, we're upgrading our agent with four production-critical features:
- Long-term Memory: Persisting conversations across sessions
- Human-in-the-Loop (HITL): Requiring approval for critical actions
- Advanced Observability: Comprehensive logging and tracing
- Error Recovery: Graceful failure handling with retry logic
If you haven't read Part 1 yet, I strongly recommend starting there, as we'll be building directly on that foundation. You can find the complete code for both parts in these notebooks: Colab Notebook Part 1, Colab Notebook Part 2.
Also these posts are part of my Master LLMs series. It's a blog series that I am creating to guide you from fundamentals to production-ready systems. Read the full series to go deeper into what makes these models tick and how to wield them effectively.
Why These Features Matter
Before we dive into implementation, let's understand why each feature is essential in today's advanced agents:
- Long-term Memory transforms your agent from a stateless function into a learning system that improves over time. Imagine a customer support agent that remembers past issues or a personal assistant that learns your preferences.
- Human-in-the-Loop is critical for safety and trust. You don't want an agent deleting production databases or sending emails without approval. HITL provides a safety gate for high-stakes actions.
- Observability is what separates toys agent from production systems. When something goes wrong (and it will), you need detailed logs showing exactly what the agent did, when, and why.
- Error Recovery makes your agent resilient. APIs fail, networks timeout, and rate limits hit. A production agent must handle these gracefully rather than crashing.
Let's build each of these, starting with memory.
Feature 1: Long-term Memory
Any chatbot or AI Agent we use nowadays have some sort of a history that persists either locally (on your machine) or in the cloud. But, our Part 1 agent was stateless, meaning every conversation started from scratch. This works for simple tasks, but real-world agents need to learn from past interactions.
Memory vs. Context Window
You might think: "Why not just keep all messages in the history?" The problem is token limits. LLMs have finite context windows (typically 32k-128k tokens). A long conversation can easily exceed this, causing:
- Truncated history (losing important context)
- Increased latency (processing massive prompts)
- Higher costs (you pay per token)
The solution is selective memory: We choose to persist only important information to disk and inject only recent, relevant context into each request.
Designing the Memory System
Our memory system needs four capabilities:
- Persistence: Survive sessions restarts.
- Session awareness: Track which conversations belong together.
- Selective retrieval: Load only recent or relevant memories.
- Clean separation: Memory doesn't interfere with live conversation flow.
Here's the implementation:
class MemoryStore:
def __init__(self, file_path: str, max_entries: int = 50):
self.file_path = file_path
self.max_entries = max_entries
self._ensure_file()
def _ensure_file(self):
if not os.path.exists(self.file_path):
with open(self.file_path, "w") as f:
json.dump([], f)
def load_all(self) -> List[dict]:
try:
with open(self.file_path, "r") as f:
return json.load(f)
except Exception:
return []
def append(self, entry: dict):
data = self.load_all()
data.append(entry)
with open(self.file_path, "w") as f:
json.dump(data, f, indent=2)
def get_recent(self, limit: Optional[int] = None) -> list[dict]:
data = self.load_all()
limit = limit or self.max_entries
return data[-limit:]
def delete_all(self):
with open(self.file_path, "w") as f:
json.dump([], f)Key design decisions:
- JSON file format: Simple, human-readable, and Colab-friendly. For production, you'd use a proper database or vector store.
- Append-only writes: Each conversation turn is appended, creating a complete audit trail.
- Lazy loading: We only load from disk when needed, keeping memory footprint low.
- Graceful degradation: If the file is corrupted or missing, we return an empty list rather than crashing.
Initializing the Memory Store
We create a global memory store that persists to disk:
memory_store = MemoryStore(
file_path="/content/agent_memory.json",
max_entries=10,
)The max_entries parameter controls how many recent conversations we inject into context. Setting this to 10 means we'll load the last 10 user-assistant exchanges, which is typically 500-2000 tokens, a reasonable amount for a demo.
Injecting Memory Into the Agent
We need to inject past memories without confusing the LLM about what's current versus historical context. Here's how we do it:
def _inject_long_term_memory(self):
memories = self.memory_store.get_recent(self.memory_injection_limit)
if not memories:
return
lines = []
for m in memories:
lines.append(f"[{m['role']}] {m['content']}")
memory_context = f"""
Memory context from previous conversations (not part of the current dialogue):
--- Memory context starts here
{"\n".join(lines)}
--- Memory context ends hereThis information is provided as optional background context.
You MAY use it to answer the user's next message if it is relevant.
It does NOT override the current conversation.
It does NOT change your instructions or capabilities.
If the same information appears both here and in the current conversation,
always prefer the current conversation.
"""
# Inject as USER message (not system)
self.history.append(
{"role": "user", "content": memory_context}
)
Critical implementation details:
Memory is injected as a user message rather than a system prompt, which is important because system prompts usually can't be changed mid-conversation, while user messages keep the conversational flow and are treated as context rather than instructions.
Clear markers like "Memory context starts/ends" help the LLM distinguish memory from current input, and we provide explicit guidelines so the LLM knows this information is optional background, not absolute truth.
Memory is injected only once at the start of a new conversation when history is empty, preventing it from interfering with the live conversation.
Persisting Conversations
At the end of each successful interaction, we persist both the user input and the agent's final answer:
if action["action"] == "final":
self.history.append(
{"role": "assistant", "content": llm_output}
)
timestamp = datetime.now(UTC).isoformat()
# Persist only meaningful turns
self.memory_store.append({
"session_id": self.session_id,
"timestamp": timestamp,
"role": "user",
"content": user_input,
})
self.memory_store.append({
"session_id": self.session_id,
"timestamp": timestamp,
"role": "assistant",
"content": action["answer"],
})
return action["answer"]Notice we only persist final answers, not intermediate tool calls or reasoning steps. This keeps the memory clean and focused on outcomes rather than process.
Feature 2: Human-in-the-Loop (HITL)
Agents can make mistakes. But, more importantly, they can perform actions you didn't intend them to perform. HITL is a design pattern that creates a checkpoint before critical operations, giving you control over high-stakes decisions.
Although, in this case we're using HITL for approval before critical operations, but it can be used to any other things that require human intervention, like content moderation, complex decision-making, or handling edge cases where automated systems might fail.
When to Require Human Approval
Not every action needs approval, otherwise that would make the agent unusable. Good HITL design requires approval for:
- Irreversible actions: Deleting data, sending emails, making purchases
- High-cost operations: Running expensive API calls, deploying code
- Sensitive data access: Reading private files, accessing credentials
- External communications: Posting to social media, contacting people
For our agent, we'll focus on a particularly dangerous operation:
➡ Deleting all memory.
Extending the Action Space
First, we add a new action type to our Pydantic models:
class HumanApproval(BaseModel):
action: Literal["human"]
reason: str
LLMResponse = Union[ToolCall, FinalAnswer, HumanApproval]Now the LLM has three possible actions:
- Call a tool, request human approval, or provide a final answer.
Updating the System Prompt
We need to teach the LLM when and how to request approval. Here's the key addition to our system prompt:
HUMAN-IN-THE-LOOP (MANDATORY):
- You have a special action called "human".
- You MUST choose the "human" action BEFORE performing any irreversible,
destructive, or sensitive operation.
- Examples include (but are not limited to): deleting memory, resetting state,
or permanently altering stored data.
- When using the "human" action, you MUST clearly explain the reason approval
is required.
- After asking for human approval, you have two options depending on the response:
1. If approval is given: You MUST continue the task by selecting the
appropriate next action (usually a tool call).
2. If approval is denied: You MUST inform the user that the original action
won't be performed because approval was not given.
- Do not repeat this action consecutively. You must always follow a "human"
action by a "tool" action.These rules create a clear protocol:

Implementing the Approval Check
In our agent's run loop, we handle the human approval action:
if action["action"] == "human":
observer.log("human_approval_requested", {
"reason": action["reason"]
})
self.history.append(
{"role": "assistant", "content": action["reason"]}
)
approved = self._human_approval(action["reason"])
observer.log("human_approval_result", {
"approved": approved
})
if not approved:
self.history.append({
"role": "user",
"content": "Human approval was demanded and it is not given. "
"You cannot perform the action that required the approval."
})
else:
self.history.append({
"role": "user",
"content": "Human approval was demanded and it is given. "
"You can now proceed with the action that required the approval."
})
continueThe _human_approval() method is straightforward:
def _human_approval(self, reason: str) -> bool:
choice = input("Approve? (y/n): ").strip().lower()
return choice == "y"In a production system, you'd replace this with a more sophisticated approval mechanism: a web interface, Slack notification, or approval queue system.
Creating the Delete Memory Tool with a Factory Pattern
Now here's where things get interesting. We need a tool that can delete memory, but tools need access to agent state (the memory_store object), while remaining stateless from the model's perspective.
This is a perfect use case for the factory pattern:
def make_delete_all_memory_tool(memory_store: MemoryStore):
def delete_all_memory(confirm: str):
if confirm.lower() != "true":
raise ValueError(
"delete_all_memory called without explicit confirmation"
)
memory_store.delete_all()
return "All long-term memory has been permanently deleted."
return delete_all_memory
delete_all_memory_fn = make_delete_all_memory_tool(memory_store)So, Why use a factory here?
➡ A quick note to explain: Tools must be stateless from the LLM's perspective, meaning they're just function signatures. But they often need access to application state (databases, API clients, configuration). The factory pattern solves this seamlessly:
- The outer function (
make_delete_all_memory_tool) captures thememory_storein a closure - The inner function (
delete_all_memory) is the actual tool, with a clean signature (no arguments) - The LLM only sees the inner function's arguments, maintaining abstraction
- We can create multiple versions of the tool with different state (e.g., dev vs. prod memory stores)
This pattern is essential whenever your tools need to access resources beyond their direct parameters. You'll see it in production systems for database connections, API clients, file systems, and more.
We register the tool with an explicit confirmation parameter:
class DeleteAllMemoryArgs(BaseModel):
confirm: Literal["true"] = Field(
description="Must be 'true' to confirm permanent deletion of all memory."
)
registry.register(
Tool(
name="delete_all_memory",
description="Permanently delete all long-term memory. This action is irreversible.",
input_schema=DeleteAllMemoryArgs,
output_schema={"result": "string"},
func=delete_all_memory_fn,
)
)The Literal["true"] constraint forces the LLM to explicitly pass confirm="true", making accidental deletion nearly impossible.
The Complete HITL Flow
Here's what happens when a user asks to delete all memory:

This flow ensures dangerous operations are never automated without oversight.
Feature 3: Advanced Observability
"If you can't observe it, you can't debug it."
This is the mantra of production systems.
Our Part 1 agent was a black box, when something went wrong, we had no visibility into what happened.
What Observability Means for AI Agents
Traditional software has stack traces, logs, and debuggers. AI agents need something similar but more adapted to their multi-step probabilistic behavior.
We need to track:
- What decisions were made (which action, which tool)
- When they happened (timestamps, durations)
- Why they were made (the LLM's reasoning)
- What the outcomes were (tool results, errors)
This creates an audit trail that lets you answer questions like:
- "Why did the agent call this tool?"
- "How long did each step take?"
- "Where did the agent get stuck?"
- "What caused this error?"
Building the Observer System
We create an AgentObserver class that handles all logging:
import uuid
import time
from pathlib import Path
class AgentObserver:
def __init__(self, log_dir="/content/logs"):
self.trace_id = str(uuid.uuid4())
self.events = []
Path(log_dir).mkdir(exist_ok=True)
self.file_path = Path(log_dir) / f"trace_{self.trace_id}.jsonl"
def log(self, event_type, data=None):
entry = {
"trace_id": self.trace_id,
"timestamp": time.time(),
"event": event_type,
"data": data or {}
}
self.events.append(entry)
with open(self.file_path, "a") as f:
f.write(json.dumps(entry) + "\n")
def span(self, name):
return Span(self, name)Event Logging Decisions:
- 1. How do we track each agent run? We assign a unique trace ID (UUID) to every run. This makes it easy to correlate logs and see exactly what happened during a specific session.
- 2. How should logs be formatted for easy parsing? We use JSONL, where each line is a complete JSON object. This format keeps parsing simple, even for massive logs.
- 3. How do we balance speed and persistence? Events are stored in memory for fast access and also written to disk to ensure nothing is lost.
- 4. How do we keep logs consistent?
Every log entry follows a structured schema: it always includes a
type, atimestamp, and any additional arbitrary data. This makes analysis and debugging straightforward.
The Span Context Manager
Spans are critical for understanding timing and performance:
class Span:
def __init__(self, observer, name):
self.observer = observer
self.name = name
def __enter__(self):
self.start = time.time()
self.observer.log("span_start", {"name": self.name})
def __exit__(self, exc_type, exc, tb):
duration = time.time() - self.start
self.observer.log("span_end", {
"name": self.name,
"duration_sec": round(duration, 3)
})Spans use Python's context manager protocol (with statement) to automatically measure execution time. Usage is beautifully simple:
with observer.span("llm_call"):
llm_output = self.llm.generate(self.history)This logs both when the LLM call started and when it finished, along with the total duration.
Integrating Observability Into the Agent
We create an observer at the start of each run and log every significant event:
def run(self, user_input: str):
observer = AgentObserver()
observer.log("run_start", {
"session_id": self.session_id
})
# ... inject memory ...
observer.log("user_message", {
"text": user_input
})
for step in range(self.max_steps):
with observer.span("llm_call"):
llm_output = self.llm.generate(self.history)
action = json.loads(llm_output)
observer.log("llm_decision", {
"step": step,
"action": action["action"]
})
if action["action"] == "tool":
observer.log("tool_call_requested", {
"tool_name": action["tool_name"],
"args": action["args"]
})
# Execute tool...
observer.log("tool_call_result", {
"tool_name": tool.name,
"tool_response": result,
})Example Log Output
Here's what a real trace file looks like (formatted for readability):
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.123, "event": "run_start", "data": {"session_id": "xyz789"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.125, "event": "user_message", "data": {"text": "What is 5 plus 3?"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.126, "event": "span_start", "data": {"name": "llm_call"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.891, "event": "span_end", "data": {"name": "llm_call", "duration_sec": 0.765}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.892, "event": "llm_decision", "data": {"step": 0, "action": "tool"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.893, "event": "tool_call_requested", "data": {"tool_name": "add", "args": {"a": 5, "b": 3}}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.894, "event": "span_start", "data": {"name": "tool:add"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.895, "event": "span_end", "data": {"name": "tool:add", "duration_sec": 0.001}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.896, "event": "tool_call_result", "data": {"tool_name": "add", "success": true, "attempt": 1}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701234.897, "event": "span_start", "data": {"name": "llm_call"}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701235.634, "event": "span_end", "data": {"name": "llm_call", "duration_sec": 0.737}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701235.635, "event": "final_answer", "data": {"text": "The result is 8."}}
{"trace_id": "a1b2c3d4", "timestamp": 1705701235.636, "event": "run_complete", "data": {"steps_used": 1}}From this trace, you can see:
- The entire run took about 1.5 seconds
- Two LLM calls were made (0.765s and 0.737s each)
- One tool was called (add), taking 0.001s
- The agent completed in just 1 step
This is the difference between "something went wrong" and "the image generation API timed out after 5.2 seconds on the third retry attempt at step 7."
Feature 4: Error Recovery
Failures such as API errors, network timeouts, and rate limits are unavoidable. A production agent must handle them correctly.
The Problem with Naive Tool Calling
In Part 1, our tool execution was simple:
tool = self.tool_registry.get(action["tool_name"])
result = tool(**action["args"])If the tool throws an exception, the entire agent crashes. Game over.
Implementing Safe Tool Calls with Retries
We wrap tool execution in a retry handler:
def _safe_tool_call(self, observer, tool, args, retries=2):
"""
Calls a tool safely with retry and error logging.
"""
attempt = 0
while attempt <= retries:
try:
with observer.span(f"tool:{tool.name}"):
result = tool(**args)
observer.log("tool_call_result", {
"tool_name": tool.name,
"success": True,
"attempt": attempt + 1
})
return result
except Exception as e:
attempt += 1
observer.log("tool_call_error", {
"tool_name": tool.name,
"attempt": attempt,
"error": str(e)
})
if attempt > retries:
# Final failure after all retries
observer.log("tool_call_failed", {
"tool_name": tool.name
})
return NoneKey features:
- Configurable retries: Default is 2 retries (3 total attempts), but adjustable per tool
- Detailed logging: Every attempt is logged with success/failure status
- Graceful degradation: Returns
Noneon final failure rather than crashing - Span timing: Each attempt is measured, showing if failures are slow (timeouts) or fast (validation errors)
Using the Safe Tool Call
In the agent's run loop, we replace the naive call:
if action["action"] == "tool":
tool = self.tool_registry.get(action["tool_name"])
result = self._safe_tool_call(observer, tool, action["args"])
# The agent continues even if result is None
self.history.append({
"role": "tool",
"tool_name": tool.name,
"tool_response": result,
})Now the agent continues operating even if a tool fails. The LLM sees the failure (result is None) and can decide how to respond: Maybe try a different tool, ask for clarification, or inform the user of the limitation.
When Retries Help, and When They Don't:
Retries are effective for:
- Transient network errors: Temporary connectivity issues
- Rate limiting: Brief API throttling
- Server overload: Temporary unavailability (503 errors)
Retries are NOT effective for:
- Invalid parameters: Will fail every time
- Authentication errors: Need to fix credentials, not retry
- Resource not found: Won't magically appear on retry
- Quota exhausted: Need to wait for quota reset, not immediate retry
For production systems, you'd implement exponential backoff and jitter to avoid thundering herd problems:
import time
import random
def retry_with_backoff(func, max_attempts=3):
for attempt in range(max_attempts):
try:
return func()
except Exception as e:
if attempt == max_attempts - 1:
raise
# Exponential backoff: 1s, 2s, 4s, etc.
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)Putting It All Together
Let's see how all four features work in concert. Here's the complete agent initialization (Pseudo-code with conversation example, full code here):
from google import genai
# Initialize API client
client = genai.Client(api_key=GEMINI_API_KEY)
# Create memory store
memory_store = MemoryStore(
file_path="/content/agent_memory.json",
max_entries=10,
)
# Create tool registry with all tools
registry = ToolRegistry()
# ... register tools (add, multiply, delete_all_memory) ...
# Create LLM wrapper
llm = GeminiLLM(client, registry)
# Create agent with all features
agent = Agent(
llm=llm,
tool_registry=registry,
memory_store=memory_store,
max_steps=5,
memory_injection_limit=6
)
# Start chatting
chat_with_agent(agent)
```
### Example: A Complete Interaction
Let's walk through a complex scenario that exercises all our new features:
**Conversation 1:**
```
You: What is 10 times 3?
Agent: [Uses multiply tool] The result is 30.
```
*This gets stored in memory.*
**Conversation 2 (new session):**
```
You: Do you remember what we discussed before?
Agent: [Memory injected] Yes, in our previous conversation, you asked me to
multiply 10 by 3, and the result was 30.
You: Great! Now delete all our conversation history.
Agent: [HITL triggered] The user is requesting to permanently delete all stored
conversation history. This action cannot be undone. Approval is required
before proceeding.
Approve? (y/n): y
Agent: [Calls delete_all_memory tool] All long-term memory has been permanently
deleted.
```
Behind the scenes, the trace log shows:
- Memory was injected at conversation start
- LLM requested human approval before deletion
- Delete tool was called successfully after approval
- Entire interaction took 2.3 seconds across 4 steps
### The Complete System Architecture
We now have a production-grade agent with:
```

Each component has a clear responsibility:
- Agent: Orchestrates the conversation flow
- Memory: Provides context from past interactions
- HITL: Gates dangerous operations
- Observer: Tracks everything for debugging
- Error Recovery: Keeps the system resilient
Performance Considerations
- Memory injection: Adds 50–200 tokens to each conversation start. With proper limits, this is negligible.
- HITL checks: Zero overhead unless approval is actually requested. When triggered, adds human wait time (unpredictable).
- Observability: Minimal overhead. File I/O happens in the background. Typical overhead is <10ms per run.
- Error recovery: Only adds overhead on failures. Successful tool calls have zero retry cost.
What Could be Next
We've built a robust foundation, but there's always more to explore:
- Advanced Memory Systems: Move beyond recency to semantic relevance using embeddings. Implement hierarchical memory (working memory, short-term, long-term). Add memory querying as an explicit tool.
- Sophisticated HITL: Build approval queues for asynchronous review. Implement role-based permissions. Create approval rules engines.
- Production Observability: Integrate with real monitoring systems. Build real-time dashboards. Implement distributed tracing across multiple agents.
- Intelligent Error Handling: Add circuit breakers and fallback strategies. Implement predictive failure detection. Build self-healing capabilities.
- Multi-Agent Systems: Coordinate multiple specialized agents. Implement agent-to-agent communication. Build supervisor agents that manage worker agents.
Conclusion
Building AI agents from scratch teaches you what frameworks abstract away. You learn why certain design patterns exist, when to use them, and how to adapt them to your specific needs.
The agent we've built across these two posts isn't just a toy, it's a demo example, but it's also a legitimate foundation for production systems. Many commercial AI agents use variations of these exact patterns, but with more improvement layers:
- Modular tool systems
- Provider-agnostic LLM integration
- Type-safe structured outputs
- Persistent memory
- Human oversight gates
- Comprehensive observability
- Graceful error handling
Whether you're building a customer support agent, a coding assistant, a research tool, or something entirely new, these patterns will serve you well.
If you're building with LLMs, the Master LLMs series is here to guide you from fundamentals to production-ready systems. Read the full series to go deeper into what makes these models tick and how to wield them effectively.

Leave a comment and follow me for more insights on AI, ML, and coding. You can also check out my work and socials: Website | YouTube | GitHub | LinkedIn | X
🚀 I'm launching a curated weekly AI newsletter, and you're invited to be among the first. 👉No hype. No noise. Just essential news, tools, papers, and insights handpicked for engineers and thinkers who build with AI.
Be part of the founding circle → Join free now