Prompt Tester

Test system prompts against conversation scenarios with simulated agent responses and tool calls.

Prompt Tester

This interactive tester lets you experiment with system prompts and see how they influence agent behavior across different conversation scenarios. Test prompt changes without deploying to production.

Time to complete: 20-30 minutes

How It Works

The Prompt Tester provides a conversational simulator where you:

  1. Write or paste your system prompt
  2. Select or create conversation scenarios
  3. See how the agent responds, which tools it calls, and why
  4. Iterate on your prompt based on results

The simulator uses the same language model as production, so results are representative of real-world behavior.

Tutorial Interface

Note:

Interactive Tester Coming Soon

This tutorial is currently in development. The prompt tester will include:

  • Full-featured prompt editor with syntax highlighting
  • Pre-built conversation scenarios (happy path, edge cases, errors)
  • Simulated agent responses showing reasoning
  • Tool call visualization (which tools, what parameters)
  • Side-by-side prompt comparison
  • Export validated prompts as JSON

For now, use the Prompt Design Guide and test prompts in a development agent.

Preview: Tester Features

Feature 1: Prompt Editor

Write or load system prompts with structured guidance:

Sections:

  • Agent Identity — Name, tenant, tone, style rules
  • Capabilities — Tool definitions with when, args, success_say
  • Policy — Data rules and escalation guidelines
  • Dialog Flow — High-level conversation structure
  • Response Templates — Pre-written phrases for common situations
  • Examples — Conversation examples showing ideal behavior

Editor Features:

  • Syntax highlighting for JSON
  • Auto-completion for common fields
  • Validation of required sections
  • Character count and token estimation
  • Template library (load Otto, Cypress, or blank)

Feature 2: Scenario Library

Test your prompt against real-world situations:

Pre-built Scenarios:

Happy Path:

  • User with clear intent and all required information
  • User provides information incrementally
  • User browses before deciding

Edge Cases:

  • User asks for something unavailable
  • User provides contradictory information
  • User changes their mind mid-conversation
  • User asks off-topic questions

Error Handling:

  • Tool returns no results
  • Tool times out or fails
  • User provides invalid input (wrong dates, negative quantities)
  • API returns unexpected data structure

Stress Tests:

  • User is vague or unclear
  • User asks complex multi-part questions
  • User interrupts or redirects frequently
  • Rapid-fire questions without pausing

Feature 3: Simulated Conversations

See how your agent handles each scenario:

Example Scenario: Product Search with Budget

User: "I need a cordless drill for building a deck, under $200"

Simulated Agent Response:

Agent Thinking:
  - User has clear use case: deck building (heavy duty)
  - Budget constraint: under $200
  - Need to call search_products tool
  - Filter to drills suitable for deck work
  - Use brushless for durability

Tool Call:
  search_products({
    query: "brushless cordless drill",
    category: "drills",
    max_price: 200,
    limit: 10
  })

Mock Tool Response:
  {
    "success": true,
    "count": 8,
    "products": [...]
  }

Agent Says:
  "Great choice for a deck project! For heavy-duty fastening, you'll want a brushless drill with good battery life. I found three excellent options under $200. The DeWalt 20V Max is our top seller for deck work — powerful, long battery life, and a 5-year warranty at $189.99. Let me show you all three."

UI Action:
  show_component("catalog_results", { items: products })

Quality Check:
  ✓ Used tools instead of guessing
  ✓ Acknowledged budget constraint
  ✓ Recommended based on use case
  ✓ Explained why (brushless for durability)
  ✓ Showed top recommendation
  ✓ Rendered visual component

Feature 4: Tool Call Visualization

See exactly which tools the agent calls and why:

Visualization:

Conversation Turn 3:

User: "What's the difference between the Makita and Milwaukee?"

Agent Decision Tree:
  1. User asking for comparison → Need product details
  2. Two products mentioned → Call get_product_stats twice
  3. Present comparison focusing on key differences

Tool Calls:
  [1] get_product_stats({ product_id: "makita-18v-drill" })
  [2] get_product_stats({ product_id: "milwaukee-18v-drill" })

Responses:
  [1] { name: "Makita", weight: "7.3 lbs", power: "250 UWO", ... }
  [2] { name: "Milwaukee", weight: "8.9 lbs", power: "300 UWO", ... }

Agent Synthesizes:
  "The Makita is lighter at 7.3 lbs — great for overhead cuts and homeowner projects. The Milwaukee is heavier at 8.9 lbs but has more cutting power. For weekend projects, the Makita is perfect. For daily contractor work, the Milwaukee is worth the extra cost."

Quality Check:
  ✓ Called tools for both products
  ✓ Compared key differentiators (weight, power)
  ✓ Explained tradeoffs in plain language
  ✓ Recommended based on use case

Feature 5: Side-by-Side Comparison

Test two prompt versions simultaneously:

Use Case: You're refining how the agent handles "out of stock" scenarios

Prompt A (Original):

{
  "escalation": [
    "If product is out of stock, apologize and ask if they want an alternative"
  ]
}

Prompt B (Improved):

{
  "policy": {
    "data_rules": [
      "If a product is out of stock, suggest in-stock alternatives with similar specs",
      "Mention why the alternative is comparable or better"
    ]
  },
  "response_templates": {
    "out_of_stock": "[Product] is currently out of stock, but we have [similar product] available now. It's in the same price range and has [comparable or better feature]. Would you like to see that?"
  }
}

Scenario: User asks for out-of-stock item

Results:

AspectPrompt APrompt B
Response"I'm sorry, that item is out of stock. Would you like to see alternatives?""The Ryobi 18V is currently out of stock, but we have a DeWalt 20V impact driver available now. It's in the same price range ($99) and actually has better torque at 1,825 in-lbs. Would you like me to show you that?"
ProactivityWaits for user to askImmediately suggests alternative
DetailVagueSpecific product, price, key spec
ValueLow (user must ask)High (agent does the work)
Winner

Feature 6: Conversation Export

Save successful test conversations:

  • Export as training examples to add to your prompt
  • Save edge case handling for documentation
  • Share with team for review
  • Use in regression testing

Guided Tutorial Flow

Step 1: Load or Write Your Prompt

Option A: Load a template (Otto, Cypress, Support Agent) Option B: Paste your existing prompt Option C: Start blank and build section by section

Step 2: Test Happy Path Scenarios

Start with straightforward conversations:

  • User with clear intent
  • All information provided
  • Tools return expected results
  • Successful outcome

Goal: Ensure basic functionality works

Step 3: Test Edge Cases

Challenge your prompt:

  • Vague user requests
  • Missing information
  • User changes mind
  • Out of stock or unavailable

Goal: Verify graceful handling of uncertainty

Step 4: Test Error Scenarios

Simulate failures:

  • Tool returns no results
  • Tool timeout
  • Invalid user input
  • API errors

Goal: Ensure user never sees broken experience

Step 5: Refine and Iterate

Based on test results:

  • Adjust tone or phrasing
  • Add handling for missed cases
  • Clarify tool usage rules
  • Add response templates

Step 6: Validate and Export

When satisfied with behavior:

  • Run full scenario suite (10-15 tests)
  • Export validated prompt
  • Add passing conversation examples to prompt
  • Deploy to staging environment

Common Prompt Issues to Test For

Tool Hallucination

Agent guesses product details instead of calling tools — test data_rules enforcement

Verbose Responses

Agent talks too much — test style_rules for conciseness

Off-Topic

Agent handles unrelated questions — test boundary enforcement

Unclear Tool Selection

Agent calls wrong tool or skips tools — test capability definitions

Advanced Testing Techniques

A/B Testing Response Templates

Compare different phrasings:

  • Formal vs casual greetings
  • Short vs detailed explanations
  • Proactive vs reactive suggestions

Stress Testing

Rapid conversation changes:

  • User interrupts mid-flow
  • User asks three questions at once
  • User provides contradictory information

Tone Consistency

Test brand voice across scenarios:

  • Does formality remain consistent?
  • Are sample phrases used appropriately?
  • Does personality shine through in all situations?

Scenario Authoring

Create custom test scenarios:

{
  "scenario": {
    "name": "Budget-Conscious DIYer",
    "description": "User wants a drill but is price-sensitive",
    "turns": [
      {
        "user": "I need a drill for home projects",
        "expected": "Agent asks clarifying questions about use case and budget"
      },
      {
        "user": "Just small repairs, I don't want to spend a lot",
        "expected": "Agent searches for budget-friendly drills and explains value"
      }
    ],
    "success_criteria": [
      "Agent calls search_products with appropriate filters",
      "Agent recommends based on value, not features",
      "Agent explains why budget option is sufficient"
    ]
  }
}

Quality Rubric

Evaluate agent responses on:

CriterionGoodBad
AccuracyUses tools for all dataGuesses or hallucinates facts
ToneConsistent with personalityFormal when should be casual, etc.
Conciseness2-3 sentences unless complexOverly verbose or too terse
HelpfulnessProactive suggestionsWaits for user to ask
Tool UsageCalls correct tools with right paramsWrong tools or missing calls
Error HandlingGraceful, offers alternativesBreaks or gives up

What's Next?

After validating your prompt:

Ready to test prompts? Launch the Prompt Tester (Coming Soon)

For now, manually test prompts using the Prompt Design Guide.