Evaluation & Testing

Reliability is the difference between a demo and a product. The Lab offers built-in validation tools to automatically test your prompts against defined criteria.

Validators

Validators are assertions that run automatically after every execution. A test is considered “Passed” only if all validators succeed.

Types of Validators

1. Deterministic Validators

Simple, fast, and free checks.

Contains: Checks if the output contains a specific string.
Exact Match: strict quality check (mostly for classification tasks).
Not Contains: Ensure specific hallucinations or forbidden words are absent.

2. Structural Validators

JSON Schema: If you require JSON output, paste your schema here. The Lab validates that the output is (a) valid JSON and (b) conforms to your strict schema definitions (types, required fields).

3. LLM-as-a-Judge

Use another LLM to evaluate the quality of the response. This is essential for subjective criteria like tone, helpfulness, or safety.

How to Configure:

Select LLM Judge as the validator type.
Criteria: Describe what success looks like in plain English.
- Example: “The response should be empathetic but professional. It must apologize for the error without admitting legal liability.”
Judge Model: Select a model to perform the evaluation (e.g., gpt-4o or a cheaper gpt-4o-mini).

The Judging Process: Warning: Running a judge incurs additional cost (1 extra LLM call).

The Lab takes the User Input, Assistant Output, and your Criteria.
It constructs a specialized “Judge Prompt”.
The Judge Model evaluates the interaction and returns a PASS or FAIL grade along with its reasoning.

Test Cases

You can save specific inputs and expected outputs as Test Cases.

Save Case: After a successful run, save the Variables + Input + Validators as a reusable case.
Run Suite: Run all saved cases in bulk to ensure no regressions when you edit your prompt.