Skip to Content
DocsPlaygroundEvaluation

Evaluation (LLM-as-a-Judge)

The Evaluation feature allows you to automatically assess the quality of your prompt’s output using another LLM call. This is often referred to as “LLM-as-a-Judge”.

Concept

Instead of manually reading every output to check if it meets requirements, you define Criteria in natural language. The system then asks an AI model to evaluate the output against those criteria.

How to Use

  1. Generate Output: First, run your prompt in the Playground to get an output.
  2. Open Judge: Click the Evaulate or Grading tab/button.
  3. Define Criteria: Enter a question or statement that defines success.
    • Example: “Does the response explicitly mention the refund policy?”
    • Example: “Is the tone empathetic and apologetic?”
  4. Run Evaluation: Click Judge.

How It Works

Backend process:

  1. The system constructs a new prompt for the “Judge” model.
  2. It feeds in your Criteria and the Output from the previous step.
  3. It instructs the model to return a PASS or FAIL verdict along with Reasoning.
  4. The result is displayed in the UI, helping you quickly validate if your prompt changes are improving or degrading quality.

Best Practices

  • Be Specific: Vague criteria lead to vague evaluations. “Is it good?” is bad. “Does it contain a bulleted list of 5 items?” is good.
  • Single Criterion: For best results, test one thing at a time. If you have multiple requirements, create multiple test cases or separate checks.
Last updated on