The Most Powerful Model for Evaluation & Hallucination Detection

Root Judge, a groundbreaking LLM that sets a new standard for reliable, customizable and locally-deployable evaluation models.

Test In Platform

Hugging Face

Trust, Control, and Safety:
The Big GenAI App Challenges

Explainable Outputs

Designed to provide transparent justifications for scoring, enhancing trust in AI-driven assessments.

Fine-Tuned Excellence

State-of-the-Art hallucination detection, outperforming both closed source frontier models such as OpenAI's GPT-4o, o1-mini, o1-preview and Anthropic's Sonnet-3.5 as well as other open source Judge LLMs of similar size.

Open Access for Innovation

With open weights and a focus on privacy-centric deployments, Root Judge fosters innovation while addressing data security concerns.

Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations. Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.

Root Judge was tested to support complex, user-defined rating rubrics over large context sizes, provide granular qualitative feedback, and support structured evaluation outputs and tool calling. Released under the Apache 2.0 license, Root Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.

Leading Hallucination Detection

Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.

Hugging Face

Detects Instruction Failures

Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations while providing detailed, structured justifications on long inputs of up to 32k tokens. Total pass rates and consistency (delta) shown above

Try In Platform

State-of-the-Art Judge LLM for Evaluation & Hallucination Detection

The Most Powerful Model for Evaluation & Hallucination Detection

Trust, Control, and Safety:
The Big GenAI App Challenges

LLMs are unpredictable & hard to trust

Shipping to production is risky and costly

Challenges to control LLM behaviour & quality

Explainable Outputs

Fine-Tuned Excellence

Open Access for Innovation

Leading Hallucination Detection

Detects Instruction Failures

Designed to serve as an LLM-as-a-Judge,
enabling organizations to:

Detect Context-Grounded Hallucinations

Facilitate Pairwise Preference Judgments

Support Privacy-Focused Deployments

Frequently Asked Questions

State-of-the-Art Judge LLM for Evaluation & Hallucination Detection

The Most Powerful Model for Evaluation & Hallucination Detection

Trust, Control, and Safety: The Big GenAI App Challenges

LLMs are unpredictable & hard to trust

Shipping to production is risky and costly

Challenges to control LLM behaviour & quality

Explainable Outputs

Fine-Tuned Excellence

Open Access for Innovation

Leading Hallucination Detection

Detects Instruction Failures

Designed to serve as an LLM-as-a-Judge, enabling organizations to:

Detect Context-Grounded Hallucinations

Facilitate Pairwise Preference Judgments

Support Privacy-Focused Deployments

Frequently Asked Questions

Trust, Control, and Safety:
The Big GenAI App Challenges

Designed to serve as an LLM-as-a-Judge,
enabling organizations to: