State-of-the-Art Judge LLM for Evaluation & Hallucination Detection

Root Judge, a ground-breaking LLM that sets a new standard for reliable, customizable and locally-deployable evaluation models.

The Most Powerful Model for Evaluation & Hallucination Detection

Root Judge, a groundbreaking LLM that sets a new standard for reliable, customizable and locally-deployable evaluation models.

Trust, Control, and Safety:
The Big GenAI App Challenges

Trust

LLMs are unpredictable & hard to trust

The unpredictable behavior of LLMs can create risks to your reputation and cause compliance issues.

Control

Shipping to production is risky and costly

Unclear LLM performance can lead to delays in launching your product and drive up development costs.

Safety

Challenges to control LLM behaviour & quality

Managing how your model behaves and measuring its quality is tough, requiring specialized knowledge and significant time.

Explainable Outputs
Designed to provide transparent justifications for scoring, enhancing trust in AI-driven assessments.
Fine-Tuned Excellence
State-of-the-Art hallucination detection, outperforming both closed source frontier models such as OpenAI's GPT-4o, o1-mini, o1-preview and Anthropic's Sonnet-3.5 as well as other open source Judge LLMs of similar size.
Open Access for Innovation
With open weights and a focus on privacy-centric deployments, Root Judge fosters innovation while addressing data security concerns.

Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations. Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.

Root Judge was tested to support complex, user-defined rating rubrics over large context sizes, provide granular qualitative feedback, and support structured evaluation outputs and tool calling. Released under the Apache 2.0 license, Root Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.

Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations. Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.

Root Judge was tested to support complex, user-defined rating rubrics over large context sizes, provide granular qualitative feedback, and support structured evaluation outputs and tool calling. Released under the Apache 2.0 license, Root Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.

Leading Hallucination Detection

Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.

Detects Instruction Failures

Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations while providing detailed, structured justifications on long inputs of up to 32k tokens. Total pass rates and consistency (delta) shown above
Root Judge represents a major leap in how organizations can evaluate and optimize their LLM systems. Its ability to transparently deliver context-grounded judgments ensures that businesses can deploy AI responsibly and effectively, while optimizing inference costs and ensuring privacy.
Ari Heljakka
CEO of Root Signals
With solutions for reliable and explainable AI, Root Signals is contributing to a critical topic to enterprises.

The successful training of Root Judge on the LUMI supercomputer demonstrates both the power of AMD compute platforms and the vibrancy of Finland's AI ecosystem. This is exactly the kind of innovation we need to see more of in Finland and Europe.
Peter Sarlin
Co-Founder and CVP, AMD Silo AI

Designed to serve as an LLM-as-a-Judge,
enabling organizations to:

Detect Context-Grounded Hallucinations
Automatically detect, describe and block hallucinations in Retrieval-Augmented-Generation (RAG) pipelines.
Facilitate Pairwise Preference Judgments
Use customizable rubrics for tasks like inference-time compute optimization or synthetic data generation requiring Best-of-N decisions.
Support Privacy-Focused Deployments
Avoid sending sensitive data over the public internet while leveraging cutting-edge LLM capabilities.

Frequently Asked Questions