EvalOps 101 – Building the Bridge from LLM Prototypes to Production

August 6, 2024

This post concisely examines the hypothesis that a new term is useful to capture the peculiarities related to handling workflows based on LLM-as-a-judge evaluation techniques, and assumes you are already familiar with LLMs in software development. The post will surely leave you with more questions than answers.

To take LLM-based applications – such as chatbots or text summarizers – to production, you need controls and measurements. An emerging powerful technique to this end involves recording the outputs of your operational LLM and using another LLM as a judge to analyze and measure the outputs of the former.

This technique, however, brings about a particular set of questions, such as:

How do you set up such measurements in a reliable manner in the first place? Asking nicely does get you started, but, we are now talking about production-grade.

How do you juggle the sheer operational overhead of both the judge LLMs and the operational LLMs?

Can you use the same model for both?

How can you trust the judge itself?

How do you handle the performance-cost tradeoffs for both kinds of models separately? After all, you are now paying both for the model that provides the direct value (responses) and the other model that does not.

For the sake of argument, let’s assume this endeavor is worthwhile. This article simply focuses on charting the necessary parts of this arrangement. The irregular, resource-intensive and time-sensitive nature of LLM judge management involves going from development-time measurements to production monitoring and scaling, which feeds new measurements back to development, spurring the need for better tests and more optimized judges, to deploy them back in production, and so on. Moving from static evaluations of model performance to dynamic feedback loop that involves real-time performance resembles the integration of software development and Ops integrating into DevOps. Hence, EvalOps.

EvalOps =def. the set of practices to automate the semantic alignment of an AI automation based on a strong AI model ("operator") with the norms, policies, and KPIs of specific human communities and organizations, by using another AI model ("judge").

The practical purpose of EvalOps is to enable and accelerate the adoption and optimization feedback cycle of GenAI tools for real-world applications. Essentially, the hypothesis is:

If you treat LLM evaluation as a separate branch of Ops work, you will naturally end up building the processes and infrastructure needed for GenAI business adoption. If you treat it as “just testing”, you won’t.


Why not just MLOps or LLMOps?

EvalOps seems largely (but not entirely) a subset of LLMOps which in itself can be considered largely (but not entirely) a subset of MLOps. EvalOps consists of the specific practice of building the AI-based governance and measurement layers to contain the operative AI pipelines constructed with LLMOps, while LLMOps reuses and is driven by models developed with MLOps.

For functional context, to illustrate the unique nature of EvalOps, consider that software product teams can rarely predict how a service will ultimately be used by the actual users. This calls for a feedback loop. For LLM applications that are non-deterministic, it gets worse. It is typical that once a chatbot has been tasked to serve specific kinds of language transformations, we realize the long list of corner cases to be managed. We need to make semantic and even strategic decisions about  how certain questions should be answered by a chatbot, redefining how certain products are supposed to be used, how we want to talk about them, which terms we should use or not use, whether our existing policies actually cover and properly treat certain situations, and whether those sets of rules are up to date, and so on.

For human context, in addition to engineers and data scientists, EvalOps includes also many types of people not involved with the rest of the LLMOps: domain experts who check the objective performance, the AI governance lead who mediates the external requirements of the organizations and the surrounding regulatory environment, human evaluators who simulate end-users, the business leads and product owners who set and interpret goals and KPIs, etc. While the eval stack is an integral part of the feedback loop used to improve the LLMOps pipeline, the eval stack almost can be split out of the rest of the process. ‍

The functional aspects of EvalOps include:

  • It operates on semantic measurements and control (in contrast to functional or data scientific performance requirements)
  • It interfaces systems with complex and changing human requirements, including user-specific requirements (e.g. children safety), cultural differences (e.g. tone), formal semantic compliances, etc.
  • It concerns with the relationships between the internal representations of artificial minds to external data-defined reality (grounding), and human reality (alignment), mediated by other artificial minds (judging)
  • Managing the complexity of using LLM-as-judge techniques in practice (asking the model to judge a statement by another model). EvalOps judgements can be delivered by e.g. semantic distances across pieces of content (comparing the embedding vectors of statement A or statement B), or by “dumb” controls and metrics for low-level judgments.
  • Solving semantic unification (the question of which extensional metrics, such as giving certain outputs on certain inputs, approximate the intensional definition, such as 'avoiding age discrimination bias')
  • Continuous monitoring and improvement based on semantic evaluation by automated evaluators and human reviewers
  • Stakeholders: developers, domain experts, business, AI governance, and regulator
Comparison Table
EvalOps LLMOps MLOps
Simplified Judgment Embodiment Brains & Cognition
Purpose Build AI controllers to align AI behaviors with human goals Manage and optimize foundation models to execute tasks with AI Build and maintain AI models for smarter software
Deliverables Metric evaluators, objectives, and guardrails Operative LLM pipelines and agents Models
Main Components Predicates, model-model interactions management, test/calibration data Prompts, model management, monitoring, instruction/test data Model training and serving infrastructure, models, train/test data
Users AI product owner, domain expert, governance lead AI engineer Data scientist, ML engineer
Framing & Modeling Semantic Quantification Empirical & systems view Statistical & data view
Agnostic of Operative pipeline Models Data contents


The main components of EvalOps include:

  • Building and managing Evaluators (predicates for specific attributes like “persuasiveness”) and align with human judgements
  • Objectives connecting evaluators to KPIs (e.g. “helpfulness” approximated by 10 evaluators)
  • Fine-tuning, managing and evaluating Judges (models used to implement Evaluators) including general-purpose LLMs as well as judgment-oriented LLMs
  • Calibrating & evaluating evaluators themselves
  • Semantic Quantification: Converting an attribute into a metric
  • Model Cross-evaluation; management and containment of data contamination
  • Guardrails: Control, intervention and routing of content
  • Monitoring in real-time across evaluated metrics and disseminating relevant signals and patterns
  • Defining, managing and connecting AI to the the source of truth; grounding (in data, in other judges including humans, in intents/definitions)
  • Semantic Separation of Concerns (separate the human goal/value from alternative technical measurements for it) by managing the Objective/sub-objective hierarchy
  • Semantic Quantification: The techniques of converting an LLM assessment of content into a meaningful and principled metric
  • Capturing & cross-comparing semantically meaningful dimensions ("is conciseness separate from informativeness")

We will continue examining these topics in other posts. We conclude with some brief explanations of unusual ideas and terms used above. For the sake of succintness, we must sacrifice some clarity for brevity.

How do you do LLM-to-LLM judgements and why is this special?

In short, this is special because you have a semantic (artificial) system judging another semantic (artificial)  system, to be calibrated by a semantic system known as humans. How you do it is outside the scope of this document.


Note on Semantic quantification

Scoring of LLM-based automations across various, sophisticated semantic dimensions is what we call semantic quantification. Such semantic dimensions can be, for example, ‘degree of compliance with regulation X’ or ‘degree of similarity to product Y’. This connects technical KPIs to business KPIs (rather than traditional low-level ML Metrics). The need for a neutral mediator grows with the number of model providers.


Note on Semantic Separation of Concerns:

As an example, consider that your user communication must, by EU AI Act, adhere to ‘Do not use language that is offensive to minorities’. This is top-level intent. It can be then interpreted by organization A as splitting into sub-intents such as ‘Use gender-neutral language’, ‘Do not mention or answer any questions related to ethnicity’, etc.

The sub-intents then need to be expressed as stacks of evaluators. E.g. ‘Use gender-neutral language’ could turn into metrics that separately evaluate the presence of certain pronouns, the presence of certain stereotypical biases, etc. So for flexibility, one must allow for both intensional and extensional variants. The Semantic Separation of Concerns must allow for both different splits into sub-intents, as well as different evaluator stacks that attempt to implement each intent.

Read More