This post concisely examines the hypothesis that a new term is useful to capture the peculiarities related to handling workflows based on LLM-as-a-judge evaluation techniques, and assumes you are already familiar with LLMs in software development. The post will surely leave you with more questions than answers.
To take LLM-based applications – such as chatbots or text summarizers – to production, you need controls and measurements. An emerging powerful technique to this end involves recording the outputs of your operational LLM and using another LLM as a judge to analyze and measure the outputs of the former.
This technique, however, brings about a particular set of questions, such as:
How do you set up such measurements in a reliable manner in the first place? Asking nicely does get you started, but, we are now talking about production-grade.
How do you juggle the sheer operational overhead of both the judge LLMs and the operational LLMs?
Can you use the same model for both?
How can you trust the judge itself?
How do you handle the performance-cost tradeoffs for both kinds of models separately? After all, you are now paying both for the model that provides the direct value (responses) and the other model that does not.
For the sake of argument, let’s assume this endeavor is worthwhile. This article simply focuses on charting the necessary parts of this arrangement. The irregular, resource-intensive and time-sensitive nature of LLM judge management involves going from development-time measurements to production monitoring and scaling, which feeds new measurements back to development, spurring the need for better tests and more optimized judges, to deploy them back in production, and so on. Moving from static evaluations of model performance to dynamic feedback loop that involves real-time performance resembles the integration of software development and Ops integrating into DevOps. Hence, EvalOps.
EvalOps =def. the set of practices to automate the semantic alignment of an AI automation based on a strong AI model ("operator") with the norms, policies, and KPIs of specific human communities and organizations, by using another AI model ("judge").
The practical purpose of EvalOps is to enable and accelerate the adoption and optimization feedback cycle of GenAI tools for real-world applications. Essentially, the hypothesis is:
If you treat LLM evaluation as a separate branch of Ops work, you will naturally end up building the processes and infrastructure needed for GenAI business adoption. If you treat it as “just testing”, you won’t.
EvalOps seems largely (but not entirely) a subset of LLMOps which in itself can be considered largely (but not entirely) a subset of MLOps. EvalOps consists of the specific practice of building the AI-based governance and measurement layers to contain the operative AI pipelines constructed with LLMOps, while LLMOps reuses and is driven by models developed with MLOps.
For functional context, to illustrate the unique nature of EvalOps, consider that software product teams can rarely predict how a service will ultimately be used by the actual users. This calls for a feedback loop. For LLM applications that are non-deterministic, it gets worse. It is typical that once a chatbot has been tasked to serve specific kinds of language transformations, we realize the long list of corner cases to be managed. We need to make semantic and even strategic decisions about how certain questions should be answered by a chatbot, redefining how certain products are supposed to be used, how we want to talk about them, which terms we should use or not use, whether our existing policies actually cover and properly treat certain situations, and whether those sets of rules are up to date, and so on.
For human context, in addition to engineers and data scientists, EvalOps includes also many types of people not involved with the rest of the LLMOps: domain experts who check the objective performance, the AI governance lead who mediates the external requirements of the organizations and the surrounding regulatory environment, human evaluators who simulate end-users, the business leads and product owners who set and interpret goals and KPIs, etc. While the eval stack is an integral part of the feedback loop used to improve the LLMOps pipeline, the eval stack almost can be split out of the rest of the process.
The functional aspects of EvalOps include:
The main components of EvalOps include:
We will continue examining these topics in other posts. We conclude with some brief explanations of unusual ideas and terms used above. For the sake of succintness, we must sacrifice some clarity for brevity.
How do you do LLM-to-LLM judgements and why is this special?
In short, this is special because you have a semantic (artificial) system judging another semantic (artificial) system, to be calibrated by a semantic system known as humans. How you do it is outside the scope of this document.
Note on Semantic quantification
Scoring of LLM-based automations across various, sophisticated semantic dimensions is what we call semantic quantification. Such semantic dimensions can be, for example, ‘degree of compliance with regulation X’ or ‘degree of similarity to product Y’. This connects technical KPIs to business KPIs (rather than traditional low-level ML Metrics). The need for a neutral mediator grows with the number of model providers.
Note on Semantic Separation of Concerns:
As an example, consider that your user communication must, by EU AI Act, adhere to ‘Do not use language that is offensive to minorities’. This is top-level intent. It can be then interpreted by organization A as splitting into sub-intents such as ‘Use gender-neutral language’, ‘Do not mention or answer any questions related to ethnicity’, etc.
The sub-intents then need to be expressed as stacks of evaluators. E.g. ‘Use gender-neutral language’ could turn into metrics that separately evaluate the presence of certain pronouns, the presence of certain stereotypical biases, etc. So for flexibility, one must allow for both intensional and extensional variants. The Semantic Separation of Concerns must allow for both different splits into sub-intents, as well as different evaluator stacks that attempt to implement each intent.