Evaluating LLMs: What, where, and how

Published Date:

January 13, 2025

Last Updated ON:

April 10, 2026

Research focussed on the evaluation of AI models in general and LLMs in specific.

This paper serves as a comprehensive overview of benchmark studies…The study also distils the process of evaluating LLMs into three main categories; the what, the where and the how. These principles are also important when designing a LLM-based application.

LLMs to produce more coherent and contextually relevant responses, rendering them highly suitable for interactive and conversational applications.

Reinforcement Learning from Human Feedback stands as another pivotal aspect of LLMs. This technique involves refining the model by using human-generated responses as rewards, enabling the model to learn from its mistakes and enhance its performance progressively.

In the context of the table presented below, the comparison of traditional Machine Learning, Deep Learning, and LLMs across six essential elements is very insightful.

As seen in the LLM column, interpretability is a challenge with LLMs with high model complexity.

Comparison of Ai models — Comparing attributes with different Ai models

As depicted in the image blow, the study explores LLM evaluation in three dimensions and challenges can be considered as a fourth.

Not only do these three dimensions serve as integral components to the evaluation of LLMs, it also serves as a reference for conceptualising and planning LLM-based products.

LLM: what, where and how — Three dimensions of LLM evaluation: what, where, how

Final thoughts

There is not one large language model to rule them all, it seems evident that model orchestration will become increasingly important. The study considers the current ecosystem to discover new trends and protocols and propose new challenges and opportunities.

Find the original research here.

GitHub - MLGroupJLU/LLM-eval-survey: The official GitHub page for the survey paper "A Survey on…

github.com

A Survey on Evaluation of Large Language Models

arxiv.org

Talk to an expert

authors

Cobus Greyling

Chief Evangelist

Recent Blogs

View all

July 28, 2026

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

AI engineering

July 27, 2026

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

July 28, 2026

Agent Platform { Artemis }

For Service

For Work

Use Case Library

Kore.ai Marketplace

Agent Platform

Evaluating LLMs: What, where, and how

Research focussed on the evaluation of AI models in general and LLMs in specific.

Final thoughts

GitHub - MLGroupJLU/LLM-eval-survey: The official GitHub page for the survey paper "A Survey on…

arxiv.org

Kore.ai named a leader in the Forrester Wave™ Cognitive Search Platforms, Q4 2025

Kore.ai named a leader in the Gartner® Magic Quadrant™ for Conversational AI Platforms, 2025

Recent Blogs

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

EU AI Act: a GDPR moment for artificial intelligence

Let’s work together

Agent Platform { Artemis }

For Service

For Work

Use Case Library

Kore.ai Marketplace

Agent Platform

Evaluating LLMs: What, where, and how

Research focussed on the evaluation of AI models in general and LLMs in specific.

Final thoughts

GitHub - MLGroupJLU/LLM-eval-survey: The official GitHub page for the survey paper "A Survey on…

arxiv.org

Kore.ai named a leader in the Forrester Wave™ Cognitive Search Platforms, Q4 2025

Kore.ai named a leader in the Gartner® Magic Quadrant™ for Conversational AI Platforms, 2025

Stay in touch with the pace of the AI industry with the latest resources from Kore.ai

Recent Blogs

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

EU AI Act: a GDPR moment for artificial intelligence

Let’s work together