How should large language models be evaluated?

Published Date:

January 7, 2025

Last Updated ON:

April 10, 2026

A while ago Tianjin University released a study, which defined a LLM model evaluation and implementation taxonomy.

The image below shows the taxonomy of major categories and sub-categories used by the study for LLM evaluation.

Visual representation of the Larry language model's evaluation metrics and performance analysis. — LLM model evaluation

Considering the taxonomy, much focus has been on Evaluation Organisation, Knowledge & Capability and Specialised LLMs from a technology perspective.

From an enterprise perspective, the concerns and questions raised centre around Alignment and Safety.

This survey aims to create an extensive perspective on how LLMs should be evaluated.

The study categorises the evaluation of LLMs into three major groups:

Knowledge and Capability Evaluation,
Alignment Evaluation and
Safety Evaluation.

What I also found interesting from the study, was the focus on more complex metrics like reasoning and tool learning, as seen below:

Visual representation of the four reasoning types: deductive, inductive, abductive, and analogical. — In the early days of Natural Language Processing (NLP), researchers frequently utilised a series of simple benchmark assessments to assess the performance of their language models. These initial assessments predominantly focused on elements such as syntax and vocabulary, including tasks like parsing syntactic structures, disambiguating word senses, and more.

LLMs have introduced significant complexity and have shown certain tendencies by revealing behaviours indicative of risks and demonstrating abilities to perform higher-order tasks in current evaluations.

Consequently, creating a taxonomy like this to reference can help ensure that due diligence is followed when assessing LLMs.

Find the full study here.

authors

Cobus Greyling

Chief Evangelist

Recent Blogs

View all

July 28, 2026

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

AI engineering

July 27, 2026

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

July 28, 2026

Agent Platform { Artemis }

For Service

For Work

Use Case Library

Kore.ai Marketplace

Agent Platform

How should large language models be evaluated?

Kore.ai named a leader in the Forrester Wave™ Cognitive Search Platforms, Q4 2025

Kore.ai named a leader in the Gartner® Magic Quadrant™ for Conversational AI Platforms, 2025

Recent Blogs

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

EU AI Act: a GDPR moment for artificial intelligence

Let’s work together

Agent Platform { Artemis }

For Service

For Work

Use Case Library

Kore.ai Marketplace

Agent Platform

How should large language models be evaluated?

Kore.ai named a leader in the Forrester Wave™ Cognitive Search Platforms, Q4 2025

Kore.ai named a leader in the Gartner® Magic Quadrant™ for Conversational AI Platforms, 2025

Stay in touch with the pace of the AI industry with the latest resources from Kore.ai

Recent Blogs

Human capital and token capital: why enterprise AI value depends on both

Inside Arch AI: How Kore.ai's built-in architect takes agents from design to production

EU AI Act: a GDPR moment for artificial intelligence

Let’s work together