Image Source : https://www.pexels.com/@rdne/

Ranking and testing LLMs and other AIs

Like taking them to school !

7 min readJul 2, 2024

--

You might have seen AI Leaderboards here , here , here and there, and well, like most busy people probably nodded, made a mental note of the top AI dogs and went on with your day, this article is for those that want to go a little deeper into the details of how AIs are ranked and tested, either because you needs to know 🤌 as a decision maker or are just curious about the subject.

So how do you test an AI ?

Turns out exactly like you would a human being ( with some caveats ), let’s say your AI is tasked with identifying shapes, you then make a test with shapes and labels and let your AI try to identify as many as it can, you can even make a benchmark where x amount of shapes need to be correctly identified to be considered a “Shapeologist AI”:

This AI has to go back to school and train harder on pointy shapes !
The caveat here, which can be a big one is that well you are testing a 
machine (at least until my side project of sentient AIs bears fruit!),
which means you are testing a system that itself is the result of testing
,for instance a shape detecting AI is built by providing pairs of shapes
and labels as a dataset to train on, and then gets tested against
unseen pairs, (a validation/test dataset) until they converge
(usually by some loss function).

So in order for new tests to be effective they need to be sufficiently
different ( while still covering base cases ) and relevant from the
original data (which might not be available), additionally as LLMs/AIs
become more complex, both questions and answers become harder to interpret
and score, we'll cover this in a second, but here's some current tests
so you get a better idea...

Specific tests

LLMs are popular these days, so here’s a couple of tests specific to them:

MMLU ( Massive Multitask Language Understanding ) provides multiple-choice questions and answers on a number of topics ranging from marketing to anatomy, I remember liking those in school as you could play the odds or make a good bet when 2 rang true.

Sample :

Q: On which planet in our solar system can you find the Great Red Spot?

Multiple Choice : [ "Venus", "Mars", "Jupiter", "Saturn" ]

A: 2 (in array notation, so the 3rd option : Jupiter)

Source/Dataset: cais/MMLU @hugginface

At this point we should note that these datasets/tests are perhaps biased by their very nature of being selective rather than comprehensive, if you only studied certain parts of the course and wished the test would cover those you’ll understand.

HellaSwag A quirky named dataset that tries to test for reasoning by means of sentence completion like this one :

Context: A chef is seen chopping vegetables on a cutting board. 
Prompt: The chef then:

a. stops chopping and cleans the knife with a towel.

b. continues chopping and adds the vegetables to a sizzling pan.

c. takes a break and sits down with a cup of coffee.


So here b.is generally believed to be the correct answer because of the
flow of common sense information.

The caveat here is that the situations need to be carefully considered, I have done a. while working at a restaurant, so that also seems reasonable to me, but by adding many more scenarios we can try to have a more accurate estimate.

Is it a dataset or a test ?

Well it's both, or rather could be both, there's nothing stopping you
from training on a test dataset but that defeats the purpose, it's like
having the answers of the test beforehand.

GRIT ( General Robust Image Task Benchmark ) To round up these few specific examples let’s change domain and look at an Image/Vision related test which covers various tasks summarized here:

Source : https://github.com/allenai/grit_official

The rest…there’s quite a few other tests and benchmarks out there, I expect the list will grow and change with time,but here’s a non exhaustive one which should get you started :

The ROBOTICS+AI combo seems underrepresented and undervalued in todays 
LLM centric environment, but frameworks do exist like this one:

https://robotperf.net

Advanced/Complex Testing

As LLMs and AIs increase in complexity so does the testing and this can generate new problems, for instance how do you test for answers that are not exactly the same as the given answer yet are also correct ? multi-step answers with multiple correct paths ? and perhaps the most dreaded test format, the essay ?

For LLMs the answer could be to simply train another LLM to do the scoring or bring in human experts to rate the LLM ( oh the irony ! ), for other AIs that require a more complex environment, think navigating rooms or operating alongside humans, maybe a custom 3D Virtual environment is the only safe/sane way of testing, but all these are new ideas that need to be made and well tested.

How relevant are leaderboards and tests anyways ?

I’d say anywhere from not relevant at all, especially with complex cognitive tasks like reasoning to very relevant for things like code, sentence completion and general knowledge, something like instruction following and vision/ audio tasks could be somewhere in the middle, follow up question, why ?

What do you test and how do you score in many ways depends on what your ultimate goal is for the student or AI, which in turn puts you in the hot seat as an exam creator, I mean how much do you really know about the subject yourself ? What’s your definition of a perfect student or AI ?

If the tasks are relatively simple and the ground truth is generally accepted and exhaustive it is usually not a big issue, there’s only so many basic geometric shapes ( about 20 ) but as for ways to reason and solve complex problems that’s and unknown.

Another issue with testing AIs is that how the AI or student got to 
the result can be of importance, you would be considered cheating if
you had an exhaustive list of answers by your side at an exam, and even if
you had dictionary level knowledge, that would only get you so far in most
advanced tasks AIs try to emulate.

This is just one aspect, tests targeting emotions, feelings, experience
and complex memory tasks in AIs are simply not a thing yet.

AIs vs Humans ?

Another aspect of testing AIs is the inevitable comparison against humans they elicit and the perils of pitting Human apples against AI oranges, MMLU for instance is quoted at 89.8% for human domain experts but It has been superseded by a newer MMLU-pro , so its a moving target and its probably wise to not draw conclusions, yet it’s not looking great for us humans, at least in some predefined categories :

A snapshot as of late 2023/early 2024 of different benchmarks and estimated Human/AI equivalent performance.

Even though we came out on top, ( 12 out of 19 benchmarks ) this table will probably be outdated by the time you read this and AIs are not that far behind :

Dont panic ! (or maybe panic a little), there's much more to being a
human and much more required to do the jobs that we do
, look for instance
at the chest X-ray interpretation benchmark, it won't replace a doctor
but rather supplement and could even lower the cost of a consultation,
for other LLM centric tasks like customer support, the spoiler is that
they have already been replaced...press #4 if you like to speak
with a human !

Takeaways

I normally end my articles with a small summary, but for me the main takeaway here is that AI tests, benchmarks and leaderboards are all over the place, it is almost as if these tests were made to give current AIs a passing/failing grade by either being too narrow, opinionated or taylor made for for some reason ( hey VCs, my AI passes this one weird test that other AI doesn’t… give me money ! )

As a user myself it would be hypocritical to discount newer AIs (especially chat based ones) which seemingly came out of nowhere if you weren’t paying attention and currently dominate the AI scene; and that I think is the ultimate test.

I tried using “smart” voice assistants in my “smart” phone, but that only lasted a few days (too awkward/unpractical), this time around I find myself using LLMs and other Generative AIs more regularly but they too seem to be a cool new tool rather than life changing, but as I’ve come acustomed to saying while dealing with new technology maybe we are just early !

Thanks for reading, there won’t be a test today and you can take the rest of the day off !

--

--

Keno Leon
Keno Leon

No responses yet