How to Measure AI Performance - Connecting Windsor-Essex

Measuring the reliability of AI tools can be challenging – they have a wide variety of capabilities, and produce results in unexpected and creative ways. Here, we’ll break down common benchmarks used to understand the performance of AI tools so you confidently decide which are best for your team.

It’s important to keep in mind that no tool is “good” or “bad”; instead explore if an AI model is a good fit for the work you’d like it to do by choosing a benchmark that tests those types of tasks.

A note on hallucinations: a hallucination occurs when the answer to a query doesn’t match the dataset being used to produce an answer. So if I asked an AI that references data about time zones the question “What is the time zone in Manilla?”, and it answered “There are crocodiles in the Philippines.”, that would be a hallucinated answer.

CLICK HERE to learn more about AI hallucinations.

SimpleQA

Asks the model short, fact seeking questions (i.e. “What is the capital of France?”)
Scoring is relatively straightforward because questions are created so there’s a single concrete right answer
Good general measurement of model response accuracy

CLICK HERE to read OpenAI’s breakdown. CLICK HERE to read the technical documentation. CLICK HERE to see the leaderboard.

PersonQA

An internal benchmark used and designed by OpenAI, PersonQA tests to see how much information a model knows about people.

Asks the model for publicly available information about real world people (i.e. “When is Oprah’s birthday?”)
Checks accuracy when answering fact-based questions
Helps score how likely a model is to hallucinate

CLICK HERE to read OpenAI’s breakdown.

MMLU (Massive Multitask Language Understanding)

An intense test that pushes models to the edge of their ability, used to evaluate how well they can multitask. Designed by researchers at UC Berkeley, Columbia, UChicago, and UIUC.

Tests 57 different tasks from a randomized database
Covers a wide range of topics including math, history, law, and more
Designed to stress the models capability to do multiple things at once

CLICK HERE to read the technical documentation. CLICK HERE to see the leaderboard.

MMMU (Massive Multi-discipline Multimodal Understanding)

Another intense test used to check how well AI models respond to expert level questions, along with their ability to perform complex reasoning tasks. Designed by researchers at IN.AI Research, UWaterloo, Ohio State, Carnegie Mellon U, UVictoria and Princeton.

Tests based on a large dataset of expert level information
Challenges AI models in four categories: comprehensiveness, heterogeneous images, interleaved text and images, and expert-level perception
Covers six core disciplines: art and design, business, science, health, social science, and engineering

CLICK HERE to read the technical documentation. CLICK HERE to see the leaderboard.

HHEM (Hughes Hallucination Evaluation Model)

A test specifically focused on testing for hallucinations designed by Vectara (a company that helps businesses integrate AI into their products).

Provides the model with multiple lines of text from a randomized dataset to use as it’s source of context
Prompts the model to provide a result based on the context provided
Scores how often the model offers a result that doesn’t match the context

CLICK HERE to read the technical documentation. CLICK HERE to see the leaderboard.

As you can see, each test is able to provide different insights into which areas a model excels or struggles. New models are being developed all the time, and older models may occasionally receive updates or be tweaked to improve performance. For that reason, benchmark leaderboards are always changing – so it helps to keep an eye on how your favorite tools are performing.

Building AI policy for your organization? Get more insight into what your team needs to know about AI tools by exploring membership today.

Get in touch See our services menu

marketing@cw-e.ca

(226) 674-3636

500 Ouellette Ave, Suite 700, Windsor. ON