Nov 21, 2025

Simplifying perceptual quality evaluation

Lucas Theis

Founder

Generative models of audiovisual content have advanced tremendously. Fifteen years ago, a typical generative image model might have produced grayscale images of 8x8 pixels. Today, we can hold real-time conversations with photorealistic avatars that are rendered in HD. From the beginning, how to evaluate and measure that progress was a hotly debated topic. I remember keeping our compute cluster busy for weeks to measure the performance of some of the earliest deep generative models algorithmically, while my colleagues at the Max Planck Institute were performing psychophysical tests to see if humans could tell the difference between real and generated image patches. The debate around the evaluation of generative models only intensified with the arrival of more capable models such as GANs and diffusion models.

Today, evaluation is more important and more challenging than ever. Audiovisual generative models are no longer just of academic interest but power a wide range of applications, from generating video game assets to replacing job interviewers with AI-generated avatars. But with new capabilities also come new failure modes. Where once we may have focused on a single dimension such as "realism", there is now a multitude of dimensions along which our models can fail and content can appear unrealistic, unnatural, or otherwise undesirable. From the aesthetics, responsiveness, or lip sync quality of avatars, to the prosody of utterances or the prompt-alignment and musicality of AI-generated music.

Subjective tests are hard to get right

For most of these criteria, automated metrics can only take us so far, and human feedback is required to accurately measure them. Unfortunately, performing subjective tests well can take a lot of effort. Sourcing capable participants with the appropriate hardware is only the first step. Then there's the design and implementation of the study itself, having to deal with web APIs and worrying about browser compatibility. Seemingly small changes in the study design or user interface can have a surprisingly large impact on a study's effectiveness and cost, which is why organizations like the ITU go to such lengths to meticulously standardize subjective tests. And unless you're evaluating an API (which brings its own host of challenges), you're dealing with hosting, certificates, signed URLs, CDNs, and caching to ensure that participants aren't spending the majority of their time waiting for your data to load. Then there's quality control, data analysis,... The list goes on.

Solutions to these problems exist and are well known, of course. Measuring subjective quality is a long-standing research topic, and entire research conferences have been dedicated to it. However, the tooling to make all of this knowledge easily accessible and available to everyone has been severely lacking. Working on generative models and compression for many years, I have experienced firsthand how studies can take weeks to complete not just in academia, but at startups and at big tech companies. Instead of a single intuitive user interface, we'd be dealing with bash scripts, CSVs, and Colab notebooks. Instead of launching studies at the click of a button, we'd be sending emails and setting up meetings to discuss next steps.

Introducing Mabyduck

That’s why, over the past year, we’ve been building Mabyduck, a company dedicated to perceptual quality evaluation. As a first step on our journey, we have been building a self-serve platform to make human-based evaluation as easy as running automated metrics. Choosing a self-serve model was intentional: it forces us to build a service that feels obvious and intuitive. Our internal README contains just one instruction:

"Simple things should be simple, complex things should be possible." —Alan Kay

Today, we are releasing the public beta of our self-service platform 🚀 . Because it's fully self-serve, you can try it out right now – no demos, sales calls, or onboarding sessions required.

Mabyduck is the tool I always wished I had when doing my own research on generative AI. It already supports more than 13 well-tested experiment types, from standardized tests such as MUSHRA to highly configurable pairwise video studies with frame-accurate synchronization.

One of my favorite features is the ability to run pilot studies with your own team as raters. As researchers and engineers, we are naturally prone to confirmation bias when "vibe checking" model outputs. Viewing results in a controlled setting can reduce that bias and already give us a much better idea of model performance. With just a few more clicks, you can then launch a crowd-sourced study and compare the perception of experts and naïve raters.

Another feature I want to highlight is the ability to create custom rubrics and leaderboards. Today's generative models are often built by large teams and at significant cost, which makes establishing a clear evaluation protocol early on essential. Without it, teams risk optimizing for the wrong target or getting stuck in endless debates about which model version to ship. Setting up a rubric can be a fun way to align the team around what truly matters, and inspire healthy competition.

Thanks

This work has been made possible by a generous pre-seed round led by Zero Prime Ventures, with participation from Magnetic. We are also fortunate to have the support of angel investors and advisors Aäron van den Oord, Sumith Kulal, Ferenc Huszár, Matthias Bethge, Lasse Espeholt, Rafał Mantiuk, and other leading experts in AI as angel investors and advisors.

We are also deeply grateful to our early customers and design partners for their trust, and for the invaluable feedback that has shaped the product.

Stay tuned for more!

Subjective tests are hard to get right

Introducing Mabyduck

Thanks

Recent posts