Skip to main content

The context you need, when you need it

When news breaks, you need to understand what actually matters — and what to do about it. At Vox, our mission to help you make sense of the world has never been more vital. But we can’t do it on our own.

We rely on readers like you to fund our journalism. Will you support our work and become a Vox Member today?

Join now

Paul Christiano and Beth Barnes are trying to make advanced AI honest, and safe

Christiano and Barnes have helped mainstream concerns about AI misalignment.

Illustrated portraits of Paul Christiano and Beth Barnes
Illustrated portraits of Paul Christiano and Beth Barnes
Lauren Tamaki for Vox
Dylan Matthews
Dylan Matthews was a senior correspondent and head writer for Vox’s Future Perfect section. He is particularly interested in global health and pandemic prevention, anti-poverty efforts, economic policy and theory, and conflicts about the right way to do philanthropy.

The first arguments that AI “misalignment” — when artificially intelligent systems do not do what humans ask of them, or fail to align with human values — could pose a huge risk to humankind came from philosophers and autodidacts on the fringes of the actual AI industry. Today, though, the leading AI company in the world is pledging one-fifth of its computing resources, worth billions of dollars, toward working on alignment. What happened? How did AI companies, and the White House, come to take AI alignment concerns seriously?

Paul Christiano and Beth Barnes are key characters in the story of how AI safety went mainstream.

Christiano has been writing about techniques for preventing AI disasters since he was an undergrad, and as a researcher at OpenAI he led the development of what is now the dominant approach to preventing flagrant misbehavior from language and other models: reinforcement learning from human feedback, or RLHF. In this approach, actual human beings are asked to evaluate outputs from models like GPT-4, and their answers are used to fine-tune the model to make its answers align better with human values.

It was a step forward, but Christiano is hardly complacent, and often describes RLHF as merely a simple first-pass approach that might not work as AI gets more powerful. To develop methods that could work, he left OpenAI to found the Alignment Research Center (ARC). There, he is pursuing an approach called “eliciting latent knowledge” (ELK), meant to find methods to force AI models to tell the truth and reveal everything they “know” about a situation, even when they might normally be incentivized to lie or hide information.

Our methodology

To select this year’s Future Perfect 50, our team went through a months-long process. Starting with last year’s list, we brainstormed, researched deeply, and connected with our audience and sources. We didn’t want to overrepresent in any one category, so we aimed for diversity in theories of change, academic specialities, age, geographic location, identity, and many other criteria.

To learn more about the FP50 methodology and criteria, go here.

That is only half of ARC’s mission, though. The other half, soon to become its own independent organization, is led by Beth Barnes, a brilliant young researcher (she got her bachelor’s degree from Cambridge in 2018) who did a short stint at Google DeepMind before joining Christiano at OpenAI, and now at ARC. Barnes is in charge of ARC Evals, which conducts model evaluations: She works with big labs like OpenAI and Anthropic to pressure-test their models for dangerous capabilities. For example, can GPT-4 set up a phishing page to get a Harvard professor’s login details? Not really, it turns out: It can write the HTML for the page, but fails to find web hosting.

But can GPT-4 use TaskRabbit to hire a human to do a CAPTCHA test for it? It can — and it can lie to the human in the process. You may have heard of that experiment, for which Barnes and the evaluations team at ARC were responsible.

ARC and ARC Evals’ reputations and those of its leaders are so formidable in AI safety circles that repeating to people that it’s okay if you’re not as smart as Paul Christiano has become a bit of a meme. And it’s true, it’s totally fine to not be as smart as Christiano or Barnes (I’m definitely not). But I’m glad that people like them have taken on a problem this serious.

Future Perfect
The 2025 Future Perfect 25The 2025 Future Perfect 25
Future Perfect

Meet the heroes keeping global progress alive.

By Bryan Walsh
The End of HIV
India’s drug industry saved the world once. Can it do it again?India’s drug industry saved the world once. Can it do it again?
The End of HIV

The “pharmacy of the world” needs to reinvent itself.

By Pratik Pawar
Future Perfect
How 6 organizers are building effective global health solutions from the bottom upHow 6 organizers are building effective global health solutions from the bottom up
Future Perfect

Meet the Future Perfect 25: On the ground.

By Bryan Walsh, Marina Bolotnikova and 3 more
Future Perfect
Free cancer treatment for all — and 5 other ideas to transform global healthFree cancer treatment for all — and 5 other ideas to transform global health
Future Perfect

Meet the Future Perfect 25: Movers and Shakers.

By Izzie Ramirez, Sara Herschander and 2 more
Future Perfect
The future of global health is at stake. These 7 pioneers could revolutionize it.The future of global health is at stake. These 7 pioneers could revolutionize it.
Future Perfect

Meet the Future Perfect 25: The Innovators.

By Izzie Ramirez, Sigal Samuel and 3 more
Future Perfect
The 6 big thinkers reshaping foreign aid, masculinity, and developmentThe 6 big thinkers reshaping foreign aid, masculinity, and development
Future Perfect

Meet the Future Perfect 25: The Thinkers

By Izzie Ramirez, Sara Herschander and 4 more