Alignment Research Center’s Paul Christiano and Beth Barnes named Future Perfect 50 finalists

Dylan Matthews was a senior correspondent and head writer for Vox’s Future Perfect section. He is particularly interested in global health and pandemic prevention, anti-poverty efforts, economic policy and theory, and conflicts about the right way to do philanthropy.

The first arguments that AI “misalignment” — when artificially intelligent systems do not do what humans ask of them, or fail to align with human values — could pose a huge risk to humankind came from philosophers and autodidacts on the fringes of the actual AI industry. Today, though, the leading AI company in the world is pledging one-fifth of its computing resources, worth billions of dollars, toward working on alignment. What happened? How did AI companies, and the White House, come to take AI alignment concerns seriously?

Paul Christiano and Beth Barnes are key characters in the story of how AI safety went mainstream.

Christiano has been writing about techniques for preventing AI disasters since he was an undergrad, and as a researcher at OpenAI he led the development of what is now the dominant approach to preventing flagrant misbehavior from language and other models: reinforcement learning from human feedback, or RLHF. In this approach, actual human beings are asked to evaluate outputs from models like GPT-4, and their answers are used to fine-tune the model to make its answers align better with human values.

It was a step forward, but Christiano is hardly complacent, and often describes RLHF as merely a simple first-pass approach that might not work as AI gets more powerful. To develop methods that could work, he left OpenAI to found the Alignment Research Center (ARC). There, he is pursuing an approach called “eliciting latent knowledge” (ELK), meant to find methods to force AI models to tell the truth and reveal everything they “know” about a situation, even when they might normally be incentivized to lie or hide information.

Our methodology

To select this year’s Future Perfect 50, our team went through a months-long process. Starting with last year’s list, we brainstormed, researched deeply, and connected with our audience and sources. We didn’t want to overrepresent in any one category, so we aimed for diversity in theories of change, academic specialities, age, geographic location, identity, and many other criteria.

To learn more about the FP50 methodology and criteria, go here.

That is only half of ARC’s mission, though. The other half, soon to become its own independent organization, is led by Beth Barnes, a brilliant young researcher (she got her bachelor’s degree from Cambridge in 2018) who did a short stint at Google DeepMind before joining Christiano at OpenAI, and now at ARC. Barnes is in charge of ARC Evals, which conducts model evaluations: She works with big labs like OpenAI and Anthropic to pressure-test their models for dangerous capabilities. For example, can GPT-4 set up a phishing page to get a Harvard professor’s login details? Not really, it turns out: It can write the HTML for the page, but fails to find web hosting.

But can GPT-4 use TaskRabbit to hire a human to do a CAPTCHA test for it? It can — and it can lie to the human in the process. You may have heard of that experiment, for which Barnes and the evaluations team at ARC were responsible.

ARC and ARC Evals’ reputations and those of its leaders are so formidable in AI safety circles that repeating to people that it’s okay if you’re not as smart as Paul Christiano has become a bit of a meme. And it’s true, it’s totally fine to not be as smart as Christiano or Barnes (I’m definitely not). But I’m glad that people like them have taken on a problem this serious.

You’ve read 1 article in the last month

Here at Vox, we're unwavering in our commitment to covering the issues that matter most to you — threats to democracy, immigration, reproductive rights, the environment, and the rising polarization across this country.

Our mission is to provide clear, accessible journalism that empowers you to stay informed and engaged in shaping our world. By becoming a Vox Member, you directly strengthen our ability to deliver in-depth, independent reporting that drives meaningful change.

We rely on readers like you — join us.