Search
Search
Valuation Pipelines in AI
Valuation Pipelines in AI

Date

source

share

Let’s be honest. The last AI conference you attended was probably littered with ethical buzzwords (fairness, privacy, accountability, transparency, safety…) whose actual meanings are vague and contentious. Of course, we all think fairness matters—the whole problem is that we disagree . . .

Let’s be honest. The last AI conference you attended was probably littered with ethical buzzwords (fairness, privacy, accountability, transparency, safety…) whose actual meanings are vague and contentious. Of course, we all think fairness matters—the whole problem is that we disagree on what it involves!

Let’s stick with fairness as our example. It’s usually connected to being unbiased somehow, though the specifics vary. For instance, Biden’s Blueprint for an AI Bill of Rights envisioned “Algorithmic Discrimination Protections” and fleshed this out with at least nine words effectively banned since then, including discrimination, equitable, race, ethnicity, sex, gender identity, and intersex.

Cynically, you might think a notion of fairness emptied of any such concrete references has little left to offer. But Trump’s Executive Order on AI is still laser-focused on eliminating at least one kind of bias: “We must develop AI systems that are free from ideological bias or engineered social agendas,” he says, just before emphasizing AI’s value for promoting “human flourishing, economic competitiveness, and national security.” No ideology here!

In my research, I study how AI researchers translate constructs like fairness into metrics they can calculate and optimize for in their work. What fascinates me is how little time is spent daily deciding which metrics to use.

But before we get to ethical constructs, let’s think about how practical constructs like causal reasoning might get operationalized. How would you demonstrate that an AI model was a capable causal reasoner? Maybe you could have it try…

  • Acing physics tests?
  • Or chemistry tests?
  • Or emotional intelligence tests?
  • Maybe you could train a model to extend video clips of physical systems?
  • Or read people’s faces to assess their psychological tendencies?
  • Or pilot a robotic body to puzzle its way out of an escape room?

Wow, it turns out that we mean lots of different things by causal reasoning! But the thing to do in this context is straightforward: we should create a bunch of benchmarks and start figuring out what our current models can and can’t achieve.

Just so we’re on the same page, a benchmark is one part training dataset, one part evaluation metric, and one part community challenge. For example, “we trained a model to get 94% of these physics questions right. Can you build a model that does better?”

Some benchmarks take off, attracting lots of competitive attention. Others don’t. A technical paper introducing a new benchmark might be cited more if its dataset is interesting and readily available, its evaluation metric is simple to calculate and interpret, its technical methods are novel and attention-grabbing, and its performance is good while still inviting further progress.

But of course, none of these are strongly related to the normative value of that benchmark’s metric—does the metric adequately measure the construct in question? Acing physics tests is an impressive technical feat that says something about a model’s understanding of physics, at least in the abstract, and we’ve been evaluating humans with tests like that for a while. But what would it mean for a model to ace fairness tests? What would those even look like, and how should we understand their limitations?

Here’s the current state of play: Several competing families of fairness metrics in the literature each champion a different notion of fairness. Do we care about equality of opportunity, outcome, or something else? And how will we measure these?

By counting some things and not others, each metric answers these questions in its own way, embedding a unique philosophy of fairness. And when AI researchers optimize for these metrics, they are willing to treat everything that isn’t counted as merely useful for everything that is.

Of course, these competing fairness metrics don’t add up to a coherent philosophical account of fairness. Worse, they often seem to function as sites of imagined moral consensus that prevent genuine moral discourse from taking place. After all, take a look—we agree on these numbers, and they’re improving!

Given how many AI papers are coming out every day, most researchers conduct only cursory searches for the most popular papers, aiming to replicate and extend their approaches while exceeding their results on relevant measurables. That means that some metrics have enjoyed runaway success for (essentially) socially random reasons.

A resulting valuation pipeline has sprung up where AI researchers consistently use a handful of relatively standardized metrics to measure fairness and treat these as just another constraint on optimizing. Long before AI ethicists or regulators show up on the scene, the discussion of value has already been framed for them. Meanwhile, evidence-based decision-makers are plodding ahead using the numbers we have, whatever it is they happen to be measuring.

And that’s a real shame, because when different fairness metrics disagree deeply on what constitutes fairness and how to measure it, they force what I call fairness tradeoffs on us. Since each metric imposes a practical stance on what fairness is, your choice of metrics already decides moral tradeoffs for us in advance. The fact that this choice is made by (essentially) socially random reasons should give us pause. Furthermore, since fairness does not seem to be some preexisting quantity in disguise, this metric-first approach seems apt to obscure the moral concerns it claims to “objectively” represent.

What to do about this problem? The answer seems more structurally daunting than drafting better top-down AI policies or requiring a second ethics class for a STEM degree. I hope that by digging deeper into the dynamics of the problem, we can better appreciate just how early in the valuation pipeline ethical considerations need to be taken seriously.

The post Valuation Pipelines in AI first appeared on Blog of the APA.

Read the full article which is published on APA Online (external link)

More
articles

More
news

What is Disagreement?

What is Disagreement?

This is Part 1 of a 4-part series on the academic, and specifically philosophical study of disagreement. In this series...

Valuation Pipelines in AI

Valuation Pipelines in AI

Let’s be honest. The last AI conference you attended was probably littered with ethical buzzwords (fairness, privacy, accountability, transparency, safety…)...

Valuation Pipelines in AI

Life as a Flow

Two Truths Approach Each Other What is it to be oneself? Or to live authentically? Psychoanalysis was a first, in...

The Capability Approach

The Capability Approach

[Revised entry by Ingrid Robeyns and Morten Fibieger Byskov on April 17, 2025. Changes to: Main text, Bibliography] The capability...