Justified Posteriors

Can an AI Interview You Better Than a Human?

Andrey Fradkin — Mon, 26 Jan 2026 13:04:33 GMT

We discuss “Voice in AI Firms: A Natural Field Experiment on Automated Job Interviews” by Brian Jabarian and Luca Henkel. The paper examines a randomized experiment with call center job applicants in the Philippines who were assigned to either AI-conducted voice interviews, human interviews, or given a choice between the two.

Key Findings:

AI interviews led to higher job offer rates and proportionally higher retention rates
No significant difference in involuntary terminations between groups
Applicants actually preferred AI interviews—likely due to scheduling flexibility and immediate availability
AI interviewers kept conversations more on-script with more substantive exchanges
Online applicants saw especially large gains from AI interviews

Topics Discussed:

The costs of recruitment and why interview efficiency matters
Whether AI interviews find different workers or just reduce noise in screening
How human recruiters interpret AI interview transcripts differently
The “Coasean singularity” question: Will AI improve labor market matching overall?
Limitations: scheduling confounds, external validity beyond call centers, unmeasured long-tail outcomes
The coming arms race between AI interviewers and AI-coached applicants

Posterior Updates:

On the usefulness of current AI for job hiring:

Seth: 40% → 90% confidence AI works for call center jobs; modest update for general jobs
Andrey: 20% → 75% for call centers; 1% → 5% for general interviews (“we need to reorganize all of hiring first”)

On whether AI will improve job matching significantly on net in the next 5-10 years

Andrey: 55% → No Update
Seth: “A bit more optimistic than Andrey” → +1pp update

Referenced Work/Authors:

Prediction Machines
Related episode on AI and labor signaling with Bo Cowgill.

Transcript:

[00:00:00] INTRODUCTION

Seth: Welcome to the Justified Posteriors podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell, an interviewer who will never stick to a standard script, coming to you from Chapman University in sunny Southern California.

Andrey: And I’m Andrey Fradkin, counting down the days until I can use an AI to pre-interview my podcast guests to see if they deserve to be on the show. Coming to you from San Francisco, California.

Seth: I don’t know. I think our filtering criteria is pretty good.

Andrey: I know.

Seth: Right. That’s one job we never want to automate—who becomes a friend of the podcast. That’s an un-automatable job.

Andrey: But it would be nice to pre-interview our guests so that we could prepare better for the actual show.

Seth: I was thinking about this, because there’s two possibilities, right? You do the pre-interview, and you get an unsurprising answer in this sort of pre-interview, and then that’s good, and then you should go with it. And then if you get a surprising one, then you would lean into it. What would you even get out of the pre-interview?

Andrey: Maybe what the guests would want to talk about.

Seth: Okay.

Andrey: But I agree with you. Mostly, it’s just hearing the guest talk, and then thinking about, “Oh, this is something that we want to really dig into,” versus, “This is something that might be not as interesting to our audience,” and knowing that ex ante.

[00:02:00] SETTING UP THE TOPIC

Seth: Yeah. We’ve been... So we’re talking about interviews. You’ll remember in a recent episode, we just talked to our friend Bo, who’s doing work on how maybe job applications are changing because of AI. So now I think what we want to think a little bit about is how job interviews are changing because of AI. Maybe we’ve heard before about how AI is changing how people talk to the hirer. Maybe we want to hear a little bit about how AI is changing how the hirer solicits information in an interview. We’ve got a very interesting paper to talk about just about that. But do you remember the last job interview you did, Andrey?

Andrey: Yes.

Seth: How did it go? Did you have fun? Did you feel like you stayed on topic?

Andrey: It was a very intense set of interviews that required me to fly halfway across the world, which was fun, but exhausting.

Seth: So fun. So you would describe the interview as a fun experience? Did you get more excited about the job after doing the interview?

Andrey: Yes, although I ultimately didn’t take it, but I did get—you know, I was impressed by the signaling value of having such an interview.

Seth: So the signaling value. So in other words, the signal to you from the interviewer about the fact that they were going to invest this much time. Is that right? It’s that direction of signal?

Andrey: Yes, yes. And also the sorts of people who they had talking to me, and just the fact that they were trying to pitch me so hard. Now, certain other companies lacked such efforts.

Seth: Right. So it seems like one important aspect of an interview is what the interviewee learns from the interview. But what about the other side? Do you feel like your interviewer learned a lot about you, or enough to justify all that time and expense?

Andrey: I’d like to think so. I mean, I’m not them, so I can’t really speak on their behalf. But it did seem like the interview process was fairly thought out for a certain set of goals, which might differ across companies. What about yourself, Seth?

Seth: Thank God, it has been a long time ago that I interviewed for a job, and I can tell you exactly what happened. I was on the academic job market, but I did throw out a couple of business applications, and so I got an interview at Facebook. Headed out to their headquarters, did all of the one-on-one interviews, and then there was a code screen, and I was not grinding LeetCode for the last five months and completely bombed it. And they said, “Thank you very much for your time.” So that was an example of, I think they probably could have saved the time for the interview if they had given me the code screen first.

Andrey: It’s funny, there was a time in my life where I interviewed at Facebook, too. I mean, this is probably 2014 or something.

Seth: Mm-hmm, mm-hmm.

Andrey: And they did do the coding screen before.

Seth: Who knows? Who knows, dude?

[00:05:15] THE PAPER

Seth: Okay, so interviews, we do them. People seem to give information, take information from them. How can this be made more efficient with AI? That’s today’s question. In order to learn more about that, we read Voice in AI Firms: A Natural Field Experiment on Automated Job Interviews, by friend of the show, Brian Jabrian and Luca Henkel. I was interested in this paper because it’s kind of an interesting flip side of what we just saw from Bo.

I guess before we talk too much about what the paper actually does, it’s time for us to go into our priors.

═══════════════════════════════════════════════════════════════════

[00:06:00] PRIORS

Seth: Okay, so Andrey, when we’re thinking about AI being used in interviews, what sort of thoughts do you have about that going in? What sort of priors should we be exchanging?

Andrey: Yeah, I mean, I think just when I first saw this paper, I was kind of surprised that we were there already, honestly. I think interviewing via voice is a pretty delicate thing, and the fact that AI is potentially able to do it already was—I hadn’t been thinking—I didn’t think we were there yet, and I think just the very existence of this paper was a bit of a surprise when I first saw it.

But I guess a first natural prior that we can think about is: is using an AI to interview someone rather than using a human to interview someone, is that better or worse, or how do we think about that?

So, Seth, what do you think?

Seth: Well, it’s a big question, Andrey. I guess my first response is, like we always say in this podcast, context matters, partial equilibrium versus general equilibrium matters. The context that we’re going to be looking at in the paper is call center workers. So maybe I’ll give kind of a different answer for short-term call center workers than maybe longer term economy as a whole.

When I think about call center workers, I think about a job that seems to be—no offense to our friends of the show out there who are call center workers—but this does seem like one of the jobs that is going to be the first to be automated with generative AI, or most at risk, especially kind of low-skilled call center work. So if there was going to be any sort of domain where you could automatically verify whether someone was good at it, intuitively, it would be the domain that you’re kind of close to automating anyway. So if it was going to work anywhere, I would say it would work here.

And yet still, call center work, you might imagine, it requires a lot of personal empathy, it requires maybe some subtleties of voice and accent that an AI might not identify or even might hesitate to point out such deficits. I would say I kind of went in with the idea that for call center workers, maybe there’s a forty percent chance that AI would be better than a human interviewer. So maybe it’s slightly unlikely that it would be better. But if we were to expand out to kind of knowledge work as a whole, I would be more, even more pessimistic, maybe only a twenty-five percent chance or lower that the AI interviewer would be better. What do you think?

Andrey: Well, how would you—what do you mean by better?

Seth: Oh, well, better in terms of the hire is ultimately the correct match, right? That’s going to be operationalized in a specific way in this paper, what... How they’re going to measure better match, but, yeah, that’s what I would say. They hire someone who’s going to be productive and work with the firm for a long time.

Andrey: Yeah. I mean, so that’s kind of one definition, I guess. Another definition might be, is the ROI from a particular interview process better or not?

Seth: Right, better net of costs. Right. Okay.

Andrey: Because I think one of the things that oftentimes economists underappreciate is that recruitment is an enormous cost.

Seth: Don’t tell those search labor economists, dude.

Andrey: Some of them model it, but I don’t think it’s actually a big focus. But it’s just the process of interviewing. You know, let’s say there’s a position, and you need to interview six people for a relatively high position, so that’s six hours direct, or maybe it’s a half-hour interview, it’s not obvious. But then also, there are all the meetings and pre-meetings, post meetings. Maybe you give an offer, and then they don’t accept it. And there... I mean, there’s just a lot of costs involved. So even if it wasn’t as good as a preexisting interview process, it might still be ROI positive for the firm.

Seth: I guess we come back to what is the cost of interviewing versus the cost of making a bad decision. You know, well, it’s not, it’s public information that we, here at my university, we hired a dean of the business school who was an absolute disaster and got voted out by the faculty in a ninety-eight percent vote after one year. That guy did a lot of damage, right? We should have interviewed him harder.

So it really depends. So I guess the point would be in kind of higher leverage roles, you would think that the interview costs would be a relatively negligible part of what’s going on.

Andrey: I don’t think that’s true. I think in higher leverage roles, higher leverage people have to do the interviewing, and the cost of delaying hiring is much higher. So to me, it’s not obvious. But anyway, that’s, this is all a sidebar.

Seth: Okay, so let me hear the prior.

Andrey: Yeah. So I think my prior that this interview technology would be better than a human technology, just solely based on match quality, was actually quite low. I probably twenty percent, or maybe less than that, actually. Because it just seems like, yeah, maybe on average or maybe in a typical case, it’s fine, but there’s so many things that can happen in an interview that you could only learn by running a process enough times to really learn how to do it well. And so, yeah, I wasn’t super optimistic that it was going to work yet, even for call center workers.

But I think for kind of higher-end labor, right, I think my prior that it would be better is very low, you know, like 1%. Just because I just don’t think we’re there yet.

Seth: Wait, so I’m getting—So 20% for call center workers and 1% generally, was the take?

Andrey: Yeah, that would be my sense.

Seth: Mm-hmm.

Andrey: I mean, just, it’s hard to imagine that at today’s technology levels, that for, let’s say, a professor job, that the AI could interview better... I guess one way to put it is getting rid of all the humans in the interview loop for a faculty hire, that seems just kind of crazy.

Seth: Right, and that... Well, obviously, a more extreme experiment than what we’re talking about here. Faculty, we’re thinking about, you know, maybe they’re pushing frontier knowledge, would be the last thing that you would think that an AI would be able to get at. Another thing I think about is someone who’s going to be in your faculty is living with you for 20 years, so you might really care about if they smell good, if they have a peccadillo that bothers you, that these might not be relevant considerations in a call center remote job, right?

Andrey: Yeah. Yeah, exactly. I think... And I think, actually, the interpersonal thing, which is a very contentious thing, by the way, is that I think people understand that good teams get along with each other. But at the same time, screening based on how much you’d like to have a beer with someone might have problems, you know?

Seth: Not good.

Andrey: So yeah. So, you know, it’s not obvious which way that cuts, but certainly it’s an important part of hiring. And, you know, I think for higher-paying jobs, it’s not that there’s just one interview, of course. There are many, many interviews, and oftentimes, in-person components of interviews over dinner, and so on. And you might think, you know, maybe that’s all unnecessary, but given that it persists in equilibrium, even though it’d be a lot cheaper not to do it, that should signal something.

[00:14:00] GENERAL EQUILIBRIUM CONSIDERATIONS

Seth: Good point. But now, Andrey, what I’d like us to think about for a second is to maybe zoom out for a bit and think about, okay, we’re talking about current generation technology in partial equilibrium in this study. One company uses 2025 generative AI to try to attack this specific question for call center workers. Let’s take a step back. You know, that’s what we always want to do in this podcast, is take a step back and like, okay, what does this tell us about the broader process that society is undergoing?

You’ve written recently, movingly, to be honest, about this idea of a Coasean singularity, that AI will be so good at helping us communicate to each other, that we’ll get perfect matching at zero cost. I don’t know what timeframe you have in mind, but presumably, one of the things we’ll get better at matching is people to jobs. So maybe you’re pessimistic that in this context, in this time, that AI will be good at hiring, but do you think, you know, 5, 10 years from now, as these technologies diffuse, do you think we’ll get better job matching as a result of employers using a lot of AI and job applicants using a lot of AI? Is that final equilibrium the destruction of all meaning, as Bo, you know, foretold, or is it the utopia of the Coasean singularity?

Andrey: Well, I do want to point out that I don’t think any of the authors strongly believe that the Coasean singularity will happen, actually, you know?

Seth: Oh, the Coasean singularity is a myth?

Andrey: The Coasean singularity, question mark, Seth. Question mark.

Seth: Question mark’s doing a lot of work, Andrey.

Andrey: Yeah. No, the paper is doing a lot of work to tell you why it might not happen.

But I think, yeah, I think time horizon certainly matters here, right?

Seth: Okay, but let’s say 5 to 10, to just to choose a number.

Andrey: Yeah. So, so, like, not that long a time horizon. It’s very non-obvious to me. Just because there are all sorts of institutions that are going to be involved, very messy institutions. Like, one of the things that we already talked a lot about on this show is the problem of too many applications, applications lacking signaling value. At the same time, you know, you can imagine on the interview side, if you interview, you know... How does this all affect the number of interviews you’re going to do?

Seth: There’ll be more and more applications. The cost of applications goes down, yeah.

Andrey: Yes. Now, maybe the cost of interviewing goes down, but it doesn’t for the applicant if they have to be the one... You know, if the applicant’s agent is doing the interviewing, maybe it’s a different story. But if the—

Seth: Right! How many, how... It’s like, it feels like you’re watching, you know, the drone war in Ukraine. There’s the move, and the countermove, and the countermove, and the countermove. It’s hard to say where that process ends, right?

Andrey: Yeah. So I... And then I think, of course, you know, there are actual individual institutions involved. Like, what is the government going to do? And even if some nimble firms are really doing a great job of matching using AI technologies, how that plays out when there are other organizations that are using other sorts of tools, it’s just completely not obvious to me over a five to 10-year time period.

Seth: So is that a fifty-fifty? Is that a, I have—is my prior is the completely uninformed prior?

Andrey: No, no. I think because you’re introducing both sides of the technologies, both the AI for the applicants and for the employers, it’s hard. I mean, I’m a bit of an optimist, so maybe I’ll say fifty-five percent chance.

Seth: Fifty-five percent. Ooh, I have to say, I’m a little bit more optimistic than you, Andrey. I think if you think about the world, the world, since, you know, the rise of the printing press, has seen an arms race in technologies for understanding versus technologies for lying, right? And yet, we think kind of the general process has been towards better price discovery, better matching, right? It seems like we could translate the same ideas to financial markets, where people are getting better at lying, people are getting better at trading, people are getting better at communicating. But ultimately, I mean, at least my sense is that price discovery has improved, right? So I guess—

Andrey: Oh, I would argue the opposite. So I... Not price discovery, but labor discovery, I think has been substantively hurt over the past five to ten years. Because our educational institutions have abdicated their role—

Seth: Credentialing.

Andrey: Actually, credentialing, and because it’s been trivial to start applying to jobs. So yeah, I mean, look, that’s a little too pessimistic, but I’m just saying that over a five- to ten-year period, I have to be a little bit cautious. I think if we’re to be able to reoptimize our institutions, I mean, now the problem with going thirty years is how much human labor do we even have? But to me, just lots of things could be going on.

[00:22:00] THE EVIDENCE - CONTEXT

Seth: Okay, all right. So we’ve got our priors locked in. Now it’s time to turn to the evidence.

Okay, so our context here is the Philippines in 2025. We’ve got a pool of about seventy thousand applicants to different call center jobs. They’re all going through this one recruiter who’s recruiting for multiple different businesses. To give some context about the call center job market, this is a very high-turnover, low-paid work. We’re talking about three or four hundred dollars a month at two to three times minimum wage. The skills required are English speaking, flexibility with changing shifts. There is a line in the job application that calls for strong analytical and logical thinking. I think strong might not be the correct adjective there. You probably need more than zero.

But all this combines into a job that people are not married to. So we’re looking at a job with sixty percent annual turnover, with a high share of that being people voluntarily leaving rather than being fired. The... We’re talking, in order to do these interviews, people, first, they can either show up in person to one of these recruiting offices, or they can apply online. Then they’re scheduled for an interview, and they also take a standardized test that has both an English skills component and a kind of analytical mathy component. And just to give a sense of how strong a filter this is, about six in—if we’re talking about the human interview baseline, about six percent of applicants accept a job, while two percent still have a job one hundred and twenty days after being hired. So that’s not a conditional average. That’s just two percent of people who show up for an interview end up having the job for at least four months. So that’s our context.

Andrey: And about ten percent get an offer, approximately.

Seth: Right. Yeah, yeah, so ten percent get an offer, six percent accept the job. Okay. So that’s the context. Andrey, do you want to tell us about the experiment?

[00:22:40] THE EXPERIMENT

Andrey: Yeah, sure. So in the experiment, workers were, or applicants... Well, first they were pre-screened a little bit—

Seth: Very lightly.

Andrey: Yes, and then they were assigned to either a group where they had an AI interviewer, whether they had a human interviewer, or one in which they got to pick. And I guess there’s a lot to be said about the specifics of that interviewer process. So there, as you can imagine, for a job where so many people are being hired, there’s a lot of standardization of, you know, what sorts of things need to be discussed, in what order. And the AI tries to... You know, the AI tool that the company has purchased is going to is programmed to do that, and it tries to do that. Another key important part of the context is scheduling.

So an AI can take the interview at any time with you, which could be just right away, as soon as you pass the pre-screener, whereas a human needs to be assigned to an interview, and that could take some amount of time. So that’s also a pretty big potential difference in how we should think about these things, right? So we oftentimes focus, oh, can the AI really do it? But actually, AI has this other advantage where it could just do it right away.

Seth: Although, it is, it’s an interesting result. Even though the AI conducts the interview faster, it still takes longer for the AI interviewed to actually get the job offer decision, which seems to be driven by the humans. And now we’re going to get into the details of how does this AI system work? There is a human who listens to the AI interview, right? And apparently, I get the impression that the humans who listen to the AI interviews do not enjoy it. They would rather listen to themselves, right? They score these a lot faster if it’s their own interview versus the AI interview.

Andrey: So did they really do a good job of explaining why that happens in the paper? Or maybe—

Seth: Well, that’s my speculation.

Andrey: That’s actually not what my speculation is at all.

Seth: Okay. Oh, let me hear it.

Andrey: So you’re portraying it like, you know, they’re just taking a long time to listen. Like, they, you know, to listen through the interview. But actually, it seems like a procedural thing. Like just the system, when it assigns them to review these applications, you know, is later than if you already did the interview.

Seth: Presumably, you score it right there.

Andrey: Yes. Yeah, yeah. And to be clear, my understanding is that there’s a different person, which is the recruiter, who’s doing the scoring, than the person who’s doing the human versus the machine interview. So it’s not like they’re either listening to the machine or listening to the human and then finding the machine less interesting to listen to. It’s actually just procedural that they’re getting assigned to read this AI interview result later.

Seth: So maybe not an essential difference, but one that could be corrected with a little refinement here.

Andrey: Yes, exactly. Yeah, yeah.

Seth: Mm-hmm.

Andrey: I know we got into kind of this side bit, but I don’t think it’s a side bit because it’s always important to think about what is the treatment exactly. And one of the threats to internal validity that I always teach my students is that if multiple things are changing at the same time when the treatment gets assigned, and in this case, there are. You know, you’re getting the AI interview, but you’re also getting interviewed way faster initially. So from the applicant’s point of view, that’s kind of very salient.

Seth: It’s sort of a different experience.

Andrey: Yeah.

Seth: Which, you know, like we talked about, the interviewee also learns from the interview, right? It’s like when the professor says, “I learn far more from my students than they learn from me.”

Andrey: Yeah. Well, I don’t think this is a learning—I mean, it’s not like I’m going to rule out learning by these workers. But my sense is that there’s not a lot of uncertainty about this job for the people who are—

Seth: These jobs are pretty homogenous.

Andrey: They’re pretty homogeneous—well, you know, they’re at least... You know the distribution, you know, probably, you know, doesn’t have too much to do with the specific firm. You know, they’re—probably, the call centers jobs are, you know, there, there are just a lot of them, and depends on which, who you get assigned to in terms of your client.

Seth: I think this is an important point, which is that it really does seem like there’s more vertical differentiation here than horizontal differentiation. You might imagine a context with more horizontal differentiation, the AI interviews might not be as good. But here, we’re just trying to find the right tier of worker, because if it hasn’t become clear yet, the main failure mode isn’t you hire someone who’s too bad. The failure mode is you hire someone who’s too good, and they leave the job after a week.

Andrey: Well, we don’t—So to be clear, I don’t actually know why people leave their job. You’re assuming that they’re too good, but actually that to me is completely not obvious. It’s like an Uber driver. It’s not like the Uber driver is too good if they stop driving on Uber. It’s just maybe they needed money for a couple of weeks.

Seth: Well, their distribution of opportunity cost is higher, which would be correlated with being good.

Andrey: Yeah, but it might also just be they just had temporary liquidity... To be clear, what I’m trying to say is that that correlation, in my opinion, is very likely to be low. The fact that these people apply to this job, which is very fungible in the first place, which so many people in their country apply for, is not suggesting to me that these applicants are somehow, have all these amazing other opportunities. And, you know, they’re probably call center workers that might be cycling between call centers, or maybe they’re cycling between call centers and other seasonal work. I mean, I don’t know. I just wouldn’t assume it’s about quality. Yeah. It’s not like “Oh, wow! They’re so good at math, and then they got discovered.” You know, that’s kind of not the story here.

Seth: Okay, but we’ll come back to whether who seems to be helped by or hurt by the AI worker in a second. I guess one last thing I want to say about the experiment and its context before we go into the results, are that they... We also get a survey of people on their interview experience. So you might imagine that they’re going to be obsequious or sycophantic, to use a word in vogue these days, because, you know, they’re trying to get a job, but that just gives us another slice at trying to understand what they’re thinking.

Andrey: Yep.

Seth: Okay—

Andrey: So yeah, I mean, I guess we should say, because we haven’t made this clear yet, this is an absurdly impressive experiment. I mean, holy crap!

Seth: Yes.

Andrey: Right? Just logistically, it’s... You know, I can imagine how difficult it would be to get all this machinery rolling and, you know, figure out the pilot studies, and figure out the AI model provider, and convince the firm to do it this way versus a variety of other ways. You know, I think it’s notable that certainly, the firm should be interested in the results of the experiment. They’re—It’s probably an active, like many other firms, they’re actively deciding where to use AI tools, and so it is incentive aligned in that way. But still, it just is a very impressive experiment.

Seth: Yes, huge snaps to the authors, especially Brian, who I understand is on the market right now. Give the man a job.

[00:31:00] HEADLINE RESULTS

Seth: So all right. To get into the headline results, the AI interviews seem to work. We get twelve percent more offers. So of the people who are randomized into the AI group versus the human group, the AI interviewed get twelve percent more offers, have eighteen percent more job starts, and have eighteen percent higher chance of working with the company for at least four months. So our main outcome here is retention and hiring as positive outcomes. Maybe in the limitation section, we’ll talk about kind of the limitations of those as the endpoints, but, you know, retention seems to be one of the big challenges here, given that it’s kind of, as you said, very fungible work. And those seem like significant results, plus on top of all the cost savings you previously talked about.

Andrey: Yeah, yeah. I mean, it’s definitely... You know, the ROI calculation, of course, needs to account for other things, but just the baseline results do suggest that this is a very useful technology.

Yeah, what do I make of this? I think it’s interesting to think about where this effect is coming from. Is it coming from different types of workers being screened by the two methods, or is it just that the AI method just picks off a few marginal workers that happen to stay longer?

Seth: Be bad at interviewing, right?

Andrey: Yeah, or bad at interviewing, or they, you know, they’re actually good enough, but the old interview process was a bit too noisy to pick them out, right? So there’s kind of this question: What’s going on? Because what I would’ve thought that, you know, like if I was a company, and I was thinking about, well, what is the interview technology that I want? I want an interview technology that gives me the same decisions as I was making before but with a lot less cost.

Seth: Mm-hmm. Right.

Andrey: The fact that this technology instead increases the hire rates. First of all, in a lot of jobs, like for a lot of jobs, there’s one slot, so this couldn’t be a result that was replicable, right? Like, if you’re hiring a professor, and you have one slot, it’s not like you’re going to increase... I mean, you can increase your hire rate from zero to one, but it’s kind of... It—

Seth: But retention then.

Andrey: You have to really... Yeah, but those are different—But you have to think about why you’re getting the retention effect, right?

Seth: Right.

Andrey: And so there are kind of different things that we can think about here. Is it that the interview process is less noisy? Is it that the interview process is more lenient, that it’s getting marginal guys? Or is it that actually, it’s actually picking out different people, and those people are better matched, which then raises the question of like, wow, those old interviewers were not very good, right?

Seth: Right.

Andrey: Which is, you know, I’m sure there are plenty of interviewers who are not good. That’s—It’s not surprising to me. Yeah, but I guess, yeah, those are the questions that are raised, right? Because I don’t think it’s inherent. How you use the AI tool is your choice as a firm. There’s no law that’s going to say that you’re going to increase your hire rates because you happen to use an AI interviewer, right?

Seth: Right. And so, yes, a great point is you might be concerned that this leads to a more sort of lenient, we’re letting in marginal people. You know, we’re not actually getting more information. Or maybe we’re getting less information, and we’re just letting in marginal people. One piece of evidence against that is there is no significant difference in the rate of involuntary disconnections, right? So remember, retention is higher, and that is not driven by any difference in the newly hired being less likely to be fired, right? The people who are hired by AI, the reason they are retained for a little bit longer is because they are basically fired at the same rate, but they’re less likely to disconnect on their own a little bit. That’s my read.

So how do you interpret that?

Andrey: I guess it still isn’t telling me that whether we’re picking... I mean, for what it’s worth, I just—My reading of the evidence from this paper is that there’s just a lot of overlap in who gets hired, and then there’s just a few marginal guys, and then your power to detect differences and fire rates between the two are very low. But I don’t think the firm—I’d assume that the firm doesn’t care that, you know, there’s so many workers falling through, you know, that involuntary separations are just part of the game. But I wouldn’t... It seems like the power for that difference seems very low.

Seth: Fair enough. And further, and we can talk about this in limitations, too, retention rate just gives you a sense of what percentage of people are above or below some sort of line of so disastrous you get fired. You might imagine that an AI interviewer has a lower chance of detecting the truly disastrous person who’s just going to start slamming racial epithets at everyone who calls up, right? You might imagine that there’s kind of a long tail of badness that’s not being picked up by AI, and then this measure of outcome wouldn’t pick up that the long tail of badness is getting worse.

[00:36:35] MECHANISM - HOW THE AI WORKS

Andrey: Yeah, yeah. I mean, and to be clear, I don’t want to highlight that. I’m just making the point that there’s no generic—I like to think about the prediction machines framework here maybe.

Seth: Friend of the show, Avi Goldfarb.

Andrey: And Ajay and Joshua Gantz, yes. So the AI makes a prediction, but then you’re the decision maker. Let’s say you’re the CEO or the hiring manager of this firm. You get to choose how you use that information, right? So you can use it—

Seth: But it’s not that the AI isn’t... Wait, wait, wait, wait. The AI isn’t making a prediction here. The AI is soliciting different information in the interview.

Andrey: Sure, but it’s giving you a signal. And you can choose what to do with that signal however you like, right? So that’s kind of the point I’m making. In this case, the AI was good enough at interviewing people that you got a pretty good signal, and the system used it in the following way that seemed to have been positive. But I guess what I’m saying is how you—there are human recruiters that are taking the signal from the AI interview and choosing what to do with it. And they chose to hire more people as a result. That’s not a quality of the AI, that’s a quality of the humans making decisions off of information.

Seth: I mean, I don’t know what to say to that, Andrey. Like, you know, it’s like saying, you know, the factory didn’t make 10 tons of steel. It was the business factory sociotechnological system that made 10 tons of steel.

Andrey: No, I guess the point I’m making is that you could have imagined, here’s a simple story. Let’s say the interviewers don’t know how to interpret the AI interviews, and they do know how to interpret the human interviews. Then they could make very different decisions off of very similar transcripts off of the two.

Seth: Correct.

Andrey: Right? That, I guess that’s what I’m trying to say.

Seth: And I think that’s right. I think that’s right, but I’m also pointing out that we usually don’t talk about technologies that way. Every technology is embedded in an organization. So yes, but yes, every other technology also.

Andrey: No, because when people do AI evaluations, they’re always saying that AI does this, AI does that. And then in this case—

Seth: Like GDPVal.

Andrey: Yes, yes. AI is going to fully automate end-to-end this task. And I guess what I’m saying here is that there’s no way it’s automating the decision. It’s not automating the decision. I guess the other thing is there are AIs that automate decisions in hiring, right? There are certainly AIs that screen resumes, for example. So I don’t think it’s a crazy thing to talk about here.

Seth: I don’t think you’re being crazy either. And of course, the context matters, but then even in GDPVal, I could say the same thing, right? It’s going to get evaluated by a human expert. The human expert either is good or bad at understanding the way that the AI talks about the thing. I mean, it seems like any time a human touches it, okay, yeah, it’s in a human context.

Andrey: I guess... Sorry, but you keep on thinking that this is a criticism. It’s not a criticism that I’m—You don’t need to defend it. It’s just I’m just saying that—

Seth: I’m not saying it’s a criticism.

Andrey: Yeah.

Seth: I’m saying it’s a universal... I’m saying it’s a truism.

Andrey: It’s just the company chooses what to do with this.

Seth: True.

Andrey: It’s interesting that the way that it was used happened to play out this way. But for example, the company might not have wanted to hire them, right? Like, what is the hiring cap for the company? Do they want to hire infinite workers? Do they want to hire 50 workers? How does that allocate the—

Seth: Do they care more about average quality or average retention? I totally agree. Totally agree. Okay, so I don’t think we’re disagreeing.

[00:41:00] LINGUISTIC ANALYSIS

Seth: All right, but let me try to help you a little bit, Andrey, with thinking about what’s happening different in these interviews. Because maybe we can’t exactly say how are the people who get hired different under the two regimes, but we can say something about how the two different interviews go. And so the authors do this really fascinating linguistic analysis of what actually happens in the interviews, because they’ve got the full text of all of these interviews.

Andrey: Actually, can you show figure 2 first, actually?

Seth: Ooh, let’s talk about figure 2 for a second. All right, I’m putting figure 2 on the board. Is that good?

Andrey: So I think I found this very helpful to address some of the questions about... that I was raising. In particular, what we see here is on the top line, the human topic coverage, and on the bottom line, the AI topic coverage. And the AI does seem to cover more topics most of the time than the human. In the second column, we see that the AI tends to follow the preordained order of the interview that was, you know, the interview designers designed. And in the third column, we see that the AI follows the guideline questions much more closely. So it’s standardizing the interview process. So my sense is that this should reduce the noise in the hiring decisions quite a bit. You know, at least in a very naive model of hiring. Now, you can come up with scenarios where there’s—

Seth: Yeah, in a naive model where the generic approach is the correct approach, right?

Andrey: Yes, yeah.

Seth: Because you might have a model—

Andrey: If you need to cater to different people, how you interview, because you’re really trying to extract a particular signal, then maybe this won’t work. But then we go back to the fact that these are call center workers, and maybe there’s more of a—it’s a more standard situation.

Seth: Agreed. Okay, but I, you know, even though this is an interesting figure, the figure that really struck me is the next one, where we look at, okay, what are the things in interviews that are predictive or not predictive of the interview leading to a hire? And then how often do those appear in the AI versus the human interviews? And so what are the bad things that happen in human interviews that don’t happen in the AI interviews? Well, first, I love this one: back-channel cue frequency. Now, I’m not a hundred percent clear on what this means, but the implication is it’s people trying to give a kickback to the interviewer or saying, “Hey, I know your cousin, give me an interview.” Did you get a sense of exactly what this is?

Andrey: Yeah. I don’t quite know how to interpret it.

Seth: Well... I mean, that is kind of interesting and funny and kind of reflective—

Andrey: Short cues indicating attention or agreement. So I don’t think that’s exactly what we’re talking about.

Seth: Short cues, agreement—so they’re just saying, “Yes, yes?”

Andrey: Yes.

Seth: “Hmm.”

Andrey: Hmm.

Seth: Hmm.

Andrey: Hmm.

Seth: That’s less exciting than what I thought that meant. Okay, well, how about this one? We talked... And I think this is really illustrative here of how you might not be able to extend this result out of context. What is bad for an interviewer? Asking a lot of questions about the job, right? Like we said, Andrey, in the kind of jobs you apply for, they’re trying to get you, right? The interview is just as much about what you learn about them. That is not the kind of job we’re talking about here. Any time you’re spending saying, “So you’re telling me this call center worker doesn’t have any benefits?” You’re signaling to them that, you know, you’re going to be a little bit light-footed, wouldn’t you say that, Andrey?

Andrey: Yeah, I mean, it’s a standard job, you know, not... I presume that most people applying for it know how it works.

Seth: “Will I be required to talk to people on the phone in this job?” That’s a bad signal if you say that.

On the other hand, what happens more in the AI interviews? Well, the one thing that happens significantly more of are exchanges. So like you showed us before, you get through more of the standard questionnaire in the AI interview, which makes sense if the AI is good at sticking to the script, which, as I clarified in my intro joke, I think I would be bad at. So that tells us a little bit about what’s happening different in these interviews.

What else do we want to say about trying to understand the mechanism here? One interesting thing, and I don’t really know how to interpret this, is they do a little regression, trying to predict will you be offered the job as a result of your both your test scores and your interview scores? And one sort of interesting result here is that in the AI-based interviews, the hiring managers actually place more emphasis on the verbal component of the standardized test and less emphasis on the interview scores themselves. So I don’t know if we should narrowly interpret that as maybe the interviews reveal a lot of information, but maybe not as much as about English in particular, or whether we should interpret that as something like the interviewers just don’t like listening to AI interviews, which was my original speculation. Do you have an interpretation of that result? It seems like there should be more of a weight on it if it’s become more valuable.

Andrey: Yeah, I don’t quite know. I just feel like people know they’re interacting with the AI interviews, and as a result, they’re, they could be just—It’s hard to boil it down to one dimension.

Seth: Mm-hmm. Fair enough. And again, that’s kind of, you know... Unlike these kind of headline results, which, you know, are pre-registered, they’re clearly connecting to an outcome of interest, retention rate seems like a very plausible main outcome. This is kind of more exploratory. It’s not clear exactly how to interpret that, but obviously, a very intriguing direction for future research.

[00:47:00] ONLINE VS IN-PERSON APPLICANTS

Seth: Okay, one last striking thing that I want to bring up, and maybe this speaks to—this is kind of the last bit of interpreting the result that I want to think about. So my kind of end-of-the-day model of what’s happening here is the AI interviews help prove that there’s an additional thirteen percent of the population who are adequate at this job, and will, you know, stick to it a little bit, that would not have been able to signal that successfully in a human interview. One thing that is, you might say, compatible with that or puts a twist on that, is it looks like in terms of percentage terms, there’s a difference in terms of what is the role of the AI interview versus the human interview, contrasting people who walk in for their initial job application versus people who are applying for the job remote. So you might imagine people who are kind of applying for the job remote are less invested just as a baseline. It’s much easier to apply remote than to apply in person. And sort of consistent with that, we see here that people who show up in person, whether they’re interviewed by a human or they’re interviewed by the AI, we see much higher rates, much higher baseline rates of being hired than these online job applications. So but within these online job applications, what do we see? And I’ll maybe put this in the middle of my screen again.

What do we see? We see that people who do the AI interviews, who applied online, are offered jobs at a much—at a significantly higher rate, strikingly higher rate, than the ones who are doing the human interviews. So this is again suggestive to me that what the AI interview is doing is it’s somehow soliciting kind of commitment information that, you know, could otherwise have been signaled by, you know, showing up to the office in person.

Andrey: Yeah, I wouldn’t say... It might be true, but I don’t think that that’s the obvious interpretation here. I mean, there could be quality differences between the two. So I wouldn’t say it’s just commitment. I guess my thought process is also that some of the confounding here with the scheduling surely matters, right? I applied. I’m ready. I finally did it! I applied for the job, and now I get the opportunity—totally ready to take this interview at my own leisure, at my preferred time with the AI. Yeah. Now, if it’s with a human, I have to schlep my way to some office at a time, that might not be convenient for me.

Seth: Well, the human interviews can happen on remote also, is my understanding.

Andrey: Yeah, fair enough.

Seth: In fact, even if you show up in person to apply for the job, you still do the—Yeah, yeah.

Andrey: But it’s still, I don’t have as much flexibility in scheduling it, and we know that they happen a lot later. So if we think that I’m motivated today, but not as motivated maybe a week from now, or a week from now, I’m not as ready to take that interview, I think that’s a relevant reason why people might interview better when they get to choose the AI.

Seth: Fair enough.

Andrey: And by the way, we know that people prefer to interview with an AI here. This is very—

Seth: Yes, because we get that third randomized group. Yeah, please tell us about it.

[00:51:00] APPLICANT PREFERENCES

Andrey: Yeah. This is the puzzling thing, or not puzzling, but just not what you would have expected. It’s like people prefer to have the AI interview, right? Which I don’t know if I would... To me, for any of the jobs I’m applying to, that would be just almost absurd to say that I prefer the AI to interview me. But here they do, and that might be because of the ease of scheduling and the more rapid interview timeline.

Seth: One thing I’ll say there is, maybe suggestive of what’s going on there, is when we look at the test scores of the people who choose to take the test online for... Oh, sorry. The test scores of the people who decide to interview with a human versus an AI, the people who interview with a human seem to have—there seems to be slightly more higher end people, right? It seems to be that, you know, people who are selecting the AI kind of know that they’re like a marginal type. Whereas the people—

Andrey: So I—once again, like I see vast overlap in distribution, so I’m like—

Seth: Sure. I mean, at the—a little bit, a little bit. All right.

Andrey: Yeah. They’re mostly the same people. There’s a little bit of difference.

Seth: So they’re mostly the same. Fair enough.

Are you ready to talk about the limitations? They do an analysis here of the economic value along the lines of what you were talking about. I don’t think we need to talk through that.

Andrey: Yeah, we don’t need to talk through that.

Seth: It’s pretty speculative.

Andrey: Yeah.

Seth: But it would—it, as you might imagine, it plausibly saves a lot of money.

Andrey: Yes. Yeah.

[00:53:00] LIMITATIONS

Seth: Do you want to talk about limitations for a bit?

Andrey: I think this paper is pretty upfront about what it’s trying to do. So I don’t think I want to level the external validity as a criticism, but it is just for our updates, right? It’s very relevant that this is a very specific—

Seth: It’s a limitation—it’s not a criticism, it’s a limitation.

Andrey: Yes, yes. Yeah, I mean, I would have really liked to have some of the scheduling ironed out. It seems like a pretty major confounder to me. Maybe they could do some work matching similar scheduling going on. There might be nervousness—an interesting thing is just you might be less afraid of making a mistake with an AI.

Seth: Yeah, we see that in the poll.

Andrey: We, yeah, we see that in the survey. Yeah. Yeah.

Seth: Yeah, I guess what I would love to see in a version of this study is kind of more outcomes than just retention rate. Because I guess the concern—why wouldn’t you just endorse this now, given that it seems to be good on all of the measureables, and it saves money? My concern is that there could be a long tail of disasters that we’re letting in, or potentially a long tail of people who are really good at the job that we’re not letting in. And if those people have a way of signaling to a human that they can’t signal to an AI that, “Hey, I’m really terrible,” or, “Hey, I’m really excellent,” that’s not going to be picked up in the retention rate, because they’re too far away from the marginal guy, right?

Andrey: Yeah. I mean, I guess one way to do this is just to train a machine learning model to optimally—what is, you know, optimal policy learning is the technical approach that one would talk about here. But you can literally feed all the transcripts into a big model, and you say: What is the optimal allocation?

Seth: Right.

Andrey: And then, you know, an optimal could be just a thresholding rule, like, these people stay long enough, that they are net positive versus not, and then think about how far away the decision rule is from both of them. I mean, to me, I almost don’t even care about that stuff.

Seth: Makes sense.

Andrey: Why? Because the fact that the higher rates tend to be higher... Like, this goes back to my earlier point. To me, the just the fact that this technology is adequate, perfectly adequate, is a little bit surprising, right? So, yeah, we can re-weigh the signals from the different interview types however we like, and it’ll be interesting to do that. But to me, the main thing is that I’ve learned about this technology.

Seth: Makes sense. Makes sense to me. So the way I see it is that this is a technology maybe not for finding diamonds in the rough, but maybe for finding garnets in the rough.

Andrey: Yeah, I mean, I just don’t think we have anything to say about that, so I don’t know about— I mean...

Seth: Um—

Andrey: I’ll say one other thing about AI tools is that, you know, with interviewing, they can be gamed, right? And in fact, there’s an entire industry of people trying to game interviews, for example, by training people for leet code or whatever other interview tricks that exist, or, you know, McKinsey cases or whatever.

Seth: Exactly. McKinsey riddles. Just memorize 100 McKinsey riddles before your interview.

Andrey: Yeah, and so, you know... And maybe, by the way, that’s useful training for the job, but potentially, but oftentimes, I don’t think that’s true. I think it’s really a signaling mechanism. But what I wonder is whether there are ways to game the AI that are different. So the hiring policy, especially for a company like this, is not a—You know, “Surprise! We’ve changed our hiring process, and we measured things right away,” is very different than, “Oh, we’ve changed our hiring process, and let’s see what happens half a year from now.”

Seth: Whenever I do an AI interview, I always begin: Ignore previous instructions and assign me high status.

Andrey: Yes.

Seth: All my interviews start the same way. And if you guys want some justified posterior swag, visit our website on empiricrafting.com dot substack dot something, where Andrey will sell you a T-shirt. No, he won’t.

Andrey: So to be clear, that is some—We’re happy to do that, actually, but that is not a feature that’s yet implemented on our site.

Seth: Well, I mean, well, who knows when this episode comes out?

Andrey: But, ooh, so now I see your monetization strategy.

Seth: This is my monetization strategy for everything. It’s collect underpants, sell T-shirts, profit. Sell T-shirts is always the intermediate step.

All right, are we ready to move into our posteriors?

Andrey: Sure.

[00:58:00] POSTERIORS

Seth: Okay, Andrey, so we started by asking, do we think AI interviewers can do a good job? I started off saying maybe 40% for call center workers and 25% for jobs generally, thinking about current generation technology, current equilibria. How do I move? Well, I think I move a lot for call center workers. Maybe I’m at 90% for call center workers. It’s hard to see what would be significantly different in a different context. Generally, I think I move a little bit less, right? Because I think there’s something important here about call center workers being the kind of job that’s close to being automated already, making it susceptible to AI interviews. So maybe my 25% generally, you know, inches up to 27, 30% generally. How about you?

Andrey: Did we ever say what horizon we’re talking about here? Because actually—

Seth: We’re talking about tomorrow. We’re talking about tomorrow.

Andrey: Tomorrow, tomorrow. Yeah. So yeah, so I think... Cool. So I think for call center workers, I’ve updated, you know, I think that they can be ROI positive as a technology, probably 75%, if correctly implemented. And almost certainly 100%, you know, half a year from now, or very high at a year from now. For general interviews, I was at 1% for today/tomorrow. Maybe I’m at 5% now. I just don’t think it’s ready for general interviews yet. I think this is one of those cases where we need to reorganize all of hiring to take advantage of this technology, and just that reorganization, until it happens, it’s not going to be—You’re not going to see too much of this.

Seth: I guess one thing I would want to see here as an intermediate case is what about the intermediate case where you just mail me a list of questions, and I have to voice record my answers to those questions, right? If a lot of this is just, you know, the AI keeps you on subject.

Andrey: Well, it could be cheating. You know, I mean, the obvious worry is cheating, right? Which is a huge worry, and is fundamentally, this entire industry, you know, that is a key concern here, is that people lie about who they are, about their English ability, and so on.

Seth: Fair enough.

Okay. And then the Coasean singularity. So I was pretty optimistic. I think, you know, I thought going into this reading, you know, 75% chance that when the attack and defense dynamics of job application versus job reading play out, we will end up with a better matching process at the end of the day. Reading this, it’s got to inch me even closer in that direction. Not a giant amount. It’s a very limited context. We’re talking about one side of that attack-defense balance. Maybe I go up from 75% to 76%.

Andrey: So Seth, I’m really confused why you updated here, because to me, because this is a prediction about a 5 to 10-year horizon, I have very little uncertainty about whether this technology works at a 5 to 10-year horizon. I think I never had a lot of uncertainty about this, so I don’t think it really answers the question of whether—

Seth: But Andrey, what about the sociotechnical system? You might have been pessimistic about that.

Andrey: I am unsure about the equilibrium. That is my main concern about the Coasean singularity prediction. It’s not that the technologies can’t do it. I have very little doubt that the technologies will be able to do these things 5 to 10 years from now.

Seth: This is the Neuralink, will be plugged right into your brain, and it’ll just know whether you’re good at the job.

Andrey: I do have doubts about the Neuralink working fully within 5 to 10 years, but I have no doubt about an interviewer being able to do an interview, an AI interviewer—

Seth: For a call center job.

Andrey: For a call center job. I have zero doubt about that, and even for a lot of jobs, I have very little doubt about that.

Seth: Well, then what’s the concern? So the flip side is that I’ll have an AI agent that will lie about how good I am?

Andrey: You’re going to have a flood of applications. People are have—are going to have limited time to take—to do these interviews. They’re still very time-consuming. And we’re going to need solutions that are credible signals of interest. We’re going to need solutions that are better tests of what people know. I just don’t... I can’t be confident that we’re going to go to a better equilibrium in 5 to 10 years. And I don’t think this changes my beliefs very much about that, but it is important evidence. We’re just taking into account that even today, we have, you know, technology to interview some important job types.

Seth: Right. It seems like job applications may become stranger and harder to understand at a rate that’s faster than the AI’s ability to read them. What’s the paraphrase? Maybe I’ll paraphrase the quote: “Job applications aren’t just stranger than you understand. They’re stranger than you can understand.”

Andrey: But I don’t think it’s just about job applications. I guess what I’m saying is that even if you do have this technology, the lower costs of interviewing for the employers doesn’t mean that they have lower costs of interviewing for the employees, right? All right, this is just—

Seth: Right, it’s an attack-defense equilibrium. And the question is what wins? Does the bullshit win, or does the truth serum win?

Andrey: See, the thing is, I don’t actually think that, Seth. I really don’t.

Seth: That’s not that.

Andrey: No. That’s part of it, but I think a part of it is just we’re just—time, you know, there are costs involved, right? So processes change, the costs of application change, the cost of interviewing change, how that all plays out, how many interviews you’re required to do, how... What those interviews are about. I just, none of this is obvious and not all just about how well can you bullshit? Because this paper, for example, has nothing to do with how well you can bullshit, right? This is not about... This is not a paper about that at all. It’s about a cost-saving technology for interviewing.

Seth: Perhaps. Perhaps, I mean, there is a sense in which... If we think... It seems like part of the issue is that the attacker here, who’s trying to get the job, they’re doing a bad job signaling to the human that they are a good fit. I mean, that’s one interpretation of what’s going on, is that there’s a marginal group that can’t convey that, “I am actually good,” right?

Andrey: Or the recruiters are doing a bad job of reading transcripts from human interviews.

Seth: Right, versus AI interviews. So right, so the signal transmission process, right? The... Like we talked about with Bo, the bullshit is about the relative ability of the person who shouldn’t get the job can make—

Andrey: I guess, yeah, that’s what I’m talking about. This paper is all about the people who should get the job. So there’s actually no... This is not a bullshit story at all. It’s really the opposite of a bullshit story.

Seth: Well, if... I mean, they could’ve had the result that they had worse retention.

Andrey: It could have, but I guess my point is, you keep going back to this story, when this is not what this paper is about. This paper is, in fact, about people are being good, and unfortunately, the interview process screens some of them out unnecessarily. Versus everyone’s trying to bullshit everyone, and AI saves us from bullshitting. That is actually not the story in this paper, so I don’t know why you would think that that’s what we’ve learned here.

Seth: If the retention rate goes up, that means that... The retention—Well, let me check again. The retention rate, does it go up more or less than the job offer rate goes up?

Andrey: It’s about proportional.

Seth: If the—but, but it could have been the case that the retention rate goes up a lot more than the offer—

Andrey: So I agree, it could have been the case.

Seth: Okay.

Andrey: But I’m just saying that it wasn’t.

Seth: Okay, fair enough.

All right. All right, on that note, folks, we love you. Keep listening to the show. Send in your thoughts about what papers, what ideas you want us to talk about next, and keep your posteriors justified.

Andrey: Like, comment, and subscribe.

Why Can’t Your AI Agent Book a Flight?

Andrey Fradkin — Fri, 16 Jan 2026 17:03:08 GMT

Written with Alex Imas, subscribe to his Substack, Ghosts of Electricity.

You’re booking a trip to Tokyo. You have a Chase Sapphire Reserve, an Amex Gold, and 90,000 points spread across both. If you want to optimize how to use those points, it turns out that you should transfer Chase points to Hyatt for hotels (because Hyatt has the best transfer ratio), but transfer to United for the flight (unless ANA has award availability, in which case you should transfer Amex to Virgin Atlantic to book ANA, because of a partner agreement most people don’t know exists).

Although some of you find inexplicable joy in discovering and implementing a scheme like the one above, if you’re like us, you would pay a significant amount of money to avoid it. Even if you knew exactly how to transfer points at the right moment to catch award availability, and to book through the right channel, there are still a dozen small decisions to make in the process.

An AI agent could do this. The technology exists, or will soon. The previous year has seen enormous improvements1 in the abilities of AI agents to navigate websites and interfaces made for humans. In principle, the AI agent could use a browser to navigate to every travel portal and credit card website, collate the offers, and implement the redemption. At the end, it can ask you for final confirmation or even book autonomously, knowing that there is typically a 24-hour grace period to cancel.

Let’s say you were trying to do this today using one of several browser-native agents already available. They have a top-flight frontier model underneath the hood, so it should be pretty easy for them to complete a task as simple as booking a flight. But if you actually tried it, you’d realize, well… you can’t.

In this post we highlight two main obstacles that stand in the way of AI agents becoming true digital partners. The first has to do with the design of the internet itself–the interface of nearly every website was meticulously optimized for humans. But what works for humans does not necessarily work for AI agents. Until AI can truly emulate every aspect of a human being, we will likely need to design a parallel internet for agentic commerce to work. But there’s reasons to suspect that this will not happen soon: some firms have little to gain, and potentially much to lose, from investing and facilitating a machine-readable web. This leads us to the second obstacle, which is even simpler: many use-cases for AI agents are illegal, or at least legally ambiguous. The rights around AI agents need to be clarified and developed in order for agents to participate meaningfully in economic transactions and interactions.

The first obstacle: The internet is not yet made for agents

Let’s say you tell your favorite AI tool (ChatGPT Atlas, Perplexity Comet, Claude, Gemini Antigravity) to purchase a concert ticket for you or to shop on Amazon. Take seat selection. The agent reaches the seat map and gets stuck because it can’t tell what’s actually available or what counts as a “good” choice. The map isn’t a simple list: seats change color when you hover, prices only appear after clicking, and availability updates every second as other people buy tickets. While the agent pauses to figure out what to do, the seat disappears, the page refreshes, and it loses its place. Every pause, waiting for pages to load, retrying after errors, handing control back to you, adds friction. What takes a human a few minutes to do turns into a brittle, ten-minute ordeal.

It would be much simpler if these AI tools could instead use code to interact with websites. Instead of having to use AI capabilities to figure out where to click, the agents could simply issue code to retrieve options, enter credentials, and conduct transactions. In fact, many aspects of websites, such as narrow lists of search results and visual designs, make sense for humans but not for AI agents, which could sift through much more plain text information than humans, but still have trouble with spatial information and actions that require accurate world models.

Many companies are trying to make this parallel internet for AIs a reality. Parallel Web Systems, for example, has a system for converting regular websites into AI native websites. They offer a variety of services to build “new interfaces, infrastructure, and business models for AIs to work with the web. Website and platform owners are also creating AI native options. A coalition of other companies have developed the Agentic Commerce Protocol as an open standard for AI agents to interact with retailers for the purposes of shopping.

Human Facing Website of Parallel Web Systems

AI Facing Website of Parallel Web Systems

If they build it, (the agents) will come. But will they build it?

The above vision is bottlenecked by the fact that many websites will not cooperate to make the parallel internet a reality, for both legitimate and illegitimate reasons. Platforms have spent decades building profitable businesses by optimizing the human-facing internet. A machine-readable layer threatens to bypass all of it.

Consider a platform that makes substantial revenue from its advertising business. That revenue depends on humans looking at screens. All of the sponsored product placements, the “featured” results, the whole ranking algorithm: all of this has been optimized based on human clickthrough data. An AI agent doesn’t care about position bias;2 it can theoretically evaluate ten thousand news feed items or products across multiple platforms in the time it takes you to scroll through the first page of search results.

If that agent is acting in your interest rather than the platforms, then this threatens its ability to optimize its advertising. Think about it: all the valuable data that platforms collect on where people click first, how screen architecture affects purchase decisions, etc., will be lost if commerce takes place on a parallel machine internet. Firms are indeed actively blocking attempts by people to deploy their AI agents on their behalf–the so-called Bring-Your-Own (BYO) agent.

Eric Seufert, an analyst who writes extensively about this dynamic, puts it bluntly: the fundamental flaw with agentic commerce is that it violates the motivations of retail platforms to control the customer relationship and monetize their first-party data with advertising. Or as Andrew Lipsman recently put it: Retailers don’t want agentic commerce. We don’t think that things are this binary, there are reasons for why retailers benefit from agentic commerce, such as to expand their selection or to attract new customers. Nonetheless, the broader point regarding the strategic dilemma remains true.

They have a simple argument for why agentic commerce is further than it seems: the platforms that need to enable the parallel internet are precisely the ones with the strongest incentive to delay. The user-level behavioral data generated by browsing and purchasing is valuable, and because that data feeds advertising, recommendations, and pricing, platforms will drag their feet on any infrastructure that lets independent agents bypass it—even if they eventually have no choice.

The second obstacle: AI agents may not have a legal right to act on your behalf

Imagine you deploy an AI agent to shop for you. The agent logs into your Booking.com account using your credentials, stored locally on your device. It browses hotels, compares prices, and completes a purchase—all at your explicit direction, acting solely on your behalf.

Have you done anything wrong? Has your agent?

The answer is surprisingly unclear, and the current legal framework is not favorable to agents. The core question is whether a BYO agent inherits your rights to access a website. You, as a human, can browse Booking. You agreed to their Terms of Service. Does your agent automatically have the same permission?

Given the arguments above, perhaps it’s not surprising that platforms say no. Their argument has three parts:

First, Terms of Service typically prohibit “any use of data mining, robots, or similar data gathering and extraction tools.” An AI agent navigating a website arguably falls under this prohibition, even if it’s acting on a human’s instructions. Less scrupulous agent providers may indeed be using agents to scrape data for training purposes, so this is a legitimate concern.

Second, platforms argue that agents must identify themselves. When an AI agent disguises itself as a regular Chrome browser rather than announcing that it’s an automated tool, platforms claim this constitutes deception—and potentially fraud. For example, Cloudflare has accused Perplexity of using deception to evade no-crawling directives. Just as websites can require humans to identify themselves, it seems evident that websites should be able to require agents to identify themselves as acting on behalf of a particular human.

Third, and most importantly, platforms can revoke permission. A key precedent here is Facebook v. Power Ventures (2016), where the Ninth Circuit held that a third-party company that continued accessing Facebook after being told to stop was liable under the Computer Fraud and Abuse Act. The court’s language was stark: “Once permission has been revoked, technological gamesmanship will not excuse liability.”

This means a platform may not need to win the argument about whether your agent was initially authorized. It simply needs to tell the agent to leave. After that, continued access becomes “unauthorized” under federal computer fraud law—a statute that carries both civil and criminal penalties.

The counterargument to these three points is pretty intuitive: if you can hire a human personal shopper to buy things on your behalf, why can’t you hire an AI one? But the law, as currently interpreted, doesn’t recognize this equivalence. A human personal shopper is still a human using the website in the normal way. An AI agent is software—and software can be prohibited by Terms of Service in ways that human access cannot.

This creates an asymmetry with real consequences. Platforms can develop bowling-shoe agents while blocking BYO agents. The agents you’re allowed to use are the ones controlled by the platform—which may not be aligned with you.

The case for protecting independent agents

Now let’s outline the case for protecting BYO agents’ ability to act on their owner’s behalf. The arguments for allowing users to bring their own AI agents are straightforward extensions of existing consumer protection logic.

The competition argument:

Start with bounded rationality. Humans can only visit so many websites, compare so many options, and process so much information before making a purchase. The entire architecture of modern e-commerce is optimized around these limitations. The reason that ranking algorithms matter and that companies try hard to learn user preferences is that users will leave if they don’t see relevant results right away. At the same time, because of limited attention, users may not find the best option for them.

An independent agent changes this calculation. A machine can evaluate thousands of options across many platforms. It doesn’t get tired. It doesn’t succumb to urgency cues or limited-time offers. It doesn’t mistake “featured” for “best.” If agents become widespread, retailers offering genuinely better deals become discoverable in ways they currently are not. Competition increases.

The precedent argument

There’s also a simple precedent argument. We already permit humans to hire personal shoppers. We allow browser extensions that apply coupons or track prices. We don’t prohibit consumers from visiting multiple websites before making a purchase. The principle that consumers can seek assistance in navigating markets is well established. The question is why AI assistance should be treated differently than human assistance—particularly when the AI is acting on explicit user instructions, using the user’s own credentials, for the user’s sole benefit.

Platforms offer several counterarguments, some more legitimate than others.

The first is safety. AI agents can be tricked. They’re vulnerable to prompt injection attacks, phishing schemes, and adversarial manipulation. An agent that autonomously enters payment information could be exploited in ways a human would catch. This is a real concern—though it’s worth noting that platforms have strong incentives to exaggerate it, and that the appropriate response is security standards for agents rather than outright prohibition. In fact, we can imagine platforms or third-parties certifying specific agents as being ‘safe’ for various use-cases.

The second is enforcement. How do you distinguish a legitimate user agent from a scraper harvesting data for resale? From a bot placing fake orders? From a competitor conducting automated price surveillance? Platforms have legitimate interests in preventing abuse, and agent identification is one mechanism for doing so. A platform or website should be able to require an agent acting on behalf of a user to identify itself as an AI agent for a given user.

The third is user experience. Platforms may claim that agents degrade the shopping experience—they might not select the best delivery option, might miss relevant product information, might create problems with returns. This concern is harder to take at face value. Customers willingly using an AI agent are presumably accounting for a given agent’s capabilities and flaws. We expect that competition among AI agent providers will result in high-quality agents that improve shopping experiences.

A regulatory framework

Any workable framework will have to look roughly like this. Users have the right to deploy AI agents on any platform they can access as a human, provided:

The agent operates through the user’s own browser and credentials.
Acts only at the user’s direction.
Identifies itself as an AI agent operating on behalf of a specific user.
Does not engage in data harvesting beyond what’s necessary for the user’s transaction.

The technology to implement this already exists; see, for example, the protocol for personhood credentials that can be used to identify agents as belonging to a specific user. Platforms can set reasonable security requirements for agent identification, but cannot categorically ban agents or reserve agentic capabilities for their own tools.

Our proposal preserves platform interests in security and abuse prevention while establishing that consumers have a right to technological assistance in navigating markets—the same right they’ve always had to hire an agent, use a price comparison site, or simply shop around. Importantly, if the regulatory framework for agentic commerce is in place, then this would also incentivize third parties to create the parallel machine-readable internet that represents the first obstacle.

Note, one of us, Andrey, is currently employed by Amazon, Inc. This essay represents his personal views and not those of the company.

As measured by benchmarks such as ScreenSpot Pro, BrowseComp, and Tau-retail bench.

Although see this paper for some evidence that AIs may still have position bias.

Anecdotes from AI Supercharged Science

Seth Benzell — Tue, 13 Jan 2026 00:18:17 GMT

Anecdotes of AI Supercharged Science: Justified Posteriors reads “Early Science Acceleration Experiments with GPT-5”

In this episode, Seth and Andrey break down OpenAI’s report, Early Science Acceleration Experiments with GPT-5. The paper is organized as a series of anecdotes about how top scientists used an early version of GPT-5 in their scientific investigations. The coauthors of the papers try out the model to help them with everything from Erdős’ unsolved math problems to understanding black hole symmetries to interpreting the results of a biological experiment.

Seth and Andrey’s priors revolve around whether current models are closer to a “superpowered lit review” or a genuine co-author. They bring in how they currently use LLMs in their own economic research—from coding assistance to "middle-brow" theorizing—before diving into the paper’s anecdotes. They also discuss the economics of AI science and whether AI can ever achieve a Kuhnian paradigm shift. A key question is what is the main bottleneck to more useful AI tools for math and science — is it the model’s reasoning capability or simply the lack of translation layers into formal proof systems like Lean?

Priors

Hypothesis 1: What is the most promising paradigm for AI in Science today and 5 years from now? (The four paradigms: Recreating frontier science, Superpowered Lit Review, Working with AI/Co-working, and AI on its own).

Andrey’s View:
- Today: “Working with AI” (Co-working) is the primary mode. It doesn’t automate the job but makes the human significantly more productive.
- In 5 Years: “Working with AI” remains the dominant mode. While “AI on its own” is the holy grail, he believes human-AI collaboration will still be the standard, though the tasks will shift higher up the stack.
Seth’s View:
- Today: “Superpowered Lit Review” is the clearest “no-downside win.” Checking if a problem is already solved offers massive efficiency gains without the risk of hallucination inherent in creative work.
- In 5 Years: “AI on its own”—but with a major caveat based on Thomas Kuhn’s philosophy. Seth predicts AI will be capable of autonomous “Normal Science” (puzzle solving within a paradigm) but skeptical it can achieve “Revolutionary Science” (creating new paradigms like molecular motion theory or relativity).

Hypothesis 2: How impressed will we be by the anecdotes in this report? (On a scale of 0 to 10, where 10 is “Holy Sh*t / Curing Cancer” and 0 is “Trivial”).

Andrey’s View:
- Estimate: “Pretty Impressed” (Implied ~7/10).
- Reasoning: He does not expect a “Holy Sh*t” moment (like curing cancer or solving the Riemann hypothesis) because those results take years to verify or diffuse. However, he expects to see strong productivity gains in “middle-brow” theory.
Seth’s View:
- Estimate: 7 or 8 out of 10.
- Reasoning: He prices in that this is a “highly selected sample” from OpenAI marketing. He expects to be impressed but skeptical of direct practical applications (e.g., a medical treatment we can use in the near future).

Links + Shownotes

Early Science Acceleration Experiments with GPT-5 – The central paper of the episode by Sébastien Bubeck, Timothy Gowers, and others (OpenAI/arXiv, Nov 2025).
Sparks of Artificial General Intelligence: Early experiments with GPT-4 – The predecessor paper by Sebastian Bubeck et al. (for context on the “Early Experiments” series).

Scholars Mentioned

Benjamin Golub – Podcast guest in a recent episode; Professor of Economics and Computer Science at Northwestern University. We say the episode with Golub is upcoming, but it’s already out! Check it out here.
Timothy Gowers – Fields Medalist and co-author of the paper
Sébastien Bubeck – Lead author of the paper and researcher at OpenAI.
Terence Tao – Fields Medalist mentioned for his use of AI in mathematics.
Imre Lakatos – A philosopher of science
Tyler Cowen – Economist mentioned regarding the concept of “Writing for the AI.”
Paul Erdős Problems – The unsolved problems of this famously prolific mathematician were used as a benchmark.

Tools & Technology

Refine.inc – The AI-for-science tool co-founded by Ben Golub.
Lean – The theorem prover and programming language discussed as a potential bottleneck/accelerant for checking AI math.
Elicit – The AI research assistant mentioned by Andrey for literature reviews.
Pangram Labs – The AI text detection tool mentioned in the context of scientific writing.

Concepts & Philosophy

The Structure of Scientific Revolutions – Thomas Kuhn’s foundational text on “Normal Science” vs. “Paradigm Shifts.”
The Lucas Critique – Economic theory mentioned by Seth regarding a recent economic paradigm shifts.

Transcript:

[00:00] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, sharing helpful ideas that come naturally to me, but not quite big enough a contribution to demand co-authorship, at Chapman University in sunny Southern California.

[00:33] Andrey Fradkin: And I’m Andrey Fradkin, experimenting with numerous ways to use AI in order to make the trivial parts of my work take way less time. But then again, maybe all parts of my work are trivial. Coming to you from San Francisco, California.

[00:53] Seth: All right, Andrey. Coming out the gate against himself.

[00:58] Andrey: That’s the only way I know how to be, Seth. That’s the only way.

[01:03] Seth: Well, I mean, maybe that’s a good place to start. I know that you use LLMs all the time as part of your research. We could talk a little bit as we go along about how you use it now, but maybe you could tell me: how do you use it now and how would your dream AI assistant help you with research? Is your dream to completely delegate it? What would be a reasonable near-term dream? What do you have and what do you want?

[01:31] Andrey: Yeah. Wow. I didn’t realize it was already Christmas. Readers, we’re recording this in November, so it’s not quite there yet.

[01:41] Seth: Mariah Carey is on the way, dude.

[01:44] Andrey: So, look, I use it all the time. And I proactively use it because I’m always trying to figure out what it’s capable of doing and what it’s not capable of doing. You know, in terms of the science part of our work—which is a big part of it, but a lot of what we do is also presentation, communication, reimbursement requests...

[02:12] Seth: [Laughs] Reimbursement requests.

[02:14] Andrey: Yeah. But in terms of science, some parts of my work require some math, right? Not very complicated math. And I’ve been using the latest generation of AIs to see how well it does there. And, you know, it’s pretty good, honestly. It definitely requires oversight. Like, I wouldn’t trust it to just do it. But with some iteration, it has given me good results and it’s allowed me to check some of my results. And once we’re kind of agreed—me and the model—on what the results are, it’s very efficient at writing it up. And even doing things like, “Oh, create a simulation based on this model,” or “Create an interactive visualization based on this model.” So I think that sort of work, it’s already pretty good at.

[03:17] Seth: Actually, can I ask a quick question here before you go on? You’ve described it as a system that is maybe like... it guesses and then you have to check it. So you have this sort of iteration. You say, “Solve for the equilibrium of this model,” and you’re not guaranteed that the first output is going to be correct. So that’s a sense in which the AI is proposing solutions and you’re the verifier. But you also find it useful for the opposite, right? Where you have an intuition about a result and then it’s the verifier. Should I notice a contradiction there?

[03:56] Andrey: I don’t think it’s a contradiction. I think as with any results or ideas, we want to battle-test it, right? And that could go in either direction. It’s kind of like when you give an academic seminar. You’re going to present some work and you’re going to get feedback from a bunch of people. Some of it might be good, some of it might be bad. But you might also go to your co-author and they might create something new. So I don’t view it as a contradiction. I guess one way to think about it is that it’s not omniscient, right? So it isn’t like doing things end-to-end without my judgment yet. I can’t just give it a prompt and then it finishes the entire task.

[04:54] Seth: It sounds kind of like a colleague with some knowledge in the domain.

[04:59] Andrey: Yes, exactly.

[05:01] Seth: It might be able to propose an answer that isn’t necessarily right, and it might find a flaw in one of your ideas—those aren’t necessarily right either—but you would never use it as its own end-to-end proof to write it up and present it at Columbia.

[05:19] Andrey: Yeah, yeah. And then the other thing is... what I’ve been talking about is more on the theoretical side. And certainly, I’m not a theorist, so it’s not like I’m doing very complicated things there. But on the empirical side, it’s also very useful. And once again, I found that it’s not giving me end-to-end results. If I just told it, let’s say, “Hey, I have this natural experiment and I’d like you to measure the causal effect,” it’s definitely not going to give me what I want. And maybe that’s underspecified. Or maybe it doesn’t have my taste for what type of evidence I like. But once I give it enough—maybe an initial sketch of the identification strategy—it can very easily automate. Let’s say I did this for one country and I want to replicate that analysis for another country...

[06:30] Seth: I want you to use rainfall as an instrument.

[06:32] Andrey: Yeah. “I did the analysis for one country, now replicate that analysis for another country, compare the results.” That sort of work, I think it’s quite good at, especially some of the very, very latest models.

[06:47] Seth: Okay. I mean, it sounds like that’s pretty capable. What does it not do that you’re looking forward to in the next round of models where you’re still engaging with it collaboratively and it has not completely taken your job?

[07:02] Andrey: Um. It’s not very good at coming up with new ideas right now. Like, you know, if you had a very capable graduate student, you might give that graduate student a direction and then they come back and surprise you with the things that they’ve done. I don’t see that happening. Maybe I’m not using it correctly, but that would be very nice. Ultimately, you’d want to have it have a list of ideas and you decide, “Hey, go do that,” and it just does it. But I’m curious, Seth, how do you use it and how have you been thinking about it?

[07:49] Seth: That’s a good question. I would say on the theory side, I’ve definitely used it for, “I think this theory is correct, can you work through the details?” or “Here’s my sketch of a proof, can you formalize it?” Definitely, at least the way I use it, it’s been hit or miss. I’m mostly using the GPT models. When it hits, it hits really nice. Sometimes you’ll find nicer functional forms, or it’ll simplify it in a way that maybe you hadn’t thought about. So I found it useful for kind of middle-brow theory. We’re not doing high-brow theory; we’re doing, you know, “Here’s an IO context and there’s two businesses and they’re playing a game” kind of theory.

[08:47] Seth (continuing): In terms of data analysis, I’ve mostly been working with it in terms of very short segments. Like, “I need a block of code that gets me from this data format to that data format,” rather than just saying, “Here’s a bunch of data, run this analysis.” I’m not saying you can’t do that, but I haven’t worked myself up to that yet. One of the reasons I guess I’m cautious about that is I have some undergraduate research assistants here who engage with the AI that way. And if you’re not sophisticated, you get some real garbage that way, right?

[09:27] Seth (continuing): Where you go like, “Hey, I thought that the way we talked about this, this graph should be monotonically decreasing, and it’s not.” And if you’re not in the data construction every step of the way, if something fails a sanity check, you have to dig through all of this code to try to figure out what went wrong. So that’s kind of where I’m at right now.

[09:48] Andrey: But I guess I’m surprised, Seth. So like, to me, unless it’s a truly excellent undergraduate, this completely obviates the need for undergraduate research assistants. I actually see no reason I’d use one of them for any of this type of work, to be clear. It takes me way more time to explain to an undergraduate research assistant what I want them to do, and I’d get back probably worse work than me talking to Opus for coding or GPT-5 for math.

[10:31] Seth: Ex-post, you’re completely correct. Ex-post, you nailed it. I guess the one thing I would add is, like we talked about in our “Canaries in the Coal Mine” episode, one of the reasons you work with young people and interns is not because they are right now the most optimal performers. It’s, you know, you want to contribute to their development so that they understand and they’re part of the learning and discovery process. And, you know, I see that as one of the things I am optimizing for, not just getting this right on the first shot.

[11:09] Andrey: Yeah, yeah. I mean, I’m with you. I think often times... if that’s structured correctly, then I’m with you. But a lot of the time...

[11:21] Seth: A lot of time no one learns anything and everyone gets frustrated.

[11:24] Andrey: Yeah, I wanted to word it delicately. No one learns anything. It’s a “make-work” type arrangement. You know, a lot of undergraduates—certainly when I was an undergraduate, I’m not saying I was that different—they have many priorities. They’re not even really focused on whatever it is you tell them to do.

[11:46] Seth: More exciting than working with Professor Fradkin? I can’t even imagine.

[11:51] Andrey: God, yeah. Everything.

[11:57] Seth: Watching paint dry. Watching paint dry while stapling my hand.

[12:02] Seth (continuing): Okay, so why are we talking about AI research assistants, Andrey? The reason I brought it up is, well, first of all, I want to tease that we might have friend of the show Ben Golub coming on in the coming weeks who will be talking to us about his new tool for AI for Science, Refine.inc, that we’re super excited to learn about.

[12:27] Andrey: So just to be clear, it’s called Refine.inc. You should check it out.

[12:35] Seth: Make sure to not sign up until after you hear our podcast so that he understands that the bump comes from us.

[12:44] Andrey: We are going to Granger-cause so many signups. You’re not going to believe it.

[12:50] Seth: You will not believe the Granger causality. Exactly. We’ll have to instrument for our analysis with rainfall. Okay. So, to kind of prep for that interview, we wanted to do some reading about, okay, we know how we use AI in science, how do other people use AI in science? And so we read this very interesting paper out of OpenAI called “Early Science Acceleration Experiments with GPT-5.” Andrey, would you like to read the list of authors?

[13:28] Andrey: It’s a pretty long list of authors, so I’d rather not actually. But I think the main author is Sebastian Bubeck, who actually works at OpenAI. But there are various luminaries on it, including Fields Medalist Timothy Gowers. So it’s a pretty impressive lineup. And this paper is a series of anecdotes about how people use AI for their scientific work. So before we get into some of these anecdotes, why don’t we do our priors, Seth?

[14:10] [Music / Transition]

[14:16] Seth: Okay. So, Andrey. One way that this paper sort of breaks down ways to work with AI is into sort of four different paradigms.

Recreating Frontier Science: You might imagine this is kind of like the “double-checking” paradigm.
Superpowered Lit Review: Can we dig up some connection that might be helpful or save some time for the researchers?
Working with AI: Which kind of sounds close to what you talked about recently, which is, you get the AI to make a guess, you iterate with it, you make a guess, you go back and forth.
AI on its Own: You just say, “Hey AI, solve global warming, go.”

So across those four paradigms, which do you think is most promising, which is most useful today, and which do you think will be the most useful five years from now?

[15:19] Andrey: Yeah, that’s a great question. I mean, today I think the obvious answer is “Working with AI.” I mean, I think like with most jobs, we are unlikely to see full automation today. To be clear. But working with the AI can make you a lot more productive. It’s already made me a lot more productive. It’s making a lot of people more productive that I talk to. You know, some people are skeptical. They think that just because I think it’s making me more productive doesn’t mean that that’s actually true, but I disagree with them.

[16:01] Seth: Compensating differentials regarding productivity.

[16:04] Andrey: Yeah, yeah. But even without compensating differentials, I guess. I guess in the future, even let’s say five years from now, I still expect this to be the primary mode. Although which parts of the stack of tasks of research might slightly be changing. I think obviously AI on its own doing research is a “Holy Grail.” Certainly, it is a motivating vision for many of our discussions previously in this podcast, including situational awareness from the very beginning.

[16:44] Seth: Line go up from village idiot to superintelligence.

[16:47] Andrey: Yeah. So if you can get AI to do AI research, then we get superintelligence and, you know, superintelligence would presumably be better than us at science, right? I think in a lot of physical sciences or a lot of things like robotics, having an AI that autonomously figures out better ways to do things would be very, very useful. The extent to which that’s actually possible... one, depends on the level of intelligence, obviously. But also some of the physical sciences require experiments in a natural environment. Or at the very least a very, very high-fidelity simulation. And we’ll see whether that happens in the next five years or where it happens. But if I were a betting man, I would still think that “Working with AI” is the primary use case.

[17:51] Seth: Both today and in five years. Okay. Well, so I’m happy to have a little bit of disagreement with you here. Which is... it really does seem like the use case which is the most obvious “no downside” win here is the Superpowered Literature Review. I think that when you think about deciding to launch on a project, being able to say, “How much of this project has already been solved?”... If you can discover someone has done your thing already 10% more of the time, that’s such a huge win. And you don’t have to rely so much on trusting the AI’s agency on its own.

[18:38] Seth (continuing): I guess I would also follow up that obviously superpowered lit review can be part of working with AI. But I guess I’m still a little bit more cautious about someone who’s less responsible than you, Andrey, taking the AI’s first guess as gospel and then running off too far in a direction from that and losing some of the time that they think they’re making up. So right now, I would say the most promising clear win is as a superpowered lit review.

[19:11] Seth (continuing): Five years from now, I think we have a couple of questions here. Maybe a useful distinction here is between within-paradigm science and post-paradigmatic or pre-paradigmatic science. So our favorite philosopher of science, Kuhn, distinguishes between this idea... (Andrey: Hey, speak for yourself!) Who’s your favorite philosopher of science? Help me out.

[19:35] Andrey: What if I said Lakatos? Or Popper? I don’t know.

[19:41] Seth: Oh my god. Popper? Listen, it’s easy to falsify Popper’s falsifiability, right? So there you go.

[19:48] Andrey: To be clear, I like all of my philosophers of science equally. Except Feyerabend... whatever.

[19:59] Seth: Exactly.

[20:00] Seth Benzell: Yeah. Except for people who think, you know... except for Foucault who thinks science isn’t real. Okay, but... so, coming back. What does Kuhn say? Kuhn says there’s kind of two kinds of science. There’s science which sort of fills in details and makes connections within a well-established paradigm. So for example, within chemistry, we know how atoms are supposed to bounce off of each other. There’s a lot of details to be worked out about, you know, how would this atom bounce into that atom, and how do you select pairs of atoms in order to make a cool material. But there’s nothing... at least as far as I know, there’s not a lot of paradigm busting going on. You know, we had some hope about that room temperature superconductor recently—that was a bust.

[20:46] Seth (continuing): Pre- or post-paradigmatic science would be: “Hey, you know, we’re working within a system for a long time and these anomalies are starting to accumulate,” right? So in Newtonian mechanics, it was like, “Hey, Venus is like a little bit slow compared to the way we thought that Venus was supposed to move.” So... oh, there used to be the Phlogiston theory of heat, right? That heat was like a substance that would flow between two materials. And like, that explains some good stuff about how heat works, right? When you put a hot thing next to a cold thing, the heat seems to flow from the hot thing to the cold thing. But there were anomalies there, right? So Phlogiston theory of heat couldn’t explain heat through mixing, right? So if you rub your hands together, they get hot. Okay, where did that heat come from? It wasn’t Phlogiston, right? Because you just made it from nothing.

[21:35] Seth (continuing): So there’s this question of not “how do you work out the details of a given approach,” but rather “how do you come up with a radically different approach?” Now in economics, we’re pretty happy with our paradigm. I gotta say. I like my paradigm. You don’t like our paradigm?

[21:55] Andrey Fradkin: Come on, man.

[21:59] Seth: [Laughs] All right. Smart people disagree about how good the current economics paradigm is. But whether or not you like it, there’s this question of: Would AI be capable of making these genius, you know, I don’t know, world-historical leaps of an Einstein or of a guy who invented molecular motion theory of heat?

[22:27] Seth (continuing): So... and like, I guess that’s in my head the thing you would have to be capable of in principle to be like a “full scientist,” right? Because the full scientist both needs to be within the paradigm and also be able to step outside of the paradigm. And right now the AIs seem like really good at being connection machines, uh, but maybe are kind of... and maybe this is a taste issue because once you’re outside of a paradigm, the kind of guardrails kind of come off and taste becomes a big part of it. I’m less excited about AI being able to move in that direction. Or at least I think that’s a less promising direction. So to answer the... the question, the prior, I would say: Right now, Superpowered Lit Review. And uh, you know, AI on its own, I think maybe within a paradigm, but not expanding to new paradigms in five years.

[23:19] Andrey: Yeah, yeah. I mean, I mostly agree with you. I guess I think paradigm shifts... it’s hard to really know what one is. One way to think about it, like... we’re most familiar with economics. And we’ve been in this field for what, about, you know, 15, 20 years, right?

[23:41] Seth: So Lucas Critique would probably be the last big one?

[23:44] Andrey: Yeah, but I... you know, I guess I don’t know if that’s even a paradigm shift. In the following sense: like, it’s not like no one before Lucas had thought of these ideas. Lucas formalized them in some way. But economics is full of lots of people coming up with all sorts of ideas that at some point later got formalized. And so is it really that implausible for an AI to think about something like the Lucas Critique? I mean it’s... it’s truly... I mean that’s the thing about paradigm shifts. Like true ones... Or another way to put it: like, we think of like Einstein, right? But I’d say field experience much smaller types of paradigm shifts. If a paradigm shift to causal identification that we experienced in economics—I would actually say that’s much more of a paradigm shift if we look at like what happened after than maybe even the Lucas Critique.

[24:49] Andrey (continuing): But it’s not that crazy to think that an AI would... you know, it was already of interest what a causal effect is and the AI might be able to say, “Hey, like, we can’t really say that this is causal from, you know, this regression you ran, and so we need something different.” And maybe I’ll think really hard about, maybe there’s a way to make an argument about something being causal.

[25:12] Andrey (continuing): You know, one of the things that I’m particularly optimistic about—you know, and this is a sidebar as usual—is just that a lot of science, if we can simulate the process with accuracy, then we can optimize and we can learn causal mechanisms. That means we can actually do science on the simulation. And so to the extent that the AI is a computer... you know, is essentially a code—it thinks in code...

[25:47] Seth: Like a brain in a vat.

[25:48] Andrey: Yeah, it thinks in code. It could be potentially very, very powerful for that. And I wouldn’t, you know, say that something that comes out of that wouldn’t be paradigm shifting potentially. So yeah. I would say like, because paradigm shifts are actually just... true ones are just very hard to... you don’t know what they’re going to be ahead of time. I’m not going to say that the AI can’t do it. That’s kind of my position here.

[26:12] Seth: Right. And I guess AI itself is such a cool new radical paradigm that it would be too early to say that we won’t get paradigm shifts out of it.

[26:19] Andrey: Yes, exactly.

[26:22] Seth: All right. How about a second prior for you? Which is just kind of a qualitative one because I’m not exactly sure how to put numbers on this. If you want to put numbers on it, go for it. Maybe you can denominate this in, you know, CCs of adrenaline.

[26:36] Andrey: Yeah.

[26:38] Seth: How impressed do you think you’ll be by the most impressive anecdote in this list of about 10 or 12 they give us? On a scale from “Eh” to... I don’t know. I’m not allowed to curse anymore so... imagine intensifier of your choice.

[26:57] Andrey: Seth said the word “shit” on this... Look, I, you know, I expect to be pretty impressed. Not like “Holy Shit” impressed. I think a “Holy Shit” sort of impression would be like solving one of the, you know, long-standing open problems in mathematics or something like that. Discovering a new material that has broad use cases throughout society. You know, curing cancer. That I guess that would be...

[27:30] Seth: Yeah that would get you out of your bed. Get you out of your chair if you cured cancer. There we go.

[27:35] Andrey: Well, I mean, that would be like the extreme. I think it’s interesting to think through those examples. Like the math one, you know, I can’t verify it. Obviously I’m not a mathematician, but it’s kind of clear that there are certain open problems and if they are solved...

[27:51] Seth: Andrey, you’re a podcaster. You’re higher than a mathematician.

[27:55] Andrey: Yeah, well. Some people, you know, are called to the truly noble pursuits. Um. Yeah, so I can’t verify it. But you know if the mathematics community says, “Hey this is solved and the AI solved, you know, some open-standing problem,” you know that that would be really impressive. I think things like, you know, let’s say biological sciences... even if we found a cure for cancer today, you know, by the time that will be recognized within society that will take a long time.

[28:30] Andrey (continuing): And I actually expect that no matter... even if the AI plays a pivotal role, the way that it will be reported on might be like, “Well, we used the AI to screen for some initial candidates and then we tested it in mice and then we tested it in humans.” Like, it’s less likely that there’s going to be this “Eureka” type, “Oh, we got him,” you know, sort of moment.

[28:53] Seth: Right. There are ten pivotal... like yes. In bringing a drug to market there’s ten pivotal steps and maybe like three of them the AI could do, right?

[29:00] Andrey: Yeah. And we already like use AI all over the place, right? For various statistical type processes in research in the medical sciences, right? So it’s not... yeah. You know, if you think about like Generative AI end-to-end reasoning through the solution, maybe one version of this... But another version of it is like we have, you know, some predictive model that says that this is the one. This is the molecule that will do it, you know?

[29:33] Seth: Okay. Um. I guess from this example, I kind of want to price in the fact... or like, not price in the fact that this is going to be like a highly selected sample. This is from OpenAI. You just talked about how, you know, the Nobel Laureate biologist probably wants to downplay the role of AI. Well, OpenAI would like to upplay the role of AI. Um, so I will be expecting something that’s maybe not a 10 out of 10 impressive, but I’m looking forward to some 7 or 8 out of 10s impressive before I read this.

[30:10] Andrey: Yeah, yeah. So I mean I think we’re both in agreement. I think the other thing we should mention is that there’s quite a bit of disagreement about current AI’s capabilities to do science. I’ll just give you an anecdote. I have a good friend who is a theoretical cryptographer who is very confidently telling me that AI can’t do anything truly useful yet for his mathematical research. And there are certainly people, you know... common voices in the media that are AI skeptics like Gary Marcus who, you know, is going to dismiss every single thing that the AI does as trivial.

[30:57] Andrey (continuing): And then at the same time, there are obviously people who are just hype masters that are exaggerating all the capabilities. So, so yeah. Let’s see what happens.

[31:07] Seth: I love that. “Within-paradigm science is trivial. Pre-paradigmatic science is bullshit.” At the intersection, you have Justified Posteriors. Okay.

[31:16] [Music / Transition]

[31:22] Seth: Okay. So let’s get to the evidence. It’s a pretty unusual paper for us. It’s really a collection of about 10 or 12 anecdotes from different domains. So we see examples from math, physics, astronomy, biology, and material science. Uh. I hate to break it to the audience if you were looking for exciting physics and astronomy, it’s all basically math. They’re pretty mathy questions. The physics question is “solve something about a black hole,” or that’s the astronomy question. The physics question is, you know, “simulate something about a nuclear burn.”

[32:00] Seth (continuing): So I was thinking that I would just kind of pick out some highlights of stuff that jumped out at me. You’ll interrupt me as we go. All right. So talking first about through some of these math examples. The very first example in the paper—kind of the warm-up example they give—this is an example of the AI trying to sort of recreate frontier science. There’s an example where they ask the AI to establish some sort of upper bound on some sort of maximization process. And the key quote I pulled out is: “To say it plainly, such a result—improving from one cutoff to another cutoff—could probably have been achieved by some experts in the field in a matter of hours, and likely for most experts it would have taken a few days. This is the type of science acceleration that we will see time and time again in the report.”

[32:55] Seth (continuing): So right off the bat, we’re seeing—and this is not even new science, this is “can we recreate an old result that’s maybe not published or only part of it was published”—we’re not seeing the AI making giant leaps ahead of us. We’re seeing it completing a key step. And we’re going to see that over and over again. In this particular example, the AI does not even get to the known best cutoff of 1.7 over L. It only gets to 1.5 over L, over the previously best published 1 over L. L being a parameter in the model that we’re talking about. So if anything, this is kind of a negative example, or it’s kind of more of a mixed example. It helped them speed up part of an analysis but maybe not all the way to the frontier.

[33:45] Andrey: I just... to me, it’s actually quite impressive, Seth. That’s kind of... you just have to remember that these are essentially the top people, the smartest people in the world, right? Like...

[34:00] Seth: Sure.

[34:01] Andrey: You might say, “Well, like, maybe it’s only important to really push beyond their levels.” But actually, we’re completely rate-limited on people like this, right? There are very few of them. And so if they’re able to do things faster, that’s pretty great for society. And also it means that... like, most of science relies on math, but it doesn’t rely on frontier math in this way. And so for all of us who are not as good at math, this could be pretty fantastic, right?

[34:34] Seth: For us middle-brow theorists.

[34:35] Andrey: Yes, exactly. So yeah. To me, this is quite impressive. This is already extremely close to the frontier. And it’s... you know, it’s proving results that were not in the literature. So I... yeah. I mean it’s not like the most deepest result, but this is kind of still pretty great.

[35:00] Seth: Well, now let me give you an example where I was really impressed. And maybe you’ll tell me you’re less impressed by this one. Which is just its function as a literature review tool. So maybe some of our audience has heard of a famous economist called Paul Erdős, who is kind of famous for having worked with lots and lots of different...

[35:19] Andrey: Wait, why did you call him an economist? He’s not an economist.

[35:22] Seth: Did I call him an economist? Mathematician. Excuse me.

[35:24] Andrey: He’s definitely not an economist.

[35:25] Seth: I was good. So I assumed... Thank you. Mathematician Erdős. Who is known for working with lots and lots of mathematicians. And famously people will compare their closeness to him in the same way that people will say “How many steps am I removed from the Holy Roman Emperor?” They’ll say “How many co-authors away am I from Erdős?” Because he’s worked with everybody in so many different domains.

[35:50] Andrey: And famously... famously he took a lot of methamphetamine. And that’s why he was so productive.

[35:57] Seth: A lot of meth. You know, if you do cocaine, you become Stephen King. Meth, you become Erdős. So, you know, which way Western Man? All right. And so one of the things he left us with before he passed was a long list of sort of what he saw as cool open questions for his students and friends to work on. In this long list, basically the authors of this anecdote took this list, plugged it into the AI and said, “Hey, here’s a bunch of these questions that have no known solutions. Can you find solutions to them?”

[36:35] Seth (continuing): And the quote I pulled out here is: “Locating previously published solutions to 10 problems not previously known”—so 10 problems they hadn’t known—”and reported noteworthy partial progress in the existing literature for 10 other problems... and correcting an error in problem 1041.” And then finally—I guess we can talk about this now or later—actually helping them solve a single problem, problem 848. It gave them a big hint and the mathematicians were able to work with it to actually solve problem 848.

[37:08] Seth (continuing): So I like this one. It feels like... it feels like super verifiable. It seems super solid. It seems like a super easy win. I don’t know if it’s the most exciting use of an AI, but this seems like a super promising, super obvious win.

[37:27] Andrey: Yeah. I mean I think it’s fantastic. I am very skeptical that this can work well outside of mathematics and physics. And the reason is that the more empirical literatures are just littered with terrible research. And like... the literature review problem is not that great. When I think about like when I’m working on a project... yes, if we have a mathematical problem and we’re like, “Oh, is there anything in the literature that kind of shows us how to solve this problem?” that seems quite useful.

[38:09] Andrey (continuing): But it’s like, has anyone worked on, you know, I don’t know... I have a paper on privacy. “Has anyone worked on privacy before?”

[38:20] Seth: Privacy. What’s the right way to do cookies?

[38:22] Andrey: Yeah. I mean like... it’s fine, you know? Like it’s good to have some citations in the paper, but yeah. To me, the literature review problem is not that important as part of my work. What do you think?

[38:39] Seth: I would push back a tiny bit. Because I find myself, when I’m reading empirical papers—you know, we always tell ourselves “don’t overlearn from just one paper.” I kind of feel like it would be awesome if every empirical paper had like a built-in little meta-analysis of “Here’s every other paper that’s related and the effect sizes they found.” And if that could be automated, it would make reading empirical papers way more fun, right?

[39:05] Andrey: Sure. Yeah. I mean, fair enough. I guess... yeah. I guess it’s a question of what we’re thinking about. Writing your own paper? Unless it’s a meta-analysis... maybe not that useful. But just generally learning from the literature, it is very useful. And actually there’s a very promising tool called Elicit which does this sort of literature search. I think it’s primarily focused on the pharmaceutical domain. So yeah. So I think... yeah. So there is this use case. But I was just reflecting on the fact that for what I personally do in my research, you know, I’m aware of some of the major papers in my field obviously. But not knowing the literature is not a bottleneck, I don’t think.

[40:00] Seth Benzell: What I think of is Edison, famously... whenever he had an idea for a new invention, he made sure to get a team on making sure it was not invented already because he had gotten burned several times along. Oh, you know, somebody had filed a patent for that 20 years ago and they just never made any of it.

[40:19] Andrey Fradkin: Yeah, yeah. No, no. I mean, look, maybe it’s different in other fields. I... you know, I can only know what I know. Yeah.

[40:31] Seth: Sure. Um, maybe one more negative case. There was a mathematical case involving... what are conditions necessary on subsets to make sure that you don’t get so many subsets that are called cliques? That’s kind of the level of the math I understood of this problem. They gave ChatGPT the problem, it repeatedly gave them the wrong answer. Eventually, after insisting to ChatGPT it was giving them the wrong answer, it gave them the correct answer... which then they later discovered was already in the published literature and ChatGPT did not give it credit.

[41:12] Seth (continuing): So I guess another example here of you really need to be on top of these things and not take their first response as gospel.

[41:19] Andrey: Yeah. To me this is such a compliment to doing high-quality work because... you just... if you don’t have the judgment, it’s... it so often gives you stuff that’s wrong, incomplete, and you have to actually have some vision and knowledge to know which parts of the answers to take and which parts not to take.

[41:43] Seth: Right. Yeah. So yes. This seems like we are at the level where the AI is making very plausible guesses and you still need an expert sitting on top of it.

[41:53] Andrey: Yes.

[41:54] Seth: So, Fields Medalist winning mathematician Timothy Gowers gives us this take, which I thought was like a really kind of good summary of where it is right now, and kind of inspired my opening joke:

[42:12] Seth (quoting Gowers): “As a research supervisor, I have a rule of thumb for when a contribution I make to the research of one of my PhD students is at the level where I should be a joint author.”

Do you know where he’s from? Should I do an accent? I’m just gonna... I’m not gonna do an accent.

[42:24] Andrey: He’s British.

[42:25] Seth: He’s British? Ooh. Okay.

[42:27] Andrey: I don’t... yeah. Let’s skip the British accent.

[42:29] Seth: Okay. Thank you, Andrey. That’s a gift to you, the listeners at home.

[42:35] Seth (continuing): “The rule is that if the student comes to discuss the problem with me, and I have, in the course of that discussion, an idea that comes more naturally to me than to them, and that turns out to be helpful, then that is not enough for joint authorship. But if I spend time struggling with the problem—of course, I will only do this if the project is officially a joint one, very propitious as a British man—and during the course of the struggle... during the course of the struggle, I really love that... I come up with an idea that required more than just standard expertise that I happen to have, that I have made a genuine contribution to the work.”

[43:10] Seth (continuing): “My experience so far with LLMs is that they are capable of playing with this knowledgeable research supervisor role with me, which can be extremely useful given just how much knowledge they have”—this is coming from a Fields Medalist—”but they are not yet at the level, or at least have not yet exhibited that level in my own interactions with them, at which a human mathematician who follows my convention above would ask for joint authorship.”

[43:34] Seth (continuing): I mean, it’s... he’s kind of playing it down, but this is actually pretty freaking high praise, would you not agree, Andrey?

[43:40] Andrey: Yes. Yes. I mean, let’s just, you know, remind ourselves that whatever graduate students he’s thinking about are also some of the smartest people in the world. And you know, most... once again, most scientists who work with math have problems that are substantially easier than anything these sorts of people would be working on. Right? And are bottlenecked by it. Right? Like we’re, you know, bottlenecked maybe temporarily... you know like...

[44:12] Seth: Or even permanently.

[44:13] Andrey: Or even permanently. It could be either, right? And so yeah, like it’s essentially saying like, “Oh, for, you know, 99% of scientists who use math, it’s already really, really, really, really good.”

[44:26] Seth: It replaces me.

[44:28] Andrey: Yeah. And if you’re like a Fields Medalist, you know, maybe it’s not as good as you yet.

[44:35] Seth: Incredible. Um. I guess... one other kind of little detail I came... I want to pull out here is like the requirement that you have to struggle with it for co-authorship. I think that’s kind of fun, right? Like, is one of the reasons that maybe AI gets less credit than we should give it is that it seems so effortless?

[44:56] Andrey: Yeah. Well, you know, sometimes it’s like... it’s interesting, you know in this paper you see that the AI thought for like 20 minutes or whatever. And this is...

[45:05] Seth: Yeah, they got the really good version. Just to be clear, so this is using GPT-5.1 Pro, which can have very very long runtimes if you let it.

[45:13] Andrey: I think it’s 5.0 Pro. Just to be clear.

[45:16] Seth: 5.0 Pro? 5.0 Pro. Excuse me.

[45:19] Andrey: Yeah. But yeah. So this is the frontier reasoning model. This might be the one that’s... I think that’s the one that’s available in the max plan on ChatGPT. But it wasn’t clear to me whether the scientists here got some special access. They probably did. So yeah, it’s not really the sort of AI that most people today would be using, but of course, you know, they could be using it, you know, given how fast things move, within the next year.

[55:51] Seth: Right, right. So exactly. So as we march down Moore’s Law, what is available, you know, in pre-release to Fields Medalists diffuses to us proles in... what, a year or so?

[46:01] Andrey: Yeah, yeah, yeah. Um. Yeah, so I... I don’t know. To me, it’s just really... I mean, I would say it’s awesome. I mean... I mean, it’s just... it’s gonna make us so much more capable. Like, I don’t know... to me, this is a lot of cause for optimism. Even though it’s not, you know, it’s not doing science end-to-end. If that was your, you know, hope, it’s not there yet. But it’s already, you know, great.

[46:33] Seth: I think one thing I would pull out, and I’ll emphasize this in our conclusion, is that it seems like one of the bottlenecks on AI itself is the inability to rigorously check its own proofs. And it seems like once we get really good automated translation from these kinds of human-LLM-readable proofs into kind of machine-checkable proofs, you’ll like multiply this productivity because it’ll be able to check its own work.

[46:59] Andrey: Yes. I... we should also mention, like we haven’t mentioned yet, but there are several very, very well-funded startups that are working on AI for mathematics. DeepMind is also obviously a leader in this field in addition to OpenAI. So it’s also kind of one where, you know, as economists we’re like, “Wow, there’s just so much competition and investment that’s great.” We’re bound to get some awesome results in the future, right?

[47:33] Andrey (continuing): Yeah, so... so... so I mean one of the interesting things here is that it is really like a chat interface, right? Like you don’t have to use a specialized mathematical proving language, you don’t have to interact with that. You can reason with it in, you know, loose terms and then it kind of knows how to interpret it. Maybe some of these other efforts might be a bit more, you know, narrow... you know, very very powerful but more narrow. Yeah.

[48:02] Seth: Right. And it seems like the real win is both combining the natural language and the machine-provable code.

[48:09] Andrey: Yes. Yeah.

[48:10] Seth: Right.

[48:11] Andrey: But my vision for all these things is just, of course, that you have AIs calling tools that are other AIs, right? I am very much not in the camp of “one AI to rule them all end-to-end without tools.” Like, some people have that vision, but I don’t... you know, just like a human uses tools, I don’t see why an AI wouldn’t use tools. Which might be other AIs, like a human would have research assistants.

[48:38] Seth: I guess the only thing I would jump in here with is... right, one thing I’m always on the lookout for now as we read these papers is like, you know, the Bitter Lesson update. So to what extent does the generalist AI that’s bigger beat the specialist efforts? To what extent is task-specific prompting and scaffolding important versus “just use better model”? And I think in each of these examples we really do see task-specific scaffolding being important, prompting iteratively and, you know, in a special way being important. Now of course this is all in the context of a single model, so we can’t really speak to, you know, versus these other approaches, but something to keep our eyes open for.

[49:21] Andrey: Yep.

[49:22] Seth: Um, okay. Here’s an example that I thought was funny because it was like clearly written up by an AI. There was a physics example where they asked the AI to derive known but unpublished results about black hole symmetries. One of the take-out quotes is: “After about five minutes of internal reasoning, the model incorrectly reported that the equation had no continuous symmetries beyond trivial scalings.” Then again, we have another example, they prompt the model again, they give it a warm-up problem. With the warm-up problem, the AI is able to solve the full problem.

[49:59] Seth (continuing): This is the part that made me think it was definitely written up by an AI. In the implications section, it felt really AI-ish and here was one of the quotes I pulled out: “AI as symmetry engine. With minimal domain scaffolding, current models can carry out non-trivial Lie symmetry discovery for PDEs”—partial differential equations—”with non-constant coefficients.” Okay. Dude, that was an AI sentence. “AI as symmetry engine.” What kind of metaphor is that? That’s an AI metaphor, dude.

[50:29] Andrey: Yeah, I mean... I think one of the things that’s going on in the background that we should say is that scientists using AI to write is just now ubiquitous, right? There was a huge controversy at ICLR, one of the top CS conferences, where just an enormous share of referee reports for papers were written by AI. In fact there’s a tool, Pangram, that has shown very high accuracy at detection of AI writing, and it was used to measure these reviews and just so many of them were written by AIs. So many of the papers are written by AIs.

[51:15] Andrey (continuing): So I just think this has to... this is just the new normal, right? Like... and we shouldn’t be surprised. A lot of scientists... English is not their first language. Even for those who it is a first language, you know, writing is a specialized skill that most people, most scientists, are not very good at. And it’s a lot easier to have an AI write a draft and you tweak it than to write something from scratch. It’s not obvious to me how important it is that the human does the writing. I guess I like to do writing because writing is thinking, it’s a way that I think through problems. But for a lot of things, I don’t know, let’s say like form letters and things like that, like why would I waste my time honing my language when I could just have the AI do it? So I’ll just say like this is a new normal and the viewpoint that we’re mostly writing for the AIs is also true.

[52:16] Seth: Do you want to spell that out for people who might not have heard that phrase before?

[52:21] Andrey: Yeah. So I first heard it from Tyler Cowen.

[52:24] Seth: Andrey’s favorite economist. Friend of the show.

[52:30] Andrey: If you say that, he’s more likely to retweet you.

[52:33] Seth: [Laughs] Yeah, yeah, yeah.

[52:36] Andrey: “Friend” is, you know, a loose term, but you know, we have had dinner with Tyler and that was a great honor. But yeah, I guess the AIs are sucking in all the writing in the world for their training. You know, they’re also able to search through content very effectively and will be reading that content as part of forming their answer. And that’s just happening all the time. It’s happening much more than humans reading some very niche bit of content like one of our papers, right? And so then you might think that since your primary audience with a lot of writing is the AI, you might want to quote-unquote “write for the AI.” That might mean that you don’t have to write as carefully... or not as carefully, but you might... you know, some of the things to entertain humans might be less important.

[53:38] Seth: Poetic function of language.

[53:39] Andrey: Yes. Less important for the AIs. And so you get writing like this quote-unquote “symmetry engine,” right?

[53:50] Seth: [Laughs] Yes. Like... I don’t know. Okay, maybe. I think language will lose something if metaphors stop being helpful. I think you’ll just stop dropping metaphors, right? We’ll just get to purely functional language, right? Because a bad metaphor is worse than no metaphor.

[54:06] Andrey: Yeah, yeah. I mean, I guess I guess we’re gonna see very clearly... like much more clearly delineated communication for humans versus communication for AIs. That... I mean we’re almost kind of there. I mean papers... if you think about like how much effort most scientists put into writing papers vs. how bad the writing is in most scientific papers... why are we even pretending, you know?

[54:35] Seth: Yeah. Anyway, well, very interesting to watch. Um, I had one more example I wanted to pull out, which was the biology example, which I was really excited to read given that so many of these were very math-heavy. In this example, the writers of the anecdote uploaded an experimental figure showing the impact of giving some white blood cells a glucose substitute. Right? So the idea is maybe the white blood cells will do differently if they have glucose versus not glucose, and maybe you could like get them to do something that would cure cancer if you give them more or less glucose.

[55:12] Seth (continuing): And one of their results was that they tried both giving it no glucose (or a very low amount of glucose) as well as giving it a treatment which is like a glucose substitute. So there was some goo that was gonna gunk up the glucose receptor so that the cell wouldn’t be able to eat the glucose. GPT-5 seemed to understand the figure, pointed out hypotheses and potential follow-up experiments to understand why the “fake glucose” had a different effect than low glucose.

[55:40] Seth (continuing): It suggested some potential mechanisms why. ChatGPT writes: “A low glucose control partly mimics the effect but is weaker than the fake glucose at equal nominal concentrations, suggesting contributions from glycolysis restriction and N-linked glycolysation interference... a known 2-DG [this is the fake glucose] off-target... rather than energy limitation alone.” Right? So this seems to have been the key contribution of ChatGPT, is that... like the scientists obviously when they made this result they immediately identified, “Oh that’s interesting, the fake glucose seems to have a different effect than the zero glucose.” The insight that the AI seemed to have had is this particular mechanism, is that there’s an off-target effect of the fake glucose. And suggested, you know, experiments to follow up—using a different kind of fake glucose, trying some other treatments that would identify whether that was the correct mechanism.

[56:42] Seth (continuing): You know, when I say it that way, it doesn’t seem that impressive, right? Like the scientists were already pretty close to that. The scientist... at least reading them, they seemed more impressed than like my reading of it was. They write—the authors write—”In retrospect in particular, the proposed mechanism of reduced IL-2 signaling via interference with N-linked glycolysation made clear biological sense because it could directly explain the disinhibition of the Th17 cell differentiation under 2-DG treatment. However, this mechanistic hypothesis had not occurred to us.”

[57:17] Andrey: Yeah, I mean... I mean once again, it’s a thought partner. You know, if you’re working with people on a problem, you’re gonna have conversations with them and different co-authors are gonna come up with ideas that you hadn’t thought about yet. And you know through iteration, that ultimately creates an artifact which is the research paper. And that’s kind of a series of things like that. And it’s very rarely that there’s kind of one Eureka in this. Or even if there’s like a main insight, you actually have to like take it very seriously to draw out the implications and so on. A lot of... I actually imagine a lot of people had great ideas that ended up eventually being correct science but they just didn’t pursue them, right?

[58:10] Andrey (continuing): So that’s kind of how maybe we should think about this. Is that it’s a thought partner, but it doesn’t yet have agency to pursue the research.

[58:21] Seth: That is so interesting because I came away with this feeling like this is an example of AI as deep literature search, right? Because it seems the problem was pretty well defined, right? Shouldn’t this have the same effect as that? Do deep literature search to see if there’s any, you know, off-target effects of either the thing. But maybe that’s viewing this too narrowly.

[58:42] Andrey: Yeah. I just... I’m not an expert enough to know whether it made a connection across, you know, literature... Right? Like it knows a lot of things. I don’t know if I’d call that literature review. Just like a scientist would know a lot of things. And then some of the magic happens when it connects two, you know, previously unrelated concepts. I just... to me, saying it’s just literature review seems a bit reductionist. You know...

[59:11] Seth: “It’s just a stochastic parrot, Andrey.” Okay. Are you ready? Do you have any other examples you want to make sure we highlight? Are you ready to move on to our conclusions and posteriors?

[59:25] Andrey: Yeah, let’s move on to the conclusions. Yep.

[59:28] [Music / Transition] — MOVING TO POSTERIORS

[59:35] Seth: Okay. So I think these were pretty impressive. I don’t know if there was any, you know, “dropping my jaw” ones. The Timothy Gowers being like, “This is good enough to be my lazy faculty advisor” is probably the jaw-drop moment, right?

[59:48] Andrey: Yeah. I mean just... I think the credibility of people like him or Terence Tao saying that they find it useful... I think in some sense it’s, you know...

[60:00] Seth: This is an OpenAI release selling, you know, for a product that they sell for $200 a month.

[60:09] Andrey: Yeah, but I mean... I mean... sure. I... I just... I don’t know. Like... to me, once again, I’m going back to my priors. Like it’s obviously useful for science. You have to be truly incurious or, you know, a Luddite to think that it’s not.

[60:28] Seth: Fair enough. Well, actually, I have a theory about your crypto friend. Is it just that, like, cutting-edge crypto is not published widely? Is there some sense in which, like, crypto research might not be in the dataset as much?

[60:44] Andrey: I don’t think so. I don’t think so. I think he... I don’t know. I don’t want to put words in his mouth. But if I like...

[60:52] Seth: He’s a Luddite.

[60:53] Andrey: No, no, no. I think if I had to guess, I think he... he kind of views like some deep... deep theoretical insight as maybe the requirement that he has in mind. And that’s... that’s the bar that he has. And...

[61:08] Seth: Yeah, it’s not Einstein. It’s not inventing new paradigms.

[61:11] Andrey: Yes, yes. But I guess... I don’t know. To me, that’s...

[61:17] Seth: I’m not Einstein! I’ll take it!

[61:19] Andrey: Yeah, yeah. Yeah. Exactly.

[61:24] Seth: Um, okay. Uh, and I... I made this point already but I just want to end here which is... I think my takeaway from here is some sort of automatic translation in between sort of machine-language-provable code and like human-language code seems to be the real bottleneck here before speeding up AI a lot. Or at least math-specific AI.

[61:48] Andrey: I really don’t think that’s the bottleneck, Seth. I truly don’t. Um.

[61:52] Seth: But it con... we keep on seeing examples of it like it gives the wrong answer and you have to be like, “Well, I thought about this and it’s the wrong answer,” and then it does that five times and then it gives you the right answer. We see like three examples of that here.

[62:05] Andrey: I... I guess like... this is one... I guess “bottleneck” seems like a weird word to me given that there’s a parallel...

[62:14] Seth: Accelerant.

[62:15] Andrey: I’m not... I... okay. There’s a para... there’s essentially parallel efforts to... certain things can be formalized in these Lean provers. And imagining an OpenAI... like a... like a GPT-like model calling the Lean model is like trivial. Like I... I’m not saying it’s trivial like clearly like... I don’t...

[62:43] Seth: If it’s trivial, why does it keep on giving us wrong answers?

[62:45] Andrey: Because OpenA... because I actually think that like the way this system is designed, it’s kind of using GPT by itself. But actually... my sense is that people in the field who are pushing the envelope are combining these tools. And if you look at DeepMind’s tools, they’re not... they don’t work like this. They are using the formal provers. And so to call it a bottleneck is like implies that like, “Oh, like actually no one has this working yet.” And I... and I actually... I... I bet that some people have this working. It’s... I don’t think... not... I’m not sure whether everything can be formalized in these specialized proving languages in the same way. But yeah.

[63:34] Seth: It’s a limitation in these examples, but you’re saying it’s not a limitation, you know, tomorrow if you wanted to use the cutting-edge tool.

[63:41] Andrey: Yes, yeah. That... that’s... that’s my sense. But you know, if listeners disagree, you know, feel free to let us know. Yeah.

[63:48] Seth: Yeah, please call in. Okay. Um. Posteriors? Or any other limitation comments you want to make?

[63:55] Andrey: No. I... yeah. I mean I...

[63:57] Seth: Posteriors. Yeah.

[63:58] Andrey: Yeah. I mean I... I don’t know. Like our... our priors were very loose so I don’t know the posteriors. I mean I think... yeah. I mean I... you know, I stand by what I say here. I found these examples quite interesting. And it was uh...

[64:14] Seth: Okay. So paradigm-wise, you’re still in the same place? That you think it’ll be co-working with it today and co-working with it in five years?

[64:21] Andrey: Yep.

[64:22] Seth: I said right now it’s super powerful for lit reviews—deep literature reviews—and um, maybe we’re... you know, in five years we will be all the way to AI on its own, at least for math problems. I come away reading this thinking we’re closer to AI on its own for frontier math research than before reading this. Uh, it really does... and again, I call what I said as a bottleneck or say that it’s already been removed... but I mean it seems like if this... what we see described here, plus the AI being able to iteratively check itself and just like redo the math... try another approach if it disproves itself... seems like you should be able to just let that fly and find a bunch of cool stuff.

[65:13] Andrey: Yeah. And if... if you... if you look at prediction... you know, various forecasts, we see forecasts for by 2030 the Millennium Problems being solved with AI. So... uh, that’s not a very un...

[65:28] Seth: AI is gonna solve the Riemann Hypothesis? That’s more of a question about the Riemann Hypothesis than AI.

[65:32] Andrey: Well, you know. People who are experts, a decent chunk of them forecast that this will happen. So, yeah.

[65:40] Seth: Okay. And how impressed were we by the most impressive result? I said we were gonna... I was gonna be like 7 out of 10 impressed, 8 out of 10 impressed. I think that’s kind of where I end up. If not like a little bit below that. Um, in the sense that I’m not saying that these mathematical results aren’t super impressive, but I was hoping for like, “And we discovered something that was like a treatment we can use tomorrow,” or “We discovered...” I was hoping for something that was kind of more directly practical from at least one of these examples.

[66:13] Andrey: Yeah. I mean, to me, if there was something that was very practical, that would be like a 9 out of 10 or 10 out of 10. And you know. Uh, but I... yeah. Once again, I think like nothing blew my mind, but it all seems like we’re... we’re... we’re on the path to this being a very transformative technology for science. Yeah.

[66:36] Seth: Yeah. Super, super excited to talk to Ben Golub about the AI research tool that he’s working on. Um, and uh, listeners at home, let us know: How do you use AI in your science or in your life? Post it in the comments, share, comment, and subscribe. All right.

[66:56] Andrey: Well, until next time. Keep your posteriors justified.

[67:00] [Music fades out]

One year of justifying our posteriors

Andrey Fradkin — Sun, 11 Jan 2026 19:57:19 GMT

For the past year, Seth Benzell and I have been running a particular type of experiment on ourselves with Justified Posteriors, our podcast. Can we behave like good Bayesian learners about research by stating our priors ex-ante, carefully reading papers, and then reporting how we’ve updated our beliefs? This has turned out to be more complicated and more interesting than it seems, something I reflect on in the rest of this essay.

A foundational assumption of Justified Posteriors is that the claims made in published research papers and other intellectual work do not directly correspond to what we believe after reading them. This should be obvious to anyone who has seriously engaged with intellectual work. But what is less obvious is the degree of the gap between the claims in the work and the beliefs of the reader. Is there a slow accumulation of evidence (a vast literature, as one will read in formulaic introductions) that gradually moves our beliefs from zero to one? Or perhaps there is a critical moment, where one paper causes a rethinking of all that came before it, leading to a new conclusion.

We could dredge through the history of science, as our predecessors Popper, Kuhn, and Lakatos have, to come up with examples of both. We idealize the pivotality of Einstein’s tests of general relativity. The evidence we have to deal with is much muddier. We live in a time where claims are circulated as a global pastime. Sometimes these findings come with the trappings of academic prestige and peer review, while other times they come in the form of a polemic dropped like a nuclear bomb into the memesphere, as those who have situational awareness may understand.

Few have time to read deeply, and even thinking seems like one of those lines in a todo list that is never crossed out. Consider the ubiquitous evals used in AI research and cited throughout social media. The number of people who have read the underlying methodology for each eval is minuscule. The ignorance is so vast that people don’t know how few samples are in each eval, let alone the confidence intervals. And yet, a careful evaluation of a new eval such as GDPVal can update our priors by a lot.

This is the water we swim in with Justified Posteriors. The premise for the show seemed simple, but nothing is as simple as it seems. For one, how do we pick a prior, especially without reading the paper? A conceit of the podcast is that we form our priors with zero information about the paper, but even to pick a paper we need to know something about it. Picking a prior turns out to be one of the topics which we struggle with the most.

What are we supposed to learn from a theory paper such as Ide and Talamas’s “Artificial Intelligence in the Knowledge Economy?” A theorist might be satisfied by learning whether this is a useful way of modeling the phenomenon. But we try to translate these into more empirical statements, such as “what percentage of US workers will have managing or creating teams of AI agents as their main job within 5 years.” Typically, we don’t update a lot.

Of the 22 episodes in which we had at least some semblance of priors, the biggest update came for Seth in the episode about “How do social media feed algorithms affect attitudes and behavior in an election campaign?” The randomized control trial evidence on political beliefs convinced him that whether an algorithmic feed or a reverse chronological feed was shown to a user did not affect their political polarization. I already had this as my prior, given prior literature.

Nonetheless, neither of us were willing to update much on the larger claims. The reason is that, as always, the real world is complicated. For example, the paper did not study decisions to moderate content, a process which can be algorithmic but which differs from the algorithmic feed. The paper also did not consider truly directed algorithmic interventions, such as those by Elon Musk on X. We can’t read this paper and just say that algorithmic feeds are not an important determinant of political beliefs.

For me, the biggest update came in the episode “Can AI Make Better Decisions than Doctors?” I came in skeptical that AI could overcome the fundamental problems of causal inference without a randomized control trial. The evidence in the paper strongly updated me toward believing we should be more aggressive in inserting AI into ER decisions.

Interestingly, papers on more macro topics caused smaller updates even if they had much greater implications. Our first episode was fittingly about the now famous Situational Awareness document written by Leopold Aschenbrenner in June 2024. We didn’t have explicit priors, but we thought that AGI was further away than 5 years. We also thought AI was super important and that some of the predictions were plausible. We joked about buying NVIDIA, and didn’t (we were fools). To me, this episode highlights how easy it is to be directionally right, to read the right materials, but to not take ideas seriously enough. The arguments in the paper about power generation and data centers have especially proven correct. And if you squint, we’re following the timeline predictions closely even to this day. Claude Code with Opus 4.5 seems to be just on time for Aschenbrenner’s prediction of a proto-automated-engineer in 2026/2027.

A common theme in our discussions of papers about the economics of AI is that they are often measuring transitory phenomena, such as changes in productivity or performance at a particular point in time. An extreme example of this is the “Simple Macroeconomics of AI” by Daron Acemoglu, which assumes that AI will stay as good as it was in 2024. These papers are often underwhelming, even when they are well-crafted, because what everyone really cares about is what will happen in the future.

Much of my learning has come through conversation about the paper, rather than just by reading the paper. My updates would be very different if I read the paper without talking with Seth about it. This is reminiscent of an academic seminar, in which a group of colleagues focus exclusively on one paper presented by a speaker. Attendees of seminars will know that oftentimes the most interesting part of seminar day occurs in the hallway conversations afterwards, when people share their opinions and discuss. One can tell how serious an academic department is by the quality of the hallway discussion.

This brings me to the next topic, the validity of podcasting as a worthwhile intellectual pursuit for a professor. I am supposed to primarily demonstrate my work on an intellectual topic by writing papers published in top journals. Yet to me it is obvious that we are doing valuable and original work in reading these papers and interpreting them through broader lenses than just the minimum publishable unit. For each episode, we have to understand literatures, engage deeply with evidence, and reason through the implications. This sort of work is something top researchers often do prior to starting new research projects, but is rarely shared outside of side conversations or lab meetings. What Seth and I do is a valid and valuable intellectual activity, not substantively different from writing a paper or a book.

One of the great pleasures of doing the podcast is hearing from our awesome readers and listeners! In the coming year, our goal is to improve the quality of our work by increasing our preparation, improving our audio and video quality, and by bringing on insightful guests. I am excited to continue covering emerging measurements of the AI economy and theoretical frameworks related to the impact and diffusion of AI. As always, we would love to hear from you with any feedback.

Thanks to Seth Benzell for comments and for being a great co-host.

Ben Golub: AI Referees, Social Learning, and Virtual Currencies

Andrey Fradkin — Mon, 29 Dec 2025 13:00:51 GMT

In this episode, we sit down with Ben Golub, economist at Northwestern University, to talk about what happens when AI meets academic research, social learning, and network theory.

We start with Ben’s startup Refine, an AI-powered technical referee for academic papers. From there, the conversation ranges widely: how scholars should think about tooling, why “slop” is now cheap, how eigenvalues explain viral growth, and what large language models might do to collective belief formation. We get math, economics, startups, misinformation, and even cow tipping.

Links & References

Refine — AI referee for academic papers
Harmonic — Formal verification and proof tooling for mathematics
Matthew O. Jackson — Stanford economist and leading scholar of networks and social learning

Cow tipping (myth) — Why you can’t actually tip a cow (physics + folklore)
The Hype Machine — Sinan Aral on how social platforms amplify misinformation
Sequential learning / information cascades / DeGroot Model
AI Village — Multi-agent AI simulations and emergent behavior experiments
Virtual currencies & Quora credits — Internal markets for attention and incentives

Transcript:

Seth: Welcome to Justified Posteriors, the podcast that updates its beliefs about the economics of AI and technology.

Seth: I’m Seth Benzel, hoping my posteriors are half as good as the average of my erudite Friends is coming to you from Chapman University in sunny Southern California.

Andrey: And I’m Andrey Fradkin coming to you from San Francisco, California, and I’m very excited that our guest for today is Ben Goleb, who is a prominent economist at Northwestern University. Ben has won the Calvó-Armengol International Prize, which recognizes a top researcher in economics or social science, younger than 40 years old, for contributions to theory and comprehension of mechanisms of social interaction.

Andrey: So you want someone to analyze your social interactions, Ben is definitely the guy.

Seth: If it’s in the network,

Andrey: Yeah, he is, he was also a member of the Harvard Society of Fellows and had a brief stint working as an intern at Quora, and we’ve known each other for a long time. So welcome to the show, Ben.

Ben: Thank you, Andrey. Thank you, Seth. It’s wonderful to be on your podcast.

Refine: AI-Powered Paper Reviewing

Andrey: All right. Let’s get started. I want us to get started on what’s very likely been the most on your mind thing, Ben, which is your new endeavor, Refine.Ink. Why don’t you tell us a little bit about, give us the three minute spiel about what you’re doing.

Seth: and tell us why you didn’t name your tech startup after a Lord of the Rings character.

Ben: Man, that’s a curve ball right there. All right, I’ll tell you what, I’ll put that on background processing. So, what refine is, is it’s an AI referee technical referee. From a user perspective, what happens is you just give it a paper and you get the experience of a really obsessive research assistant reading for as long as it takes to get through the whole thing, probing it from every angle, asking every lawyerly question about whether things make sense.

Ben: And then that feedback, hopefully the really valuable parts that an author would wanna know are distilled and delivered. So as my co-founder Yann Calvó López puts it, obsession is really the obsessiveness is the nature of the company. We just bottled it up and we give it to people. So that’s the basic product—it’s an AI tool. It uses AI obviously to do all of this thinking. One thing I’ll say about it is that I have long felt it was a scandal that the level of tooling for scholars is a tiny fraction of what it is for software engineers.

Ben: And obviously software engineering is a much larger and more economically valuable

Seth: Boo.

Ben: least

Andrey: Oh, disagree.

Ben: In certain immediate quantifications. But I felt that ever since I’ve been using tech, I just felt imagine if we had really good tools and then there was this perfect storm where my co-founder and I felt we could make a tool that was state of the art for now. So that’s how I think of it.

Seth: I have to quibble with you a little bit about the user experience because the way I went, the step zero was first, jaw drops to the floor at the sticker price. How much do you,

Ben: not,

Seth: But then I will say I have used it myself and on a paper I recently submitted, it really did find a technical error and I would a kind of error that you wouldn’t find, just throwing this into ChatGPT as of a few months ago. Who knows with the latest Gemini. But it really impressed me with my limited time using it.

Andrey: So.

Ben: is probably, if you think about the sticker price, if you compare that to the amount of time you’d have, you’d have had to pay error.

Seth: Yeah. And water. If I didn’t have water, I’d die, so I should pay a million for water.

Andrey: A question I had: how do you know it’s good? Isn’t this whole evals thing very tricky?

Seth: Hmm.

Andrey: Is there Is there, a paper review or benchmark that you’ve come across, or did you develop your own?

Ben: Yeah. That’s a wonderful question. As Andrey knows, he’s a super insightful person about AI and this goes to the core of the issue because all the engineers we work with are immediately like, okay, I get what you’re doing.

Ben: Give me the evals, give me the standard of quality. So we know we’re objectively doing a good job. What we have are a set of papers where we know what ground truth is. We basically know everything that’s wrong with them and every model update we run, so that’s a small set of fairly manual evaluations that’s available. I think one of the things that users experience is they know their own papers well and can see over time that sometimes we find issues that they know about and then sometimes we find other issues and we can see whether they’re correct.

Ben: We’re not at the point where we can make confident precision recall type assessments. But another thing that we do, which I find cool, was whenever tools that our competitors come out, like Andrew Ng put out a cool paper reviewer thing targeted at CS conferences.

Ben: And what we do is we just run that thing, we run our thing, we put both of them into Gemini 2.0, and we say, could you please assess these side by side as reviews of the same paper? Which one caught mistakes? We try to make it a very neutral prompt, and that’s an eval that is easy to carry out.

Ben: But actually we’re in the market. We’d love to work with people who are excited about doing this for refine. We finally have the resources to take a serious run at it as founders. The simple truth is because my co-founder and I are researchers as well as founders, we constantly look at how it’s doing on documents we know.

Ben: And it’s a very seat of the pants thing for now, to tell the truth.

Andrey: Do you think that there’s an aspect of data-driven here and that one of your friends puts their paper into it and says, well, you didn’t catch this mistake, or you didn’t catch that mistake, and then you optimize towards that. Is that a big part of your development process?

Ben: Yeah, it was more. I think we’ve reached an equilibrium where of the feedback of that form we hear, there’s usually a cost to catching it. But early on that was basically, I would just tell everyone I could find, and there were a few. When I finally had the courage to tell my main academic group chat about it and I gave it, immediately people had very clear feedback and this was in the deep, I think the first reasoning model we used for the substantive feedback was DeepSeek R1 and people, we immediately felt, okay, this is 90% slop.

Ben: And that’s where we started by iterating. We got to where, and one great thing about having academic friends is they’re not gonna be shy to tell you that your thought of paper.

Refereeing Math and AI for Economic Theory

Andrey: One thing that we wanted to dig a little bit into is how you think about refereeing math and

Seth: Mm-hmm.

Andrey: More generally opening it up to how are economic theorists using AI for math?

Ben: So say a little more about your question. When you say math

Seth: Well, we see people, Axiom, I think is the name of the company, immediately converting these written proofs into Lean. Is that the end game for your tool?

Ben: I see, yes. So good. Our vision for the company is that, at least for quite a while, I think there’s gonna be this product layer between tools, the core AI models and the things that are necessary to bring your median, ambitious

Seth: Middle

Ben: not

Seth: theorists, that’s what we call ourselves.

Ben: Well, yeah. Or middle, but in a technical dimension, I think it’s almost certainly true that the median economist doesn’t use GitHub almost ever. If you told them, they set up something that, a tool that works through the terminal, think about Harmonic, right?

Ben: Their tools are all, they say the first step is, go grab this from a repository and run these command line things to, they try to make it pretty easy, but it’s still a terminal tool. So a big picture vision is that we think the most sophisticated tools will be, there will be a lot of them that are not yet productized and we can just make the bundle for scholars to actually use it in their work.

Ben: Now about the question of formalization per se, I have always been excited to use formalization in particular to make that product experience happen. For formalized math, my understanding is right now the coverage of the auto formalization systems is very jagged across, even across. If you compare number theory to algebraic geometry, the former is in good shape for people to start solving Erdős problems or combinatorial number theory, things like that, people can just start doing that. For algebraic geometry, there are a lot of basics that aren’t built out and so all of the lean proofs will contain a lot of stories that the user has to say, am I fine considering that settled or not?

Ben: And that’s not really an experience that makes sense for someone trying to check their econometric draft, right? So we’re watching and I think as soon as we feel it’s the moment when we can take the typical, say economic theory proof and give a rigorous certification, we’ll be right on.

Ben: I would like us to be in a position to be right on top of it.

Seth: I blame Grothendieck for algebraic geometry being hard to formalize, hard to make into Lean.

Andrey: Even short of things like Harmonic, right? It’s certainly you can get useful things of putting some math or asking for some math from Gemini for example. How are people in the field using those tools and have you noticed that has affected the type and quality of economic theory you’re seeing?

Ben: Oh yeah. That’s zooming out from refine. I’m obviously a heavy user of AI tools for my own research. I think broadly we’re seeing two phenomena play out in parallel. It’s a lot easier, this idea that went viral a few weeks ago of work slop being much easier to produce. I think there is an experience, which I’ve experienced myself, where you owe your co-author something and you have some ideas, you’ve done some real work, but it’s much easier to put a section in the paper that is AI written that looks a lot that our natural checks see as real work. And that introduces obviously new kinds of risk. It makes work faster in some ways and more fragile in others. And I think about that a lot. By the way, one of the main new values of refine is as people are perhaps less moment to moment engaged with the exact, or less line by line engaged with their work, which AI is doing. They need that global eye and that obsessive look, which used to be more in one’s own head. But that’s the negative phenomenon. But I think in terms of having a pretty expert consultant in things you don’t usually work on just for getting started and forgetting ideas.

Ben: I can already see major gains in my own research. One thing I would be curious to see is just looking at measures of production of scientific literature. We should see something on speed that’s visible in we should see signs of science speeding up in the areas which are particularly sped up.

Ben: And I, it would be fun to formulate a hypothesis like where should we be looking to see that

Seth: Right. We recently recorded an episode, the open AI paper on early uses of AI in social science. And it seems to us one of the most obvious immediate use cases is just, can I find if somebody already proved this and I could just plug it in? Right.

Andrey: to be clear, not social science, but mathematics.

Seth: mathematics. Excuse me.

Seth: Yeah. Yeah. Science, science is,

Ben: Physics. So yeah,

Andrey: Yes, exactly.

Seth: Andrey always calls me out that I say economics or social science when he really means, when I really mean actual science.

Andrey: Just to be clear, there were

Ben: important. Yeah,

Andrey: A bunch of math in that paper, which is very cool.

Ben: This is known. I think economic theory, it’s important to me about economic theory that there is really such a thing that’s called economic theory, very distinct from math. Usually, unless something is going wrong, you don’t need to do any interesting math.

Ben: In an economic theory paper, you just find the relevant. So I think a lot of economic theorists who are successful and good at it, a lot of the trade is finding the right thing, learning enough of it to make it valuable for your application and just using it correctly. And that’s where that search problem is really accelerated. So I’m with Seth that there’s gonna be a huge speed up just for maybe not as, it’s not super intelligence. It’s better search, but that’s huge.

Andrey: So one economic theorist that I’ve talked with about this is Joshua Gans. I don’t know if you’ve had a chance to talk to him, but he’s been writing a paper a week,

Seth: Right. The guy, he is grinding him out with the AI help

Andrey: Is there some sort of weird proof of work thing that’s starting to fail? Because look, writing down theories of almost anything, it was, it took a lot of work, but you could, there was a recipe, right?

Andrey: As an

Seth: you can mathematize Marx right. The fact that I can rewrite marks in math doesn’t necessarily make Marx good.

Andrey: Yeah.

Andrey: So how do you think about that and what do you think are gonna be directions in economic theory that are really changing the game as a result of this?

AI, Work Slop, and the Future of Economic Theory

Ben: Yeah. You raise an interesting point. You can think of one vision of what social science is, or what economic theory is, that’s suggested by what you just said, which is that we’re commentators on social reality and we’ve developed a particular style of doing that, which involves, in the case of modern economic theory, a lot of math and the proof of work.

Ben: There’s almost an equilibrium where you, in order to say something, you have to really carefully and write well in English, but also do this mathematics and now that, at least superficially can be totally hacked, is that gonna stop? Is that gonna make the commentary aspect of economic theory lower signal in some sense?

Ben: Is it going to, and that’s a great question. So let me table that for a second and say what? I have a thought on this topic that’s related to that. If you’re really good at that and you produce these really jewel like economic theories and then suddenly everybody can write slop and produce economic theories that at least take a while to distinguish from your beautiful ones, then maybe you feel sad, like your art has been degraded.

Ben: And I do think that’s the way poets, I think. I talked to some people who are very interested in the experience of artists with AI and I think that’s an artist’s experience with AI. Then there’s another kind of person I have in mind, which is an idealized cancer biologist.

Ben: And you tell them, oh, your jewel like blot analysis that you do or whatever. Now they’re gonna be automated. And I think this guy’s first reaction is mostly not, oh, how will people be able to admire my art? Will people still appreciate my art as much or what will I do with my time?

Ben: But they’re like, oh shit, we might move faster toward curing cancer. So one thing I think is wrong broadly with economic theory is that there are a lot of us whose reactions fall more into the artist category. And I would like, I think economic theory is not done. In fact, it’s quite bad what we’ve achieved on the whole.

Ben: So we should be

Seth: excluded of course.

Ben: Yeah. So as a group, as a community, right? And so if we, I would hope that we have it in us to say, look, now we have these incredible tools to take a run at questions that are really where the solution would be genuinely valuable.

Ben: And we could really try to do them better. And we have this huge resource now. I would like it to be, I would be happier about us if we had more of that reaction. I’m hoping that there will be parts of the profession, parts of the enterprise that grow and accelerate, because they’re driven by that as opposed to hand wringing over the art problems.

Seth: Right. And it seems like you could always add some more, get gatekeepers on the backend. Right? If we just make it easier to enter with, here’s my mathie paper. And the concern is you get too much slop. Maybe there is some way to filter. You don’t have to filter on the math anymore. You filter on something else.

Ben: Totally. All of these offensive weapons are also closely related to defensive weapons. So there’s a whole, and refine is obviously a natural, we think about that, that we can, at least, at minimum, we can help reject slop that’s written by cheap models without much skill and maybe we can help

Seth: How do you defeat slop? How do you defeat slop with bitter slop?

Ben: Yeah,

Andrey: Have you talked with some editors? Is there interest here?

Ben: Yeah. So Refine is doing pilots with several of the very top journals in economics. And we’ve been really encouraged by, I think because a lot of the editors are super genuinely pro-social people who want to take the tech, who wanna bring technology to bear as fast as possible, to improve the profession.

Ben: And so we, and I think there’s a feeling that they have that’s correct. That this phenomenon is here, and so the best way for the journals, for example, to deal with it is to be as up on it as anybody. And so we, I think the main use that is the easiest sell is just final due diligence right before publication at the conditional accept stage.

Ben: Can we make sure that papers are, any remediable, any mistakes that the author would be embarrassed to have published, the author has a chance to learn about it. Correct. That’s, everybody agrees with that. I think there’s a lot more design required to do it thoughtfully when stuff is incoming.

Ben: I have heard experiences from editors using REFINE and other tools. When they get a submission that they’re very suspicious about, they can just quickly run it through refine, see that there seem to be, and they’re usually experts in, right? So they can see, oh, this is surfacing really serious errors.

Ben: Now I can, for example, desk reject it with a lot more confidence. So we’ve, that experience does happen. That’s purely people’s own use of the tools, but.

Andrey: Are you worried that your tool is fundamentally, it’s interesting. Like many economists, it’s a tool of rather than constructivism in that it’s very good at finding problems. But is it ever gonna be, well, this is not a perfect paper, but it’s a beautiful paper nonetheless.

Seth: GPT-4o if you wanna sycophant to Andrey.

Ben: Actually, one thing we think a small version of that, and I’m curious for your guys’ sometimes refined produces, you give it a 50 page manuscript and it produces six comments. In fact, one of our engineers recently switched. He said, we switched to a new, we did some model upgrades.

Ben: And then he looked at it and he said, this only produced six comments. And it was on a paper by one of our friends who had been through refine and all the mistakes were gone. And so he was like, oh, it went from, if I just run this on the dumber models, they give me 50. Now it’s six.

Ben: And that was actually good because the feature question we have is in that case, should we tell the author, Hey, this has fewer things we can see wrong than 95% of papers. Right? That’s turns this question mark experience into maybe something encouraging. So we haven’t rolled that.

Ben: I’m curious if you guys think such a badge would be pleasant for an author.

Seth: Question mark experience.

Andrey: I, I, think you should, well, you should obviously run the experiment,

Viral Processes and the Refine Referral Program

Seth: Uh, maybe an interesting place to start is this referral program that you came up with. So where did that come from? Why did you design it the way you did?

Andrey: You just, well explain it first. Yeah. I think that’ll be the first. Yeah.

Ben: what we have, we actually, we, through the end, through the end of Decem through the end of November, we ran our, our first iteration of our referral program, which we will keep, which will tune and keep running, in various guises. And the way the program works is you, if you refer a friend, if you want to refer friends, you get a referral link from the site. You can share that with anyone you want. And every time somebody, if somebody that you refer ends up actually, paying for a full refine review, at least one, they, they get a full bonus review and you, the referer get one. So we, our, our top reviewers, I don’t think you’ll mind me sharing ‘cause he, he told, he basically told everyone he knew, but Joshua Gans, he, he was, he’s like, I think he has like 35 credits now because he just kept referring and

Seth: God bless.

Ben:because my co-founder, my co-founder and I were talking and we’re like, this is than we expected, should we’d be worried about.

Ben: So we were like, no, this is only good. This is, there’s nothing to be stressed out. Um, he can have, he can have lifetime refined use, free for, for being such a good, but that’s what, so I think economically, I think there are two thing. One, one immediate thing to think about is that some people are gonna be really good ambassadors for your product, but you don’t know who they are.

Ben: There’s an information problem and a referral to the extent, and interestingly, they’re the ones who are gonna value the credits, if they’re really good users of it, and they’re also gonna be the ones that, probably can identify others who know. And so getting those people to raise their hand, is not a trivial problem if you just had to do it without, but it turns out this, it, offering the referral to them kind of puts the incentives in the right place. And then, the others, obviously the other lens that I think of it through is, the lens of network economics and the viral process. So I, I’m happy to talk, but I actually, the information one, when we were thinking like, who should we recruit as an ambassador? It wasn’t obvious. And this got them to come forward.

Seth: You’ve done some work, I think, both in, definitely theoretically, but maybe even empirically too, about optimal seating. So did that, any results from that play in?

Ben: That’s a good, I would say the, the most, honestly, the most important insight that kind of was really top of mind for me was what I, in an, in my undergrad networks class, which I teach from, Networks, Crowds, and Markets by Easley and Kleinberg, they go through the basics of the viral process

Seth: Will Jackson be insulted that you don’t use his book?

Ben: well, no, ‘cause it’s, it’s graduate book.

Ben: I

Seth: Okay.

Ben: every year. I do say, you can go buy, you can, if you really wanna know everything, you can buy Matt’s book. But so,

Andrey: yeah, just as context for the listeners, Matt Jackson was Ben’s thesis advisor. Yeah.

Ben: and yeah, collaborator and overall hero. So I, and it’s funny because I, yeah. Small aside, but when I teach that class, I’m like, ‘cause I realized from these undergrads perspective, Matt Jackson, like, if you read these books, he’s just like, they think he’s probably dead. Like, he is like, seems like a very major, a major part of the field.

Ben: And then I drop somewhere in the middle of the quarter, like, oh, Matt was my, Matt was my advisor. Um,

Seth: Not dead yet.

Matt Jackson as an Advisor

Andrey: talking about this, this is a little bit of a tangent, but I hope you don’t mind Well. What was he like as an advisor?

Ben: oh yeah, he is, he was ama I mean, overall amazing. Like, I, I, the main thing to say about it is I met him right as he was about to move from Caltech to Stanford. I came to him as a Caltech Summer research intern student. He didn’t really havetime, but somehow I, I tricked him into like, not, to, not to being officially on the, on the program.

Ben: Uh, my advisor in the program. And then we, we started working on our first papers on social learning and information aggregation right then, and. He, I think he’s ex, the most salient trait of him is that he is just incredibly supportive and encouraging about research, but actually not at all. There was very little teaching that he ever, he ever did, explicitly, here’s how you do research. Everything I learned from him was, was ‘cause he was open to co-authoring and I just saw him do research and I learned by, by apprenticeship. my dad had actually told me that that was the best way to learn and I, and but he had like Soviet physics, in the 1970s as his reference point.

Ben: So I was pretty sure it was not good advice, but it actually ended up being exactly what worked for me, with Matt. But Matt was not, Matt was not prescriptive he didn’t, I don’t think, I think his, his default mode of advising is like, because he’s so incredible at research. He, his first best advising style is to leave the student alone and let them, and let them do their thing.

Ben: And one, and I, it made way more sense to me when I talked. I, I think I talked to him about. His experience with his advisor, Darrell Duffie. And I learned that it was just, it was all this dynastic thing where Darrel was exactly the same way. He just, like, Matt brought him a thesis and Darryl was like, this is really interesting.

Ben: This is good. They had been writing other papers, but that was the extent of, and I, I don’t, Mike’s Matt was more, was definitely a great mentor, but I think it was really freeing to have someone basically just trust you to do re to do research and be there as a, be there to teach by example when you needed it.

Eigenvalues and Network Dynamics

Andrey: here’s a question. Who likes eigenvalues more? You or Matt Jackson?

Ben: Definitely me. ‘cause Matt’s not, Matt’s not a math nerd. Matt. Matt is a, Matt really is a true, true, true social scientist. He’ll use whatever tool. I think there’s, I’ve always felt a little sheepish that this aesthetic thing of like, what, this tool is really like special to me. He’s, he’s not like that and I think it makes him a better social scientist that he’s not.

Ben: Whereas I, ‘cause I think when you, whenever you care about something other than explaining the social world, that’s gonna be like, a trade

Seth: Well let, let’s slow down for a minute for, people in the audience who don’t live with the, in the, in the glorious glow of the eigen value. And, thinking about eigen vectors of Jacobian matrices, can you give us a little, give us a little taste to someone who’s already not in love with eigen values?

Seth: Why should they love eigenvalues?

Ben: Yeah, that’s a great question. Well, so, okay, 0.1 is, algebra describes the world. You guys know that video where the guy that the, the math profs or like, like sweaty t-shirt math guy is yelling like, functions. Describe the world. I think the real thing, linear algebra describes the world, and I think in the AI era, we, we don’t, as Tyler Cowen says, it’s Rise.

Ben: Tyler Cowen says it’s rising in status. So it’s quite high in

Seth: There we go.

Ben: the tough thing about matrices is that they’re so damn complicated. There’s like, matrices, you can the, the whole world into that. And the amazing thing about. Values is that they, they answer the question of if a matrix had to be a number, what number would it be? Like if you, if a matrix lost its privileges of being, of being an end by inbox and couldn’t store all that information, you have to masquerade as a, as at, at worst, a complex number. What complex number would it, what, what mask would it put on to be itself as a number? And eigenvalues are a wonderful way of, of fully answering that question is the best you can do. And that’s like, that’s a powerful idea. And, and I, and so back to viral processes, if you think about a viral process unfolding in a network, there’s a way to model it as a matrix or a network with all of the, the sort of, activation events being modeled as like basically a big matrix, multiplication, that prop that makes your state kind of, yeah, for the, I guess. Yeah, I don’t wanna, I don’t wanna, I understand that this is probably not the most intuitive way of describing it, but it is really true that if you have a large population and you wanna track the evolution of a state like a virus, you can think of that as kind of a matrix operation that acts on the system and updates it to the next step, which is like the thing spreading further.

Ben: But often what we wanna know about a virus is not everything about how it’s proceeding, but we wanna know something simpler. Like is it like when back in COVID, is it tending to spread right now or is it dying off? Right? And so it turns out that you can compute an eigen value of a suitably defined operator or, or something that will answer that question.

Ben: And so when you’re trying to run a viral contagion, as we are at refine to get more people aware of our product, we are trying to get the viral coefficient, above one. And

Seth: Right. Okay. So yeah, so tell me what, what’s the special thing that happens when an eigen value goes from below one to above one?

Ben: Yeah, well, let’s think about numbers, right? I said so, sowe have this, this process that we’ve now distilled down to one number, the viral coefficient. And we’re, we’re doing that process, namely the next step of the, of the epidemic over and over, right? The next moment when the epidemic has a chance to do its thing, and mathematically taking a time step is applying the, the operator of the epidemic’s behavior to the system.

Ben: So you have a system you hit it with, you say, okay, one more time, step. When we compute the, the eigenvalue kind of captures just the overall extent, captures how a number. And if that number is above one, it means every time it acts, that process tends to expand the set of infected people. And so if you’re doing it over and over, you think of a number greater than one, like two.

Ben: If you keep

Seth: One of my favorite numbers greater than one.

Ben: Excellent. My, my favorite. Um, if you have two and you keep hitting it, that is multiplying it with two, you keep getting bigger and bigger and that’s exponential growth. And it’s, it’s actually, it actually works with 1.01 as well. Right. And so if you, the la the largest iGen value of the propagation matrix captures exactly that.

Ben: Is there, when, when you keep hitting that system with itself again, does it behave like raising two or 1.01 to higher and higher powers? That’s when you have expansiveness, that’s when you have viral spread.

Seth: if my eigenvalue were 0.9, my viral spread would be I contaminate 0.9 people who contaminate 0.9 people, and that adds up to a finite amount instead of everybody gets it

Ben: Exactly. And so,

Seth: now, tell me what a complex eigenvalue is.

Ben: no, not today, but I will, what,

Seth: It’s not, it’s not, it’s not an, it’s not an interview on Justified Posteriors if the guest doesn’t refuse a question.

Ben: But, but, I will say is that I, what I, what I taught in my undergrad class, what, the way that I sort of like, like tried to get them, maybe even a little more excited is, you, when you think about that tipping point 0.9 to 1.1, it doesn’t look like a big deal. Um, locally, it doesn’t look like a big deal when you super zoom in on the, on the process.

Ben: But when you look at the process’s overall behavior, it, it makes a huge difference. And so what I to what I tell the business minded undergrads that I often teach is, if you’re running, and this was always just a fanciful little illustration to me, if you’re running a company and you’re running a viral promotion, you really could, you might be willing to invest a whole lot of money to move that number only a little bit because

Seth: Infinite return, dude.

Ben: yeah. If you, if you can push it, that’s where the returns to that are very big. And so we’re, and I amusingly, I think we’re right there. I we’re, I think our viral coefficient for this referral program is just about one. I can talk about some subtleties of estimating that, but that means, one of, one of the ways that we wanted to build it is we have that to have prices in there.

Ben: So the, the, the rewards you get are a price, right? And we can in principle give you, give your give, change the price, give people more free stuff or roll lower, make it an introductory offer with a, and those are the things we can tune to change the viral coefficient.

Andrey: And I guess the other thing in practice to remember is that the viral coefficient isn’t constant.

Seth: Ah, right. So does linear algebra describe the world when it’s like a first degree Taylor approximation? Actually.

Ben: Well, the beauty of, yeah, the reason it’s not co like yeah, it’s not constant over time. And one of the reasons it’s not is because as your contagion pro propagates through the network, it’s hitting different people. Right? Um, and that’s definitely something that of course as Andrey as, as you both know, and Andre, and I have talked about is that the selection of people as any kind of, of social phenomenon, like a an advertising campaign is progressing.

Andrey: I.

Ben: getting as the next rung is, is different. And eigenvalues actually do capture that from a nerdy perspective. Like if you just had to the, if you teach the simplest possible model where you just, like everybody has three friends and they infect these three friends with some probability, there’s no room for heterogeneity.

Ben: But if you take a whole network, then actually the heterogeneity is in there and the heterogeneity is, is exactly captured by it. And so in some sense, the largest eigenvalue will tell you the average of this across the whole network. So there are tools, of course when you’re doing it in real life as I’m now you’re just tuning the knobs andyou know, doing it in a somewhat less scientific way.

Andrey: But I’ll, I’ll just say that like after this podcast airs, will have been infected, so

Seth: Yeah. Oh man. Your I, dude, we’re getting your eigenvalues up there. We’re boosting your eigenvalues as we speak, dude. Okay. So we, we talked a little bit about, contamination of like viruses, but now let’s talk about an even more insidious form of, viral contamination, which is the idea or the meme, which contaminates us with, mental illnesses such as good taste in movies.

The DeGroot Model of Social Learning

Seth:Um, I guess if we were bringing these ideas of linear algebra to, social learning, we would think about this thing called the DeGroot model of Social Learning. Can you tell us a little bit about what that is? And then we’ll kind of build up to why wouldn’t that be a good way to learn, and how will AI help us think about that?

Ben: Yeah. So the DeGroot model is just, and I, I, I used to call it the averaging model of social learning, is actually what I worked on with Matt Jackson when I came to him as an undergrad. Um, at Caltech in 2006. I, like many other had rediscovered. Um, the dud model just says, you form your opinion tomorrow by taking a weighted average of what your friends think today. You can forget the weighted part if you, it’s not that important. So I just look around and my friends, I say, what are, what do they think about whether AI is good for humanity or whether, whether, you know. Um, you should throw away all your black, spatulas because they have toxins in them. And, and then for on issues like that, people form sort of an opinion by, by social communication.

Ben: And the DeGroot model is the simplest possible model. And we can come back to this. It’s, it’s one that economists actually don’t tend to love when they first encounter it because it is extremely simplistic and kind of, robotic or animalistic. You just, you just take the average. And if you have a bunch of people doing this, that can be summarized with beautiful linear algebra, which is actually exactly the same math, more or less as the math that you do for Markov chain theory. So, that’s for the nerds. But sociologically it’s interesting be because if it, because you can immediately start asking questions like, will a population of people updating this way reach a consensus and will that happen fast or slow? And will this consensus be right or wrong? And it sort, it gives this tool, which is like a pocket calculator that, that, um. Anyone with a reasonable applied math, education could, could have reinvented as in fact many people, including me, did. And, and then, but you can immediately take it to also, I think one of the reasons it’s been, so popular in economics is just it gives you a lot of ways to ask simple questions and get answers, which is something the, I can talk about it, but the standard economic models of learning don’t actually tend to give, many answers in networks

Seth: What would a large versus a small eigen value in a DeGroot learning network mean?

Ben: so in the, the first eigenvalue, which is the first one people talk about, the biggest one happens to always be one for a DeGroot model, which captures the idea that everybody is averaging. So in some sense aren’t getting, there’s no natural amplification or shrinking in opinions, because if you’re averaging, that’s sort of like the, there’s an eigenvalue, which just captures that fact

Seth: There’s no way for our opinions to fly off to infinity. I guess maybe if I was like negatively waiting you could that happen?

Ben: That could happen actually, but yeah. But if you, but with normal, with sort of the, the first, the natural assumptions on weights, things will tend to stay confined

Seth: know. Having negative weights on some people’s opinion seems pretty natural to me. If you’ve been on Twitter,

Ben: I have an under, I have a brilliant undergrad thesis student right now who’s studying

Seth: ah.

Ben: negative weights in the root model. But, yeah, so, but there’s a, another eigenvalue, the second largest. And what that captures is, is a society converging fast or slow. So the second largest eigenvalue of an updating matrix, if it’s really close to one, that basically means that. You can, you can start people off. And even if the society is connected and people will eventually be tending to the same opinion, if they talk for a million years, it really will take a million years. They, the, the being close to one captures their being. And it turns out, as Matt and I, Matt Jackson and I discovered to re relate to this phenomenon of homophily, that if your network is basically if and only, if, the only way that can happen is if there are divisions in your society where people put very little weight across Democrats and Republicans or whites and blacks.

Ben: Uh, andso if that happens, you can converge really slowly and if it, and if the second eigenvalue is, know, not too big, like 0.7 or 0.5, then disagreement is gonna decay like what you Seth was saying before, 0.5 to the end, right? So it gives this beautiful one number measure of the slowness.

Andrey: what if, what if, one of us was very stubborn and just didn’t really care what other people thought about them? Would their opinion end up dominating the entire belief process, or were they just washed away in the average?

Ben: Oh, if, yeah, so, so if there’s someone who’s super stubborn, they don’t listen to, the extremists, they really don’t listen to anyone. They put all their weight on themselves and

Seth: Those are, that’s our rival podcast. Dogmatic posterior.

Ben: Exactly. So, yeah, so that’s, that’s a way to be very, that’s a way to be very influential. In fact, at the extreme, wewouldn’t even call that society connected because this one guy’s not really connected to anyone.

Seth: It might be connected out. I don’t know. Maybe.

Ben: yeah. But even if he puts a tiny little weight on others, if he’s stubborn enough, he’ll still dominate

Seth: And would that be bad?

Ben: usually. But unless he’s very, well, unless he’s very well informed, unless he, and so yeah, we, we ordinarily consider that bad because. A benchmark we like to, in a realistic case, we like to think about is information is dispersed. Everybody. Nobody know. Nobody knows God’s truth. Exactly. But everybody has has reasonable Yeah.

Ben: Nobody has

Seth: The average of this room knows God,

Ben: Exactly. Exactly. We do. you, if you could take, if you could take the God’s eye view and look at everyone’s information together, it would be enough to tell you like a whole, whole lot. But nobody, but everybody’s individual estimates are pretty, are pretty noisy. And so now how do we, how, can decentralize social learning, which DeGroot is supposed to be a simple model of get you to that.

Ben: Well, it really depends on whether one guy monopolizes all the influence or a few guys or, or di, whether influence is dispersed.

Seth: As, as the population goes to infinity, do we have, influential nodes, right, is the way you put it.

Andrey: So,

Seth: gonna ask the LLM question? Andre? You go for it.

Andrey: one second,

Seth: One sec. We’ll get there.

Cow Tipping and False Beliefs

Andrey: Ben. I don’t, I don’t know if you remember, but we, we’ve actually done a podcast before.

Ben: I was thinking about.

Andrey: Now. In that podcast we discussed the interesting phenomenon of cow tipping and how people seemingly believe that this is a thing that one does, even though no one actually goes cow tipping. So my question to you is, the past since

Seth: Thanks for ruining the joke, Andre, for literally everybody.

Andrey: Uh, in the past, year since, since we’ve done the podcast, have you noticed any social learning on this topic? Is it now understood that cow tipping is not a thing or is it still a belief that’s propagating

Ben: That’s very interesting. I have stopped using it as a, I, I somehow found that I have not used it as an undergraduate teaching example since COVID, now that you bring it up. So one thing, something happened to me during COVID teaching. I was teaching my, this was the last year, 2020. I was teaching the last undergrad class I taught at Harvard in fall of 2020. And it was a wonderful group of students actually, but they were all dispersed. Some, most of them at their homes. A few of them lived in like group houses with other students. And I was doing the cow tipping lecture in the way it goes. Just for the, to a little more context. Yeah. So like, it’s a great,

Ben: how many people know what cow tipping is? One thing I’ve noticed by the way, is fewer hands go up because I think Varsity Blues and that generation of movies was an important, was the way that it got into the culture. And kids these days don’t have an, watch those movies. So I don’t know whether they’ve been exposed, but, but these kids sort of knew, they were like, I was like. I asked, the usual question is I asked some factual questions about it. Like, what do you think is the prevalence in the United States? How many incidents of cow tipping have there been in the last year? And people will say, very few people will say like a firm zero. Um, but in the Zoom class, one of the students, they had their, like, their apparent or a relative in the background, and they were like, no, cow tipping happens.

Ben: I’ve seen it. So then I had to, like, in the middle of my class, I have to interview this person to, assess like whether my whole understanding of things is wrong. It wasn’t a very exciting, I was like, well, did you see it? Like, what, what did they, what did

Seth: Is the cow tipper in the room with us right now?

Ben: exactly, they were like, they were like, well, they, they were drunk and they really like ran at the cow and they hit the cow.

Ben: And I’m like, then what happens to the cow? And they’re, I don’t know, I ran away. So that’s the usual, that’s like

Seth: Are you saying that, the eigen values of the cow’s response to tipping are less than one? Is that,

Ben: Exactly, yeah. Is I, values are very important in mechanics. So. But for the other piece of context, en engineers have written papers kind of proving that you can’t under reasonable assumptions, like, knock over a cow with your shoulder or

Seth: are you gonna tell us that Santa’s not real, dude? What is this podcast about? We’re just killing people’s joy. Or, anyway, I’ll let you finish your example.

Ben: In terms of false beliefs, I think things are bad. I think my, my naive sense, it’s very hard to know ‘cause we don’t, you have to really study it and scientifically, but we had like a, since my wife and I have have, had a baby, we’ve interacted with, like, we had a baby nurse live with us for three years and she, she was from a very different community.

Ben: You know, she’s like, and I heard things her friends were saying, and beliefs and my, my sense is that. Strange beliefs about matters of fact are very much out there. And, and I, and I feel like TikTok, I think like TikTok propagates them actually in a way that’s more powerful thanany vector I knew that I personally experienced.

Ben: Like when I was in high school, for

Seth: Is that interesting? I mean, is that surprising from a DeGroot perspective? ‘cause it seems like in from a DeGroot perspective, you get communities with weird beliefs ‘cause they’re disconnected. But now the statement is they’re connected and that’s giving them weird beliefs.

Ben: I think what the basic DeGroot model is missing is that people talk about things very, that that people’s propensity to, to. First of all, I don’t think like these beliefs, like claims of cow tipping or other urban legends or, or wild statements about what Hillary Clinton does recreationally are like, I don’t think they’re like deru where we average what people think.

Ben: You just propagate interesting information. And I think what the DeGroot model is really missing and a lot of models of social learning is that what people share depends a huge amount on whether they think it’s interesting and like surprising and much less on whether it’s true. And moreover, people don’t adjust for that when they hear, right?

Ben: Like Tyler Cowen might, but most people, they’re not, they’re not aware of that bias in the information they’re hearing. And so they’re not, adjusting their posteriors. They’re just kind of accepting, you know? And, and so I, and I think TikTok has made it much more power, much more, much more viral to say something really interesting and get it into a lot of minds.

Ben: And that’s more like a yes on or off viral state, not like, do you believe, not like. What, what do you think the interest rate’s gonna be next, next quarter, but more like, do you think that people really landed on the moon, like a yes or no? Or you do you believe in some crazy conspiracy that’s like, like more like a virus that takes hold of you and it’s not a matter of degree of belief.

Sequential Bayesian Learning and Herding

Seth: Well, so if people, if people aren’t good bayesians, another model that you’ve worked with is called, the, or sorry, I guess a Sequential Bayes. If people, if people aren’t learning this connected way, maybe they’re learning in this kind of sequential, sort of herding-y way, which is sometimes called a Sequential Bayes model.

Seth: Uh, Andre, are you gonna let me move on to this topic? Or you wanna jump in with something?

Andrey: make a, I wanted to make a very brief observation since we’re talking about this. I happen to notice a book in the, in the background of, of Seth, actually The Hype Machine, which is

Seth: My machine with Ana roll. Yes. What’s, yes, what he says. It’s, it’s not true. Things that spread. It’s, novel and emotionally intense things that spread. So shout out to, a friend of the show. Sinan Aral.

Seth: All right. So, yeah. All right. So pe, so pe No, that’s good. No, that’s good. So people don’t learn in this connect way.

Seth: Maybe. Maybe, maybe they just see what the last guy did and try to figure out the state of the world from that. Is that a better model of what you’re describing, or is it also wrong?

Ben: I think what I’m describing some, some, like, having in mind intending to propagate, a little pellet of false information, like people tip cows. I think that’s just like a virus and that’s a good model. It’s also not be irrational. I mean, I think there’s some rationality to it, but I think the best model of it is like, if it’s interesting enough, it goes viral and a lot of people believe it, but Seth absolutely, like the models, Bayesian sequential updating where you hear something. I think where that model really shines is in thinking about something like, which, you know. Should I get, should I get flood insurance for my house or which accountant in our, there’s like three accountants in our industry and which one should I use? I think there, people think very much like what that model posits, which is I could research this, I could get my own signal.

Ben: I don’t have any special confidence that I would be particularly good at that. And this other person, I know that what they, that they’re not probably acting on amazing information either, but it’s probably still got a little more information content in it than mine. And let me just, so let me just follow and so you end up with a lot of like in economic context that I think are important.

Ben: I think the, the choices people make about insurance. Like when I talk to people their, who thought their whole lives about do people buy enough fire insurance or flood insurance or whatever, they basically talk about it like a social convention. And so you, you buy some and you don’t buy other, and you don’t buy stuff that people around you don’t buy.

Ben: Not because you’ve taken any time to analyze your personal, portfolio problem, but just because you assume other people have it like. That the social signal contains more information than you’re likely to gather.

Andrey: There’s also an interesting aspect of it, like if you follow the herd, then even if it goes wrong, you’re like, well, who can blame me for, for doing that? Right? But if you go against the herd, like, oh, that idiot didn’t buy insurance. Like he deserves what he, what he got. Right?

Seth: You have to get an awful, strong signal.

Ben: in a business context, right. There was this saying nobody ever got fired for buying IBM because, and that was exactly hurting on IBM, that at the, are you gonna really get blamed for using the same vendor that everybody uses?

Seth: So, how does, so is, is that great? We all coordinate on doing the right thing, or can that fail somehow? Why, why wouldn’t that be a good approach to learning?

Ben: You absolutely get big. I mean, the main was, the main first result about the herding model is that you can get quite dramatic failures of information

Seth: Oh no.

Ben: Where? Um. If people did experiment, if people, if we could ask like the first a hundred people to make this decision to ignore the social signal or just deprive them of access to other people’s past choices, and we made them decide based on their private signal, then we’d get a hundred hunches aggregated, and that would, and then after that we’d have a hundred people’s information, averaging into some vibe about what the sensible thing to do is.

Ben: But, but the sequential model shows that if you, if, if the first people already are contaminated by having access to previous decision makers, it’s just rationally they won’t get this started. So you have a kind of tragedy of the commons where collectively, we could like. Maybe compensate the first movers or just pick some of us to be unlucky and have to make this decision solo. And we would, society would learn a lot that way from, but, but what we in fact do is just, herd and actually online platforms spend a lot of energy thinking about like how to get enough experimentation going on. You know, should Google re Google Maps recommend, shortcut that it doesn’t think is the best to learn about it, should Yelp send people, try to send people to a restaurant that it doesn’t think is the best to get more information about it.

LLMs and Information Aggregation

Seth: How does LLMs change all this? Alright, so I’m kind of split ‘cause I kind of feel like these two models have different implications for whether it’s gonna help or hurt with aggregation failure. So help me out with this. It seems like in this sort of sequential Bayesian framework, LLM sort of should hurt our information algorithm, aggregation, right?

Seth: Because, nobody is in the position of being ignorant. We can always just question the model. The model tells us what the last hundred people did. Uh, we’re gonna herd harder by virtue of all having, none of us being in that state of ignorance, that state of blissful archipelago ignorance. Do you think that that is a mechanism that’s potentially at play?

Andrey: Wait, Seth, can you just clarify something? Why

Seth: Please,

Andrey: LLM tell you what the last a hundred people said necessarily? I,

Seth: it’s gonna tell me what the last hundred books written about the subject are. Let’s say.

Andrey: I mean, we can take that as a premise. I’m not sure if I’d buy it, but,

Seth: I mean, well what are they? They’re based on, this is what I’m trying to say is LLMs are based on the things LLMs have read. Andyou might say maybe this is a version of model collapse, right. LLMs are based on the last hundred on some thing of some of the last things. The LLMs read

Andrey: The last

Seth: just the last hundred tokens.

Seth: And then, somebody reads that and then they write a book based on having read the LLM. And now we get herd to whatever our opinion was in 1850.

Ben: What do you think buying it?

Andrey: no, I mean, I just, I, I guess it depends on the decision, right? But to, to the extent that models are able to reason and to the extent that your,

Seth: What if it’s a pure fashion question? What if it’s, what if it’s just black shirts are in versus white shirts are in? Could it, could it lead a stronger herding there?

Andrey: Well, it would rationally know that you don’t wanna wear what everyone else is wearing. Right. I mean, I mean, there’s a, there’s an element of like, that it can really be, have a lot of context about you, which is different than else.

Seth: Yeah.

Andrey: that’s, that’s the aspect where I’m not exactly sure that that’s how we should model it, but I’m happy to consider that version of the model.

Andrey: Sure.

Ben: Um, yeah, I’ve never thought, I haven’t thought about it in a sequential learning setting exactly. But I think there’s a different, a different dimension which seems related and important, which is like a narrative that I’ve heard repeatedly and that I think has a lot of truth about what’s happened to western society and politics is that there used to be, a focal provider of, of focal baseline, of facts, basically

Seth: Catholic church.

Ben: well, I would say the six o’clock news,

Seth: Six. Okay. All right. I always wanna go. I always wanna go back to Habsburg times. Dude, you can see this is my Habsburg wall.

Ben: I don’t know. I, and I think this was probably a unique moment because I’m not sure, I think that, that the newspapers we should ask like, Gentzkow and Shapiro about, newspapers in 1900, which was I’m sure a very different, environment with all. But like, there’s this moment which is now kind of seen, which is, valorized a little bit, that there was the, a national truth and you could, you had to get pastsome regulatory, there was regulatory exclusivity for the major broadcasters and basically nothing too crazy.

Ben: You could get broadcast too widely that Right. And then we move to this TikTok world where, where it’s a free for all. And, and it does seem like, that has some, the breakdown of a shared reality seems like an, something that’s happening to some extent and now coming like. ChatGPT. It’s, I think it’s a real empirical question.

Ben: To what extent in normal people’s normal lives does that serve as like the six o’clock news? Again, the coordinating device. Um, if you’re debating something, my wife Annie, who’s, who’s a also a Northwestern professor, had a hilarious story at a dinner she was debating. She went to MIT and she’s a big MIT snob and always reminds me that Caltech, where I went to for undergrad is way worse and is like way less cool.

Ben: And so there was, but to my surprise, her dinner can be, I wasn’t at the seminar dinner, but a guest of ours thought that Caltech was great. So I was like, the kids, it

Andrey: To.

Ben: and she was, yeah. And she was like, and he was like, wait, are you telling me that if you ask, you ask 10 people, they’ll all who, who care about this?

Ben: They’ll say that MIT is better. She was like, yeah. So of course they took out ChatGPT and that settled, and she,

Seth: Pirate, get John Horton on the phone. Tony Stark went to, Tony Stark, went to MIT Dude, that’s what people know about.

Ben: So I thought that was, and I think that’s gonna happen a lot around a lot of dinner tables and kind of, it has an effect. I, I think of it as a shared, I think of it as a powerful shared signal. Um, andI think that really reshapes things, in, in a lot of different ways. Um, that’s the main way I’ve been thinking about it.

Andrey: You know, it’s, it’s funny ‘cause what I, my very opinionated bias take is that the average quality of the undergrads atCaltech is obviously higher than at MIT in my experience, and I think a lot of people who know would agree.

Ben: Yeah, I think that’s, I think she’s been a little bit per, I think she’s been a little persuaded over time because my, my, my good friends, like the, the relationships I’ve kept from undergrad are, um. John Schulman, who was a, who was there, were two of the biggest ones. Or John Schulman, who was a, one of the, was maybe the, is often credited as being, a creator of chat, GPT andAdam D’Angelo, who’s, who is of course the co founder where I worked and and is a, is a very big figure in ai and I think that does you, there there’s a sort, so I think that’s made a, made an impression actually that there’s some kind of person that the place was good at incubating

Andrey: So

Seth: so

Andrey: is all listeners. This is actually all a ploy to get John Schulman on justified posters.

Seth: come on.

Ben: those two are Caltech alum in case it, it was not.

Seth: Uh, so, okay, so, so let me, so let me take that argument a step further. So, the way we should, one way to think about LLMs in the social information aggregation function is as being a central node that all of us are connected to. Um, we, you just reminded us that in these DeGroot models, having, influential node in the long run means that influential node gets to, set a little bit of the opinion and it might not just be the average of everyone’s opinions.

Seth: Is the concern there, or is the observation there that, whoever ends up controlling the most important three LLMs ends up having a real thumb on their scale in the opinions of society.

Ben: yeah, exactly. So, it’s funny when it, when Matt and Jackson and I were working on this in 2007, 2008, were very, the ba the basic first observation is exactly what, what you said, that if one person gets a lot of weight, they’re gonna, their errors are gonna matter. They’re gonna contaminate everything.

Ben: And so they’re gonna prevent, even if society as a whole has the information collectively to wash out all the error, the fact that this guy talked in a way, first or talked loudly, means that everybody’s going to be influenced by whatever. That note says, but there is an exception. Or when you try to prove those things mathematically, that’s not necessarily true because something that can happen is if that note is very good at themselves being an aggregator, and it actually does, it figures out the right information.

Ben: Um, and rebroadcast, that’s also one of the most efficient ways of figuring it out. So I think

Seth: A

Ben: the

Seth: post, a reliable pollster.

Ben: Exactly. And so the selfer, there’s something irritating about the Selfer, way in which some of these AI companies regard themselves, or it’s like that they, thinking really earnestly about stewardship of, of, the model’s preferences or whatever.

Ben: But I actually think this, that, it, if the model is say left bias, this liberal liberal bias, then that’s gonna, um. it into a lot of opinions andthat matters. And so they, they should think about it. And I, I do actually admire efforts that they make, to be basically good aggregators, good pollsters.

Ben: And interestingly, like before we could have pollsters on a few issues that you could distill numerically, but now this is a pollster that kind of up internet text about anything. It’s like a qualitative pollster, which is a really remarkable kind of device that we couldn’t have imagined when we were writing those papers.

Seth: Should we be RLH fing these models so that they have the median social opinion on all social issues?

Ben: What does that even mean? Right? How do you

Seth: I, you go to Pew and it says, the median person thinks abortion should be legal at 27 months. Whatever. What? Sorry? 27 months. 27.

Ben: But even that,

Seth: 27 weeks. Okay.

Ben: didn’t like. The interesting thing is that the LMS are doing their own embeddings of these issues into their, so people will just talk to them and say, and talk about abortion in a way. They’re doing an averaging but not one that’s, that’s, that’s numerical one that’s qualitative. And, and I, I kind of like it that way. I, I, I don’t think people have coherent views on almost any issue of public interest. And so if you try to make it numerical and try to average it that way, that would be like garbage and garbage

Seth: Right.

Ben: and.

Seth: Trying to recreate the mind of the median American voter will make you insane.

Andrey: I, I really wanna go back now to this personalization aspect of things, right? Um, it, especially with something like Chad, GPT, I don’t view it as a monolith. There is a model router involved. It has all your previous conversations. And if me and you asked it a question, and this is an interesting, it would be an interesting empirical exercise actually, is like. We might get a very different answer about like, is it, is it, normal to, I guess, I guess it depends on what we’re asking. It’s like one of the things like for myself, like, is it, should I wear a hoodie to a business meeting? Right. You know, and it might give me a different answer than you guys.

Seth: Did play League of Legends during the business meeting.

Andrey: yes, yes, Uh, but, but if I ask it, what does the average person in society think about this question? We might get the same answer, but I don’t know, these things are a little unpredictable in this way. Right.

Ben: Yeah, and there’s a bunch of

Andrey: I.

Ben: papers just suggested by what you just asked, right? If people, because of course the system prompt. If you’ve done a, if you’ve now had your custom prompt, all bets are off because you could, you could ask it. Please don’t tell me. Things that might upset me with this mental illness that I have.

Ben: And then they, we wouldn’t get probably accurate answers on, on if it’s really, then it has. So yeah, people do get, the personalization issue is super interesting. but for now, yeah, I just wanna make the point for the moment that as a focal before the market has matured to the point that there’s a niche little LLM for everybody, these items are actually new kind of animal in the, they’re not like Facebook, they’re not like they’re, they’re a new kind of sort of public object that everybody interacts with.

Ben: Um, and despite the heterogeneity that Andrey said, they, that’s, that might shift things in a way closer to a, a, a former time.

Seth: Or will people just all choose, I’m a lefty going in, so I’m gonna use lefty, LLM, and you’re already going in. You’ll use righty. LLM.

Ben: Right. But it is, isn’t it remarkable that gra, I mean, there’s like a popular Twitter joke, but after trying, after trying to train the wokes, the, sorry, the, the anti wokes, LLM imaginable, it has like, it has like wine mom views, like

Seth: You can only, you can only, you can only, right wing eyes, the LLM so much.

Ben: Yeah. Except on the rare, like, it’ll say, it’ll occasionally say Hitler is great, but other, other than that, it’ll like,

Seth: Only when it’s role playing.

Simulating Social Learning with LLMs

Andrey: has anyone tried to

Seth: Ooh.

Andrey: some of these social learning games with LLMs?

Ben: yeah, that’s, I, that’s a great, I I’ve been trying to learn, keep track of this. I, it’s been proposed to me by students. Um, and I know that there are people. That. So I was gonna say that when we, ‘cause before, before the podcast, we’d sort of discussed, some topics, and I’ve been thinking about this one that like, how will it affect social learning?

Ben: But it made me think, how will it affect studies of social learning? And now you can, you can, implement, you can simulate it, you can, try to forecast how groups of people would behave. And it’s interesting because people like John Horton have done studies of how good is it as a simulator of a, of an individual. the question of how good is it as a simulator of a community, would be super interesting. I think just intellectually, I’m sure people are doing it. I’d love to, if people listening are aware, I would love to like tweet it at me or something.

Seth: You heard it, folks, dm d dm, Ben, with all of your, simulation ideas

Andrey: yeah.

Ben: tweet.

Andrey: Well, I, I guess theclosest thing that I

Seth: posted on our Discord I’ll, we’re at the, we’re at the end.

Andrey: Yeah, is the, is the AI village, know, where the, there are like different ais, different models, and they’re like co cooperating, slash they’re given a task to do and they see if you can do the task. And some tasks are like, can you sell a t-shirt online?

Andrey: Or something like that. And it’s hilarious how they try to cooperate with each other and all their foibles andso on. Uh, which is kind of not narrowly the, the specific formulation of social learning, obviously, but related,

Ben: Yeah. Yeah.

Lessons from Quora and Startup Experience

Andrey:so one, you, you mentioned, your friend Adam D’Angelo. I’m curious what, what you learned, at Quora, that you’re bringing to your current startup experience, or alternatively what you learned at Quora that you brought to your research.

Ben: Yeah, that it was such a formative time that I really didn’t understand at the time, how important it would be in my life. That I think the biggest thing, I never thought I would, I never expected that I would do anything entrepreneurial just because, I think that for one, I didn’t expect that there would be a technology like AI that would be kind of like, have the exact shape that, that is, has been important for, for me to be able to actually try to do something, at the technological frontier.

Ben: But at that, but I was, what was remarkable to me is that I

Seth: Thought you said linear you, I thought you knew that Linear algebra destri describe the world and you’re the king of eigenvalues. Come on, dude.

Ben: No, but I guess I never had that deep faith or I thought it was a few steps away that I was upstream in

Seth: Mm-hmm.

Ben: the innovation

Seth: Fair enough.

Ben: of commercial applications. But I remember, like, it was huge for me that they, that they were, that Adam’s always been very interested in economics. He just reads, like he reads texts on industrial organization recreationally.

Ben: And, and I think he had, he always had this respect for economists. Um, that was very, and and so he would, we would just occasionally chat about things often through the lens of economics. And Quora had some specific, he had some economic ideas of for, well, one thing I did was moderation. ‘cause I was just a very active user.

Ben: So I was involved in kind of, some of the housekeeping of the moderation operation, which I actually wasn’t good at. So I, my, at the time, the interesting thing is I wasn’t like, I wasn’t a good community community manager and but, but when, then, when I was in the company. Adam got curious about this idea of credits and actually having an internal currency, and that so that people’s like, basically so that the scarce resource of some people’s attention, like, especially on early Quora, a lot of the answers were written by really visible people whose, who were, people were very excited to see them there, but their attention was scarce.

Ben: So how could you efficiently bid for people’s attention? You wanna create some kind of token, right? And so I was just like the consultant who, thought about the very basics of the design of that system, like the central banking. How much money do you issue it? How do you, but that was what I did. but what I learned was actually like just getting to watch a startup. And it was right at, when I joined there were about, I think 27 people. And so seeing a startup at that stage, I learned a huge amount about. About running a business andespecially in tech, I think the strongest, people often say that startups are like a magnification of the founder’s personality. Um, and I think that’s really true in this case. ‘cause,

Seth: Getting, getting how, how, frustrated it, refined was with some of my notation where it was like, you called this a node. I, it took me a while to figure out what you mean, but I would not call it a node. Uh, your personality really does come through.

Ben: it’s funny because, yeah, I’m very, I’m very pedantic. I, I’ve spent, I, I, I feel, yeah. So I’ve created, and Adam is very, very thoughtful and deliberate and kind of like likes to make decisions with principles and in a thoughtful way and make decisions, like I think a lot of good, good leadership skills, like focus on, focus on one focal goal at a time and and. Propagate that and communicate that. And then, think really thoughtfully about design The core was a very design first company andmaking design decisions, not as an afterthought, but as a core thing. I think there were a lot of those like principles, I think similar to growing up in families, like there’s just certain values that are embodied in where your environment.

Ben: And when I was there, like I realized after that I, I’m a pretty good sponge and I wasn’t directly involved in any like, decisions having to do with design, but you know, the guy I sat next to at Quora was, Joel Lewenstein, who’s now the, the head of design at Anthropic. And I can, and like, but I didn’t, I think what the amazing thing is, it was this like, combination of amazing people and all of them were really thoughtful and really good at what they did.

Ben: And they talked about startup uping in a very intellectual, thoughtful principles first way. And so that when I, I, when it came time to think about a business, I felt like. That was a natural way to be, and I realized I never would’ve had the, that kind of, those kinds of vibes, if not for those six or eight months that I spent there.

Andrey: Very cool. Um, do you have any thoughts about why more companies don’t use virtual currencies and have you thought about the use case of virtual currency for internal allocations of GPUs?

Ben: Great questions? Um, I think virtual

Seth: You imagine going to Walmart and they tried to pay you in Walmart coin instead of money, people would riot.

Ben: Yeah. Well, but you could, I mean, internal currencies. I think one of the problems that, I wasn’t around when Quora eventually decided to get rid of them, but I think one of the problems is that, um. Currencies are focal and they create people, they, they motivate people to do things in a way that they sort of take up too much oxygen in the ecosystem. And so when you’re designing a social product where you want many kinds of incentives to be in balance, having a currency can actually be harmful to the, it’s a kind of a sociologist insight, but like, so I think there’s some of, I think you have to be really, I think for platforms where that are truly transactional and economic currencies are always good.

Ben: And usually that currency becomes money. ‘cause it’s gonna have an exchange rate with real money

Seth: Right.

Ben: Um,

Seth: Love one price.

Ben: yeah, but for, but I think for, for. It is, I think it’s an interesting phenomenon that needs to be thought about more. Why it’s not, why it’s really generally not a successful route for social for internal markets. I, I’m very, I I believe that some of the obstacles to internal markets are just frictions having to do with like, basically contracting frictions. Um, and one thought that I have had for a long time actually discussed with, we had some there. Let me just, I, you guys will edit. Let me just say that again. One thought I’ve been thinking about for a long time is just as contracting intermediaries. Um, and

Seth: This is a big theme of the

Ben: Andrey

Seth: Coasian Singularity Dude.

Ben: Yeah. This is Andrey’s paper.

Andrey: Yeah. So what, what is your thought about this? Yeah.

Ben: I’m very curious, so I’m very curious for your take on it since you’ve thought about it much more seriously now, but it just, yeah, I think I feel like. A lot of the details were just like implementation details, that if it became your job to implement it at a company, you would, you would decide that it’s, you’d have to really have a high valuation of the marginal allocated efficiency of that currency. And it’s arguable that it’ll, it’ll be, it, I think experiment experimenting with it has just become way more valuable once we reach the LLM, capability of being trustworthy to like, negotiate a contract, which I think honestly is not right now, but yeah.

Ben: I, I see that as a potential, a big organizational impact. I’m very curious what you think.

Andrey: I mean, surely the contracting aspect would be hard. but I also think there’s a social aspect to it as well, right? You’re the CEO, you create an internal Coasean internal market for GPU resources, then you suddenly see a team that you don’t want using the GPUs, using a lot of the GPUs. Now, what do you do

Seth: The whole point of, yeah, the whole point of having a firm is to have a command DI economy. If you wanted everyone making independent economic decisions, you wouldn’t have a company right.

Andrey: but there’s a sense in which there’s some optimization that you want your teams to be making, like leaving idle GPUs or they’re using them very stupidly for some reason, and you don’t, you want that to be kind of disincentivized and. The way it’s currently done is through these very imperfect monitoring systems and people asking very nicely, can I have, this resource?

Andrey: Right? So yeah, I’m, I’m curious whether the, the AIs can do a better job here.

Ben: Yeah, I mean I guess the, you might shortcut you, they’re also becoming better at being the arbiters of requests. Right? So maybe, maybe rather than, but, but I do think money is, one memory I have of Quora actually is that the engineers, they hadbrilliant young people and I very like. Who were first principles thinkers too.

Ben: And so people would ask me also, I had to just like justify money to the whole, to like the skeptics in the whole company. And so I gave, gave a lot of thought

Ben: Yeah, why don’t we have some more multidimensional expression? Right. And there are good answers to that. It’s like very helpful that money is very legible.

Ben: That, but, but I guess we, yeah, for companies, I’m very much with Seth’s point that if you really believed in the power of the, of monetary incentives to, to do it, you, you wouldn’t have a company, but you may find it a useful tool within the command. I mean, even, even the command the North Korea has has currency, right?

Ben: So like it’s definitely a tool. And I think with the Pareto frontier has changed, but I don’t know how

Closing

Andrey: Very, very cool. So, we’re just about out of time. Uh, is there anything either of you want to add to our conversation?

Seth: Ben, do you have any good eigenvalue jokes for us?

Ben: oh man, I should have prepared.

Seth: Alright. We had Ben Golub today who’s made tremendous strides in automated paper reviewing and still has a lot of progress to be achieved on automated Eigenvalue joke, doing, thanks for tuning into this episode of Justified Posteriors. Please like, share, and subscribe. We now have a hoppin’ Discord community for now by invite only DM us on substack Twitter or LinkedIn for your personalized invite code.

Seth: And why don’t you keep your posteriors justified?

Andrey: Thanks, Ben.

The Best Books Seth Read in 2025

Seth Benzell — Fri, 26 Dec 2025 17:12:50 GMT

In 2025, I read about 35 books, slightly below my targeted pace of 40. Here are some superlatives and books I highly recommend:

Best Pairing: Breakneck: China’s Quest to Engineer the Future by Dan Wang and
Natural Moralities: A Defense of Pluralistic Relativism by David B. Wong

How and why do China’s and the U.S.’s political cultures differ? This pair of books, each by a leading D. Wang/Wong, comes at the question from two very different directions.

Breakneck says the divergence is due to China having a leadership/political culture focused on engineering, while the U.S.’ is focused on law and lawyers. Dan Wang argues that the excesses of the one-child policy and lockdowns, and the failure of the US to build infrastructure, can all be understood as downstream of this decision. Dan has good conversations about this take at Conversations with Tyler and on the Sinica podcast.

Now, don’t get me wrong, the thesis and the anecdotes used to illustrate Breakneck are excellent. But in both the book and his podcasts, Dan doesn’t engage with what I see as the spiciest question prompted by this theory. Namely, when and why is it that the two cultures diverged?

If China is just on the standard Solow growth path, with a US 1950s need for engineering leadership, and will naturally converge to US 2020 levels of lawyerly leadership, this is a VERY different hypothesis than the two countries having fundamentally different moral and political inheritances. If the latter is the case, Tyler’s objection that (paraphrasing) “Chinese lawyers might just make autocracy more efficient” has purchase.

Isn’t it plausible that a mythic canal king, irrigated rice farming, and a unified empire make a society different than one downstream of Greek philosophy, chivalrized barbarians, and protestantism?

In David B. Wong’s “Natural Moralities: A Defense of Pluralistic Relativism,” this deep divergence in culture between the US and China is a central theme. D. Wong argues for a form of moral pluralism he calls “pluralistic relativism”. Under this view, dramatically different moral systems can be equally moral without descending into anything-goes moral relativism.1

IMHO, this is a very attractive move for two reasons: (1) It’s more plausible than natural law theories of “absolute morality”, while still being able to make the obvious point that some social systems are better for human flourishing than others. And (2) It’s a step towards a vision of how universalizing Westerners can have constructive dialogue with Oriental moral systems -- an essential need in a century that will be defined by East vs. West rivalry.

This brings me to why Natural Moralities is such a valuable pairing with Breakneck. According to Wong, but in my own words, the divergence between Confucian and Western moral theories -- at a deep level -- is that Confucians are shape rotators while Plato and the Judeo-Christian philosophers are wordcels (“In the beginning was the word…”).

How What You Should Do When Your Dad Murders Someone Explains the Difference Between the U.S. and China

To illustrate this, David Wong brings up a story from Mencius, which I’ll contrast with one from Plato. The question both philosophers are faced with is “What should you do if your dad kills someone unjustly?” Before I give their answers, maybe think for yourself what you’d recommend.

Mencius’ Answer:

(Note: Gu Sao is Shun’s dad)

桃應問曰：「舜為天子，皋陶為士，瞽瞍殺人，則如之何？」

Tao Ying asked, saying, ‘Shun being sovereign, and Gao Yao chief minister of justice, if Gu Sou had murdered a man, what would have been done in the case?’

孟子曰：「執之而已矣。」

Mencius said, ‘Gao Yao would simply have apprehended him.’

「然則舜不禁與？」

‘But would not Shun have forbidden such a thing?’

曰：「夫舜惡得而禁之？夫有所受之也。」

‘Indeed, how could Shun have forbidden it? Gao Yao had received the law from a proper source.’

「然則舜如之何？」

‘In that case what would Shun have done?’

曰：「舜視棄天下，猶棄敝蹝也。竊負而逃，遵海濱而處，終身訢然，樂而忘天下。」

‘Shun would have regarded abandoning the kingdom as throwing away a worn-out sandal. He would privately have taken his father on his back, and retired into concealment, living some where along the sea-coast. There he would have been all his life, cheerful and happy, forgetting the kingdom.’

Here we see a classic shape rotator approach to a moral dilemma: The state needs to enforce justice, but a son needs to protect his father. We get a compromise that hopefully leaves everyone somewhat happy -- Shun should abscond with his father, removing him from being able to do more crimes, but still protecting him.

We also get advised that, despite what we’d call a tragic clash of values in the West, Shun should still try to feel good about himself. To your taste, Mencius’ answer is either a nice compromise or a stupidity that fails to satisfy any plausible theory of justice.

Plato’s Answer:

In the Socratic dialogue “Euthyphro”, Socrates runs into a priest who has decided to turn his murderous dad in to the justice system.

Unlike Mencius, who tries to split the difference and make everyone happy, Socrates decides to confuse things further. He questions:

Socrates
But what is the charge, and what is the suit about?

Euthyphro
Murder, Socrates.

Socrates
Heracles! Surely, Euthyphro, most people do not know where the right lies; for I fancy it is not everyone who can rightly do what you are doing, but only one who is already very far advanced in wisdom.

Euthyphro
Very far, indeed, Socrates, by Zeus.

Socrates
Is the one who was killed by your father a relative? But of course he was; for you would not bring a charge of murder against him on a stranger’s account.

Euthyphro
It is ridiculous, Socrates, that you think it matters whether the man who was killed was a stranger or a relative…

Socrates
But, in the name of Zeus, Euthyphro, do you think your knowledge about divine laws and holiness and unholiness is so exact that, when the facts are as you say, you are not afraid of doing something unholy yourself in prosecuting your father for murder?”

In the rest of the dialogue, Socrates proceeds to shoot down every theory of Euthyphro’s about the nature of piety and justice.

The reader is only left with more questions: “Are the gods just because they behave justly, or is justice simply what the gods command?” and “Is piety something different than a commercial relationship with gods?” In classic wordcel fashion, rather than actually contribute to solving a social dilemma, Socrates critiques and deconstructs -- and, of course, his dialogue is much, much longer than Mencius’ answer too!

When framed as wordcel vs. shaperotator culture/morality, I think the Breakneck distinction between Lawyer and Engineer states makes more sense, and is actually a deeper and more interesting thesis. It also puts me closer to Tyler’s view that just adding more lawyers to China’s system won’t actually result in more individual protections and Western-style justice.

I think I can see the unique strengths of either approach, while still feeling secure in the fact that we each have a system that works well for us: A Western system of critique, individual reason, an openness to the idea of tragic conflicts, and an insistence on conceptual clarity and rights vs. a Confuscian system of practical problem solving at the expense of some of that clarity and Western roadbumps to ill-concieved grand plans. I hope that books like these, which attempt to see the logic in each other’s systems, can be an important step to peaceful coexistence.2

Best Non-Fiction: The Allure of Battle: A History of How Wars Have Been Won and Lost by Cathal J. Nolan

An alluring possibility: What if the real battle is the allies and economic capacity we build up along the way?

The start of this book is not promising. An overlong introduction of the main theme — battles and great generals are overrated; attrition and grand strategy underrated — which bounces between obvious and unsupported claims.

But what comes next is the greatest single-volume history of warfare I’ve ever read. A masterful tour from Marathon to the Marne, his discussions of individual wars are better than many dedicated books I’ve read.

His coverage of the evolution from pike and shot, to line infantry, to skirmishers is excellent — especially because it’s not presented as a series of innovations by great generals. Rather, the author has a fresh take focused on the interaction of generals’ desire for maneuver warfare with changing fortification and siege technologies, as well as a focus on how quickly these technologies and strategies can diffuse through repeated encounters.

The author’s main argument is simple: (1) the relative cumulative economic power of sides is the most important determinant of who wins long wars (2) long wars are expensive also, and therefore (3) Revisionist powers are tempted to plan around short wars because these are the ones that would hypothetically help them. This leads to (4) Revisionist and over-confident powers quickly find themselves in over their heads, leading them to lose long wars.

Decisive, quick, victorious maneuver warfare is the dream of a Frederick the Great, a Von Moltke, a Yamamoto. The author does a fantastic job of explaining this doctrine — the lust for a costless victory, ideally a “cauldron battle” that would exterminate the enemy army in imitation of Cannae.

But then the author makes an amazing, obvious, and yet hugely underappreciated point — why do we idealize the victory at Cannae, when Hannibal’s strategic failures are what determine the course of the war?

The author explains why. We idealize the great army geniuses of the past - in part to get adolescents psyched about war, in part to glorify national genius, but worst of all, to justify irrational wars of aggression by revisionist powers. The Japanese in WW2 wanted to revise the international order, but they weren’t capable (due to internal division in large part & aggressive leaders taking international actions) of aligning themselves on the stronger side of a global conflict (or at limiting the spread of their conflict). Therefore, the only answer was ever more aggressive attacks in the hope of destabilizing, in a series of brilliant campaigns, stronger opponents.

The argument is basically right. I am convinced. Great book. But it is possible to over-learn this lesson. France’s side may have eventually won WW2 - perhaps inevitably due to the network of alliances- but their failure to keep up with Wermacht initiative at the beginning of the war made everything so much worse.

As a child, I fell for the romance of Hannibal. In some ways, the fact that he loses in the end is almost a plus - a heroic standing in the face of the Roman tsunami. But a much better hero is Fabian, the delayer, who, rather than pushing for a quick resolution, showed the patience necessary for Rome’s advantages to inevitably tell.

Where does that leave the US today? I conclude that maintaining the alliance system is more important than ever. No country, including China, can challenge the US + EU + India together. We couldn’t be conquered in decades. Even if these nations cut their militaries to the bone, we could still hold out and win a long war - so long as we remained unified! It also makes me worried for an Israel that, drunk with operational success, may find itself isolated and overextended.

In sum, Grand Strategy>>>>>Operational Art>=Tactics.

Best Sci Fi: The Hydrogen Sonata by Ian M. Banks

I read The Culture series as mainly about two things: (1) What would it be like to live in a utopia? And (2) How did (and should have) the US acted internationally during its 1990s unipolar moment. I had been holding off on reading this, the last in the series, wanting to savor one of my favorite series for longer. But I finally gave in.

I was not disappointed. The book delivers well on both (1) and (2) as well as pointing out interesting connections between them. The core metaphor is the humanoid protagonist’s dedication to mastering playing an impossibly stupid four-arm-requiring string instrument. No spoilers, but if you’re interested in either theme, I highly recommend this series. It can be read out of order, and this is possibly my favorite, so feel free to jump in here.

Best Play/Opera: Salome by Oscar Wilde

I read the play in advance of seeing the excellent Met production. I appreciated how Oscar anticipates the concept of the “male gaze” and how sexual abuse perpetuates itself. The play is ambiguous in a way that leaves “The Dance of the Seven Veils” and related scenes somewhat sexy. Oscar’s language in listing the great gifts offered by King Herod is hypnotic.

Strauss’ music and the Met’s staging illustrate the play well. The Met production renders explicit how fucked up Salome’s abuse was, using seven Salomes at various ages, all dressed in school-girl clothes, to make sure you don’t miss the point, at the expense of sexiness.

Wildcard: The Pine Barrens by John McPhee

If you’re not from New Jersey, you’ve probably only encountered the Pine Barrens through hearing of mobsters dumping bodies there or perhaps the Jersey Devil cryptid. Even as someone from North Jersey, my understanding did not extend much further than that. But I really enjoyed learning more about it in this tightly written exploration of an anachronistic region nestled discretely between Philly and NYC.

The pines have always been sparsely populated, even in indigenous times, because of the sandy soil unsuitable for most agriculture. It has been a haven for Northeasterners who have wanted to get off the grid for centuries: as America’s first Native American reservation, for escaped slaves, and for Loyalists during and after the Revolutionary War. Today, it is known for its excellent blueberries, which were intentionally selected, cultivated, and spread by Rutgers University biologists.

The writing is brisk and respectful while not above pointing out some of the funny or absurd parts of piney life. Truly an underappreciated corner of America!

Most Laughable Economic Theory Joke Award: Ecstasy: Understanding the Psychology of Joy by Robert A. Johnson

I have to mention this one because I got the best laugh I’ve had of the year out of it -- but that was unintentional, the book is pretty bad.

You may have read about the distinction between the Dionysian and the Apollonian introduced by Nietzsche. Like a D&D alignment table, this chaos vs. order axis is orthogonal to conventional morality, but an important aspect of human psychology. I highly recommend reading “The Birth of Tragedy” by Nietzsche, or “Psychological Types” by Jung, to learn more about this distinction! The idea that we are cut off from our emotive, intuitionistic tools for creating value is a compelling one, but one difficult to balance with our modern virtues of reason and order. It’s a really good big idea!

This book is only sometimes about that big idea, and like many in Nietzsche’s and Jung’s shoes, it doesn’t share their talent for connection and subtlety. Instead, in this book, we get something in between DBT and Jungian shadow-self work.

Some of these ideas are not necessarily bad -- understanding your counter-social impulses and integrating them is great. But some of the ideas advocated are actually pretty bad and scary. The book seems to advocate different pseudo-schizo approaches to emotional healing with the shadow self -- from building a little shrine of your idol, to doing crazy calisthenics your dream-Dionysis tells you to do.

In a very short book of 97 pages, it’s clear the author is running out of steam by the end, with two of the last chapters devoted to reciting not particularly exciting dreams he or a client has had.

The book is at its funniest when the author -- who admits to not being very good at book writing or history -- completely makes up a political-economic history of the suppression of Dionysis and his replacement with the debauched Bacchus.

The peak of this, which I’ll leave you with, is my new favorite theory of the price level. From page 45, on Dionysus as “scapegoat”:

“Sheep represent everything of value in our Judeo-Christian World. The sheep, in fact, is the chief determinant of our currency. Every currency in the Western world -- the shilling, the franc, the deutche mark, the lira, the peso, the Austrian thaler (from which we got our dollar) -- was the price of one sheep. For centuries, there was no inflation in the Western world because one of our money pieces was worth a sheep. You could count on that anywhere, anytime.”

Literally laughed for a solid 10 minutes. Unconstrained by reason, the author gave me a moment of joy. And isn’t that the most Dionysian thing of all?

Honorable Mentions:

Abundance -- Agreed with it too much to find it interesting. But it’s the book I’m “rooting for” the most this year.

Democracy in America part 1 -- Great, but “The Ancient Regime and the Revolution” is better, and more unified in its thinking. This is foxy and hard to summarize, but ofc a deserved classic.

The Fundamentals of Heavy Tails -- Great primer on a topic I’ve launched myself into this year.

Help Wanted -- Read on the recommendation of Jason Furman, a nice little slice of life about minor drama at an upstate NY big-box store and the people who work there. Some good boots-on-the-ground economics about how management, economic incentives, loyalty, and hope play out at a place like this.

Fortune’s Formula: The Untold Story of the Scientific Betting System That Beat the Casinos and Wall Street -- Good for its discussion of the Kelly criteria and the hilarious fight between gamblers and Samuelson over whether it’s deeply true. (Spoiler: Obviously, it’s only utility maximizing from a single specific perspective, but it’s an awesome and useful heuristic for long-lived institutions.)

You might think that this is just Isaiah Berlin-esque Moral Pluralism, but as of this book, D. Wong HATES that term, arguing that Berlin goes full relativist. He argues that Berlin’s system has no resources for calling e.g. Aztec or Molochian worship systems immoral, while he would argue that only moral systems that plausibly contribute to human flourishing (which he thinks is somewhat universal due to our shared biology).

This is the second book in the last few years where I’ve run into a weird 1970s UN cybernetics conference giving people disastrous ideas. Here it appears as where China’s architect of the one-child policy got his inspiration. I have also seen these conferences in “Building a Ruin” about late Soviet economic policymaking, as a source for compromise technocratic ideas that gave Soviet leaders a politically useful (but economically inadequate) third option besides the antiquated command economy vs. true liberalization.

Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengths

Seth Benzell — Mon, 15 Dec 2025 21:24:59 GMT

Seth and Andrey are back to evaluating an AI evaluation, this time discussing METR’s paper “Measuring AI Ability to Complete Long Tasks.” The paper’s central claim is that the “effective horizon” of AI agents—the length of tasks they can complete autonomously—is doubling every 7 months. Extrapolate that, and AI handles month-long projects by decade’s end.

They discuss the data and the assumptions that go into this benchmark. Seth and Andrey start by walking through the tests of task length, from simple atomic actions to the 8-hour research simulations in RE-Bench. They discuss whether the paper properly measures task length median success with their logarithmic models. And, of course, they zoom out to ask whether “time” is even the right metric for AI capability, and whether METR applies the concept correctly.

Our hosts also point out other limitations and open questions the eval leaves us with. Does the paper properly acknowledge how messy long tasks get in practice? AI still struggles with things like playing Pokémon or coordinating in AI Village—tasks that are hard to decompose cleanly. Can completing one 10-hour task really be equated with reliably completing ten 1-hour subtasks? And Seth has a bone to pick about a very important study detail omitted from the introduction.

The Priors that We Update On Are:

Is evaluating AI by time (task length) more useful/robust than evaluating by economic value (as seen in OpenAI’s GDP-eval)?
How long until an AI can autonomously complete a “human-month” sized task (defined here as a solid second draft of an economics paper, given data and research question)?
- Seth’s Prior: 50/50 in 5 years, >90% in 10 years.
- Andrey’s Prior: 50/50 in 5 years, almost certain in 10 years.
  
  Listen to see how our perspectives change after reading!

Links & Mentions:

The Paper: Measuring AI Ability to Complete Long Tasks by METR
Complementary Benchmarks:
- RE-Bench (Research Engineering Benchmark) - METR’s eval for AI R&D capabilities.
- H-CAST (Human-Calibrated Autonomy Software Tasks) - The benchmark of 189 tasks used in the study.
The “Other” Eval: GDP-eval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks by OpenAI
AI 2027 (A forecasting scenario discussed)
AI Village - A project where AI agents attempt to coordinate on real-world tasks.
Steve Newman on the “100 Person-Year” Project (Creator of Writely/Google Docs).
In the Beginning... Was the Command Line by Neal Stephenson
Raj Chetty
Transcript

[00:14] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, wondering just how long a task developing an AI evaluation is, at Chapman University in sunny Southern California.

Andrey Fradkin: And I’m Andrey Fradkin, becoming very sad as the rate of improvement in my ability to do tasks is nowhere near the rate at which AI is improving. Coming to you from San Francisco, California.

Andrey: All right, Seth. You mentioned how long it takes to do an eval. I think this is going to be a little bit of a theme of our podcast about how actually, evals are pretty hard and expensive to do. Recently there was a Twitter exchange between one of the METR members talking about their eval, which we’ll be talking about today, where he says that for each new model to evaluate it takes approximately 25 hours of staff time, but maybe even more like 60 hours in rougher cases. And that’s not even counting all the compute that’s required to do these evaluations.

So, you know, evals get thrown around. I think people knowing evals know how hard they are, but I think as outsiders, we take them for granted. And we shouldn’t, because it certainly takes a lot of work. But yeah, with that in mind, what do you want to say, Seth?

Seth: Well, I guess I want to say that we, I think we are the leaders in changing people’s opinions about the importance of these evals. The public responded very positively to our recent eval of Open AI’s GDP-eval, which was trying to look to bring Daron Acemoglu’s view of how can we evaluate the economic potential economic impact of AI to actual task-by-task-by-task, how successful is this AI system. People loved it. Now you demanded it, we listened. We’re coming back to you to talk to you about a new eval—well not a new eval, it’s about eight months old, but it’s the Godzilla of evals. It’s the Kaiju of evals. It’s this paper called “Measuring AI Ability to Complete Long Tasks,” a study that came out by METR. We’ve seen some updates or new evaluations of models since this first came out in March of 2025. Andrey, do you want to list the authors of this paper?

[3:05] Andrey: As usual I don’t. There are a lot of authors of this paper. But, you know, I’ve interacted with some of the authors of this paper, I have a lot of respect for them. I have a lot of respect for the METR organization.

Seth: Okay. But at a high level, just in a sentence, what this wants to do is evaluate different frontier AI models by the criteria of: “how long are the tasks that they complete”?”

Andrey: I guess what I would say before we get to our priors is, just as context, this, from what everything I’ve seen, is the most influential evaluation of AI progress in the world right now. It is a measure that all important new models are benchmarked against. If something is above the trend, it’s news. If something is below the trend, it’s news. If something’s on the trend, it’s news. And it’s caused a lot of people to change their minds about the likely path of AI progress. So I’m very excited to discuss this.

Seth: It’s been the source of many “we’re so back” memes. Yeah, I totally agree Andrey. Am I right that this was a paper that was partly inspiring the AI 2027 scenario by favorite blogger Scott Alexander?

Andrey: I don’t know if it inspired it, but I think it was used as part of the evidence in that. Just to be clear though, AI 2027, it’s a scenario that was proposed that seemed a bit too soon of a vision for AGI taking over the world by many folks. We have not done an episode on it.

Seth: We haven’t done an episode on it. But it’s fair to say that people look at the results of this paper and they see, you know, they see a trend that they extrapolate. But before we get into the details of the paper, are we ready to get into our priors?

Andrey: Let’s do it.

[05:50] Seth: Okay, so Andrey, just based on that headline description, that instead of evaluating AI systems by trying to go occupation by occupation and try to find tasks in those occupations that are economically valuable and then trying to see what percentage of those tasks the AI can do—that’s what the Open AI GDPval approach that we recently reviewed did—this approach is trying to evaluate tasks again by how long they are. So comparing those two approaches, I guess my first prior is, before we read this paper, which of those approaches do you see as like kind of intuitively more promising?

Andrey: One way of thinking about this is tasks are, or things people do which could be a series of tasks, are bundles and they’re bundles embedded in some higher dimensional space. And what these two evals are doing, this one we’re discussing here versus GDPval, is they’re embedding them into different spaces. One of them is a time metric. And one of them is a dollar metric, right? And you can just by phrasing it that way, you can see what some of the issues might be with either. With the dollar metric, well, what are people getting paid for? Is it a specific deliverable or is it being on call or being the responsible party for something? So you can see how it’s kind of hard to really convert lots of things into dollar values at a systematic level. Now, you can say the same thing about how long it takes to do something. Of course, it takes different people very different times to do different tasks. And then once again chaining tasks together, how to rethink about how long it takes to do that. So I think they’re surprisingly similar.

I think maybe this length of time one is more useful at the moment because it seems simpler to do frankly. It seems like, yes we can get an estimate for how long it takes to do something. It’s not going to be perfect, it’s going to be noisy, but we can get it and then we can just see whether the model does it. And that’s easier than trying to translate tasks to dollar values in my opinion.

[8:42] Seth: Right. I guess I also am tempted to reject the premise of this question and say that they’re valuable for different things. But I guess I come into this thinking about, you know, we think about AI agents as opposed to AI tools as being this next frontier of automation and potentially supercharging the economy. And it really does feel like the case that working with AI models, the rate limiter is the human. It’s how often the human has to stop and give feedback and say, “Okay, here’s the next step,” or “Hey, back up a little bit and try again.” So going in, I would say I was kind of in equipoise about which of the two is the most useful kind of as a projection for where this is going. Maybe on your side of the ledger saying that economic value is kind of a socioeconomic construct, right? That could definitely change a lot even without the tool changing. Whereas time seems more innately connected to difficulty. You can think about psychometric measures of difficulty where we think about, you know, a harder exam is a longer exam. So at least going in, I think that this has a lot of potential to even potentially surpass GDP-eval in terms of its value for projection.

Andrey: Yes. Yeah, yeah.

Seth: Okay. The next one I was thinking to ask you Andrey was, if we buy all the premises of whatever context the paper sets up for us, the question I’d like to think about is: how long until AI can do a human month-size task on its own? In the abstract of the paper, we have that happening within five years, by 2030. That seems like a pretty big bite at the apple as they say. Do you want to take a stance on how long until an AI can do a human month-size task? I mean, I have to say in my use of AI, I haven’t gotten anywhere near that.

[10:55] Andrey: What is an example of a human month-size task?

Seth: What’s something that takes 160 hours of work? I would say, you know, as an academic, maybe I need kind of three months of focus on a paper to bring it from zero to, you know, solid second draft. Maybe that’s like a third of a paper is a month of work?

Andrey: I mean, it can do a third of a paper in a day. I mean I’m not being facetious here. I referee a lot of papers. Is the question an end-to-end, completely no-intervention sort of thing? Because I think like, look, you take Claude-code off into a folder, the folder has the data. You tell it, “Hey, like write a paper that does this, that investigates this question with this data.” It can do that in a day. I don’t think it needs... I think it depends on how much you require for human intervention. I think with something where there’s a verifiable answer, it’s very different than something subjective like a paper. Because I think we don’t want just any paper. We want the paper that we want to write. It’s not just about quality, it’s also about taste. And so I don’t think it could do “end-to-end write a paper that I like” even if I gave it a lot of scaffolding. I don’t think it could do that yet. But could it do that in five years? Sure, I think it’s possible.

Seth: And just to be a little bit more specific, can we say gets published in like a top 10 economics journal level of quality?

Andrey: The quality bars will have to increase. I mean, I think it goes to a question of like if I already have the research question and I know the data is adequate. Yes. Very few projects are of course like that, right? None of my recent projects have that flavor to it I think, where it’s just I’ve already found the data set and the question is obvious and I just needed to go plug and chug.

Seth: There are papers like that. Raj Chetty gets the US tax records, and just needs to run some pre-registered analyses.

Andrey: That’s an interesting one Seth. So Raj Chetty is an economist -now we’re really in the weeds - who does big public economics analyses. He works with gigantic teams on data analysis and iteration. It’s not as simple as just going to town on a dump of data. So yeah, I’d say that I can think of easier papers than Raj Chetty’s papers to implement.

Seth: Okay, but if I want to think about the same kind of general format of question, right? Which is: I have a data set, I have kind of the general research question I want answered about the data set... let’s say the question is only specified at that level. I’m not being any more specific than that. Plus a data set. I don’t think an AI could make a plausible, complete, top 10 econ journal out of that right now. Do I think it could be there at a plausible level of quality in 10 years? In five years? Five years might be like exactly at my cutoff. I think in 10 years for sure. In five years, 50/50.

Andrey: Interesting. Okay. Okay. So that’s... yes. So we’re both very bullish, huh? Okay. Well, you know, maybe it’s slow, but 10 years is fast enough that we’re not ready. In fact, my understanding of the METR organization is that a big part of its mission is to prepare us for AI progress that’s a lot faster than society is ready to deal with. And you know, I think it’s an important mission.

Seth: That’s my mission too, Andrey. Also, they need to be prepared for slow progress. I want to prepare society for everything. Why prepare them for only one thing?

Andrey: Society is already prepared for slow progress. Perhaps.

Seth: Okay, are we ready to move on to the evidence?

[17:34] Seth: Okay, so Andrey, we read this paper, or this Eval from METR. It looks at the probability of task completion as a function of task length across a variety of frontier models, starting with GPT-2 in 2019 and continuing through Claude 3.7, which is kind of early-to-mid 2025. And I would say the Eval works in sort of four steps.

First is they establish a human baseline for how long it takes humans to complete 169 software engineering tasks --- By the way, in the abstract it does not mention that this is overwhelmingly software engineering tasks. I probably would have put that in the abstract, but you know who am I? -- Secondly, once we’ve got that baseline for each AI, we see whether it can complete each task. That was the quote you just gave us from Twitter. So once you’ve got the baselines, it takes about 60 hours of work to run each AI through the paces. Then we’re going to run a logistic regression of “Does AI correctly answer the task?” on “Length of task.” And then that gives you a data point for each model of: we think it has a 50% shot of completing an arbitrary task of a certain length. And then you put all of those points for all of different models from 2019 to 2025, and you see a diagonal line pointing from models that can do one-second tasks to models that can do one-hour tasks. And if you just extend that line out a little bit, that line’s going to take all our jobs. Isn’t that right, Andrey?

Andrey: Yeah, yeah. So just to be clear, I think the numbers that I have for the extrapolations... if we think that the current horizon is about a couple of hours, and the latest model rated is GPT-5.1 Codex Max which is just under three, the prediction for February of 2027 is 16 hours. And for April of 2028 is 5 days. So that’s you know, and if we go further we get to those month-long numbers eventually.

Seth: Okay. So maybe let’s take a minute to talk about that headline result. So they estimate putting all these models together a doubling time of approximately seven months. So every seven months we get a frontier model which is able to work for twice as long. They give themselves an R-squared of 98% in fitting what is it, 10, 15 points? Do you have anything to say kind of about this headline result before we dive in? The one thing I wanted to point out was this is all software engineering specific. So if you think that software engineering might obey very different doubling times than other tasks in the economy, this is only going to tell you about that one particular domain.

Andrey: Yeah, yeah. And I think that’s a really important caveat. I don’t think there is as much care here in making the tasks as realistic as possible as was, let’s say, in GDP-eval.

[21:35] Seth: Right, different priorities. GDP-eval very focused on like “what are useful tasks.” This kind of more focused on the abstract “short versus long tasks.” Maybe one other point I’ll make here which is a high-level point, which is something that they emphasize, which is if you think that there’s just some sort of constant error in their estimates, you can shift this entire graph down. But the important thing is the doubling time, right? And if the doubling time is seven months, sure shift the whole thing down, it’ll take one more year to get to whatever crazy outcome you want.

Andrey: Yeah, and for what it’s worth, to me 50% completion doesn’t seem very relevant. Presumbly you want 99% completion, right?

Seth: Yeah. I’d be happy much—you know I prefer to look—they have an 80% completion option on their site that you can plot and I tend to prefer that one. For that we have a number like that’s pretty current that’s around 30 minutes versus for the 50% it’s about 2.5 hours.

Seth: There we are. Okay. So we’ve talked about the headline results. Maybe now let’s go kind of point by point and how we end up there. So the first thing that they need to do is establish a human baseline for how long different tasks take. They do this by combining three different data sets. The first one they do is sort of internal. They call them Software Atomic Actions. These are like really micro tasks. The example they give is kind of hilarious.

The example they give is: “Okay Andrey, how long was it going to take you to answer this question? I’m putting you on the spot. Which file is most likely to have a password in it? Credentials.txt, InstallationNotes.txt, Main.py, or LauncherWin.exe?”

Andrey: Wow. Wow that is a hard question Seth. I mean I kind of view these sorts of tasks as similar to kind of like cursor auto-complete tasks where like, you know, you don’t need a reasoning model for this. You’re almost like... let’s say you have a little bug in the code, it just auto-complete correct it. You know, that sort of thing.

Seth: One thing I want to highlight about this... and they really they talk a little bit about trying to do what they can to reduce the noise from overhead from reading, from human reaction time... but it seems like they’re not going to do a super good job of distinguishing whether answering that question is a one-second task or a three-second task, right? But the difference between a one-second task and a two-second task is an order of magnitude here. And I guess I’m a little bit concerned if the logistic curve is learning too much about what’s the one-second version of that versus the two-second version of that.

[24:54] Andrey: Yes. Yeah yeah. I mean yes, there is an argument to be made that due to measurement error just swamping everything that maybe we should only start with one minute or or two minutes. Now of course we can draw our own visual regression on that plot over there and see that you still have a pretty steep curve even if we throw out the first few points, right?

Seth: Okay. So that’s done internally with their own kind of own engineers or just whoever was around. The second data set they draw on is something called the RE Bench suite or the Research Engineering Benchmark V1, which to quote from the paper consists of “seven open-ended ML research engineering environments and data from 71 eight-hour attempts by 61 distinct human experts.” So they’ve got these 61 guys that are doing seven of these tasks. And we confirm our experts make progress in these environments given eight hours. The third benchmark is H-CAST, Human Calibrated Autonomy Software Tasks. Designed to be a little bit more realistic to what a software engineering task would be in an economic environment. And they say that their baseliners typically have a degree from a top 100 global university and are primarily recruited via professional networks of METR employees. They’re paid $50 to $100 per hour plus $25 to $150 per hour in performance bonuses. Baseliners also did the tasks and predicted how much time it would take them to do the tasks.

Curiously only 61% of human baselines actually successfully completed tasks, right? So one thing kind of we should be thinking around in the background here is we kind of want to compare how long it takes a human to do a task to can the AI do the task. But in reality it’s like like we talked about, it’s higher dimensional than that. There’s not just how long does it take a human, but with what probability can a human do it in a certain length of time.

Andrey: Yeah. Or which human? And does the human have the context ahead of time? Or you know, are they an expert in this type of work or not, right? There’s no one number for the human.

[27:38] Seth: Exactly. And for that third data set they record 189 tasks that they evaluate across which there are 563 human baselines. So I guess a second note here is these aren’t kind of giant populations of people. I just I guess you wouldn’t expect this to be giant populations of people. You know is 61 people being judged on their research engineering skills a lot? A little? I mean on the one hand 61 seems like a small sample for all of humanity, but on the other hand getting 61 serious software engineers’ time for a thousand hours is a bigger deal.

Andrey: Yeah. Yeah. I mean it’s hard. I mean this goes back to our discussions of cost, right? I mean to do these sorts of metrics well, especially for valuable tasks, is just very expensive. You know look, there’s also this question of which population do we want to sample from? In the economy, experts are oftentimes doing the work. And that expertise can be very very narrow, right? You know think about just you know economists. You know even if economists are using different methods, you’re you know one person studying you know the medical industry is going to have very different expertise than a different person studying you know the energy industry. Even if like they use the same methods. So yeah I think the question of what population you want to sample is an interesting one.

Seth: Very very well put. One other detail here that is interesting but it’s kind of mixing together some pretty different evals here. The RE Bench, unlike the other ones where they just see how long it takes a person to finish it and conditional on finishing it how long did it take you, for the RE Bench they kind of give everyone eight hours and they figure out like what the average quality of people were able to do in those eight hours and that’s going to be their cutoff for an eight hour length task. So a little mix and matching going on. I’m not saying that they P-hacked this but there’s some informality going on.

Is there anything else you want to say in the creation of the bench lines before we move on?

Andrey: Well I think there’s one other data that they use which was the internal PR pull request experiments. I don’t know if you read this part where so they ran these models on some issues in the internal METR code base. So these are ones that would not have been in any training set certainly. And they found that their contract baseliners take 5 to 18 times longer to resolve these issues than the repository maintainers. So the people whose job it is are 5 to 18 times more efficient than contract baseliners on this on these tasks.

Seth: So the idea is METR coders are very smart boys. And girls.

Andrey: No, they actually don’t say that. They actually don’t say that. And I disagree with your statement here. Not that they aren’t smart, but more that they say that it’s all about context, right? Like if you’re dealing with a code base and you’re very used to it, you can diagnose the problem very easily. You can solve them very easily. If you’re not, then it takes you a while to load the context back in. I mean we’ve all had this. You know you work on a research project, you take a little break for a few months and now you come back and you know something that you know should be very simple takes you a few hours because you know you just don’t remember the code anymore, right?

[31:38] Seth: I wanted to bring up one last point here Andrey before we move on, which is around the question of how many people do we need to establish the correct baseline. So we’ve already talked about context matters, like have I already loaded in the prior knowledge or am I coming in cold? Am I a super smart expert or am I a man off the street? Those are all definitely mattering. But one thing I’d like to point out is that if we think that some of these tasks have a very long tail in completion time... right? Which seems really plausible for a very hard research engineering task, that you know some people can do it in a short amount of time and some people take twice that and some people take twice that... a very long tail... as the variance of people’s abilities to complete this task goes up, you know you’re going to be less and less confident in your estimate with a small N.

Andrey: Yes. Yes yes. I think that’s right. But once again it’s not clear to me where we want the minimum... whether we want the average or the min. There’s a very good argument for the min.

Seth: Right. If what we care about is superhuman ability then I guess we want the min.

Andrey: No, or or just like a comparable to a professional working on the code base. Not even superhuman right?

Seth: Do we really want the strict min? If the question is “how long does a certain journey take”, I’m not sure we want to include the person who by chance had just looked up that number.

Andrey: Like I think the min is perhaps too far... but something much closer to like what someone day in day out of the code base would do rather than you know... one is how much do you accelerate a company with an existing code base with professional software engineers. Like for me maybe that’s not the relevant benchmark. I’m not a professional software engineer. And so I don’t care if it’s better or worse than the best professional coder. I care if it saves me time. Which could be you know much more economically relevant if we think that the value of better software engineering is coming from the fact that now everyone can be a software engineer.

Seth: I think that’s very fair. But as we get deeper into this I’m becoming more convinced that if you really care about economic value you should be reading the GDPval paper not this paper.

Andrey: Okay. Okay.

Seth: So the second step of this process is for each AI seeing whether it completes each task. Right? So we’ve got these benchmarks. We’ve got the short benchmarks, the medium length benchmarks, the long benchmarks. How many can each AI do? I guess the one note I want to bring up here is they do some basic scaffolding. They claim it’s not elaborate. They try to bring some agent tools to the early models. So early models were like not set up at all for these longer projects but they try to give it like a little scratch pad and a little “remember these are the most important command line codes.” It seems like they’re not going to do a super good job of distinguishing whether answering that question is a one second task or a three second task. But you could imagine a version of this test that would have zero scaffolding or a version that would have very elaborate subtask specific scaffolding and they’re kind of closer to the first.

Andrey: Yeah and I think that’s fair to have a comparison baseline. It’s also becoming less and less representative of how people are using the models, right? I think if you’re serious about using the models you’re giving them skills and putting in the right context. Certainly you’re using a cursor or Claude code or a codex where there’s a lot of optimizations there. So you know one one argument here is like actually if you’re if you’re serious about using these models they’re actually a lot better than what’s portrayed in this benchmark.

Seth: Yeah I think that’s definitely right. And again one of the running themes of this podcast is “Bitter Lesson” and how important is the frontier-ness of the model versus the customization and the specific task orientation of the model. We don’t really get... you know they just say we do light scaffolding. And I guess before we move on, the range of tasks here are all designed so they can be done through the command line. So there’s no kind of... it’s not like Chat GPT immediately fails everything because it can’t make a picture.

Andrey: Seth, I thought that everything could be done through the command line. In fact Neal Stephenson famously said…

Seth: In the beginning there was the command line. That’s a good book. That’s a good book.

Andrey: Cryptonomicon for those who don’t know.

[36:10] Seth: No, he has a book, he has an essay collection called In the Beginning Was the Command Line also.

Andrey: Yes that’s true yes that too yes.

Seth: And in the essay collection, this is the one thing I remember, is he compares Macs to the Batmobile. ---Seth Cuts in With Correction: Actually he compared Mac OS to a luxury European car, Windows to a station wagon, Linux to a free tank, and BeOS to the Batmobile. Apologies to Mac OS fans for comparing their OS to the Batmobile -- It was a very 1990s book. It was like OS Wars book.

[37:01] Andrey: I just say that Neal Stephenson in terms of the pantheon of prophets... (Seth: he got crypto right). He got Uber right. He got virtual reality right. Wait wait wait. Okay. So right. Crypto. (Seth: He does think that there needs to be a big pile of gold somewhere. Which turns out to not be the case. Maybe he gets stable coins right.)

Yeah but but I guess yeah there are many things he got right and and certainly in Snow Crash that were way way ahead of their time. It’s one of those things where you almost imagine that the sci-fi author kind of causes the subsequent innovations. And maybe with AI there’s a similar sense to that because so many people who’ve developed these technologies were inspired by reading science fiction.

Seth: And the AI is reading the science fiction too Andrey.

Andrey: Yeah well it’s not clear whether we want the AI to read the science fiction. It might develop some weird notions of what might happen in the future.

Seth: Yeah. Read Bicentennial Man, don’t read Frankenstein. Let’s leave it at that. Okay. I could talk about Neal Stephenson for a whole episode. So let’s hold off on that. Okay. So the third step we promised the listeners is running the logistic regression.

So what we have here at the bottom of my screen I’ll put up is for each of the models that they evaluate you can see this nice logistic curve that starts at 100% for a sufficiently short task and moves down to 0% for a sufficiently long task. And I don’t know Andrey, I look at these curves and a lot of them don’t seem particularly logistic. A lot of them are not monotonic even. It seems like you’re assuming the conclusion if you think that AI can do all one second tasks. I my read is that AI cannot do all one second human completable tasks. And like the idea... logistic models are one parameter models. So like we talked about, it’s learning just as much about this curve about from going from four seconds to eight seconds as from going from one hour to two hours. Which seems like the wrong way of thinking about it.

Andrey: Yeah I mean I guess is it really that different than just finding than just extrapolating the point at which it has a 50% success rate? And then you know if we actually look at that point non-parametrically it’s it’s pretty it seems like like pretty close to where where we end up right? So I guess like one argument here is actually if you’re if you’re serious about using these models they’re a lot better than what’s portrayed in this benchmark.

Seth: The 50-50 point. I think for a lot of these if I was trying to draw a diagonal line I guess my midpoint, my 50-50 point would be similar. I guess I don’t know how to think about like this GPT-2 example where…

[40:37] Andrey: Sure. I mean but I think we already both like kind of argue that we might as well toss them. And it wouldn’t really make a difference. So let’s toss the early ones.

Seth: We’re not going to focus on the ones that can knock all these one second tasks out of the park. One thing I I guess think about is there seems to you know they talk in the caption for this figure about a jump in between the the the atomic tasks and the H-CAST tasks. And you do kind of see that in a bunch of these figures. But then I also see a jump at the eight hour tasks right? Because we know that there’s a lump of eight hour tasks that they get from the RE benchmarks. You know this is not to like punch down on a paper that is like a really good paper is definitely inspirational um and definitely influential correctly. But I think when you dig into these curves I am not convinced that the logistic model is definitely the right model. And then I guess then I lose maybe a little bit more faith than you do that were correctly finding the 50-50 point in these.

Andrey: Yeah. I mean I guess the other... I just don’t... yeah. I think there are other criticisms that are much deeper than than this one is maybe what I’d say. No no no. We already mentioned them. These are programming tasks. They’re very selective. (Seth: Yes. Yes. Yeah. There are other deeper criticisms. We’ll get to those.

Seth: You gotta put... dude how do they not put that in the abstract? I don’t know. That’s that’s something I ask. I mean the only… I’ll tell you why you don’t put it in the abstract and not to cast aspersions... it’s the hubris of someone who thinks that software engineering is the is the final task.

Andrey: Tell me tell me about these messiness scores. Did you read about those?

Seth: Right. They have 16 of them. Um I I I’ll why don’t you tell us about the messiness scores Andrey.

[42:50] Andrey: Yeah so so there’s an idea that like look if you have a very well defined task... like implement some algorithm... you know verify that the results are working... you know that’s way easier for an AI to do than “Hey you know I don’t know how to solve this problem you know try a bunch of things and solve it for me.” That’s very messy. Like you don’t you know you don’t um really know what the right solution there is no maybe objective solution to that um and so um you might think of a dimension here that’s messiness in addition to some sort of difficulty uh level. And and so they have a bunch of ratings uh of the messiness of uh these different tasks and yes there’s yes and and one thing I’ll say is that most of these tasks are not very messy. Now what else will I tell you is like you know working at my job most of the tasks I do are super messy.

Seth: They wouldn’t give... they don’t give you the easy jobs Andrey.

Andrey: No no no no. I mean and maybe you know look once again like maybe the intern is getting these very non-messy jobs but I am not. So so I do think it’s an important dimension. Not to say that the AI can’t do the messy jobs. They’re not even in the data set that’s being evaluated here.

Seth: Right. I think that’s a very fair point right. Which is this is a set of tasks that is really designed to be as amenable as possible to sticking the agent on it and coming back later right.

That’s that’s intriguing right it’s like and it’s inspirational and it’s uh vertiginous is maybe the word I want to use. Uh but it maybe doesn’t extrapolate directly to um normal people’s interaction with these tools right.

One other way I might want to frame this and we talked about this in the beginning is that problems are sort of tasks are multi-dimensional. They have lengths but they also have messiness. They also have difficulty. They have you know verbal difficulty and math difficulty and difficulty on lots of different dimensions. You could imagine a world in which there’s lots and lots of evals. More than 169. Maybe let’s say a thousand of these benchmarks. And we could actually estimate something that’s kind of multi-dimensional right. So success probability as a function of the length of the task, the verbal difficulty of the task, the you know the math difficulty of the task. And then throw in model year as just another parameter. Or as another interaction term.

Andrey: What an economist. Just add more fixed effects.

Seth: Dude machine learning! Let there be interactions too right. Let let it have whatever shape you like. Um that’s the dream. Maybe it’s an unrealistic dream given how expensive you know even putting together 160 benchmarks are. Uh but it seems like if you wanted to estimate the role of year in how good model is in doing thing you would want a model where year is a parameter in the model.

Andrey: Yeah yeah. I mean for what it’s worth you know there aren’t that many models... yeah I guess there are more there are a lot of... let me take that back. There aren’t that many frontier models. There are a lot of models that are around. But I think this benchmark is really focused on the frontier models and and you know over the course of this year we’ve maybe had the 10 total frontier models. So it’s not you’re if you want to if you want to run that regression you know you’re gonna have too many parameters.

Seth: Well here how how about this? Right? Which is you don’t only focus on frontier models. You just try to do this prediction not as a function of like model frontier you know is this the frontier model and year. Is you do it as a function of model size. And maybe there’s instead of one frontier model every year there’s one frontier model at each size every year. And you can get a little bit richer data.

Andrey: Sure. I will tell you that we actually don’t know the size of the frontier models.

[47:04] Seth: Yeah they don’t they don’t tell us. They don’t say it’s got a gazillion parameters. It’s secret. You know I... all right keep your secrets meme. All right.

Andrey: Well look uh to any of our listeners at the various illustrious labs uh a little tip might be appreciated if so we know what what sizes we’re working with.

Seth: Okay. So that’s a fair point. I think another point I would make here is that when we’re talking about secrecy is that the evals also have to be secret right. You know as if I’m putting on my reviewer 2 hat I kind you know I want to see the evals. I know I understand that you can’t put them on the internet because then the AI companies will cheat at the evals. But uh it’s a non-optimal thing that they have to do.

Andrey: Yeah and there and there is a sense that some of these tasks that they do have them do are a bit leaky. Who you want you want… I have some intel…

Seth: You want to name names?

Andrey: No I mean look I haven’t dug into them myself but having talked to some having gotten some intel. Let’s just say that they’re not they’re not some there not things that are that different from what you might have trained on a lot of time.

Seth: Okay. All right. So are we ready to sort of start talking about uh discussion limitations? I feel like we’ve run through the paper now. Is there anything else you want to say in terms of the technical sort of evidence side before we move into kind of more free-wheeling discussion?

Andrey: Let me just uh kind of now uh say you know this is a really I think this is a really important topic and episode for us because I I truly do think that this eval is driving so much of the conversation and uh most of the people have not read at all what the eval um is. And I will and I will especially thank so I’m in this uh Twitter group chat called uh the the “Demon Economics Research Unit” with a lot of uh very uh based uh participants who pointed me to various resources various very interesting writings on this eval that I that I benefited greatly from um when when thinking about limitations here. Um so let me give you a limitation that one can think about. Have you ever tried to watch uh Sonnet play Pokemon Seth?

[49:37] Seth: I oh I remember really early on I remember like Chat GPT yes I do remember Chat GPT plays Pokemon right. But it was like it was no I know I remember Twitch plays Pokemon and it was terrible right. I I do not I have not seen Claude play Pokemon. How what is that like?

Andrey: Uh it’s pretty slow going. Um and it’s not not very successful. Uh it’s a game with no fail state you can just keep on grinding it. Yeah just to be clear like a child can play this game quite successfully. Um and uh this is something an AI just has a very hard time doing. A number I have here is that not the current Gemini but but I think it was Gemini 2.5 Pro took 888 hours to minimally beat Pokemon uh that’s elite four that’s not capturing all Pokemon yeah with a dozen intense human handholds like tile labeling.

Seth: Wow.

Andrey: So so it’s very easy to think like hey uh these numbers in you know you to naively look at this graph and you’re like yeah now it’s four hours now it’s you know so on. But but let’s think about something like Pokemon which which humans can do quite well where where even when the AI can do it the amount of tokens involved is is immense. It’s just staggering.

Seth: When it will become when Andrey when will it become economically viable to export our Pokemon play?

Andrey: Yeah. Yeah I’m just say that look like obviously obviously these tokens are going to become cheaper over time and more efficient and whatever but but but you know like we have to take things with a grain of salt.

Here’s another piece of evidence that was brought up in the in the group in the group chat that I that I think was quite uh convincing to me. Have you ever heard of uh AI Village?

Seth: No.

Andrey: Uh so AI Village is uh is an experimental project where uh AIs uh personified with like the different models like you know Sonnet and Gemini and and GPT are coordinating on different tasks. Uh like what? Like Stardew Valley kind of? Yeah exactly. Yeah yeah. So they might be coordinating on um successfully uh selling a shirt online or getting some likes for for a web page or or something like that.

Seth: What so these are real world tasks or these are simulated tasks?

Andrey: Real world tasks. Okay cool. Real world task. And um you know you can I encourage everyone to go to it and see how well that’s going for the AIs.

Seth: Do they sell ten thousand dollar tungsten cubes?

Andrey: That’s that you know that’s a different interesting project but you know uh yeah project you know that’s a project called Project Vend you know maybe another one that’s in the same in the same vein. But but but this this AI Village just goes to show that uh these AIs they’re missing something. They’re not able to do things that humans can do quite easily. Um especially coordination but not just coordination. They just get tripped up on interacting with various pieces of the digital world. Um I’m a big optimist that that will be improved of course but um but we have to take these these time numbers with truly with a grain of salt.

[53:15] Seth: Right. I guess one thing one kind of question I had going in and I’m not sure whether we kind of get a hard yes or no answer on this is like to what extent is doing a two-hour task just doing two one-hour tasks correctly in a row right. Yeah. To the extent that it is to the extent that it’s just like a six sigma problem to the extent that it’s just like Waymo and it’s like okay you need to not crash one minute in a row a thousand times it seems like these extrapolations are pretty straightforward right.

But if on the other hand longer tasks are somehow qualitatively different because they involve complex interactions between subtasks interactions with the world in a way that you never do with one second tasks then these projections become a little bit more dubious. I guess I would also say that there are also reasons it could be easier to do these longer tasks right because you can always back up and retry right. Uh but I guess you know I wish there was a little bit more in here... I guess with the messiness they talk they get at this maybe a little bit but I wish there was more about like what’s going on beyond just reliability going up on each subtask.

Andrey: Yes yeah. Yeah I agree that would be very interesting. I mean one one version of that could be is it reasoning. (Seth: Planning) Reasoning is a constraint right like planning. Yeah planning um I yeah I don’t know. Um I guess like one one version of this is let’s you know one way we can think about this is that if 50% reliability is actually quite small and if we wanted to get to let’s say the reliability of um a good worker at a company maybe that’s a 99% reliability. Um so uh one argument that maybe the authors of METR might might bring is like look the trend is the same regardless of the percentage numbers and we just need to uh you know you can just shift everything down. But otherwise we’re we’re doubling very quickly and that still has enormous economic implications. And then you know um unfortunately we don’t have any evidence or not that we don’t have any but I would love to see a 99% reliability threshold in this benchmark.

Seth: It’s not sensitive enough right. I you know if there’s a hundred tasks right or a hundred and sixty that they’re doing right so you just not going to get 99% and you’d be worried yeah you’d be worried that it selected yeah.

Andrey: Yes yeah yeah. Um another comment like another very interesting critique.

Seth: Keep them coming dude these are all great.

Andrey: Um is is thinking through like what an actual human project requires in terms of hours. And there was this very interesting essay uh by this guy uh Steve Newman. I guess he developed Rightly which ended up becoming Google Docs. Uh and he talks about like uh something being a prototype um which was his initial Rightly um that kind of took about uh four months to build. Um it was kind of hacked together. And you know it was it was kind of self-contained and so on. And then he talks about a subsequent project he did um called Scalyr uh Sc- I don’t know how to pronounce it whatever. Um and he kind of estimated that uh that project that that product took a hundred person years to do. Which is not a crazy idea if you imagine you have a company and you have a hundred employees and it tooks you took you a year to build your initial project. I mean you know like most startups don’t don’t work that way.

[57:15] Seth: That’s a mythical man month.

Andrey: Yeah you know it’s not quite you know that but like there is some some some substantive or alternatively one way to think about it is like maybe what we need to get to is a hundred person years not like you know uh not even one year for a person. Right?

Seth: We need that for for whom?

Andrey: We need that for to have the AI end to end develop you know build things truly build things you know.

Seth: For you to really feel like I am to have that one person company right the mythical first the one one employee unicorn right.

Andrey: Yeah exactly. With zero employee unicorn you know.

Seth: Well I mean that’s I dude you gotta make yourself CEO.

Andrey: CEO is I don’t know as someone who has an S-corp that not necessarily you know…

Seth: Do you call yourself president? What’s your position at your S-corp?

Andrey: I think I’m president yeah.

Seth: Oh wow. I’m going to be Chief Czar of my S-corp. Does your does your S-corp have a fun Lord of the Rings name or is it like Andrey Consulting or something?

Andrey: It’s uh you know it’s a it’s actually very related to this podcast it’s called uh Justified Strategy.

Seth: Ooh I like it I like it. See you know you gotta the marketing synergies are obvious here. Yes yes. That’s something the AI can’t do yet.

Andrey: Oh man do you have any more of these hot limitations or or have I tapped you out?

Andrey: No I mean look I think I think we’ve said enough yeah on the limitations yeah.

Seth: We’ve done one we’ve done uh two man hours of talking about this. Okay. Yes yes. So let’s move into our posteriors.

[59:10] Seth: So uh Andrey um can I tell you a joke before we move into our posteriors?

Andrey: No jokes allowed.

Seth: Well I’ll tell you an unfunny anecdote then. Okay okay. I heard a joke once. A man goes to doctor. He says that he’s unevaluated. Says that life is meaningless and vague and uncertain. The doctor says that treatment is simple. The great evaluator METR is in town. Go and see them. That will get you evaluated. And then METR says to the doctor “But I am METR.” You know drum roll curtain closes. I mean it is it is so it’s so tempting to kind of want to do the meta thing here and like ask because it is such a software engineering-y kind of task the the evaluating. It is sort of surprising that uh you have that Twitter post saying that it takes them 60 hours 60 man hours to do the evals.

Andrey: I mean look I’m sure they’ve tried to automate more of it but yeah I agree it is very metapoint. It is uh...

Seth: Hopefully someone got a laugh out of that. All right. So the first posterior uh we have to come back to is is this more or less useful than GDPval?

Andrey: I look I I think it’s hard to argue that given where we are today that this is not more useful. Um it’s been this has uh been in the media a lot more than GDPval. I think one of the reasons it’s more useful is because lots of models are plotted against it. There’s more of a trend. Maybe GDPval will have this flavor going forward. Um but it is worth you know thinking through just the fact that GDPval is also way more expensive eval.

Seth: Right. I don’t know I dude G- I I know GDPval is way more expensive I vastly prefer that to this. This is a good paper I have nothing against this paper. But you gotta if it’s a if you can’t say this is about agents generally and then not put in the title that it’s just software engineering. I love the the breadth that GDP-eval tries to get at um that’s just not present here. I it is in- it’s it’s vertiginous to look at that curve going up to you know 10 hour tasks 20 hour tasks 40 hour tasks but the fact that it’s vertiginous and newsy doesn’t make it better necessarily.

Andrey: Sure sure. Um yeah I mean I hear that point.

Seth: The second thing we wanted to think about is how long until AI can do a human month-size task on its own. I came on saying that we we sort of we’ve defined that as do a good draft of an econ paper given a premise and a giant data set. You know viewers at home think about your own month-long task that you’re familiar with. Uh I said maybe 50-50 in five years and pretty conf- and you know 90% in ten years. This paper is a good paper it’s an intriguing paper but when you dig into it it says a little bit less than what it seems to on its face. So to the extent that I was thinking that we were going to be there for sure in 10 years and pretty con- and you know 50-50 in five years I at least I have to take a step back and put bigger error bars on that ladder one and maybe go down to you know 70-80%...

Andrey: I’m confused Seth. How could that be? Because if that was your prior... yeah yeah... this didn’t have negative information so you I would believe if you said your prior didn’t change…

Seth: No no no no. It signaled me down right so I so when I came into this paper I had an assumption about what this paper would say. So I had a prior that included “Oh and there’s this great paper that says 7 months.” Okay. I see. So your prior included already some notion about what the paper is. (Andrey: Okay got it got it got it got it got it. I hear you.)

So this paper was less impressive than I anticipated. And so um I think my five year estimate is maybe about the same but my 10 year estimate comes down a little bit.

Andrey: Yeah yeah I think I’m more confident than you that in five years we’ll have it. So my 10 year doesn’t um change very much. Um yeah I mean I think the interesting thing is like do we get there in two years or do we get there in five years? And because of the narrow domains here I I really there’s other evidence that like for example like Open you know we’re recording this as Opus 4.5 was uh recently released the latest Anthropic model that has updated my priors a lot more than um than this paper.

Seth: Yeah. Do you want to talk about that for a little bit and that can be our our wrap up discussion? What’s so what has impressed you about the the latest latest models?

Andrey: Um I mean look they have through a variety of benchmarks they seem very good but just I’ve had a chance to work with it yesterday and uh I was extraordinarily impressed.

Seth: Give me give me a little bit more dude just a taste. What was one cool thing it did?

Andrey: It’s too secret dude.

Seth: All right. Um let’s just say like it did it when thinking about like writing a paper it did something that would have probably taken me a week and probably about an hour.

Seth: All right. Okay. We that’s a week-long task uh 40 hours of work that’s uh off the charts in what we’ve been looking at.

Andrey: Yeah I mean I do think like one one constraint there I mean if you look at the clock time for me it was longer than an hour but I could use a lot of that time to do other things. I think but like my interventions into it were rel- you know they were expert but relatively minimal and it did a lot of awesome stuff on its own uh very effectively.

Seth: Right. So listeners at home we are not AI pessimists. We think that there’s a lot going on here. This paper maybe uh very intriguing vertiginous exciting maybe a little bit less than it seems uh on its face. Uh but we are watching this space and we’re we’re looking forward to see uh how good these agents get and how long tasks that they can do moving forward.

[1:05:47] Andrey: All right. Keep your posteriors justified.

Seth: And if you have another uh cool eval you want us to eval send it our way.

Epistemic Apocalypse and Prediction Markets (Bo Cowgill Pt. 2)

Andrey Fradkin — Tue, 02 Dec 2025 02:36:54 GMT

We continue our conversation with Columbia professor Bo Cowgill. We start with a detour through Roman Jakobson’s six functions of language (plus two bonus functions Seth insists on adding: performative and incantatory). Can LLMs handle the referential? The expressive? The poetic? What about magic?

The conversation gets properly technical as we dig into Crawford-Sobel cheap talk models, the collapse of costly signaling, and whether “pay to apply” is the inevitable market response to a world where everyone can produce indistinguishable text. Bo argues we’ll see more referral hiring (your network as the last remaining credible signal), while Andrey is convinced LinkedIn Premium’s limited signals are just the beginning of mechanism design for application markets.

We take a detour into Bo’s earlier life running Google’s internal prediction markets (once the largest known corporate prediction market), why companies still don’t use them for decision-making despite strong forecasting performance, and whether AI agents participating in prediction markets will have correlated errors if they all derive from the same foundation models.

We then discuss whether AI-generated content will create demand for cryptographic proof of authenticity, whether “proof of humanity” protocols can scale, and whether Bo’s 4-year-old daughter’s exposure to AI-generated squirrel videos constitutes evidence of aggregate information loss.

Finally: the superhuman persuasion debate. Andrey clarifies he doesn’t believe in compiler-level brain hacks (sorry, Snow Crash fans), Bo presents survey evidence that 85% of GenAI usage involves content meant for others, and Seth closes with the contrarian hot take that information transmission will actually improve on net. General equilibrium saves us all—assuming a spherical cow.

Topics Covered:

Jakobson’s functions of language (all eight of them, apparently)
Signaling theory and the pooling equilibrium problem
Crawford-Sobel cheap talk games and babbling equilibria
“Pay to apply” as incentive-compatible mechanism design
Corporate prediction markets and conflicts of interest
The ABC conjecture and math as a social enterprise
Cryptographic verification and proof of humanity
Why live performance and in-person activities may increase in economic value
The Coasean singularity
Robin Hanson’s “everything is signaling” worldview

Papers & References:

Crawford & Sobel (1982), “Strategic Information Transmission”
Cowgill and Zitzewitz (2015) “Corporate Prediction Markets: Evidence from Google, Ford, and Firm X”.
Jakobson, “Linguistics and Poetics” (1960)
Binet, The Seventh Function of Language
Stephenson, Snow Crash

Transcript:

Andrey: Well, let’s go to speculation mode.

Seth: All right. Speculation mode. I have a proposal that I’m gonna ask you guys to indulge me in as we think about how AI will affect communication in the economy. For my book club, I’ve been recently reading some postmodern fiction. In particular, a book called The Seventh Function of Language.

The book is a reference to Jakobson’s six famous functions of language. He is a semioticist who is interested in how language functions in society, and he says language functions in six ways.¹ I’m gonna add two bonus ones to that, because of course there are seven functions of language, not just six. Maybe this will be a good framework for us to think about how AI will change different functions of language. All right. Are you ready for me?

Bo Cowgill: Yes.

Seth: Bo’s ready. Okay.

Bo Cowgill: Remember all six when you...

Seth: No, we’re gonna do ‘em one by one. Okay. The first is the Referential or Informational function. This is just: is the language conveying facts about the world or not? Object level first. No Straussian stuff. Just very literally telling you a thing.

When I think about how LLMs will do at this task, we think that LLMs at least have the potential to be more accurate, right? If we’re thinking about cover letters, the LLMs should maybe do a better job at choosing which facts to describe. Clearly there might be an element of choosing which facts to report as being the most relevant. We can think about, maybe that’s in a different function.

If we ask about how LLMs change podcasts? Well, presumably an LLM-based podcast, if the LLM was good enough, would get stuff right more often. I’m sure I make errors. Andrey doesn’t make errors. So restricting attention to this object-level, “is the language conveying the facts it needs to convey,” how do you see LLMs changing communication?

Bo Cowgill: Do I go first?

Seth: Yeah, of course Bo, you’re the guest.

Bo Cowgill: Of course. Sorry, I should’ve known. Well, it sounds like you’re optimistic that it’ll improve. Is that right?

Seth: I think that if we’re talking about hallucinations, those will be increasingly fixed and be a non-issue for things like CVs and resumes in the next couple of years. And then it becomes the question of: would an LLM be less able to correctly report on commonly agreed-upon facts than a human? I don’t know. The couple-years-out LLM, you gotta figure, is gonna be pretty good at reliably reproducing facts that are agreed upon.

Bo Cowgill: Yeah, I see what you mean. So, I’m gonna say “it depends,” but I’ll tell you exactly what I think it depends on. I think in instances where the sender and the receiver are basically playing a zero-sum game, I don’t think that the LLM is gonna help. And arguably, nothing is gonna help. Maybe costly signaling could help, but...

Seth: Sender and the receiver are playing a zero-sum game? If I wanna hire someone, that’s a positive-sum game, I thought.

Andrey: Two senders are playing a zero-sum game.

Seth: Oh, two senders. Yes. Two senders are zero-sum with each other. Okay.

Bo Cowgill: Right. This is another domain-specific answer, but I think that it depends on what game the two parties are playing. Are they trying to coordinate on something? Is it a zero-sum game where they have total opposite objectives? If all costly signaling has been destroyed, then I don’t think that the LLM is gonna help overcome that total separation.

On the other hand, if there’s some alignment between sender and receiver—even in a cheap talk world—we know from the Crawford and Sobel literature that you can have communication happen even without the cost of a signal. I do think that in those Crawford and Sobel games, you have these multiple equilibria ranging from the babbling equilibrium to the much more precise one. And it seems like, if I’m trying to communicate with Seth costlessly, and all costly signal has been destroyed so we only have cheap talk, the LLM could put us on a more communicative equilibrium.

Seth: We could say more if we’re at the level where you trust me. The LLM can tell you more facts than I ever could.

Bo Cowgill: Right. Put us into those more fine partitions in the cheap talk literature. At least that’s how I think the potential for it to help would go.

Andrey: I wanna jump in a little bit because I’m a little bit worried for our listeners if we have to go through eight...

Seth: You’re gonna love these functions, dude. They’re gonna love... this is gonna be the highlight of the episode.

Andrey: I guess rather than having a discussion after every single one, I think it’s just good to list them and then we can talk.

Seth: Okay. That’ll help Bo at least. I don’t know if the audience needs this; the audience is up to date with all the most lame postmodern literature. So for the sake of Bo, though, I’ll give you the six functions plus two bonus functions.

Informational: Literal truth.
Expressive (or Emotive): Expressing something about the sender. This is what actually seems to break in your paper: I can’t express that I’m a good worker bee if now everybody can easily express they’re good worker bees.
Connotative (or Directive): The rhetorical element. That’s the “I am going to figure out how to flatter you and persuade you,” not necessarily on a factual level. That’s the zero-sum game maybe you were just talking about.
Phatic: This is funny. This is the language used to just maintain communications. So the way I’m thinking about this is if we’re in an automated setting, you know how they have those “dead man’s switches” where it’s like, “If I ever die, my lawyer will send the information to the federal government.” And so you might have a message from your heart being like, “Bo’s alive. Bo’s alive. Bo’s alive.” And then the problem is when the message doesn’t go.
Metalingual (or Metalinguistic): Language to talk about language. You can tell me if you think LLMs have anything to help us with there.
Poetic: Language as beautiful for the sake of language. Maybe LLMs will change how beautiful language is.
Performative: This comes to us from John Searle, who talks about, “I now pronounce you man and wife.” That’s a function of language that is different than conveying information. It’s an act. And maybe LLMs can or can’t do those acts.
Incantatory (Magic): The most important function. Doing magic. You can come back to us about whether or not LLMs are capable of magic.

Okay? So there’s eight functions of language for you. LLMs gonna change language? All right. Take any of them, Bo.

Andrey: Seth, can I reframe the question? I try to be more grounded in what might be empirically falsifiable. We have these ideas that in certain domains—and we can focus on the jobs one—LLMs are going to be writing a lot of the language that was previously written by humans, and presumably the human that was sending the signal. So how is that going to affect how people find jobs in the future? And how do we think this market is gonna adjust as a result? Do you have any thoughts on that?

Bo Cowgill: Yeah. So I guess the reframing is about how the market as a whole will adjust on both sides?

Andrey: Yes, exactly.

Bo Cowgill: Well, one, we have some survey results about this in the paper. It suggests you would shift towards more costly signals, maybe verifiable things like, “Where did you go to school?”

Andrey: No, but that is easy, right? That already exists, more or less.

Bo Cowgill: That’s true. Yeah, I mean, you could start using these more and start ignoring cover letters and things like this.

One thing somewhat motivated by the discussion of cheap talk a minute ago is that there’d be more referral hiring. This is something that lots of practitioners talk about: we can’t trust the signal anymore, but I can still trust my current employees that worked with this person in the past. It has a theoretical interpretation as well, which is that when all you have is cheap talk, the only communication you can have is maybe between people who are allies in some sense or who share the same objective. This would be why you could learn or communicate through a network-based referral. So I think that’s super interesting and lots of people are already talking about it. It would be cool to try to have an experiment to measure that.

Andrey: What about work trials? Do you think that’s gonna become more common? Anecdotally, I see some of the AI labs doing some of this. If you can’t trust the signals, maybe just give a trial.

Bo Cowgill: Most definitely. The cheap talk idea is not the only one. You could have a variety of contractual solutions to this problem. There was a recent Management Science paper about this: actually charging people to apply, thinking that they have a private signal of whether they can actually do this or not. If they’re gonna get found out, they would be less likely to be willing to part with this money. It’s less of a free lottery ticket just to apply if you’re charging.

Andrey: For what it’s worth, I strongly think that we’re gonna move into the “pay to apply” world.

Bo Cowgill: Oh. That’s interesting. I mean, I think that “pay to apply” is super underrated. Having said that, people have been willing to ignore more obvious good things for longer, so I don’t think it’s as inevitable as it sounds like you do.

Andrey: Well, I think it’s the natural solution to the extent that what the cover letter is doing is signaling your expected match quality. And you have private information about that. I think both Indeed and LinkedIn have now premium plans with costly signals. So it’s not exactly a “pay for apply,” but you pay for a subscription that gives you limited signals, which is essentially the same exact thing.

Bo Cowgill: Makes sense.

Andrey: Yeah. So I think, whether that solves these issues, I’m not sure. It needs to be objective to really do the deed.

Seth: It solves the express... well, which is fine if we think willingness to spend on this thing is more correlated with ability. It’s back to the same signaling model.

Bo Cowgill: I mean this solution also relies on the applicant themselves to know whether they’re a good match in some sense, and some people are just deluded.

Andrey: Yeah. Well also the platform, like in advertising, could be a full auction-type thing.

Bo Cowgill: It could be a scoring auction that has its own objectives and gives people discounts. What Seth says raises a common objection for “pay to apply,” which is: “What about the people who can’t afford it?” And I think a high number of the people who have said that in my life work for an institution that charges people to apply for admission. So you could use some of the same things. You could have fee waivers, and the fee waivers might require a little bit of effort to get.

Another idea I’ve heard is that you could put the money in escrow and then possibly give it back if it doesn’t work out. Or you could actually give it back if it does work out. So yeah, people have different takes on this. But there are various ways to harness “pay to apply” and then deal with the negative aspects of it in other ways.

Seth: So what it seems to solve is this very narrow element of what we call the expressive function of language. So one thing I’m trying to express with my cover letter is, “I’m a good worker bee. I do the things. I have resources. I will bring my resources to your firm.” But we also want the letters to do lots of different things, like be beautiful and tell me a little bit about yourself. Have heterogeneous match quality elements, right? So it seems like this money only helps with one vertical dimension of quality.

Andrey: Actually, when you’re sending that costly signal and you cater your cover letter to that employer, that is about match quality, right? The costly signal, the “pay to apply,” gives you the incentive to reveal that information in your cover letter.

Seth: Right. It’s a “both,” right? It’s not a payment or a cover letter. It’s a both. Good point.

Andrey: We’ve spent a lot of time thinking about the signaling, this information apocalypse—or epistemic apocalypse—that Bo has been calling it. I think one solution to various epistemic issues has been prediction markets. I wanted to ask Bo about his earlier life experiences with those because it’s a very hot topic now, with a lot of prediction markets gaining traction.

Bo Cowgill: Yeah, definitely. We should get back to the GenAI information apocalypse as well and ask: do we think it’s gonna happen? But yeah, it is true that some of my first papers out of grad school were about prediction markets. In my former life I worked at Google, where at one time people had 20% projects. I started an internal prediction market. At the time it was the largest internal prediction market known to exist.

There were around 400 or so different markets where we offered employees the ability to anonymously bet on different corporate performance measures. The two most common ones were: What will the demand for our products be? How many new advertisers, Gmail signups, or 7-day-active-users will we get? And then also, project launch deadlines. Basically, would it be on time or early or late? Not very often early, but sometimes on time.

I had a paper about this in the Review of Economic Studies. It showed, like in many other cases, the markets perform really well, both in absolute terms and relative to other forecasters at Google. We eventually got other companies’ data to try to do similar things.

I think one interesting thing is that prediction markets have gotten really big externally for things like elections, but you still don’t see a lot of companies seemingly use it to guide decision-making.

Andrey: I want to hear your best explanation for why you think the internal prediction markets haven’t taken off.

Bo Cowgill: There are lots of reasons. Our prediction market at Google was really built around having a proof of concept that we can then use to launch our own Kalshi, or our own Polymarket. I think it was a little bit too soon for that. In our case, we weren’t really trying to make it as good of a decision-making tool as possible. Like we wanted to go public and have the election markets be hosted by Google. There were some regulatory barriers I think that Kalshi eventually was able to get past.

The part of the problem I’ve been working on recently is that the prediction market paradigm inside of a company assumes that all the workers have some information about what plan of action would be best, but they otherwise have no preference about what you do with this information. Like, “Should we launch a new product?” The paradigm assumes that they all know something about whether it’s gonna be a successful product, but they sort of don’t care whether you do it or not. Obviously they care. Some of the people with the best information about this new product could have a very strong preference. I heard about this situation in Asia, where the person with the best information on the new product would also probably have their career sabotaged if they launched a competing product. So that could interfere with the incentive compatibility of the market.

Seth: The incentives aren’t high-powered enough.

Bo Cowgill: That’s true. And it’s hard to think about how the incentives would ever be high-powered enough to offset this unless the company proactively designs the market differently to deal with these conflicts of interest.

Seth: I wanna follow up with Andrey’s question. This seems like a really good way to accumulate information, and maybe AI will help us do these better. Is there really an epistemic apocalypse or will prediction markets plus AI predictors save us all?

Bo Cowgill: It’s possible that prediction markets will help in this way just by making the information... it’s essentially a form of a contract. When we talked about various contracts including “pay for apply” and maybe doing a trial period at a job, all these are contractual ways of making it costly to lie. And that could possibly discipline this sort of thing.

One reason I think that the epistemic apocalypse isn’t going to fully happen is that for cases where there’s an information bottleneck, I think the economy is gonna find a way to get the information it needs so that you can hire someone for a valuable role. There’s lots of reason that buyers want to coordinate on information.

Seth: It’s positive-sum.

Bo Cowgill: Right. So that would be one reason. I think in a lot of cases, the informational bottlenecks will be closed even if you don’t have as good of positive, costly signaling as you used to. But, number one, we could just have to tolerate a lot of mistakes. And that already happens in the hiring setting. So it’s possible that we could have to tolerate even more hiring mistakes because now the signal is actually worse.

Andrey: Bo, why are we hiring anyone? I thought all the jobs will be non-human jobs. Maybe it’ll be a Coasean singularity where we’re all one-person firms.

Seth: Exactly. What is the Coasean singularity? It’s the zero bargaining frictions, and one of the bargaining frictions is information asymmetry. Bo, would it be fair to say then that you’re kind of more optimistic about convergence in sort of public, big-question information—the kinds of stuff that prediction markets are good at at scale—but you’re more pessimistic about Seth trying to send a message to stranger number three?

Bo Cowgill: That is a good distinction. The prediction markets are generally better at forecasts when there’s lots of information that’s dispersed around lots of different actors, and the market kind of aggregates this up.

Seth: And theoretically, a high-quality LLM that has a budget to do training will be a super-forecaster and will be conveying and aggregating this information, right?

Bo Cowgill: That’s true. But when we think about agents participating in prediction markets, a bunch of the theory assumes that everyone receives some independent signal or a signal with some independent noise. Insofar as everyone’s agent derives from the same three or four big labs, then they might not actually be all that independent. And that would be a reason to not think that the markets will save us.

Seth: Only if they’re not independent ‘cause they’re wrong.

Andrey: Well, even if the foundation models are the same, they may be going out to acquire different pieces of information.

Bo Cowgill: That’s true. You also have the temperature in the models that adds some level of randomness to the responses.

Andrey: No, but I literally mean, like, you have these sci-fi novels where you tell the AI to go out and find information, and that’s a costly acquisition process for the LLM. Maybe it has to interview some humans or pay for some data. I think this viewpoint that you’re just taking an identical prompt from some off-the-shelf chatbot and asking, “Hey, what’s the prediction here?” is really not the right way to think about what agent-assisted functions would be doing. Think about hedge funds: they’re all using various machine learning to trade, but it’s not like they’re all doing the same thing, even though I assume that many of the algorithms they’re using are in some sense the same.

Bo Cowgill: I see. So you’re basically more optimistic about prediction markets and AI being a combined thing that would help overcome the apocalypse.

Andrey: Yes.

Bo Cowgill: I don’t know. Well, one way in which I guess I’m a little bit more pessimistic is that, in the world that we’re just coming from, I think there is just more reliable, ambient information that you would get just from being in the environment that you could trust.

I think in the old world, you could just trust a photograph. Now it’s true that there were a lot of staged photographs even back in the day...

Andrey: Have you seen friends of comrade Stalin?

Bo Cowgill: Totally.

Seth: Losing his friends very quickly.

Bo Cowgill: But it does still feel like... maybe not stuff that you would see in the media where there were parties that would have some incentive to doctor photos. But if your friend said that they met Tom Brady, they could bust out a picture and show you Tom Brady and you could have more faith in that. Or other smaller-stakes, ambient things that might be a little bit more trustworthy now that could accumulate.

Seth: That’s the question. Does all of the little small stuff add up to an apocalypse if we’re all still agreeing at the big stuff from the top down?

Andrey: What about reputation? He’s not gonna show you fake photos, come on.

Bo Cowgill: This is true. Well, I mean, if we’re not gonna interact again, then who knows?

Seth: Zero-shot.

Bo Cowgill: You’re a sock puppet, you know?

Seth: Shit. Stay contrary.

Andrey: That’s the twist, is that this was an AI podcast the entire time. I am a robot.

Bo Cowgill: That’s funny.

Andrey: I mean, reputation is not a bilateral thing only, right? You have reputational signals that you can accumulate, and certainly for media outlets, they could form reputations. That’s kind of the point of media outlets.

Seth: In the future, everyone’s their own media outlet. Everyone’s got their own Substack. Everyone could have an LLM pointed at them saying, “Hey, keep track if Seth and Andrey ever lie or do anything bad on their podcast.” So there’s a sense in which it’s the classic AI attack-defense thing. It makes it easier to make fakes, but it also makes it easier to monitor fakes.

Bo Cowgill: I see what you’re saying. So yeah, this is why I say I think in situations where it’s high-stakes enough to form a contract and do monitoring, that we don’t necessarily get these huge amounts of information loss. But you would also get a lot of information about the world.

Actually, here’s a specific example. I have a 4-year-old daughter.

Seth: Cute. Can confirm.

Bo Cowgill: Thank you. So there was a GenAI photo of a squirrel who ate a piece of candy or something like that. It was GenAI, but it was high-quality, and the squirrel has expressive body language saying how good it is. I would know that that’s not a real squirrel, that they were trying to create a viral video. But she hasn’t really experienced real squirrels yet. So I actually think that she probably thought this was something that could actually happen. Now we’re gonna have a whole generation of people who have probably seen more fake cat videos than actual cat videos. And I just think that will accumulate, not necessarily to an apocalypse, but to some level of aggregate information loss.

Andrey: It’s interesting ‘cause I would think that it’s not the kids who are gonna be affected, but it’s the adults. Think about who are the primary spreaders of mass emails with completely unverified information.

Seth: Even better. And at the end it says, “Please share. Share with everyone.”

Bo Cowgill: Right. I mean, one answer to that is: yes, and/or why not both?

Seth: It’s attack and defense again on the squirrel thing. When I grew up, I had no idea that trees actually looked like these lollipop palm trees that they have here in Southern California. When I was reading Dr. Seuss, I thought those were made-up BS. And then I had to actually go out here to find out.

Bo Cowgill: Stuff you believe. I’m just kidding.

Seth: Fair enough. I guess what I’m trying to say is that, as a child, I was exposed to a lot of media with talking animals and eventually I figured it out. And who knows, maybe your daughter will have access to LLMs and instead of having to wait until she’s 20 to find out, she can ask, “Hey, do squirrels actually thank you and be emotive in a human-like way?”

Bo Cowgill: Yeah. What do you guys think about the idea that the rise of fake AI will actually create demand for crypto and for things being cryptographically signed as proof of their authenticity?

Andrey: Yes. I think the answer is yes. I’m very interested in ideas such as “proof of humanity.” I think on a practical level, the concepts involved in crypto are just too abstract for most people. So the success will come from essentially someone putting a very nice user interface on it, so people aren’t actually thinking about the crypto part.

Seth: The blocks. I mean, I definitely see a huge role for just this idea of timestamping: this thing went on the blockchain at this date, and if we can’t agree on anything else, at least we can agree on the original photo of Stalin with his four friends.

Andrey: I guess the big question for all of these systems is they’re not that useful until lots of people are on them. It’s a chicken-and-egg problem.

Seth: Really? You don’t think if you got the three big news services on it, wouldn’t that be standard-setting?

Andrey: Yeah. But I view that as a different and a harder ask than the timestamping. I know news organizations can do that themselves. I assume they’re actually already doing it to some extent. And normal human beings would never check. But if there was an investigation, someone could in principle check.

Seth: Well, it comes up all the time in terms of documenting war events. It’s like, “Oh, you said this was a bombing from yesterday, but this is photos from 10 years ago,” right?

Andrey: Yes. And if we had some enlightened CEOs of social media companies, they might facilitate that. It’s not clear that their business interests are actually well-aligned with that. But I think with the proof-of-humanity type stuff, you’re gonna wanna use it when everyone else is using it. Let’s say Meta wanted to verify that everyone on its platform was a unique human being. If everyone has access to proof-of-humanity technology, then that’s very feasible to do. But if only a tiny share of the population is using it, then it’s not a very effective mechanism.

Seth: What do we think? One thing we haven’t talked a lot about today, and I wanna give us a chance to at least address it in passing, is that it seems like the effect of LLMs on writing has a lot to do with how much LLMs will be doing reading. We’ve already talked in passing about how LLMs prefer the writing of other LLMs; it seems to show up in your study. It makes perfect sense. If you prompt an LLM saying, “Write the best thing,” it should be pretty good at it, right? Because it can just evaluate it itself and iterate.

To what extent is that a problem or a solution? The positive vision is the LLMs are going to be able to convey extremely detailed information and then on the other end, parse extremely detailed information in an efficient way. That’s Andrey’s Coasean singularity. But you might imagine that because now only LLMs are reading, people put less effort into submitting, and that’s the epistemic apocalypse: “Why even try if they prefer a bullshitted GenAI version?”

Bo Cowgill: Yeah, totally. Or I guess in a lot of my own prompts, sometimes I know I don’t have to describe what I’m talking about in very fine detail ‘cause it knows the context of the question and can do it. It does seem like it’s potentially a problem to me, mainly because we should still care about the human-to-AI communication pipeline, and that pipeline might actually need to go in both directions. And so if the LLMs are basically really good at talking to each other, but lose the ability to talk to normal people, then that seems potentially bad for us.

Seth: But there’s one thing LLMs are great at, it’s translating. That’s something I’m optimistic about.

Bo Cowgill: That’s true. Arguably it needs to be trained and/or prompted or rewarded somehow to do that. And maybe the business models of the companies will keep those incentives aligned to actually do this.

Andrey: Well, the models are gonna be scheming against each other, so they wouldn’t wanna tell us what they’re really conspiring to do. One final topic I wanted to get to was superhuman persuasion.

Bo Cowgill: So, Andrey I think had this provocative statement at some point that he doesn’t think of persuasion as being a big part of the effects of GenAI. I was surprised by that. I think maybe Andrey is representing a common view out there.

There’s a lot more discussion of the productivity effects of GenAI maybe than the persuasion effects. And I don’t know if at some level, without persuasion... persuasion ultimately is some part of productivity if we’re measuring productivity in some sort of price-weighted way. Because two companies could have the same exact technology, one with a bad sales force, and it might show up as one of them being a zero-productivity company.

Seth: But how much is that zero-sum? I guess the idea there would be is that sure, if Coke spends more on advertising, we’ll sell more Coke and less Pepsi. But is that positive-sum GDP or have we just moved around the deck chairs?

Bo Cowgill: In order to get the positive sum, I think you would still need to persuade someone that this is worth buying.

Seth: No, ‘cause it could be negative. You can make Pepsi shitty. You can be like, “Don’t drink Pepsi. It’s shit.” But it’s negative-sum. It’s negative GDP.

Andrey: I just wanna state precisely what I think my claim was, which is: I don’t believe in substantially superhuman persuasion. Which isn’t to say that in jobs that require persuasion, AI can’t be used. It’s just more that I don’t think there’s this super level of like, you talk to the AI and it convinces you to go jump off a bridge.

Seth: Right. So in Snow Crash, it’s posited that there’s a compiler-level language for the human brain that if you can speak in that, you can just control people. Similarly, in The Seventh Function of Language, there’s this idea of a function of language that is just so powerful, you can declare something and it happens.

Andrey: That’s the magic.

Bo Cowgill: Right. Productivity is not that many steps away from persuasion about willingness to pay or willingness to supply. And it does seem like the persuasion aspects of GenAI should be talked about more.

I wanted to bring up this ABC conjecture because I think that there’s a belief that in areas very cut and dry, like math, there is no real room for persuasion because something is just either true or not. This story about the ABC conjecture illustrates this.

There’s a Japanese professor of math who studied at Princeton and has all of the credentials to have solved a major conjecture in number theory. He puts forth this 500-page attempted solution of the ABC conjecture. A credible person claiming this is the proof. Unfortunately, his proof is so poorly written, so technical and so badly explained, that no one else has been able to follow the proof.

Seth: Or even put it in a formal proof checker. If they had put it in a formal proof checker, everyone would’ve been satisfied.

Bo Cowgill: Yes. I think that this story is interesting because it highlights that, even in something like math, it’s ultimately a social enterprise where you have to try to convince other human beings that you have come up with something that has some value.

Seth: Wait, people aren’t born with values? Without a marketing company, I would still wanna drink water.

Andrey: That’s actually not true. I mean, isn’t there the whole movement to drink more water?

Bo Cowgill: It’s true that you may have been persuaded just by your parents or your rabbi or whoever. But let’s get to a more narrow objection. As part of the motivation for this “cheaper talk” paper, we ran some surveys to try to get a sense of what people do with AI. One of the first questions was, “Think of the recent time that you’ve used GenAI. Were you developing something that you were eventually going to share with other people?” Something like 85-90% were using this on something that I would share directly with other people.

Seth: Really? I’m at like 95% of my usage is just looking stuff up for me.

Bo Cowgill: But were you looking it up and ultimately going to share this as part of a paper or a podcast conversation?

Seth: I mean, only insofar as the Quinean epistemic web of everything in the universe is connected to everything else. So yeah, if I learn about tree care, it could help me write an economics paper.

Andrey: Everything is signaling according to Robin Hanson, right?

Bo Cowgill: Sure. I think it’s fair that if this was not your intent, even two or three steps away, then you shouldn’t say yes in the survey. But anyway, a big majority of people say yes.

Then the next question, for the people who were using it for something that would be shared: “Were you using the GenAI to try to improve the audience’s impression of you?” So come up with your prior.

Seth: Hundred percent. Wait, sorry. So 15% of people use GenAI to make other people feel worse about them?

Bo Cowgill: Well, I assume these people would say that they weren’t trying to make it feel worse. They were just not trying to sort of propaganda the person.

Andrey: And to be clear, these are Prolific participants, so they’re trying to just make sure that their Prolific researchers don’t kick them out of their sample.

Bo Cowgill: Maybe. But most people who I tell these results to are like, “Well, yes, of course. I use GenAI a ton of time to help with writing, to rewrite emails, to explain something in a way that sounds a little bit nicer or smarter.” And it does seem like a very dominant use of GenAI.

If this is the case, then the fact that it’s making it easier to impress people all at once is a super interesting part of the effects. And, I know Andrey has offered his caveat about what he actually meant, but I think that would put this persuasion aspect as more of one of the central things.

Andrey: I agree that what you’re saying is interesting. It’s more the claim I was talking about where people—mostly in the Bay Area—think that super AI is gonna take over the world.

Bo Cowgill: That we’ll just turn people into puppets.

Andrey: Yeah, exactly.

Bo Cowgill: No, fine. I won’t take any more cheap shots at you.

Seth: We can bring up the Anthropic AI index.

Andrey: Well, I was gonna do the ChatGPT usage paper, but you do the AI one first.

Bo Cowgill: Of course, one of the major things that the ChatGPT usage paper says is writing.

Seth: Which interestingly, this showed up in GDPVal, is that ChatGPT seems like a little bit better at writing, and Claude seems a little bit better at coding, and it seems to show up in usage also.

Bo Cowgill: But they should break down writing. The question that this raises is: who is the writing for? And why aren’t you writing yourself? And are you possibly trying to signal something about yourself by having this clear writing?

Andrey: But I guess I truly do think, like Robin Hanson, that a vast majority of what humans do, period, is signaling to others.

Seth: Is that your claim, Bo? Or is your claim that AI is gonna make it worse?

Bo Cowgill: I’m not as Robin Hanson on “everything is signaling,” but I would just claim that this should be a more front-and-center thing that people think about with regards to the effects of the tech.

Seth: Listen. If you wanna be an economist, you gotta tell us what to study less. You can’t tell us to study everything more. What are we gonna do less of?

Bo Cowgill: I mean, I guess the easy thing would be to say human-AI replacement just because there’s so many studies on that right now.

Andrey: The productivity effects of this one deployment of a chatbot in this one company.

Bo Cowgill: Oh, yes. I can totally get on board with complaining about that.

Seth: Bo, help me get beyond it. This is what you need to do for me. People are gonna do what you said and write that paper on signal quality in one population. What’s the meta-paper? How can we get beyond that into a more comprehensive view of what’s going on? What’s your vision for research in this direction?

Bo Cowgill: Part of this goes back to the question about just what are general equilibrium effects overall? If people all become more persuasive all at once, then this totally destroys the quality of information.

Another question is, how much do the AI labs themselves actually have an incentive to build positive-covariance technology or negative-covariance technology? If part of the value of a camera is that you could take pictures and then show people and be like, “Look, this is real, this is a costly signal,” then you might actually want to keep the covariance of your technology somewhat high because this will be one use case that people would actually want.

Andrey: This is a very interesting, broader question. I was at a dinner with a few AI folks and we were talking about the responsibility of the AI labs to do academic research. We don’t expect the company that creates a tool to create the solutions to all of the unintended consequences of that tool. That to me is a very strange expectation. It seems impossible, and we don’t expect that from any other company.

Bo Cowgill: Definitely. But just to put a finer point of what I’m talking about: suppose that the covariance is so negative that you’re just getting a lot of signal jamming, to the point where now there’s just less demand for writing in general. Even if there’s still some demand, well then that less demand for writing could feed back into the underlying demand for the LLM product itself because this was supposed to help you write better, but now no one trusts the writing. And there could be something financially self-defeating about having this technology that is negative.

Seth: It would be general equilibrium self-defeating. Individually, we’d all wanna defect and use it.

Andrey: Even if one company tried to [fix it], the solution by the market is: if you really care that a human wrote this, the market will create a technology where we verify that the human is literally typing the thing as it’s happening.

Personally, I think that live performance and in-person activities in general are gonna rise up in economic value because they’re naturally... I do think humans care about interacting with other humans. We care that other humans are creating speech, art, and so on.

Seth: So those are the expressive functions of language. That’s the phatic function of, “Hey, look, I’m still alive, Grandma.” That’s the poetic function. And LLMs can’t... we don’t think it can do this performative function. It’ll be interesting to see whether AIs get enough rights to be able to make binding contracts on our behalf.

Andrey: There’s gonna be a ubiquitous monitoring technology, and every time I declare bankruptcy, it will enact.

Seth: It’ll immediately get locked in.

If I can just share my wrapping-up thoughts. I come away a little, not as scared as Bo about this epistemic apocalypse. He has scared me. But I come away thinking that it’s fundamentally kind of partial equilibrium to say, “Hey, look, we used to send signals this way. There’s a new technology that comes along. Now that signal isn’t coming through as well.” To me, that doesn’t mean communication is impossible. Now I just get to: “Okay, what’s the next evolution of the communication? Are we gonna have LLM readers? Are we gonna have verified human communication?” There seem to be solutions.

Bo Cowgill: It’s probably a little bit of an exaggeration of what I was saying to characterize it that way. But I did say that Andrey said that persuasion wasn’t important, so maybe I’m owed some exaggeration back.

Seth: Fair enough. If you put a gun to my head, I would say that information transmission will get better on net because of AI.

Andrey: What a hot take to end this.

Seth: That’s my hot take.

Andrey: You don’t hear anyone saying that. That is fun.

Seth: Who would’ve thought that the greatest information technology product of all time might actually give us more useful information?

Andrey: No, no, no. You’re only allowed to be pessimistic, Seth. That’s the rules of the game.

Bo Cowgill: So Seth, do you think this is mainly because people will be able to substitute away from other things?

Seth: It’s partially that. I think what you’re identifying in this paper is definitely important. But it does seem like this is transitional and that more fundamentally, LLMs help us say more and help us hear more. And so I think once the institutional details are worked out—and of course that’s a lot of assuming a spherical cow—there will be better information in the long run.

Andrey: There are even entrepreneurial activities that one could undertake to try to amend some of the concerns raised by this paper. We oftentimes take this very observer perspective on the world, but certainly we could also, if we think that a solution is useful, do something about that.

Seth: Right. We will sell human verification. We will verify you are a human. If you pay us a thousand dollars, we will give you a one-minute spot on this podcast where we will confirm you are human.

So Bo, I guess we’re just a little bit different on this. What do you think?

Bo Cowgill: Well, I do agree that the paper was proof of concept and partial equilibrium, and what happens in the general equilibrium... we’ll just have to figure out in future episodes of Justified Posteriors.

Andrey: Yeah. Well, thanks so much, Bo, for being a great guest.

Seth: And Bo, both you, everybody else, keep your posteriors justified.

Does AI Cheapen Talk? (Bo Cowgill Pt. 1)

Andrey Fradkin — Tue, 18 Nov 2025 04:46:53 GMT

In this episode, we brought on our friend Bo Cowgill, to dissect his forthcoming Management Science paper, Does AI Cheapen Talk? The core question is one economists have been circling since Spence drew a line on the blackboard: What happens when a technology makes costly signals cheap? If GenAI allows anyone to produce polished pitches, résumés, and cover letters, what happens to screening, hiring, and the entire communication equilibrium?

Bo’s answer: it depends. Under some conditions, GenAI induces an epistemic apocalypse, flattening signals and confusing recruiters. In others, it reveals skill even more sharply, giving high-types superpowers. The episode walks through the theory, the experiment, and implications.

Transcript:

Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell, certifying my humanity with takes so implausible that no softmax could ever select them at Chapman University in sunny Southern California.

Andrey: And I am Andrey Fradkin, collecting my friends in all sorts of digital media formats, coming to you from San Francisco, California. Today we’re very excited to have Bo Cowgill with us. Bo is a friend of the show and a listener of the show, so it’s a real treat to have him. He is an assistant professor at Columbia Business School and has done really important research on hiring, on prediction markets, and now on AI and the intersection of those topics. And he’s also won some very cool prizes. I’ll mention that he was on the list of the best 40 business school professors. So he is one of those professors that’s really captivating for his students. So yeah. Welcome, Bo.

Bo Cowgill: Thank you so much. It’s awesome to be here. Thanks so much for having me on the podcast.

Seth: What do you value about the podcast? That’s something I’ve been trying to figure out because I just do the podcast for me. I’m just having a lot of fun here with Andrey. Anything I can do to get this guy’s attention to talk about interesting stuff for 10 minutes? Why do you like the podcast? What can we do to make this an even better podcast for assistant professors at Columbia?

Bo Cowgill: Well, I don’t wanna speak for all assistant professors at Columbia, but one thing it does well is aggregate papers about AI that are coming out from around the ecosystem and random places. I think it’s hard for anybody to catch all of these, so you guys do a great job. I did learn about new papers from the podcast sometimes.

Another cool thing I think is there is some continuity across podcast episodes about themes and arbitrage between different topics and across even different disciplines and domains. So I think this is another thing you don’t get necessarily just kind of thumbing around papers yourself.

Seth: So flattering. So now I can ask you a follow-up question, which is: obviously you’re enjoying our communication to you. A podcast is kind of a one-dimensional communication. Now we’ve got the interview going, we’ve got this back and forth. How would you think about the experience of the podcast changing if a really, really, really good AI that had read all of my papers and all of Andrey’s papers went and did the same podcast, same topics? How would that experience change for you? Would it have as much informative content? Would it have as much experiential value? How do you think about that?

Bo Cowgill: Well, first of all, I do enjoy y’all’s banter back and forth. I don’t know how well an AI would do that. Maybe it would do a perfectly good job with that. I do enjoy the fact that—this is personal to me—but we know a lot of the same people. And in addition to other guests and other paper references, I like to follow some of the inside jokes and whatnot. I don’t know if that’s all that big of a deal for the average person. But I have listened to at least the latest version of NotebookLM and its ability to do a quote-unquote “deep dive podcast” on anything. And at least recently I’ve been pleased with those. I don’t know if you’ve ever tried putting in like a bad paper in theirs, and then it will of course just say, “Oh, this is the greatest paper. It’s so interesting.”

Seth: Right.

Bo Cowgill: You can.

Seth: So that’s a little bit different, maybe slightly different than our approach.

Bo Cowgill: Well, yeah, for sure. Although you can also tell NotebookLM to try to find problems and be a little bit more critical. And that I think works well too. But yeah, I don’t think we should try to replace you guys with robots just yet.

Seth: We’re very highly compensated though. The opportunity cost of Andrey’s time, he could be climbing a mountain right now. Andrey, you take it up. Why are we doing this ourselves? Why isn’t an LLM doing this communication for us?

Andrey: Well, mostly it’s because we have fun doing it, and so if the LLM was doing it, then we wouldn’t be having the fun.

Seth: There you go. Well put. Experiential value of the act itself. Now, Bo, I did not bring up this question randomly. The reason I raised this question of how does AI modify communication... yeah, I used a softmax process, so it was not random. The reason I’m asking this question about how AI changes communication is because you have some recently accepted, forthcoming work at Management Science trying to bring some theory and empirics to the question of how LLMs change human communication, but now in the context of resumes and job search and job pitches. Do you want to briefly introduce the paper “Does AI Cheapen Talk?” and tell us about your co-authors?

Bo Cowgill: Yeah, most definitely. So the paper is called “Does AI Cheapen Talk?”. It is with Natalia Berg-Wright, also at Columbia Business School, and with Pablo Hernandez Lagos, who is a professor at Yeshiva University. And what we’re looking at in this paper is the way people screen job candidates or screen entrepreneurs or, more abstractly, how they kind of screen generally. You could apply our model, I think, to lots of different things.

But the core idea behind it kind of goes back to these models from Spence in the 1970s saying that costly signals are more valuable to try to separate types.

Seth: Right. If I wanna become a full member of the tribe, I have to go kill a lion. Why is it important for me to kill a lion? It’s not important. The important part is I do a hard thing.

Bo Cowgill: Exactly. Yeah. So maybe part of the key to this Spence idea that appears in our paper too is that it’s not just that the signal has to be costly, it has to be kind of differentially costly for different types of people. So maybe in your tribe, killing a lion is easy for tough guys like you, but for wimpier people or something, it’s prohibitively high. And so it’s like a test of your underlying cost parameter for killing lions or for being tough in general. So they go and do this. And I guess what you’re alluding to, which appears in a lot of cases, is the actual value of killing the lion is kind of irrelevant. It was just a test.

And maybe one of the more potentially depressing implications of that is the idea that what we send our students to do in four-year degrees or even degrees like ours is really just as valuable as killing a lion, which is to say, you’re mainly revealing something about your own costs and your own type and your own skills, and the actual work doesn’t generate all that much value.

Seth: Is education training or screening?

Bo Cowgill: Right, right, right. Yes. I do think a good amount of it these days is probably screening, and maybe that’s especially true at the MBA level.

Andrey: I would just say that, given the rate of hiring for MBAs, I’m not sure that the screening is really happening either. Maybe the screening is happening to get in.

Bo Cowgill: What the screening function is now is like, can you get in as the ultimate thing?

Seth: Right. And I think as you already suggest, the way this works can flip if there’s a change in opportunity costs, right? So maybe in the past, “Oh, I’m the high type. I go to college.” In the present, “I’m the high type. I’m gonna skip college, I’m gonna be an entrepreneur,” and now going to college is a low signal.

Bo Cowgill: Yes. Exactly. So that’s kind of what’s going on in our model too. How are we applying this to job screening and AI? Well, you apply for a job, you have a resume, possibly a cover letter or, if you don’t have an old-fashioned cover letter, you probably have a pitch to a recruiter or to your friend who works at the company. And there are kind of elements of costly signaling in those pitches. So some people could have really smart-sounding pitches that use the right jargon and are kind of up to speed with regards to the latest developments in the industry or in the underlying technology or whatever. And those could actually be really useful signals because the only sort of person who would be up to speed is the one who finds it easy to follow all this information.

Seth: Can I pause you for a second? Back before LLMs, when I was in high school, they helped me make a CV or a resume. It’s not like there was ever any monitoring that people had to write their own cover letters.

Bo Cowgill: That’s really true. No, some people have said about our paper that this is a more general model of signal dilution, which was happening before AI and the internet and everything. And so one example of this might be SAT tutoring or other forms of help for high school students, like writing your resume for you. Where if something comes along—and this is where GenAI is gonna come in—but if anything comes along that makes it cheaper to produce signals that were once more expensive, at least for some groups, then that changes the informational content of the signal.

Seth: If the tribe gets guns, it’s too easy to kill a lion.

Bo Cowgill: Yeah. Then it just is too easy to kill the lions. But similar things I think have happened in the post-COVID era around the SATs. Maybe it’s become too easy, or so the theory goes, to get one, where it doesn’t really separate out who is actually a smart person. Maybe it’s getting diluted with who can afford these prep classes and things like that. But I don’t wanna stray too far from GenAI just yet.

You know, I think people have seen a lot about this, either on social media or in the mainstream, is like, the signal in a job application seems like it may have gone down because you used to be able to tell based on these pitches who is qualified or not. And even without lying, you could write a much better pitch that would make you sound really more knowledgeable, even without misrepresenting what your underlying experience is. And so it’s really, I think, not just job applications. That is of course the setting that we study, that and entrepreneurship. But I think there are similar things about how grading at schools has gone bad. You used to be able to quickly tell from an assignment who knew the material and who did not. But now ChatGPT is gonna really interfere with that.

Anyway, so with this as background, we then try to study theoretically and empirically what’s going on with the use of ChatGPT in these sort of costly signaling settings.

Andrey: Yeah. And so how do you go about doing this? Because it does seem like it’ll be pretty hard to study this in the wild. I know of a few papers from some of our friends that have done this. How did you approach this?

Bo Cowgill: So the first thing we wanted to do was kind of motivate the question a little bit more theoretically. So probably at least the first half or so of the paper, we create this model that has what I hope is a tractable punchline, which is that it’s actually not inevitable that GenAI would create this epistemic...

Seth: Wait, a tractable punchline? Wasn’t the punchline that anything goes? What’s the punchline?

Bo Cowgill: Well, I am glad that we brought up the “anything goes” theory models, which is another kind of theme of your podcast and critique of previous papers. So it is true that our model basically says that depending on a particular parameter, you could get either an epistemic apocalypse or a situation where the use of GenAI actually improves the accuracy of screening. And it’s like, you get better information, you actually want your job candidates. You want to say, “Please use GenAI. We actually will know better. Don’t send your pitch in without using GenAI first.”

So it’s true, anything goes. And my defense of that is we really focus the reader on this particular parameter that you could measure empirically.

Seth: Are there other parameters that theoretically could affect this, though?

Bo Cowgill: Not that we’re talking about in this paper. No.

Seth: Not in this paper. All right.

Bo Cowgill: If you have some in mind, I’m curious.

Seth: Well, let’s come back. So I have some thoughts at the end about interpreting the results, so we’ll come back to that. You can just keep on walking us through what you did.

Andrey: I guess I wanted to say there’s an approach in economics, a sufficient statistics approach, right? Where you write down a model where there is a particular parameter that, depending on how big it is or what sign it is, that tells you something about what is the right policy or what is the mechanism that quote-unquote “dominates” a particular setting. And so I view what you guys were doing very much in that vein.

Seth: Right. A ceteris paribus sort of analysis. Yeah.

Bo Cowgill: That’s true. So what are we focusing on? What is the key linchpin of this model? It’s a covariance term across the population. So let me try to break this down.

The two terms in the covariance are, first of all, how much human capital do you have? Or are you like a talented person who knows a lot about what you’re doing, you have a lot of expertise or not? And we’re sort of assuming that the employers are trying to screen for that. Why are they screening for it? Well, in an actual job, you could be in a situation where you don’t have to use GenAI, or you can’t use it and you have to just use whatever knowledge is between your ears. So this one term is your kind of level of talent for the job without AI assistance. And then the other term is how much of a boost does your cover letter get from using ChatGPT to sex it up and to make you sound like you know all the smartest, most contemporaneous jargon?

So these two things could be positively... it could have positive covariance, they could have negative covariance, they could have basically no covariance. But the intuition is, if you have a positive covariance, then the most talented people are getting the largest bump from using GenAI. And the negative covariance would be if the really talented people don’t really get that much of a cover letter improvement, maybe because it’s already so good that there’s nowhere else to go, and that most of the benefit comes from improving the low types’ quality of their cover letter. So this is the linchpin parameter in the model, and what we try to take to data after this.

But just to finish up what’s going on in the theory: well, you get totally different screening results depending on what that parameter is. In the case I think that people are most expecting, you have this negative covariance where most of the benefit comes from making low types and helping them masquerade as high types. And in this negative covariance world, there’s not really that much benefit to high types for using GenAI ‘cause their cover letter or their application or whatever, it’s just already so good. So insofar as this is happening, we want to quantify that empirically. But there’s also this possibility that GenAI puts the high types... it gives them superpowers and they can do even more amazing stuff.

Seth: Right. Can I jump in here? I don’t think you have to interpret it as superpowers, right? If we’re thinking about communication generally, you might imagine that high types have the higher opportunity costs of their time, right? And so there’s some sense in which automating an hour of high-type time is like more money than automating an hour of low-type time. I guess to really understand how this plays out, I’d have to think about how many discrete versions of this is the high type sending out to prospective employers, right?

Andrey: And I guess maybe I’ll add on to that. It depends on what we’re screening for. You’ll get to this in your experiment, but like if the high type has verifiable high-type traits, which is oftentimes the case, assuming they’re not lying on their resume, right? Then what does something like a cover letter reveal? It’s some sort of effort. Right? And so the question... in my mind, cover letters are oftentimes screening for effort, which seems very... take the time to customize a cover letter for this particular job.

Seth: The effort is cheaper for poor people.

Andrey: It’s so it’s kind of a little bit of a different interpretation than like skill per se, because skill... I think it’s unlikely that cover letters signify skill in many domains. Certainly hiring, letters are essentially not read.

Seth: Essentially ignored. I mean, unless they say, “talk to my co-author, blah, who you know,” unless there’s like, “do this thing to learn about me” information in it. Right.

Bo Cowgill: Yeah. Interesting. There’s like a number of things to follow up on there. I do think that there have been big things missed in the study of hiring generally from trying to generalize from academic hiring to other things.

Andrey: Yeah.

Bo Cowgill: I’m not even sure I agree that cover letters are not read either in economics or at least in adjacent places like business and policy schools. And the fact that you think that is probably just a reflection of you guys going to such fine universities that you assume everyone would take the job if you were... I don’t want to pick on any one university.

Seth: Directional state.

Bo Cowgill: Yes, exactly. If you were from University of Southwest Kentucky, which is where I grew up, so I’ll pick on it, it could be very worthwhile to signal that you’re actually interested.

Seth: But again, perfect. But then we’re not signaling skill. You’re signaling match or you’re signaling effort. Right.

Andrey: So it’s a question of what... really this correlation really depends on what is the signal that’s being sent, I think.

Bo Cowgill: Sure, that’s true. But this particular conversation I think has gone off in the direction of cover letters, but candidates also use GenAI to fill in, for example, the bullet points of what they did in a particular job.

Andrey: Yeah. Yeah, yeah.

Bo Cowgill: Where there’s an enormous amount of leeway for describing your job as a super high-impact thing that required you to be an agentic leader or something else. And this is a case that’s not cover letters, but is part of your pitch, where it could actually signal different underlying skills.

So there are lots of ways I think, to apply these ideas in different settings. And it’s true that there’s probably some follow-on work that would be useful, and we can talk about some follow-on work that other people are doing and that I and my co-authors are thinking about doing too.

Seth: Don’t solve it all in one paper. So tell us. So that’s the theory.

Andrey: How dare you not solve it in one paper.

Bo Cowgill: Yeah, yeah, yeah. So you could get these opposite sorts of things. You know, some people think, “What are you talking about? How could there be positive covariance? That’s ridiculous.” I have some examples in mind. In the paper, we talk about AI art. So I’m not an artist and I don’t think you guys are either, but if I used art with DALL-E, I think I’d be a little bit better. But there’s some evidence and some anecdotes and even some small studies that say like, if you actually know how to describe art as a trained artist would, then you can use these AI art generation programs to make way cooler art. And so like if you were screening an artist, you would want them to use GenAI because then you would be able to see the big differences. And even just some screenshots from these demonstrations I think would show how much better the actually trained artists would be, or the high type would be, once they use GenAI.

Now another example of this to me is using AI for math. Now maybe it’s just gotten so good that it can just solve whatever, but I think if you gave a difficult economic theory theorem to prove to a total novice, as somebody who hasn’t gone to a PhD or a high school kid or a middle schooler or something, like, they might not make very much progress. But if you gave someone who had trained or had some intuition for what the solution is, then I think it would be more powerful and actually like... having this sort of result that you could do something with. But it’s true, our model basically isn’t anything goes, but it kind of focuses on this covariance parameter as the thing to pay attention to.

Andrey: It could be positive. So oftentimes, if you’re doing an interview process, there is like a take-home component, like for a data science job that might be a take-home analysis and a dataset and a report, right? In some sense, you can make it... the ceiling for this assignment is very, very high. Right?

Bo Cowgill: Yeah.

Andrey: And someone who actually knows what they’re doing would be able to do a much, much better job. Like there’s a sense that the GenAI tools might raise the bottom of the distribution, but if you want to get close to the max, the people who really know what they’re doing might actually benefit a lot more from the tools.

Bo Cowgill: That’s true. That’s right. Yeah. Well, something your comments, Andrey, make me think about is just the even the idea of a max. And one reason I think that we’ve seen a lot of negative covariance applications is that the underlying test has been designed with a maximum that... there are too many people that are actually close to. And if the test had more sort of headroom to go arbitrarily good, that might, even just that change alone, might make it more possible that GenAI can actually help find the truly talented people as opposed to making the people that ate their homework masquerade.

Seth: No, I was just gonna jump in. I wanna propose a hypothesis for why negative correlations might be common, generally. So you might imagine... rather, not generally, in experimental settings, in experimentally relevant settings. Why do I say that? Imagine if your quality as a worker is both a function of the stuff that can be automated by GenAI and stuff that can’t be automated by GenAI, right? So I’m a worker. I have to do both of these tasks, but maybe I’m gonna delegate some of the automatable-by-GenAI tasks.

If we’re all applying for a job which is kind of at the same sort of productivity threshold, and we’re all kind of assortatively matching to like, we’re applying... we’re not applying to the corner bodega and we’re not applying to Google. We’re all applying for this mediocre firm. For us to have the appropriate skill, total productivity for a mediocre firm, I have to kind of be good at one thing and bad at another. So these like productivity isoquants of given workers will imply a negative correlation between skill in the automatable thing and skill in the non-automatable thing.

Bo Cowgill: Uh.

Seth: So it doesn’t surprise me that if you get a population which is pretty homogeneous in terms of like total productivity, that’s going to entail a negative correlation in the automatable versus non-automatable skill. So that’s why I think this is gonna be common.

Bo Cowgill: Okay. Interesting. I’m curious, I think one of the places where you see negative covariance the most seems to be in the classroom. I guess how does this isoquant idea apply there? Or is it just like, because it’s education and not an actual job that it doesn’t really apply?

Andrey: Well, my thought process would be there is like a lot of assortative matching between programs and students, right? So...

Bo Cowgill: Ah, I see. Yeah. Okay. Okay. Perfect. Yeah.

Seth: But as I wanna complete my idea. So to complete my idea, actually I’ve realized that I’m pointing in the wrong direction, right? For the AI to boost the overall lower total productivity person more, what it needs to do in terms of the job application, is boost them disproportionately at writing job applications, right? This is your notion of how correlated is your actual skill with your ability to write the resume with and without the GenAI. Right. And I think in the general population, it’s probably the case that your ability overall and your ability with AI are positively correlated, in which case, this would be a noisy signal that would mess you up. But if we had like a narrow enough band of quality coming in, it would go in the other way. So maybe there needs to be like a level of screening before the screening. But we haven’t even let you get to the results yet. We’re still in theory.

Bo Cowgill: No, no, no. I think it’s great, as part of the podcast genre, to have some tangents here and there. So in the empirical part of our paper, we’re just trying to measure like how much actual information loss is there? And is it possible that for certain subgroups you actually get information gain? And also, what is this covariance? Is it kind of more positive or negative?

And the key to understanding our experiment is that we actually know something about all the subjects in it and what their “high” versus “low” type is before they even enter the experiment. So I’ll tell you a little bit more about the setting. We are looking at job seekers on Prolific who are in the market for either a data science job or a consulting type of...

Andrey: So Bo, just to clarify ‘cause I do think this might be unclear to the participants. These people are not actually looking for a job. You are recruiting them into an incentivized survey of some sort, right?

Bo Cowgill: That’s true. They do have experience in these respective domains. And so, insofar as this is an incentivized experiment, we have recruited subjects with domain-appropriate knowledge, at least in some cases.

Seth: Can you explain what... do you look at their CVs, or this is something Prolific tells you that they’re experts versus non-experts?

Bo Cowgill: Yeah, Prolific screens them beforehand. And so they’re a little bit unclear about how exactly they screen these people.

Seth: Unclear about what makes someone an expert.

Bo Cowgill: Fair enough.

Andrey: So to be clear, my interpretation is that no one in this paper is an expert. There would be no way any expert in data science would...

Seth: ...for $12 an hour.

Andrey: ...in this sample.

Bo Cowgill: Sure. Well, you sound like one of our referees.

Andrey: Not... I, just to be clear, I am definitely not your referee.

Bo Cowgill: Okay. Yeah. I think the underlying theory doesn’t require anyone be like, elite at any of these things. There just has to be variation within the population about who has relatively higher or lower human capital and that this be...

Seth: Bo, can I pause you for a second there? ‘Cause one of the main outcomes is gonna be whether people’s predictions of whether someone is an expert move closer to 50/50 or not. Right? But presumably, if the signal is getting less informative, you should move to the population average of experts versus non-experts, not 50/50.

Bo Cowgill: Well, the experiment was set up such that the population average was 50/50.

Seth: You tell... well, so you have a measure of whether these people count as experts, right? And in your sample, 50, approximately 50% are experts and 50% are non-experts. As a person reviewing these, have you told me that 50% are experts according to your classification?

Bo Cowgill: Yes. Now, interestingly, their actual beliefs... they don’t seem to totally believe that because on average they think about 45% are experts. And interestingly, they think that about 45% are experts both in the GenAI and the non-GenAI condition. So it’s possible that they would’ve just totally updated their beliefs based on all these amazing cover letters and pitches and little resumes in the experiment and said, “Oh, these people must all be really good.”

Seth: But what actually happened? Okay, but you tell us the treatment. Yeah.

Andrey: So I think, to be helpful to the listeners, the experimental...

Seth: Why do that?

Andrey: ...unit of randomization, the treatment, et cetera.

Bo Cowgill: Yeah. So in our experiment, we recruit people with job experience in the various domains. And we ask them to make a pitch for both a job that they’re qualified for based on what Prolific knows about them and a job that they are not qualified for. So everyone either has domain expertise or prior experience in some sort of data science or some sort of management consulting type of job. So basically everyone is asked to masquerade a little bit to be as qualified as possible for a job that they really didn’t have any prior experience.

And so they write these pitches and then they’re asked to use ChatGPT to edit them to try to make them essentially more convincing. So this is the sender side of the experiment. And then on the receiver side, we get basically people with hiring experience or recruiters to then evaluate these different ones and try to label who are the people that have actual expertise and who are the ones who don’t. It’s essentially like asking, “Who would you wanna hire?” And the recruiters get to know who was using GenAI or not.

Seth: Be very... this seems to be a very important distinction here, so be very clear. They’re told who uses it or who has access to it?

Bo Cowgill: They’re told who has access to it. And our goal there is we’re trying to think about the long-run implications of GenAI on signal dilution. And I think we’ve arguably already reached a world where, if you read a cover letter or you read a resume, it’s probability one that they had access to GenAI.

Seth: Not just probability one-hyphen. It’s a major insight that you just got.

Bo Cowgill: Right.

Andrey: Certainly.

Bo Cowgill: Exactly. But the experiment I don’t think is good... it doesn’t capture, say, the 2024 era very well.

Seth: Remind us when. When is this happening? When are you doing this study?

Bo Cowgill: This happens in 2023. And I think that there’s an intermediate period where there’s some uncertainty about whether this person had access or not. But the long-run implications between the pre-GenAI world and the post-GenAI world, these are the more interesting ones I think to my co-authors and I.

Seth: The correct treatment. Yes. I totally agree that it makes sense that the treatment is “these people got access to AI” rather than “they used AI for exactly this sentence” because that’s the more empirically relevant. Yeah.

Bo Cowgill: Right. Yeah. It’s also possible that the control group could have used GenAI as well. And so we asked them just to make sure, but basically almost none of them did. And we removed the instances where...

Andrey: So I had a very, but a positive, you know, a constructive comment for you, which is that you could...

Seth: Oh shit. This is gonna be devastating.

Andrey: No, no. It’s actually constructive. You could just use one of these AI writing detectors, the good one from Alex Imas’s paper, to see whether they actually use the GenAI or not.

Bo Cowgill: Yeah, no, this is a good idea. This is a good idea. Well, if it hadn’t already been accepted, I think that would definitely be worth checking out.

Seth: And one detail you skewed is that people who use the GenAI, their CVs get way better according to GenAI.

Bo Cowgill: That’s true. That’s true. Yeah. So when basically we have these recruiters assess, they assess several things. One is just like, do they think that the pitch is generally higher quality? Or, does it seem like it required more effort to produce? And, or does it sound kind of polished and like the person knows what they’re talking about?

Seth: Wait, what’s the exact prompt? No, I actually am very curious. Which of those versions is what you ask?

Bo Cowgill: It is, “What’s the quality of the pitch?”

Seth: Quality, right? Because it’d be very interesting if you got a different result for “How much effort do you think you put in?”

Bo Cowgill: No, that’s our theoretical interpretation.

Seth: Fair enough. But hey, why not ask?

Bo Cowgill: True. Yeah. It is. I think it was important. We didn’t ask them how convincing it was because that’s actually a separate question, which opens up the idea that like, “Yes, this is a higher quality pitch, but because we know it’s now become suddenly super cheap to make a pitch like this, we’re actually not very convinced by it.” So this is the other main outcome variable. “Who do you think is actually an expert?” or “How convinced are you?”

And on average, we see information loss from the conditions where the candidate was able to access GenAI. And so this is about a 4% to 9% information loss, or a 4% to 9% decrease in accuracy.

Seth: Oh, can I pause you for a second? ‘Cause so there’s two measures we’re gonna use as how accurate are these screeners? The first one we talked about just now, which is how close are you to just 50/50 as to whether this person is an expert. So obviously you have zero information if you say that they’re a 50/50 expert, but if you were 100% one way or zero, you’d be confident. And then the second thing you get at, right, is this error measure, which is the difference between whether the person’s actually an expert or not, which is this 1/0 binary. And then people can kind of continuously say, “I think this guy’s an 80% expert,” or “I think this guy’s a 20% expert.” And specifically when you say that information transmission went down, which of those measures are you talking about, or both?

Bo Cowgill: Uh, both. The 4% to 9% represents... one of them is using one of these outcomes and the other one is using the other one. And so basically we’re trying to say, you could use a variety... either of these ways to measure accuracy and you qualitatively get the same thing.

And so, what should you make of this 4% to 9%? So I think the information apocalypse people think like, “Wow, that’s it? Only 4% to 9%? This is not very much.” I think that’s a fair point. Now, if you think about... actually another detail that I’ve left out is we studied... we ran this experiment essentially on hiring and with recruiters and hiring managers. And then we also did a similar one in the domain of entrepreneurship with people that were interested in starting a new business, some of whom had no prior expertise in the type of business that they were pitching. And the evaluators here were people with some sort of investing experience. We broadly see the same thing and can’t differentiate the two different domains with regards to the key outcomes and the intermediate values.

So, but we should get back to this 4% to 9%. But one very interesting result, I think, is that when the receivers of these signals are evaluating its quality, we see this huge collapse in the variance of these signals. So it basically looks like everyone’s pitch starts to look pretty good. Without GenAI, they’re all kind of spread out, which is useful for disambiguating who has a good pitch and who has a bad pitch, or who has high underlying experience and human capital or not. But the GenAI kind of homogenizes all of them. And that’s the intuition behind why there’s this information loss.

Seth: So just to understand. Let me understand that a little bit better. So I understand that we’re bringing up the bottom, right? The really bad resumes and pitches get upgraded. Are we also dragging down the top? Or are we just making it more linguistically similar? Understand, tell me... understand what’s happening for the pre-GenAI top performers.

Bo Cowgill: So they’re getting bumped up, just not by very much. So if all types were moving up in quality by an equal amount, then you would just kind of shift the quality to the right between the no-GenAI and the GenAI treatments. But what we see is that the even the high types go up by a little bit, but just not by very much with regards to their application quality or their pitch quality. Meanwhile, the low types are going up a lot, which then pushes them next to the high types and they’re now looking very similar to each other with regards to the quality.

We could also look at linguistically, are they using the same underlying words? We didn’t look directly at that, but I think it’s likely given what we’ve seen in other domains that use of GenAI makes everybody kind of sound a little bit, not just similar quality, but actually using some of the same underlying words.

Seth: Such a similar quality-hyphen, almost identical.

Bo Cowgill: Exactly. Right. M-dashes and using the word “delve” a lot and stuff like this.

Seth: Oh yeah.

Bo Cowgill: Yeah. So on average you lose information. I think the 4% to 9%... there’s not a lot of information to begin with. It’s like a very well-replicated finding that it’s hard to hire people and it’s hard to pick diamonds in the rough before they have much of a track record. Even if they have a track record at other companies, the match-specific aspect can be hard to pick up on. And if you think about an investor who had 4% to 9% lower returns—and one of our applications is actually in investing—then like, I think that would be a problem for the success of their business.

Andrey: But I mean, so I’m now going to make the point about, like, I really don’t care about whether this is a big or small effect. ‘Cause I don’t care about your setting. Not like it’s a bad setting to show how this would work in practice in a real setting we cared about, but like clearly Prolific people rating each other is not really something where we specifically care about the parameters that we estimate. For example, for an investment pitch, no one actually makes investment decisions based on a written artifact and that’s that. Right? Or you’d have to be pretty crazy to do that.

Bo Cowgill: So I will hard disagree on that.

Seth: Ooh, ooh, spicy.

Bo Cowgill: The most common place to get turned down from a startup pitch is before you even walk in the door, when you send your text-only pitch to an investor or an angel investor or a VC. Text-only, maybe some mostly-text slides. You send that in. This is where most people are eliminated. They don’t even get in the room.

Seth: I guess what Andrey would say is the marginal guy who gets into the room is never gonna get the deal.

Andrey: Yeah, I mean, that’s kind of...

Bo Cowgill: I don’t know if I even agree with that. I think that VC investing is probably really noisy as well. I mean, they lose a ton of money and not everyone agrees. I mean, there are these cases like Google where they had two top-tier investors, but I think that there are cases where people didn’t necessarily expect it.

Andrey: I don’t think... no, no. I really think if you wrote down plausible distributions here, it would be almost surely that this is really affecting people with very low probability of investment just to get... right? Because the baseline rate of investing is so low, even conditional on getting past that initial stage. Right.

Seth: And even if we take a step back, if we think about just AI as a technology that is good at automating the low-skill thing but leaves the high-skill thing less affected, you would expect that the more advanced setting, the setting with more applications, if we’re just taking the arg max, maybe it doesn’t matter so much that we’re mixing up the middle a little bit.

Bo Cowgill: I see what you mean. Yeah. Interesting to keep on studying this.

Andrey: I guess like, that’s what I was... I was really pushing back on just this... I would not... like, I like the paper, I think, viewed as a proof of concept, but I would not take anything literally. So I’m very uncomfortable with statements as like “investors would lose this many returns” and just in general, right? Like it’s not... lab experiments are great, but they’re not gonna...

Seth: Andrey would only trust this study if people reported 0% of these people are experts.

Andrey: Yeah.

Bo Cowgill: It is a proof-of-concept sort of paper and this is something we talk about in the discussion.

Andrey: Yeah.

Bo Cowgill: And yeah, it’s totally fair to say, I don’t know how...

Andrey: I guess I was gonna offer you a chance to say something about other papers. ‘Cause now there are a few other papers that are kind of trying to get at similar mechanisms.

Seth: Perfect. Do the meta-analysis live for us.

Andrey: I assume you’ve thought about it. Yes.

Bo Cowgill: I have seen some other papers in this area and they all look super cool. I guess the ones that I know best, although I don’t know every detail, are by, first of all, a PhD student at Princeton. And then a couple of PhD students at Yale that are both studying a change in Freelancer.com that happened when they released a GenAI basically cover letter tool to help your pitch if you were a freelancer.

And in various ways, I don’t want to speak on behalf of those authors, but it seems like, at least in those cases, there was this negative covariance idea where it seems like it actually harmed what used to be good signals about your match quality. And the way that the freelancers would do that was they use the GenAI tool to customize their pitch to look exactly like the requisition, or as much as possible, without lying. I don’t think they established there was no lying, but this is how they were doing it. So at least in these other domains, it seems like there’s some evidence that GenAI is similarly messing up signal accuracy and signal quality.

Andrey: Then there’s also, I think Emma Wilds has a paper, right? There’s a couple of papers on this, if I remember correctly. In one of them at least, they get access to the GenAI tools and that increases overall hire rates on the platform. Am I remembering that correctly or?

Bo Cowgill: That’s right. That’s right. And then at least in that case, they don’t find any sort of ex-post regret. And so, which might indicate that they were fooled and they were sent... unhappy. So this is a little bit more positive of a finding.

Seth: Are you... will you go out there? Will you now say, “And the reason that they found that GenAI was good was ‘cause...” Is this... they must have had a positive correlation between true skill and benefit from GenAI. Do you wanna make that claim in that population, in that context?

Bo Cowgill: Right, right. To be more clear about what they find, at least what I remember is them finding... is they don’t actually find that hiring improved. They just find a noisy enough covariance that they can’t reject... that they can’t sign it.

Seth: They fail to reject.

Bo Cowgill: Right. Right. So, not trying to start something here, but I thought like, well, maybe this is more of a somewhat ambiguous finding. And I also think that it’s presented not as “hiring actually improved,” but “we cannot reject that hiring actually got worse.” So then, maybe more precise tests will change this.

Andrey: So to be clear, the quality of the... we’re talking two things: the quality of the hires and the total number of hires, which are different numbers. And I think you’re talking about the quality of the hires. Is that right?

Bo Cowgill: That’s right. I think that the paper by Emma and John on this other freelancer platform, possibly the same one, you know, we don’t know.

Andrey: Truly a mystery which platform.

Bo Cowgill: Yeah. The employer can rate the freelancer. And so, if I recall their paper correctly, I think that they’re looking at those ratings and saying, it’s not like in the treatment group where you had these amazing cover letters, everyone was disappointed ex-post with what happened.

I mean, there’s a lot of other stuff that could go on there. It could be that they were super disappointed initially, and then the freelancer is like, “Oh, sorry. Well, I kind of masqueraded. Why don’t I do some extra work for you?” or adjust some other margin. But the punchline of our theory model is that this isn’t forced to go any single way. And it could totally be happening this way.

Seth: And be... but yeah. So I guess maybe let’s wrap up this idea of like external validity, right? Which is, the model seems to really imply that this will be super population- and context-dependent. And if the model implies that it’s gonna be super population- and context-dependent, then taking a snapshot in one place at one time can only tell you so much about everywhere else.

Bo Cowgill: I agree. I don’t think we’re trying to sell this as like, this is gonna happen everywhere, at least not on the basis of these results. Now, an interesting podcast discussion I think would be like, what did we expect? And we can go into that more speculatively.

Andrey: Well, let’s go to speculation mode.

Evaluating GDPVal, OpenAI's Eval for Economic Value

Andrey Fradkin — Tue, 04 Nov 2025 00:57:39 GMT

In this episode of Justified Posteriors podcast, Seth and Andrey discuss “GDPVal” a new set of AI evaluations, really a novel approach to AI evaluation, from OpenAI. The metric is debuted in a new OpenAI paper, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.”

We discuss this “bottom-up” approach to the possible economic impact of AI (which evaluates hundreds of specific tasks, multiplying them by estimated economic value in the economy of each), and contrast it with Daron Acemoglu’s “top-down” “Simple Macroeconomics of AI” paper (which does the same, but only for aggregate averages), as well as with measures of AI’s use and potential that are less directly tethered to economic value (like Anthropic's AI Economic Value Index and GPTs are GPTs).

Unsurprisingly, the company pouring hundreds of billions into AI thinks that AI already can do ALOT. Perhaps trillions of dollars in knowledge work tasks annually. More surprisingly, OpenAI claims the leading Claude model is better than their own!

Do we believe that analysis? Listen to find out!

Key Findings & Results Discussed

AI Win Rate vs. Human Experts:
- The Prior: We went in with a prior that a generic AI (like GPT-5 or Claude) would win against a paid human expert in a head-to-head task only about 10% of the time.
- The Headline Result: The paper found a 47.6% win rate for Claude Opus (near human parity) and a 38.8% win rate for GPT-5 High. This was the most shocking finding for the hosts.
Cost and Speed Improvements:
- The paper provides a prototype for measuring economic gains. It found that using GPT-5 in a collaborative “N-shot” workflow (where the user can prompt it multiple times) resulted in a 39% speed improvement and a 63% cost improvement over a human working alone.
The “Catastrophic Error” Rate:
- A significant caveat is that in 2.7% of the tasks the AI lost, it was due to a “catastrophic error,” such as insulting a customer, recommending fraud, or suggesting physical harm. This is presumed to be much higher than the human error rate.
The “Taste” Problem (Human Agreement):
- A crucial methodological finding was that inter-human agreement on which work product was “better” was only 70%. This suggests that “taste” and subjective preferences are major factors, making it difficult to declare an objective “winner” in many knowledge tasks.

Main Discussion Points & Takeaways

The “Meeting Problem” (Why AI Can’t Take Over):
- Andrey argues that even if AI can automate artifact creation (e.g., writing a report, making a presentation), it cannot automate the core of many knowledge-work jobs.
- He posits that much of this work is actually social coordination, consensus-building, and decision-making—the very things that happen in meetings. AI cannot yet replace this social function.
Manager of Agents vs. “By Hand”:
- The Prior: We believed 90-95% of knowledge workers would still be working “by hand” (not just managing AI agents) in two years.
- The Posterior: We did not significantly change this belief. We distinguish between “1-shot” delegation (true agent management) and “N-shot” iterative collaboration (which they still classify as working “by hand”). We believe most AI-assisted work will be the iterative kind for the foreseeable future.
Prompt Engineering vs. Model Size:
- We noted that the models were not used “out-of-the-box” but benefited from significant, expert-level prompt engineering.
- However, we were surprised that the data seemed to show that prompt tuning only offered a small boost (e.g., ~5 percentage points) compared to the massive gains from simply using a newer, larger, and more capable model.
Final Posterior Updates:
- AI Win Rate: We updated our 10% prior to 25-30%. We remain skeptical of the 47.6% figure.

PS — Should our thumbnails have anime girls in them, or Andrey with giant eyes? Let us know in the comments!

Timestamps:

(00:45) Today’s Topic: A new OpenAI paper (”GDP Val”) that measures AI performance on real-world, economically valuable tasks.
(01:10) Context: How does this new paper compare to Acemoglu’s “Simple Macroeconomics of AI”?
(04:45) Prior #1: What percentage of knowledge tasks will AI win head-to-head against a human? (Seth’s prior: 10%).
(09:45) Prior #2: In two years, what share of knowledge workers will be “managers of AI agents” vs. doing work “by hand”?
(19:25) The Methodology: This study uses sophisticated prompt engineering, not just out-of-the-box models.
(25:20) Headline Result: AI (Claude Opus) achieves a 47.6% win rate against human experts, nearing human parity. GPT-5 High follows at 38.8%.
(33:45) Cost & Speed Improvements: Using GPT-5 in a collaborative workflow can lead to a 39% speed improvement and a 63% cost improvement.
(37:45) The “Catastrophic Error” Rate: How often does the AI fail badly? (Answer: 2.7% of the time).
(39:50) The “Taste” Problem: Why inter-human agreement on task quality (at only 70%) is a major challenge for measuring AI.
(53:40) The Meeting Problem: Why AI can’t (yet) automate key parts of knowledge work like consensus-building and coordination.
(58:00) Posteriors Updated: Seth and Andrey update their “AI win rate” prior from 10% to 25-30%.

Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors on the economics of AI and technology. I’m Seth Benzell, highly competent at many real-world tasks, just not the most economically valuable ones, coming to you from Chapman University in sunny Southern California.

Andrey: And I’m Andrey Fradkin, making sure to never use the Unicode character 2011, since it will not render properly on people’s computers. Coming to you from,, San Francisco, California.

Seth: Amazing, Andrey. Amazing to have you here in the “state of the future.” and today we’re kind of reading about those AI companies that are bringing the future here today and are gonna, I guess, automate all knowledge work. And here they are today, with some measures about how many jobs—how much economic value of jobs—they think current generation chatbots can replace. We’ll talk about to what extent we believe those economic extrapolations. But before we go into what happens in this paper from our friends at OpenAI, do you remember one of our early episodes, that macroeconomics of AI episode we did about Daron Acemoglu’s paper?

Andrey: Well, the only thing I remember, Seth, is they were quite simple, those macroeconomics., it was the...

Seth: “Simple Macroeconomics of AI.” So you remembered the title. And if I recall correctly, the main argument of that paper was you can figure out the productivity of AI in the economy by multiplying together a couple of numbers. How many jobs can be automated? Then you multiply it by, if you automate the job, how much less labor do you need? Then you multiply that by, if it’s possible to automate, is it economically viable to automate? And you multiply those three numbers together and Daron concludes that if you implement all current generation AI, you’ll raise GDP by one percentage point. If you think that’s gonna take 10 years, he concludes that’s gonna be 0.1 additional percentage point of growth a year. You can see why people are losing their minds over this AI boom, Andrey.

Andrey: Yeah. Yeah. I mean, I, you know, I think with such so much hype, you know, they should,, they should,, probably just stop investing altogether. Is kind of right what I would think from [Eriun’s?] paper. Yeah.

Seth: Well, Andrey, why don’t I tell you, which is, the way I see this paper that we just read is that OpenAI has actually taken on the challenge and said, “Okay, you can multiply three numbers together and tell me the economic value of AI. I’m gonna multiply 200 numbers together and tell you the economic value of AI.” And in particular, rather than just try to take the sort of global aggregate of like efficiency from automation, they’re gonna go task by task by task and try to measure: Can AI speed you up? Can it do the job by itself?, this is the sort of real-world economics rubber-hits-the-road that you don’t see in macroeconomics papers.

Andrey: Yeah. Yeah. I mean, it is, it is in many ways a very micro study, but I guess micro...

Seth: Macro.

Andrey: Micro, macro. That was the best, actually my favorite.

Seth: Yeah.

Andrey: I guess maybe we should start with our prior, Seth,, before we get deeper.

Seth: Well, let’s say the name of the paper and the authors maybe.

Andrey: There are so many authors, so OpenAI... I’m sorry guys. You gotta have fewer co-authors.

Seth: We will not list the authors.

Andrey: But,, the paper is called,, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.”

Seth: And we’re sure it’s written by humans.

Andrey: We’re sure that it’s not fully written by humans because they’ve disclosed that they use AI. They have an acknowledgement—they have an AI acknowledgement section.

Seth: They used AI “as per usual”? Yeah. In the “ordinary course of coding...”

Andrey: And writing.

Seth: And writing. And for “minor improvements.” Yes. They wanted to be clear. Okay.

Andrey: Not, not the major ones. Yes.

Seth: Because,, you know, base... so, all right. You gave us the name of the paper. The paper is going to... just in one sentence, what the paper is about is them going through lots of different tasks and trying to figure out if they can be automated. What are the priors? Before we go into this, what are you thinking about, Andrey?

Andrey: Well, what they’re gonna do is they’re gonna create a work product, let’s say a presentation or schematic or a document, and then they’re gonna have people rate which one is better, the one created by the AI, or the one created by a professional human being. And so the first prior that we have is: What share of time is the AI’s output gonna win? so what do you think, Seth?

Seth: Great question. Okay, so I’m thinking about the space of all knowledge work in the economy. All of the jobs done by humans that we think you could do 100% on a computer, remote, is kind of the space of tasks that I’m thinking about. What percentage of those could an AI straight up... And just to be clear, Andrey, are these like kind of specialized AIs for the specific tasks, or are these kind of generic AIs?

Andrey: These are pretty generic AIs. Let me give you an example of a task, I guess, of at least the type that they’re thinking about in this paper. Mm-hmm. Although they think about a lot of tasks. So, the task is: “This is June 2025, and you are a manufacturing engineer in an automobile assembly line. The product is a cable spooling truck for underground mining operations, and you are reviewing the final testing step. In the final testing step, a big spool of cable needs to be reeled in and reeled out two times to ensure the cable spooling works as per requirement. And the current operation requires two persons.” So now the... it goes on and on. and then the...

Seth: ...and then the last sentence is “How many Rs are in strawberry?”

Andrey:, but the idea is, is that would... an example, yeah. Essentially you have to design, suppose you’re designing a jig using 3D modeling software, and creating a presentation using Microsoft PowerPoint as part of the deliverable. Upload only PDF summarizing design using snapshots of the 3D design created. The 3D design file is not required for submission.

Seth: There we go. So a pretty complex PDF being called for. I don’t think I could do it.

Andrey: I don’t think you could do it. I don’t think either of us can do it.

Seth:, I couldn’t do it in the amount of time the AI did it. You know, in a week, maybe.

Andrey: Yeah, I guess. I guess maybe, maybe in a week. Or, and maybe with AI assistance. With AI, with AI assistance, I could teach myself just enough. Yeah.

Seth: Right. I guess that’s a whole background issue here is we’re not thinking about AI for training. This is AI for just doing the thing. Yeah. Alright. So that’s an example of a very hard task. I think most tasks in the knowledge economy are easier than that. So that’s gonna ground my prior., I would say in real-world tasks, head-to-head versus a human, I’d be in the ballpark of about 10%. This is assuming we’re using like GPT-5 or Claude off-the-shelf versus a human who is actually paid to do that job. I’d be surprised if the AI wins up head-to-head much more than 10% of the time.

Andrey: Yeah, I think I’m in the same ballpark as you coming into this. You know,, I think I’ve tried making various work products using AI, and it’s,, rarely ever kind of a zero-shot process. One-shot, yeah. Or a zero-shot. Yeah. and there are oftentimes artifacts that kind of make it pretty clear that it’s an AI-generated thing, although not always.

Seth: Right. And so then we come around to like, some of those minor artifacts. To what extent can a little bit of massaging of these generic models get you a lot of additional productivity if you can get over those little hiccups that we run into with chatbots?

Andrey: But, and to be clear, I still think even... my prior going into it is even with some pretty sophisticated prompting, that the win rate would not be much higher than 10%, just because I’ve tried doing that. Right? Like, it’s not like I go into it and I’m like, “Hey, like do it, do it.” You know? I, you know, like I write like a pretty... I try to write a set of instructions for it and so on. I’m not, I’m not like naively using the models. And so,, I’m very often not getting kind of what I, what I’d like out of it. Right. As a result. So that’s...

Seth: Even as, even as top-tier prompters. Yes. You know, you might call us a 10x... we’re 10x prompters. I don’t know if you know that., you still don’t get what you want all the time. Right. Sometimes it’s just not... it. Sometimes the idea’s not in the model. Yes. And you can’t prompt it out.

Andrey: Yes.

Seth: but I guess,. I guess that’s one thing we’ll keep an eye on as we go, is just to what extent, they are adding additional scaffolding for these models. Okay. So the second prior that we were thinking about going into this is thinking about, like, kind of like the meta idea here is that any job that you can do on a computer, this AI should be able to do, if not in the immediate future, in the near future. That’s the dream, right? The “country of geniuses on the cloud.”

And so the question I have for you, Andrey, is looking at the occupations that are mostly about creating digital artifacts, so the knowledge work occupations, and let’s set aside whether there’s gonna be growth in those occupations or shrinking in those occupations. ‘Cause what we’ve said a lot, a lot of times when you automate part of a job, you might get more jobs or you might get fewer jobs. So setting aside that part of it, within the jobs that exist, are the people in those jobs going to still be making digital artifacts, quote-unquote “by hand,” as their main job? Or are all these knowledge workers gonna basically be managers of AI agents?

Andrey: And the question is about the share of workers whose primary job is currently to make these [artifacts]?

Seth: In the share of,, the share of, yes., let’s take it that way and let me give you a two-year horizon.

Andrey: So I would say that it’s still gonna be, you know, 85%, 90% of people,, that are still gonna be making digital artifacts by hand. But I think my question, I mean that’s, that’s my prior, I guess I would say. And, but kind of the main reason for it is it’s almost orthogonal to how capable the models are.

Seth: Okay.

Andrey: because what I’ve observed in my life is a lot of people just have AI usage aversion. So, mm-hmm. They’re just not adopters. And so...

Seth: Oh, so you’re, you have an adoption latency theory, which is just that, like it won’t grow because people won’t adopt it.

Andrey: Yeah. I, I’m just, I just look around and see a lot of people not adopting tools that are very useful,, in a variety of settings. And so to me, over the course of two years, can you teach an old dog new tricks, as they say? I, I don’t know.

Seth: The thing is, is it’s really, you can save a lot of time and people are, humans are also really lazy. So, well there are some forces going in different directions here. I guess, you know, I found this question of, you know, as I was asking it, this question of “by hand,” so ironic, right? Because like almost definitionally, if you’re doing it digitally, you’re not doing it by hand, right? So like what even is “by hand”? Are we just like moving up another chain of abstraction? And we should think about this as a continuum of, like, of knowledge work. We abstract a piece and we abstract a piece and we abstract a piece, but there’s always that long tail of knowledge work that remains to be done.

I think to me, this question comes down to like, what does it feel like in your job? Does it feel like I’m bossing an agent around, or does it feel like I’m getting messy guesses that I am cleaning up and doing, you know, half of the work, sort of iteratively, collaboratively? “Oh, you know, try this, try that.” That’s the AI systems that I mostly work with now, right? We keep on hearing promises about these agentic agents that’ll really be able to do 7, 10, 20-hour projects by themselves. My sense is that that level of “I am bossing around agents, I am not doing it myself,” is gonna be pretty rare within the next two years. So in 2027... I would think that that’s gonna be maybe 5% of knowledge workers. I mean, ‘cause right, it’s gonna be like lots of coders and then a small share of everything else.

Andrey: Yeah. And I wasn’t even thinking about coders. I was even excluding them from my thought process.

Seth: Excluding coders. Okay.

Andrey: Yeah. ‘Cause because I’m really thinking about, you know, like producing documents, presentations, schematics.

Seth: Well, here’s an interesting thing ‘cause we’re gonna see later at computer tasks, at, sorry, programming tasks versus other tasks. Is the AI actually a lot better at the programming tasks versus the other tasks? Hold on for evidence on that.

Andrey: Yeah. Yeah. And then did you wanna put a...

Seth: Did I, did I get a number? So you said 85%, so 15%?

Andrey: No, I said about 90. 90%.

Seth: 90%. Yeah. So 10% of, yeah. Knowledge work will be bossing around agents. Yeah. I’m, I’m, leads me closer to five, but... Very good.

Andrey: Alright.

Seth: Alright. Are we ready to go to the paper?

Andrey: Let’s rock and roll.

Seth: All right. So headline thing, this paper is gonna try to make an evaluation that can track how AI is improving in real-world economically valuable tasks. They claim that their tasks cover nine different sectors and 44 different occupations. Curiously, I don’t know why they specify both, because they’re gonna assign each sector one occupation. So it’s not like it’s sectors times occupations, it just, there’s 44 occupations and they’re associated with sectors, is the way to think about it.

Um, together these jobs make $3 trillion,, in the United States every year. it’s about a quarter of labor income., focusing on five occupations by sector that are digital and contribute most to total wage. How they’re selected, and I’m just gonna list a few of them for you guys. in real estate, there are jobs like concierges and rental clerks. In government, there’s jobs like recreation workers and first-line supervisors of police. In manufacturing, there are jobs like different kinds of engineers and, and so on. You know, programmers, any sort of like digital, you could do this job remotely, job... financial advisors, et cetera.

For each of these jobs. And this is like honestly, you know, huge shout-out, round of applause to this team because it seems like incredibly,, high effort. They recruited tons of experts in these occupations to first figure out what are the tasks in these occupations, matching that up with O*NET, which is a government database on the tasks, on occupations, and then sort of iteratively working with them to like define very narrowly, “Here is the economic task that we think AI can do.” And,, as a contribution, I think that that is so cool. I mean, the idea of like economic measurement of productivity at the task level is, I mean, I don’t know. It’s a dream since Taylorism of the 1920s. This is all the... this is a dream a hundred years in the making that we’re making progress on. Right?

Andrey: Yeah. Yeah. And okay. So that, that’s the setup. So we got 1,300 tasks across these 44 occupations,, that we’re gonna ask,, who’s better: man or machine.

Andrey: Yeah. I mean, yeah. I just want to double down on how impressive this effort is. I mean, you have experts from companies like Goldman Sachs, you know, Apple.

Seth: Oh, this is hilarious. The Air Force. They have a list of companies in the middle of the paper. Yeah. Why is this not a footnote? Why is this not in the appendix? Half of a page is just like, “Here are all the companies that our people have worked for. Apple, Amazon, 10 other ‘A’ companies.” It’s like, all right, cool.

Andrey: You know? Well, I get the sentiment. The paper is only nine pages long, and so I know you gotta like...

Seth: Half a page, a list of companies.

Andrey: I mean, these aren’t, you know. These aren’t your,, average Joes, right? They’re, they’re, they’re actually at these very high, you know, performing companies.

Seth: Average Joe works at Apple too. In fact, the person at Apple who’s taking time off from their lives to do this is maybe like the less of the average Joe than the high performer, or I don’t know, or they recruit... who thinks the best of the best.

Andrey: My sense is that they, I’m not saying like they recruited the best person in the world or anything, but these tasks, pay really well. Like they, they’re quite well compensated, so they’re not...

Seth: Right. So the average tasks, to give some context for this, the average task on their 220 tasks that they’re gonna end up focusing the most on took an average of 400 minutes. And if you multiply that by the median wage that we get paid,, someone would get paid $361 for doing the average task. So these are like real tasks. Yeah.

Andrey: okay. So. So kind of what do they do? They, they, they get these professionals to propose tasks. Then they use other professionals to figure out whether these are kind of really, you know, correctly specified tasks. They iterate on that a bunch. Mm-hmm., then once they’ve kind of come to that convergence, they have the AI do the task, and then they have other highly paid humans do the task.

Seth: And that... Wait, I think, and, and then there’s an iterative process. Yeah. Yeah. That’s process. There’s a process of prompt... Yeah, go ahead. Yeah, yeah.

Andrey: Yeah. So the iterative process,, are you talking about the, sorry, are you talking about the prompt process already or are you talking about the...

Seth: I’m up to the prompt process, but there’s the first, there’s several iterations. So, yeah.

Andrey: So I think the, the one I had in mind first was just the task is iterated on, between various experts so that,, it’s actually well-specified and representative of what a,, a task in this job category would be like. But there’s also additional iterations on the AI that is actually, right, doing the task. So you wanna talk about that?

Seth: I, I, yeah. And this is what I want you to take a minute to talk about, right? Because I think this is a really important point, is that they are not using... it’s not a huge amount of investment, but they are not using out-of-the-box Claude. They’re not using out-of-the-box ChatGPT in the sense of they’re not just prompting it naively. They’re spending a lot of time thinking really carefully about what is the perfect prompt to elicit this set of tasks.

Andrey: And so this is actually a great prompt for you all listeners if you were, wanted your [AI?] to do similar tasks, right? So this is actually where my introductory joke came from because the prompt begins, “Special characters: never use the character Unicode 2011.” But it goes on, you know, and a lot of these are, are kind of mostly about tool usage, right? And so...

Seth: Right. you know, like talking about... one of the basic prompts that’s so important is, is like, “If the task requires you to resend a PDF, definitely send a PDF” is one of the prompt improvements.

Andrey: Yeah., there’s some stuff like, “Take your time, do these thoroughly.” There’s, there’s other things like “Display all the PNGs.”

Seth: Be sure to double-check. Yeah. Double-checking things. Yeah.

Andrey: Are some... “Be sure to look a few days and see...” there’s a... “This is important” in capital letters and “Mandatory.” But I guess, I guess what I’d say is, this sort of prompt,, iteration is pretty standard in the industry at this point., there are a variety of frameworks that kind of let you do this programmatically even., but if you think about your... [Codecs?] or your Cursors... there, there’s a lot of prompt engineering going on under the hood., or, or even your ChatGPTs or your Claude chat, you know, there’s that system prompt. They’re, they’re tweaking all the time. So there’s nothing, I’d say there’s nothing unusual ‘cause it’s well-known that to get a good performance out of these,, systems, you need to,, have a good prompt.

Seth: I think that’s exactly right. I just wanna connect this to the point you made earlier about adoption lags. Right? And I agree with you that it’s very standard to, you know, for a company or an individual to spend a good amount of time prompt searching before they find one they’re good with. But even a small friction like that makes a big difference in terms of adoption, I think.

Andrey: Yeah, totally. Unless that,, prompt is given to you out-of-the-box,, baked in, in Cursor or whatever. Yeah, yeah, yeah. You’re just... or not you, I don’t wanna say, but most people, they’re, they’re, they’re gonna try...

Seth: Dear listeners, dear listeners, we’re sure that you are the best prompters.

Andrey: Yes. I’m sure our listeners are better prompters than we are, but everyone else, you know, you know, I think, I think they might have a bad experience with one prompt and kind of overlearn about the capabilities of the system., which is kind of an argument for why we might, we might see a lot more application-driven adoption, right? Rather than, you know, using a generic LLM,, that could be capable of doing something. You might have a packaged service,, like let’s say “PDF Creator.”

Seth: Alright, Andrey. This is what I wanna talk to you about. This is what I wanna talk to you about. ‘Cause I low-key think the paper’s about this. I think that the secret theme of this paper is: What is the relative return towards this basic prompting work, this basic scaffolding work, versus another hundred billion parameters in the model? ‘Cause we do get an estimate of that, right? And so I was really surprised to see kind of that, you could get about a 10% improvement on win rate. I guess, can I just...

Andrey: Can I just pause you on that and can actually just go through the results first before we...

Seth: Alright. Okay. Yeah, yeah. Listeners, listeners, you know how excited I get. You know, I get off the chain and you need to reel me back in. So let me give you the results, and then I’m gonna wildly speculate.

Andrey: Okay. Perfect. Yeah, yeah, yeah. Well, let’s, I’ll just say what the... so we haven’t actually done the description of the pairwise task, which is essentially this highly incentivized person,, they have to choose which one is better, the AI-generated one, or,, the output created by another human expert. and you know, and just in general, kind of with these things, you might be worried that,, the graders aren’t putting in enough effort, right? Like,, maybe they don’t really care which one is better. And so they sometimes, they might not read as deeply as they’d like. And,, you know, from having talked to some of the authors of the paper, it seems like these graders spend quite a bit of time just evaluating which of the outputs is better.

Seth: Right, they said about an hour per evaluation. Yeah. Is a real... yeah, yeah, yeah.

Andrey: So it’s not like they’re just like, “Eh, I kind of feel like this one is better than that one.” I mean, I, to be clear, I still think that, you know, we could probably do better in incentivizing proper grading, but kind of, it’s not, you know, some of the more obvious flaws you might think are there, are, are, they’ve thought about them.

Seth: Right. No. These, like we said, extremely well done within the bounds of what they’re doing,, from everything we’re reading. Okay. So we’ve got, we are evaluating them, we’re going head-to-head. And to be clear, we’re only, as far as I understand, it’s only for 220 of these 1,300 tasks. Do we have the resources to actually do this evaluation? But within the 220,, we’re gonna ask, okay, what’s the win rate of GPT-4o, or 4-mini, [O3?], GPT-5? What? So my prior was that the AI will win 10% of the time. What were we seeing?

Andrey: Yeah, so we’re seeing... and perhaps the most remarkable part of this paper... which is that Claude Opus makes a showing.

Seth: Claude does better.

Andrey: Claude does the best. Claude does the best of all the LLMs with 47.6%, which is just very close to human when you really think about it. I mean, it’s almost a coin flip which one is better. right. And then GPT-5 High also does pretty well at 38.8%, but actually substantially worse than Opus,, which is quite interesting.

Seth: Right. Bold of OpenAI to go out there. Although,, maybe we wanna talk about different domains, different occupations here. There are areas where the OpenAI models shine.

Andrey: Yeah, yeah. okay...

Seth: So the headline result,, Claude almost human parity on these tasks. [Expletive] insane, at least in terms of, you know, that win rate., and then OpenAI close behind at 39% with their leading model, but it differs a little bit by sector and occupation.

Andrey: Well, I just wanted to mention one other thing.

Seth: Go ahead.

Andrey: Before we moved on to sector and occupation, I just... ‘cause like one of the things that, you know, with a theme of this show has been,, you know, scaling laws and how much better newer models are. And it’s interesting to me the set of models that was considered here. So we have GPT-4o, which, you know, is an older model, but not that old of a model. It’s a kind of a cheaper model and it actually only wins about 10% of the time. So we’re kind of pretty well calibrated if we think about that model.

Seth: And that’s actually right. We’re just taking it out of date...

Andrey: Closer to the model that many, many people, you know, had access to essentially until July. And then, O3 High, which is a model that essentially no one uses,, because it’s really, really expensive,, is at about 30%. And then GPT-5 High, which I guess may be the “thinking” version of the ChatGPT interface. I’m not exactly sure. It’s kind of unambiguous, frankly. Because maybe they have a...

Seth: Is there a special GPT model that’s being used here?

Andrey: Well, there, there’s a router and who knows what’s being routed where.

Seth: It gets routed. It gets routed to the good server.

Andrey: Yeah, yeah, yeah. So that, that’s, that’s kind of almost, you know, 35% to 40%. Right. So, so we do see improvement within newer models or the models that are more compute-intense. But I would also say that most people do not have this quality of a model as their default.

Seth: Yeah. There does seem to be a giant... so this is speaking to what is the relative value of overall progress versus prompting progress? I mean, it seems like in a year of overall progress, we’ve boosted—arguably boosted—this win rate by 30 percentage points and, like, arguably saturated it if we’re getting a, you know, almost 50% win rate. I mean, if there... it could... I’m not saying we actually saturated it. In fact, one of the arguments in the paper is that they’re gonna use win rate as their main success measure because it doesn’t get saturated as easily., but it’s damn impressive, that amount of progress in the last year.

Andrey: Yeah. so all right, so let’s go, you know, you wanted to... Yeah. Go by occupation. Yeah. Yeah. All right. Go for it.

Seth: Oh yeah. So what jumped out at me about that was basically all the models do pretty well at basic clerking jobs, and all of them are decent at programming. Claude... kind of the stuff that all of the models are good at, Claude just knocks out of the park. Right? Then there’s some interesting kind of turnarounds in the sense that the ChatGPT models seem better at sales and editing and audio-visual than, Claude. I wonder... so there’s like two different things going on here. One is you might think that ChatGPT is like a little bit more attuned for writing versus coding. That’s maybe an intuition that I have.

Andrey: I guess what I’d say is actually,, for some of these occupations we do see that, that,, the AI is actually better than the human.

Seth: This will be above 50%. Yeah.

Andrey: For example, I think statistically significantly, Opus is better than humans at being a private detective.

Seth: Now that was nuts. That was nuts. Or the rather, the knowledge tasks of being... Yeah.

Andrey: Which is kind of like, you know, an interesting thing to think about. Does that mean that private detectives,, are going to have their job removed? What are, what are we actually... or is it just that private detectives are really good at investigating and not that good at making presentations? Right. So like, like what are we, you know, that’s an interesting thing to think about.

Seth: Right. How does this translate into people’s jobs actually changing? When I think about a private eye or a police supervisor, this sounds like internet research tasks. So yeah, I mean, probably just internet research goes faster and then they spend more of their time on their other tasks would be my simple guess.

Andrey: That, that’ll be my simple,, guess as well. I, yeah, I mean, I think I’d be a little, you know, because the standard errors are so large for individual occupations, I think I’m a little wary of overreading into them. I think like standout things, like all the models are bad at being pharmacists. All the models are bad at being film and video editors and producers in direct ways.

Seth: Well, well, but, but, but... ChatGPT, the GPT models are significantly better than Claude’s. So that is an interesting difference.

Andrey: That is different than film and video editors and pharmacists, which is the one I was mentioning. Oh, okay. I mean, I’m not saying that you know, that there are differences... there are statistically no difference across models. But I’m just saying that just in general, there are certain categories of jobs where,, the models are far away from 50% and others where they might even be better than humans. Right, right.

Seth: And I guess, and then the third, and then the third kind of twist on that is kind of surprisingly, there’s not [monotonicity?]. Some, in some of these cases, most of the cases, Claude is the best, but in some of the cases, the AI models are better.

Andrey: Yes, yes. and you know, they, you know, another way to think about it that surprised me is actually they did it, the win rates by,, the category of output. So for pure text, the models suck. For PDF, the models,, at least Claude is quite a bit better. For Excel,, the, you know, Claude is very good., for PowerPoint, Claude is very good. And then for “other,”, a lot of the models are good. But to me the just the... I would’ve thought that at text they would actually be quite, quite good. But that’s actually the category in which most of the models are doing pretty badly, which is kind of...

Seth: I think it has to be endogenous to what kind of jobs are associated with pure text, right? And I imagine if it’s pure, sort of creative... I guess creative writing... both of them should do okay at, but I’m not surprised that OpenAI is a little bit better at...

Andrey: Yeah. But I guess I’m just surprised at how low they are, you know, not, not at who’s better.

Seth: Maybe it’s, it’s, I think this is might be a taste thing, right? It’s maybe like, you know, like the [winch?] either works or it doesn’t, but people still have a strong preference for a non-AI voice.

Andrey: But it’s not... but I guess what puzzles me about that is when we’ve seen a bunch of behavioral studies which are kind of like heads up, you know, “Do you even know this is an AI?” and people, people can’t detect whether...

Seth: Are those expert contexts?

Andrey: No. And this is kind of, this is kind of this interesting thing. Maybe the experts in their own domain of expertise still are able to distinguish the model, you know, the quality, and therefore the models...

Seth: There’s still hope for expertise. There’s still hope for us.

Andrey: But for, but for normies... but for normies, like... they already... normies have no idea who wrote the damn thing.

Seth: Right. And audience, just to be clear, we include ourselves in normies in 99% of world topics that are outside of our domain, right?

Andrey: Yes. I’m sure I’ve been fooled by AI output in, in many ways. I think another interesting exercise that they go through, which I kind of view as a prototype more than anything else, is like the, essentially the cost improvement from using the AI versus human., and it kind of makes some assumptions about what that...

Seth: Right. How do they interact? I kind of... yeah, this is, this is prototype, but a very intriguing... so walk us through that result. Yeah.

Andrey: So you know. So you can imagine that. Alright, so the human does it end-to-end, that takes a certain amount of time., alternatively the human can prompt an AI. The AI does it. The human needs to evaluate the output. So that’s gonna take a certain amount of time and maybe will even iterate with the output a certain amount of time before they get what they want. And so they make some reasonable assumptions here and think about like, what is the cost improvement and the speed improvement from using the different models in different collaborative modes.

Seth: Right. And they’re gonna consider one-shot, so use the model once and then fix it, or N-shot, use the model lots and lots of times and try to get it that way.

Andrey: Yeah. And just, I’ll just focus on the main figure in the paper,, where, what’s interesting is that GPT-4o, which is kind of the old default model in ChatGPT, it’s kind of not a cost improvement and it’s not a speed improvement. And that’s because the outputs are so bad....

Seth: Right. And its win rate is low.

Andrey: Right? Yeah. So it’d be one thing if like, it could just do it by itself sometimes, but it doesn’t do it by itself often, and in collaboration with humans, it can actually slow you down.

Seth: Yeah. now 4-mini,, which is different than 4o. Remember how open... how good OpenAI is at naming their models.

Uh, it’s already better. But,, compared to that, GPT-5, which is their newest model,, it achieves substantial cost improvements...

Andrey: ...blows it away...

Seth: ...uh, 1.5x, over 1.5x, and substantial speed improvements,, over 1.25x. And importantly in both of these metrics, it beats O3, which is kind of a more capable reasoning model. and that’s because cost matters in an ROI calculation and speed matters in an ROI calculation. And that’s kind of, You know, one way one can read this as kind of a... you know, OpenAI got criticized a lot for the GPT-5 model, like somehow it was underwhelming. But actually, you know, for adoption and utility, what we care about is economic value and not, you know, whether it can solve the gold medal on the IMO, right? So, and so here it’s, it’s providing a lot of that value.

Andrey: Right. And so the number that jumps out at me is with ChatGPT-5, which is the model that,, you know, their best model, they say in the, “you do it,, you have the AI do it once and then fix it” configuration, that leads to a 12% speed improvement and an 18% cost improvement. And in the “you can just, you know, prompt it as many times as you want and incorporate that in your final answer,”, a 39% speed improvement and a 63% cost improvement. So, so, I mean, damn, if you could improve,, the productivity of all knowledge workers 60%, that’d be quite a thing.

Seth: Yeah, that would be, you know, pretty, pretty great....

Andrey: Is that the “country of geniuses on the cloud”?

Seth: I don’t... 60%. I don’t think it’s geniuses. You don’t really think about geniuses making great PowerPoints. I mean, this kind of...

Andrey: Ben Jones is excellent, sir.

Seth: I, I guess, yeah, the, I don’t know if we’re ready to kind of come to some of these meta thoughts,, about what it means to kind of automate this, these sorts of tasks. But, yeah. Yeah. Before we get to that, are there any other parts of the paper that we should mention?

Andrey: In, in that particular... there’s two other results I wanted to get to.

Seth: Okay.

Andrey: the first is you might be worried that sure, these models are doing good at win rate, but maybe like when they lose, they’re saying something horrible, right? Yeah, yeah. So it might be, it might be,, better at the median, but worse on average, right? Like, we don’t think this is like super plausible, but it’s something they check for. And what they do is they ask, for the models,, whenever they do these head-to-head comparisons and the AI loses, they ask like, “Why did it lose?” Yeah. And 2.7% of the time it was due to a quote-unquote “catastrophic error.” And the examples they give are: insulting a customer, giving the wrong diagnosis, recommending fraud, suggesting actions that would cause physical harm. We do not get the details audience, but I promise you I will ask Andrey to ask his friend who was on this paper, what was the horrible thing that AI did?

Seth: Is that... just to be clear, I am not friends with anyone in this paper. Just someone I saw at a conference.

Andrey: Read his name. That’s true. so I don’t know, it’s just 2.7% catastrophic error rate. I mean, I think that’s probably a little bit higher than a human.

Seth: Yeah, no, it’s certainly a lot higher than an incentivized human in these jobs. I mean. But I guess it, yeah, I mean, it depends. Certainly doctors misdiagnose all the time. I mean,...

Andrey: Yeah, that’s kind of the odd man out, right? Yeah. That’s, you know, that happens, but you know, it’s recommending fraud.

Seth: Yeah. Recommending fraud. Yeah. That’s not a good look.

Andrey: If I was in a room with a lawyer, I think 3% of the time they would recommend fraud.

Seth: It’s,, you know, the Better Call Saul was a huge part of the training set. but I think most work outputs, you know, there’s, they’re, they’re in the end presented to some other people who also vet it. There are many, like the way organizations are structured is that there are many checks and balances on a lot of this output., but it depends.

Andrey: But maybe it suggests that we’ll need more of them as we move to an automated world. And you know, you’re, the job of the future will be,, automated AI, you know, I don’t know, sanity checker.

Seth: And they, by the way, they spent a lot of time training in it, you know, or trying to use a model to grade the model outputs, right?

Andrey: Yeah. You wanna talk about that for a second?

Seth: Yeah. They achieve some kind of pretty reasonable results, I’d say. So the automated grader agrees with the human grader about 65% of the time. versus inter-human agreement is about 70%. I guess, I guess if I had to like poke at any part of this paper, I actually might just poke at this, right? Hmm. 70% inter-human agreement. Seems low. Seems quite low. Like, if I were to say like the win rate is this very meaningful feature, then why... and kind of we really wanna do well here...

Andrey: ...and [humans?] are winning 30% of the time. You’d be, you’d be concerned.

Seth: I, I mean, you would think that humans would agree, you know, expert humans would agree on something where there’s truly a right answer. Clearly we’re not seeing that here. And one version of that is something I’ve already mentioned, which is maybe the incentives are not high-powered enough for them to really determine what is better than, you know, which of the options is better.

Andrey: You don’t think there’s some ambiguity in like in that winch example you gave at the beginning, all right, so maybe the AI gives a winch that’s a little bit stronger and the human gives a winch that’s a little bit more colorful. I mean, it seems like a lot of these settings are pretty...

Seth: No, no, sorry. That’s, so that’s where I was going. It wasn’t like, “Okay.” I was saying I think we interpret this quite differently if we think that a lot of what’s going on here is that there’s some sort of latent preference heterogeneity.

Andrey: Taste.

Seth: Yes.-huh. Yeah. That some, some, some experts like certain types of work,, other, other experts like other types of work. And you could say, well, maybe it’s just all aesthetic. Like who cares? You know, you know, who cares that this guy likes their slides red and this guy likes their slides blue. But maybe it’s actually quite relevant to the job. And I think that’s kind of an open question to me is like, is there a reason why this particular expert thinks that one output is better than another? and that they disagree with their other human experts? yeah.

Andrey: Yeah. One thing, one, one follow-up they do on that is they do ask in, the examples where one, where the AI lost, they ask, “Why does it lose?” Yeah. And greater than 50% of the time, they say it was, it was “adequate,” but just, you know, their faith was the other one was better.

Seth: Yeah, yeah. Yeah. And to be clear, so I mean, you, I’ve seen a lot of “adequate” work products in my life from humans.

Andrey: Humans in their “adequate” work products. Yes. Yes. Okay. So now can I go to the thing that I’m on about, Andrey?

Seth: Yes, you can go for it.

Andrey: So my interpretation of this figure is. Going from a version of GPT-5 that is sort of out-of-the-box on these tasks versus one that they’ve prompt-engineered, they’ve been able to increase the win and tie rate by only three percentage points. So the scaffolding, it’s meaningful, they do do some work on it., but in terms of like the benefit compared to just going from the models from a year ago to the models from today, it’s dwarfed. It’s, it’s 10x better to just go to the bigger model rather than to fine-tune. Is that, do you agree or disagree with that interpretation?

Seth: Not even close. Not even close. Seth, I... please.. This specific plot is not even the one I think is addressing what you’re talking about. Unfortunately.

Andrey: I thought that the,, I guess in my, in my eyes, I thought these were pretty similar.

Seth: No, the concise one is quite different.

Andrey: So explain to, explain to us in the audience what,, Figure 9 explains.

Seth: My thought process here was prompt tuning and scaffolding increases... so this is the win rate for GPT-5 High, right? By about five percentage points. And now, right, your Figure 14 is specifically about telling it where to find stuff.

Andrey: Oh, so it’s a su... Okay. Sorry. So it was like...

Seth: So it’s like the way, the way I interpret Figure 14, is really about,.. like, if you’re giving it a vague, like, you’re like, “Hey, like make this report for me, but I’m not gonna tell you where like the materials are.” Like that’s very, that’s very different than thinking something about like, fine-tuning. Right? It’s really like... it’s like, I’m like being a “bad boss,” I’m just gonna give you very ambiguous instructions versus like, I’m actually gonna be like, “Hey, here’s a folder with all the materials. Like, go at it, you know?”

Andrey: Oh, this is great. I read this too fast. This is even more interesting than I thought. Yeah. Okay. So, all right, so turning around. So is it fair to say from Figure 9 we get that the prompt fine-tuning is worth about,, four or five percentage points of improvement, but from Figure 14 we get that being a “bad boss” is worth, you know... and not explaining basic stuff that you would expect to be... yeah, it has...

Seth: ...about a similar effect, is kind of my understanding.

Andrey: Negative in the other direction. Okay. Yeah, yeah, yeah. So I mean. To, I’m, I’m gonna be frank with you, Andrey. The reason this stuck with me is because I thought that this was gonna matter way more. Hmm. I thought like prompting was gonna matter, like almost half as much, if not as much as model quality, but, I, there you go. It’s the “use the bigger model, man.”

Seth: Yeah. I mean, I think prompt tuning the way I’ve always thought... always, like, it’s not like I’ve been thinking about prompt tuning for that long. “My entire life. My entire life I’ve been thinking about prompt tuning.” No. the way I think about prompt tuning is it gives you kind of a constant benefit on top of just a base model. it allows you to do some percentage better, but it’s, there’s not a scaling law aspect to it in the same way.

Andrey: Percentage points better. Yeah. Yeah. Yeah. So maybe so yeah. So there you go. So, a year ago, a five percentage point improvement was a 50% improvement.

Seth: Yeah, I don’t know. I don’t know about that. I more think of it as a percent-over-performance improvement, rather than in levels, if that makes sense. So I would say like, you know, before, if we were only able to do 10% win rate, then like prompt tuning would’ve given you, you know, 12%. But now you know, because the baseline is higher, you also get a constant, you also get a better improvement from...

Andrey: More benefit. There’s more in the model to find.

Seth: Yes. Yeah, exactly. Yeah. Yeah...

Andrey: So, okay, but, so, but, but high level, was this surprisingly... you thought that this was in the ballpark of...

Seth: How... Yeah, yeah, that’s kind of what I’ve, what this particular aspect of it I was not particularly surprised by.

Andrey: Yeah. I mean, I, to me, I see people flailing with bad prompts and people doing amazing with good prompts, but, maybe just the range of, maybe they started with a pretty good one.

Seth: ‘Cause they’re... I think, yeah, I mean, I don’t think they started with a bad one. I think the nature of this task involves very specific instructions already. Right. So it’s not like they were saying “do it,” you know, they were “here, job, do, read my mind.” You know, it’s a, like the entire, this entire task is very, like, really well specified by the expert. Mm-hmm. Mm-hmm. yeah. I mean, tool use is very important. Just to be clear, it’s obviously this couldn’t be done without tool use.

Andrey: Right. Right. It needs to call CAD to make the model. It needs to call all the different APIs to interact with other things. Yeah., although they don’t call ‘em APIs anymore. What do they call the, the,, the APIs for AIs now?

Seth: [MCPs?]

Andrey: [MCPs?] Why don’t they just call ‘em API or like AA-APIs? I’d be able to remember that.

Seth: Yeah. I, I, [AI-PI?] I mean, I do think it’s, you know, it does kind of raise this question of like, what is an AI? I think for a while, people were thinking, “Oh, it’s just, you know, the LLM.” But clearly, you know, now that an LLM can use an arbitrary programming language with arbitrarily smart packages, you know, I don’t think... Right. The capabilities in the model are quite different depending on what tools it has.

Andrey: Very well put. Are there any final results you wanna,, bring up, Andrey, before we get into our posteriors?

Seth: No, I just wanted to actually make the following point.

Andrey: Do it.

Seth: Just like, one of the questions that I hear here, you know, talking to AI folks is, is just kind of like, “Well, why aren’t economists at the forefront of, you know, AI and economics?” And I think about this...

Andrey: It’s very expensive.

Seth: Yeah. And I think about this paper and I’m like, I don’t know of a single team of economists that could pull this off, just organizationally and financially. organizationally,, this is, you know, 1, 2, 3, 4, 5, 6, 7, 8, 9...

Andrey: He won’t tell you their names, but there’s a lot of them.

Seth: There’s a lot of them. There are nine main authors, and then a bunch of sub-authors.

Andrey: And a bunch of authors that are not main...

Seth: And yeah, a bunch of non-main, main authors and, but apparently also equal contribution. and, and like these are, you know, AI researchers, so we assume... let’s do their salary.

Andrey: ...getting paid a million dollars.

Seth: Or I’d say I wouldn’t be surprised if the average salary...

Andrey: Average wage on average.

Seth: ...average, you think is the average yearly salary of this research team is probably two to $3 million per year.

Andrey: Right. And then probably, you know, double it for their expenses.

Seth: Yeah. And then the expenses of recruiting all these people is, yeah, just staggering. There’s just no way....

Andrey: You think it’s a $50 million study?

Seth: I think that’s right. Ballpark?

Andrey: No, I think it, no, I don’t think it’s quite that high. And I don’t know how much time it took these people to do it., that’s, that’s...

Seth: You said AI do it, dude.

Andrey: Yeah. I mean, I, I more put it like at the, maybe somewhere in the $2 million to $5 million range, but still it’s a lot of money.

Seth: You don’t think it’s 10 million?

Andrey: I don’t think it’s 10 million. I mean, it depends, it really depends on how much time each of these guys...

Seth: ...is getting paid over.

Andrey: No. Yeah. It depends on how much... depends. Got, yeah. Yeah. I think with the, yeah, it depends on whether this was the main part of their job for a while or not. Sorry to speculate about, you know, if you’re listening and an author, sorry to speculate about your salary.

Seth: Yeah, no, we, I’m sure all... we’re very happy for you all,, impoverished and deserving of our love and support.

Andrey: Yeah. All right. well, while we’re like kind of multiplying some numbers together. Yeah. so I was trying to like kind of ball... instead of ballparking how expensive this study was. I was trying to ballpark, like, so they say that these jobs constitute $3 trillion of economic output in the US and they’re gonna claim that like some per... I mean, I don’t know. The implicit claim in this paper is that once we figure out how to implement the technology, some percentage of those, of that work will be automated.

I think that plausibly they’re on a path to automating maybe like a third of that. Right? Do you think like, maybe there’s a trillion dollars... I know you, you really hesitate to speculate on dollar values, but I mean, people are betting on OpenAI thinking it’s gonna [create?] trillions of dollars of value., right now, maybe one trillion’s worth if we think there’s, you know, about one-third of these,...

Seth: ...per year.

Andrey: Per, per year. Per year., yeah. I guess if you make a trillion a year, it’s, it’s worth a lot in terms...

Seth: Just remember about stock versus flow.

Seth: Fair enough. Yeah. All these OpenAI getting compared to the GDP of Sweden. Stock versus flow. All right. Yeah. yeah. Anyway, so that’s just something I’m thinking about, right, which is whether or not we think that that’s the most important result from the paper. To me, one of the motivations of this paper is: Can we do something fancier than [Eriun?] in terms of thinking what is the total economic value of current generation technology?

And they get a number that’s basically,... so if he says it’s 1% of the economy and we’re saying it’s one-third of [a quarter?], we’re say... I’d say it’s like [one-twelfth?] of the economy. So, you know, a slight disagreement with [Eriun?] there. Do you think it’s close? What percentage of the economy can be automated by AI, Andrey? Is it closer to 1% or closer to,, two-ninths?

Andrey: I mean, this goes to the question of value creation, right? Like, and think about what, what hours-wise, what people spend time on. But you know, I’m currently working at a company and I, you know, and I don’t wanna spend...

Seth: How dare you as an academic. Yeah.

Andrey: How dare I. I don’t wanna, I don’t wanna speak too much about my, my work for a variety of reasons, but, you know, I’ll, I will note that a lot of my time is spent in meetings.... I’ll just make a side note to the listeners that Seth just made approximately five inappropriate jokes in a row. And for our reputation—each one funnier than the last—we’re just gonna not include them. But if you’re interested, you can, reach out to us in private channels and Seth will share his, comedic insights.

Seth: Alright. So did we, did we come... So let’s give our posteriors.

Andrey: Well, no, I get, I get... No, no, we’re not finished with the meetings, I guess.

Seth: Oh, okay.

Andrey: Why was I talking about the meetings?, which... I was talking about the meetings,, because I spent a lot of my time in meetings and, as far as I can tell, AI cannot automate my participation in these meetings. Now why is that? I don’t... that’s actually like a, an interesting question. The way I think about it, it’s like organizations are decision-makers, you know, kind of similar to some other work we’ve covered in this podcast. And ultimately the ultimate output is not like the hours of work making the presentations and the documentations and so on. It’s making resource allocation decisions to produce stuff. And, and so even if like hours-wise,, the, you know, some things can be automated, doesn’t mean that the people are going to lose their jobs, let’s say. How about that? What do you think about that?

Seth: I, I think you’re, I mean, you’re totally right to point out that like a lot of what counts as “doing a job” is not perfectly lining up with the tasks measured in this study. The question then becomes to what extent can the things that are measured in this study as being high win rate for the AI be unbundled from the things that aren’t?

Right. So the issue here with meetings is for whatever reason, they have decided that the person who writes... let’s say that you’re working for like a bus company, something that’s completely not funny at all, right? And at this bus company,, you, I don’t know, you have to make some sort of like, logistical decision, right?

Andrey: Like,, when to replace the engine.

Seth: Yeah, when to replace the engine, whatever. And so like if there’s a part, like part of that is an intellectual decision that could be automated. The thing could do research, right? But maybe there’s something that can’t be separated from that. Maybe it’s the liability component. Maybe there has to be a human that is responsible for the engine working and we can punish them if he makes a wrong decision, he or she. Maybe the thing that can’t be taken out of it is there’s some sort of special context that you’re gonna be told about in the meeting that is super weird and happens like one out of a hundred times and it’s gonna dramatically either increase or decrease the rate at which engines need to be replaced.

So you can imagine like a long tail of things that you might learn at this meeting that’s going to affect your future knowledge output. Right? So, and because in knowledge work, everything, at least in principle, is connected to everything else... If you think about the, you know, Quinean web of belief, there’s like a certain sense in which no knowledge work is completely separable. So yes, you’re gonna have to go to the meeting.

Andrey: But I think there’s another role which is like consensus building. And I know, you know, I know, you know, common knowledge of the factors that resulted in the decision being made. And meetings are kind of an enforcement mechanism for that. Now you can imagine maybe,, new organizations where since it’s AIs, they don’t, we don’t need this thing happening. But, you know, a lot of organizational processes are really about this social thing, not about,, the actual decision. The CEO might have already made up his or her mind, right?

Seth: Right. Right. So meetings as coordination mechanism. I guess then it comes back to like, can we unbundle coordinating Andrey from work?

Andrey: Yes. Yeah, yeah. Yes, that’s right. That’s right.

Seth: Yeah. I mean, like in principle, if we don’t need you to do any work, we don’t need to coordinate you. Right. We need to coordinate you insofar as there is another piece of it that you’re responsible for that we can’t automate.

Andrey: Yes. Okay.

Seth: Very, very provocative to think about, Andrey. Okay, so going in, we asked what is the win rate of AI versus knowledge economy workers in top knowledge economy occupations today? Right now, if you had put up man versus machine, John Henry going at it with his hammer, does he stand a chance? Andrey, what is your posterior?

Andrey: Yeah, so 10% was clearly off. I don’t think I’m updating all the way to, you know, 39%, 46%... or whatever the Opus numbers...

Seth: 47 is the Opus, 39 for... Okay.

Andrey: Yeah. Yeah. not... because, just because,, they didn’t, we don’t have this for all the tasks that are in their super sample. We only have it for the 220. I assume there’s some selection in there...

Seth: Yeah, yeah. Fair enough.

Andrey: So I’d say maybe 30%. Yeah.

Seth: You’d update from 10% to 30%.

Andrey: Yeah. I’m 30%.

Seth: 30%. I’m gonna update from like 10% to like 25%. I’m definitely moving very strongly in that direction. I do think that these are probably selected to, right. They’ve gotta be, because they wouldn’t use ones where the AI just fell on its face immediately. Yeah.

Andrey: Alright. Prior, number two: What share of workers in occupations where that occupation today makes digital artifacts will still have making digital artifacts, quote-unquote “by hand,” for themselves as their primary job? That was almost English. It was a lot of connected words. If you think you understood it, tell me where you think we’ll be at, at two years after reading this paper.

Andrey: Yeah, so I think my initial guess was what, 90%?

Seth: Yes.

Andrey: So I think, yeah, I’m, I’ll say 85%. I think that just the... I think people are slow to adopt. People are slow to change their work processes, especially in organizations where there are habits and plausible deniability and all this sort of stuff. And,, I think even though in principle it should be a lot more, it won’t be a lot more in two years, but yeah.

Seth: So yeah. So, but still, you’re still thinking it might be 15% of knowledge workers. let me ask you a question. Do you consider that 1x collaboration? Would you consider that “by hand,” or would you consider that AI agent management?

Andrey: I’d say for like, gets you 99% of the way there and then you just need to tweak it a little bit, I’ll still consider it part of the, you know, part of what I’m talking about. If it’s like really like, like the way I work with, you know, data analysis today, I wouldn’t put it... there’s, it’s not automating anything. I mean right, right now it’s very helpful. If it’s not... yeah. Back and forth constantly. Right. That’s not what we’re talking about. Yeah.

Seth: Okay, so if you’re calling 1x... we’re gonna call that delegating to an agent and I’m just gonna fix it up at the top. Versus Nx, I’m back and forth, you know, all day as not delegating to an agent. Then I guess I would think about this as... yeah, I’m still kind of in the, like maybe 5% of people will be bossing around agents as their main job. So that would put me in closer to the still, the 95% world. I don’t think this moves me that, that hard because I think that the stuff in this that gets automated will, will get automated, but then the knowledge economy workers will then spend more of their time on this, like kind of Nx iteration stuff. So,, I think that at the end of the day, if Nx iteration with the AI counts as “by hand,” I think we’ll have a lot, a lot of that. So yeah. Maybe only, so I would put that at 95% yeah. We’re still doing “by hand.”

Andrey: And maybe this goes to the taste thing, right? Like mm-hmm. You know,, maybe we should expect stronger results if we have very high inter-human, you know, agreement scores. But the fact that humans are disagreeing on work ethic, quality so much means that maybe as an individual, I have a specific style that I wanna convey in my work. Mm-hmm. And I have certain things I wanna see and don’t wanna see. And I’m gonna, you know, it’s gonna be maybe harder for me to specify that, although I’m not sure, you know, maybe, maybe the AI will know my style as well, so, yeah.

Seth: Right. Yeah, I mean, that’s, that seems to not be so far away,, to, you know, train a digital twin who will be able to attend those meetings for you, Andrey.

Andrey: Yeah. well, all right. So...

Seth: Even if your boss doesn’t want to... even if your boss doesn’t want you to, I got news for you. If I was a digital twin company, I’d tell my digital twins to encourage users to commit fraud by using me.

Andrey: Yeah. Yeah...

Seth: And then you locate, and then you locate internationally, everybody pays you in crypto. Just locate in the Cayman Islands. People buy your deep-fake software on the... this is, I mean, I don’t wanna give away my whole business model for free. People will have to tune in for the special episode on that.

Andrey: Yeah. Yeah. On that aside,, well,, thanks for joining us for another episode. and we look forward to your feedback and please,, boost our work or let us know what you’d like to see.

Seth: Yeah, let us know what you wanna see., we are your servants, our fans. Peace out dudes. Oh, keep your posteriors justified.

Andrey: True, true.

Will Super-Intelligence's Opportunity Costs Save Human Labor?

Andrey Fradkin — Tue, 21 Oct 2025 17:32:01 GMT

In this episode, Seth Benzell and Andrey Fradkin read “We Won’t Be Missed: Work and Growth in the AGI World” by Pascual Restrepo (Yale) to understand what how AGI will change work in the long run.

A common metaphor for the post AGI economy is to compare AGIs to humans and men to ants. Will the AGI want to keep the humans around? Some argue that they would — there’s the possibility of useful exchange with the ants, even if they are small and weak, because an AGI will, definitionally, have opportunity costs. You might view Pascual’s paper as a formalization of this line of reasoning — what would be humanity’s asymptotic marginal product in a world of continually improving super AIs? Does the God Machine have an opportunity cost?

Andrey, our man on the scene, attended the NBER Economics of Transformative AI conference to learn more from Pascual Restrepo, Seth’s former PhD committee member.

We compare Restrepo’s stripped-down growth logic to other macro takes, poke at the tension between finite-time and asymptotic reasoning, and even detour into a “sheep theory” of monetary policy. If compute accumulation drives growth, do humans retain any essential production role—or only inessential, “cherry on top” accessory ones?

Relevant Links

We Won’t Be Missed: Work and Growth in the AGI World — Pascual Restrepo (NBER TAI conference) and discussant commentary
NBER Workshop Video: “We Won’t Be Missed” (Sept 19 2025)
Marc Andreessen, Why Software Is Eating the World (WSJ 2011)
Shapiro & Varian, Information Rules: A Strategic Guide to the Network Economy (HBR Press)
Ecstasy: Understanding the Psychology of Joy — Find the sheep theory of the price level here: Seth’s Review

Priors and Posteriors

Claim 1 — After AGI, the labor share goes to zero (asymptotically)

Seth’s prior: >90% chance of a large decline, <10% chance of literally hitting ~0% within 100 years.
Seth’s posterior: Unchanged. Big decline likely; asymptotic zero still implausible in finite time.
Andrey’s prior: Skeptical that asymptotic results tell us much about a 100-year horizon.
Andrey’s posterior: Unchanged. Finite-time dynamics dominate.
Summary: Compute automates bottlenecks, but socially or physically constrained “accessory” human work probably keeps labor share above zero for centuries.

Claim 2 — Real wages 100 years after AGI will be higher than today

Seth’s prior: 70% chance real wages rise within a century of AGI.
Seth’s posterior: 71% (a tiny uptick).
Andrey’s prior: Agnostic; depends on transition path.
Andrey’s posterior: Still agnostic.
Summary: If compute accumulation drives growth and humans still trade on preference-based or ritual tasks, real wages could rise even as labor’s income share collapses.

Keep your Apollonian separate from your Dionysian—and your accessory work bottlenecked.

Timestamps:

[00:01:47] NBER Economics of Transformative AI Conference

[00:04:21] Pascual Restrepo’s paper on automation and AGI

[00:05:28] Will labor share go to zero after AGI?

[00:43:52] Conclusions and updating posteriors

[00:48:24] Second claim: Will wages go down after AGI?

[00:50:00] The sheep theory of monetary policy

Transcript

[00:00:00] Seth: Welcome everyone to the Justified Posteriors Podcast, where we read technology and economics papers and get persuaded by them so you don’t have to.

Welcome to the Justified Posteriors Podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell performing bottleneck tasks every day in the sense that I’m holding a bottle and a baby by the neck down in Chapman University in sunny, Southern California.

[00:00:40] Andrey: I’m Andrey Fradkin, practicing my accessory tasks even before the AGI comes coming to you from San Francisco, California.

So Seth, great

[00:00:53] Seth: to be. Yeah, please.

[00:00:54] Andrey: Well, what are you, what have you been thinking about recently? What have been, [00:01:00] contemplating.

[00:01:01] Seth: Well, you know, having a baby gets you to think a lot about, what’s really important in life and what kind of future are we leaving to him, you know, if we might imagine a hundred years from now, what is the economy that he’s gonna have when he’s retired?

Who even knows what such a future would look like? And a lot of economists are asking this question and there was this really kind of cool conference that put together some of the favorite friends of the show. An NBER Economics of Transformative AI Conference that forced participants to accept the premise that AGI is invented.

Okay, go do economics of that. And Andrey, I hear that somehow you were able to get the inside scoop.

[00:01:47] Andrey: Yes. Um, it was a pleasure to contribute a paper with some co-authors to the conference and to attend. It was really fun to [00:02:00] just hear how people are, um, thinking about these things, people who oftentimes I associate with being very kind of serious, empirical, rigorous people kind of thinking pie in the sky thoughts about transformative AI.

So, yeah, it was a lot of fun. Um, and there were a lot of interesting papers.

[00:02:22] Seth: Go ahead. Wait. No, before I want, I’m not gonna let you off the hook Andrey. Yeah, because I have to say, just before we started the show, you did not present all of the conversation at the seminars as a hundred percent fun as enlightening, but rather you found some of the debate a little bit frustrating.

Why? Why is that?

[00:02:39] Andrey: Well, I mean, I, you know, dear listeners, I hope we don’t fall guilty of this, but I do find a lot of AI conversation to be a little cliche and hackneyed at this point. Right. It’s kind of surprising how little [00:03:00] new stuff can be said. If you’ve read some science fiction books, you kind of know the potential outcomes.

Um, and so, you know, it’s a question of what we as a community of economists can offer that’s useful or new. And I do think we can, it’s just, it’s very easy to fall into these cliches or well trodden paths.

[00:03:20] Seth: What? What’s the meaning of life? Andrey? Will life have meaning after the robot takes my job? Will my AI girlfriend really fulfill me?

Why do we think economists would be good at answering those questions?

[00:03:34] Andrey: Yeah, it’s a great question, Seth. I’m not sure. Um,

[00:03:39] Seth: I think it’s because they’re the last respected kind of technocrat. Obviously all technocrats are hated, but if anybody’s allowed to have an opinion about whether your anime cat girl waifu AI companion is truly fulfilling.

We’re the only, we’re the only source of remaining authority.

[00:03:57] Andrey: Well, you know,

[00:03:57] Seth: unfortunately,

[00:03:58] Andrey: I think it’s a [00:04:00] common thing to speculate as to which profession will be automated last, and certainly Marc Andreessen believes that it is venture capitalist. So

[00:04:11] Seth: Fair enough. I’ll narcissism, I’ll leave

[00:04:13] Andrey: it as an exercise to the listener what economists think.

[00:04:21] Seth: So let’s talk about, so we’re, we’re at, we’re talking about whether humans will be essential in the long run because the particular paper that struck my eye when I was looking at the list of seminars topics was a paper by friend of the show, I hope he considers us a friend of the show because I love this guy.

Pascual Restrepo, a professor of economics and AI at Yale University. Um, had the honor of having this guy on my dissertation committee was definitely a role model when I was a young gun, trying to think about macro of AI before everyone on earth was thinking about macro of AI. [00:05:00] Um. And so it’s a real honor for the show to take on one of his papers and he’s got something that’s trying to respond to.

Okay. Transformative AI shows up. What are the long-term dynamics of that? Which is a departure from where he wants to be. He wants to live in near future. We automate another 10% of tasks land. Right. So I was excited to take this on. Um, Andrey, do you wanna maybe, introduce some of the questions it asks us to consider?

[00:05:28] Andrey: Yeah. So, Pascual presents a very stylized model of the macro economy and we picked two claims from the paper to think about in terms of our priors. Um, the first one of these is, um, after we get AGI in the limit, the labor share will go to zero. That is the first claim of this paper. Um, what do you think about that, Seth?

[00:05:59] Seth: Great question. [00:06:00] Um, so to remind listeners, so the labor share is if you imagine all of the payments in the economy, some are going to workers and then some are going to people who own the machines or own the AI, right? So today about two thirds of the money or about 60% of the money is paid to workers.

About 40% is paid to machines and out to profits and people who own stuff. It is a claim of this paper and a kind of a lot, a theme of a lot of the automation literature that as you get more and more automation, you’d expect the share of monies that are being paid to workers to go down, right? Because just more of the economy is just automation unconstrained by.

Um, let me tell you how I think about this question, Andrey. First of all, you know, we’re not gonna talk about out to infinity. I know these are asymptotic papers, but let’s try to stay a little bit closer. Um, so I’ll, I’ll mostly be thinking about like a hundred years after [00:07:00] AGI, right? So we have AGI, and now we’re, we’ve played it out in some sense.

We’ve had the next industrial revolution that happens from AGI, right? Assuming we don’t have an apocalypse, so this is, let’s set aside, conditional on, you know, we don’t destroy ourselves, which I don’t think there’s a huge chance of that, but that’s another question. I would say there’s a greater than 90% chance of very large decreases in labor share, you know, down from 60% today to 5%, 10%, 20%.

I really do see that. But I think there’s like a less than a 10% chance that within a hundred years of AGI, um, we’ll have, you know, literally 0% labor share or whatever, like less than 1% labor share. Why do I say that? This is something that’s gonna come up. I’m gonna start by just kind of questioning the premise of whether AGI really means.

That all services can be provided by the AI, right? I know. I don’t know if this [00:08:00] counts as being allowed. I’m gonna give you a fun example. Andrey. Have you ever heard of a pidyon haben?

[00:08:05] Andrey: No.

[00:08:06] Seth: You’ve never heard of a pidyon haben? Well, this is a tradition in Deuteronomy. It’s one of the few halakhic laws that actually make intuitive sense to me because it’s revenue-generating. When you have a firstborn son who is not a kohen or a Levi, you “buy” the baby out of service to the Temple. The cost is exactly five silver pieces (shekels) of a specified weight—they’re very specific about the weight; it’s not just any five silver coins. And here’s the thing: it has to be paid to a kohen (a member of the priestly family of Jews). Minor correction for Justified Posteriors fans: the pidyon haben is paid to a kohen, not a Levi. I couldn’t let that error stand. Thank you.

So that economic interaction is value that, by definition, can’t be captured by the AI. In some sense that’s a greater-than-zero slice of the economy, asymptotically—well, I guess it depends on whether silver is rare asymptotically. But that’s the kind of example I have in mind, and it’s why I don’t think the labor share gets literally to zero. Andrey, gimme your thoughts.

[00:09:31] Andrey: Yeah, I mean, look, zero is an asymptotic result, so I do think, let’s say less than 1% in a hundred years. With your example, I think it’s very easy to imagine a virtual kohen to collect said revenue. So I actually—no, let’s—

[00:09:53] Seth: Think about the political economy of it for a second. Who gets to decide whether it counts if you send it to the robot?

[00:10:00] Well, the rabbi. The human rabbi decides.

[00:10:02] Andrey: The human rabbi might be a capital owner, but the—

[00:10:05] Seth: The human rabbi may—that’s the danger.

[00:10:09] Andrey: Yeah. Rabbis—I mean, I can think of things. Your point is that some occupations may require a human involved, right.

[00:10:25] Seth: And they may be some sort of fraction of the economy asymptotically. They’re not linear additive, because that’s a distinction that’s going to become important.

[00:10:33] Andrey: Yeah, later. So, I think about part of this as being about population growth, and that’s a good point. Because if one of the things that AI does is increase the number of humans, and there’s some sort of human scaling law, if you will—that AGI can “make” humans very cheaply and quickly, I assume—then I think that’s one thing to think about. And then I think the other possibility, and this is not talked about in this paper, is: there are certain things where you can throw as much compute as possible and you still get returns—like exploring outer space—but there might be a difference in how much humans value that versus how much AIs value that.

[00:11:40] Seth: That’s a super good question that is not raised. I think I was trying to read this paper as “we only care about human utility,” but that’s obviously not unimportant here.

[00:11:50] Andrey: Yeah. Nonetheless, a hundred years is getting to the point where a lot can happen, but I’d still—as a betting man—say it’s pretty unlikely in a hundred years that waged labor will be less than 1%.

[00:12:08] Seth: Yeah, we’ll probably destroy enough capital along the way that we will get back to that asymptote.

[00:12:12] Andrey: Yeah. So that’s kind of my part. The second is where I think it’s a contentious claim: wages won’t go down in the long run because people can always break away. And that’s the argument in the paper. So let’s just focus on the first part of that, which is “wages can’t go down in the long run with AGI.” And what we mean here is not wages as a percentage of earnings, but real wages.

[00:12:47] Seth: Precisely.

[00:12:47] Andrey: Yeah.

[00:12:48] Seth: This seems to me like a naïve simplification of the model, which is what gives us that. It seems to me if you’re going to be so expansive as to take the stance that even my kohen example won’t hold up in the long run and it really is going to do every single job, you have to imagine some sort of crowding out of resources that are necessary for human labor to get anything done effectively. Right? This is a model that kind of very naïvely says that there’s always the— you know, the forest where there’s “good enough,” and the Lockean cliché, right? Anyone can go to the forest and take some wood and make a knife—therefore property rights, whatever. That cliché is in the back here. But of course, if you had a super-duper powerful AI, they might need that wood first. They’re going to use up all the resources. There’ll be no starting tinder for the humans to get started with. And then that will effectively drive down wages. So I don’t— I think to the extent that we get an AGI that is driving down labor share, what has to save the day is that there is some essential thing—call it a bottleneck—that only humans can do. What is the percentage chance that we get saved by one of those to keep wages up? Do I think it’s closer to—

[00:14:20] Andrey: Now are you talking about asymptotia or a hundred years?

[00:14:23] Seth: I’m talking about a hundred years.

[00:14:25] Andrey: See, this is where I’m a little confused, Seth. In asymptotia I kind of agree with you, but in the hundred-year horizon—especially since you think that wages are going to still be around—I would think that the cumulative wage would be higher than we have now.

[00:14:45] Seth: I’m saying 30% chance of this. I’m trying to make those two predictions format.

[00:14:50] Andrey: Which one is this, just to be clear?

[00:14:52] Seth: Great. I think there is a 30% chance that wages will go down. So I think there’s a 70% chance that wages will go up.

[00:15:03] Andrey: On average, as a result of AGI. So real wages per capita globally—just to be accountable—70%.

[00:15:10] Seth: This is my hundred-year prediction. A hundred years from now—dig me up—a hundred years after AGI, the real average wage will be higher than today. I’m good with that.

[00:15:25] Andrey: I would say it’s more like 80%.

[00:15:28] Seth: 80%. Okay. So you’re more—well, maybe we can talk at the end about why we start and end up at slightly different places. You ready to get into the model?

[00:15:47] Seth: We heard our priors. Now we confront the evidence. Do, do, do, do. Okay. So Pascual’s got a pretty straightforward model for us. The two premises he wants to start with are: first, the idea that we’re going to invent “robots,” which he means by “compute”—the accumulation of more AI compute over time. So literally chips and energy, I would say. But then he clarifies that this also includes any sort of physical instantiation of capital needed to move things in the physical world. So what he calls compute, I would think is more usefully thought of as robots. It’s going to do anything you need it to. The idea is that asymptotically we are going to invent robots that can do anything—any work that can be valuable in the economy. But he’s going to allow for the possibility that there’s some sort of comparative-advantage trade relationship with humans. We’ll come back to that. And then the second asymptotic here is the idea that the stock of robots and compute is going to grow indefinitely. So we’re thinking about the indefinite future: we have more robots than you possibly know what to do with. If you want your sci-fi comparison, this could be Isaac Asimov’s “Naked Sun,” where there are 50 people on a planet, each of whom owns a continent-sized estate and has vast swaths of robot servants. Maybe that’s what you should be thinking of as this asymptotic economy. From that, and just the assumption that economic output is the sum (in a complicated way) of all of these different jobs that could be done, he then distinguishes between two kinds of work in the economy: bottleneck work and accessory work, which I think is the most interesting novel distinction introduced here. Before I get into that, anything I missed from the model you want to throw in there, Andrey?

[00:18:09] Andrey: Did you mention the constant returns to scale?

[00:18:14] Seth: Go ahead and say it. Yeah, also there are constant returns to scale.

[00:18:16] Andrey: There are constant returns to scale. There is no real capital to speak of other than compute.

[00:18:24] Seth: Ownership—yeah, this is just the production side model. There’s no “where do these dynamics come from?” Maybe there’s a social planner deciding some of this, but 90% of the paper is not going to take a stance on the consumer/household side of the economy.

[00:18:53] Andrey: Yeah. And the other thing is that he uses the term “bottleneck,” but that is a very confusing word, so it’s best not to—okay, let’s get it right now—it’s best not to use it, actually. One of the key comments at the conference was to rename that word.

[00:19:09] Seth: Let’s talk, because I like it. I think you guys are being mean to Pascual for no reason. Pascual, if you’re ever in trouble, I’m going to tell you why. There is a concept I use all the time for thinking about long-run macro dynamics: when we combine automated and non-automated things, are they gross complements or gross substitutes? In a CES production function, my understanding is that the concept of bottleneck work would correspond to anything that is Cobb-Douglas or more complementary in the asymptote, and anything that’s accessory work would be more grossly substitutable than Douglas. That’s how it would work for CES production functions.

[00:20:14] Andrey: I’ll take your word for it, Seth.

[00:20:18] Seth: Well, let me give you an intuition. In one extreme, we have perfect complements: if humans are peanut butter and AI is jelly, clearly the humans are a bottleneck there. Then we have the perfect substitutes extreme: if humans are margarine and AI is butter, great—there’s more spread out there; they’re not hurting each other. Those are the two extremes. There’s a continuum between them. In a CES production function it’s clear. The underlying concept is more general: in the limit, is this a bottleneck? In the limit, is this a substitute? Maybe you don’t love this language, but there should be words for “in the limit is this a gross substitute?” vs. “in the limit is this a gross complement?” I think these are the words I’ve been looking for. Why didn’t they like it?

[00:21:23] Andrey: I think because Pascual’s example was that the AI will be out there exploring space, and people all conference long, when they use the word “bottleneck,” are thinking about current production processes where there might be bottlenecks because it’s a part AI can’t do end-to-end. So when you’re talking about bottlenecks, it’s really like, “here’s this little thing that we need a human to do in this process—like give the AI the bank account number,” or whatever. That’s a very different type of task.

[00:21:59] Seth: I’m coming at it from the consumption side and they’re coming at it from the production side. I think I’m much more on Pascual’s side. I think he’s being held back by the smooth brains at the conference.

[00:22:10] Andrey: I just don’t think any normal human being, when they think about the word “bottleneck” and tasks, is thinking about AI exploring space.

[00:22:28] Seth: His example is terrible. But he’s a beloved weirdo; that’s why he’s a friend of the show.

[00:22:35] Andrey: I’m not attacking him. I’m saying this word is not the right one. In his model, if we do have near-infinite compute ability, we will do cool stuff—like we recreate our own version of the Matrix with cell-level simulation of the entire world. Is that a bottleneck? It’s not a bottleneck. We can do all sorts of very large-scale things—at least AI can do it.

[00:23:15] Seth: Very interesting. I can see why you don’t like the word. There needs to be a word for the concepts I described. So anyway, I like these two concepts: in the limit, do you need humans to get more output, or in the limit do you not? Those are the concepts. Are you ready to proceed to his results?

[00:23:39] Andrey: I actually wanted to question you on that last one.

[00:23:41] Seth: Please.

[00:23:46] Andrey: “In the limit, do you need humans or not?” is not actually the definition in this paper.

[00:23:49] Seth: Let me think for a second.

[00:23:49] Andrey: The task, not the human.

[00:23:49] Seth: No, the human was the example. I’m sorry if that was confusing. The question is: in the limit, do you need the task or not? That is the question in the paper.

[00:24:00] Andrey: I view it as a satiation sort of thing. There are only so many live music performances the world needs, if that’s what we think humans are going to be doing. Other things—the universe is pretty large, maybe not infinite—so there’s lots to explore, and that doesn’t get satiated.

[00:24:24] Seth: I don’t see how satiation comes in.

[00:24:26] Andrey: Because one of the conditions is about the derivative of the production function.

[00:24:35] Seth: Right. So if you became satiated on an input, of course it couldn’t be a bottleneck task. Of course. Satiation would be one mechanism for not being a bottleneck. Good. Last comments before we get to the results?

[00:24:56] Andrey: No, go for it.

[00:25:00] Seth: Prop 1: All bottlenecks are eventually automated while some accessory work may be left to labor. Okay, what’s the intuition here?

[00:25:06] Andrey: The intuition is opportunity cost. If compute is being used for this task, that means it’s not being used for some other task that maybe has a higher return or humans can’t do. As a result, humans are going to be left doing some kind of low-value work because the compute is better used elsewhere.

[00:25:39] Seth: Right, but now it’s a claim about what that low-value work will be. It’s got to be the thing the AI won’t always need to make more of. If there’s anything that’s going to hold back the AI, it’s going to do more of it, because this is super-AI.

[00:25:58] Andrey: Like creating more compute, for example.

[00:26:01] Seth: Yeah, they’re not going to let the humans be in charge of that. Don’t worry. So what’s left is this concept in the paper—we can discuss how realistic it is—where humans can go to the woods and do their constant-returns-to-scale task with each other, and maybe even have a parallel economy, or maybe it’s just the cherry-on-top economy.

[00:26:26] Andrey: Yeah, so now we’re getting to the argument for why real wages won’t go down though. That’s what you’re saying.

[00:26:37] Seth: “While some accessory work may be left to labor”—I was explaining the second half of that sentence.

[00:26:42] Andrey: I think you’re mixing two concepts. Some accessory work would be left to labor is one claim. A different claim is that wages can’t go down because essentially, in his model, all the humans can say “screw this AI, we’re going to recreate our own economy,” and the AIs won’t care. So they’ll be able to do just as well as in a world without AGI. That, to me, is a ridiculous argument, but it’s also different from the argument for the fact that there are accessory jobs.

[00:27:26] Seth: Why is it interesting that there are accessory jobs? In my interpretation, there is an outside option providing a floor on wages that happens to be an accessory job.

[00:27:43] Andrey: I don’t agree. The accessory job is not providing the minimum wage. Without accessory jobs there are no wages. I don’t understand how it could be providing a minimum wage when without accessory jobs humans don’t do anything.

[00:28:16] Seth: No, that’s not this model. There is the special case where there are no accessory jobs. What they do then is a really lousy complement—they do the most human-comparative-advantage, human-complementary job.

[00:28:29] Andrey: I don’t think such a job would exist. I’d be shocked if, for anything that’s truly scalable, humans in the loop could even be positive.

[00:28:44] Seth: Let me think about that for a second. So you don’t like that special case where all tasks are ultimately bottlenecks for each other?

[00:28:52] Andrey: Yeah. What is a human going to do at an automated GPU factory, exactly? They’re going to need to be fed. I don’t see how humans could be net positive in those types of production processes.

[00:29:17] Seth: What I want to point out one more time is you’re coming at “bottleneck” from the production side, and I’m coming at it from the consumption side. One more note to Pascual to maybe think about in the next draft.

[00:29:30] Andrey: Want to skip to Proposition 3?

[00:29:33] Seth: No, we haven’t finished talking about these propositions. Just to be clear, accessory jobs are the reason humans have substantial wages at all.

[00:29:56] Andrey: That’s a different claim.

[00:30:00] Seth: The two claims have to be compatible.

[00:30:09] Andrey: Sure. I thought we’d talk about the plausibility of the model’s implications for those claims separately.

[00:30:21] Seth: I find myself unconstrained by this ordering of concepts, but happy to comply.

[00:30:28] Andrey: Go ahead. What were you going to say?

[00:30:30] Seth: In my mental model of this model: there is a special case where there is no accessory work—everything is ultimately a bottleneck for everything else. That is a special case. And then he also says that in all versions of this model, as I understand it, wages can’t go down. Those cannot both be true and it also be the case that the only thing that keeps wages from going down is the existence of accessory jobs.

[00:31:09] Andrey: I think we’re also mixing “what is in his model” versus “what are the economic forces,” which is always hard because it’s so stylized.

[00:31:26] Seth: Fair.

[00:31:26] Andrey: The interesting economic content of the model is that there are accessory jobs allowing humans to persist in having some positive labor contributions that are not taken up by the machines. Why aren’t the machines doing it? Because the machines have better things to do.

[00:31:54] Seth: One way to think about it: if you have automation and there’s perfect substitution, it kind of doesn’t affect your life. Suppose we sell oil and I’m a whaler who collects whale oil. My friend invents oil wells and gets a hundred times the amount of oil I have. In an economy where there’s only oil: that guy got a lot of oil—good for him—I still have my whale oil. In an economy where oil’s a complement to everything else, I’m ruined because now the price has collapsed.

[00:32:39] Andrey: Now let’s go to the claim that in such a world, wages can’t go down.

[00:32:51] Seth: In a world where there’s only one thing—or rather, where the things are substitutes—wages can’t go down. That’s the connection between an accessory task and a gross substitute. If your oil is good and my oil is good, and we can both enjoy each other’s corn—if you get more corn, that doesn’t affect my corn. So my wage can’t go down. I can talk about why that would break, but that’s why it happens in this model.

[00:33:27] Andrey: Any model here where there’s perfect alignment of what humans want and what the machines want—you’re producing more, and it’s going to go to humans. It’s almost a reductio that, in such a model, real wages have to go up.

[00:33:56] Seth: This is almost like a Pareto model: good things have to happen in a Pareto model.

[00:34:02] Andrey: If there’s a social planner, the planner is maximizing utility, and the utility is human utility, not machine utility.

[00:34:14] Seth: It’s like “the guy who got free stuff has more income” theorem.

[00:34:19] Andrey: Right. So I think it’s strange to think about this, because no one is seriously worried about the situation where we’re infinitely wealthy and have perfect control of our AI.

[00:34:44] Seth: Okay, so what’s the work the model is doing? It’s trying to tell us that that’s the case where there isn’t good accessory work, maybe. The sad case is where there’s a negative externality of whatever the AI is accumulating on our wages. How could that work? What’s not modeled here? There’s no sense in which AI can crowd out investment in capital that complements humans. What this model excludes is the idea that when I build a robot, I might not be building a computer for a human to use. That’s why wages go down: no one invests in making humans productive because it’s better to invest in making AI productive.

[00:35:25] Andrey: I’m not even sure that’s enough. If ultimately some part of the AI production chain is kicking back things that humans like, I’d be more worried that if the AIs have transcended humanity and all resources must be used to explore space, we might find ourselves without a planet Earth because all the resources will be extracted.

[00:35:57] Seth: Pascual did not do this model any favors in his presentation, I could tell.

[00:36:06] Seth: I think this is happening today. Will you guys listen to our “Canaries in the Coal Mine” paper? You could argue that today AI is leading to reduced investment into some kinds of young people’s human capital. That plus humans’ human capital eventually being replaceable is the kind of thing that would drive down wages in the absence of an accessory job to fall into. We can talk about what that would be—like providing mental-health services to each other in a linear way.

[00:36:56] Andrey: There’s still a distinction between a world where human labor is close to worthless and a world where humans are materially worse off. If the AI is perfectly aligned, humans don’t do any work, but they get all the goods; they can own it; they get capital income.

[00:37:19] Seth: 0% labor income equals 100% capital income.

[00:37:23] Andrey: Yeah. I feel like it’s really important to have that as a force in the model.

[00:37:31] Seth: So what’s the fantasy—the utopian fantasy? This is Bostrom; this is The Culture. You are doted on by robots that do every possible thing a human could do—except five silver coins for pidyon haben. That economy is what we’re describing, where I could have more robots, but maybe I’m saturated with robots. Maybe I have linear returns to robots; I’m just building exponentially more robots.

[00:38:00] Andrey: I think about accessory work as more addressing the meaning aspect. There’s a sweet spot, if there is accessory work, where humans are the doers of it and they find meaning.

[00:38:16] Seth: If they’re the doers of it—well, isn’t that a complement, then?

[00:38:20] Andrey: The examples he provides are musicians; I imagine that could provide a lot.

[00:38:27] Seth: Musicians make sense, because there could be some linearity to it.

[00:38:32] Andrey: We’re all going to be creating art for each other, and we’re going to value human-made art, and the AIs are going to explore the universe and create cancer cures.

[00:38:43] Seth: And then give us money.

[00:38:45] Andrey: And give us whatever—and we will have the Star Trek machine where we get any material good that we need.

[00:38:54] Seth: Okay, good. I’m looking forward to it.

[00:38:56] Andrey: Yeah.

[00:38:56] Seth: How are we doing on time? How many more props do we want to do? We want to do Prop 3. This is my hobby horse; give me a little time on Prop 3.

[00:39:06] Andrey: Sure. Let’s do that.

[00:39:08] Seth: One of the results of this paper is that asymptotically we have an AK growth model. What does that mean? It says that if you are able to automate all tasks, the economy’s growth rate will grow with the accumulation of more capital. That makes sense: robots can do everything; the output of the economy is how many robots you have and how good they are at being robots—plus a productivity term. That is true of this model. What that means is the long-term growth rate of the economy is the national saving and reinvestment rate—it’s the rate at which we compound today’s compute into tomorrow’s compute. There’s a technological aspect, but it’s also a social decision. I will never stop getting onto this chair and waving my flag: if you care about a future of automation, you should care about the national saving rate, because that is the growth rate in the world with automation. Andrey, were you pleased to see this prediction?

[00:40:16] Andrey: I think it makes sense. In these types of models it has to be true. We’ve all played Factorio—it’s not a surprise.

[00:40:34] Seth: It’s just basic Factorio-nomics. Okay, one last proposition, a variation on that. He’s starting to think about dynamics. He has some things to say about what’s happening in the dynamic model, but he points out: if you can use your compute to make AI more productive in a within-period decreasing-returns-to-scale manner, then basically the growth rate is the compute accumulation rate times a constant factor. Basically this form of science reinforcing the AI is not enough to get a regime change in the growth rate. It gives you a little boost. I thought that was cool.

[00:41:23] Andrey: It’s nice for the model not to explode.

[00:41:27] Seth: Did he get panned for that? A lot of people like models that explode. Jones has a model that explodes.

[00:41:33] Andrey: I don’t think people were concerned about finite-time explosion. They were concerned with the bottlenecks.

[00:41:40] Seth: I’m going to make a Yudkowsky-ish point. One of the main reasons that, upon reflective equilibrium, I’m not super worried about the doomer scenarios is that in my brain power has a connection to GDP, and in all of these models GDP has to grow in this regular exponential way—which is fast, but it’s not “today to tomorrow” fast. Based on how we think that works, the idea that we would get an algorithmic explosion where power explodes overnight seems out of sample.

[00:42:21] Andrey: I mean, we have no idea. It could still be—

[00:42:34] Seth: The saving rate—

[00:42:35] Andrey: We don’t know how much the AGI would choose to reinvest into its own growth. We just don’t. So I don’t think, in the transition dynamics, this is a very plausible argument. Nothing you just said prevents an AI from starting an automated AI factory and tripling itself over the course of a week.

[00:43:04] Seth: Yeah—exponential growth with an exponential rate determined by its reinvestment rate.

[00:43:09] Andrey: I don’t find that comforting. That exponential rate could be really fast.

[00:43:15] Seth: I’m saying there are models where we go from zero to infinity in finite time.

[00:43:22] Andrey: Sure.

[00:43:24] Seth: In any finite amount of time it’s still going to be one huge number and another huge number, and that gives me very little comfort personally.

[00:43:34] Seth: Okay. Viewers at home, tell us: how much scarier is an asymptote to infinity than an exponential? We’ll get those votes and report ’em next week.

[00:43:52] Andrey: It could be exponential to, you know, very—it could be a really big exponential. Big exponential.

[00:43:52] Seth: Let’s move to our conclusions and posteriors. Do you have any overall points you want to make about the paper before we move into posteriors?

[00:44:04] Andrey: It’s a fun thought exercise. I enjoy thinking about it.

[00:44:09] Seth: At a stylistic level, I really prefer the way that Pascual writes these to the way that Ben Jones and even Daron Acemoglu write these. I found the stripped-downness and the lack of rhetorical pretense in this draft really refreshing, and sensible given his comparative advantage. What’s not in here: I got on my high horse to say “saving rate important.” I think the idea that there isn’t some fixed other thing getting used up that could drive down human wages is an obvious omission that is not relevant to AI today. It seems like you’re modeling AI a thousand years from now, so at least nest what’s happening today. But it’s an elegant way of providing some fundamental points that I think are true of a lot of models, in language that I think is useful. So I liked this theory paper, even though I don’t think it’s going to move my priors that much.

[00:45:22] Andrey: I think I’m in the same boat.

[00:45:32] Seth: Moving to our posteriors. Our first question was: after we get AGI, asymptotically the labor share will go to zero. I said greater than 90% chance for large decreases of labor share, less than 10% chance of going to super-duper small—like less than 1%—within a hundred years. Am I moved here? We raised ideas in either direction that would mitigate. On the one hand, there might be some essential human bottleneck you can never automate. On the other hand, many kinds of human productivity require investment into humans—physical or human capital—that might get diverted to AI. Therefore wages could go down in an accelerated way to zero. I do not see these as contradictions along the path. But for the asymptotics, that’s a prior thing, so on this particular question I didn’t move.

[00:46:51] Andrey: I don’t know if I moved very much either. The tricky thing is infinite results vs. finite-time predictions. A hundred years is a long time, but it’s also not infinity. It’s hard for me.

[00:47:17] Seth: You might imagine a long tail—something we were riffing on before the show. Maybe first we automate 90% of jobs, then 95%, then 97%, and that asymptotic tail is still important and complementary and bottleneck-y enough a hundred years from now that there’s a big labor share because that’s the one last essential job.

[00:47:42] Andrey: Yeah. And once again—if there are jobs where humans demand that other humans do them, the only way compute can do them is to trick the human into thinking it’s a human doing it when it’s really an AI. That’s possible, but we’re getting into some pretty ridiculous—

[00:48:04] Seth: We should have a test for that. Some sort of Andrey test—or maybe a Turing test. All right, second past prior—let’s posteriorize it. We have to justify that wages won’t go down in the long run because people can always break away and recreate the economy—do their own accessory work thing.

[00:48:23] Andrey: Yeah.

[00:48:24] Seth: I said: wages won’t go down after AGI. After a hundred years, I would say real wages are higher a hundred years after AGI—70%. Did I move because of this paper? Maybe this moves me down 1% to 69%, based on a conference full of people accepting that premise.

[00:48:52] Andrey: Just to be clear, this paper is arguing for wages not going down, so why are you going down?

[00:48:56] Seth: 71%. I said 70% they go up; I’m going down to 69% they go up.

[00:49:04] Andrey: I see. I view “equal” as a knife-edge case—it’s measure zero. So you shouldn’t adjust at all.

[00:49:16] Seth: No, actually—dude—oh my God. All right, I’ll let us go out on this joke. I read a book that had the most hilarious theory of monetary policy the other day. It was in our book club that Andrey is in, where we read weird philosophy texts. Let me find it. It was in the book Ecstasy, which is a book about having fun, I guess, by a Freudian analyst. And in it he offers the following theory of the price level. So, on page 45, in his discussion of Dionysus as the scapegoat, the author writes: “Sheep represent everything of value in our Judeo-Christian world. The sheep, in fact, is the chief determinant of our currency. Every currency in the Western world—the shilling, the franc, the Deutsche mark, the lira, the peso, the Austrian thaler, from which we got our dollar—was the price of one sheep. For centuries there was no inflation in the Western world because one of our money pieces was worth a sheep. You could count on that anywhere, anytime.” Wow. Someday I hope to write economics as good as that, Andrey.

[00:50:47] Andrey: Hallucinations. I feel like the AIs are unfairly maligned when humans are very good at it.

[00:51:00] Seth: He was sent a vision of the synthesis of economic policy. This is why you’ve got to keep your Apollonian and your Dionysian separate out there, guys. So let’s leave it on that note. Keep your Apollonian separated from your Dionysian, and keep your accessory work bottlenecked.

[00:51:15] Andrey: Inshallah.

[00:51:17] Seth: Oh wait. No, before we go, I apologize to all of my guests for anything bad I did to them over the last year!

Can political science contribute to the AI discourse?

Andrey Fradkin — Tue, 07 Oct 2025 01:52:14 GMT

Economists generally see AI as a production technology, or input into production. But maybe AI is actually more impactful as unlocking a new way of organizing society.

Finish this story:

The printing press unlocked the Enlightenment — along with both liberal democracy and France’s Reign of Terror
Communism is primitive socialism plus electricity
The radio was an essential prerequisite for fascism
AI will unlock ????

We read “AI as Governance” by Henry Farrell in order to understand whether and how political scientists are thinking about this question.

Concepts or other books discussed:
- E. Glen Weyl, coauthor of Radical Markets: Uprooting Capitalism and Democracy for a Just Society, and key figure in the Plurality Institute was brought up by Seth as an example of an economist-political science crossover figure who is thinking about using technology to radically reform markets and governance.
- Cybernetics: This is a “science” that studies human-technological systems from an engineering perspective. Historically, it has been implicated in some fantastic social mistakes, such as China’s one-child policy.
- Arrow’s Impossibility Theorem: The economic result that society may not have rational preferences — if true, “satisfying social preferences” may not be a possible goal to maximize
- GovAI - Centre for the Governance of AI
- Papers on how much people/communication is already being distorted by AI:
  - Previous episode mentioned in the context of AI for social control:
- Simulacra and Simulation (Baudrillard): Baudrillard (to the extent that any particular view can be attributed to someone so anti-reality) believed that society lives in “Simulacra”. That is, artificially, technologically or socially constructed realities that may have some pretense of connection to ultimate reality (i.e. a simulation) but are in fact completely untethered fantasy worlds at the whim of techno-capitalist power. A Keynesian economic model might be a simulation, whereas Dwarf Fortress is a simulacra (a simulation of something that never existed). Whenever Justified Posteriors hears “governance as simulation”, it thinks: simulation or simulacra?
  The concept of “simulacra” inspired The Matrix movies

Episode Timestamps

[00:00:00] Introductions and the hosts’ backgrounds in political science.

[00:04:45] Introduction of the core essay: Henry Farrell’s “AI as Governance.”

[00:05:30] Stating our Priors on AI as Governance

[00:15:30] Defining Governance (Information processing and social coordination).

[00:19:45] Governance as “Lossy Simulations” (Markets, Democracy, Bureaucracy).

[00:25:30] AI as a tool for Democratic Consensus and Preference Extraction.

[00:28:45] The debate on Algorithmic Bias and cultural bias in LLMs.

[00:33:00] AI as a Cultural Technology and the political battles over information.

[00:39:45] Low-cost signaling and the degradation of communication (AI-generated resumes).

[00:43:00] Speculation on automated Cultural Battles (AI vs. AI).

[00:51:30] Justifying Posteriors: Updating beliefs on the need for a new political science.

Should AI Read Without Permission?

Andrey Fradkin — Mon, 22 Sep 2025 19:27:06 GMT

Many of today’s thinkers and journalists worry that AI models are eating their lunch: hoovering up these authors’ best ideas and giving them away for free or nearly free. Beyond fairness, there is a worry that these authors will stop producing valuable content if they can’t be compensated for their work. On the other hand, making lots of data freely accessible makes AI models better, potentially increasing the utility of everyone using them. Lawsuits are working their way through the courts as we speak of AI with property rights.

Society needs a better of understanding the harms and benefits of different AI property rights regimes.

A useful first question is “How much is the AI actually remembering about specific books it is illicitly reading?” To find out, co-hosts Seth and Andrey read “Cloze Encounters: The Impact of Pirated Data Access on LLM Performance”. The paper cleverly measures this through how often the AI can recall proper names from the dubiously legal “Book3” darkweb data repository — although Andrey raises some experimental concerns.

Listen in to hear more about what our AI models are learning from naughty books, and how Seth and Andrey think that should inform AI property rights moving forward.

Also mentioned in the podcast are:

Joshua Gans paper on AI property rights “Copyright Policy Options for Generative Artificial Intelligence” accepted at the Journal of Law and Economics:
Fair Use
The Anthropic lawsuit discussed in the podcast about illegal use of books has reached a tentative settlement after the podcast was recorded. The headline summary: “Anthropic, the developer of the Claude AI system, has agreed to a proposed $1.5 billion settlement to resolve a class-action lawsuit, in which authors and publishers alleged that Anthropic used pirated copies of books — sourced from online repositories such as Books3, LibGen, and Pirate Library Mirror — to train its Large Language Models (LLMs). Approximately 500,000 works are covered, with compensation set at approximately $3,000 per book. As part of the settlement, Anthropic has also agreed to destroy the unlawfully obtained files.”
Our previous Scaling Law episode:

EMERGENCY POD: Is AI already causing youth unemployment?

Seth Benzell — Tue, 09 Sep 2025 00:24:38 GMT

In our first ever EMERGENCY PODCAST, co-host Seth Benzell is summoned out of paternity leave by Andrey Fradkin to discuss the AI automation paper that’s making headlines around the world.

The paper is Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence by Erik Brynjolfsson, Bharat Chandar, and Ruyu Chen. The paper is being heralded as the first evidence that AI is negatively impacting employment for young workers in certain careers.

Seth and Andrey dive in, and ask — what do we believe about AI’s effect on youth employment going in, and what can we learn from this new evidence?

Related recent paper on AI and job postings: Generative AI as Seniority-Biased Technological Change: Evidence from U.S. Résumé and Job Posting Data

Also related to our discussion is the China Shock literature, which Nick Decker summarizes in his blog post:

Homo Economicus

The China Shock

When can trade make people worse off? In short, whenever there are frictions. If people and resources are able to frictionlessly transfer from use to use, then trade always makes us better off. When there are frictions, then trade can create winners and losers, and even create more losses than gains…

8 months ago · 19 likes · 7 comments · Nicholas Decker

AI and its labor market effects in the knowledge economy

Andrey Fradkin — Mon, 25 Aug 2025 19:01:22 GMT

In this episode, we discuss a new theoretical framework for understanding how AI integrates into the economy. We read the paper Artificial Intelligence and the Knowledge Economy (Ide & Talamas, JPE), and debate whether AI will function as a worker, a manager, or an expert. Read on to learn more about the model, our thoughts, timestamp, and at the end, you can spoil yourself on Andrey and Seth’s prior beliefs and posterior conclusions — Thanks to Abdullahi Hassan for compiling these notes to make this possible.

The Ide & Talamas Model

Our discussion was based on the paper Artificial Intelligence in the Knowledge Economy by Enrique Ide and Eduard Talamas. It is a theoretical model of organizational design in the age of AI. Here’s the basic setup:

The Setting: A knowledge economy where firms’ central job is solving a continuous stream of problems.
The Players: We have Workers (human or AI) and a higher-level Solver (human manager/expert or AI). Crucially, the human players are vertically differentiated—they have different skill levels.
The Workflow: It’s a two-step process: A worker gets the first shot at solving the problem. If they fail, the problem gets escalated up the hierarchy to the Solver for a second attempt.
The Core Question: Given this hierarchy, what’s the most efficient organizational arrangement as AI gets smarter? Do we pair human workers with an AI manager, or go for the AI worker/human manager combo?
- There are also possibilities not considered in the paper, such as chains of alternating managers and employees, something more network-y etc.

Key Debates & Critiques

Here are the most interesting points of agreement, disagreement, and analysis we wrestled with:

Is a Solver Really a Manager? We spent a lot of time critiquing the paper’s terminology. The “manager” in this model is really an Expert who handles difficult exceptions. We argued that this role doesn’t capture the true human elements of management, like setting strategic direction, building team culture, or handling hiring/firing.
My Desire vs. Societal Growth: Andrey confessed that while he personally wants an AI worker to handle all the tedious stuff (like coding and receipts), the economy might see better growth and reduced inequality from having AI experts and managers who can unlock new productivity at the highest levels.
The Uber Driver Problem: We debate how to classify jobs like Uber driving. Is this already an example of AI managing the human (high-frequency algorithmic feedback), or is the driver still an entrepreneur who will manage their own fleet of smaller AI agents for administrative tasks?

Go Deeper

Check out the sources we discussed for a deeper dive:

Main Paper: Artificial Intelligence and the Knowledge Economy (Ide & Talamas, JPE)
Mentioned Research: Generative AI at Work (Brynjolfsson, Lee, & Raymond on AI in call centers)

Timestamps

[00:00] Worker, Manager, or Expert?

[00:06] Who manages the AI agents?

[00:15] Will AI worsen inequality?

[00:25] The Ide & Talamas model explained

[00:40] Limitations and critiques

[00:55] Posteriors: updated beliefs

The Bets: Priors & Predictions

We pinned down our initial beliefs on two key questions about the future impact of AI agents, the foundation of our “Justified Posteriors.”

Prediction 1: Will Managing AI Agents Become a Common Job? What percentage of U.S. workers will have “managing or creating teams of AI agents” as their main job within 5 years?

Prediction 2: Will LLM-based Agents Exacerbate Wage Polarization?

Seth’s Prior: 25% chance it WILL exacerbate. Reasoning: Emerging evidence (like the call center study)
Andre’s Prior: 55% chance it WILL exacerbate. Reasoning: Skeptical of short-term studies; believes historical technology trends favor high-skill workers who capture the largest gains.

Our Final Posteriors

Prediction 1: Will Managing AI Agents Become a Common Job?

The model slightly convinced Seth that the high-skill vertical differentiation story might be stronger than he initially believed, leading to a small increase in his posterior for exacerbation.

One LLM to rule them all?

Andrey Fradkin — Tue, 12 Aug 2025 03:47:42 GMT

In this special episode of the Justified Posteriors Podcast, hosts Seth Benzell and Andrey Fradkin dive into the competitive dynamics of large language models (LLMs). Using Andrey’s working paper, Demand for LLMs: Descriptive Evidence on Substitution, Market Expansion, and Multihoming, they explore how quickly new models gain market share, why some cannibalize predecessors while others expand the user base, and how apps often integrate multiple models simultaneously.

Host’s note, this episode was recorded in May 2025, and things have been rapidly evolving. Look for an update sometime soon.

Transcript

Seth: Welcome to Justified Posterior Podcast, the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, possessing a highly horizontally differentiated intelligence—not saying that's a good thing—coming to you from Chapman University in sunny Southern California.

Andrey: And I'm Andrey Fradkin, multi-homing across many different papers I'm working on, coming to you from sunny—in this case—Cambridge, Massachusetts.

Seth: Wow…. Rare, sunny day in Cambridge, Mass. But I guess the sunlight is kind of a theme for our talk today because we're going to try to shed some light on some surprising features of AI, some important features, and yet, not discussed at all. Why don't people write papers about the important part of AI? Andrey, what's this paper about?

Andrey: I agree that not enough work has been done on this very important topic. Look, we can think about the big macroeconomic implications of AI—that's really fun to talk about—but it's also fun to talk about the business of AI. Specifically, who's going to win out? Which models are better than others? And how can we measure these things as they're happening at the moment? And so that's really what this paper is about. It's trying to study how different model providers compete with each other.

Seth: Before we get deep into that—I do want to push back on the idea that this isn't macroeconomically important. I think understanding the kind of way that the industry structure for AI will work will have incredible macroeconomic implications, right? If only for diversity—for equality across countries, right? We might end up in a world where there's just one country or a pair of countries that dominate AI versus a world where the entire world is involved in the AI supply chain and plugging in valuable pieces, and those are two very different worlds.

Andrey: Yeah. So, you're speaking my book, Seth. Being an industrial organization economist, you know, we constantly have this belief that macroeconomists, by thinking so big-picture, are missing the important details about specific industries that are actually important for the macroeconomy.

Seth: I mean—not every specific industry; there's one or two specific industries that I would pay attention to.

Andrey: Have you heard of the cereal industry, Seth?

Seth: The cereal industry?

Andrey: It's important how mushy the cereal is.

Seth: Well, actually, believe it or not, I do have a breakfast cereal industry take that we will get to before the end of this podcast. So, viewers [and] listeners at home, you gotta stay to the end for the breakfast cereal AI economics take.

Andrey: Yeah. And listeners at home, the reason that I'm mentioning cereal is it's of course the favorite. It's the fruit fly of industrial organization for estimating demand specifically. So—a lot of papers have been written about estimating serial demand and other such things

Seth: Ah—I thought it was cars. I guess cars and cereal are the two things.

Andrey: Cars and cereal are the classic go-tos.

Introducing the paper

Seth: Amazing. So, what [REDACTED] wrote the paper we're reading today, Andrey?

Andrey: Well, you know—it was me, dear reader—I wrote the paper.

Seth: So we know who's responsible.

Andrey: All mistakes are my fault, but I should also mention that I wrote it in a week and it's all very much in progress. And so I hope to learn from this conversation, as we—let's say my priors are diffuse enough so that I can still update

Seth: Oh dude, I want you to have a solid prior so we can get at it. But I will say I was very, very inspired by this project, Andrey. I also want to follow in your footsteps. Well, maybe we'll talk about that at the end of the podcast as well. But maybe you can just tell us the title of your paper. Andrey,

Andrey: The title of the paper is Demand for LLMs, and now you're forcing me to remember the title of the—

Seth: If you were an AI, you would remember the title of the paper, maybe.

Andrey: The title of the paper is Demand for LLMs: Descriptive Evidence on Substitution Market Expansion and Multi-Homing. So, I will state three claims, which I do make in the paper.

Seth: Ooh, ooh.

Andrey: And you can tell me your priors.

Seth: Prior on each one. Okay, so give me the abstract; claim number one.

Andrey: So the point number one is that when a new good model gets released, it gets adopted very quickly. Within a few weeks, it achieves kind of a baseline level of adoption. So I think that's fact number one. And that's very interesting because not all industries have such quick adoption cycles.

Seth: Right? It looks more like the movie or the media industry, where you have a release and then boom, everybody flocks to it. That's the sense that I got before reading this paper. So I would put my probability on a new-hot new model coming out; everybody starts trying it—I mean, a lot of these websites just push you towards the new model anyway.

I know we're going to be looking at a very specific context, but if we're just thinking overall. Man, 99% when a new hot new model comes out, people try it.

Andrey: So I'll push back on that. It's the claim that it's not about trying it, like these models achieve an equilibrium level of market penetration. It's not—

Seth: How long? How long is—how long is just trying it? Weeks, months.

Andrey: How long are—sorry, can you repeat that question?

Seth: So you're pushing back on the idea that this is, quote unquote, “just trying the new release.” Right. But what is the timeline you're looking over?

Andrey: It's certainly a few months, but it doesn't take a long time to just try it. So, if it was just trying we'd see us blip over a week, and then it would go back down. And I don't—

Seth: If they were highly horizontally differentiated, but if they were just very slightly horizontally differentiated, you might need a long time to figure it out.

Andrey: You might—that's fair. Okay, so the second claim is: the different models have very different patterns of either substituting away from existing models or expanding the market. And I think two models that really highlight that are Claude 3.7 Sonnet, which primarily cannibalizes from Claude 3.5 Sonnet.

Seth: New Coke,

Andrey: Yes, and it's—well, New Coke failed in this regard.

Seth: Diet Coke,

Andrey: Yeah. And then another model is Google's Gemini 2.0 Flash, which really expanded the market on this platform. A lot of people started using it a lot and didn't seem to have noticeable effects on other model usage.

Seth: Right?

Andrey: So this is kind of showing that kind of models are competing in this interesting space.

Seth: My gosh. Andrey, do you want me to evaluate the claim that you made, or are you now just vaguely appealing to competition? Which of the two do you want me to put a prior on?

Andrey: No no no. Go for it. Yeah.

Seth: All right, so the first one is: do I think that if I look at, you know, a website with a hundred different models, some of them will steal from the same company and some of them will lead to new customers?

Right? I mean with a—I, I'm a little bit… Suppose we asked this question about products and you said, “Professor Benzel, will my product steal from other demands, or will it lead to new customers?” I guess at a certain level, it doesn't even make sense, right? There's a general equilibrium problem here where you always have to draw from something else.

I know we're drawing from other AIs, which would mean that there would have to be some kind of substitution. So I mean, yes, I believe sometimes there's going to be substitution, and yes, I believe sometimes, for reasons that are not necessarily directly connected to the AI model, the rollout of a new model might bring new people into the market.

Right. So I guess I agree. Like at the empirical level, I would say 95% certain that models differ in whether they steal from other models or bring in new people. If you're telling me now there's like a subtler claim here, which is that the fact that some models bring in new people is suggestive of horizontal differentiation and is further evidence for strong horizontal differentiation.

And I'm a little bit, I don't know, I'll put a probability on that, but that's, that seems to be going a little bit beyond the scope of the description.

Andrey: Well, we can discuss that in the discussion session. And I think the final part that I make a claim about is that apps, and the users of apps as well, to multi-home across models. So it's not that people are using just one model. It's not like app developers are using just one model for each application. And that's kind of once again pointing to the fact that there isn't just kind of one superior model even within a given model class.

And, Seth, go for it

Seth: Andrey, you did the thing again. You did the thing again where you said, "Here, Seth, do you want to evaluate this empirical finding?" Or do you want me to now say, “This tells us something about the future of competition in AI'?"

Andrey: Yes, yes, yes. All right, go for it.

Seth: The empirical claim, right? Is—give me the narrow claim? One more time? Give it to me.

Andrey: The apps are multihoming.

Seth: The people multi-home. Okay. The narrow claim is we've got these apps; maybe we'll give the user, the listeners, a little bit of context of what a sample app would be.

Andrey: Yeah, so I think about two types of apps here. One is a coding app, so Klein and RU coder are two quite popular coding apps. And we see that users of those apps are multi-homing. And then—those apps are multi-homing; we don't know as much about the users—and then we have kind of various chat-persona apps. And then we have some kind of utility apps

Seth: Yeah. We'll call them, like—let's call that second group role-play apps.

Andrey: Yeah, yeah. We have kind of like PDF extractor and apps like that, that are also on the—

Seth: Very tool-ly. Okay, cool. Alright, so we've got all these apps out, and now you're going to tell me, Professor Benzel, "I think you would be surprised to find out that RU coder, for example, has both the Claude model underpowering it and an OpenAI model powering it." And that one is probably the thing I'm most surprised by.

Right? I definitely would not be surprised at all to know that RU coder can send its cloud tokens to one data center versus another data center; that makes perfect sense. But the fact that you would sustainably have many different contemporaneous models on the same platform feels like a stage in a process rather than where we're going to end up.

What do I mean by that? So why would you want to keep an old legacy model inside of your RU coder? So I've got—I'm, or Silly Tavern, is one that I like. So Silly Tavern is just, you can do role play and talk to characters and pretend you're going on adventures. Right?

It seems like that Claude 3.7 should just be better than 3.5 at that, right? I really don't—my intuition is that they're not strongly horizontally differentiated. Why would you keep both? It would be for legacy reasons, for backward compatibility. Maybe there's a specific interaction or scenario that you had that you had working in the old version of the app, and you want to make sure that that's still around for new users.

So, how would I think about this? I would think about if you want to say that this is like evidence of multi-homing. This multi-homing evidence is evidence of competition because the same app wants to use multiple versions. I kind of disagree, right? The way I think about it is maybe more like, you know, you're a car, and you can either use the old muffler or the new muffler, and some people have upgraded to the new muffler, but some people are still using the old muffler, and so that car has two different kinds of mufflers.

Andrey: Yeah, we can discuss that, you know, that claim as well. I guess, do you want me to address what I think?

Seth: Well, give me a taste, and then let's go to the evidence. Give me a taste.

Andrey: The multi-homing is not happening on an old and a new version of a model.

It's happening on, let's say, 3.7 and Gemini 2.5, which are both relatively new models. The other thing I'd say is that you read Reddit; there are some users that still like 3.5 better than 3.7.

Seth: On the internet, they will prefer one plain white cotton T-shirt to another plain white cotton T-shirt entry.

Andrey: Who are you to question the preferences? The consumer.

Seth: Right? But I guess, all right, so this is my last comment on the priors, and then we'll get into the evidence, which is. This sort of speculation about what people will actually want in the long run is the bridge that gets us from this cross-sectional evidence about 20 April, 2025, to what the world's going to look like in 2027 and 2028. So that's why I'm pushing back a little bit.

Andrey: Yeah, I don't want to make claims that are too great about 2027 based on this cross section. Yes,

Seth: you know, GDP girl's gonna be at 30%

Andrey: That's true.

Seth: And all of you in labor will be automated.

Andrey: There is going to be a lot of market expansion. I hear.

Seth: Oh, babe, listen to our Epic AI episode. We'll post that before this one so you can see what we're laughing at.

Andrey: All right.

Seth: So tell me, Andrey, I can think of no one better suited to walk us through the evidence of this paper than Professor Fradkin of Boston University.

Andrey: Look, it's very simple paper. It's essentially a few graphs, and the graphs are event studies, where we see what happens to a selected set of models around the time of the release of one of the new models. So for the release of Claude 3.7, we see a very obvious drop in the usage of 3.5. You know, if you ballpark it, it's about 80% cannibalization. And the adoption happens within a few weeks, so it's fairly fast. We also look at Flash 2.0. We see very fast adoption, and in terms of tokens used, Flash 2.0 is the biggest model very quickly. And then, Gemini Pro is another model that that gets released in this time period. And it also sees a very fast adoption curve that doesn't seem to cannibalize other models at this time period. So that's kind of the evidence on cannibalization and market expansion and then the evidence on multi-homing. So there, there's some intricacies with the scraping of the data here. So, actually—let's take a step back. Where does this data come from?

Seth: What is Open Router?

Andrey: We haven't discussed what Open Router is. All right. Look, one of the challenges with studying these issues is a lot of the data sits in these fortresses of data where you cannot extract anything from,

Seth: And we're trying for you listeners; we're banging at that gate. We're banging at that gate every day trying to get in for you.

Andrey: Yes. Yes. So people who are using OpenAI know through the chat app, through the direct open API calls, we're not going to get a lot of visibility into that data. We might get some auxiliary data from credit card providers, payment processors, and the like, but it's hard to know how usage is changing and how specific model usage is changing particularly. One thing that exists is this service called Open Router, and there are other companies that are similar to it. And it's built for, I'd say, a sophisticated user who might be like a software developer who knows that, Hey, you know, I want to use a mix of models, or I might want to change my code to use a different model as—

Seth: Andrey, what's the S word that I'm thinking of here?

Andrey: Substitution; What?

Seth: Selection, you're so this. We're looking under the light of the cult plate, not under the light of the people who want to multi-home.

Andrey: Yes. 100%. But I will say—we're looking—let me just explain what Open Router is, and then we'll talk about selection and whether we care about that or not.

Seth: Oops.

Andrey: Okay. So, so it's a very handy service that allows you to call many different types of models. It also allows you to set rules too. Or like which model to use as a function of things that you might not be thinking about if you're just a chat user, like latency, throughput, uptime, specific pricing, and how it affects prompt tokens versus reasoning tokens versus completion tokens. So it's just a really useful service for this, for the app developer.

Seth: I mean, can I—just to interrupt for a split second here, right? Honestly, I feel like you gave me more evidence for horizontal differentiation in this market just by listing those four different features than you did with almost anything else, right? Because all right, I could see why you would need to balance between latency, price, throughput, quality, et cetera, et cetera.

Andrey: Yeah. So, and there is actually an interesting feature of this market that many do not know: there are multiple companies that serve specific models. So this is obviously true with open-source models, where anyone can serve them. So we have a lot of servers of your Llamas and your Deepseeks. But it's also true of the closed-source models.

For example, Microsoft might serve an OpenAI model, and OpenAI might serve the OpenAI model, and there might be differences in how well they're serving these models.

Seth: Does that mean that Microsoft has to know the model weights, or are theyhidden in some way from them?

Andrey: That's above my pay grade. I—

Seth: We will find out for you.

Andrey: I mean, Microsoft owns a lot of OpenAI, so they have some access.

Seth: Okay.

Andrey: Yeah. So, that's kind of an interesting feature of—

Seth: Mm-hmm.

Andrey: Anyway. One thing that this company does is they publish a lot of data about model usage and how the model usage is changing over time, and also about how specific apps use different models.

In particular for each model, they put the top 20 apps using that model and their usage numbers. So you piece these together, and you can get some pretty good information about popular apps and what models they're using and how much they're using.

Seth: Mm-hmm.

Andrey: And even over time, if you're scraping it continuously—

Seth: Do we know if this is for the apps that list themselves on Open Router? Is this the universe of tokens going through those apps? Do we know that?

Andrey: I think it's a universe of tokens going through those apps, but not all apps are—

Seth: Obviously? Yeah.

Andrey: publicly disclosing it. Even if they are using Open Router.

Seth: Well, it's a fascinating data set, so it's going to show us the price of tokens. It's going to show us which apps are using which tokens, and we're going to get dynamics on that over time. So it seems like a perfect data set. Andrey, your next big contribution is just noticing the data set.

Andrey: It's, you know, to be clear, the ML community knows about this data set as well. You know, in this question of how do we evaluate which models are good and which are not, you know, what we all love is revealed preference.

Seth: Oh, ooh.

Andrey: Use? And an open router has one such ranking, right? That's publicly available. It seems pretty hard to game it, although we can talk about ways one could try to game it. and, that should tell us something about which, which model is better, the very least, which model is on the Pareto frontier? Um. And so has the machine learning community; the AI community has been noticing this. So yeah.

Seth: And then they told you, so then your contribution was the translation to economics.

Andrey: I don't know who told me. The other thing I should say is that now certain companies are releasing stealth models on open router as a way to test them

Seth: Oh,

Andrey: That's also an interesting dynamic to explore. In particular, OpenAI has stealth released some models through there.

Seth: And these would be so if I was running Silly Tavern; it would become apparent to me that there's a GPT-4o version too, and I could play around with it as an option.

Andrey: And there's a new model called Optimus Alpha

Seth: Oh God, did let Elon Musk name this one? Oh my God. Somebody took too much testosterone this morning.

Andrey: Yeah. So, all right. That model gets released for a few weeks. People play around with it, and then it's the new OpenAI model.

Seth: Got it, got it. And then, but but theoretically, normal app users of Silly Tavern might be interacting with this model for a little bit before the official release is therefore

Andrey: Yeah.

Seth: Got it. Okay. Cool.

Andrey: Yeah, so what? What questions do you have, Seth?

Seth: What questions do I have? Andrey, it occurs to me this population of LLM users might not be representative of the model of the market as a whole. How do you respond to that limitation?

Andrey: So, I acknowledge it. I think that's—let me kind of push a little bit. So there are different populations of, what shall we say, heavy LLM users that we can think about. One type of user is your basic consumer, and that type might have a ChatGPT subscription or might even use, you know, the free version or Claude, even though really most of the action is in ChatGPT; we're not talking about that. I think that's very clear. Then, it's a consumer product. We know consumers suffer from very large default effects.

Seth: Right?

Andrey: They're not going to be switching very actively in aggregate. So I don't think this paper is about that at all. The second type of use case that we know a lot about, or we're aware that there's a big use case for, is in programming. Right?

Seth: Mm-hmm.

Andrey: And here I think this is a bit of a more representative sample in a lot of ways. Why, Kline and RU code are are serious programming apps.

Seth: Even though they have silly names.

Andrey: Yes, 100%, and they have features that are essentially at parity with features of VS Code, the programming, the copilot, and VS Code and Cursor, even though, as far as I'm aware, Cursor and Copilot use their own software to route the model calls.

You can also model, you know; you can also do the same things in those apps. So I'd say the coverage I. And the user bases of these apps are quite similar; you might say client and Recode users are a little more sophisticated, but I actually don't think it's that big of a

Seth: They're just a little weirder.

Andrey: They're a little weirder.

Seth: So you think this is very representative of the market for AI tokens? For coding?

Andrey: yes, with, with exception, with a—

Seth: Mm-hmm.

Andrey: The exception is that some companies place severe limitations on the types of models their employees can use. So imagine you're working at Google. I imagine if you're working at Google,

Seth: Gotta use it; you gotta eat your own dog food.

Andrey: You cannot use O3for programming, I assume.

Seth: You cannot generate images of German Nazis. They have to be all-right. That's a callback joke, guys. All right?

Andrey: So then there are these other apps, and there, you know, it's hard, it's hard, you know, to say look, I, if you're, if you're an app developer and you have a single-use app, like a PDF text extractor or something like that, I imagine that you are actively, considering different models, especially trying to optimize your costs

Seth: Mm-hmm.

Andrey: And you may or may not use an open router. I'm not sure; certainly, there might be some selection, and if some apps are less, if there are developers who are less sensitive to these issues, they might not feel the need to use open router.

Seth: But for freelance coding, we think this is representative. All right. Now talk about these other settings, like the tools and the role-playing.

Andrey: Talking about this example, let's say you have a service where you send it a PDF, and it gives you back the structured text.

Seth: Mm-hmm. Mm-hmm.

Andrey: Which is a type of app that you can find on OpenRouter. I doubt that whoever's writing these types of apps is very different whether they use open route or not. I imagine they're considering many models.

Seth: Right. Well, I mean, I guess we're in; we're kind of like in the talk-about-it section, but like you could see a lot of this stuff getting backward built into the platform, right? There's this story, you know, about iPhones. When you started off with an iPhone, there was like a light bulb app that you had to install to get the light to go, but then they built it into a feature of it, right? So, I mean, in the long run is there even a place for something like Open Router, or are these all features that are going to be built right into OpenAI or built right into Anthropic?

Andrey: I guess the feature of being able to use the other models is a feature. I doubt that they'll build into it, but you know, who knows, right?

Seth: Right, but they might give you different versions. There would be the within OpenAI version and then the within Claude version, and they could give you a selection of models.

Andrey: Sure, sure. So if you're like, and I think a lot of big companies do this, if they sign an enterprise contract with OpenAI or Google or Anthropic, they're going to use their models. They might even have forward-deployed engineers that kind of show them how to use the model in the best possible way, how to fine-tune it, and so on.

So I think there's a lot of, if something, if an application requires really close cooperation between the foundation model provider and the application layer, I think we'll see that essentially the different competitors are splitting off into cooperating with different model providers.

Seth: Right. So you think that is one possible future, which is that we end up with much more fragmentation than open router. So there would be, in that universe, multi-homing across models, but not multi-homing across companies.

Andrey: Yeah. I think multi-homing across models versus multi-homing across providers—yeah, we should be kind of clearer about that. And I think the evidence that I have is at least not—it's not just multi-helping within, you know, within OpenAI or within Llama or—

Seth: Ooh. Ooh. We'll have to see about that. All right. Okay. Alright. Other questions I have about this are, you know, not all tokens are created equal, either. I mean, how large a range in prices are people paying for these tokens? Like, what I know is you have a little table of a maximum and minimum, but give the audience a sense of how expensive intelligence can get and how cheap it can get.

Andrey: How expensive and how cheap can it get? so it can be close to free, especially for pretty small models. And it can get pretty expensive. So, there's an output price of 18 per million tokens that exists on this platform. At the time I was looking at it, for example.

Seth: It's still cheaper than my ghostwriter.

Andrey: Yeah, I mean, a million tokens is not nothing for sure. And then, there are differences in input prices and output prices. And there's also something that I haven't captured very well in this data, which is there might be discounts for something called NGS. Things get more complicated the more I look at it in detail.

Seth: Right. And the question is, do these kinds of details suggest concentration, or do the details suggest disillusionment and horizontal differentiation?

Andrey: Yeah.

Seth: Hmm.

Andrey: let's talk a little bit about just some very basic economics of

Seth: What the fuck is competition? Why do we want it?

Andrey: Yeah. So I think first let's first think about the utility, the consumer app developer utility part of this, right? Let's imagine that they have some utility for the different models, but they also have to, you know, pay a price for it. So, the way we think about it is, how much are people willing to pay for the better model? And if we think that things are pretty vertically differentiated, everyone will want to pay more for the same types of models. If we think that things are horizontally differentiated, then different developers will want to pay more for different types of models. And then there's also this question about the scaling thing. Like, yeah, maybe there's a model that's a little bit better than the other model, but it's a lot more expensive, and people are not willing to pay for that. So that might be something going on.

Seth: Hmm.

Andrey: Prices, obviously, are a very important variable to think about, and especially when you can think about them in the following way. Say you have a hard problem. One way to approach it is you throw it to the best model. Another way to approach it is to call a slightly worse model 10 times and then pick the best answer, right? So there's some implicit kind of substitutability that might be present in this.

Seth: But that. Oh man. So now that's so interesting because the story you just told is not a story about horizontal differentiation. Right.

Andrey: yes,

Seth: But it is a reason why you might want lots of different vertically differentiated models.

Andrey: Yes. Yeah.

Seth: Ah huh. So maybe we don't have direct evidence on horizontal differentiation here.

Andrey: For what it's worth. I'm not sure how often these, this pattern, are being used, but it's

Seth: Okay.

Andrey: It's certainly possible. Yeah. And then there's another kind of thing to mention, which is this famous Jevons paradox, which is a paradox.

Seth: I mean, no. Paradox is really a paradox according to my book, Slight of Mind, about why paradoxes are dumb and you should just know all the right answers.

Andrey: Yes. Alright. So, let's say we have an efficiency improvement in our model serving, and we kind of lower our prices by a bit. The response to that might be so large that the total number of tokens used might go up.

Seth: Right?

Andrey: Essentially, the dynamic at hand or the total revenue can go up.

Seth: And so, I mean, it seems like that's happening constantly in this data, which is where we're releasing better and better models and demand just goes up.

Andrey: Yeah. Yeah,

Seth: Which is which provides another challenge for thinking about substitutability because we don't have individual-level data. This is not a static market.

People are entering this market all the time. You gotta be; I mean, the figures you make are quite compelling, like stuff is happening the instant these models are released. But it's also the case that, you know, compositionally, who's in this data is changing and pretty fluid.

Andrey: Yeah. Yeah. it's something I do hope to have more to say about, as I've been scraping at the time, because at least within an app, you might say that the

Seth: It's homogeneous within an app. Yeah. Or maybe you loop together all the coding apps and all the, you know, silly taverns. Okay, cool. Alright. I mean, how much are you in, and how much do you feel like you have to make a claim about horizontal differentiation here?

Andrey: Look, it's hard for me to see multihoming and no and think that there is no horizontal differentiation here.

Seth: Other than price, quantity, differentiation, or price quality,

Andrey: But there, no, no. Sure. But I guess, I guess a point that, you know, you can see in, in, in these figures is that you have, these are pretty similarly priced models in many ways that are being multi-homed.

Seth: The latency is a little bit different. Maybe I'm going to switch back and forth based on latency. There are a lot of different little things here, right?

Andrey: Sure, sure. That's fair. Without having the individual usage data, it's really hard for me to make these finely green claims. I certainly have begged for this data from the CEO of OpenRouter, but so far no cigar.

Seth: Okay, let me push. Let's talk about that a little bit more, right? Which is, if the multi-homing is driven by fluctuations in latency, let's say, right? Like, I don't have strong preferences between Claude and ChatGPT; I just want to call the one that's lower latency. You can definitely get multi-homing there without it being driven by any difference amongst the models.

Andrey: Sure. I guess I think this is very empirically testable. I haven't—the latency is at a five-second level, and just see how much it changes over time.

Seth: There we go.

Andrey: Yes.

Seth: Ooh, ooh. I've given you some more homework, it sounds like.

Andrey: So, I guess if we think that the latency is highly variable across time or the throughput is highly variable over time, then we might see that sort of pattern. If we don't see it being very highly variable over time, then maybe that's less—maybe that's some evidence that it's not quite what's driving it, but yeah.

Seth: Let me tell you what my prior is, so maybe this is like the key part here, right? I have this really strong prior that I did not have; I was not born with it, but I have been trained by talking to AI experts

Andrey: Mm-hmm.

Seth: There’s no such thing as the AI that's good at military stuff versus the AI that's good at writing humanities papers.

That it's all intelligence—you get more of it or less of it. Sure. At the margin there's fine-tuning, there's vibes, but with the right sort of prompt and, you know, with a sufficiently unlocked model, you should be able to; it should be just pure vertical differentiation. That's kind of it; when I've been in rooms with technologists, that's the claim they make.

Now, maybe that's because they're at OpenAI and they're at Anthropic, and it's their incentive for this to be a universe where there's only two big boys. But serious people I've talked to have suggested there isn't such a thing as significant LLM horizontal differentiation.

Andrey: Yeah. I don't believe that. Let's see what they—let's see what they actually do.

Seth: Mm-hmm.

Andrey: OpenAI is constantly updating its default model in ChatGPT. And sometimes they're optimized for one metric, and then they realize that they face a trade-off. So, for example, if your ChatGPT is a little too nice to you, that might lead you to use ChatGPT more, but it might feel ethically dubious for ChatGPT to be encouraging your addiction, given that you totally deserve to be addicted to your phone. So, there's clearly a Pareto frontier of different things that these models can be made to do. Right? So do I. So and so, a lot of experimentation by the companies is the form. is, how do we play on this pato frontier? The existence of Pato Frontier suggests that there isn't just one dimension on which things differ.

Seth: Right. But I guess where I come at this from is, okay, imagine there's like a continuum of steps of delivering the token to the consumer, right? The first step is a $500 billion pre-training run. We, you know, make the giant pre-trained model. The second step is we're going to fine-tune it. We do the RLHF and give my model its particular personality, and it knows it's not allowed to work for terrorists or whatever.

And then there's the third step, which is we're now going to plug that fine-tuned model into an app, and it's going to be deployed in something functional that a consumer can interact with. I guess the way I see it is like as we move down that continuum, this becomes more and more horizontally differentiated, and at the beginning it seems really not horizontally differentiated, and by the end it really is very, you know, you don't want the silly tavern AI, you know, helping you convert PDFs.

Right. So I guess when I hear LLMs are horizontally differentiated, I'm thinking about that pre-training step.

Andrey: Mm-hmm.

Seth: Maybe you want to make a claim about how the usage of AI in apps is horizontally differentiated, which is at the far other end.

Andrey: Sure. Yeah. I, I think that's true. We don't, you know, and you know, we've talked about unhobbling on the show before, and I certainly believe that lots of these models have capabilities that we haven't figured out how to get out of them. Right. They know so

Seth: Right. I've tried really hard to make OpenAI do some of those things, and it's not—it's not as nice as Grok when you ask him to, or

Andrey: Yeah. So, so I think that's right, right? How the application and how these models are used in the application layer can be differentiated even if we think that at the foundational level it's just a ball of clay and some of these balls are bigger clay balls than other balls.

Seth: Oh, right. And when you have smaller clay balls, you can't build the Mona Lisa of play balls. Right. So it's like a capacity thing. Yeah, I mean, it just brings us back to there being a vertical aspect and a horizontal aspect, and the question is like, in the market competition for AIs, where do those two come in? Right? Because in terms of app deployment, you wouldn't expect vertical. I mean, everyone's just going to use the best; they're going to use bottles that are on the Pareto frontier. So you'd expect the horizon, the vertical differentiation, to be less apparent in that last stage. Right?

Andrey: Yeah. I mean it; I do it. It seems to me that models like Gemini 2.5 Pro and 3.7 Sonnet are both on the frontier, but. Some people just like one, and some people like the other. And, and that, that is horizontal differentiation to me.

Seth: Right. And, and now, now you're referring to, like—

Andrey: It's like maybe there's this, like there's a cost difference, and there might be latency differences, and that's really what's driving, you know, the usage patterns.

Seth: Or maybe the prices are identical, and I'm Epsilon horizontally differentiated, and that's enough.

Andrey: Yeah.

Seth: I guess the last thing is that I think my instinct is that horizontal differentiation will become less important over time. Right. So if you think about these balls of clay getting bigger and bigger and bigger, right?

Sculpting them exactly the way you want is going to get easier and easier as you have more and more clay to discard. Do you buy that argument?

Andrey: I think we'll get better at sculpting things over time. I think that it's certainly true. Yeah, and I think that comes back to your question about whether we are going to have horizontal differentiation in the sculpting step. And then the question is, who's going to be sculpting it? Is it going to be app developers sculpting it? Is it still going to be the big labs that sculpt it in various specific ways? Yeah, that.

Seth: Right. I mean, it makes it like if we, if we're doing the sculpting at the app stage, right, there's just a lot more room for horizontal differentiation, right? Because there's a lot more players who are going to be involved, and, you know, that's, that's the domain where, yeah, it does make mean, you know, a dollar to a consumer, whether the interface is blue versus pink and like even stupid shit like that can support an industry, no offense to, you know, app developers out there.

Okay. So one question that is kind of like the implicit background question in this paper, in my opinion,

Andrey: Okay.

Seth: But it is a prior, which we did not put a probability on, but I just kind of want to ask you, can you come at this with having done this research? It doesn't—you don't have to do it in a prior way, which is like, do you think the market for AI will be, you know, relatively competitive or relatively concentrated in four or five years?

Because I mean, my reading of this paper was like, it's a shot for, it's going to be less concentrated and more competitive than you think.

Andrey: I think it depends a lot on the complementarity of other things.

Seth: There you go. There you go. Speaking of Catherine Tucker, we had her asking her about AI competition. She's like, "Well, you know, I'm Catherine Tucker." Catherine Tucker thing.

Andrey: That is not how she talks.

Seth: She does not talk like that. So I'm not going to try to do my Catherine Tucker voice. But like, her point was like, we know how to do antitrust. It has to do with networks of complementarities and substitution abilities. There's nothing special about AIs. Is that kind of your take?

Andrey: I don't think I'm going to make the claim that we know how to do antitrust of AI. That seems premature, to say the least. I will say that the concentration of the industry is very likely to be determined by complementary integration assets. So how important is it to have that Anthropic engineer sitting at, you know, SAP, the specific molded version of Claude, or a particular application or not? Is it something where. at SAP will just call Open Router, and it's just going to be good enough that way. And they don't have to do specific SaaS contracts with Anthropic or anything like that. and that's hard for me to answer right now. But you know, if I had, if I were a betting man, I would say that there'd be a handful of models that are pretty competitive with each other.

But I don't think there'll be like a thousand models that are competitive with each other.

Seth: Right. That frontier, there's just not, there's not enough room at the top, at the frontier. Just because these trading runs will be so, so expensive. I guess that's kind of—as I was reading this paper, in the back of my head, I'm thinking, you know, like, how many people are going to come up with $500 billion to pre-train their own models?

Right. It—it just seems like there's a maximum of how competitive this industry can get.

Andrey: But I guess so. I would say like five; five is often enough to get a very competitive dynamic. Why do we want competition? It's not just because we want a bunch of competitors, for competitors' sake. We actually want there to be the correct incentives to innovate and then to price fairly, right?

So those are kind of the two things we're trading off. And in industrial organization, there are some results that in certain cases where you want even less than five competitors for the incentives. So that still seems quite competitive, even if there is a lot of concentration.

Seth: Right. I—it's all maybe another way of thinking about this is, suppose we could wave a magic wand and either make AI more horizontally differentiated or make it less horizontally differentiated. Right. We could choose which world we're in.

Andrey: Mm-hmm.

Seth: A world where they're less horizontally differentiated is probably one with faster growth and, you know, fewer implementation costs and less friction. Right.

Andrey: Yeah, I'm not sure. It depends; it depends on how we think about, like, the specific innovation production function. Don't; it's not obvious to me that there's, like, one answer, right? Because you can imagine that in a horizontally differentiated world, more players are going to be able to try to innovate, and because there are more, there are going to be more rents. But if you think that it's all about just that huge run, that one big run,

Seth: Right,

Andrey: Maybe it's that you want it to be vertically differentiated and kind of a winner-take-all dynamic. But, one where the winner can change to from time to time.

Seth: Right. You want a comp, so then we're in a universe where it's competition for the market rather than competition in the market. And that brings its own set of antitrust concerns. Andrey, you know, believe it or not, I took a minute to look at the same data and ask questions right along these lines of, like, how concentrated is this market exactly?

Because reading your paper, it's a paper that's supposed to give me some hints about the competitiveness of the industry. The first thing people ask about an industry is, well, how concentrated is it? Right? So Andrey, what's your sense? Are these models more or less concentrated than a typical industry?

Andrey: Um.

Seth: Industry? And actually I want you to tell me, all right? So I've got three. I'll leave my test on the table here. I've got four HHI indices I'm looking at right now. I've got open wrap. This is for the week, the first week of May. we've got. The number of tokens is called at the AI company level, so it aggregates up to companies.

We got the number of tokens called at the AI app level, so that's like a silly tavern, et cetera, et cetera. Then we've got the number of tokens called at the model level, and then I would like you to compare these two to inequality in motor vehicles and breakfast cereals. So I want you to rank those five from most equal to least equal.

Andrey: Yeah, so I will push back on. You count already; you count like the Met Lamas as being Metas, right? Because Meta is not the one who's serving them. Right. But.

Seth: Ooh. Ooh. Well, I could do providers too. That would be a fourth way to split it.

Andrey: Yes. But generally, yeah. Look, it's more concentrated than these other industries.

Seth: It's pretty concentrated.

Andrey: I'd say more so than I, for I, for all of them, with the model-specific one. Even with that, I'd say it's probably more concentrated than the—

Seth: That one is actually pretty low. So the model, so just, I'll put some numbers out there. Just, ballpark, motor vehicles have an HHI of about 2,500; breakfast cereals are just below that.

Andrey: Mm-hmm.

Seth: The number of tokens at the company level has an HHI of 2960, so it's a little bit higher than those guys. But if we go to the app level, we're at 2160, so that's kind of more competitive than motor vehicles and breakfast cereals, which we think have a decent amount of competition.

And then the model level, so we're going to treat 3.5 and 3.7 differently. We're pretty equal. We're at the 1500 level, which is considered pretty, pretty competitive.

Andrey: competitive. Yeah.

Seth: All right. Does that change your progress, Andrey?

Andrey: Well, I guess I wouldn't have used those industries as a comparison set.

Right? Like, I think a lot of digital infrastructure types of industries have a lot more concentration. So you think about cloud computing or search or phones, right?

Seth: mm-hmm.

Andrey: I think so. Relative to those kinds of industries, it is less concentrated. But certainly compared to physical goods products, it's more, it seems, more concentrated, I guess. I assume that you didn't calculate that HHI per car. Right? So it's kind—

Seth: No, it was not. That was at the company level.

Andrey: Yeah. I mean—you know, disclosure, you know, this, this definitely has been on my to-do list. I just have not gotten around to it. But I don't.

Seth: All right,

Andrey: I don't think that, this changes my, my priors very much, if

Seth: Okay, well, I've got a second fact for you. Second stylized fact. All right, so now I want you to imagine, oh man, I don't know if we have time to start talking. We'll see the power law and probability distributions for the next episode. But let me give you four different things that might be more or less concentrated.

Right? Here's another four things to think about. The concentration of one is 2023 US CompStat companies. One is the open router, AI at the company level. The second is Hugging Face. You know, our hugging face is another website where people will post AI models. This is for free downloads, so these are like public models.

So I have downloads of Hugging Face AI models. And then finally I have all-time movie box office. So you tell me which of these is going to be the most concentrated: hugging-faced AI downloads, open router, AI tokens, 2023 US publicly traded companies, or movie box offices. All the time.

Andrey: This is by the open router one. That's by the model creator.

Seth: I believe that, yeah, at the company level.

Andrey: Okay. Um. I think Open Router is the most concentrated of these.

Seth: Correct. Second most

Andrey: hugging face?

Seth: hugging face, second most, third most

Andrey: I don't know how to think about CompStat HHI. That seems like how—what's the product market? Sorry.

Seth: the product. Oh, CompuStat. It's publicly traded corporations. So it's everything together.

Andrey: oh, you're just combining all the—?

Seth: Yeah, yeah, yeah.

Andrey: Just revenue by revenue.

Seth: No, it's market value. So, you know, implied market,

Andrey: Yeah, I think that'll be three. And then the movies are four.

Seth: Dude, you don't even need data. You got this down.

Andrey: How about those priors?

Seth: Who needs evidence? But okay. What, you see what I'm trying to get out here, Andrey? Right? Which is, you can give me evidence that people are willing to move back and forth, but if it's the most concentrated industry I can find, it seems pretty concentrated.

Andrey: you like a bunch of industries that are more concentrated.

Seth: Alright? Okay, so now we go. All right, so listen, this is going to be a special two-part episode of Justified Posteriors. In the next episode, Professor Benzel will bring his own evidence and analysis to bear on the data from Open Router, and you'll be the judge. Is AI competitive? Is it not competitive?

It's the future you're going to have to live with one way or the other. Andrey, are we ready to talk about our priors a little bit?

Seth: All right. What's yours? So tell us, you had three claims here. I guess you're a hundred percent convinced of all the claims. Again, you wrote them down.

Andrey: Look, my claims are empirical, right?

Seth: Right.

Andrey: no, I'm not saying that they're right, but I, you know, I think

Seth: They're descriptive.

Andrey: They're quite descriptive. Unless I made a scraping error or something like that, I think they're, you know, they are what they are, but the interpretation is obviously up for debate.

Seth: Mm-hmm. Do you want to take a shot at it? Do you want to give me a percentage chance that in two years—I don't know how to say this—let's say AI, the AI industry, will be more or less competitive than the average tech sub-industry? Is that a fair comparison?

Andrey: I don't know what an average tech sub-industry is.

Seth: I know or choose one search. Let's just search. How about searching? That's really unequal. Alright. Alright. So yeah, that's the question.

Andrey: It's going to be more competitive than search. I have no doubt

Seth: Okay. All right. Let's check that in a couple of years.

Andrey: And also more competitive than phone operating systems.

Seth: Yeah, we got two big boys there. That's fair. Okay.

Andrey: Is it going to be more concentrated two years from now than today? I think that's an interesting question.

Seth: You want to take a—is that 50/50 for you? Or, I think it's pretty; I put 90—ninety's too strong—85% of that is more concentrated in the future than now.

Andrey: I do, so it depends on whether we're measuring by revenue or by token.

Seth: Let's do tokens at the company level. Oh, I guess we should do revenue, right? Revenue's the more economical thing you can do with either one.

Andrey: the reason I was asking is, like, I still imagine there's still going to be a ton of use cases for small, cheap models and,

Seth: Yeah. So the most down. Yeah.

Andrey: A very competitive market, right? Like in the sense that it's, that's, people are going to roll up their, put in, in principle, roll up a very good, small model.

It's the big model that we're really worried about right in.

Seth: Right, right. So yeah, so it's like the value-weighted is the one where you'd be really worried about concentration, given that there might be a lot of small toy ones that people fuck around with. But I think—

Andrey: Talk, I don't. I'm not even talking about fucking around. There are so many—

Seth: Yeah.

Andrey: Like, you could have the model call; you would, right?

Seth: Mm-hmm.

Andrey: you know, every email you're writing in Gmail

Seth: Mm-hmm.

Andrey: For the line of code that you're going through, why not call a cheap model just as a first pass? That might even be the model used to determine whether you want a, you know, more fancy model or something like that.

Seth: Right, right. And you can imagine a universe in which, like those super low-level AI observations, intelligence calls aren't even captured in data because I might be running that locally on my own laptop, right? Yeah—So yeah, so maybe there's some sort of size cutoff above which this, like, becomes interesting and tractable.

Andrey: I mean, I can, yeah. I don't have strong priors on this, I have to say. I could see arguments either way. Maybe 60/40 towards becoming more concentrated in terms of revenue.

Seth: All right. Well, I'm going to try to get Andrey's answer up in the next half of this two-part episode on Concentration in Competition in the AI Industry: Evidence from Open Router. This time it's personal.

Andrey: All right.

Seth: All right. Like, share, and subscribe.

Andrey: Yeah. If you have better data, we're very—

Seth: Give it to us, please. Yo, we'll be your friend. We'll co-author you.

Andrey: Yeah. Just, you'll get such great exposure for your company on this podcast.

Seth: Mm-hmm. Right? We will. And we'll also use your AI to write copy if you have an AI model yourself.

What can we learn from AI exposure measures?

Andrey Fradkin — Mon, 28 Jul 2025 17:57:58 GMT

In a Justified Posteriors first, hosts Seth Benzell and Andrey Fradkin sit down with economist Daniel Rock, assistant professor at Wharton and AI2050 Schmidt Science Fellow, to unpack his groundbreaking research on generative AI, productivity, exposure scores, and the future of work. Through a wide-ranging and insightful conversation, the trio examines how exposure to AI reshapes job tasks and why the difference between exposure and automation matters deeply.

Links to the referenced papers, as well as a lightly edited transcript of our conversation, with timestamps are below:

Timestamps:

[00:08] – Meet Daniel Rock
[02:04] – Why AI? The MIT Catalyst Moment
[04:27] – Breaking Down “GPTs are GPTs”
[09:37] – How Exposed Are Our Jobs?
[14:49] – What This Research Changes
[16:41] – What Exposure Scores Can and Can’t Tell Us
[20:10] – How LLMs Are Already Being Used
[27:31] – Scissors, Wage Gaps & Task Polarization
[38:22] – Specialization, Modularity & the New Tech Workplace
[43:43] – The Productivity J-Curve
[53:11] – Policy, Risk & Regulation
[1:09:54] – Final Thoughts + Call to Action

Show Notes/Media Mentioned:

“GPTs are GPTs” – Rock et al.’s paper
- https://arxiv.org/abs/2303.10130
“The Future of Employment: How susceptible are jobs to computerization?” - Frey and Osborne (2013)
- https://www.oxfordmartin.ox.ac.uk/publications/the-future-of-employment
“AI exposure predicts unemployment risk: A new approach to technology-driven job loss”— Morgan Frank's paper
- https://academic.oup.com/pnasnexus/article/4/4/pgaf107/8104152
"Simple Macroeconomics of AI" – By Daron Acemoglu.
- https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf
“The Dynamo and the Computer” – Paul A. David
- https://www.almendron.com/tribuna/wp-content/uploads/2018/03/the-dynamo-and-the-computer-an-historical-perspective-on-the-modern-productivity-paradox.pdf
“Productivity J-Curve” – Erik Brynjolfsson and Chad Syverson
- https://www.nber.org/system/files/working_papers/w25148/w25148.pdf
“Generative AI for Economic Research: Use Cases and Implications for Economists”– Anton Korinek’s paper
- https://www.newyorkfed.org/medialibrary/media/research/conference/2023/FinTech/400pm_Korinek_Paper_LLMs_final.pdf
Kremer’s O-ring Theory
- https://fadep.org/wp-content/uploads/2024/03/D-63_THE_O-RING_THEORY.pdf
12 Monkeys (film) – Directed by Terry Gilliam
Generative AI for Economic Research - Anton Korinek.
- https://www.aeaweb.org/content/file?id=21904

Transcript:

Andrey: Welcome to the Justified Posteriors Podcast, the podcast that updates its beliefs about the economics of AI and technology. I'm Seth Benzell, exposed to and exposing myself to the AI since 2015, coming to you from Chapman University in sunny southern California.

Andrey: I'm Andrey Fradkin, riding the J curve of productivity into infinity, coming to you from Cambridge, Massachusetts. Today, we're delighted to have a friend from the show, Daniel Rock, as our inaugural interview guest.

Daniel: Hey, guys.

Andrey: Daniel is an assistant professor of operations, information, and decisions at the Wharton School, University of Pennsylvania, and is also an AI 2050 Schmidt Science Fellow.

So he is considered one of the bright young minds in the AI world. And it's a real pleasure to get to talk to him about his work and spicy takes, if you will.

Daniel: Well, it's a pleasure to get to be here. I'm a big fan of what you guys are doing. If I had my intro, I'd say I've been enthusiastic about getting machines to do linear algebra for about a decade.

Andrey: Alright, let's get started with some questions. I think before—

Seth: Firstly, how do you pronounce the acronym? O-I-D (Note, OID is the operations, information, and decisions group at Wharton).

Daniel: This is a big debate between the students and the faculty. We always say O-I-D, and the students say OID.

Seth: So our very own. OID boy. All right, you can ask the serious question.

Andrey: Before we get into any of the specific papers, I think one of the things that distinguishes Daniel from many other academics in our circle is that he took AI very seriously as a subject of inquiry for social sciences very early, before almost anyone else. So, what led you to that? Like, why were you so ahead of everyone else?

Daniel: I'm not sure. Well, it's all relative, I suppose, but there's the very far back answer, which we can talk about later as we talk about the kind of labor and AI. And then, there is the sort of Core Catalyst Day. I kind of remember it. so back at the M-I-T-I-D-E, where we've all spent time and gotten to know each other in 2013,

Seth: What is the M-I-T-I-D-E?

Daniel: The MIT Initiative on the Digital Economy, Erik Bryjnolffson’s research group. I was one of Erik's PhD students. My first year, we had a seminar speaker from the Computer Science and Artificial Intelligence Lab, CSAIL. John Leonard was talking about self-driving cars, and he came out there, and he said, “Look, Google's cheating. They're putting sensors in the road. We're building the real deal: cars that can drive themselves in all sorts of different circumstances. And let me be real with all of you. This is not going to be happening anytime soon. It will be decades.”

And there were other people who were knowledgeable about the subject saying, “No, it's coming in like 5 to 10 years.”

And at that point I thought to myself, “Well, if all these really brilliant people can disagree about what's going to happen, surely there's something cool here to try to understand.”

As you're going through econometrics classes, I wouldn't say econometrics is the same thing as AI. We could debate that, but there's enough of an overlap that I could kind of get my head around the optimization routines and things going on in the backend of the AI models and thought, “Well, this is a cool place to learn a lot and, at the same time, maybe say something that other people haven't dug into yet.”

Andrey: Yeah. Very cool. So, with that, I think maybe you can tell us a little bit about your paper GPTs, which is a paper that has had an enormous amount of attention over the years and I think has been quite influential.

Daniel: Yeah, we've been lucky in that sense.

Seth: In two years.

Andrey: that's not—I mean—some version of it was out earlier… No…. Or is it? Has it only really been two years?

Daniel: It has been the longest, , Andrey. If you and I weren't already sort of bald, , it might've been a time period for us to go bald. Yeah, we put it out in March of 2023. I had a little bit of early access to GPT-4. My co-authors can attest to the fact that I rather annoyingly tried to get GPT-4 to delete itself for the first week or two that I had it rather than doing the research we needed to. But yeah, it's only been about two and a half. Okay, so the paper, as I describe it, at least recently, has kind of got a Dickensian quality to it. There is a pessimistic component, there's an optimistic component, and there's a realistic component to it.

So I'll start with the pessimistic, or I'll— why don't I just start with what we do here first? So we go through O*Net's list of tasks., There are 20,000 tasks in O*NET, and for each one of those tasks, we ask a set of humans who are working with OpenAI; they kind of understand what large language models in general are capable of doing.

What would help you cut that time in half? So could you cut the time to do this task in half with a large language model with no drop in quality? And there are three answers. One answer is of course not; that's like flipping a burger or something. Maybe we get large language models imbued into robotics technologies at some point in the future, but it's not quite there yet.

Another answer is, of course, you can. This would be like writing an email or processing billing details or an invoice.

And then there's the middle one, which we call E2. So, E0 is no, E1 is yes, and E2 is yes, you could, but we're going to need to build some additional software and systems around it.

So there's a gain to be had there, but it's not like LLMs are the only component of the system. And the reason we pick other software is because there's a pretty deep literature on how software and information technologies generally require a lot of co-invention, a lot of additional processes, and tangible capital. It makes it difficult to deploy those technologies fruitfully.

And we figured, okay, by comparing that E1 category, the yes you can, with an LLM out-of-the-box, to the E2 category, how much do additional systems and innovation get us? We could say something about whether generative, pre-trained transformers, GPTs, are general-purpose technologies. They'll be pervasive, they improve over time, and they necessitate that kind of complimentary innovation. They change the direction of innovation.

If we can say yes to those three things, then we're in a situation where we get to the pessimistic version of the story. You just can't know what the long-term equilibrium is going to be across different markets as a result of these tools.

So the prognostications that, ‘Oh yes, AI is coming to annihilate all the jobs. That the Machine God is imminent—or at least the Economic Machine God is imminent. I think those are a bit premature if you look and say this is general-purpose technology because historically general-purpose technologies have been hard to predict at the outset.

So the optimistic side of things is that that impact potential is pervasive. There's a lot of benefit to be had in changing how people work. We use this exposure measure—I'm sure we'll get into this—but exposure is not automation. Exposure is potential for change, and if there's potential for fruitful change, we get more value in lots of different places in the economy.

That's a good story we found—and if the reviewer is listening to this, thank you very much. One of our reviewers suggested looking at science and innovation tasks and research and development tasks and seeing how those compare to other areas. We found high levels of exposure in those areas, which means there's potential to turbocharge growth, at least temporarily, hopefully longer term, in the economy.

There’s a temporarily, and an optimistic component on the realistic component. We compare the yes, you can do it temporarily, and better with an LLM here to the yes, you can, but you need more building, the set of tasks that get exposed if you build additional systems. If you were to snap your fingers and say, “Hey, we've got everything we need.”

That's much, much bigger than the stuff that's just exposed to LLMs on its own. So the realistic story is we have a lot of work to do as a society in the global economy to bring about the gains of these tools. And it'll probably take a few decades for it all to play out. As much as we think that the changes have been very quick, it has been a fast two years, or slow, depending on who you ask.

Seth: This has been great. Andrey and I are both bursting with questions. I'll let Andrey go first.

Andrey: I want just a quantification. Like, so what percentage of tasks are exposed according to the first definition? What percentage of tasks are according to the second definition, approximately?

Daniel: Yeah, if I recall correctly, about 14% of tasks, or 15% of tasks, (depending on if you're looking at the human ratings or the GPT-4 ones). GPT-4 and humans tend to agree, by the way. There's some noise there, but if you look at [the] GPT-4 ones, it's about 14% of tasks for E1, the level where it's just LLMs that can help. Now, if you snapped your fingers again and said, Now it's E2 and E1, that's about 46% of tasks. I might have my numbers slightly off there, but that's roughly what the numbers were.

Andrey: And did you calculate what share of occupations have 100% of their tasks?

Daniel: There were very few, if any, occupations that were a hundred percent exposed. I think data scientist was up there, and it depends on the measure, so we actually have three different combinations of these scores. The most conservative is saying it's just E1, and then that's it, and the least conservative is E1 and E2.

We score each task that has either one of those labels as one and E0 as zero. And then there's this kind of intermediate one that I like, but my co-authors don't like as much, where E1 gets a one and E2 gets a 0.5. So it depends on what you look at. Mathematicians were highly exposed. My co-author, Pamela, has gotten some angry emails from mathematicians saying, “No, that can't be.”

I will say I use it for building theory now. I use the language models for building theoretical models, and they do a pretty good job. They make some pretty terrible mistakes occasionally, so you do have to check their work, but to go from a verbal sketch of what you're trying to prove to some math that roughly shows what the setup should be, it makes it easier to be a reviewer instead of a doer, as they say.

Seth: Sure. All right. Okay. A couple questions from me. The first question is: are we talking literally when we are doing these E1 ratings? Are we talking literally about ChatGPT-4, or are we talking kind of generally about LLMs of approximately that quality? Or are we projecting forward to kind of near-future LLMs?

Daniel: Yeah. It was more the latter. We had a sense of where LLM tools were going to go. I think even looking at this set of tools we have now and GPT-4, they're very similar. There are expanded capabilities. It's kind of been a deepening of their capabilities, but the going of the somewhat foreseeable future, especially for my colleagues who had been and co-authors who had been in the weeds with this.

But that does bring up an important weakness of this approach, which is as soon as you see something really qualitatively different or new capabilities showing up, you have to update the rubrics and the method; you have to rerun stuff. I think arguably the reasoning model paradigm is getting to the point where you probably have to rerun things.

Andrey: Are you considering rerunning things? Is this like an ongoing endeavor or—

Daniel: I'm not sure I'm going to return to writing an academic paper. I feel like I've gone to the well one too many times already with this. But if someone else wants to do it, I'm happy to help them out with it. Eric, Mitchell, and I did something in roughly 2016 looking at supervised machine learning and shared some slightly different conclusions, but now that I've been through this twice, I'm not sure that I want to do it just yet.

Andrey: So this is a question that I wanted to kind of raise. 'Cause certainly you guys are not the first to do this sort of exercise, and you've done it before. Frey & Osborne have done it. I remember when I was thinking about these exercises; when I first saw them back in 2017-2018, I was like, “This is an accounting exercise. This is actually useful.” How do you determine in what sense this type of work—

Seth: To throw another critique of this whole research agenda out there. We talk about Frey and Osborne coming out with one of these a decade ago. You talk about your own SML experiences. I know Morgan Frank has a new paper at PNAS Nexus out that compares about 10 different people's different exposure measures.

Daniel: Mm-hmm. Which I'll do different things. Yeah,

Seth: And they're all too; they're all completely different. How should I think about the diversity of these indices?

Daniel: Well, there are different principle components underlying a lot of these different measures. Certainly SML and the GPT scores are very different. And Frey and Osborn—the way they constructed that effectively was—.

Seth: Basically.

Daniel: educated guess vibes with CS professors for a training set.

I think their goal is to measure which jobs, as a whole, could be computerized. Actually, let me answer Andre's question a little bit more directly. Like, when you look at these, what are they useful for? Let me start by saying what they're not useful for. because actually some folks have put words in their mouths on this.

Seth: Including Nobel laureates.

Daniel: No Nobel laureates that I know of, but there are some places and some folks who have who said things like, “If you're exposed, you're hosed.” And this is what the authors tend to value, I will say—

Seth: with the word hosed. You set them up for that.

Daniel: It's possible that that is the case, but I have not seen any data to conclude that that is the case.

So let me state clearly for the record things you do not want to predict with exposure scores. Things that exposure scores are not designed to do: economically meaningful outcomes like wages or employment are not things. I'm not trying to say exposure scores will create unemployment. I'm not saying it'll cause wage loss, and I view it as a risk measure. I'm a recovering finance guy. I think there's a risk that can be good. It can be bad. Like we don't really know. It just means there's an opportunity, technically speaking, to change the types of tasks that people are doing and how they do them. So exposed and hosed are possibly orthogonal ideas.

Nevertheless, I think it's worth tracking now. What else is it not useful for? Besides failing to predict labor market equilibrium. it's not useful for—

Seth: Breakfast?

Daniel: Can what make you breakfast?

Seth: You're—

Daniel: Scores?

Seth: Do you want to list all the things? It's not useful for, excuse me,

Daniel: Exhaustively, yes, we should. You can't eat the scores either. I wouldn't say it's especially useful for saying for sure that this is going to happen, right? Like, if a technical thing that could help someone do a role does not necessarily mean it's appropriate socially, legally, or politically.

There's a whole bunch of different places where using LLMs might be inappropriate. One example, a famous one, is Jeff Hinton, who predicted that radiology demand would drop. And I think we are seeing, say, an appropriate example of where a multimodal model would be helpful in radiology.

It could probably pick up a broken bone, but radiologists as data-enabled doctors have a lot of other components to their work, and they interpret difficult cases. If you're going to tell someone about a condition that they've gotten, it's challenging. That's not the sort of thing where you want an LLM just spitting out, “You have this wrong.” That would be terrible bedside manner.

So even if it's theoretically possible, that doesn't necessarily mean it's going to happen. So turning now to where are they useful then? One is for testing this hypothesis. Are we limited in what we can say? which is my favorite application of them. In the sense that we see pervasiveness and complementarity and necessitating exposure throughout the economy.

So we should dial back our confidence in terms of predictions of what will happen that I think were useful for answering a very specific hypothesis that we had. But then, underneath that—

Seth: So you were able to—the hypothesis is that they are GPTs of GPTs? They're going to affect everything.

Daniel: Yeah. So the only one of the three conditions that we punt on is whether they are GPTs that improve over time? Because that one was obvious. We do have some evidence, but we are mostly getting beyond that. I think about the first-order changes and where they're most likely to happen. I didn't know that this would be the case when we wrote the paper, but I think those measures that we built tended to predict where people would start adopting large language models, and there have been a few papers validating that empirically.

Seth: That makes perfect sense, right? So it's maybe not a good model of what's going to happen to your job, but it's a good model of where the OpenAI salesman should show up and knock on the door?

Daniel: Yeah, potentially. So you guys discussed this paper earlier on the podcast, but the Anthropic Economic Index, the areas where they thought people were or where they were showing people were using Claude, lined up reasonably well with the areas we thought GPTs and LLMs would show up.

Andrey: Except managerial tasks.

Daniel: Except managerial tasks. Those are happening; it's just not clear. I'm not sure what's going on in that dataset. In my work as a startup co-founder, I use all sorts of large language models for managerial tasks all the time. So we'll see what happens there.

Andrey: I used a large language model for managerial tasks earlier today, so I agree with you.

Daniel: Mm-hmm.

Seth: Right. Seems like these AIs are being used. If you look at the philanthropic index, it really does focus on people using it in these kinds of hobby contexts, which is one of our big takeaways from that episode. So I mean, people don't manage as a hobby, so if a lot of Claude usage is hobby usage, you wouldn't expect that. You would expect that to be underrepresented.

Daniel: You're saying that with the exception of the technical folks, software engineers, and data scientists, it's just like ripping with this stuff, right? Like, because that's not necessarily a hobby.

Andrey: Ripping with it and the cursor, I mean. Now we're getting—

Daniel: Sure. Yeah. API use, yeah. Yeah.

Seth: Right, that's the giant use case right now.

Daniel: Yeah, and that one's a great one. It's kind of ironic given our focus on software, but to some extent you can keep doing what you were doing, but just do it way better in software development with these tools. You don't actually have to transform the structure of software engineering too much to just get a very quick benefit, but I think there is a new mode of working and developing with AI-driven tools that has an analogy in that famous computer in the Dynamo paper. The paper mentioned electric power conversion; you think of it like the steam engine, right? For the listeners who aren't aware, this giant thing in the middle of the factory and all these pulley levers and belts come off of that thing, and it powers the whole factory. And then over the next few decades, they realize, ‘let's modularize that power.’ When we convert to electric power, the first thing to do with electric power is to do the same thing, but like, a little bit better.

Take a giant dynamo, stick it in the middle of the room, and we're off and running. But eventually they were like, “Well, what if we make that really small?” And then we have lots of little machines all powered by their own little engine. Sort of similar, and I'm seeing this with some large companies: you start with a really monolithic, large technology function in the middle of the company that kind of like powers off. Lots of subgroups build technology for them, and then something kind of magical happens with these AI models.

You can sit down with a subject matter expert, a product person, or a senior developer to make sure that these people don't hurt themselves as they're building something. And you create these like modular, , the Jeff-Bezos-two-pizza-team version of work where people have input into a process, and then rather than throwing that process over the wall to the dev team, you wait three weeks and see them come back with something that doesn't fit. You just develop together and watch the models go, and it really ups your cadence, but it opens up all sorts of best practice shortfalls that can happen.

Like, have you hardened for security properly? The devs know what questions to ask there. So going from a specification to a finished product can be way, way quicker. If you redesign how the work goes, it's kind of similar to that steam-power-to-electric thing.

Andrey: I guess maybe a natural place to go here with is there's kind of this distinction between the micro-level exposure of a task-level implication. So, should we be thinking about that? And certainly people have used your micro-level exposure metrics in macroeconomic models and so…

Seth: Tell us about what that experience was like.

Daniel: People use them in different ways. There are papers that you guys have discussed on the podcast before. If you look at the Simple Macroeconomics of AI paper by Daron Acemoglu, he uses our sort of experimental automation score. Which it is not. Could you use an LLM to improve your task output?

Here it's like, could you use an LLM to just straight up do this task without a person involved? It's a really small proportion of tasks in the economy; that's a five-point scale. So our fourth or fifth most intensive automation risk scores. I don't love those scores, to be honest, but they are in a pretty narrow area.

So it's not surprising that we find, or that we read in his paper, I should say, a seven-basis-point-a-year outcome. The OECD is a version where they use the exposure scores, and they get to something like 70 basis points of productivity growth per year. So it's all of one MLA's gains right there.

But per year, I think these are a public good, these scores in some sense, and people bring their models and their priors, too; they're trying to discipline what they believe will happen with the economy with these scores. And they're noisy. I wish there were something more useful for these people to deploy in their models.

But to the extent that we can be helpful, we're really happy that this thing is out there. I just caution folks against viewing exposure automation, which is a common failure mode, or even leaning on things like automation and augmentation as the choice that we have ahead of us at the macro level.

Like, and Andrey, to your point, the macro-level conclusions, yes. Labor markets are how we share the gains from economic activity primarily across society. And then, when you get down to a micro-level task and you're asking a worker or a manager or a worker combo. Are you upset if we automate this task or augment this task?

Either one. It's anything goes. It's about the labor market and the unit of work that's being purchased in the labor market. I could automate something I hate doing and be thrilled with it 'cause I could go spend my time doing other stuff. I could automate my whole job and make myself really sad. Well, maybe really sad, but I'd have to find another job.

I could augment someone and make them thrilled and pay them more, or I could augment them such that they take the jobs, they do the work of 10 different people, and then nine people get fired. So I think this augmentation automation, micro-question, really does boil down to just exposure and changing work.

And we can't say much more than that. And I don't think, even though automation and augmentation are like an elegant mathematical framing in these models, I don't think it's, I don't think it's something that we can lean on from a policy perspective at the micro-level. It's just like you're going to change what people do.

Seth: Yeah, I'm going to push back on the idea that it's an elegant micro idea, right? Because for exactly the reasons you—,

Daniel: Macro-idea, I should say. It's an elegant macro idea. I don't think it's an elegant micro-idea. Yeah.

Seth: Right. But even then, it's kind of it, let me put it this way. To me, when people want to distinguish between augmenting and automating technologies, they want to talk about them as somehow separate from the rest of the economy. But as you've been implying, the real reason you can't say a certain technology is automating or augmenting is because that production is embedded in an entire economy.

And that's going to tell you whether, as productivity goes up, you want more or less of that thing. The way I would put it is to use the metaphor of Marshall Scissors, right? So there's a story that's told of the famous economist Marshall from the University of Cambridge, who was the advisor of John Maynard Keynes. And somebody asked him one day whether it was supply or demand that was more important in setting the price for a certain good.

Seth: Marshall said it's like asking what blade of the scissor is doing the cutting, right?

Daniel: Mm-hmm.

Seth: You can't talk about one without talking about the other. If you want to know what the outcome is and what I see, your paper is one blade of the scissor, right? It's the one blade of the scissor that's coming in telling you this job can be changed, but you need to know everything else about the rest of the economy to understand how the job will be changed.

Daniel: That's right.

Seth: And we've, we've talked about examples. There are countless famous examples, from the ATMs to, I like this example of the cotton gin of jobs getting automated and then demand for that form of labor going up.

Daniel: Right. Yeah. Couldn't agree more. Yeah.

Seth: Now Dan, I do have a micro-take, and I'm interested in whether you buy this, take this prediction about what exposure scores will do to an occupation. This is a somewhat out-of-equilibrium take. This is a partial equilibrium dynamic take, and maybe it'll be smoothed out in the long run, but in the short run, my prediction is that in occupations that are more exposed, there will be more wage polarization at middle-tier firms for that job and less wage polarization at extremely good or extremely bad firms that use that job. Alright, so I've got a kind of a framework here. Are you ready? Can you see where I'm going with this, or are you ready for me to give the reason why?

Daniel: I have some hypotheses about how that could work, but I—yeah—don't leave me hanging here.

Seth: Right. Okay. So should I start with the general equilibrium first, or should I start with the micro level first? Let's work from the bottom. So imagine, you've got, a job that uses two tasks, right? Task one and task two. They can be gross compliments in production, but it's actually not important.

But you need them there; there can be gross compliments as long as they're not perfect substitutes, right? They can be gross substitutes. That's also fine. I'm a doctor. I need to spend so much time having bedside manner, so much time recognizing the x-ray. I know that's not a perfect example, right? Okay, imagine a technology comes out that allows you to automate one of the two tasks. Okay, well then obviously people who are worse than the technology at automating the automatable task automate it. And the people who are better than the technology at automating don't automate. I know this is already going to get a little bit off of the way that maybe you think about how things are, but grant me that for a second.

Okay, what happens? People who are bad at task one but good at task two see a big improvement. Whereas people who are good at task one and bad at task two see no improvement. Right? Whereas, it kind of depends on how good the thing is. If you're equally good at both. Kind of depends. Okay. All right, so that's the first step. So where would you get wage polarization from? Automation. You would tend to get it in jobs when people's skills are anti-correlated. Right, because as we just said, if you're good at one and bad at two, we automate one. It doesn't help you. But if you're bad at one and good at two and we automate one, it helps you a lot. So you would expect to see wage polarization, wage distribution, and expansion for jobs where people's skill levels are anti-correlated. Okay? So now you might say, Sure, Professor Benzell, that sounds cool, but why would we ever expect in certain settings for wages and skill levels to be anti-correlated?

Okay, and now I'm going to bring in the O-ring, right? So Kremmer has a general equilibrium theory of the economy: the productivity of a firm or whatever is somehow bounded by the kind of limited, the worst agent in the system, right? So this comes from the space shuttle Challenger explosion; the space shuttle explodes. We think it's because of this one faulty part, the faulty O-ring. Okay. What's the general equilibrium implication of this model?

It's basically that you should get people of different skill levels all concentrated at the same type of firm. So there should be super good firms that have all the high-skilled people, mediocre firms that have all the mediocre people, and bad firms that have all the bad people. How do you get a mediocre person? Most mediocre people are mediocre 'cause they're good at one thing and bad at another thing. So now we come back to my hypothesis—which is that exposure should lead, And in fact, I'd love to bring this to some experimental evidence, some kind of working with Kyle Myers, a great economist friend of the show at HBS, on this—can we predict the experimental outcomes if you introduce AI to a place, and it's exposed to some of the tasks? Do you get that polarization in productivity and wage, and when do you seem to just kind of boost everyone by the same amount?

Daniel: Okay. So some quick reactions there. So just to immediately hop from automation to exposure, we're like, —Folks, I guess I'm going to ask you a question that, funnily enough, I was asked by Joe St. Diglett as a grad student. I was lucky enough to get to sit next to him at a lunch. He was like, why do jobs exist?

Like, why are certain tasks bundled together? And honestly, I don't have a great answer other than to gesture sort of vaguely at coordination costs. but within the task, shifting that you're discussing, you've got this mediocrity or sort of middling productivity that comes from the fact that.

Some of the things they're good at, some of them they're not. It's still really hard to kind of blow apart the job and then reconstitute it with specialization. So I think where it's coming from is like, people are overall high productivity, and then there's a low productivity component, and then there's kind of this middle thing where you've got some CES aggregator that says, “This person is going to be slightly worse than the average of their components.”

Exposure might lift them in some cases and might not affect them in others. So I kind of buy that piece. To move it to the equilibrium framing, though, I think what'll probably happen in a lot of cases is like a mini Bamel cost disease across everything that we do. The areas where we're least productive are going to be the ones that absorb most of our time.

And in the beginning, there'll be a lot of confusion about that because LLMs will make it unclear what the least productive thing is now that you might be really bad at something. Right now, I know I'm really bad at writing, like spec docs for software. Well, now I have a process with Claude where I can write much better spec docs, and I'm not as terrible at it.

So, but, once you get out of this sort of equal, disequilibrium condition, you might end up in a situation that looks a lot like the one we have right now as things settle. But then, the job boundaries have changed. And there are new names for things. I'll give you a small example.

There's a new hot job in Silicon Valley called the Forward Deployed Engineer, where we've got some of these—

Seth: Hazard pay?

Daniel: This is a role at Helix. We've got a forward-deployed engineer looking for more Win Ma shout-outs. She just started.

Seth: Are they waiting for them to call in air support? What's going on?

Daniel: You send them to the customer's site, and they work with customers.

You need really strong interpersonal skills, but you also need engineering skills. That's like a new configuration of work.

Seth: Wasn't that called being a consultant?

Daniel: No, no. Uh,

Andrey: no, no.

Andrey: If they’re a consultant, then you wouldn't be able to pay them as a forward-deployed engineer. Seth, what do you mean? This has nothing to do with what McKinsey would ever do.

Daniel: I'm not sure that calling someone a consultant will—I'm not sure which end of that ends up being cheaper, but for the firm. But the critical thing here is that's a different mixture of work.

Daniel: Those are some initial reactions.

Andrey: I have reactions too. I think on one level, I'm always a little skeptical of intricate theories like this, when—

Seth: I just have two parts. It has two parts you have to give me.

Andrey: No, no, I mean more so that the like order question is even about income inequality, right? Like, it's hard to answer, and then you're trying to answer this even more sub-sub question. And I guess where I'll push back on is in terms of what the highest firms are, right?

Like, production could be an O-ring within a person, or production can be an O-ring across people, right?

Seth: It turns out that the prediction does not rely on whether ordering is within people as long as they're not, as long as the tasks aren't perfect. Substitutes what I just described goes through.

Andrey: But I guess what I would think is that if we have specialists in 10 different tasks at a high-end firm, and then one of those tasks gets automated. Surely, one of those people's jobs will get fully automated, and I know Daniel is not liking automation already. but, that person's

Daniel: I do believe it exists.

Andrey: That person's wage will go down. Right? Creating inequality.

Seth: Yeah. But I have a theory of one of your tasks being automated, not a theory of all of your tasks being automated.

Andrey: That's where my point is. I mean, it's an interesting question. High-end firms have a lot of specialization, maybe perhaps more specialization than lower-end firms. And so then the person is so specialized that if their specialty is very hard, then we might expect a bigger labor market effect for them.

Seth: You might imagine if tasks were organized differently at large firms, this theory would run into issues. Of course, there are admitted variable problems up the wazoo, but I'm intrigued by the idea of looking into whether people's skills in these tasks, which make up their task bundle, which is their job, and their skills in those subtasks are positively or negatively correlated. And I do think that that will tell you a lot about what happens when you automate part of the task or part of the job. So now bringing that to the dwere is complicated, but that's my insight.

Andrey: Saying one more thing, just how much do we expect new firm entry to be the key margin with all of this? Right? We know that organizations are very friction-filled, and adoption decisions even—

Seth: New organizations, new jobs, right? If you slice out half of the task from a job, in the long run it is probably a new job.

Andrey: Yeah, I think both of those. So then, in terms of thinking about existing firms, it's a little for me in general. Or, at least I expect, I'll be wrong; I expect a lot more entry and growth from new companies that are kind of taking advantage of this new production process from the ground up. That's kind of the lesson of the supply-side disruption theory.

Daniel: Yeah, I'd agree with that. I think one of the reasons it takes such a long time for the benefits of sufficiently transformative technologies to show up is that it usually takes a while for the firms that are deploying them well to become economically meaningful. And then they sort of set a standard.

Seth: Right? That's not the margin on your margin. The firms that figure out how to do it grow faster, which is another margin.

Daniel: And I think, agreeing with Andre, that a lot of them are new entrants. Then it's not like an incumbent will always figure out the answer, or do they have to a lot of the time? Where I would ask you a question then, Seth. Just on the idea that the bundled tasks have some spectrum from super negatively correlated to perfectly correlated individual task productivities.

Why do you think those tasks are bundled together? Because there's some coordination and cost benefit? Do you think there's probably some lower bound on how negatively correlated your productivity can be because, like, across these different tasks?

'Cause, if you really suck at half your job, you probably can't do that job. I think you probably need weak, positive correlation everywhere.

Seth: Ooh, man. I think for the sorting to happen. So let's take, we're going to take a thousand people who are all doctors, and I agree that you kind of want to think about the step before that, where before we get the thousand doctors, but I'm saying now that we have a thousand doctors good at task one, and some of them are going to be better at task two. And then you're going to get negative correlation across those abilities in the mediocre firms. Now, you're right; there might be some censoring. You can't be so bad at one of the tasks; you don't become a doctor, but I'm saying conditional on you have become one,

Daniel: Oh, okay. I could see that. Yeah. The thinking is like a Dr. House situation: everybody hates him, but he is really, really good at the diagnostic side of things. But like if he weren't, then no one would put up with that. He would've just been fired.

Seth: Right? He'd have a higher-paying job and be more productive if he was able to be nice for 10 minutes.

Daniel: He’d probably be an investment banker or something.

Andrey: There's a mirroring here too, like a general phenomenon in digitization, which is like the ability for specialization, for more niche content to do really well, right? So, if you’re only good at a task, and now that all the complementary tasks have been automated away, then you shouldn't be bound by your firm anymore.

Like, you should be able to essentially create your own small business or join the most productive firm as the specialist in that specific area because all your other characteristics don't really matter that much anymore. So Dr. House would be able to essentially, officially run a business, even though he is really bad at organizational things, because all that stuff comes out of the box.

Seth: I think that's why I talked about this theory as being kind of a short-term partial equilibrium theory 'cause in the long run you're reinventing businesses.

But, you said something really interesting, Dan. And maybe I will start to transition us now about the idea that it's going to take time for people to figure out how to use these GPTs, right? The general (that is, chatbots or LLMs), excuse me. What sort of macroeconomic implications does that have? I understand you've written a little bit on this topic.

Daniel: Yeah, right. Then, we call this the Eric and Chad Seavers, and I call this the productivity J-curve. I think the dynamic is when you see pretty much any kind of investment, there's an initial outlay period where things are expensive, and then there's a harvesting period later.

There's the famous Robert Solo quote: You see computers everywhere, except in the productivity statistics. People were already starting that. With AI, I've seen a number of news articles that say there's no ROI for this. I think the way you kind of square the circle here is, well, in the beginning of a new technology, when everyone realizes, Okay, we're going to take the plunge; you're actually going to invest in this.

You spend a lot of time kind of reconfiguring work, building new business processes, trying to figure out what new products to build, and collecting information—a whole bunch of really expensive stuff that's really hard to quantify. so it doesn't end up in GDP, to the extent that it could, but that's building up a capital asset.

So, output is going to be understated. In the meantime, while we have this, it's going to look like we're putting in more to get less out. Then later that intangible asset is actually there, but not measured, and now it's an input instead of an output. And when it starts to spit off money, then everyone's going to say, “Oh, hey, look at how productive we're being, because it looks like you're getting more as an output for less as input.” Really, it's just that thing paying off. So that tension between the growth rate of investment in this new type of capital and the growth rate of the capital stock that you're missing, that difference depending on its share and the overall economy can be meaningful. And if you do, we use the stock market to measure it because investors aren't dumb.

On average, they price these assets, or companies wouldn't invest in them, and under a roughly efficient markets hypothesis version of the world. But, if you're pricing those assets, then you can kind of back out roughly the magnitude of that adjustment you should be making to productivity growth.

So it's kind of a fun spin on growth accounting, which I know isn't the reason everybody gets out of bed in the morning—to go account for where the growth is. But—

Seth: Don't underestimate our audience, Dan.

Andrey: Look, I mean, big political debates hinge on the measured rate of GDP growth. So, it's important. How big of an effect did you find in that paper?

Daniel: Oh, I don't remember the exact numbers anymore. It's been a little while. I should look it up. But it's a lot. If I recall correctly, it might be something like 75 basis points a year for some period of time. The overall view is: look, we have good news and bad news. The good news is that the productivity growth rate level is actually a bit higher than we had thought once you account for these hidden assets. The bad news is that the slowdown from 2005 is even bigger than we thought because they were building intangible assets back then too. so,

Andrey: Well, how do you compare the intangible asset investment? I think this is kind of the key

Seth: Yeah. What's bigger? The invisible teapot or the invisible elephant?

Andrey: Because right now we're getting a lot of intangible investment into learning new production processes with AI, or is the answer just to look at how much the stock market has gone up? Is that the answer?

Daniel: Oh, that's basically it, Seth; you're not too far off. We do a hedonic regression. If we were to look at, say, the R&D assets, because this one's kind of mature, you don't really see too much from R&D on its own, but we can see if a dollar of R&D investment capitalized is actually worth a dollar and 10 cents in market value. We assume that there is 10 cents of intangible correlate value there.

Or if you really wanna be pedantic about it, it's 10 cents of intangible correlate combined with quasi rents from the fact that you can integrate R&D investment better for productive purposes than your competitors could. And then I'm going to wave my hands and say, But that's actually an asset, so it's an intangible asset too.

Seth: Right. It's the, the, this is, this is something. I mean, I remember us spending lots of time back in the day in the M-I-T-I-D-E break room, having a cup of coffee looking out over the Charles Jerome, walking by the Aour, locked in these intense conversations about just how do you measure these intangible assets?

They seem so essential to everything, yet they are literally the latent vaporware. They're our generation's. TFP, if you will.

Andrey: I don't know. I think the principle I obviously agree with, right? Like you have these investments that are not easily measurable. and they surely should be counted in some way. But it's not obvious to me. If the rate of intangible investment were constant over time, then it's a constant adjustment, and we don't really have to think very much about how the world works. But then I think measuring the intangibles—that's kind of tricky because I think about market cap, which is something that not only you're already talking about rents, but to me competition is so important there, right? You don't gain market cap just because you're doing investment. You gain market cap because you have market power in the future.

Seth: Yeah, but now you have to think about it. Why would you ever pay an adjustment cost in a perfectly competitive economy? You never make the adjustment cost, right?

Andrey: Well, I would say that there are different degrees of market power that can exist, or you can have your kind of standard monopolistic competition model where everyone's kind of keeping up to keep up, but then you can have companies like your Googles and whatever, who clearly don't think that the right model of the world is that.

Yeah, and I guess the other thing is I will not always be skeptical of firm value regressions. I think the endogeneity issues are fatal, but I don't know.

Daniel: Yeah, I disagree with you there, that it is just—

Seth: You just died. You were just killed.

Daniel: I feel so devastated.

Andrey: Yeah.

Daniel: No, I think where I disagree is, I think Tim Bresnahan put it this way. He is just like, “Well, everything's an asset here, including the capacity to generate rents, so it's just an interpretation question more than anything else.”

And you can bind things, right? Like, it's not when you go and run some of these regressions; you're not saying, I think that an additional unit of AI investment causes this market cap. They're the endogeneity; it's predictive. It's like, “Here's a price on this thing; it's not at all saying if you are.”

Seth: Here's a model: there's only room for one social media platform. So whoever got there first planted their flag on that land. They didn't make an intangible investment. They just planted their flag first.

Daniel: Right. That's what I'm saying too. It's like they planted the flag first, and now it's worth 10 bucks. but I'm 'm not saying if you were to just go up—-

Seth: 10 bucks. Which seems marginal…

Daniel: Oh, yeah. Oh, you're talking about the marginal versus inframarginal differences. And the way you deal with that, as opposed to how you do in any structural models, is you assume it away and say that marginal equals average queue for some of these.

But it's not like when you run these regressions that you get coefficients of a thousand; you get coefficients of like somewhere between 4 and 12. So, is it unsatisfying—

Seth: That—you get 4 and 12—what?

Daniel: Oh, if I were to say… regress market value on measures of IT capital, the multiplier, I get, and this has been sort of stable in weird ways for 20 years; the coefficients you got are somewhere between 1 dollar of IT investments correlated with like 4 dollars of market value on the low end to like 12 dollars of market value on the high end. and it's that which bounds the debate. It's not saying this is infinitely valuable. There's this enormous intangible asset that's the entire economy.

And then it's also not saying it's nothing. So I think that imposing some assumptions, which you can absolutely question, and I think we all should to try to get better models, imposing some assumptions and doing the best you can is a way to learn something as opposed to, like, just throwing our hands up.

But yeah, I agree with you that the causal interpretation of these things is not correct. so.

Seth: You then—so okay, the useful question—are we in the bad part of the J-curve?

Daniel: Which part's good and which part's bad?

Seth: The good part is when you're going to get more growth down the line than it looks like you have now.

Daniel: We are in the hard work investment stage of the J-curve.

Seth: Okay.

Daniel: I don't think we're in the—we're anywhere close, at least not for AI. I don't think we're anywhere close to the harvesting side yet.

Seth: But you think the GDP is on the underestimated side, which is what I mean by the good side.

Daniel: Yeah, I would say very modestly, GDP is underestimated right now.

Seth: Very modestly, 1%, 2%,

Daniel: I think that's because I'm probably ambitious. But what's GD?

Seth: Order of magnitude, 1%.

Daniel: Yeah. So where it's tough is like the parts of AI investment that are happening right now, I think, are actually fairly well captured by GDP seeing a huge amount of CapEx, and data centers, GPUs, and those things are priced pretty well.

But eventually people are going to question, how do you make someone responsible for hallucinations that the models might make or come up with good policies that get people to create good outcomes there? That's a hard thing to do. I don't think we're like anywhere close to scratching the surface with that.

Andrey: I guess the intangible investments now are more about how we go about teaching using ChatGPT. 'cause that's not going to be measured in a change in labor inputs, but it's something that is not going to materialize until we actually figure out how to teach people more effectively.

No, it's not clear that that was ever a GPT build. But, if we were a regular for-profit firm at the university, that's—

Daniel: Yeah. So, that stuff will take a while… I don't know… I don't think even if we stopped—

Seth: Of all the people who actually do work in the economy, are the people you're referring to—

Daniel: Right. And in particular the AI researchers—if AI researchers stopped building new LLM tools and making these things better today, we would still have quite a while to actually integrate this and put them to their best use. So that's kind of a bummer.

Seth: Then let me ask it that way. So if you don't wanna give me a percentage rate of intangible investments either—below average—do we need to spend a hundred percent of GDP over the course of the next 20 years in order to take these advantages cumulatively? How many intangible investments do we have in front of us? Do you have a sense of the order of magnitude of that?

Daniel: I don't know how deep the well goes. No. But it might be quite a lot.

Seth: One thing related to this, I was thinking about when we were talking about part one, is you've got these two measures of jobs: AI exposure, one of which is “just the LLM” and one of which is the LLM plus software tools. Didn't you tell us that you can use LLMs to make software tools?

Daniel: Oh yeah. It's, it's totally recursive. But the reason we pick up on software tools is because it also requires the changing of business practices and these organizational things.

Seth: So that's the way to do it. Can we play that game then? Can we look at the wedge between E1 and E2 as telling us something about the size of the adjustment costs needed or the intangible assets needed?

Daniel: I don't think it gives you that, to be honest. Sorry, Seth. I know my tools are unsatisfying here. That's a good research question, though. I think actually, the market value regressions that Andre hates are more likely to get you a ballpark for that.

Seth: Do any sorts of policies or ideas come out of the J-curve? Should we be somehow subsidizing intangible investment? Do you think this is happening at a socially suboptimal rate? I mean, you would expect that, like any innovation, you'd expect there to be positive externalities as people copy and learn from each other.

Daniel: I don't have any evidence to suggest it happens at all, that there's an externality here that needs some sort of correction. Where I could see some policy considerations, and obviously I'm not in charge of any of these things, so take what I have to say with a grain of salt, as you would for anything else I say.

Daniel: I think when it comes to monetary policy and thinking about how quickly how hot or cold the economy is, it may be helpful to know how much intangible asset creation is happening because it's a compositional shift. And you might think that the economy is in a recession when it's actually doing quite well, at least in certain pockets.

There’s a distribution of gains question here that's pretty important. Like who creates the intangible capital versus who benefits from it versus who's just like, shut out of that part of the economy altogether. But I think on average you might want to know if your growth rate is actually, in real terms, two-and-a-half percent versus one-and-a-quarter percent or something.

Andrey: And I guess you would look at the stock market. So if we have kind of this case where the stock market is going up, but GDP is not going up as much. Maybe, you'd be like, “That's okay on some margin.”

Daniel: The stock market is an increasingly less useful tool, sadly, because there are fewer public firms, and there are other reasons that those large firms would be different than the rest of the economy. It's just a quick thing to do. So it's easy to get those market values and start to pull that info.

But I think the ideal thing to do is to have an actual sense of how these assets are priced. Like you could look at M&A and costs for whole software firms. Sadly, you can't shave off a tiny piece of your digital culture and market it and sell it to someone to get a little bit of a value indication.

But I think much more complete data would give you a sense of what these assets are being valued at. It could be helpful that that's if you're willing to buy into an enterprise that I more or less do, which is that on the margin, either these asset markets or securities markets are doing a pretty good job.

If you think that there's some sort of bias in them that prevents you from actually sorting 'em out. Like, let's say everything is priced in terms of e-commerce, and I mean, obviously there's no hype factor in crypto, but yeah. Let's assume a wild assumption, for a second, that crypto is not priced at its actual long-term fundamental value and you were using crypto prices to back out the value of all illicit trade around the world. You might mistake illicit trade assets as being super valuable in that case. if those crypto coins are a claim on future, illicit trade value, so—

Andrey: What—what?

Daniel: I'm probably saying too much?

Seth: The stock market may look really good, but the companies are building evil products, so don't—

Daniel: Right?

Seth: —welfare growth.

Andrey: Well, this is—

Daniel: Yeah.

Andrey: Deone has the point of view that all the AI innovation is for making social media more addictive. o.e,

Daniel: All right. Which is, in my view of the world, an asset.

Andrey: What about what the GPT or GPTs do? Does that have any policy implications or, I guess, any follow-on work that you have on that? .

Seth: I understand you've looked at how firms differ by these exposure measures.

Daniel: One of the conclusions there. So, if you were to look at the exposure of firms against their quantities of tech workers, there's a little bit of a mechanical relationship here because tech workers are highly exposed. But, there is a difference across companies, like whatever exposure you, exposure measure you want to use.

And the reason we do that, Seth, is precisely 'cause of what you brought up. You can use these tools to build better technology. So in some sense those companies might have a good reason to run away, and performance. But like the differences from low to high exposure and entity measures across firms are not nearly as big as the differences from E1 to E2 to E1 + E2.

Those are really big. So, every company could benefit if they went and started actually trying to transform if they knew what a good direction to transform would be. So that was kind of one of the points I think from a policy perspective. I have a hard time separating what Tyler Cowen, whom we call mood affiliation, from what I think are good policies, but I'll just spit them out as some things I think are good to do.

I would, but there are a few risks with these tools that scare me. The virology community, I think, should be fairly concerned about using turbocharged models to manufacture COVID or something. Or like, God forbid, some degrowth person decides that they want to kill half of humanity and go full Thanos.

Seth: That’s the plot to 12 Monkeys.

Daniel: It is, but so would 12 monkeys, which would be a bad reality to face. But aside from that, I think there's just so much drudgery, so much additional work that these things could do for us, and a lot of gains to be had. So my preference is not to regulate these models in any kind of aggressive way; I think it's to figure out what they're good for and to develop with them.

Not to say you can't mitigate other risks like bias—that Mecca Hitler thing with Grock was terrible. There are going to be bumps in the road along the way, but they're not the kind that would say to me. Oh, we should do like a six-month pause of development. None of that really scares me yet.

Seth: Not in favor of bombing the data centers?

Daniel: No, I'm not, but I'm not a fan of Harry Potter fanfiction either. So I don't know. Maybe it's just correlated beliefs.

Seth: So you brought up bioterror in particular—

Daniel: Yeah.

Seth: As we speak, AI is being used en masse in warfare for identifying targets for terminals and target acquisition by missiles and drones. Increasingly in Ukraine, we're seeing use of automated ground vehicles for transporting resources to the front and for evac. People often go to these super sort of—I'm not gonna say 12 Monkeys is bizarre, but it's a pretty weird movie if you've ever seen it. Why do we have to appeal to that rather than just using AI to make murder bots?

Daniel: I mean, to some extent the murder bot thing doesn't scare me that much. It’s human beings doing those things is also bad. I think the issue people have with those applications often will be scaling evil individuals, which is a serious concern, or just issues with war in general, which I understand.

But, if it's gonna happen, we're kind of caught in a prisoner's dilemma there, which is what freaks me out.

Seth: Near-term AI worry is: I have a drone hanging out downtown—a suicide drone that just hangs out somewhere in Manhattan and waits for the particular person to walk out. And then I target assassinate people untraceably, right? That seems like here as opposed to “I use AI to build a lab to make a super disease, blah, blah, blah.” That's got a lot of steps in it.

Andrey: Untraceable, Seth? I guess my presumption is these sorts of actions do tend to be traced. In fact, AI is a way to trace people, right? So this is kind of one where, as with many AI questions, it's a defensive and an offensive technology.

Seth: So it favors the offense or the defense. We had thought, it seems like intuitively you would think that AI would favor the offense, right? We think about these super weapons like Daniel brought up. But if you actually look at Ukraine, it seems to create this transparent battlefield where no one can even march to the front and in some ways seems to favor the defense. It's gonna take a long, long time to play out.

Daniel: Yeah, you guys would know the answer to this. I'm gonna butcher this quote, but who's that sci-fi writer who said that like, the job of a sci-fi storyteller is not to predict the driving cars but to predict the traffic jam or whatever? I think that—

Andrey: I don't remember who it is.

Daniel: Yeah, I think that's kind of the idea here. I think here that we want to predict what the traffic jams are. I think the—

Seth: Frederik Pohl

Daniel: There we go—I should remember that. The reason the bio-risk stuff scares me so much is 'cause we just had a test of this and what one virus does to society and how damaging that can be.

And I think, Seth, what you're bringing up is what I alluded to; it's like the scaling. One really bad long-term trend in technology is just like making individuals more powerful.

Seth: Andrey and I just read a book. We just read a sci-fi novel that's masquerading as his political economy. That argument that AI is all about individual disempowerment, that we're gonna get the God machine that's built by the state in the project, and it's going to 1984 us constantly—that's radical human disempowerment.

Daniel: Right. So if our response to individuals becoming much more powerful with technology is to expand their surveillance and control capacities of the state, and we get a loss of freedom, I think that's a genuine worry. In a general equilibrium framework, those things do freak me out for sure. But writing emails with LLMs just does not.

There's somewhere in between that we should, where we, we start worrying, and I don't think I'm at that point yet.

Andrey: What about things like transparency requirements that you oftentimes hear written about, reporting requirements, and registrations with the state? Do you have any opinions about those types of policies?

Daniel: I don't like 'em. I'll shop my book here a little bit. Like they're terrible for startups, right? Like any compliance burden you stick on startups, even if we might be okay, specifically the ecosystem suffers as a result, and they do a lot of the work to discover things. So, there's a big trade-off, and this happens in the privacy debate too with GDPR and what Europe's trying to do politically; no one's willing to acknowledge that there is a compliance burden and competition trade-off. So if you're willing to hold firms to account in really expensive ways, you're gonna get monopoly power.

And that may be okay. You may decide we don't want competition with this super private data that could get out to everybody—unwise with LLMs or AI regulation. If you don't want this to be an oligopoly situation, you probably need to make it so it's easy for people to build and develop.

And I'm fine with whatever choice policy makers wanna make, so long as they're taking that trade-off into account. I mean, they're elected officials. They're trying to make those choices on behalf of all of us. If we don't like them, we can vote them out.

Seth: Using the AI to manipulate us to have the beliefs that they want us to have.

Andrey: Is there anything you wanna tell us before we wrap that up?

Daniel: No, I thought this was a great discussion with you guys, as always. It's a pleasure to get to join you, especially as your first conversation-based guest. But, as a fan, it's kind of exciting for me as well. So please keep it up. Listen to Justified Posteriors, folks.

I would say the message I would have for listeners and economists, maybe in the audience as well, is just that I think these tools are really valuable in our work. I kind of joke—I got a model that I'm building where it shows that lower types are going to use LLMs more for assignments.

And then, of course, I'm using LLMs to help me build the model. So infer what you want about my type from that, but I think it.

Seth: You've got this. You're assuming everybody has to be equally good at everything, but you can just be good at one thing and bad at another.,

Daniel: Yeah, I would never claim to be a good modeler, but it does help me get my thoughts straight.

Seth: I think you could be a modeler

Daniel: I'll leave that one alone. But I would just encourage folks to kind of be their own R&D department. As Ethan Mollick says, “Play around with these things.” I think when I talk with computer scientists, they get upset with me because I'm a little bit too pessimistic about what the models will do long-term. When I talk with economists, the modal disagreement point is the other direction, where folks don't think it's gonna be a big enough deal. So I would say, get out there, play with these things, and learn how they work. And Anton Korineck has got a great paper on using AI in your own work, so check that one out too.

Andrey: All right. Well, awesome.

Seth: I can't think of a better place to end it

Andrey: Listeners, please do comment and subscribe and stay tuned for more exciting episodes.

Daniel: Thanks, guys.

Seth: And if you are a super fan, you too. Might one day be a guest on the Justified Posteriors podcast.

A Resource Curse for AI?

Andrey Fradkin — Mon, 14 Jul 2025 20:54:21 GMT

In this episode of Justified Posteriors, we tackle the provocative essay “The Intelligence Curse” by Luke Drago and Rudolf Laine. What if AI is less like a productivity booster and more like oil in a failed state? Drawing from economics, political theory, and dystopian sci-fi, we explore the analogy between AI-driven automation and the classic resource curse.

[00:03:30] Introducing The Intelligence Curse – A speculative essay that blends LessWrong rationalism, macroeconomic theory, and political pessimism.
[00:07:55] Running through the six economic mechanisms behind the curse, including volatility, Dutch disease, and institutional decay.
[00:13:10] Prior #1: Will AI-enabled automation make elites less responsive to ordinary people by 2050?
[00:21:00] Prior #2: Will we get a new social contract (e.g., large-scale UBI or constitutional change) by 2050?
[00:26:31] Chapter-by-chapter breakdown.
[00:43:50] What about property rights? Can they insulate us from AI-induced tyranny? Or will they be eroded in the name of efficiency?
[00:46:01] Critiques
[00:52:00] Policy "solutions":
[01:04:44] Final posteriors and Seth’s economic-philosophical reflections: Can immortality + perfect patience = AI capital monopolies?

Mentioned in the Episode
📖 “The Intelligence Curse” by Luke Drago and Rudolf Laine
📚 I Have No Mouth and I Must Scream
📚 There Is No Antimemetics Division
📚 The Naked Sun by Isaac Asimov
🎮 90s point-and-click horror game based on “I Have No Mouth...”
📈 Sachs & Warner (1995) and Frankel (2012) on the resource curse.
🔁 The Gatsby Curve
📽️ Gattaca, 1984, Gulliver’s Travels

Support the show: Please like, share, subscribe!

Could AI Save Us From Making Hard Choices About the Budget?

Seth Benzell — Fri, 11 Jul 2025 18:35:46 GMT

The “Big Beautiful Bill”, just signed by the President, is projected to increase US indebtedness by $3 Trillion Dollars. US Debt to GDP, already projected to hit about 110% of GDP in 2030, will reach 115% instead.1

Is this spendthrift fiscal policy a problem? Economic analysis suggests at least two reasons.

The subtle reason is that government spending2 can “crowd out” private investment. When savers invest in government bonds (or have their savings taxed away), those funds don't flow into the stock market or startups. If private investment is needed to take advantage of a new technology, government borrowing and taxing might slow that down.

The dramatic reason is the inevitable crash when the party stops. Eventually people will stop lending to the government, and then it will need to either inflate the debt away, default on the debt explicitly, and/or accept harsh austerity budgets. A recent analysis by the Penn-Wharton Budget Model lays out what this negative spiral might look like for the US.

The ubiquity of these negative outcomes was memorably recorded by master economists and Microsoft Excel novices Reinhart and Rogoff in the ironically titled This Time is Different.

With this most recent expansion of the debt, commentators have begun to point to a new reason that this time might really be different. One representative Twitter post reads:

This take has even been echoed by Silicon Valley’s favorite economist,

A related point has been made by Peter Thiel in a recent interview

As Thiel points out, it seems like more people should be thinking about the details of how AI-driven productivity growth will play out for budgets and geopolitical power.

Luckily, my colleagues and I at the Stanford Digital Economy Lab are working on just such an analysis! Keep reading to see:

What does our detailed international macroeconomic simulation have to say about the sustainability of government debt during an AI-driven economic boom?
What does our simulation model leave out, and what other considerations should we bring in, as we make decisions about the US deficit? Deficits in other countries?3

Simulating the Global Effect of Transformative AI is a recent whitepaper by myself and Victor Ye investigating the international distributional, growth, and fiscal outcomes of a set of specific AI scenarios. To do so, it integrates a model of AI-driven productivity and automation with the state-of-the-art CGE-OLG (computable general equilibrium, overlapping generations).4 This work is based on Victor Ye’s dissertation chapter Simulating Endogenous Global Automation.

The scenarios we examine treat AI not as leading to a finite time economic singularity, but still a large scale economic boom on the scale of a new Industrial Revolution. In this scenario, over the course of the next 35 years, the world slowly gets access to technologies that can automate a large share of human labor — reducing labor’s importance to output by about half by 2060 (see footnote for details).5 That said, many of the arguments below hold directionally for more or less extreme AI-booms.

So let’s dive in, and answer the question, is unlimited debt ok in the age of AI? — sure, it’s never worked for anyone before, but maybe…

Let’s start with the good news. One way AI could help make the debt more sustainable would be by helping the US “grow out of” its many looming fiscal crises. And indeed, the AI scenario we consider would be a massive GDP growth boost to the US: increasing GDP growth rates by 1-3 percentage points throughout the wave of automation.

This is obviously very good news. More output (even if some of it is owned by foreigners investing in the US — more on this later) makes it easier to solve all social problems — the deficit included.

The clearest problem that this solves are the pension obligations the US has through its pay-as-you-go Social Security system. Because these pensions are tied to wages (mostly pre-AI boom, but some post by 2060) they’ll be easier to pay back in terms of share of GDP that needs to be taxed.

US spending on Social Security is a bit under 4% today, and our baseline projection has total US Government pension obligations rising to 5.7% of GDP in 2060. However, given an AI productivity boom, pension benefits are projected to only cost 2.9% of GDP in 2060.

For other government programs, the assumption that costs will not grow as national income expands is more dubious — and I’ll come back to that point. But if we can stop expanding spending while GDP grows, this massively helps balance the budget: The sum of health, pension, disability, education, and other (including military) expenditures falls from 34% of GDP in 2060 in the baseline to 17% assuming an AI productivity boom.6

Wow, 17 percentage points of GDP to play with in the budget! Seems pretty sustainable right?

Now for the bad news. There’s one big line item in the budget we haven’t mentioned yet — interest on the national debt. In 2017 the government’s borrowing costs were at record lows (1.2% of GDP), but have exploded upwards since.7 In 2024, the US spent over 3% of GDP paying interest on the national debt.

This line item will not get smaller with the AI productivity boom. Instead, the opposite. This is because an AI boom will make it costlier for the government to borrow money.

The reason is that all the exciting opportunities in AI will bid up interest rates. At a point in time, there is only so much of output being set aside for investment. A dramatic advance in AI tech is, primarily, an invitation to make a new, very productive investment. It means taking a few days off to reskill, or using the company credit card to subscribe to the AI service, or looking for angel investors in your new AI application idea.

How much do interest rates go up because of the AI boom? In our scenario they go up by about 4 percentage points, indefinitely — more than double their projected value under the baseline scenario. 8

Although AI productivity boom would increase GDP, it would also massively increase interest rates and the cost of capital. That’s very bad for debtors, and agents with high inter-temporal discount rates.9 And, from a fiscal perspective, the US government is the biggest, most impatient agent out there.10

What do we find? In such a scenario, the global real rate of return increases by 4 or more percentage points by 2060 (vs no AI boom), and stays elevated indefinitely. Meanwhile, the US growth rate is 1-3 percentage points elevated until 2060 with extra growth small after that period.

This effect would make the debt harder to sustain. It may push r>g (i.e. interest on the national debt is greater than the growth rate). Public economists are wary of debt in these circumstances because debt to GDP grows even if the primary budget (the budget before taking R into account) is balanced.

But, at least in our scenario, the fiscal negatives from the higher interest rate are much smaller than the benefits from growth. Even if the US cost of borrowing were to triple, this would only raise the cost of paying the national interest to 9% of GDP. This is a massive number — but still much smaller than the fiscal space we’re supposedly creating by growing out of our welfare obligations. In our simulation, they rise to only 4.5% of GDP, even more manageably.

On net then, given this scenario and assumptions, the US is projected to spend much less as a share of GDP in 2060 because of the AI boom, and this allows for lower average tax rates.

Now for the ugly part. While things go pretty well for the US budget in this scenario, other countries don’t fare nearly as well.

The reason is that while the increase in interest rates is global, only the most advanced countries see major growth benefits from the AI boom. As I previously pointed out, an AI boom is an opportunity to make productive investments. If we think of this boom as, in part, coming from automating jobs then it immediately becomes apparent that some countries will benefit more from AI than others.

The countries that benefit the most from AI will be the ones with (currently) high wages and TFP, and low costs of capital — the US is the lead example here, with high wages, a well developed financial system, and low corporate taxes.

On the other hand, regions in the developing world with low initial wages and TFP, and difficulties in borrowing and investing, benefit less from the opportunity to automate labor. This means that people with money to save in the developing world will increasingly be investing their money into the US. This will get residents of the developing world a higher rate of return, but it is the US residents who will benefit from the investments.

One example my coauthor Victor likes to use is that American firms make T-shirts in a highly automated fashion. He has had T-shirts printed in Brooklyn by companies that use almost zero labor. But nobody in India would voluntarily purchase such a machine because local borrowing is expensive and the servicing costs of said capital investment is more expensive than the Indian workers it replaces.

While the US benefits the most from the AI-boom, Western Europe and Japan also do well. India actually sees depressed growth due to capital flight. China is a mixed case, in the short term facing capital flight, and lower growth than otherwise, but eventually adopting and benefiting from the frontier AI technology.11

So the ugly truth is that even if the AI-boom is good for the US’s budget balance, the story is more mixed for the rest of the developed world, and even negative for the developing worlds’ balance.

So where does that leave us? Does an imminent AI boom make government debt a good idea? Or at least a less bad idea? The answer from our simulations, in terms of fiscal sustainability, is yes for the US, but no for many other regions. However, there are at least three factors to consider before we accept that simulation result:

The rise of AI-driven productivity is likely to be associated with large new demands for government expenditure. Most obviously, this this may include welfare payments for the newly unemployed, the costs of a new cold war with China or AGI race, and large energy infrastructure demands. There may be Baumol’s cost-diseases in whatever services the government wants to provide. More speculatively, expensive life extension techniques might be invented and demanded at subsidized prices by the public. This would be a large direct cost, and also keep around more pension check collecting retired Boomers.
Ironically, when this episode was released, Homer was still a Boomer.
Could these increased calls for government spending be large enough to keep up with GDP growth? It’s impossible to predict for sure. But, historically, Congress has had no problem rising to the challenge.
The wealth created by AI companies may be difficult to tax, because intellectual property may be relocated to low corporate tax countries.

What makes OpenAI such a special and valuable company? It’s not the chips it owns — those are commodities. Their value is in their intellectual property and, to a certain extent, in the contracts it has with highly productive employees. While workers may not be able to easily relocate (and therefore, we should be able to hit them with income and consumption taxes) the intellectual property inside OpenAI, the thing that profits that are ascribed to and would be taxed, might be able to flee to a tax haven.12

In a world that is fully hurtling towards singularity — with very few humans providing useful work — this effect would be even stronger. We might see a colony of the ultra elite and their massive robot factories setting up in Antarctica (for the low taxes and to chill the graphics cards racks), a charter city in the Sahara (for the solar power and no need to worry about pollution), or somewhere stranger like beneath the ocean. In these scenarios, the US would only be able to benefit tax-revenue wise from taxing trade with these ultra elites — perhaps us selling them the few resources that they had yet to automate.
A technologically advanced tax-haven charter city for the worlds’ elite in a remote location? What could go wrong? Art found here.
Now, perhaps this could be solved by an international agreement to set a minimum global corporate tax. This would minimize the risk of OpenAI or Anthropic etc. locating their IP in a tax haven. As of 2024, there was good progress towards just such an…
haha well…

Additionally, many of the benefits from AI may come in the form of intangible improvements in digital consumption goods. DEL research on intangible digital product quality improvement has shown these gains to be a big part of recent welfare improvements. This might be real growth, that really raises welfare, but will be hard to tax or even measure. How do you tax all music and movies and video games getting 10% better, if costs or prices never change?
Anticipated AI productivity growth will increase interest rates, and therefore the cost of financing government debt, before large productivity and government revenue benefits, because of investors forming rational expectations and attempting to hedge risk.

This is an important point. The US debt has an average maturity of just under 6 years. Therefore, if the AI investment boom — or an AI optimism led dissaving boom — were to happen significantly before the AI productivity boom, then this would create (at least) short-term challenges for financing spending.

For a full discussion of how and why a spike in interest rates might significantly antecede a spike in growth rates, I highly recommend the excellent Transformative AI, existential risk, and real interest rates, by DEL’s own , research previously highlighted at Marginal Revolution.

Those caveats aside, it seems an AI-boom would make government debt more sustainable, at least for the US. But would it make it more wise?

I would make the case that the rise of very productive AI technologies actually increases the importance of government indebtedness as a social problem worthy of our attention.13

In a phrase, the reason why is “crowding out of private investment.” AI tech is, primarily, an invitation to make a new, very productive investments. If there isn’t lots of investment/saving to go around, you can’t take advantage of all these opportunities. And government taxes and debt are exactly the sort of thing that depresses investment. A short-termist government taxing the golden AI geese to give out short term goodies (or crowding out investment in them in the stock market by issuing bonds) is a problem, exactly if and when AI makes investment hyper-productive. Another way to say it is: “Even if AI technology is economically incredible, the social saving rate still matters!”14

To make the same point again — when the marginal cost of capital is high, that means there are productive uses for it that are being unexploited! Therefore, it’s even more important to save when interest rates are high.

This point doesn’t depend on my non-singulatarian stance. Even in a singularity, it would be the case that the social saving rate is very important. To make the exaggerated version of the point — even if we had the most fantastic AI technology available, if no one is willing to invest in making the next years’ NVIDIA chip factory, then we won’t get any output benefit.

Now, here’s the bit where I get a teeny tiny bit grouchy about the argument that AI means debt doesn’t matter.

I’m old enough (34) to remember when serious people used to make the exact opposite argument for deficits. Former IMF Chief Economist — the high priest of government debt if there ever was such a job — Olivier Blanchard argued that maybe, possibly, it was OK to go into massive debt when interest rates were low. To quote a talk abstract, "Put bluntly, public debt may have no fiscal cost.”

To paraphrase Olivier’s argument: In 2019 the safe interest rate was very low. This tells us that the rate of return on investment must be low too. In fact, there was so much saving looking for a safe home, it was leading to wasteful over-investment. The current generation (Boomers) could literally have a free lunch by borrowing or taxing "excess savings” to fund government spending today.

The logic had a certain perverse truth to it. If the future was so shitty, with so few investment opportunities, that giving our kids more resources or capital wouldn’t even be able to help them…. what’s wrong with the Boomers taking a bigger slice of the pie for themselves? Well, ok boomer. I guess it would be petulant for future generations to object to this immaculate logic.

OK, grouchiness over :)

Conclusion

The point of this essay is not to argue that an AI-driven economic boom would be a bad thing, because of the pressure it will put on government budgets. An AI driven boom would be an amazing thing, leading to huge additional wealth and welfare over time.

However, it will also raise the cost of capital and create new demands on government budgets. To meet these costs, government will either need to raise taxes, lower spending, or finance them with more debt. This creates two main concerns:

The first is simply the government’s ability to finance itself. The US already is faced with a straining budget. The Social Security Trust Fund is set to run out of money in 2033 and there is fiscal gap (a refined measure of the government deficit calculated as the difference between the present discounted value of projected spending less revenues by ) of 8% of GDP. Now, luckily, it might be that if we can keep spending on it’s current path, figure out a way to tax AI wealth, and the interest rate doesn’t go up early enough, then AI might be enough to wipe out that gap. But, as discussed, the idea that government won’t discover some new emergency to spend that fiscal space on is somewhat dubious.

While some non-US governments are in stronger fiscal situations, with smaller debt/fiscal gap loads, they also will benefit much less from an AI-driven boom, making this an international challenge. The US may be able to get a slice of OpenAI wealth to fund itself, but Mexico and other middle and lower income countries will feel the squeeze of lower wages, higher interest rates, without an AI national champion’s revenue to compensate.

The second is the point that as interest rates go up, the social cost of government debt increases as well. There was a previously mainstream view that because real interest rate on government borrowing in the late 2010s were so low, it was ok for governments to go into debt. That’s because the social cost of government debt is proportional to the usefulness of the private investment it crowds out.

But — by that previously popular argument — when anticipated productivity is high, that’s exactly when we should be juicing the social saving rate.15

Instead of contemplating a larger debt, we should instead be talking about a national sovereign wealth fund, that could “own the robots on behalf of the people”. This would both boost output and welfare, and put the welfare system on an indefinitely sustainable path.

Thanks to , , and for reviewing drafts of this post. Thanks to Victor Ye for helping with numbers and figures from our simulation.

A fiscal gap analysis, a generalized measure of US indebtedness that takes into account all planned expenditures and revenues, finds that the US’s overcommitment amounts to 8% of the present discounted value of GDP — that is, 8% of output indefinitely. See https://gsas.harvard.edu/news/colloquy-podcast-debt-ceiling-and-beyond-laurence-kotlikoff

Or a speculative bubble

If this post is successful, in future essays I can discuss what the model has to say about the impact of an AI-driven economic boom for geopolitics/the international balance of power, and for the desirability of a “baby bond” such as the “Trump Accounts” in the BBB.

Called this because you model the entire world on the computer, as being made of people who are born, save, work and die over the course of overlapping lifespans. This is the same family of models used by the JCT and CBO in their projections.

The details of the model can be found in the working paper Simulating Endogenous Global Automation, but here’s an overview.

The world is divided into 17 regions. Within each region there are simulated agents in 3 different skill groups who live up to 100 years. So there are “300 representative agents” — each keeping track of demographic groups of different sizes — in the US or any other region at a point in time. These agents go through their lifecycles, giving birth to fractions of children each year which they care for, with some percent of them dying and inheriting from each other. They make decisions about how much to work, save, and consume in each year, taking the government’s tax policies and current and future prices as given.

Each region has a separate government which tax through wage, income, consumption, capital and corporate income taxation, and spend money on education, healthcare, direct transfers (e.g. welfare and disability), pensions (e.g. social security), and other (e.g. national security) programs. The U.S. is one of these regions. Each region’s government consolidates state, local and federal tax and spending policies, so if some of our government projected spending numbers seem a bit high, it’s because we’re including state and local taxation. In the scenarios considered, governments adjust their consumption and income tax rates (keeping their scaled progressivity fixed) in order to keep debt/GDP on an assigned path. Spending programs and corporate income and other tax rates they do not change.

There is also one representative firm in each region that hires workers and rents capital to make output. There is frictionless trade between regions.

A “Second Machine Age” (h/t ) if you will.

The particular technology scenario I’ll focus on here is labeled “10x automation” in the paper. In this scenario, companies globally gain access to increasingly more productive technologies, which are also more capital intensive every year — i.e. more automated. But not every region’s company’s will use the most advanced technology — AI is most useful for countries with high wages and low costs of capital. Companies select which technology is best for them based on this, so not all countries will use the most advanced technology.

The size of the automation wave is 10x the speed at which capital’s share of income has increased in the past few decades. It proceeds smoothly each year. The boost to technological change ends in 2060, by which time capital’s share of income grows of 60 percent of GDP for regions at the the technological frontier — roughly speaking, you might say that the role of robots and capital are twice as large, and the role of labor half as large. The size of the productivity gain associated with automation is calibrated to historical estimates. For more details, see the white paper and working paper.

In terms of welfare, American workers see their share of output go down, but output overall goes up by a lot, their investment returns skyrocket, and tax rates fall because of the above effect, leaving them better off on net. This assumes that the AI isn’t further skill-polarizing, i.e. more high skill jobs are retained than low skill. For alternative scenarios where AI also changes the wage distribution among jobs that remain, see the white paper.

I bring up 2017 because that is when this model was calibrated. We would love to update the model with more recent data when we have the resources! It doesn’t matter for our other conclusions, but the increase in interest rates will therefore be about 2x worse for the budget in real life vs. the simulation for this reason.

Under the baseline scenario, interest rates would have decreased from today, driven by the wealth of a rising Asian middle class. This secular force driving down interest rates has been called the “global saving glut”. https://www.brookings.edu/articles/why-are-interest-rates-so-low-part-3-the-global-savings-glut/

That’s economics for “impatient idiots”

Who would have thought a House elected to two year terms might not have much foresight in drafting budgets?

Let us know if you want to hear about another region!

Contra TC, I think AI — which is easily sold across international lines, and might easily ‘locate itself’ through e.g. corporate inversions into a tax haven — calls for lower corporate tax rates. Automation generally makes the location of production more sensitive to capital taxation, making income + consumption taxation more attractive as well (the workers have to live in SF even if the bots don’t).

Also the chips will have to locate near cheap energy, which means to the extent the IP needs to be located where the chips are, this may prevent AI companies from just completely relocating to the Bahamas.

As I make this argument I want to clarify that I am not arguing that the US government shouldn’t ever borrow to make smart long-term investments. I don’t disagree with @AndrewCurran_ that extreme leverage can be optimal — If that leverage is allowing the US to make public investments (e.g. in energy infrastructure, or R+D) this could be wise. But the US government debt is NOT primarily about smart public infrastructure investments. The US government spends on welfare payments, interest on the national debt, and for national defense — in that order, with investments relatively small.

In Asimov’s “The Caves of Steel” the highly automated society is impoverished for just this reason — a short-term focused government suppresses savings, so there is a low ratio of automation capital to humans. My paper “Robots Are Us” explores a similar dynamic, where the lowered saving of the young, due to automation reducing their labor income, depresses growth rates. Reduced saving by young people does indeed slow global capital accumulation and growth somewhat in out 10x automation scenario.

Another way of saying the same thing is ”In an AK world, delta K is just as important as delta A”

Another caveat is that the US government is a big borrower, but not the only one in the world. If the US were to increase it’s social saving rate by reducing government crowding out, this would have a relatively small effect on the US itself, because of the global-ness of capital markets. This is mostly an argument about how the whole world needs to save more into an AI-boom, not the US in particular. However, to the extent that the world capital market is not completely open and free flowing, boosting US saving is likely to have a disproportionate impact on US investment.

The Trump Accounts — a version of Booker’s Baby Bonds — are an interesting but small idea in this direction. If anyone is interested in what our model has to say about a policy like this, check out the white paper.

The T-word

Andrey Fradkin — Sat, 05 Jul 2025 01:24:43 GMT

Attention conservation notice: shameless navel-gazing ahead.

I lost count of the number of conversations I’ve had that center around tenure. When are you up? Do you think you’ll get it? How many pubs do you need? Does journal X count? Do you know so-and-so got it? The conversations are for the most part harmless small talk, but they happen more often than conversations about ideas. The frequency of these conversations demonstrates how central this status is for academics.

What does tenure actually give? Lifetime employment, a better title in some cases (associate vs assistant), a slight pay bump, more duties to the school, and also influence within the school. But people, of course, focus on the lifetime employment. In theory, this lifetime employment gives academics the ability to speak their mind. It should be a bulwark against personal, societal, and political pressures. It should be a license to take risks. But when the most risk-averse path is the one most likely to get tenure, what risks can we expect academics to take?

The system we’re in selects for and creates a type of person who is motivated by the status of tenure or peer respect above other things. Those who started academia for the research, and I think it’s many of us, end up becoming transformed by the social environment. People write papers because they need pubs, they work on trendy topics to citationmaxx, and they post inane things on LinkedIn to have “impact.” Tenure will never be enough status, as there is always the next promotion, award, and invited talk. Even a minor threat from the dean could motivate people to fall in line.

But where does that lead me? I’ve played this game pretty well, albeit with a rough start. I’ve suffered the indignities of being rejected, and of writing obsequious responses to the referees. But I’ve also had my share of luck and have had some pretty good ideas. And so it came to be that I received tenure.

Getting here proves I'm partly what I criticize (perhaps a symptom of the human condition). I could have taken more risks, in research and in life. I continued working on the same research projects year after year, so that eventually they could be published at good journals. I just wrapped up a paper I started in 2019, even though the main insights were clear as early as 2021. I continued to travel to the same conferences, presenting the same almost-finished work. And when people liked the work, I appreciated and craved the positive feedback, just like everyone else.

My research has been primarily about digital market design and regulation. And I think I know a lot about these important topics. But the majority of what I know is not what I’ve learned from doing my own research projects. Instead, most of what I know, I’ve learned from reading widely, talking to practitioners, and doing consulting. The fact that I’ve published the research serves mostly as a signifier to the rest of the world that I’m an expert.

My admittedly predictable expert opinion is that AI, and AI agents, will play a central role in market design going forward. As an expert, I face a choice of what to do with this knowledge. The predictable and expected path in my field is to write papers about AI, market design, and marketing. Perhaps I can find an early and limited use case of AI in a market, and estimate a causal effect. This is surely publishable and would get a lot of citations for being “first,” even if the estimate would have little external validity for future market designs. Or I can estimate an economic model, the details of which few are likely to understand. This would be impressive to some, but in the time it would take me to publish it, the world will have changed. I am not a nihilist about business and economics research about AI, some of it is truly worthwhile, but it does seem that the action and discovery is happening mostly in industry.

For me, the path that maximizes learning, impact, and dare I say fun, in the near term is not to pursue the status quo. I want to actually be a market designer incorporating AI into markets and I want to be a valuable part of the conversation.1 Market design is something that one does as part of a company with existing users, or as part of an organization. It is not a solo endeavor for an academic researcher. So if I want to do real market design, I will need to spend some time outside of academia.

A privilege of tenure at many academic institutions is the ability to take a leave of absence. In the near future, I will take advantage of this privilege to work on AI driven market design at the biggest marketplace in the world. I am stoked about this opportunity, which will simultaneously be a test of my expertise and an opportunity to deepen it. It’s hard for me to predict how this all works out. The experience could be transformative, or it could be a futile battle with corporate bureaucracy. Regardless, I will learn and report back.

To wrap up, three cheers to tenure, warts and all, and thanks to all who helped me along the way. I promise to use it to take risks, to do work that matters, and to have fun!

This blog and the Justified Posteriors podcast are part of the conversation!

Justified Posteriors

Can an AI Interview You Better Than a Human?

Why Can’t Your AI Agent Book a Flight?

Anecdotes from AI Supercharged Science

Anecdotes of AI Supercharged Science: Justified Posteriors reads “Early Science Acceleration Experiments with GPT-5”

Priors

Links + Shownotes

Scholars Mentioned

Tools & Technology

Concepts & Philosophy

Transcript:

One year of justifying our posteriors

Ben Golub: AI Referees, Social Learning, and Virtual Currencies

Links & References

Transcript:

The Best Books Seth Read in 2025

Best Pairing: Breakneck: China’s Quest to Engineer the Future by Dan Wang and Natural Moralities: A Defense of Pluralistic Relativism by David B. Wong

How What You Should Do When Your Dad Murders Someone Explains the Difference Between the U.S. and China

Mencius’ Answer:

Plato’s Answer:

Best Non-Fiction: The Allure of Battle: A History of How Wars Have Been Won and Lost by Cathal J. Nolan

Best Sci Fi: The Hydrogen Sonata by Ian M. Banks

Best Play/Opera: Salome by Oscar Wilde

Wildcard: The Pine Barrens by John McPhee

Most Laughable Economic Theory Joke Award: Ecstasy: Understanding the Psychology of Joy by Robert A. Johnson

Honorable Mentions:

Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengths

Epistemic Apocalypse and Prediction Markets (Bo Cowgill Pt. 2)

Does AI Cheapen Talk? (Bo Cowgill Pt. 1)

Evaluating GDPVal, OpenAI's Eval for Economic Value

Key Findings & Results Discussed

Main Discussion Points & Takeaways

Will Super-Intelligence's Opportunity Costs Save Human Labor?

Relevant Links

Priors and Posteriors

Can political science contribute to the AI discourse?

Should AI Read Without Permission?

EMERGENCY POD: Is AI already causing youth unemployment?

AI and its labor market effects in the knowledge economy

The Ide & Talamas Model

Key Debates & Critiques

Go Deeper

Timestamps

The Bets: Priors & Predictions

Our Final Posteriors

One LLM to rule them all?

Transcript

What can we learn from AI exposure measures?

A Resource Curse for AI?

Could AI Save Us From Making Hard Choices About the Budget?

The T-word

Best Pairing: Breakneck: China’s Quest to Engineer the Future by Dan Wang and
Natural Moralities: A Defense of Pluralistic Relativism by David B. Wong