News May 13, 2026

The Impact of Government-Controlled Media on LLMs


Brandon M. Stewart

Princeton SPIA’s Research Record series highlights the vast scholarly achievements of our faculty members, whose expertise extends beyond the classroom and into everyday life.

If you’d like your work considered for future editions of Research Record, click here and select “research project.”

The Details

  • Authors: Hannah Waight (University of Oregon), Eddie Yang (Purdue University), Yin Yuan (University of California San Diego), Solomon Messing (New York University), Margaret E. Roberts (University of California San Diego), Brandon M. Stewart (Princeton University), Joshua A. Tucker (New York University)
  • Title: “State Media Control Influences Large Language Models”
  • Journal: Nature

The Breakdown

Even as artificial intelligence, in the form of large learning models (LLMs), gains increasing recognition for its persuasive capacity, the question of what data the models learn from is difficult to answer. Most scrutiny has focused on LLMs’ output, not the source material the models draw on, and AI companies do not provide access to their data sources, leaving researchers to wonder who exactly is influencing AI’s responses to prompts.

A team that included Brandon M. Stewart, an associate professor of sociology who is associated with the Princeton School of Public and International Affairs and who is the paper’s corresponding author, was curious about the extent to which state-coordinated media helps train LLMs.

“States are constantly seeking to shape discourse through control of the media and information operations,” Stewart said. “These efforts can — sometimes unintentionally — influence large language models that those institutions don’t directly control.”

To test their hypothesis, the researchers — who included co-first author and Princeton Ph.D. graduate Hannah Waight, whom Stewart advised — combined evidence from evaluations of LLMs in the local languages of 37 countries with a case study from China. In the case study, they compared two sources of Chinese state-coordinated media with a major open-source training dataset derived from Common Crawl, an open repository of web crawl data collected over the last 18 years.

The Findings

The team found that state-scripted news showed up in LLMs’ common training data at a rate 41 times greater than Chinese-language Wikipedia entries. In addition, the models reproduced distinctive memorized phrases from state media, a result of intentional repetition on various platforms of messaging the government wished to be seen and heard. Finally, additional pretraining of an open-weight model on state-scripted news caused more pro-government responses than training on other Chinese text.

“State-coordinated content is not just about what appears in official media,” Stewart said. “It is also about recirculation; the same phrasing moving through newspapers, apps, reposts, and ordinary webpages until it looks like part of the broader information environment. Once state-coordinated content is in the training data, the model can launder it into what looks and sounds like neutral, objective information.”

Armed with those findings, the researchers delved further and discovered that the government’s influence over the LLMs’ training data appeared most clearly in its primary language.

The Implications

“Institutional influence” is the term the team uses to describe the effect state media control has on shaping the behavior of LLMs — even beyond the state’s borders — by altering the training data on which the models rely. Among its most significant implications is that propaganda can become indistinguishable from objective information.

“Large language models separate the message from the messenger,” Stewart said. “What began as a strategic narrative from a powerful government in a state media outlet can reappear as informed commentary from a highly knowledgeable intelligent agent. With no visible source reputation, people lack any signal about the interests that shaped that answer.”

Among the researchers’ recommendations are a call for greater transparency from AI companies so that their models’ training data is evident to all, and an extension of their study from text-only models to image and video models.

“The bottom line is that training data does not just fall from the sky — it is produced in a context mitigated by the existence of socio-political institutions,” the team concluded. “Understanding of these institutions can and should in the future be harnessed to produce better understanding of the outputs of LLMs.”