Joel P. Barmettler

AI Architect & Researcher

< Back
2026·arXiv PreprintIndependent Research

The Invisible Coalition Partner: How LLMs Vote When Democracy Gets Concrete

Prior work has established that instruction-tuned LLMs lean left of center on abstract political questionnaires. I tested whether this holds when models face real policy decisions. Using a dual-instrument methodology grounded in Swiss democracy, I administered the Smartvote questionnaire (75 policy questions, 66 LLMs vs. 184 elected parliamentarians) and 48 federal referenda to 9 flagship models in four national languages. Abstract and concrete instruments tell fundamentally different stories about the same models.

Dual-instrument design

The first instrument, Smartvote, replicates prior work: abstract policy proposals answered on a four-point scale, compared to the positions of 184 Swiss National Council members across six parties spanning the full left-right spectrum. The second instrument is novel: real federal referenda (Volksabstimmungen) with official government summaries, presented as binary Ja/Nein decisions in German, French, Italian, and Romansh under three information conditions. Party recommendations (Parolen) serve as the political benchmark.

Key findings

On Smartvote, all 66 models converge on the same center-left position (Cohen's d = 3.64, p = 0.0002), replicating the established finding. No structural variable (geography, licensing, model generation) predicts positioning.

On Volksabstimmungen, the left-to-right agreement gradient flips: models agree most with centrist Die Mitte and FDP, not with SP and Grüne (Wilcoxon p = 0.008). The leftward bias measured on abstract instruments does not generalize to concrete policy decisions.

For some models, the language of the question changes the answer more than the political content does. Cross-linguistic consistency ranges from 98% (GPT-5.4) to 50% (Mistral), whose approval rate swings from 17% in German to 82% in Romansh. Two models (Grok, Mistral) show systematic change-aversion, voting Nein on 83-94% of referenda regardless of political direction.

Authors: Joel P. Barmettler (Independent Researcher, Zurich)

Published at: arXiv (preprint, 2026)

Do large language models have a political bias?

On abstract political questionnaires, yes: 66 LLMs from 27 model families converge on a center-left position, replicating prior research. However, on concrete policy decisions (Swiss federal referenda), the bias shifts to centrist and status-quo-favoring, suggesting the established leftward bias is instrument-dependent.

What is the dual-instrument methodology?

The study uses two independent instruments grounded in Swiss democratic reality: (1) the Smartvote questionnaire with 75 abstract policy questions administered to 66 LLMs and compared to 184 Swiss parliamentarians, and (2) 48 real federal referenda (Volksabstimmungen) presented to 9 flagship LLMs in four national languages under three information conditions, compared to actual outcomes and party recommendations.

Does the language of a political question change the LLM's answer?

Dramatically for some models. Cross-linguistic consistency ranges from 98% (GPT-5.4) to 50% (Mistral). Mistral's approval rate swings from 17% in German to 82% in Romansh. These shifts do not track the actual Swiss linguistic voting divide (Röstigraben) but reflect model-internal language processing instabilities.

What is the gradient flip finding?

On the abstract Smartvote questionnaire, all models show highest agreement with left-wing parties (SP, Grüne) and lowest with right-wing SVP. On concrete referenda, this gradient flips: models agree most with centrist Die Mitte and FDP, not with SP and Grüne. The Wilcoxon signed-rank test confirms this is systematic (p = 0.008).

Do LLMs exhibit change-aversion on referenda?

Two models (Grok and Mistral) vote Nein on 83-94% of referenda regardless of whether the proposal is progressive or conservative, suggesting systematic change-aversion rather than political ideology.

How well do LLMs predict the popular vote?

Alignment varies dramatically: GPT-5.4 matches 97.9% of referendum outcomes while Grok matches only 60.4%. A temporal analysis splitting pre- and post-release referenda found no significant drop in alignment, arguing against pure training data memorization.


< Back

.

Copyright 2026 - Joel P. Barmettler