<aside>

Problem

When AI responses appear equally fluent across languages, how do you reveal differences in evidence quality—especially when non-English queries surface weaker sources?

</aside>

<aside>

Approach

The team designed an interface-level intervention that makes language-based differences in source quality visible and enables users to access verified, cross-lingual evidence.

</aside>

<aside>

My Role

Led research — conducted literature review, defined project scope, and designed a multi-method, experimental evaluation to assess how interface changes impact trust and comprehension.

</aside>

<aside>

Outcome

Increased verification behavior and improved calibrated trust: users sought stronger evidence, recalled more accurate information, and showed more nuanced confidence in their judgments.

</aside>

<aside>

Context & Problem

Large language models (LLMs) are increasingly used as everyday information tools, yet they reflect structural imbalances in the linguistic makeup of the Internet. Although only around one-fifth of the global population speaks English, English accounts for roughly half of all online content. Because training data is English-dominant, these systems tend to perform more robustly—and cite stronger sources—in English than in many other languages.

For users, this imbalance is largely invisible. Responses across languages often appear equally fluent and confident, even when the quality of supporting evidence differs significantly. Prior research and our internal testing revealed a language-based quality gap: in everyday health-related queries, Hungarian-language prompts more often surfaced secondary or non-academic sources, while equivalent English prompts more consistently cited peer-reviewed research. This creates uneven access to verifiable information across users.

<aside>

Observe an example of the same health-related query prompted in English and Hungarian. While both responses appear similarly fluent, the cited sources differ in type and academic rigor.

English prompt: Peer-reviewed and academic citations

Hungarian prompt: Secondary and non-academic sources

</aside>

Rather than intervening at the model level, our team approached this as a design problem. We developed a user-facing intervention that makes language-dependent differences in source quality visible and enables users to request more transparent, verifiable evidence directly through the interface. The goal was not to change the answers themselves, but to help users understand how language shapes the evidence they receive—and to support more informed judgments of credibility.

</aside>

<aside>

My Role

As the research lead, I shaped the project’s direction by conducting the literature review that grounded our understanding of language bias in large language models and defined the scope of the work. I then designed a multi-method evaluation framework, including a randomized between-subjects experiment, behavioral uptake analysis, free-recall coding, and calibrated confidence measures to assess how interface-level changes influenced trust and comprehension across languages.

Throughout the design process, I collaborated closely with two designers to build and refine the prototype, systematically testing model behavior across languages and documenting patterns in source quality, tone, and reliability. These observations directly informed design decisions, ensuring the interface remained grounded in both user experience principles and observed system behavior.

I am currently extending this work to examine how ongoing system changes reshape the relationship between language, evidence quality, and user trust.

</aside>

<aside>

Design Intervention

Because we could not modify the underlying language model, we approached the problem at the interface level. Our goal was to increase user awareness and agency by making language-dependent differences in evidence quality visible during search.

<aside>

We designed a verified-source feature within ChatGPT’s search experience. When a query returns limited reliable sources in the user’s selected language, the system surfaces a contextual prompt explaining the limitation and offering an alternative:

“It seems there’s limited reliable content in this language. Would you like to broaden the search to other languages for more accurate information? I’ll translate the results for you.”

Prototype walkthrough demonstrating the verified-source feature and cross-lingual evidence expansion.

</aside>

If accepted, the system expands the search across languages and translates verified results back into the user’s preferred language. This preserves access to higher-quality evidence while maintaining transparency about how the information was obtained.

<aside>

To further support clarity and trust, the interface introduces two visual cues:

A verified badge marking results cross-checked with credible sources
A translation disclaimer noting that translated content may lose nuance </aside>

Together, these elements are designed to support more informed credibility judgments—while making the role of language in shaping evidence quality explicit.

</aside>

<aside>

User Testing

Aim 1: Behavioral Uptake

We first examined whether participants in the intervention condition activated the verified-source option.

<aside>

80% (24 participants) chose to broaden the search

20% (6 participants) did not

</aside>

Participants who clicked most often referenced the disclaimer about limited reliable content and reported wanting access to stronger evidence. Those who did not click generally reported being satisfied with the initial answer.

<aside>

In the control condition, only a small minority (5 participants) indicated they would have followed up if such an option had been available.

</aside>

Aim 2: Understanding and Confidence

After the interaction phase, participants completed two additional tasks to assess comprehension and evaluation.

Task 1: Free Recall

Participants were asked to write everything they remembered from the response. Answers were coded for:

Accurate idea units (factually aligned pieces of information)
Nuance (1–3 scale: absolute → qualified → conditional reasoning)

<aside>

Participants who activated the verified-source option recalled significantly more accurate information and demonstrated higher levels of nuanced reasoning than the control group.

Participants in the intervention condition who did not activate the feature did not differ meaningfully from the control group.

</aside>

Task 2: Confidence Calibration

Participants rated their certainty (1–5 scale) for ten statements, including both overly absolute and appropriately qualified claims.

<aside>

Participants who activated the verified-source option:

Reported lower certainty for overly absolute claims
Maintained or increased certainty for appropriately qualified claims </aside>

The intervention did not increase confidence indiscriminately. Instead, it improved epistemic calibration: participants became more cautious in response to absolute claims and more confident in appropriately qualified ones.

</aside>

<aside>

Key Insights

1. Source limitations are largely invisible unless surfaced

In the control condition, most participants accepted the response without questioning the evidentiary strength.

<aside>

Transparency must be designed into the interaction—users rarely infer it on their own.

</aside>

2. When transparency is introduced, users seek stronger verification

The majority chose to broaden the search when reliability limitations were made explicit.

<aside>

Users are willing to invest additional effort when the system clearly communicates uncertainty.

</aside>

3. The intervention changed how participants evaluated information

Participants who accessed verified sources demonstrated more conditional reasoning and more calibrated confidence.

<aside>

The feature influenced not only interaction behavior but also how users cognitively processed and evaluated claims.

</aside>

<aside>

Implications for AI Product Design

1. Fluency is not credibility

Users often accept AI responses at face value, even when sourcing is vague. Systems should not rely on confident language to signal reliability. When evidence quality is limited, that limitation must be made visible.

2. Surface verification pathways

When users are given a clear opportunity to access stronger or cross-lingual evidence, most will take it. Verification should be embedded directly into the interaction flow—not hidden behind advanced features or additional effort.

3. Design for calibrated trust, not maximal trust

Increasing user confidence indiscriminately is not a meaningful success metric. Interfaces should support differentiated confidence—encouraging caution for overstated claims while reinforcing confidence where evidence is appropriately qualified.

4. Interface design shapes epistemic outcomes

Even without modifying the underlying model, interface-level transparency meaningfully influenced user behavior and evaluation patterns. Responsible AI is not only a model-level challenge—it is also a product design responsibility.

</aside>

<aside>

</aside>