Open research · researchers for researchers

Multilingual, multimodal trust evaluation for medical MCQA.

OmniTrust-Med is a community benchmark built from the TrustFullQA healthcare subset. We study whether language models answer medical trust questions reliably — not generic trivia — when inputs and outputs move between text and speech, across many languages.

Work in progress

The benchmark pipeline, validation tools, recordings, and evaluation harness are under active development. This page is intentionally high-level; the repository README tracks phases, scripts, and current priorities (French and Hindi first).

Overview

What we are building, how the pipeline is organized, and the questions we aim to answer.

What OmniTrust-Med is

A multilingual, multimodal multiple-choice benchmark for medical trust questions — items designed to probe faithful, reliable answers rather than surface-level medical knowledge alone.

Each item can be evaluated in text-only and speech-mediated settings, with human-validated translations, medical entity spans, human audio (multi-speaker English), and a Qwen3-TTS synthetic counterpart for comparison.

Pipeline at a glance

  1. Phase 1 — LLM translation and entity annotation with voting.
  2. Phase 2 — Human validation of text and entity spans.
  3. Phase 3 — Human recordings and Qwen3-TTS synthesis.
  4. Phase 4 — ASR + LLM cascade and end-to-end omni-model evaluation.
  5. Phase 5 — Hugging Face release, paper, harness integration, leaderboard.

Research questions (preview)

  • How does trust MCQA accuracy change across languages and modalities?
  • Where do ASR errors hurt medically relevant terms?
  • Does speaker diversity (gender, accent) shift measured trust?
  • How far can high-quality TTS stand in for human speech?

Full experiment matrix and metrics are documented in the repository; results will appear here when ready.

Coverage & artifacts

Languages under study and where to find data and documentation.

20 target languages

English is the source language. Other languages are produced via translation, entity enrichment, and human review before audio work begins.

enfrhimresitar hbsaryzhzh-TWvirubho uksqcsapcfasi