< Back
2026·On GithubPyPInpmHuggingFace

gheim: An open Swiss-language PII dataset and NER model

Existing public PII detectors handle Swiss text poorly. openai/privacy-filter reaches a strict-span F1 of 0.44 on the gheim Swiss test set. Microsoft Presidio reaches 0.43. AI4Privacy's distilbert fine-tunes drop to 0.16 on real Swiss text. None of them fluently handle the four official Swiss languages together with structured Swiss-format PII (CH-IBAN, AHV/AVS, VAT-CHE), and only SwissNER ships a Romansh slice.

gheim is the artifact built to close that gap: a dataset, a model trained on it, and a reproducible evaluation harness that scores both against the open alternatives. The full write-up is the technical report at the bottom of this page. What follows is the skim.

Dataset

gheim-ch-pii-212k is 212,503 chunks of real Swiss text from the Apertus pretrain corpora (court rulings, federal parliament records, Swiss-filtered FineWeb-2, Romansh corpus) plus a synthetic gap-fill layer for sparse cells. Five languages (de_CH, fr_CH, it_CH, rm, en), eight PII categories, 33-class BIOES tag scheme byte-identical to openai/privacy-filter. Released under CC BY 4.0.

Three open-weights LLMs label in parallel under the same JSON schema: Gemma 4 26B-A4B, Qwen3.6 35B-A3B, and Nemotron-3 Nano Omni 30B-A3B. A separate regex pass gated by checksum validators (ISO 13616 for IBAN, ISO 7064 for VAT-CHE, Luhn for cards) scans the same corpora independently, so structured PII the LLMs missed still gets picked up. A fourth format-aware annotator runs only on the commercial-register subset, which the general-purpose labellers under-recall. The four streams merge at the span level with per-span provenance kept. A Geonames-CH gazetteer demotes municipality names that get mis-tagged as people. A five-phase balancer then caps per-document, per-value, and per-cell concentration. Templated synthetic chunks filled by Faker_CH populate the cells where real-text coverage was still insufficient. Splits are document-isolated.

To put a number on residual label disagreement, a 580-chunk sample from val and test was re-annotated by four 2026-era LLMs (Kimi K2.6, DeepSeek V4 Pro, Minimax M2.7, GLM 5.1) and reduced to per-span majority vote. The released labels score F1 = 0.71 against that consensus. The breakdown of the gap is in the report: it is dominated by policy disagreement (the dataset flags publicly-listed officials by design), not random annotator error.

Model

gheim-ch-560m is a token-classification head on xlm-roberta-large (560M parameters). XLM-R-large was selected from a bake-off against ZurichNLP/swissbert (270M dense). Both bases received an identical 5 × 3 sweep over (learning rate, layer-wise LR decay) at one epoch. XLM-R-large won by 0.8 pp validation F1. The winning configuration was then trained for three full epochs: AdamW at 5e-5 cosine with 5% warmup, effective batch 128 (per-device 64 × 2 GPUs DDP), bf16, max sequence length 512, on 2 × RTX 4090. Wall time around 66 minutes.

Released under Apache 2.0. Ships fp32 PyTorch, fp32 and fp16 ONNX, and an int8 ONNX export so the demo can run in-browser through transformers.js. A sibling gheim-ch-560m-research checkpoint adds external NER/PII corpora (openpii-1m, WikiNeural, CoNLL-2003) to the training mix. It is tied on in-domain Swiss PII but gains +20 pp on Swiss-news transfer and +46 pp on Romansh. Non-commercial licence.

Headline numbers

SystemStrict-span F1Char F1
gheim-ch-560m (this)0.9100.946
openai/privacy-filter0.4430.610
Microsoft Presidio0.4340.562
Isotonic/distilbert_finetuned_ai4privacy_v20.160.46

All contestants scored on the same 21,246-chunk held-out test split. Full per-language by per-category breakdown, methodology validation against each baseline's published numbers, and cross-domain evaluation on four external benchmarks (swissner, openpii-1m, WikiNeural, CoNLL-2003) in the report.

SDKs

Two thin wrappers expose the model for the typical use case: anonymise the prompt, send it to any LLM, restore the originals as the response streams back. Both ship a drop-in replacement for the official OpenAI client.

from gheim.openai import OpenAI

client = OpenAI()
r = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hi, my name is Joel."}],
)
# r.choices[0].message.content contains "Joel".
# OpenAI only ever saw "<PERSON_1>".

The live demo at gheim.ch runs the full ONNX model in the browser via WebGPU, with no server inference. The Python SDK, the JS/TS SDK, and a self-host detection server (air-gap-friendly, weights baked into the image) are published as the deployment surface around the model.

Repository: github.com/joelbarmettlerUZH/gheim

Model: gheim-ch-560m

Dataset: gheim-ch-pii-212k

What is gheim?

An open Swiss-language PII NER artifact: a 212k-chunk multilingual dataset covering the four official Swiss languages and English, and a 560M xlm-roberta-large fine-tune trained on it. Two thin SDKs (Python and JavaScript) wrap the model for the LLM-anonymisation round-trip.

How accurate is gheim-ch-560m?

0.910 strict-span F1 on the held-out test split (0.946 char F1). The next-best public detector on the same Swiss test set is openai/privacy-filter at 0.443.

How is the dataset labelled?

Three open-weights LLMs (Gemma 4, Qwen3.6, Nemotron-3) label in parallel. A checksum-validated regex catalogue (IBAN, AHV, VAT-CHE, Luhn) scans independently. A Geonames-CH gazetteer demotes municipality names mis-tagged as people. A five-phase balancer caps per-document and per-cell concentration. A synthetic gap-fill layer populates sparse cells.


< Back

.

Copyright 2026 - Joel P. Barmettler ·Impressum·Privacy