PDF extraction has gotten good. Tools like MinerU, Docling, and Marker can pull text, tables, and images out of PDFs and produce clean markdown. But they share a common failure mode: heading hierarchy. A document with six levels of nesting comes out with every heading at # or ##. The content is there, but the structure that makes it navigable is gone.
md-reheader is a fine-tuned Qwen3-0.6B model that reads a markdown document and predicts the correct heading level for each heading. It is small enough to run on CPU, published on PyPI and HuggingFace, and works as a post-processing step in any PDF-to-markdown pipeline.
Heading levels encode document structure. They tell you that section 3.2.1 is a subsection of 3.2, which belongs to chapter 3. PDF parsers can detect that something is a heading (larger font, bold weight), but inferring the correct nesting depth requires understanding the document semantics: what is a chapter title, what is a subsection, what is a minor detail. Most parsers do not attempt this and output flat headings instead.
Install and run in three lines:
from md_reheader.inference.predict import load_model, reheader_document
model, tokenizer = load_model("joelbarmettlerUZH/md-reheader")
fixed = reheader_document(markdown_text, model, tokenizer)
Under the hood, the pipeline extracts all headings from the document using markdown-it-py, flattens them to level 1, strips body text to the first and last 128 tokens per section (preserving structural cues without bloating the context), truncates to 8k tokens, and sends the result to the model. The model outputs the correct heading prefix for each heading, and the pipeline applies those levels back to the original document.
The model was fine-tuned on ~197k documents from two sources: GitHub markdown files from codeparrot/github-code and Wikipedia articles from euirim/goodwiki. The training set is published as joelbarmettlerUZH/md-reheader-dataset.
Heading levels follow a power-law distribution: H2 and H3 dominate, while H5 and H6 are rare. To address this, documents with deep nesting are oversampled (depth 4 at 2x, depth 5 at 4x, depth 6 at 8x). Documents are split by repository name (GitHub) or article title (Wikipedia) to prevent data leakage between train and test sets.
Training uses Axolotl on 2x NVIDIA RTX 4090 GPUs with DDP, BF16 precision, flash attention, and gradient checkpointing. The learning rate is set higher than typical (5e-5) because assistant tokens make up only ~2% of the sequence. One epoch is optimal; a second epoch overfits.
The current approach (V3) emerged from two earlier attempts:
V1 kept full document text and randomly corrupted heading levels. The model learned to reproduce text (98% of the loss) rather than predict levels. It defaulted to H2 for almost everything.
V2 replaced heading text with marker tokens and only output numeric levels. Removing the heading text removed too much semantic signal. Performance dropped.
V3 strips body text but keeps heading text in the output. This gives the model the semantic cues it needs from pretraining while keeping sequences short enough for efficient training. All input headings are flattened to # so the model must infer structure from content alone, not copy existing levels.
Benchmarked on 7,321 held-out documents against an all-H1 baseline and a heuristic rule-based approach:
| Metric | All-H1 baseline | Heuristic | md-reheader |
|---|---|---|---|
| Exact document match | 0.0% | 14.5% | 56.1% |
| Per-heading accuracy | 13.1% | 49.1% | 80.6% |
| Hierarchy preservation | 61.3% | 68.6% | 91.0% |
| Mean absolute error | 1.38 | 0.62 | 0.22 |
Performance is strongest on H1-H3 (77-85% accuracy). H5 and H6 reach 45-50%, with most errors being off-by-one: the relative structure is preserved even when absolute levels shift. Wikipedia articles score higher (71.3% exact match, 95.5% per-heading) than GitHub markdown (49.5%, 74.0%), likely because Wikipedia follows more consistent heading conventions.
On an RTX 4090, inference takes 0.4s for short documents (<1k tokens) and 3.4s for long ones (4-8k tokens). On CPU, the same range is 5-60s.
The model compresses deep hierarchies. Documents with 5-6 heading levels often come back with levels shifted inward by one or two. Heading levels are inherently subjective at the boundaries, and the model has learned majority conventions rather than universal rules. Documents longer than 8k tokens are truncated from the end, so late headings retain their input levels.
Repository: github.com/joelbarmettlerUZH/md-reheader
Model: huggingface.co/joelbarmettler/md-reheader
Dataset: huggingface.co/datasets/joelbarmettler/md-reheader-dataset
Package: pypi.org/project/md-reheader
md-reheader is a fine-tuned Qwen3-0.6B language model that restores correct heading levels (H1-H6) in markdown documents. PDF extraction tools like MinerU, Docling, and Marker often flatten all headings to a single level. md-reheader reads the document content and predicts the correct hierarchy.
md-reheader achieves 56.1% exact document match and 80.6% per-heading accuracy on a test set of 7,321 documents. It preserves heading hierarchy in 91% of cases. The heuristic baseline scores 14.5% exact match and 49.1% per-heading accuracy.
Install via pip: pip install md-reheader. Then load the model with load_model('joelbarmettlerUZH/md-reheader') and call reheader_document() on your markdown text. The model runs on both GPU and CPU.
The model was trained on ~197k documents from two sources: GitHub markdown files (codeparrot/github-code) and Wikipedia articles (euirim/goodwiki). Deep documents (4+ heading levels) were oversampled to address class imbalance. The dataset is published on HuggingFace as joelbarmettlerUZH/md-reheader-dataset.
.
Copyright 2026 - Joel P. Barmettler