A focused Hebrew RoBERTa project delivering the first large Hebrew model with state-of-the-art benchmark performance; pre-trained on a TPUv4-128 pod and evaluated on private hardware.
The first large Turkish RoBERTa-style model, developed after PortBERT with extensive evaluations on private GPUs and the LRZ BayernKI H100 cluster. The study highlights the importance of corpus variance over sheer size.
A domain-adaptive German medical RoBERTa model, exploring continued pre-training and from-scratch training with specialized vocabularies.
A Portuguese RoBERTa model evaluated during a research stay on the Azores, highlighting efficiency-focused perspective models.
A continued-pretraining extension of GottBERT, developed during a period of transition and finalized as a preprint before being presented at GlobalNLP@RANLP 2025.
The first published German RoBERTa-based model family with a clear development path: from its 2020 preprint to the extended EMNLP 2024 version.
A modern medical search engine developed from scratch with a fully open, extensible IR architecture.