NLP

HalleluBERT

A focused Hebrew RoBERTa project delivering the first large Hebrew model with state-of-the-art benchmark performance; pre-trained on a TPUv4-128 pod and evaluated on private hardware.

The first large Turkish RoBERTa-style model, developed after PortBERT with extensive evaluations on private GPUs and the LRZ BayernKI H100 cluster. The study highlights the importance of corpus variance over sheer size.

ChristBERT

A domain-adaptive German medical RoBERTa model, exploring continued pre-training and from-scratch training with specialized vocabularies.

PortBERT

A Portuguese RoBERTa model evaluated during a research stay on the Azores, highlighting efficiency-focused perspective models.

GeistBERT

A continued-pretraining extension of GottBERT, developed during a period of transition and finalized as a preprint before being presented at GlobalNLP@RANLP 2025.

GottBERT

The first published German RoBERTa-based model family with a clear development path: from its 2020 preprint to the extended EMNLP 2024 version.

tala-med Search Engine

A modern medical search engine developed from scratch with a fully open, extensible IR architecture.