Corpus Design

SindBERT

The first large Turkish RoBERTa-style model, developed after PortBERT with extensive evaluations on private GPUs and the LRZ BayernKI H100 cluster. The study highlights the importance of corpus variance over sheer size.

ChristBERT

A domain-adaptive German medical RoBERTa model, exploring continued pre-training and from-scratch training with specialized vocabularies.

GeistBERT

A continued-pretraining extension of GottBERT, developed during a period of transition and finalized as a preprint before being presented at GlobalNLP@RANLP 2025.