ML Researcher · LLM Evaluation · Computational Pragmatics · Berlin
I develop methods to evaluate whether large language models genuinely understand social and pragmatic language — politeness, register, precision, and the fine-grained signals humans navigate effortlessly.
My background spans Computer Science (BSc), Interdisciplinary Media Studies (MSc), and Computational Linguistics (PhD), with 15+ years of formal modeling work in game theory, probabilistic NLP, and multi-agent systems.
Current focus:
- Novel calibration metrics (ESR, CDS) for LLM evaluation on social meaning tasks
- Benchmarking GPT-4, Claude, and Gemini on pragmatic phenomena
- Probabilistic speaker models of politeness and register
Stack: Python · PyTorch · HuggingFace · scikit-learn · R
Links: muehlenbernd.net · LinkedIn