AI Safety Engineer @ Gray Swan AI

Mateusz Dziemian

I build benchmarks and red teaming tools to stress test AI. My biggest works are AgentHarm (200+ citations) and segformer_b2_clothes (30M+ downloads on HuggingFace). Currently interested in eval awareness and automating AI safety research.

Mateusz Dziemian

Selected Research
Security Challenges in AI Agent Deployment: Insights from a Large-Scale Public Competition
A. Zou, M. Lin, E. Jones, M. Nowak, M. Dziemian, N. Winter, A. Grattan, V. Nathanael, A. Croft, X. Davies, J. Patel, R. Kirk, N. Burnikell, Y. Gal, D. Hendrycks, J. Z. Kolter, M. Fredrikson
NeurIPS 2025
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y. Gal, X. Davies
ICLR 2025 200+ citations
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
S. Lermen, M. Dziemian, N. Pérez-Campanero Antolín
NeurIPS 2024 · SATA Workshop
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
S. Lermen, M. Dziemian, G. Pimpale
NeurIPS 2024 · SafeGenAi Workshop
About

AI Safety Engineer at Gray Swan AI. I work on red teaming AI agents and building safety benchmarks. At Gray Swan I focus on automated red teaming (Shade), our public red teaming competitions (the Arena), and pre release safety evaluations for frontier models.

Currently interested in eval awareness and automating AI safety research, in the direction of work like Petri and AuditBench.

Outside of research, I'm a purple belt in BJJ (10th Planet London) turned boulderer, currently projecting V6. Based in London.

30M+
HF downloads
200+
citations
4
papers
V6
bouldering

Open Source

AgentHarm

Open sourced subset of the AgentHarm benchmark for measuring harmfulness of LLM agents. 44 of 110 unique behaviors publicly available, covering 11 harm categories. Available on Hugging Face and Inspect AI.

200+ citations
huggingface inspect paper

segformer_b2_clothes

Fine tuned SegFormer for body parts and clothing segmentation. One of the most liked and most downloaded segmentation models on Hugging Face.

30,000,000+ downloads
huggingface

Contributions

PRs to Inspect AI for Hugging Face agent support. Sections on multi-modal and generative models for the HuggingFace Computer Vision course.

github
Contact

Let's talk.

Interested in Gray Swan, collaborating on research, or just doing interesting things. Reach out.