Back to events

Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Date
Thursday 11 September 2025
Time
19:00 - 21:00
Location
Lorong AI

About the event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. This paper addresses a critical problem with open-weight AI models: traditional post-training safety measures can be bypassed with just a few hundred fine-tuning steps. Instead of teaching models to refuse harmful requests after training, the researchers test whether filtering dangerous content from pretraining data creates more durable safeguards. They develop a multi-stage pipeline to remove biothreat-related content from training data, creating models with "deep ignorance" of certain dangerous topics.