Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Name: Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Start: 2025-09-11T11:00:00.000Z
End: 2025-09-11T13:00:00.000Z
Location: Lorong AI

Date: Thursday 11 September 2025
Time: 19:00 - 21:00
Location: Lorong AI

About the event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. This paper addresses a critical problem with open-weight AI models: traditional post-training safety measures can be bypassed with just a few hundred fine-tuning steps. Instead of teaching models to refuse harmful requests after training, the researchers test whether filtering dangerous content from pretraining data creates more durable safeguards. They develop a multi-stage pipeline to remove biothreat-related content from training data, creating models with "deep ignorance" of certain dangerous topics.