Back to events

Paper Club: Preference Learning with Lie Detectors can Induce Honesty or Evasion

Date
Thursday 7 August 2025
Time
19:00 - 21:00
Location
Lorong AI

About the event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. Last session, we discussed how Chain-of-Thought (CoT) reasoning is transparent when models need to "think out loud" to arrive at the answer. This week, we investigate what happens when you detect lies by looking directly at a model's internal states, and when you try to use these "lie detectors" to train models to be more honest. The core tension: we want our AI systems to be honest, but can prefer lies we are convinced by over uncomfortable truths. Will including a lie detector in the training process make models more honest? Or will they simply train models to hide their lies better?