Paper Club: Preference Learning with Lie Detectors can Induce Honesty or Evasion

Name: Paper Club: Preference Learning with Lie Detectors can Induce Honesty or Evasion
Start: 2025-08-07T11:00:00.000Z
End: 2025-08-07T13:00:00.000Z
Location: Lorong AI

Date: Thursday 7 August 2025
Time: 19:00 - 21:00
Location: Lorong AI

About the event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. Last session, we discussed how Chain-of-Thought (CoT) reasoning is transparent when models need to "think out loud" to arrive at the answer. This week, we investigate what happens when you detect lies by looking directly at a model's internal states, and when you try to use these "lie detectors" to train models to be more honest. The core tension: we want our AI systems to be honest, but can prefer lies we are convinced by over uncomfortable truths. Will including a lie detector in the training process make models more honest? Or will they simply train models to hide their lies better?