Skip to main navigation Skip to search Skip to main content

Would I lie to you? Inference time alignment of language models using direct preference heads

Avelina Asada Hadji-Kyriacou, Ognjen Arandjelović

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

Original languageEnglish
Title of host publicationAdvances in neural information processing systems 37
Subtitle of host publication38th conference on neural information processing systems (NeurIPS 2024)
EditorsA. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang
PublisherNeural Information Processing Systems Foundation, Inc. (NeurIPS)
Pages1-26
Number of pages26
ISBN (Print)9798331314385
Publication statusPublished - 9 Dec 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: 9 Dec 202415 Dec 2024
Conference number: 38
https://nips.cc/Conferences/2024/Dates

Publication series

NameAdvances in neural information processing systems
PublisherNeural Information Processing Systems Foundation, Inc. (NeurIPS)
Volume37
ISSN (Print)1049-5258

Conference

Conference38th Conference on Neural Information Processing Systems, NeurIPS 2024
Abbreviated titleNeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period9/12/2415/12/24
Internet address

Fingerprint

Dive into the research topics of 'Would I lie to you? Inference time alignment of language models using direct preference heads'. Together they form a unique fingerprint.

Cite this