Deakin University
Browse

Multi-Reference Preference Optimization for Large Language Models

journal contribution
posted on 2025-05-30, 18:28 authored by H Le, QH Tran, D Nguyen, K Do, S Mittal, K Ogueji, Svetha VenkateshSvetha Venkatesh
How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing close-formed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as HH-RLHF, GSM8K and TruthfulQA.

History

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Volume

39

Pagination

24375-24383

ISSN

2159-5399

eISSN

2374-3468

Publication classification

E3 Extract of paper

Issue

23

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Usage metrics

    Research Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC