Anthropic

Constitutional AI: Harmlessness from AI Feedback

宪法AI：从AI反馈实现无害性

📅 2022-12-15👤 Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, 📄 arXiv: 2212.08073

AI安全对齐无害性宪法AIRLHF

中文摘要

宪法AI是一种不依赖人类反馈就能训练无害AI模型的方法。该方法使用一组原则（宪法）指导AI生成自我批评和修订，从而减少对有害输出的倾向。这种方法比传统的RLHF更高效、成本更低，同时能更好地保证模型的安全性和无害性。

Constitutional AI is a method for training harmless AI models without relying on human feedback for harmlessness labels. This approach uses a set of principles (a constitution) to guide AI in generating self-critiques and revisions, reducing harmful outputs more efficiently than traditional RLHF.

快速链接

PDF 下载 arXiv 原文网页查看全文

← 厂商论文列表首页