大模型LLM数据处理

Reading time ~7 minutes

本文是对 LLM 进行微调以及二次预训练时,训练数据预处理的相关文章的一些学习笔记,包含极少的经验总结。多样化的训练数据很重要,要涵盖使用场景的各种情况,避免训练数据单一。

数据格式介绍

Question and Answer Dataset

Context Question Answer
privacy statement for adb org choice opt out correct update contacting the website the asian development bank has created this privacy statement in order to demonstrate our … what is the purpose of the privacy statement for adb.org? it covers various aspects of information gathering, opting out of communications, updating information, and contacting the website.

Text Summarization

Article Summary
nick fairley has signed for the st louis rams on a one year prove yourself deal when fit and focused fairley is one of the nfl’s better defensive linemen but his career has been plagued with concerns about injuries attitude and his weight worth 5 million but with the incentive to make up to 7 5m according to nfl insider ian rapoport the defensive tackle joins arguably the most fearsome front four in football the 27 year old will likely slot in alongside either defensive rookie of the year aaron donald or michael brockers to aid the rams’ frightening pass rush rams coach jeff fisher poses with nick fairley during his unveiling in st louis with defensive ends chris long and robert quinn quarterbacks … start fairley heads to the nfc west and should play a key part in jeff fisher and gregg williams’ rotating defensive line . the detroit lions did not take up his fifth year deal, and fairley penned a one-year contract worth $5m with incentives to make $2.5m more . the 27-year-old joins one of the most feared front fours in the nfl . when fit and focused he is a disruptive force in the middle . but fears about injuries, attitude and off-field concerns remain . end

Instruct Tuning

Instruction Output
what is the best way to download music from youtube? the best way to download music from youtube is through a third party website or application many of these websites or applications will let you convert the youtube video into an mp3 file and save the audio to your device however it’s important to use a reliable safe source when downloading music from youtube.

部分模型指令微调的数据集增加 Input,Input 可以为空,此时的例子如下:

Instruction Input Output
Translate English to Chinese. The best way to download music from youtube is through a third party website or application many of these websites or applications will let you convert the youtube video into an mp3 file and save the audio to your device however it’s important to use a reliable safe source when downloading music from youtube. 从YouTube下载音乐的最佳方法是通过第三方网站或应用程序。其中许多网站或应用程序将允许您将YouTube视频转换为MP3文件并保存音频到您的设备上。但是,重要的是在从YouTube下载音乐时使用可靠且安全的来源。

Human-Bot Conversation

Conversation Response
<human>: what are the five most common causes of stress among people? \n<bot>: a 2017 study showed that the top causes of stress in america were: money (64%) work (60%) the economy (49%) family responsibilities (47%) personal health problems (46%) \n<human>: how would someone cope with stress from financial difficulties and work? \n<bot>: To cope with financial difficulties: * read up on personal finance online.

Prompt 的格式和具体的模型有关。

Continue Pre-Training

准备包含长文本的数据集,以便进一步预训练语言模型。

Text
thomas sheridan anthropologist thomas e sheridan born 5 september 1951 is an anthropologist of sonora mexico and the history and culture of the us south west he is distinguished outreach professor at the university of arizona affiliated with the department of anthropology and the southwest center since 2003 sheridan’s family moved to phoenix arizona at the age of 3 he left the south west after high school attended reed college briefly before returning and graduated from the first incarnation of prescott college in arizona in the 1970s he became interested in northern mexico and travelled there frequently for study spending months in baha kino in 1971 he completed a phd on the yaqui in 1983 he directed the mexican heritage project at the arizona historical society from 1982 1984 and was curator …

数据准备流程

数据导入

从不同的数据源(connector中)导入文档,需要将不同格式的原始数据处理为带有 metadata 的文本格式。

数据清理

目的是删除数据中低质量部分,包括:

  • 文本清理(去除换行符、字母小写转换、URL 删除、HTML 标签删除等),有一些工具如justext, trafilatura
  • 基于元信息过滤,如 OpenAI 筛选 Reddit 链接时,过滤掉点赞数小于 3 的帖子。

同时此时可以使用一些数据质量检查方法(Bleu/Meteor/相似度/奖励模型)自动剔除低质量数据。

数据过滤

数据过滤和清理不同,数据过滤目的是过滤掉不符合模型训练的文本,可以应用不同的过滤器,典型如:

  • 长度过滤:大语言模型一般都有上下文限制,同时考虑到训练目标,数据不能过短也不能过长。
  • 语言过滤:比如过滤掉阿拉伯语内容,视目标模型的多语言支持目标而定。
  • 机器生成文本过滤:比如过滤掉 Google 翻译的文本、ChatGPT 生成的文本等。

数据去重(De-duplication)

研究表现,训练数据中的重复数据会极大的降低模型的能力。所以需要对训练数据进行去重。最近的研究表明,数据去重可以使语言模型更有效地训练。遵循这一原则,Dolma在每个来源中去重数据。实践中:使用两阶段去重策略。首先,在common crawl数据中,根据URL去重页面,只保留每个的一份副本。然后,在单个文档中删除重复的段落。两个阶段都使用Bloom filter数据结构。

有一些库如deduplicate-text-datasets, and datasketch

针对微调训练集,也可以使用向量检索的方式进行去重。

风险缓解

从互联网上采样的数据可能包含有害或有毒的内容,或泄露互联网用户的个人信息。准确检测这些类别仍然具有挑战性,特别是在需要处理大量数据的情况下:即使是非常快的方法,每个文档处理时间少于一秒,也可能需要几周的时间来运行一个数据集。Dolma的方法依赖于逻辑分类器(内容标记)和正则表达式(PII检测)的组合。实践中:检测并掩盖电子邮件地址、电话号码和IP地址。删除被fasttext分类器检测到的有害或淫秽内容,这些分类器是在Jigsaw数据集上训练的。这里Dolma数据集选择了一个非常高的阈值(>60%可能是有害或淫秽内容的可能性),以避免意外地删除非正式内容。

数据去污(Decontamination)

需要尽量避免在训练数据集中包含测试数据集,否则可能会高估模型性能。

污染包括输入输出污染以及输入污染两种。

  • 输入输出污染指的是测试数据集的输入输出也出现在训练数据中,模型学习到只是文本复制而非真正的能力。
  • 输入污染指的是测试数据集的输入出现在训练数据中,虽然没有学习到答案,但同样会导致在 Zero-shot/Few-shot 情况下模型性能的高估。

这里有一个数据去污的例子:https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/decontamination.md#decontamination

价值观控制

从数据中过滤掉不符合所在国家法律或者道德要求的的内容。比如种族歧视、言语暴力等。具体应该以所在国家的法律规定为准。

个人信息脱敏

从数据中去除法律规定的个人信息数据,如ID、医疗记录等。有presidio and pii-codex 库等。

数据转换

  • 将数据转换为目标格式(如JSON)
  • 匹配目标模型的数据要求,例如截断长度、添加 padding token、start-end token等

数据构建

指令微调数据准备是对数据标注员要求极高的工作,一个经验丰富的具备大学学历的数据标注员,标注速度可能会低于 5 条每小时。

有三种方法可以提高数据构建效率:

  • 复用数据集:复用开源数据集。从而增加微调任务的多样性,提升模型性能。
  • 使用 GPT-4 生成数据集:利用 GPT-4 的强大能力,初步生成基本数据集,人工只需要复核,可以提高标注速度。
  • 体验良好的数据标注系统:开发或者使用体验良好的数据标注系统,可以提升标注速度和质量。同时还可以利用交叉验证等方法避免低质量数据生成。

参考资料

Query Rewrite重写技术

Query rewriting 是将 queries 和系统中存储的文档的语义空间进行对齐(aligning the semantics of queries and documents)的关键技术。 Continue reading

GraphRAG介绍

Published on July 02, 2024

RAG系统优化

Published on June 26, 2024