NLP(一百零一)Embedding模型微调实践
共 16619字,需浏览 34分钟
·
2024-06-07 22:24
本文将会介绍如何使用Sentence Transformers和AutoTrain对开源的Embedding模型
bge-base-zh-v1.5
进行微调,并验证Embedding模型微调后的效果。
在RAG框架或者语义相似度计算任务时,Embedding模型是我们常常会打交道的模型。Sentence Transformers
是一个 Python 库,用于使用和训练各种应用的Embedding模型,例如检索增强生成 (RAG)、语义搜索、语义文本相似度、释义挖掘 (paraphrase mining) 等等。其 3.0 版本的更新是该工程自创建以来最大的一次,引入了一种新的训练方法。
本文将会以智源研究院(BAAI)开源的Embedding模型bge-base-zh-v1.5
作为基准模型,展示如何使用Sentence Transformers
进行评估,并使用多种训练框架对其进行微调,验证微调后的模型效果会有所提升。
评估指标Baseline
我们在文章NLP(八十二)RAG框架中的Retrieve算法评估中,使用LlamaIndex框架对RAG流程中的各种Retrieve算法,包括Embedding模型召回,进行了评估,评估指标采用Hit Rate
和MRR
。本文将继续使用这篇文章中给出的数据集进行评估。
Sentence Transformers
给出了便捷的Embedding模型召回效果的评估方法,我们在这里采用InformationRetrievalEvaluator
评估器,评估指标使用accuracy@5
, accuracy@10
, map@100
, mrr@10
, ndcg@10
等。
示例评估代码如下:
# -*- coding: utf-8 -*-
# @file: bge_base_zh_eval.py
import os
import json
import time
import torch
from pprint import pprint
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim
project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]
# data process
# load dataset, get corpus, queries, relevant_docs
with open(os.path.join(project_dir, "data/doc_qa.json"), "r", encoding="utf-8") as f:
content = json.loads(f.read())
corpus = content['corpus']
queries = content['queries']
relevant_docs = content['relevant_docs']
# # Load a model
# 替换成自己的模型完整路径或使用huggingface modl id
model_name = "bge-base-zh-v1.5"
model_path = os.path.join(project_dir, f"models/{model_name}")
model = SentenceTransformer(model_path, device="cuda" if torch.cuda.is_available() else "cpu")
print("Model loaded")
s_time = time.time()
# # Evaluate the model
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name=f"{os.path.basename(model_path)}",
score_functions={"cosine": cos_sim}
)
# Evaluate the model
result = evaluator(model)
pprint(result)
print(f"Time cost: {time.time() - s_time:.2f}s")
我们在评估器中传入queries, corpus, relevant_docs字典,加载完模型后即可进行评估。
评估结果在下文中给出,作为baseline(基准)指标。
微调数据合成
在介绍Embedding模型微调前,我们将使用LlamaIndex
框架来构建微调数据集。数据集的构建方法已在文章NLP(八十六)RAG框架Retrieve阶段的Embedding模型微调中给出,这里再次介绍。
在LlamaIndex
框架中,可方便地使用generate_qa_embedding_pairs方法,利用Prompt工程对文本生成相关问题并进行关联。
Embedding模型的微调数据合成脚本如下:
# -*- coding: utf-8 -*-
# @file: make_ft_corpus.py
import os
from llama_index.legacy.finetuning import (
generate_qa_embedding_pairs
)
from llama_index.llms.openai import OpenAI
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from dotenv import load_dotenv
load_dotenv()
project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]
TRAIN_FILES = [os.path.join(project_dir, "data/ft_train.txt")]
VAL_FILES = [os.path.join(project_dir, "data/ft_test.txt")]
TRAIN_CORPUS_FPATH = os.path.join(project_dir, "data/ft_train_corpus.json")
VAL_CORPUS_FPATH = os.path.join(project_dir, "data/ft_val_corpus.json")
def load_corpus(files, verbose=False):
if verbose:
print(f"Loading files {files}")
reader = SimpleDirectoryReader(input_files=files)
docs = reader.load_data()
if verbose:
print(f"Loaded {len(docs)} docs")
parser = SentenceSplitter(chunk_size=250, chunk_overlap=0)
nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)
if verbose:
print(f"Parsed {len(nodes)} nodes")
return nodes
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)
llm = OpenAI(model="gpt-3.5-turbo", api_key=os.getenv("OPENAI_API_KEY"))
qa_generate_prompt_tmpl = """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination in Chinese. The questions should be diverse in nature \
across the document in Chinese. The questions should not contain options, not start with Q1/ Q2. \
Restrict the questions to the context information provided.
"""
train_dataset = generate_qa_embedding_pairs(nodes=train_nodes, llm=llm, num_questions_per_chunk=1, qa_generate_prompt_tmpl=qa_generate_prompt_tmpl)
val_dataset = generate_qa_embedding_pairs(nodes=val_nodes, llm=llm, num_questions_per_chunk=1, qa_generate_prompt_tmpl=qa_generate_prompt_tmpl)
train_dataset.save_json(TRAIN_CORPUS_FPATH)
val_dataset.save_json(VAL_CORPUS_FPATH)
输出结果如下:
Output:
Loading files ['/Users/admin/PycharmProjects/embedding_model_exp/data/ft_train.txt']
Loaded 1 docs
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 23.54it/s]
Parsing nodes: 0%| | 0/1 [00:00<?, ?it/s]Parsed 137 nodes
Loading files ['/Users/admin/PycharmProjects/embedding_model_exp/data/ft_test.txt']
Loaded 1 docs
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 45.84it/s]
0%| | 0/137 [00:00<?, ?it/s]Parsed 111 nodes
100%|██████████| 137/137 [03:34<00:00, 1.57s/it]
100%|██████████| 111/111 [01:55<00:00, 1.04s/it]
这样,我们就能得到微调数据集了,保存为ft_train_corpus.json和ft_val_corpus.json。
Embedding模型微调
接下来,我们将会对bge-base-zh-v1.5
模型进行微调,微调的目的是让模型更适配我们自己的数据集,从而取得更好的召回效果。
以下笔者将介绍三种模型训练的框架,帮助读者更好地实现Embedding模型微调。
使用 `sentence-transformers v3`
这里,我们使用的sentence-transformers模块的版本为V3.0.0。
利用该模块,我们不难实现Embedding模型微调,微调代码如下:
# -*- coding: utf-8 -*-
# @file: ft_sentence_transformers_trainer.py
import os
import json
import time
import torch
from datasets import Dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers import SentenceTransformerTrainer
start_time = time.time()
project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]
# load eval dataset
with open(os.path.join(project_dir, "data/ft_val_dataset.json"), "r", encoding="utf-8") as f:
eval_content = json.loads(f.read())
corpus, queries, relevant_docs = eval_content['corpus'], eval_content['queries'], eval_content['relevant_docs']
# load train dataset
with open(os.path.join(project_dir, "data/ft_train_dataset.json"), "r", encoding="utf-8") as f:
train_content = json.loads(f.read())
train_anchor, train_positive = [], []
for query_id, context_id in train_content['relevant_docs'].items():
train_anchor.append(train_content['queries'][query_id])
train_positive.append(train_content['corpus'][context_id[0]])
train_dataset = Dataset.from_dict({"positive": train_positive, "anchor": train_anchor})
print(train_dataset)
print(train_dataset[0:5])
# Load a model
model_name = 'bge-base-zh-v1.5'
# 替换成自己的模型完整路径或使用huggingface modl id
model_path = os.path.join(project_dir, f"models/{model_name}")
model = SentenceTransformer(model_path, device="cuda:0" if torch.cuda.is_available() else "cpu")
print("Model loaded")
# # Evaluate the model
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name=f"{model_name}",
score_functions={"cosine": cos_sim}
)
train_loss = MultipleNegativesRankingLoss(model)
# define training arguments
args = SentenceTransformerTrainingArguments(
output_dir=f"ft_{model_name}", # output directory and hugging face model ID
num_train_epochs=5, # number of epochs
per_device_train_batch_size=2, # train batch size
gradient_accumulation_steps=2, # for a global batch size of 512
per_device_eval_batch_size=4, # evaluation batch size
warmup_ratio=0.1, # warmup ratio
learning_rate=2e-5, # learning rate, 2e-5 is a good value
lr_scheduler_type="cosine", # use constant learning rate scheduler
optim="adamw_torch_fused", # use fused adamw optimizer
tf32=True, # use tf32 precision
bf16=True, # use bf16 precision
batch_sampler=BatchSamplers.NO_DUPLICATES,
eval_strategy="epoch", # evaluate after each epoch
save_strategy="epoch", # save after each epoch
logging_steps=10, # log every 10 steps
save_total_limit=3, # save only the last 3 models
load_best_model_at_end=True, # load the best model when training ends
metric_for_best_model=f"eval_{model_name}_cosine_ndcg@10", # Optimizing for the best ndcg@10 score
)
# train the model
trainer = SentenceTransformerTrainer(
model=model, # the model to train
args=args, # training arguments
train_dataset=train_dataset.select_columns(
["positive", "anchor"]
), # training dataset
loss=train_loss,
evaluator=evaluator
)
trainer.train()
trainer.save_model()
print(f"cost time: {time.time() - start_time:.2f}s")
笔者在1张NVIDIA A800-SXM4-80GB型号的GPU上进行训练,耗时约63.10秒。同时,我们会将微调后的Embedding模型保存在GPU上。
使用 `AutoTrain`
AutoTrain
是一种自动训练和部署最先进的机器学习模型的方法,与 Hugging Face 生态系统无缝集成。它提供了一种自动方式来训练和部署最先进的机器学习模型。该应用程序支持广泛的机器学习任务,包括文本分类、文本回归、实体识别、摘要、问答、翻译和表格任务。
你不需要写任何代码,就能实现各类机器学习任务的训练。
这里,我们以Embedding模型训练为例,来介绍如何在AutoTrain
实现模型训练。
首先,我们将上述微调数据集进行数据处理,使其格式符合datasets模块的格式,并上传至HuggingFace Hub,项目名称为jclian91/embedding_exp_semiconductor
。
模型训练的配置文件(config.yml
)如下:
task: sentence-transformers:pair
base_model: /workspace/code/embedding_model_exp/models/bge-base-zh-v1.5
project_name: autotrain-pair
log: tensorboard
backend: local
data:
path: jclian91/embedding_exp_semiconductor
train_split: train
valid_split: dev
column_mapping:
sentence1_column: anchor
sentence2_column: positive
params:
max_seq_length: 512
epochs: 5
batch_size: 4
lr: 2e-5
optimizer: adamw_torch_fused
scheduler: cosine
gradient_accumulation: 2
mixed_precision: fp16
训练命令为CUDA_VISIBLE_DEVICES=0 autotrain --config config.yml
。
全程我们没有写任何代码,只是配置了模型训练的相关参数,就完成了Embedding模型的微调,轻松 + 愉快。
使用 `LlamaIndex`
使用LlamaIndex
框架来进行Embedding模型微调,笔者已经在文章NLP(八十六)RAG框架Retrieve阶段的Embedding模型微调中介绍过了,这次不再详述,但使用起来也是非常方便的。
评估指标对比
我们对基准模型、不同训练方式的Embedding模型进行指标对比,结果如下表所示:
模型 | accuracy@5 | accuracy@10 | map@100 | mrr@10 | ndcg@10 | cost time |
bge-base-zh-v1.5 | 0.8100 | 0.8816 | 0.6998 | 0.6945 | 0.7396 | 15.44s |
ft_bge-base-zh-v1.5 | 0.9128 | 0.9408 | 0.8052 | 0.8018 | 0.8362 | 15.14s |
autotrain-bge-base-zh-v15 | 0.9159 | 0.9346 | 0.7952 | 0.7918 | 0.8272 | 15.07s |
ft_bge-base-zh-v1.5和autotrain-bge-base-zh-v15都是对基准模型bge-base-zh-v1.5进行微调得到的模型,前者使用sentence-transformers微调,后者使用AutoTrain微调。
评估脚本为 src/baseline_eval/bge_base_zh_eval.py,使用Mac CPU测试,CPU型号为
Apple M2 Pro
。不同的训练框架并不会太影响模型微调后的表现(除去部分模型参数的影响),决定Embedding模型效果的是数据集和模型本身,还有模型参数等因素。
总结
本文重点介绍了如何使用Sentence Transformers和AutoTrain对开源的Embedding模型bge-base-zh-v1.5
进行微调,并验证Embedding模型微调后的效果。
Sentence Transformers是一个宝库,它介绍了关于Embedding模型方方面面的内容,是了解、深入Embedding模型必不可少的工具。后续笔者将会介绍Embedding模型量化、俄罗斯套娃嵌入模型(Matryoshka Representation Learning, MRL)等相关方面的内容。
本文给出的Python代码将会是一个更全面的Embedding模型相关实验的部分内容,后续项目将会陆续补充。该项目已开源至Github,网址为: https://github.com/percent4/embedding_model_exp .
参考文献
Training and Finetuning Embedding Models with Sentence Transformers v3: https://huggingface.co/blog/train-sentence-transformers
Fine-tune Embedding models for Retrieval Augmented Generation (RAG): https://www.philschmid.de/fine-tune-embedding-model-for-rag
俄罗斯套娃 (Matryoshka) 嵌入模型概述: https://huggingface.co/blog/zh/matryoshka
Finetune Embeddings: https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/
NLP(八十六)RAG框架Retrieve阶段的Embedding模型微调: https://mp.weixin.qq.com/s?__biz=MzU2NTYyMDk5MQ==&mid=2247486333&idx=1&sn=29d00d472647bc5d6e336bec22c88139&chksm=fcb9b2edcbce3bfb42ea149d96fb1296b10a79a60db7ad2da01b85ab223394191205426bc025&token=1376257911&lang=zh_CN#rd
How to Fine-Tune Custom Embedding Models Using AutoTrain: https://huggingface.co/blog/abhishek/finetune-custom-embeddings-autotrain
Upload a dataset to the Hub: https://huggingface.co/docs/datasets/v1.16.0/upload_dataset.html
NLP(八十二)RAG框架中的Retrieve算法评估: https://mp.weixin.qq.com/s?__biz=MzU2NTYyMDk5MQ==&mid=2247486199&idx=1&sn=f24175b05bdf5bc6dd42efed4d5acae8&chksm=fcb9b367cbce3a711fabd1a56bb5b9d803aba2f42964b4e1f9a4dc6e2174f0952ddb9e1d4c55&token=402005631&lang=zh_CN#rd