EasyNLP中文 NLP 算法框架

联合创作 · 2023-09-26 06:34

随着 BERT、Megatron、GPT-3 等预训练模型在NLP领域取得瞩目的成果,越来越多团队投身到超大规模训练中,这使得训练模型的规模从亿级别发展到了千亿甚至万亿的规模。然而,这类超大规模的模型运用于实际场景中仍然有一些挑战。首先,模型参数量过大使得训练和推理速度过慢且部署成本极高;其次在很多实际场景中数据量不足的问题仍然制约着大模型在小样本场景中的应用,提高预训练模型在小样本场景的泛化性依然存在挑战。为了应对以上问题,阿里云机器学习PAI 团队推出了 EasyNLP 中文 NLP 算法框架,助力大模型快速且高效的落地。

主要特性

  • 易用且兼容开源:EasyNLP 支持常用的中文 NLP 数据和模型,方便用户评测中文 NLP 技术。除了提供易用简洁的 PAI 命令形式对前沿NLP算法进行调用以外,EasyNLP 还抽象了一定的自定义模块如 AppZoo 和 ModelZoo,降低NLP 应用的门槛,同时 ModelZoo 里面常见的预训练模型和 PAI 自研的模型,包括知识预训练模型等。EasyNLP 可以无缝接入 huggingface/ transformers 的模型,也兼容 EasyTransfer 模型,并且可以借助框架自带的分布式训练框架(基于Torch-Accelerator)提升训练效率。
  • 大模型小样本落地技术:EasyNLP 框架集成了多种经典的小样本学习算法,例如 PET、P-Tuning 等,实现基于大模型的小样本数据调优,从而解决大模型与小训练集不相匹配的问题。此外,PAI 团队结合经典小样本学习算法和对比学习的思路,提出了一种不增添任何新的参数与任何人工设置模版与标签词的方案 Contrastive Prompt Tuning,在 FewCLUE 小样本学习榜单取得第一名,相比 Finetune 有超过 10% 的提升。
  • 大模型知识蒸馏技术:鉴于大模型参数大难以落地的问题,EasyNLP 提供知识蒸馏功能帮助蒸馏大模型从而得到高效的小模型来满足线上部署服务的需求。同时 EasyNLP 提供 MetaKD 算法,支持元知识蒸馏,提升学生模型的效果,在很多领域上甚至可以跟教师模型的效果持平。同时,EasyNLP 支持数据增强,通过预训练模型来增强目标领域的数据,可以有效的提升知识蒸馏的效果。
  • 多模态模型技术:由于很多NLP任务需要借助其他模态的表征来完成,EasyNLP框架不仅仅支持纯NLP任务,它还支持各种流行的多模态预训练模型,以支持需要视觉知识或视觉特征的NLP任务。例如,EasyNLP集成了用于文本图像匹配的 CLIP模型和用于文本到图像生成的 DALLE 风格的中文模型。

安装

$ git clone https://github.com/alibaba/EasyNLP.git
$ pip install -r requirements.txt 
$ cd EasyNLP
$ python setup.py install

环境要求:Python 3.6, PyTorch >= 1.8.

快速上手

下面提供一个BERT文本分类的例子,只需要几行代码就可以训练BERT模型:

首先,通过load_dataset接口加载数据,其次构建一个分类模型,然后调用Trainer即可训练.

from easynlp.core import Trainer
from easynlp.appzoo import GeneralDataset, SequenceClassification, load_dataset
from easynlp.utils import initialize_easynlp

args = initialize_easynlp()

row_data = load_dataset('glue', 'qnli')["train"]
train_dataset = GeneralDataset(row_data, args.pretrained_model_name_or_path, args.sequence_length)

model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
Trainer(model=model,  train_dataset=train_dataset).train()

For more datasets, please check it out in DataHub.

也可以使用自定义数据接口:

from easynlp.core import Trainer
from easynlp.appzoo import ClassificationDataset, SequenceClassification
from easynlp.utils import initialize_easynlp

args = initialize_easynlp()

train_dataset = ClassificationDataset(
    pretrained_model_name_or_path=args.pretrained_model_name_or_path,
    data_file=args.tables,
    max_seq_length=args.sequence_length,
    input_schema=args.input_schema,
    first_sequence=args.first_sequence,
    label_name=args.label_name,
    label_enumerate_values=args.label_enumerate_values,
    is_training=True)

model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
Trainer(model=model,  train_dataset=train_dataset).train()

测试代码:

python main.py \
  --mode train \
  --tables=train_toy.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --first_sequence=sent1 \
  --label_name=label \
  --label_enumerate_values=0,1 \
  --checkpoint_dir=./tmp/ \
  --epoch_num=1  \
  --app_name=text_classify \
  --user_defined_parameters='pretrain_model_name_or_path=bert-tiny-uncased'

我们也提供了AppZoo的命令行来训练模型,只需要通过简单的参数配置就可以开启训练:

$ easynlp \
   --mode=train \
   --worker_gpu=1 \
   --tables=train.tsv,dev.tsv \
   --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
   --first_sequence=sent1 \
   --label_name=label \
   --label_enumerate_values=0,1 \
   --checkpoint_dir=./classification_model \
   --epoch_num=1  \
   --sequence_length=128 \
   --app_name=text_classify \
   --user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'
$ easynlp \
  --mode=predict \
  --tables=dev.tsv \
  --outputs=dev.pred.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --output_schema=predictions,probabilities,logits,output \
  --append_cols=label \
  --first_sequence=sent1 \
  --checkpoint_path=./classification_model \
  --app_name=text_classify

AppZoo更多示例,详见:AppZoo文档.

ModelZoo

EasyNLP的ModelZoo目前支持如下预训练模型。

  1. PAI-BERT-zh (from Alibaba PAI): pre-trained BERT models with a large Chinese corpus.
  2. DKPLM (from Alibaba PAI): released with the paper DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understanding by Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He and Jun Huang.
  3. KGBERT (from Alibaba Damo Academy & PAI): pre-train BERT models with knowledge graph embeddings injected.
  4. BERT (from Google): released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  5. RoBERTa (from Facebook): released with the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov.
  6. Chinese RoBERTa (from HFL): the Chinese version of RoBERTa.
  7. MacBERT (from HFL): released with the paper Revisiting Pre-trained Models for Chinese Natural Language Processing by Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang and Guoping Hu.
  8. WOBERT (from ZhuiyiTechnology): the word-based BERT for the Chinese language.
  9. Mengzi (from Langboat): released with the paper Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese by Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang and Ming Zhou.

详细列表参见 readme

预训练大模型的落地

EasyNLP提供小样本学习和知识蒸馏,方便用户落地超大预训练模型。

  1. PET (from LMU Munich and Sulzer GmbH): released with the paper Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference by Timo Schick and Hinrich Schutze. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  2. P-Tuning (from Tsinghua University, Beijing Academy of AI, MIT and Recurrent AI, Ltd.): released with the paper GPT Understands, Too by Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang and Jie Tang. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  3. CP-Tuning (from Alibaba PAI): released with the paper Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning by Ziyun Xu, Chengyu Wang, Minghui Qiu, Fuli Luo, Runxin Xu, Songfang Huang and Jun Huang.
  4. Vanilla KD (from Alibaba PAI): distilling the logits of large BERT-style models to smaller ones.
  5. Meta KD (from Alibaba PAI): released with the paper Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains by Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li and Jun Huang.
  6. Data Augmentation (from Alibaba PAI): augmentating the data based on the MLM head of pre-trained language models.

CLUE Benchmark

EasyNLP提供 CLUE评测代码,方便用户快速评测CLUE数据上的模型效果。

# Format: bash run_clue.sh device_id train/predict dataset
# e.g.: 
bash run_clue.sh 0 train csl

根据我们的脚本,可以获得BERT,RoBERTa等模型的评测效果(dev数据):

(1) bert-base-chinese

Task AFQMC CMNLI CSL IFLYTEK OCNLI TNEWS WSC
P 72.17% 75.74% 80.93% 60.22% 78.31% 57.52% 75.33%
F1 52.96% 75.74% 81.71% 60.22% 78.30% 57.52% 80.82%

(2) chinese-roberta-wwm-ext:

Task AFQMC CMNLI CSL IFLYTEK OCNLI TNEWS WSC
P 73.10% 80.75% 80.07% 60.98% 80.75% 57.93% 86.84%
F1 56.04% 80.75% 81.50% 60.98% 80.75% 57.93% 89.58%

详细的例子,请参考CLUE评测示例.

Tutorials

License

This project is licensed under the Apache License (Version 2.0). This toolkit also contains some code modified from other repos under other open-source licenses. See the NOTICE file for more information.

联系我们

扫描下面二维码加入dingidng群,有任何问题欢迎在群里反馈。

 

参考文献

更加详细的解读可以参考我们的 arxiv 文章

@article{easynlp,
  doi = {10.48550/ARXIV.2205.00258},  
  url = {https://arxiv.org/abs/2205.00258},  
  author = {Wang, Chengyu and Qiu, Minghui and Zhang, Taolin and Liu, Tingting and Li, Lei and Wang, Jianing and Wang, Ming and Huang, Jun and Lin, Wei},
  title = {EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing},
  publisher = {arXiv},  
  year = {2022}
}
浏览 3
点赞
评论
收藏
分享

手机扫一扫分享

编辑 分享
举报
评论
图片
表情
推荐
点赞
评论
收藏
分享

手机扫一扫分享

编辑 分享
举报