EMNLP 2021 | 多标签文本分类中长尾分布的平衡策略-轻识

点击上方“视学算法”，选择加"星标"或“置顶”

重磅干货，第一时间送达

作者 | 黄毅

作者简介：黄毅，本文一作，目前为罗氏集团的数据科学家，研究领域为自然语言处理的生物医学应用。

论文链接：https://arxiv.org/pdf/2109.04712.pdf

文章源码：https://github.com/Roche/BalancedLossNLP

摘要

多标签文本分类是自然语言处理中的一类经典任务，训练模型为给定文本标记上不定数目的类别标签。然而实际应用时，各类别标签的训练数据量往往差异较大（不平衡分类问题），甚至是长尾分布，影响了所获得模型的效果。重采样（Resampling）和重加权（Reweighting）常用于应对不平衡分类问题，但由于多标签文本分类的场景下类别标签间存在关联，现有方法会导致对高频标签的过采样。本项工作中，我们探讨了优化损失函数的策略，尤其是平衡损失函数在多标签文本分类中的应用。基于通用数据集 (Reuters-21578，90 个标签) 和生物医学领域数据集（PubMed，18211 个标签）的多组实验，我们发现一类分布平衡损失函数的表现整体优于常用损失函数。研究人员近期发现该类损失函数对图像识别模型的效果提升，而我们的工作进一步证明其在自然语言处理中的有效性。

引言

多标签文本分类是自然语言处理（NLP）的核心任务之一，旨在为给定文本从标签库中找到多个相关标签，可应用于搜索（Prabhu et al., 2018）和产品分类（Agrawal et al., 2013）等诸多场景。图 1 展示了通用多标签文本分类数据集 Reuters-21578 的样例数据（Hayes and Weinstein, 1990）。

图1 Reuters-21578 的样例数据（仅展示文章标题）。

标签后面的数字代表数据集中带有该标签的数据实例个数。

当标签数据存在长尾分布（不平衡分类）和标签连锁（类别共现）时，多标签文本分类会变得更加复杂（图2）。长尾分布，指的是一小部分标签（即头部标签）有很多数据实例，而大多数标签（即尾部标签）只有很少数据实例的不平衡分类情况。标签连锁，指的是头部标签与尾部标签共同出现导致模型对头部标签的权重倾斜。现有的 NLP 解决方案包括但不限于：在分类中对尾部标签重采样（Estabrooks et al., 2004; Charte et al., 2015），模型初始化时将类别共现信息纳入考虑（Kurata et al., 2016），以及将头尾部标签混合的多任务架构方案 (Yang et al., 2020) 。但这些方案依赖于模型架构的专门设计，或不适用于长尾分布数据。

图2 Reuters-21578的长尾分布和标签连锁现象。

热图矩阵展示了第i列标签在含第j行标签数据实例中的条件概率p(i|j)

近年来，计算机视觉（CV）领域也有不少关于多标签分类的研究。其中，优化损失函数的策略已被用于多种 CV 任务，如对象识别（Durand et al., 2019; Milletari et al., 2016）、语义分割（Ge et al., 2018）与医学影像（Li et al., 2020a）等。平衡损失函数，如 Focal loss (Lin et al., 2017)、Class-balanced loss (Cui et al., 2019) 和 Distribution-balanced loss (Wu et al., 2020) 等，提供了针对多标签图像分类的长尾分布和标签连锁问题的解决方案。由于损失函数的调整可以独立于模型架构地灵活嵌入常见模型，NLP 中也逐步有类似的优化损失函数的策略探索（Li et al., 2020b; Cohan et al., 2020）。例如，(Li et al., 2020b) 将医学图像分割任务中的 Dice loss (Milletari et al., 2016) 引入 NLP，显著改善了多种任务的模型效果。

本项工作中，我们将一类新的平衡损失函数引入 NLP，用于多标签文本分类任务，并使用 Reuters-21578（一个通用的小型数据集）和 PubMed（一个生物医学领域的大型数据集）数据集进行了实验。对于这两个数据集，分布平衡损失函数在总指标上优于其他损失函数，并且显著改善了尾部标签的模型表现。我们认为，平衡损失函数为多标签文本分类的应用提供了一个有效策略。

方法介绍

损失函数

多标签文本分类中，二值交叉熵（Binary Cross Entropy, BCE）是较常用的损失函数 (Bengio et al., 2013)。原始的 BCE 容易被大量头部标签或负样本干扰。近年来，一些新的损失函数通过调节 BCE 的权重，实现了模型训练过程的相对平衡。我们在此回顾了三类损失函数设计。

Focal loss （FL）通过模型对数据实例标记标签的“难易程度”为 BCE 设计权重 (Lin et al., 2017)。对于同一数据实例，相比可轻松分类（p值接近真实值）的标签，难以标记（p值远离真实值）的标签将获得比 BCE 更高的权重。由于 FL 在模型训练过程中良好的自适应效果，下述两类损失函数也采用了这一组件。

Class-balanced focal loss（CB）通过估计数据采样的有效数量，将每个标签增量训练数据的边际效用纳入考虑，在不同训练数据支持的标签间调节权重 (Cui et al., 2019)。

Distribution-balanced loss（DB，分布平衡损失函数）则是在 FL 基础上添加了两部分组件 (Wu et al., 2020)。其一为 Rebalancing 组件，减少了标签连锁带来的冗余信息，其二为 Negative Tolerant Regularization （NTR）组件，在不同正负样本数目的标签间调节权重，降低尾部标签的阈值。

上述损失函数的具体设计如图3所示（简单起见已略去求和平均项）。

图3 损失函数的具体设计。

数据集

本项工作中，我们使用了两个不同数据量和领域的多标签文本分类数据集（表 1）。Reuters-21578 数据集包含1987 年刊登在路透社的一万多份新闻文章（Hayes and Weinstein, 1990）。我们按照（Yang and Liu, 1999）使用的训练-测试分割数据，并将 90 个标签平均分为头部（30 个标签，各含 ≥35 个实例）、中部（31 个标签，各含 8-35 个实例）和尾部（30 个标签，各含 ≤8 个实例）标签的子集。PubMed 数据集则来自 BioASQ 竞赛（Licence：8283NLM123），包含PubMed 文章的标题、摘要及对应的生物医学主题词标记 (MeSH)（Tsatsaronis et al.，2015; Coordinators, 2017）。类似地，18211个标签按分位数分为头部（6018 个标签，各含≥50 个实例）、中部（5581 个标签，各含 15-50 个实例）和尾部（6612 个标签，各含 ≤15 个实例）标签的子集。

表1 实验用数据集的基本信息

实验

我们比较了不同损失函数与经典 SVM one-vs-rest 模型的表现。对于各个数据集和模型，我们计算了标签集整体以及头部、中部、尾部标签子集的micro-F1 和 macro-F1 得分（Wu et al., 2019；Lipton et al., 2014 ）。表 2 汇总了不同损失函数的实验结果。Reuters-21578 结果中，BCE 的表现最差。依次对比 micro-F1 和 macro-F1之间、及不同组间的得分可以看出长尾分布的影响。PubMed 数据由于不平衡更明显，长尾分布的影响更大。

表2 实验结果对比

对于 Reuters-21578 数据集，损失函数 FL、CB、R-FL 和 NTR-FL 在头部标签中的表现与 BCE 相似，但在中部和尾部标签中的表现优于 BCE，说明它们对于不平衡问题的改进。DB 在尾部标签改进最明显，整体表现也优于先前使用相同数据集的解决方案，例如 Binary Relevance、EncDec、CNN、CNN-RNN、Optimal Completion Distillation和 GNN 等（Nam et al., 2017 ; Pal et al., 2020；Tsai and Lee et al., 2020）。对于PubMed 数据集，由于BCE 中部和尾部标签已失效，我们使用 FL 作为更强的基线。其他损失函数在中部和尾部标签中的表现均优于 FL。DB 再次证明了其在整体、中部和尾部标签的良好效果。

我们进一步尝试从 DB 中去除一个组件，即移除 NTR 组件得到 R-FL、移除 Rebalancing 组件得到 NTR-FL，移除 FL 组件得到 DB-0FL，通过比较三个残缺模型探索对应三个组件的效果。如表 2 所示，对于两个数据集，移除 NTR 组件 (R-FL) 或 FL 组件 (DB-0FL) 会降低所有亚组的模型效果。移除 Rebalancing 组件 (NTR-FL) 产生相似的整体 micro-F1，但整体 macro-F1 及中部和尾部标签 F1 得分不如 DB，显示增加Rebalancing 组件的作用。最终，我们还尝试将 NTR-FL 与 CB 集成，从而得到一个全新的损失函数 CB-NTR，它在两个数据集上得到的所有 F1 值均优于 CB。CB-NTR 和 DB 间的唯一区别是使用 CB 权重替换了 Rebalancing 权重，而 DB 在中部和尾部标签中的表现优于或非常接近 CB-NTR，可能来自于通过 Rebalancing 权重处理标签连锁对模型效果的提升。

结语

针对多标签文本分类中的不平衡分类问题，我们研究了优化损失函数的策略，并系统比较了各种平衡损失函数的效果。我们首次将 DB 引入 NLP，并设计了全新的平衡损失函数 CB-NTR。在开放数据集 Reuters-21578（90 类标签，通用领域）和 PubMed（18211 类标签，生物医学领域）的实验表明，DB 的模型效果优于其他损失函数。这项研究证明，优化损失函数的策略可以有效解决多标签文本分类时不平衡分类的问题。该策略由于仅需调整损失函数，可以灵活兼容各种基于神经网络的模型框架，也适用于其他受到长尾分布影响的 NLP 任务。

罗氏集团制药部门中国 CIO 施涪军：该工作来自于合作团队在生物医学领域的深度学习应用探索。相比于日常文本，生物医学领域的语料往往更专业，而标注更稀疏，导致 AI 应用面临“最后一公里”的落地挑战。本论文从稀疏标注的长尾分布等问题入手，由 CV 前沿研究引入损失函数并优化，使得既有 NLP 模型可以在框架不变的情况下将训练资源向实例较少的类别平衡，进而实现整体的模型效果提升。很高兴看到此策略在面临类似问题的日常文本上同样有效，希望继续与院校、企业在前沿技术的研究与应用上扎实共创。

参考文献：

Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828.

Francisco Charte, Antonio J Rivera, María J del Jesus,and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163:3–16.

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.

NCBI Resource Coordinators. 2017. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 46(D1):D8–D13.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9260–9269.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

T. Durand, N. Mehrasa, and G. Mori. 2019. Learning a deep convnet for multi-label classification with partial labels. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 647–657, Los Alamitos, CA, USA. IEEE Computer Society.

Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1):18–36.

Weifeng Ge, Sibei Yang, and Yizhou Yu. 2018. Multievidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Philip J. Hayes and Steven P. Weinstein. 1990. Construe/tis: A system for content-based indexing of a database of news stories. In Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, IAAI ’90, page 49–64. AAAI Press.

Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural network-based multi-label classification with better initialization leveraging label cooccurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521–526.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.

Jianqiang Li, Guanghui Fu, Yueda Chen, Pengzhi Li, Bo Liu, Yan Pei, and Hui Feng. 2020a. A multilabel classification model for full slice brain computerised tomography image. BMC Bioinformatics, 21(6):200.

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2020b. Dice loss for dataimbalanced NLP tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 465–476, Online. Association for Computational Linguistics.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, Los Alamitos, CA, USA. IEEE Computer Society.

Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014. Optimal thresholding of classifiers to maximize f1 measure. In Machine Learning and Knowledge Discovery in Databases, pages 225–239, Berlin, Heidelberg. Springer Berlin Heidelberg. Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571.

Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multilabel classification. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Ankit Pal, Muru Selvakumar, and Malaikannan Sankarasubbu. 2020. Magnet: Multi-label text classification using attention-based graph neural network. In ICAART (2), pages 494–505.

F. Pedregosa, G. Varoqu

aux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pages 993–1002.

Che-Ping Tsai and Hung-yi Lee. 2020. Order-free learning alleviating exposure bias in multi-label classification. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 6038–6045. AAAI Press.

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artieres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16:138.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A metalearning approach for multi-label classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4354– 4364, Hong Kong, China. Association for Computational Linguistics.

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. 2020. Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision – ECCV 2020, pages 162–178, Cham. Springer International Publishing.

Wenshuo Yang, Jiyi Li, Fumiyo Fukumoto, and Yanming Ye. 2020. HSCNN: A hybrid-Siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6716–6722, Online. Association for Computational Linguistics.

Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 42–49, New York, NY, USA. Association for Computing Machinery.

点个在看 paper不断！