NLP（四十）利用seqeval模块获取序列实体识别结果-轻识

在文章NLP（二十三）序列标注算法评估模块seqeval的使用中，笔者首次介绍了seqeval模块，它可以帮助我们很好地完成序列标注算法的模型效果评估，并且能在Keras模型训练过程中引入。
其实，在seqeval模块中还有一个get_entities函数，它能帮助我们迅速地从一个标注序列中获取完整实体，支持常规的BIO、BMESO等标注方式。让我们来看下该函数的源代码：

def get_entities(seq, suffix=False):
    """Gets entities from sequence.

    Args:
        seq (list): sequence of labels.

    Returns:
        list: list of (chunk_type, chunk_start, chunk_end).

    Example:
        >>> from seqeval.metrics.sequence_labeling import get_entities
        >>> seq = ['B-PER', 'I-PER', 'O', 'B-LOC']
        >>> get_entities(seq)
        [('PER', 0, 1), ('LOC', 3, 3)]
    """
    # for nested list
    if any(isinstance(s, list) for s in seq):
        seq = [item for sublist in seq for item in sublist + ['O']]

    prev_tag = 'O'
    prev_type = ''
    begin_offset = 0
    chunks = []
    for i, chunk in enumerate(seq + ['O']):
        if suffix:
            tag = chunk[-1]
            type_ = chunk.split('-')[0]
        else:
            tag = chunk[0]
            type_ = chunk.split('-')[-1]

        if end_of_chunk(prev_tag, tag, prev_type, type_):
            chunks.append((prev_type, begin_offset, i-1))
        if start_of_chunk(prev_tag, tag, prev_type, type_):
            begin_offset = i
        prev_tag = tag
        prev_type = type_

    return chunks

该函数的输入为标注序列，输出结果为实体列表，包含实体类型、实体开始下标和结束下标。
我们以文章NLP入门（六）pyltp的介绍与使用中的命名实体识别程序为例，同时采用自己提取标注序列中的实体识别信息和使用seqeval模块提取标注序列中的实体识别信息两种方式，实现代码如下：

# -*- coding: utf-8 -*-

import os
from pyltp import Segmentor, Postagger

# 分词
cws_model_path = os.path.join(os.path.dirname(__file__), 'ltp_v3.4/cws.model')  # 分词模型路径，模型名称为`cws.model`
lexicon_path = os.path.join(os.path.dirname(__file__), 'ltp_v3.4/lexicon.txt')  # 参数lexicon是自定义词典的文件路径

segmentor = Segmentor()
segmentor.load_with_lexicon(cws_model_path, lexicon_path)

sent = "据韩联社12月28日反映，美国防部发言人杰夫·莫莱尔27日表示，美国防部长盖茨将于2011年1月14日访问韩国。"
# sent = "记者4日从中国航空工业集团有限公司获悉，AG600项目研制加速推进，001架机在成功完成陆上、水上、海上首飞之后，于3月4日在湖北荆门漳河机场完成灭火任务系统首次科研试飞，飞机状态良好。"
# sent = "大临铁路通车当天，81岁的佤族老人田学明专程赶到临沧站，观看列车发车。“我最大的梦想，就是有一天火车能开进阿佤山。今天，我的梦想终于实现了！”"
words = segmentor.segment(sent)  # 分词

# 词性标注
pos_model_path = os.path.join(os.path.dirname(__file__), 'ltp_v3.4/pos.model')  # 词性标注模型路径，模型名称为`pos.model`

postagger = Postagger()  # 初始化实例
postagger.load(pos_model_path)  # 加载模型
postags = postagger.postag(words)  # 词性标注


ner_model_path = os.path.join(os.path.dirname(__file__), 'ltp_v3.4/ner.model')   # 命名实体识别模型路径，模型名称为`pos.model`

from pyltp import NamedEntityRecognizer
recognizer = NamedEntityRecognizer() # 初始化实例
recognizer.load(ner_model_path)  # 加载模型
netags = list(recognizer.recognize(words, postags))  # 命名实体识别
print(list(words))
print(netags)

# 用自己的方法提取识别结果中的人名，地名，组织机构名
persons, places, orgs = set(), set(), set()
i = 0
for tag, word in zip(netags, words):
    j = i
    # 人名
    if 'Nh' in tag:
        if str(tag).startswith('S'):
            persons.add(word)
        elif str(tag).startswith('B'):
            union_person = word
            while netags[j] != 'E-Nh':
                j += 1
                if j < len(words):
                    union_person += words[j]
            persons.add(union_person)
    # 地名
    if 'Ns' in tag:
        if str(tag).startswith('S'):
            places.add(word)
        elif str(tag).startswith('B'):
            union_place = word
            while netags[j] != 'E-Ns':
                j += 1
                if j < len(words):
                    union_place += words[j]
            places.add(union_place)
    # 机构名
    if 'Ni' in tag:
        if str(tag).startswith('S'):
            orgs.add(word)
        elif str(tag).startswith('B'):
            union_org = word
            while netags[j] != 'E-Ni':
                j += 1
                if j < len(words):
                    union_org += words[j]
            orgs.add(union_org)

    i += 1

print('人名：', '，'.join(persons))
print('地名：', '，'.join(places))
print('组织机构：', '，'.join(orgs))

# 用seqeval提取识别结果中的人名，地名，组织机构名
from seqeval.metrics.sequence_labeling import get_entities
from collections import defaultdict
seq_result = get_entities(netags)
words = list(words)
ner_result_dict = defaultdict(list)
for seq_res in seq_result:
    ner_result_dict[seq_res[0]].append("".join(words[seq_res[1]:seq_res[2]+1]))
print('人名：', ner_result_dict["Nh"])
print('地名：', ner_result_dict["Ns"])
print('组织机构：', ner_result_dict["Ni"])

# 释放模型
segmentor.release()
postagger.release()
recognizer.release()

输出结果如下：

['据', '韩联社', '12月', '28日', '反映', '，', '美', '国防部', '发言人', '杰夫·莫莱尔', '27日', '表示', '，', '美', '国防部长', '盖茨', '将', '于', '2011年', '1月', '14日', '访问', '韩国', '。']
['O', 'S-Ni', 'O', 'O', 'O', 'O', 'B-Ni', 'E-Ni', 'O', 'S-Nh', 'O', 'O', 'O', 'S-Ns', 'O', 'S-Nh', 'O', 'O', 'O', 'O', 'O', 'O', 'S-Ns', 'O']
人名： 盖茨，杰夫·莫莱尔
地名： 韩国，美
组织机构： 美国防部，韩联社
人名： ['杰夫·莫莱尔', '盖茨']
地名： ['美', '韩国']
组织机构： ['韩联社', '美国防部']

我们再尝试两个句子，识别结果如下：

['记者', '4日', '从', '中国', '航空', '工业', '集团', '有限公司', '获悉', '，', 'AG600', '项目', '研制', '加速', '推进', '，', '001架机', '在', '成功', '完成', '陆上', '、', '水上', '、', '海上', '首', '飞', '之后', '，', '于', '3月', '4日', '在', '湖北', '荆门', '漳河', '机场', '完成', '灭火', '任务', '系统', '首', '次', '科研', '试飞', '，', '飞机', '状态', '良好', '。']
['O', 'O', 'O', 'B-Ni', 'I-Ni', 'I-Ni', 'I-Ni', 'E-Ni', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Ns', 'I-Ns', 'I-Ns', 'E-Ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
人名： 
地名： 湖北荆门漳河机场
组织机构： 中国航空工业集团有限公司
人名： []
地名： ['湖北荆门漳河机场']
组织机构： ['中国航空工业集团有限公司']

['大临铁路', '通车', '当天', '，', '81', '岁', '的', '佤族', '老人', '田学明', '专程', '赶到', '临沧站', '，', '观看', '列车', '发车', '。', '“', '我', '最', '大', '的', '梦想', '，', '就', '是', '有', '一', '天', '火车', '能', '开进', '阿佤山', '。', '今天', '，', '我', '的', '梦想', '终于', '实现', '了', '！', '”']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-Nh', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-Ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
人名： 田学明
地名： 阿佤山
组织机构： 
人名： ['田学明']
地名： ['阿佤山']
组织机构： []

从上我们可以发现：

使用seqeval的实现方式与自己的实现方式效果一致；
seqeval的实现方式更加简洁高效，从代码上看，seqeval只需3-4行代码，而自己实现需20-30行代码。

本文介绍了如何使用seqeval模块快速地从标注序列中提取实体识别结果，这是我们在做命名实体识别任务时经常会碰到的问题，使用seqeval能够提升我们的工作效果，使代码更加简洁优雅。
本次分享到此结束，感谢大家阅读~
2021年3月5日于上海杨浦