基于条件随机场的农作物病虫害及农药命名实体识别
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金项目(61502500)、北京市自然科学基金项目(4164090)和中央高校基本科研业务费专项资金项目(2017QC077)


Recognition of Crops, Diseases and Pesticides Named Entities in Chinese Based on Conditional Random Fields
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    互联网农技问答平台现仅依靠人工提供答题服务,响应速度慢,回答质量难以保证。实现智能农技问题解答,构建农技知识库,需要从现有问答数据提取“农作物-病虫害-农药”命名实体三元组。现有对农业中文命名实体识别的研究较少,且准确率较低。根据农作物、病虫害及农药命名实体的特点,针对农技问答数据,提出基于条件随机场的农作物、病虫害及农药命名实体的识别方法。对数据集进行格式整理及自动分词,并对分词后的语料,针对是否包含特定界定词、是否含特定偏旁部首、是否是数量词、是否是特定左右指界词及词性等特征进行自动标注。利用标注后的数据训练CRF模型,可以对语料进行分类,包括判断语料是否属于农作物、病虫害、农药3类命名实体并识别该语料在复合命名实体中的位置,从而实现了对3类命名实体的识别,由此可自动构建关联三元组。通过试验选择特征组合和调整上下文窗口大小,提高了本方法的识别准确度,降低了模型训练时间,对农作物、病虫害、农药命名实体识别的准确度分别达97.72%、87.63%、98.05%,比现有方法有显著提高。

    Abstract:

    On internet agricultural technology platform, thousands of new questions are waiting to be answered by experts every day. It is generally doubted because of slowly response time and uncertain quality of the manual services. An intelligent response system based on agricultural technology knowledge base can help to answer some questions automatically. To build the knowledge base, it is necessary to recognize triples of “crop-disease-pesticide” named entities from mass of existing questions and answers data. However, fewer studies are reported on recognition methods for named entities of diseases and pesticides in Chinese, and accuracies of those for named entities of crops are low. Thus, a recognition method based on conditional random fields (CRF) was proposed, which recognized crops, diseases, and pesticides named entities from agricultural technology questions and answers data. In the method, question and answer texts was formatted and split to pieces of corpus. Each corpus piece was automatically annotated with several features, including whether it contained characteristic Chinese characters and characteristic radicals, whether it was numeral, whether it was the left or right bound of a compound word, and part of speech. A CRF model was trained with these annotated texts to classify pieces of corpus, including judging whether they were parts of crop, disease, or pesticide named entities and recognizing positions in named entities. With the trained model, three types of named entities could be accurately recognized and triples could be associated automatically. Recognition accuracies and time cost of model training were optimized by choosing input feature combinations and adjusting sizes of context windows in experiments. Accuracies of recognizing crops, diseases, and pesticides of this method were 97.72%, 87.63% and 98.05% respectively, which were significantly higher than existing methods.

    参考文献
    相似文献
    引证文献
引用本文

李想,魏小红,贾璐,陈昕,刘磊,张彦娥.基于条件随机场的农作物病虫害及农药命名实体识别[J].农业机械学报,2017,48(s1):178-185.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2017-07-10
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2017-12-10
  • 出版日期: