A COMPARATIVE STUDY OF WORD REPRESENTATION METHODS WITH CONDITIONAL RANDOM FIELDS AND MAXIMUM ENTROPY MARKOV FOR BIO-NAMED ENTITY RECOGNITION

Maan Tareq Abd; Masnizah Mohd

doi:10.22452/mjcs.sp2018no1.2

FULL TEXT

Published: Dec 28, 2018

DOI: https://doi.org/10.22452/mjcs.sp2018no1.2

Keywords:

biomedical named entity prototypical representation data representation methods Word2Vec

Maan Tareq Abd

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia

Masnizah Mohd

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia

Abstract

Bio-Named Entity Recognition (Bio-NER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in Bio-NER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent Bio-NER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bag-of-words (CBOW) model, and a new prototypical representation method with two popular sequence-labeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major Bio-NER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for Bio-NER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90% data was used as training data, yielding overall F-measure values of 0.79% and 0.85% for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall F-measure values of 0.76% and 0.78% for the JNLPBA corpus and GENETAG corpus, respectively.

Downloads

Download data is not yet available.

How to Cite

Abd, M. T., & Mohd, M. (2018). A COMPARATIVE STUDY OF WORD REPRESENTATION METHODS WITH CONDITIONAL RANDOM FIELDS AND MAXIMUM ENTROPY MARKOV FOR BIO-NAMED ENTITY RECOGNITION. Malaysian Journal of Computer Science, 15–30. https://doi.org/10.22452/mjcs.sp2018no1.2

Issue

2018: Special Issue December 2018: "Information Retrieval and Knowledge Management Special Issue Publication"

Section

Articles

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Most read articles by the same author(s)