Evaluating and improving morpho-syntactic classification over multiple corpora using pre-trained, “off-the-shelf”, parts-of-speech tagging tools

Glass, K; Bangay, Shaun

File(s) under permanent embargo

Evaluating and improving morpho-syntactic classification over multiple corpora using pre-trained, “off-the-shelf”, parts-of-speech tagging tools

journal contribution

posted on 2008-01-01, 00:00 authored by K Glass, Shaun BangayShaun Bangay

his paper evaluates six commonly available parts-of-speech tagging tools over corpora other than those upon which they were originally trained. In particular this investigation measures the performance of the selected tools over varying styles and genres of text without retraining, under the assumption that domain specific training data is not always available. An investigation is performed to determine whether improved results can be achieved by combining the set of tagging tools into ensembles that use voting schemes to determine the best tag for each word. It is found that while accuracy drops due to non-domain specific training, and tag-mapping between corpora, accuracy remains very high, with the support vector machine-based tagger, and the decision tree-based tagger performing best over different corpora. It is also found that an ensemble containing a support vector machine-based tagger, a probabilistic tagger, a decision-tree based tagger and a rule-based tagger produces the largest increase in accuracy and the largest reduction in error across different corpora, using the Precision-Recall voting scheme.

History

Journal

South african computer journal

Volume

40

Pagination

4 - 10

Publisher

Computer Society of South Africa

Location

Halfway House, South Africa

ISSN

1015-7999

Language

eng

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2008, Computer Society of South Africa

Usage metrics

Keywords

Untagged

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Evaluating and improving morpho-syntactic classification over multiple corpora using pre-trained, “off-the-shelf”, parts-of-speech tagging tools

History

Journal

Volume

Pagination

Publisher

Location

ISSN

Language

Publication classification

Copyright notice

Usage metrics

Categories

Keywords

Licence

Exports