Bios is a suite of syntactico-semantico analyzers that include the most common
tools needed for the shallow analysis of English text.
The following tools are currently included:
- Smart tokenizer that recognizes abbreviations, SGML tags etc.
- Part-of-speech (POS) tagger. The POS tagger is implemented as a
a wrapper around the TNT tagger
by Thorsten Brants.
- Syntactic chunking using the labels promoted by the
CoNLL chunking evaluations.
- Named-Entity Recognition and Classification (NERC) for the
CoNLL
entity types plus an additional 11 numerical entity types.
Why should you use this software?
There are at least 4 reasons:
- You can configure it for very high accuracy but slower execution (using
Yamcha)
or for high speed and slightly lower accuracy (using
my own implementation of an asymmetric Perceptron). Note: Maximum
Entropy (ME) is also supported but the ME models are not included in this
package because both accuracy and response time are below those of the
Perceptron. See the project NEWS
file for performance numbers using Yamcha and
Perceptron.
- It has built-in models for both case-sensitive and case-insensitive text.
- You can retrain Bios on your own corpus and label set.
- It has a clean and easy to use Java API.
|