This code implements a linear interpolation of several linear-time parsing models (all based on MaltParser). Each individual parser runs in its own thread, which means that, if a sufficient number of cores are available, the overall runtime is essentially similar to a single Malt parser. The resulting parser has state of the art performance yet it remains very fast. For example, we parse Stanford dependencies slightly better than the Stanford parser and about 100 times faster -- i.e., we parse approximately 100 sentences/second.
We provide models for parsing both Stanford dependencies and CoNLL-2008 dependencies. Our current scores in these two domains, using the traditional Section 23 of the Penn Treebank for testing, are:
CoNLL-2008 (with predicted POS tags):
Stanford (with gold POS tags):
Mihai Surdeanu, David McClosky, Christopher Manning
If you use this code for a research publication, please cite this paper:
Mihai Surdeanu and Christopher D. Manning. Ensemble Models for Dependency Parsing: Cheap and Good?
Proceedings of the North American Chapter of the Association for Computational Linguistics Conference (NAACL-2010), 2010.
Copyright (c) 2009-2010 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
The MaltParser, which is used to generate all individual parsing models, is distributed with its own "as is" license (LICENSE.MaltParser).
The easiest way to run the ensemble parser is to using the included properties files. We provide one file for CoNLL-2008 dependencies (conll08.properties) and one for Stanford dependencies (stanford.dependencies). Please read these files to see the available parameters.
For example, to parse a corpus with CoNLL-2008 dependencies, use this command:
java edu.stanford.nlp.parser.ensemble.Ensemble --arguments conll08.properties --run test --testCorpus <YOUR-TEST-CORPUS>
or use the included ensemble.sh
shell script (probably simpler because it includes all necessary jar files):
sh ensemble.sh conll08.properties <YOUR-TEST-CORPUS>
To re-train the CoNLL-2008 models (useful only to developers!), use this command:
java edu.stanford.nlp.parser.ensemble.Ensemble --arguments conll08.properties --run train --trainCorpus <YOUR-TRAINING-CORPUS>
Take a look at the input* files in the samples/ directory for examples of input files correctly formatted.