======================================================================== Twelfth Conference on Computational Natural Language Learning (CoNLL 2008) Shared Task Distribution -- Official Release http://www.yr-bcn.es/conll2008/ Created February 28, 2008 Organizers: Mihai Surdeanu Richard Johanson Adam Meyers Lluis Marquez Joakim Nivre ======================================================================== WARNING The data of this distribution uses portions of the Penn Treebank II collection. For participants not owning a valid license of the Penn Treebank II collection, LDC is providing an "evaluation license", valid during competition time, which allows the free download and use of the the CoNLL-2008 shared task datasets. See the shared task website for details. GENERAL This is the 20080228 release of the shared task corpus. This release is intended to be stable, but it is subject to minor changes and updates if some errors are found (please inform organizers asap if you notice something wrong with the datasets). The 2008 CoNLL shared task focuses on the identification of syntactic dependencies (from the Penn Treebank [TB]) and semantic dependencies (from PropBank [PB] and NomBank [NB]). This year's shared task is mono lingual: only English is covered. The syntactic dependencies follow the format and description of the previous shared tasks (with some notable exceptions - see website for details). For the identification of semantic dependencies, the systems must identify first the semantic predicates in each sentence. For each target predicate, all corresponding roles must be identified. Please consult the shared task webpage for a detailed description of the task, instructions on how to participate, calendar, and updates of the task and data. DIRECTORY STRUCTURE The following directories are included in this distribution: * trial/ : Contains the trial corpus * train/ : The complete training corpus (covers Sections 02-21 of TreeBank) * devel/ : Development corpus (Section 24 of TreeBank) * test.wsj/ : In-domain test corpus (Section 23 of TreeBank) * test.brown/ : Out-of-domain test corpus (Sections ck01, ck02, and ck03 of the Brown corpus) Each data directory contains two files: * .closed : Contains the data relevant for the closed challenge. * .open : Contains the additional data for the open challenge. DATA FORMAT The format of the file for the closed challenge is detailed in the shared task website: http://www.yr-bcn.es/dokuwiki/doku.php?id=conll2008:format Note that the test corpora have the GPOS column filled with "_" and no syntactic or semantic dependency information is provided (columns 9+). The additional data provided for the open challenge (e.g., trial.open) follows the same column-based format as the data for the closed challenge. For the open challenge, five additional columns are provided: 1. Named entity (NE) labels using the tag set from the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder 2003). 2. NE labels using the tag set from the BBN Wall Street Journal Entity Corpus [BBN]. 3. WordNet [WN] super senses (Ciaramita and Altun 2006). 4 and 5. Syntactic dependencies generated by the MALT parser (Nivre et al 2006). PREPROCESSING SYSTEMS The input annotations provided for both closed and open challenges are generated using the following state-of-the-art systems: *) The predicted Part-of-Speech (PoS) tags (i.e., the PPOS and PPOSS columns in the closed-challenge file) are generated using the PoS tagger of (Gimenez and Marquez 2004). *) The lemmas (LEMMA and SPLIT_LEMMA columns) are extracted from WordNet using the most common sense for the corresponding predicted PoS tag. *) Columns 1 to 3 in the open-challenge file are generated using the semantic tagger of (Ciaramita and Altun 2006). *) Columns 4 and 5 in the open-challenge file are generated using the MALT parser (Nivre et al 2006). REFERENCES (Ciaramita and Altun 2006) M. Ciaramita and Y. Altun "Broad Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger" Proc. of EMNLP, 2006 (Gimenez and Marquez 2004) Gimenez J. and Marquez L. "SVMTool: A general POS tagger generator based on Support Vector Machines" Proc. of LREC, 2004 (Nivre et al 2006) Nivre J., Hall J., Nilsson J. and Eryigit G. "Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines" Proc. of the CoNLL-X Shared Task, 2006 (Tjong Kim Sang and De Meulder 2003) Erik F. Tjong Kim Sang and Fien De Meulder "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition" Proc. of CoNLL-2003, 2003 [PB] PropBank Project: http://verbs.colorado.edu/~mpalmer/projects/ace.html [NB] NomBank Project: http://nlp.cs.nyu.edu/meyers/NomBank.html [TB] Penn TreeBank II Project: http://www.cis.upenn.edu/~treebank [BBN] Pronoun coreference and entity type corpus: LDC catalog number LDC2005T33 [WN] WordNet: http://wordnet.princeton.edu/ ACKNOWLEDGMENTS The organizers thank Massimiliano Ciaramita for the help with his semantic tagger and Jesus Gimenez for PoS tagging the corpus.