This archive contains additional material to our paper: Christian M. Meyer and Iryna Gurevych: What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage, in: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), November 2011. Chiang Mai, Thailand. http://www.ukp.tu-darmstadt.de/data/sense-alignment/ Please cite this paper if you plan to use our datasets. -------------------------------------------------------------------------------- The following files are availble from our homepage: 1) ijcnlp2011-meyer-data-alignment.tsv.bz2 A full alignment of Wiktionary and WordNet in a bzip2 compressed, tab-separated file. File format: * WN_SYNSET: the synset id of WordNet 3.0. The id consists of a synset's database offset and its part of speech tag. See WordNet documentation for more information on synset ids: http://wordnet.princeton.edu/. * WKT_ID: the sense id of the English Wiktionary edition from April 3, 2010 as generated by the Java-based Wiktionary Library (JWKTL) Version 0.15.2. See JWKTL homepage for more information on sense ids: http://www.ukp.tu-darmstadt.de/software/jwktl/. If you do not have a parsed Wiktionary edition for this date at hand, you can use the dataset described in (3). This file contains all WordNet synsets and all Wiktionary senses. If a certain synset or sense has no counterpart in the respective other resource, "null" is printed instead of the synset or sense id. -------------------------------------------------------------------------------- 2) ijcnlp2011-meyer-data-classification.tsv.bz2 The classification data used to create the alignment described in (1) in a bzip2 compressed, tab-separated file. File format: * WN_SYNSET: the WordNet 3.0 synset id, see (1). * WKT_ID: the Wiktionary sense id, see (1). * SIM_COS: the similarity score of the COS measure described in the paper. * SIM_PPR: the similarity score of the PPR measure described in the paper. * IS_ALIGNED: the automatical judgment of our classifier, whether the WordNet synset and the Wiktionary sense should be aligned (= "1") or not aligned (= "0"). -------------------------------------------------------------------------------- 3) ijcnlp2011-meyer-data-wiktionary.tsv.bz2 An excerpt of the English Wiktionary edition from April 3, 2010 parsed with the Java-based Wiktionary Library (JWKTL) Version 0.15.2. The data is available as a bzip2 compressed, tab-separated file. File format: * WKT_ID: the Wiktionary sense id for this Wiktionary edition. The ids correspond to our alignment results of the files (1) and (2). * LEXEME: the lexeme that the Wiktionary article describes; e.g. 'plant'. * POS: the lexeme's part of speech tag (N = noun, V = verb, A = adjective, R = adverb, ? = other) * GLOSS: the sense gloss; e.g. "an organism that is not an animal [...]" for 'plant'. * EXAMPLES: an example sentence for this sense; e.g. "The garden had a couple of [...] plants around the border" for 'plant' (might be emtpy). * SYNONYMS: synonymous words for this sense; e.g. 'gratis' for 'free' (might be empty; multiple synonyms are separated by semicolons). -------------------------------------------------------------------------------- 4) ijcnlp2011-meyer-dataset.txt The annotated dataset used for the evaluation of our work. File format: * WN synset offset: the WordNet 3.0 synset offset, which is used to form the synset id. * pos: the WordNet 3.0 part of speech tag of the synset, which is used to form the synset id. * lemma: Wiktionary's lexeme. * WKT id: the sense id in Wiktionary, as described in (1). * annotation: the gold standard annotation for the sense pair of WordNet synset and Wiktionary sense. -------------------------------------------------------------------------------- 5) ijcnlp2011-meyer-dataset_annotation-guidebook.pdf The corresponding annotation guidebook that was given to the annotators. -------------------------------------------------------------------------------- License issues. The Wiktionary dataset is available under the Creative Commons Attribution/Share-Alike License (CC-BY-SA). See http://creativecommons.org/licenses/by-sa/3.0/ and http://www.wiktionary.org/ for details. WordNet is a registered trademark of the Princeton University. Please refer to http://wordnet.princeton.edu for further details. The data can also be obtained from their homepage. -------------------------------------------------------------------------------- Contact. In case of any questions, please don't hesitate to contact the corresponding author Christian M. Meyer: http://www.ukp.tu-darmstadt.de/people/christian-m-meyer/