CorpusDraw displays the tree structures assigned to sentences in
a parsed corpus and allows an annotator to edit these trees in the
course of corpus construction or revision. It can also be used to
display parse trees for presentation purposes.
CorpusDraw is a module within the CorpusSearch program. On any computer where CorpusSearch has
been downloaded and installed, CorpusDraw is also available. It has been used extensively under
Linux and MacOS X but has not been tested under Windows.
CorpusDraw is invoked with the following command, where "/FOO" represents the path to the CorpusSearch .jar file
% java -classpath /FOO/CS.jar drawtree/CorpusDraw
An alias to this command can be included in a .cshrc or .bashrc file, as described for CorpusSearch itself
in the installation chapter for the program.
CorpusDraw accepts a command file and a source file as its command line arguments:
- The command ("query") file is optional and when provided specifies structural constraints on what sentences to
display.
- the source file argument is obligatory and specifies what file of parsed data to display.
In addition, CorpusDraw will read in a file of legal syntactic and
part-of-speech tags, if one is supplied. The source file, command file, and
CorpusSearch program itself must reside in different directories. The recommended
directory configuration has a root corpus directory with three sister
subdirectories, one for CorpusSearch itself, one for the corpus source files and
one for the command files and for the file of legal tags. When starting CorpusDraw,
the current directory should normally be the root directory of the corpus, with the
path to the corpus file being worked on specified on the command line.
A source file is any file that contains parsed, labelled sentences. This could
be a file from the Penn Parsed Corpora of Historical English or from another
parsed corpus.
The command file
contains a query, which describes a structure
that every sentence must meet to be displayed by CD. The use of such a command
file allows the annotator to view only those sentences relevant for a given
editing change being implemented on the corpus.
In order to prevent the accidental introduction of ill-formed labels, CorpusDraw can be given a file of all allowed tags
(both phrasal and part-of-speech).
The file of allowed or "legal" tags is generated from existing parsed files by using a query file with the following
command as its content:
make_tag_list: t
It is possible to add a line concerning the font encoding of the corpus to the query file, as in the following sample,
but that information is perhaps better included in the preferences file.
corpus_encoding: UTF-8
make_tag_list: t
Like any other query file, the query file should have a .q extension.
Here is an example of how the legal tags query would be invoked:
CS queries/legaltags.q parsed/DONE/*.psd
The legal tags creation query outputs a file with the same basename as the query file, but with the .tag extension.
(It also outputs a spurious empty .out file which should be discarded.)
Caution: The corpus tag set and hence the .tag file must not contain any tags that consist of or begin with a
hyphen or a colon, since these characters function as delimiters.
When generated, the .tag file is placed into the same directory as its associated .q file. In order for CorpusDraw
to read it, however, it must be moved to the directory from which CorpusDraw is invoked (ordinarily the directory
above the one containing the parsed files). Note that CorpusDraw expects the directory from which it is invoked,
the directory containing the parsed files, and the queries directory to be distinct. If they aren't, CorpusDraw will
issue a warning.
When invoked, CorpusDraw automatically looks for a .tag file in the same directory. On opening the display, it
displays a message containing the name of the .tag file when it succeeds or a warning if there is a problem with
the .tag file (as when it contains illegal characters - see above). If no message is displayed, no .tag file has
been read, and tag editing is not constrained.
The CorpusDraw GUI is intended to be largely self-explanatory.
The display, which can be seen by clicking here,
contains the follow parts:
- a tree display window
- a window containing the text of the displayed sentence
- a top row of buttons for editing the tree
- a second row of buttons for modifying the display for ease of use
The scroll bars at the bottom and on the right edge of the tree display window
allow different parts of the tree to be centered in the window. This
can also be accomplished by clicking on the word in the text window
that the user wishes to place in the center of the display. The arrows
at left of the editing button row move the display from one sentence
to the next.
The editing buttons allow the annotator:
- to change node labels
- to move nodes and their descendants around in the tree
- to coindex nodes
- to add empty categories of the various types specified in the legal tags
file
CorpusDraw will not permit the annotator to accidently change the
order of words in the sentence or to delete any text.
The actions controlled by the editing buttons can also be triggered by the
use of shortcuts, both keystrokes and mouse clicks. Some of these require a
sequence of keystrokes or clicks. A current list of these shortcuts can be
found in the last section of this chapter. Here
is a QuickTime movie of these shortcuts in action:
When CorpusDraw is displaying a file with the name "foo.psd" and the file is
saved after certain changes are made, the saved file has the name
"foo.psd.new." This change in name guarantees that changes can easily be
discarded.