Contents of this chapter:

What is CorpusDraw?
Obtaining and running CorpusDraw
Input to CorpusDraw
source file(s)
command file
file of legal tags
The CorpusDraw graphical user interface
the tree display window
the text window
editing buttons
display buttons
Output of CorpusDraw

What is CorpusDraw?

CorpusDraw displays the tree structures assigned to sentences in a parsed corpus and allows an annotator to edit these trees in the course of corpus construction or revision. It can also be used to display parse trees for presentation purposes.

Obtaining and running CorpusDraw

CorpusDraw is a module within the CorpusSearch program. On any computer where CorpusSearch has been downloaded and installed, CorpusDraw is also available. It has been used extensively under Linux and MacOS X but has not been tested under Windows.

CorpusDraw is invoked with the following command, where "/FOO" represents the path to the CorpusSearch .jar file

% java  -classpath  /FOO/CS.jar  drawtree/CorpusDraw

An alias to this command can be included in a .cshrc or .bashrc file, as described for CorpusSearch itself in the installation chapter for the program.

Input to CorpusDraw

CorpusDraw accepts a command file and a source file as its command line arguments:

  1. The command ("query") file is optional and when provided specifies structural constraints on what sentences to display.
  2. the source file argument is obligatory and specifies what file of parsed data to display.
In addition, CorpusDraw will read in a file of legal syntactic and part-of-speech tags, if one is supplied. The source file, command file, and CorpusSearch program itself must reside in different directories. The recommended directory configuration has a root corpus directory with three sister subdirectories, one for CorpusSearch itself, one for the corpus source files and one for the command files and for the file of legal tags. When starting CorpusDraw, the current directory should normally be the root directory of the corpus, with the path to the corpus file being worked on specified on the command line.

source file

A source file is any file that contains parsed, labelled sentences. This could be a file from the Penn Parsed Corpora of Historical English or from another parsed corpus.

command file

The command file contains a query, which describes a structure that every sentence must meet to be displayed by CD. The use of such a command file allows the annotator to view only those sentences relevant for a given editing change being implemented on the corpus.

file of legal tags

In order to prevent the accidental introduction of ill-formed labels, CorpusDraw can be given a file of all allowed tags (both phrasal and part-of-speech).

The file of allowed or "legal" tags is generated from existing parsed files by using a query file with the following command as its content:

make_tag_list: t

It is possible to add a line concerning the font encoding of the corpus to the query file, as in the following sample, but that information is perhaps better included in the preferences file.

corpus_encoding: UTF-8
make_tag_list: t

Like any other query file, the query file should have a .q extension.

Here is an example of how the legal tags query would be invoked:

CS queries/legaltags.q parsed/DONE/*.psd

The legal tags creation query outputs a file with the same basename as the query file, but with the .tag extension. (It also outputs a spurious empty .out file which should be discarded.)

Caution: The corpus tag set and hence the .tag file must not contain any tags that consist of or begin with a hyphen or a colon, since these characters function as delimiters.

When generated, the .tag file is placed into the same directory as its associated .q file. In order for CorpusDraw to read it, however, it must be moved to the directory from which CorpusDraw is invoked (ordinarily the directory above the one containing the parsed files). Note that CorpusDraw expects the directory from which it is invoked, the directory containing the parsed files, and the queries directory to be distinct. If they aren't, CorpusDraw will issue a warning.

When invoked, CorpusDraw automatically looks for a .tag file in the same directory. On opening the display, it displays a message containing the name of the .tag file when it succeeds or a warning if there is a problem with the .tag file (as when it contains illegal characters - see above). If no message is displayed, no .tag file has been read, and tag editing is not constrained.

The CorpusDraw graphical user interface

The CorpusDraw GUI is intended to be largely self-explanatory. The display, which can be seen by clicking here, contains the follow parts:

The scroll bars at the bottom and on the right edge of the tree display window allow different parts of the tree to be centered in the window. This can also be accomplished by clicking on the word in the text window that the user wishes to place in the center of the display. The arrows at left of the editing button row move the display from one sentence to the next.

The editing buttons allow the annotator:

CorpusDraw will not permit the annotator to accidently change the order of words in the sentence or to delete any text.

The actions controlled by the editing buttons can also be triggered by the use of shortcuts, both keystrokes and mouse clicks. Some of these require a sequence of keystrokes or clicks. A current list of these shortcuts can be found in the last section of this chapter. Here is a QuickTime movie of these shortcuts in action:

Output of CorpusDraw

When CorpusDraw is displaying a file with the name "foo.psd" and the file is saved after certain changes are made, the saved file has the name "foo.psd.new." This change in name guarantees that changes can easily be discarded.