Contents of this chapter:
CorpusSearch finds linguistic structures in a corpus of parsed, labelled sentences. It also has other features, including support for the automatic creation of coding strings for statistical analysis and the automatic creation of a lexicon for a corpus.
A new feature of CorpusSearch 2 is support for corpus creation, in the form of automated modification of corpus tree structures. This feature is useful for correction systematic errors and for applying global changes in annotation guidelines to an entire corpus.
CorpusSearch needs two pieces of information:
A source file is any file that contains parsed, labelled sentences. This could be a file from the Penn Parsed Corpora of Historical English or from another parsed corpus. It could also be an output file from a previous search, or perhaps a file of sentences that the user has cut and pasted together. Any number of source files can be searched in a single one run of CorpusSearch.
The command file contains a query, which describes the structures being searched for, and possibly additional control and output specifications. This additional material may specify the node boundaries within which to search, and may choose various options for specifying the form of the output.
CorpusSearch always builds a text output file, containing the sentences with the specified structure, and basic statistics.
The output file contains the sentences that were found to contain the searched-for structure, along with comments describing where the structures were found. Statistics are kept detailing the number of "hits," that is, distinct constituents containing the structure, the number of matrix sentences ("tokens") containing hits, and the total number of tokens in the file. Notice that the number of hits may change depending on the definition of the boundary node.
CorpusSearch can be asked to create an output file in which a coding string is added to each boundary node in the corpus that matches a given query. The content of the columns in the coding string can be specified automatically by subqueries.
CorpusSearch can be asked to generate the set of all local syntactic environments within which a given word of the corpus occurs. Local environments are defined as syntactic sisters of the part-of-speech label of the word and are called local frames.
CorpusSearch can be asked to generate a lexicon for a corpus. The lexicon is a list of every word in the corpus along with the number of times it occurs under each part-of-speech label that it can have.