What Is CorpusSearch?

Contents of this chapter:

What is CorpusSearch?
input to CorpusSearch

search output
coding output
frames output
lexicon output

What is CorpusSearch?

CorpusSearch finds linguistic structures in a corpus of parsed, labelled sentences. It also has other features, including support for the automatic creation of coding strings for statistical analysis and the automatic creation of a lexicon for a corpus.

A new feature of CorpusSearch 2 is support for corpus creation, in the form of automated modification of corpus tree structures. This feature is useful for correction systematic errors and for applying global changes in annotation guidelines to an entire corpus.

input to CorpusSearch

CorpusSearch needs two pieces of information:

a corpus of sentences to search (source file(s)).
a specification of what structures to search for (command file).

source file(s)

A source file is any file that contains parsed, labelled sentences. This could be a file from the Penn Parsed Corpora of Historical English or from another parsed corpus. It could also be an output file from a previous search, or perhaps a file of sentences that the user has cut and pasted together. Any number of source files can be searched in a single one run of CorpusSearch.

command file

The command file contains a query, which describes the structures being searched for, and possibly additional control and output specifications. This additional material may specify the node boundaries within which to search, and may choose various options for specifying the form of the output.

output of CorpusSearch

CorpusSearch always builds a text output file, containing the sentences with the specified structure, and basic statistics.

search output

The output file contains the sentences that were found to contain the searched-for structure, along with comments describing where the structures were found. Statistics are kept detailing the number of "hits," that is, distinct constituents containing the structure, the number of matrix sentences ("tokens") containing hits, and the total number of tokens in the file. Notice that the number of hits may change depending on the definition of the boundary node.

Basic Concepts

Table of Contents

CorpusSearch Home

Contents of this chapter: