CorpusSearch has been used to search Middle English, Old, Middle, and Modern English corpora, as well as corpora of Chinese, Korean and Yiddish. In order for
CS to search a corpus it must meet the following formatting requirements:
- Every sentence in the corpus must be completely parsed; that is, every word must be labeled and must be included within the outside brackets of some sentence.
- Phrasal and part-of-speech labels may not contain a space or other white space character, nor may they begin with digits.
- Constituents must be bracketed with parentheses -- "(" and ")" -- not with square brackets or other delimiters.
- Every sentence must have a "wrapper", that is, an unlabeled pair of parentheses surrounding the sentence.
Below is an example of a sentence bracketed in accordance with these guidelines, using the labels of the PPCME2 and PPCEME. Note that CorpusSearch is indifferent to the choice of phrasal and part-of-speech labels.
(ADVP-TMP (ADV Then))
(NP-SBJ (D the) (N child))
(ADJP (ADJR happier) (CONJ and) (ADJR happier))
For more information on corpus formatting for CorpusSearch see the CorpusSearch Users Guide