CorpusSearch has been used to search Middle English, Old, Middle, and Modern English corpora, as well as corpora of Chinese, Korean and Yiddish. In order for CS to search a corpus it must meet the following formatting requirements:
  1. Every sentence in the corpus must be completely parsed; that is, every word must be labeled and must be included within the outside brackets of some sentence.
  2. Phrasal and part-of-speech labels may not contain a space or other white space character, nor may they begin with digits.
  3. Constituents must be bracketed with parentheses -- "(" and ")" -- not with square brackets or other delimiters.
  4. Every sentence must have a "wrapper", that is, an unlabeled pair of parentheses surrounding the sentence.
Below is an example of a sentence bracketed in accordance with these guidelines, using the labels of the PPCME2 and PPCEME. Note that CorpusSearch is indifferent to the choice of phrasal and part-of-speech labels.
( (IP-MAT (ADVP-TMP (ADV Then)) (NP-SBJ (D the) (N child)) (VBD became) (ADJP (ADJR happier) (CONJ and) (ADJR happier)) (E_S .)) )
For more information on corpus formatting for CorpusSearch see the CorpusSearch Users Guide.