Contents of this chapter:

about the query language
search function calls
wild cards and escaping wild cards
logical operators
regular expressions

about the query language

The CorpusSearch query language has these basic components:

search function calls

The most basic query is a single search-function call. For instance, here is a query that searches for nodes labelled QP ("quantifier phrase") that immediately dominate nodes labelled CONJ ("co-ordinating conjunction"):

(QP iDominates CONJ)

and here is a sentence found by the query:

/~* and so he is bo+te more and lasse to his seruaunt. (CMWYCSER,351.2223) *~/ /* 1 IP-MAT: 9 QP, 10 CONJ bo+te 1 IP-MAT: 9 QP, 12 CONJ and */ (0 (1 IP-MAT (2 CONJ and) (3 ADVP (4 ADV so)) (5 NP-SBJ (6 PRO he)) (7 BEP is) (8 ADJP (9 QP (10 CONJ bo+te) (11 QR more) (12 CONJ and) (13 QR lasse)) (14 PP (15 P to) (16 NP (17 PRO$ his) (18 N seruaunt)))) (19 E_S .)) (ID CMWYCSER,351.2223))

Any number of search-function calls may be combined into more complex queries using AND, OR, and NOT.

wild cards and escaping wild cards

CorpusSearch supports two wild cards, namely * and #.


* works as in regular expressions, that is, it stands for any string of symbols. For instance, "CP*" means any label beginning with the letters CP (e.g. CP, CP-ADV, CP-QUE-SPE). "*-SPE" means any label ending with "-SPE", and *hersum* means any string containing the substring "hersum" (e.g., "hersumnesse", "unhersumnesse"). * by itself will match any string. * may be used anywhere in the function argument; beginning, middle or end.

escaping the asterisk (\*)

Some labels, for example "*con*" ("subject elided under conjunction"), contain the character '*'. If you're looking for such a label, use \ (escape character) to show that you're searching for * and not using it as a wild card. For instance, to search for *con* dominated by a noun phrase, you could use this query:

(NP* dominates \*con\*)

to find (among others) this sentence:

ne did euyll.

    1 IP-MAT: 3 NP-SBJ *con*

   (1 IP-MAT (2 CONJ ne)
             (3 NP-SBJ *con*)
             (4 DOD did)
             (5 NP-OB1 (6 N euyll))
             (7 E_S .))
      (ID CMMANDEV,1.14))


# is the wild card for digits. For instance, to find prepositions divided into parts, you could use this query:

(PP iDominates P#) 

to find sentences like this:

Anone there $with all arose sir Gawtere

    1 IP-MAT: 4 PP, 7 P21 $with
    1 IP-MAT: 4 PP, 8 P22 all

   (1 IP-MAT
             (2 ADVP-TMP (3 ADV Anone))
             (4 PP
                   (5 ADVP (6 ADV there))
                   (7 P21 $with)
                   (8 P22 all))
             (9 VBD arose)
             (10 NP-SBJ (11 NPR sir) (12 NPR Gawtere)))
      (ID CMMALORY,199.3135))

escaping integers

Integer arguments are expected for some search functions and not allowed for others. But suppose you want to search for a piece of text that is an integer, for instance a year. You can't do this:

(WRONG!) query: (1929 exists)

because "exists" won't take an integer argument. To cause the query parser to accept an integer as text, use a "\" as follows:

query: (\1929 exists)

logical operators

Search-function calls may be combined using the logical operators AND, OR, and NOT.

There are also logical operators that act on arguments to search functions. These are |, which means "or" for a list of arguments (e.g. "MD*|HV*" means "MD* or HV*"), and "!", which negates an argument (or list of arguments) (e.g. "NP-SBJ dominates !N" returns cases where NP-SBJ does not dominate N.)

regular expressions

CorpusSearch allows the use of regular expression syntax in the arguments to functions. For example, the expression "[xyz]" stands for a single character that is either an "x", a "y" or a "z". Note that the period character "." stands for any letter or digit and the sequence ".*" stands for any sequence of such characters. If the argument in the query contains a literal period, it must be escaped with a "\", as in the case of asterisk.