Contents of this chapter:
The CorpusSearch query language has these basic components:
The most basic query is a single search-function call. For instance, here is a query that searches for nodes labelled QP ("quantifier phrase") that immediately dominate nodes labelled CONJ ("co-ordinating conjunction"):
(QP iDominates CONJ)
and here is a sentence found by the query:
/~* and so he is bo+te more and lasse to his seruaunt. (CMWYCSER,351.2223) *~/ /* 1 IP-MAT: 9 QP, 10 CONJ bo+te 1 IP-MAT: 9 QP, 12 CONJ and */ (0 (1 IP-MAT (2 CONJ and) (3 ADVP (4 ADV so)) (5 NP-SBJ (6 PRO he)) (7 BEP is) (8 ADJP (9 QP (10 CONJ bo+te) (11 QR more) (12 CONJ and) (13 QR lasse)) (14 PP (15 P to) (16 NP (17 PRO$ his) (18 N seruaunt)))) (19 E_S .)) (ID CMWYCSER,351.2223))
Any number of search-function calls may be combined into more complex queries using AND, OR, and NOT.
CorpusSearch supports two wild cards, namely * and #.
* works as in regular expressions, that is, it stands for any string of symbols. For instance, "CP*" means any label beginning with the letters CP (e.g. CP, CP-ADV, CP-QUE-SPE). "*-SPE" means any label ending with "-SPE", and *hersum* means any string containing the substring "hersum" (e.g., "hersumnesse", "unhersumnesse"). * by itself will match any string. * may be used anywhere in the function argument; beginning, middle or end.
Some labels, for example "*con*" ("subject elided under conjunction"), contain the character '*'. If you're looking for such a label, use \ (escape character) to show that you're searching for * and not using it as a wild card. For instance, to search for *con* dominated by a noun phrase, you could use this query:
(NP* dominates \*con\*)
to find (among others) this sentence:
/~*
ne did euyll.
(CMMANDEV,1.14)
*~/
/*
    1 IP-MAT: 3 NP-SBJ *con*
*/
(0
   (1 IP-MAT (2 CONJ ne)
             (3 NP-SBJ *con*)
             (4 DOD did)
             (5 NP-OB1 (6 N euyll))
             (7 E_S .))
      (ID CMMANDEV,1.14))
# is the wild card for digits. For instance, to find prepositions divided into parts, you could use this query:
(PP iDominates P#)
to find sentences like this:
/~*
Anone there $with all arose sir Gawtere
(CMMALORY,199.3135)
*~/
/*
    1 IP-MAT: 4 PP, 7 P21 $with
    1 IP-MAT: 4 PP, 8 P22 all
*/
(0
   (1 IP-MAT
             (2 ADVP-TMP (3 ADV Anone))
             (4 PP
                   (5 ADVP (6 ADV there))
                   (7 P21 $with)
                   (8 P22 all))
             (9 VBD arose)
             (10 NP-SBJ (11 NPR sir) (12 NPR Gawtere)))
      (ID CMMALORY,199.3135))
Integer arguments are expected for some search functions and not allowed for others. But suppose you want to search for a piece of text that is an integer, for instance a year. You can't do this:
(WRONG!) query: (1929 exists)
because "exists" won't take an integer argument. To cause the query parser to accept an integer as text, use a "\" as follows:
query: (\1929 exists)
Search-function calls may be combined using the logical operators AND, OR, and NOT.
There are also logical operators that act on arguments to search functions. These are |, which means "or" for a list of arguments (e.g. "MD*|HV*" means "MD* or HV*"), and "!", which negates an argument (or list of arguments) (e.g. "NP-SBJ dominates !N" returns cases where NP-SBJ does not dominate N.)
CorpusSearch allows the use of regular expression syntax in the arguments to functions. For example, the expression "[xyz]" stands for a single character that is either an "x", a "y" or a "z". Note that the period character "." stands for any letter or digit and the sequence ".*" stands for any sequence of such characters. If the argument in the query contains a literal period, it must be escaped with a "\", as in the case of asterisk.