Contents of this chapter:
The CorpusSearch query language has these basic components:
The most basic query is a single search-function call. For instance, here is a query that searches for nodes labelled QP ("quantifier phrase") that immediately dominate nodes labelled CONJ ("co-ordinating conjunction"):
(QP iDominates CONJ)
and here is a sentence found by the query:
/~* and so he is bo+te more and lasse to his seruaunt. (CMWYCSER,351.2223) *~/ /* 1 IP-MAT: 9 QP, 10 CONJ bo+te 1 IP-MAT: 9 QP, 12 CONJ and */ (0 (1 IP-MAT (2 CONJ and) (3 ADVP (4 ADV so)) (5 NP-SBJ (6 PRO he)) (7 BEP is) (8 ADJP (9 QP (10 CONJ bo+te) (11 QR more) (12 CONJ and) (13 QR lasse)) (14 PP (15 P to) (16 NP (17 PRO$ his) (18 N seruaunt)))) (19 E_S .)) (ID CMWYCSER,351.2223))
Any number of search-function calls may be combined into more complex queries using AND, OR, and NOT.
CorpusSearch supports two wild cards, namely * and #.
* works as in regular expressions, that is, it stands for any string of symbols. For instance, "CP*" means any label beginning with the letters CP (e.g. CP, CP-ADV, CP-QUE-SPE). "*-SPE" means any label ending with "-SPE", and *hersum* means any string containing the substring "hersum" (e.g., "hersumnesse", "unhersumnesse"). * by itself will match any string. * may be used anywhere in the function argument; beginning, middle or end.
Some labels, for example "*con*" ("subject elided under conjunction"), contain the character '*'. If you're looking for such a label, use \ (escape character) to show that you're searching for * and not using it as a wild card. For instance, to search for *con* dominated by a noun phrase, you could use this query:
(NP* dominates \*con\*)
to find (among others) this sentence:
/~* ne did euyll. (CMMANDEV,1.14) *~/ /* 1 IP-MAT: 3 NP-SBJ *con* */ (0 (1 IP-MAT (2 CONJ ne) (3 NP-SBJ *con*) (4 DOD did) (5 NP-OB1 (6 N euyll)) (7 E_S .)) (ID CMMANDEV,1.14))
# is the wild card for digits. For instance, to find prepositions divided into parts, you could use this query:
(PP iDominates P#)
to find sentences like this:
/~* Anone there $with all arose sir Gawtere (CMMALORY,199.3135) *~/ /* 1 IP-MAT: 4 PP, 7 P21 $with 1 IP-MAT: 4 PP, 8 P22 all */ (0 (1 IP-MAT (2 ADVP-TMP (3 ADV Anone)) (4 PP (5 ADVP (6 ADV there)) (7 P21 $with) (8 P22 all)) (9 VBD arose) (10 NP-SBJ (11 NPR sir) (12 NPR Gawtere))) (ID CMMALORY,199.3135))
Integer arguments are expected for some search functions and not allowed for others. But suppose you want to search for a piece of text that is an integer, for instance a year. You can't do this:
(WRONG!) query: (1929 exists)
because "exists" won't take an integer argument. To cause the query parser to accept an integer as text, use a "\" as follows:
query: (\1929 exists)
Search-function calls may be combined using the logical operators AND, OR, and NOT.
There are also logical operators that act on arguments to search functions. These are |, which means "or" for a list of arguments (e.g. "MD*|HV*" means "MD* or HV*"), and "!", which negates an argument (or list of arguments) (e.g. "NP-SBJ dominates !N" returns cases where NP-SBJ does not dominate N.)
CorpusSearch allows the use of regular expression syntax in the arguments to functions. For example, the expression "[xyz]" stands for a single character that is either an "x", a "y" or a "z". Note that the period character "." stands for any letter or digit and the sequence ".*" stands for any sequence of such characters. If the argument in the query contains a literal period, it must be escaped with a "\", as in the case of asterisk.