Contents of this chapter:
To save typing and to improve readability, CorpusSearch allows shorthands and lower-case/upper-case variations for the names of search functions. Acceptable variants are listed below with each function.
When a function has an integer argument, there is always a space between the function and argument. This syntax is a change from earlier versions of CorpusSearch.
In general, the wild card "*" in CorpusSearch corresponds to the standard regular expression ".*". In other words, it matches a string of any number (including zero) of arbitrary characters, not necessarily the same as each other. However, after expressions in square brackets (and certain other complex contexts), only the standard regular expression has the desired effect. If using the one-character wild card gives unexpected results, use the standard regexp.
A / \ B C / \ \ D E FB ccommands C and F and both C and F ccommand B, D and E. D and E, on the other hand, ccommand only each other. A ccommands no node because, being the root of the tree, it dominates all of the other nodes. The following query:
query: (NP-SBJ* idoms PRO$) AND (PRO$ ccommands NP*)finds examples like:
(NP-SBJ (PRO$ his) (ADVR+Q ouermoch) (N fearinge) (PP (P of) (NP (PRO you))))in which a possessive pronoun ccommands a noun phrase, here the object of a prepositional complement to the head noun.
If, for instance, you want to find sentences whose CODING-IP-MAT node contains an "m" or "n" in the 7th column, use this query:
query: (CODING-IP-MAT column 7 m|n)If you want to find sentences whose CODING node does not contain a "p" or "q" in the 4th column, use this query:
query: (CODING-IP-MAT column 4 !p|q)
(IP-INF dominates \*arb*)
returns this sentence:
/~* And soo by the counceil of Merlyn the kyng lete calle his barons to counceil, (CMMALORY,14.419) *~/ /* 18 IP-INF: 19 NP-SBJ *arb* */ ( (18 IP-INF (19 NP-SBJ *arb*) (20 VB calle) (21 NP-OB1 (22 PRO$ his) (23 NS barons)) (24 PP (25 P to) (26 NP (27 N counceil)))) (ID CMMALORY,14.419))
domsWords counts the number of words dominated by the search-function argument. So "domsWords 4" means "dominates 4 words", domsWords 2 mean "dominates 2 words", and so on. A word in this case is defined as a leaf node that is not on the word_ignore_list. Here's the default word_ignore_list:
RMV:*|COMMENT|CODE|ID|LB|'|\"|,|E_S|0|\**
Thus, traces, 0 complementizers, punctuation, and comments are not counted as words.
So this query:
node: NP* (NP-OB* domsWords 3)
will return this structure (ignoring the trace *ICH*-1):
/~* and by kynge Ban and Bors his counceile they lette brenne and destroy all the contrey before them there they sholde ryde. (CMMALORY,20.613) *~/ /* 24 NP-OB1: 27 N contrey */ ( (24 NP-OB1 (25 Q all) (26 D the) (27 N contrey) (28 CP-REL *ICH*-1)) (ID CMMALORY,20.613))
(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").
domsWords< is just like domsWords except that it returns structures that dominate strictly less than the given number of words. For instance, this query:
(NP-OB* domsWords< 3)
will return this structure:
/~* for it was I myself that cam in the lykenesse. (CMMALORY,5.131) *~/ /* 6 NP-OB1: 9 PRO$+N myself */ ( (6 NP-OB1 (7 PRO I) (8 NP-PRN (9 PRO$+N myself))) (ID CMMALORY,5.131))
(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").
domsWords> is just like domsWords except that it returns structures that dominate strictly more than the given number of words. For instance, this query:
(NP-OB* domsWords> 3)
will return this structure:
/~* for she was called a fair lady and a passynge wyse, (CMMALORY,2.9) *~/ /* 9 NP-OB1: 20 ADJ wyse */ ( (9 NP-OB1 (10 NP (11 D a) (12 ADJ fair) (13 N lady)) (14 CONJP (15 CONJ and) (16 NP (17 D a) (18 ADJP (19 ADV passynge) (20 ADJ wyse))))) (ID CMMALORY,2.9))
(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").
(MD0 exists)
will find this sentence:
/~* but I fere me that I shal not conne wel goo thyder / (CMREYNAR,14.261) *~/ /* 10 IP-SUB: 15 MD0 conne */ ( (10 IP-SUB (11 NP-SBJ (12 PRO I)) (13 MD shal) (14 NEG not) (15 MD0 conne) (16 ADVP (17 ADV wel)) (18 VB goo) (19 ADVP-DIR (20 ADV thyder))) (ID CMREYNAR,14.261))
A common mistake is to use "exists" unneccessarily, as in this example:
(MD exists) AND (MD iPrecedes VB)
If a sentence contains the structure (MD iPrecedes VB), MD necessarily exists in the sentence. So this query would get the same result:
(MD iPrecedes VB)
(NP* HasLabel NP-SBJ)
will find all NP's with the simple label NP-SBJ and no indices or further dash tags. This function is useful only for coding queries.
node: IP* query: (NP-SBJ hasSister BE*)finds both of these sentences:
/~* indeede I must be gone: (DELONEY,69.13) *~/ /* 1 IP-MAT-SPE: 5 NP-SBJ, 10 BE */ ( (IP-MAT-SPE (PP (P+N indeede)) (NP-SBJ (PRO I)) (MD must) (BE be) (VBN gone) (. :)) (ID DELONEY,69.13))
/~* I pray you is it true? (DELONEY,70.47) *~/ /* 13 IP-SUB-SPE: 16 NP-SBJ, 14 BEP */ ( (CP-QUE-SPE (IP-MAT-PRN-SPE (NP-SBJ (PRO I)) (CODE {TEMP:prn_ok}) (VBP pray) (NP-ACC (PRO you))) (IP-SUB-SPE (BEP is) (NP-SBJ (PRO it)) (ADJP (ADJ true))) (. ?)) (ID DELONEY,70.47))
((NP* iDominates FP) AND (FP iDominates ane))
finds this sentence:
/~* Sythen he ledes +tam by +tar ane, (CMROLLEP,118.978) *~/ /* 1 IP-MAT: 11 NP, 13 FP ane */ (0 (1 IP-MAT (2 ADVP-TMP (3 ADV Sythen)) (4 NP-SBJ (5 PRO he)) (6 VBP ledes) (7 NP-OB1 (8 PRO +tam)) (9 PP (10 P by) (11 NP (12 PRO$ +tar) (13 FP ane))) (14 E_S ,)) (ID CMROLLEP,118.978)) /*
Notice that "iDominates" describes the relationship between a label and its associated text (e.g., "FP" and "ane").
"iDomsFirst" means "immediately dominates as a first child."
For instance, this query:
node: IP* query: (NP* iDomsFirst PRO$)
results in this output:
/~* My Lady yor mother, I thanke God, is very well and cheerly, (KNYVETT-1630,86.12) *~/ /* 1 IP-MAT: 2 NP-SBJ, 3 PRO$ 1 IP-MAT: 7 NP-PRN, 8 PRO$ */ ( (IP-MAT (NP-SBJ (PRO$ My) (N Lady) (NP-PRN (PRO$ yor) (N mother))) (, ,) (IP-MAT-PRN (NP-SBJ (PRO I)) (VBP thanke) (NP-ACC (NPR God))) (, ,) (BEP is) (ADJP (ADJP (ADV very) (ADJ well)) (CONJP (CONJ and) (ADJX (ADJ cheerly)))) (. ,)) (ID KNYVETT-1630,86.12))
So this query:
node: IP* query: (IP* iDomsLast BEN)
results in this output:
/~* but keepes her chamber because of the Bitter weather that hath been. (KNYVETT-1630,86.13) *~/ /* 31 IP-SUB: 31 IP-SUB, 36 BEN */ ( (IP-MAT (CONJ but) (NP-SBJ *con*) (VBP keepes) (NP-ACC (PRO$ her) (N chamber)) (PP (P+N because) (PP (P of) (NP (D the) (ADJ Bitter) (N weather) (CP-REL (WNP-1 0) (C that) (IP-SUB (NP-SBJ *T*-1) (HVP hath) (BEN been)))))) (. .)) (ID KNYVETT-1630,86.13))
node: IP* query: (NP-SBJ iDomsMod NP*|CONJ* PRO)finds this sentence:
/~* So by the entrete at the last the kyng and she met togyder. (CMMALORY,4.104) *~/ /* 1 IP-MAT: 21 NP-SBJ, 31 PRO, 27 CONJP */ (0 (1 IP-MAT (2 ADVP (3 ADV So)) (5 PP (6 P by) (8 NP (9 D the) (11 N entrete))) (13 PP (14 P at) (16 NP (17 D the) (19 ADJ last))) (21 NP-SBJ (22 NP (23 D the) (25 N kyng)) (27 CONJP (28 CONJ and) (30 NP (31 PRO she)))) (33 VBD met) (35 ADVP (36 ADV togyder)) (38 E_S .)) (40 ID CMMALORY,4.104))The query
node: IP* query: (NP-SBJ iDomsMod NP*|CONJ* !PRO)would also find the above sentence because "NP-SBJ iDomsMod NP" is true of the full NP "the king."
(CP-DEG iDomsNumber 1 C)
produces this output:
/~* And Merlion was so disgysed that kynge Arthure knewe hym nat, (CMMALORY,30.939) *~/ /* 1 IP-MAT: 9 CP-DEG, 10 C that */ (0 (1 IP-MAT (2 CONJ And) (3 NP-SBJ (4 NPR Merlion)) (5 BED was) (6 ADJP (7 ADVR so) (8 VAN disgysed) (9 CP-DEG (10 C that) (11 IP-SUB (12 NP-SBJ (13 NPR kynge) (14 NPR Arthure)) (15 VBD knewe) (16 NP-OB1 (17 PRO hym)) (18 NEG nat)))) (19 E_S ,)) (ID CMMALORY,30.939))
iDomsOnly means "immediately dominates as an only child." That is, x immediately dominates y as an only child if x immediately dominates y and y is the only legitimate child of x. So this query:
(ADJP iDomsOnly Q*)
results in this output:
/~* But after my lytyll wytt it semeth me, sauynge here reuerence, +tat is more. (CMMANDEV,123.2992) *~/ /* 23 IP-SUB: 27 ADJP, 28 QR more */ ( (23 IP-SUB (24 NP-SBJ (25 D +tat)) (26 BEP is) (27 ADJP (28 QR more))) (ID CMMANDEV,123.2992))
iDomsTotal counts the number of nodes immediately dominated by the search- function argument. Traces count as daughters unless they are added to the ignore list. The following query:
(NP-OB* iDomsTotal 3)yields this output:
/~* And +tere it lykede him to suffre many repreuynges and scornes for vs (CMMANDEV,1.4) *~/ /* 10 IP-INF-1: 13 NP-OB1, 16 CONJP */ ( (10 IP-INF-1 (11 TO to) (12 VB suffre) (13 NP-OB1 (14 Q many) (15 NS repreuynges) (16 CONJP (17 CONJ and) (18 NX (19 NS scornes)))) (20 PP (21 P for) (22 NP (23 PRO vs)))) (ID CMMANDEV,1.4))
Here, the 3 nodes immediately dominated by NP-OB1 are labelled Q, NS, and CONJP.
iDomsTotal< is like iDomsTotal except that it returns structures that immediately dominate strictly less than the given number of nodes. So this query:
(NP-OB* iDomsTotal< 3)
yields this output:
/~* & take of euereche iliche myche (CMHORSES,125.397) *~/ /* 1 IP-IMP: 8 NP-OB1, 9 QP */ (0 (1 IP-IMP (2 CONJ &) (3 VBI take) (4 PP (5 P of) (6 NP (7 Q euereche))) (8 NP-OB1 (9 QP (10 ADV iliche) (11 Q myche)))) (ID CMHORSES,125.397))
iDomsTotal> is like iDomsTotal except that it returns structures that immediately dominate strictly more than the given number of nodes. So this query:
(NP-OB* iDomsTotal> 3)
will yield this output:
/~* & aftur tak an hot yre +tat is smal bi-fore (CMHORSES,95.119) *~/ /* 1 IP-IMP: 6 NP-OB1, 10 CP-REL */ (0 (1 IP-IMP (2 CONJ &) (3 ADVP-TMP (4 ADV aftur)) (5 VBI tak) (6 NP-OB1 (7 D an) (8 ADJ hot) (9 N yre) (10 CP-REL (11 WNP-1 0) (12 C +tat) (13 IP-SUB (14 NP-SBJ *T*-1) (15 BEP is) (16 ADJP (17 ADJ smal)) (18 ADVP-LOC (19 ADV bi-fore)))))) (ID CMHORSES,95.119))
Node x immediately dominates (via trace t) node y if and only if x immediately dominates t and t is co-indexed with another node z. The label of z must be that of x and the index of z must be that of t. The trace t can be any of the corpus empty categories that bear indices and the form of the empty category is specified in the function call. A characteristic use of this function is to search for extraposed relative clauses. Thus, to search for extraposed relative clauses with pronominal subjects, you can use the following query:
node: IP* query: (CP-REL iDomsViaTrace \*ICH\* IP-SUB) AND (IP-SUB iDoms NP-SBJ) AND (NP-SBJ iDoms PRO)Note that the node boundary, here IP, must include both the trace and the extraposed constituent. The query finds this sentence:
/~* Another defect I note, wherin I shall neede some Alchimist to helpe me... (BACON-E2-H,2,4R.132) *~/ /* 1 IP-MAT: 7 CP-REL, 24 IP-SUB, 27 NP-SBJ, 28 PRO */ ( (IP-MAT (NP-OB1 (D+OTHER Another) (N defect) (CP-REL *ICH*-1) (NP-SBJ (PRO I)) (VBP note) (, ,) (CP-REL-1 (WPP-2 (WADV+P wherin)) (C 0) (IP-SUB (PP *T*-2) (NP-SBJ (PRO I)) (MD shall) (VB neede) (NP-OB1 (Q some) (NS Alchimist) (CP-EOP (WNP-3 0) (IP-INF (NP-SBJ *T*-3) (TO to) (VB helpe) (NP-OB2 (PRO me)))))))
"inID" is true of substrings of the ID node. This functin is introduced because the ID node, being outside of the parsed sentence, cannot serve as an argument of a search function. In particular, (ID iDominates *) will return no hits.
Here's a typical ID node from the Malory parsed file in the Middle English corpus:
(ID CMMALORY,3.41)
To isolate Malory sentences from an output file, you could use this query:
query: (*MALORY* inID)
The algorithm for "x iPrecedes y" runs as follows:
1.) Find x.
2.) If x has an immediately following sister, then that sister and all its leftmost descendants (that is, the first child of the sister, the first child of the first child, and on as far as the tree goes) are candidates for y.
3.) If x has no immediately following sister, recurse from 2.) with the mother of x in place of x.
The following query:
query: ([1]as iPrecedes sone) AND (sone iPrecedes [2]as)
produces this output:
/~* and as sone as he myght he toke his horse (CMMALORY,206.3401) *~/ /* 1 IP-MAT: 6 as, 8 sone, 11 as */ ( (IP-MAT (CONJ and) (ADVP-TMP (ADVR as) (ADV sone) (PP (P as) (CP-CMP (WADVP-1 0) (C 0) (IP-SUB (ADVP-TMP *T*-1) (NP-SBJ (PRO he)) (MD myght) (VB *))))) (NP-SBJ (PRO he)) (VBD toke) (NP-OB1 (PRO$ his) (N horse))) (ID CMMALORY,206.3401))
isRoot searches for the argument label at the root of the tree of the parsed token. For instance, this query:
query: (CP* isRoot)
will return all tokens in the corpus whose root is a CP, for instance, the following sentence:
/~* why thou whoreson when wilt thou be maried? (DELONEY,79.296) *~/ /* 1 CP-QUE-SPE: 1 CP-QUE-SPE */ ( (CP-QUE-SPE (INTJP (WADV why)) (NP-VOC (PRO thou) (N$+N whoreson)) (WADVP-1 (WADV when)) (IP-SUB-SPE (ADVP *T*-1) (MD wilt) (NP-SBJ (PRO thou)) (BE be) (VAN maried)) (. ?)) (ID DELONEY,79.296))
IsRoot ignores the node boundary set by the query and returns results based only on the label of the root of the parse tree of each token in the input file.
(VB precedes NP-OB*)
produces this output:
/~* thenne have ye cause to make myghty werre upon hym. ' (CMMALORY,2.25) *~/ /* 9 IP-INF-PRP: 11 VB make, 12 NP-OB1 */ ( (9 IP-INF-PRP (10 TO to) (11 VB make) (12 NP-OB1 (13 ADJ myghty) (14 N werre) (15 PP (16 P upon) (17 NP (18 PRO hym))))) (ID CMMALORY,2.25))
node: IP* query: (NP* iDoms \*exp\*) AND (NP* sameIndex CP*)
finds this sentence:
/~* hym thought there was com into hys londe gryffens and serpentes, (CMMALORY,33.1031) *~/ /* 1 IP-MAT: 2 NP-SBJ-1, 3 *exp*, 9 CP-THT-1 */ ( (IP-MAT (NP-SBJ-1 *exp*) (NP-OB2 (PRO hym)) (VBD thought) (CP-THT-1 (C 0) (IP-SUB (NP-SBJ-2 (EX there)) (BED was) (VBN com) (PP (P into) (NP (PRO$ hys) (N londe))) (NP-2 (NS gryffens) (CONJ and) (NS serpentes)))) (E_S ,)) (ID CMMALORY,33.1031))