Contents of this chapter:

General considerations
Search functions


General considerations

We commonly refer to the first argument to a search function as "x", and the second argument as "y".

To save typing and to improve readability, CorpusSearch allows shorthands and lower-case/upper-case variations for the names of search functions. Acceptable variants are listed below with each function.

When a function has an integer argument, there is always a space between the function and argument. This syntax is a change from earlier versions of CorpusSearch.

In general, the wild card "*" in CorpusSearch corresponds to the standard regular expression ".*". In other words, it matches a string of any number (including zero) of arbitrary characters, not necessarily the same as each other. However, after expressions in square brackets (and certain other complex contexts), only the standard regular expression has the desired effect. If using the one-character wild card gives unexpected results, use the standard regexp.

Search functions

CCommands (variants: cCommands, ccommands)

A node x ccommands a node y if and only if:
  1. neither x nor y dominates the other AND
  2. the first branching node dominating x does dominate y.
In the following tree,
            / \
           B   C
          / \   \
         D   E   F
B ccommands C and F and both C and F ccommand B, D and E. D and E, on the other hand, ccommand only each other. A ccommands no node because, being the root of the tree, it dominates all of the other nodes. The following query:
query: (NP-SBJ* idoms PRO$) AND (PRO$ ccommands NP*)

finds examples like:
(NP-SBJ (PRO$ his)
        (ADVR+Q ouermoch)
        (N fearinge)
        (PP (P of)
	    (NP (PRO you))))
in which a possessive pronoun ccommands a noun phrase, here the object of a prepositional complement to the head noun.

Column (variants: column, Col, col)

"Column" is used to search columns of the CODING node, or any other leaf whose text is written in columns separated by ":".

If, for instance, you want to find sentences whose CODING-IP-MAT node contains an "m" or "n" in the 7th column, use this query:

query:  (CODING-IP-MAT column 7 m|n)
If you want to find sentences whose CODING node does not contain a "p" or "q" in the 4th column, use this query:
query:  (CODING-IP-MAT column 4 !p|q)

Dominates (variants: dominates, Doms, doms)

dominates means "dominates to any generation." That is, y is contained in the sub-tree dominated by x. Dominates will accept text as y, but text as x will always return an empty set (text never dominates a subtree.) Notice that the following query uses the escape character, "\", to search for *arb*:

(IP-INF dominates \*arb*)

returns this sentence:

And soo by the counceil of Merlyn the kyng lete calle his barons to counceil,

    18 IP-INF: 19 NP-SBJ *arb*

      (18 IP-INF (19 NP-SBJ *arb*)
                 (20 VB calle)
                 (21 NP-OB1 (22 PRO$ his) (23 NS barons))
                 (24 PP (25 P to)
                        (26 NP (27 N counceil))))
      (ID CMMALORY,14.419))

DomsWords (variants: domsWords, domswords)

domsWords counts the number of words dominated by the search-function argument. So "domsWords 4" means "dominates 4 words", domsWords 2 mean "dominates 2 words", and so on. A word in this case is defined as a leaf node that is not on the word_ignore_list. Here's the default word_ignore_list:


Thus, traces, 0 complementizers, punctuation, and comments are not counted as words.

So this query:

node: NP*

(NP-OB* domsWords 3)

will return this structure (ignoring the trace *ICH*-1):

and by kynge Ban and Bors his counceile they lette brenne and destroy all the
contrey before them there they sholde ryde.

    24 NP-OB1: 27 N contrey

      (24 NP-OB1 (25 Q all)
                 (26 D the)
                 (27 N contrey)
                 (28 CP-REL *ICH*-1))
      (ID CMMALORY,20.613))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

DomsWords< (variants: domsWords<, domswords<)

domsWords< is just like domsWords except that it returns structures that dominate strictly less than the given number of words. For instance, this query:

(NP-OB* domsWords< 3)

will return this structure:

for it was I myself that cam in the lykenesse.

    6 NP-OB1: 9 PRO$+N myself

      (6 NP-OB1 (7 PRO I)
                (8 NP-PRN (9 PRO$+N myself)))
      (ID CMMALORY,5.131))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

DomsWords> (variants: domsWords>, domswords>)

domsWords> is just like domsWords except that it returns structures that dominate strictly more than the given number of words. For instance, this query:

(NP-OB* domsWords> 3)

will return this structure:

for she was called a fair lady and a passynge wyse,

    9 NP-OB1: 20 ADJ wyse

      (9 NP-OB1
                (10 NP (11 D a) (12 ADJ fair) (13 N lady))
                (14 CONJP (15 CONJ and)
                          (16 NP (17 D a)
                                 (18 ADJP (19 ADV passynge) (20 ADJ wyse)))))
      (ID CMMALORY,2.9))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

Exists (variants: exists)

exists searches for label or text anywhere in the sentence. For instance, this query:

(MD0 exists)

will find this sentence:

but I fere me that I shal not conne wel goo thyder /

    10 IP-SUB: 15 MD0 conne

      (10 IP-SUB
                 (11 NP-SBJ (12 PRO I))
                 (13 MD shal)
                 (14 NEG not)
                 (15 MD0 conne)
                 (16 ADVP (17 ADV wel))
                 (18 VB goo)
                 (19 ADVP-DIR (20 ADV thyder)))
      (ID CMREYNAR,14.261))

A common mistake is to use "exists" unneccessarily, as in this example:

(MD exists) AND (MD iPrecedes VB)

If a sentence contains the structure (MD iPrecedes VB), MD necessarily exists in the sentence. So this query would get the same result:

(MD iPrecedes VB)

HasLabel(variants: hasLabel, haslabel)

x hasLabel y if the label of node x is the string y. This query:

(NP* HasLabel NP-SBJ)

will find all NP's with the simple label NP-SBJ and no indices or further dash tags. This function is useful only for coding queries.

HasSister (variants: hasSister, hassister)

x hasSister y if x and y have the same mother. It doesn't matter whether x precedes y or y precedes x. So this query:
node: IP*
query: (NP-SBJ hasSister BE*)
finds both of these sentences:

indeede I must be gone:
1 IP-MAT-SPE:  5 NP-SBJ, 10 BE

( (IP-MAT-SPE (PP (P+N indeede))
              (NP-SBJ (PRO I))
              (MD must)
              (BE be)
              (VBN gone)
              (. :))
  (ID DELONEY,69.13))

I pray you is it true?
13 IP-SUB-SPE:  16 NP-SBJ, 14 BEP

                              (CODE {TEMP:prn_ok})
                              (VBP pray)
                              (NP-ACC (PRO you)))
              (IP-SUB-SPE (BEP is)
                          (NP-SBJ (PRO it))
                          (ADJP (ADJ true)))
              (. ?))
  (ID DELONEY,70.47))

iDominates (variants: idominates, iDoms, idoms)

iDominates means "immediately dominates". That is, x dominates y if y is a child (exactly one generation apart) of x. So this query:

((NP* iDominates FP) AND (FP iDominates ane))

finds this sentence:

Sythen he ledes +tam by +tar ane,

    1 IP-MAT: 11 NP, 13 FP ane

   (1 IP-MAT
             (2 ADVP-TMP (3 ADV Sythen))
             (4 NP-SBJ (5 PRO he))
             (6 VBP ledes)
             (7 NP-OB1 (8 PRO +tam))
             (9 PP (10 P by)
                   (11 NP (12 PRO$ +tar) (13 FP ane)))
             (14 E_S ,))
      (ID CMROLLEP,118.978))


Notice that "iDominates" describes the relationship between a label and its associated text (e.g., "FP" and "ane").

iDomsFirst (variants: idomsfirst)

"iDomsFirst" means "immediately dominates as a first child."

For instance, this query:

node: IP*
query: (NP* iDomsFirst PRO$)

results in this output:

My Lady yor mother, I thanke God, is very well and cheerly,
1 IP-MAT:  2 NP-SBJ, 3 PRO$
1 IP-MAT:  7 NP-PRN, 8 PRO$

                  (N Lady)
                  (NP-PRN (PRO$ yor) (N mother)))
          (, ,)
          (IP-MAT-PRN (NP-SBJ (PRO I))
                      (VBP thanke)
                      (NP-ACC (NPR God)))
          (, ,)
          (BEP is)
          (ADJP (ADJP (ADV very) (ADJ well))
                (CONJP (CONJ and)
                       (ADJX (ADJ cheerly))))
          (. ,))
  (ID KNYVETT-1630,86.12))

iDomsLast (variants: idomslast)

"iDomsLast" means "immediately dominates as a last child."

So this query:

node: IP*
query: (IP* iDomsLast BEN)

results in this output:

but keepes her chamber because of the Bitter weather that hath been.
31 IP-SUB:  31 IP-SUB, 36 BEN

( (IP-MAT (CONJ but)
          (NP-SBJ *con*)
          (VBP keepes)
          (NP-ACC (PRO$ her) (N chamber))
          (PP (P+N because)
              (PP (P of)
                  (NP (D the)
                      (ADJ Bitter)
                      (N weather)
                      (CP-REL (WNP-1 0)
                              (C that)
                              (IP-SUB (NP-SBJ *T*-1)
                                      (HVP hath)
                                      (BEN been))))))
          (. .))
  (ID KNYVETT-1630,86.13))

iDomsMod (variants: idomsmod)

This funcion takes three arguments in the following form: (x iDomsMod z y) It is read as x immediately dominates y, mod z. It is satisfied if x dominates y, and the only nodes intervening on the path from x to y (if any) are instances of the label z. Note that if no intervening nodes at all occur on the path from x to y, the query function is true. The most obvious use of this function is to search within conjuncts. Thus, to search for pronominal subjects within conjoined NPs, you can use the following query:
node: IP*
query: (NP-SBJ iDomsMod NP*|CONJ* PRO)
finds this sentence:

So by the entrete at the last the kyng and she met togyder.
1 IP-MAT:  21 NP-SBJ, 31 PRO, 27 CONJP

(0  (1 IP-MAT (2 ADVP (3 ADV So))
              (5 PP (6 P by)
                    (8 NP (9 D the) (11 N entrete)))
              (13 PP (14 P at)
                     (16 NP (17 D the) (19 ADJ last)))
              (21 NP-SBJ (22 NP (23 D the) (25 N kyng))
                         (27 CONJP (28 CONJ and)
                                   (30 NP (31 PRO she))))
              (33 VBD met)
              (35 ADVP (36 ADV togyder))
              (38 E_S .))
    (40 ID CMMALORY,4.104))
The query
node: IP*
query: (NP-SBJ iDomsMod NP*|CONJ* !PRO)
would also find the above sentence because "NP-SBJ iDomsMod NP" is true of the full NP "the king."

iDomsNumber (variants: idomsnumber, iDomsNum, idomsnum)

"iDomsNumber" means "immediately dominates as the #th child". That is, x immediately dominates y as the #th child if x immediately dominates y and y is the #th child of x. Note that "iDomsNumber 1" is identical to "iDomsFirst." This query:

(CP-DEG iDomsNumber 1 C)

produces this output:

And Merlion was so disgysed that kynge Arthure knewe hym nat,

    1 IP-MAT: 9 CP-DEG, 10 C that

   (1 IP-MAT (2 CONJ And)
             (3 NP-SBJ (4 NPR Merlion))
             (5 BED was)
             (6 ADJP (7 ADVR so)
                     (8 VAN disgysed)
                     (9 CP-DEG (10 C that)
                               (11 IP-SUB
                                          (12 NP-SBJ (13 NPR kynge) (14 NPR Arthure))
                                          (15 VBD knewe)
                                          (16 NP-OB1 (17 PRO hym))
                                          (18 NEG nat))))
             (19 E_S ,))
      (ID CMMALORY,30.939))

iDomsOnly (variants: idomsonly)

iDomsOnly means "immediately dominates as an only child." That is, x immediately dominates y as an only child if x immediately dominates y and y is the only legitimate child of x. So this query:

(ADJP iDomsOnly Q*)

results in this output:

But after my lytyll wytt it semeth me, sauynge here reuerence, +tat is more.

    23 IP-SUB: 27 ADJP, 28 QR more

      (23 IP-SUB
                 (24 NP-SBJ (25 D +tat))
                 (26 BEP is)
                 (27 ADJP (28 QR more)))
      (ID CMMANDEV,123.2992))

iDomsTotal (variants: idomstotal)

iDomsTotal counts the number of nodes immediately dominated by the search- function argument. Traces count as daughters unless they are added to the ignore list. The following query:

(NP-OB* iDomsTotal 3)
yields this output:
And +tere it lykede him to suffre many repreuynges and scornes for vs

    10 IP-INF-1: 13 NP-OB1, 16 CONJP

      (10 IP-INF-1 (11 TO to)
                   (12 VB suffre)
                   (13 NP-OB1 (14 Q many)
                              (15 NS repreuynges)
                              (16 CONJP (17 CONJ and)
                                        (18 NX (19 NS scornes))))
                   (20 PP (21 P for)
                          (22 NP (23 PRO vs))))
      (ID CMMANDEV,1.4))

Here, the 3 nodes immediately dominated by NP-OB1 are labelled Q, NS, and CONJP.

iDomsTotal< (variants: idomstotal<)

iDomsTotal< is like iDomsTotal except that it returns structures that immediately dominate strictly less than the given number of nodes. So this query:

(NP-OB* iDomsTotal< 3)

yields this output:

& take of euereche iliche myche

    1 IP-IMP: 8 NP-OB1, 9 QP

   (1 IP-IMP (2 CONJ &)
             (3 VBI take)
             (4 PP (5 P of)
                   (6 NP (7 Q euereche)))
             (8 NP-OB1
                       (9 QP (10 ADV iliche) (11 Q myche))))
      (ID CMHORSES,125.397))

iDomsTotal> (variants: idomstotal>)

iDomsTotal> is like iDomsTotal except that it returns structures that immediately dominate strictly more than the given number of nodes. So this query:

(NP-OB* iDomsTotal> 3)

will yield this output:

& aftur tak an hot yre +tat is smal bi-fore

    1 IP-IMP: 6 NP-OB1, 10 CP-REL

   (1 IP-IMP (2 CONJ &)
             (3 ADVP-TMP (4 ADV aftur))
             (5 VBI tak)
             (6 NP-OB1 (7 D an)
                       (8 ADJ hot)
                       (9 N yre)
                       (10 CP-REL (11 WNP-1 0)
                                  (12 C +tat)
                                  (13 IP-SUB (14 NP-SBJ *T*-1)
                                             (15 BEP is)
                                             (16 ADJP (17 ADJ smal))
                                             (18 ADVP-LOC (19 ADV bi-fore))))))
      (ID CMHORSES,95.119))

iDomsViaTrace (variants: idomsviatrace)

This function was introduced in version 70 of CorpusSearch 2. It is defined as follows:

Node x immediately dominates (via trace t) node y if and only if x immediately dominates t and t is co-indexed with another node z. The label of z must be that of x and the index of z must be that of t. The trace t can be any of the corpus empty categories that bear indices and the form of the empty category is specified in the function call. A characteristic use of this function is to search for extraposed relative clauses. Thus, to search for extraposed relative clauses with pronominal subjects, you can use the following query:

node: IP*
(CP-REL iDomsViaTrace \*ICH\* IP-SUB)
Note that the node boundary, here IP, must include both the trace and the extraposed constituent. The query finds this sentence:

Another defect I note, wherin I shall neede some Alchimist to helpe me...                   
1 IP-MAT: 7 CP-REL, 24 IP-SUB, 27 NP-SBJ, 28 PRO                              

( (IP-MAT (NP-OB1 (D+OTHER Another)                                                      
                  (N defect)                                                             
                  (CP-REL *ICH*-1)                                                       
          (NP-SBJ (PRO I))                                                               
          (VBP note)                                                                     
          (, ,)                                                                          
          (CP-REL-1 (WPP-2 (WADV+P wherin))                                              
                    (C 0)                                                                
                    (IP-SUB (PP *T*-2)                                                   
                            (NP-SBJ (PRO I))                                             
                            (MD shall)                                                   
                            (VB neede)                                                   
                            (NP-OB1 (Q some)                                             
                                    (NS Alchimist)                                       
                                    (CP-EOP (WNP-3 0)                                    
                                            (IP-INF (NP-SBJ *T*-3)                       
                                                    (TO to)                              
                                                    (VB helpe)                           
                                                    (NP-OB2 (PRO me)))))))

InID (variants: inID)

"inID" is true of substrings of the ID node. This functin is introduced because the ID node, being outside of the parsed sentence, cannot serve as an argument of a search function. In particular, (ID iDominates *) will return no hits.

Here's a typical ID node from the Malory parsed file in the Middle English corpus:


To isolate Malory sentences from an output file, you could use this query:

query:  (*MALORY* inID)

iPrecedes (variants: iprecedes, iPres, ipres)

This function is true if and only if its first argument immediately precedes its second argument in the text string spanned by the parse tree.

The algorithm for "x iPrecedes y" runs as follows:

1.) Find x.

2.) If x has an immediately following sister, then that sister and all its leftmost descendants (that is, the first child of the sister, the first child of the first child, and on as far as the tree goes) are candidates for y.

3.) If x has no immediately following sister, recurse from 2.) with the mother of x in place of x.

The following query:

query: ([1]as iPrecedes sone) AND (sone iPrecedes [2]as)

produces this output:

and as sone as he myght he toke his horse
1 IP-MAT:  6 as, 8 sone, 11 as

( (IP-MAT (CONJ and)
          (ADVP-TMP (ADVR as)
                    (ADV sone)
                    (PP (P as)
                        (CP-CMP (WADVP-1 0)
                                (C 0)
                                (IP-SUB (ADVP-TMP *T*-1)
                                        (NP-SBJ (PRO he))
                                        (MD myght)
                                        (VB *)))))
          (NP-SBJ (PRO he))
          (VBD toke)
          (NP-OB1 (PRO$ his) (N horse)))
  (ID CMMALORY,206.3401))

IsRoot (variants: isRoot, isroot)

isRoot searches for the argument label at the root of the tree of the parsed token. For instance, this query:

query: (CP* isRoot)

will return all tokens in the corpus whose root is a CP, for instance, the following sentence:

why thou whoreson when wilt thou be maried?

              (NP-VOC (PRO thou) (N$+N whoreson))
              (WADVP-1 (WADV when))
              (IP-SUB-SPE (ADVP *T*-1)
                          (MD wilt)
                          (NP-SBJ (PRO thou))
                          (BE be)
                          (VAN maried))
              (. ?))
  (ID DELONEY,79.296))

IsRoot ignores the node boundary set by the query and returns results based only on the label of the root of the parse tree of each token in the input file.

Precedes (variants: precedes, Pres, pres)

"x precedes y" means "x comes before y in the tree but x does not dominate y". So this query:

(VB precedes NP-OB*)

produces this output:

thenne have ye cause to make myghty werre upon hym. '

    9 IP-INF-PRP: 11 VB make, 12 NP-OB1

      (9 IP-INF-PRP (10 TO to)
                    (11 VB make)
                    (12 NP-OB1 (13 ADJ myghty)
                               (14 N werre)
                               (15 PP (16 P upon)
                                      (17 NP (18 PRO hym)))))
      (ID CMMALORY,2.25))

SameIndex (variants: sameIndex, sameindex)

x sameIndex y finds structures where x ends with the same index as y. This is useful in searching for antecedents with the same index as a trace. For instance, this query:

node: IP*
query: (NP* iDoms \*exp\*) AND (NP* sameIndex CP*)

finds this sentence:

hym thought there was com into hys londe gryffens and serpentes,
1 IP-MAT:  2 NP-SBJ-1, 3 *exp*, 9 CP-THT-1

( (IP-MAT (NP-SBJ-1 *exp*)
          (NP-OB2 (PRO hym))
          (VBD thought)
          (CP-THT-1 (C 0)
                    (IP-SUB (NP-SBJ-2 (EX there))
                            (BED was)
                            (VBN com)
                            (PP (P into)
                                (NP (PRO$ hys) (N londe)))
                            (NP-2 (NS gryffens) (CONJ and) (NS serpentes))))
          (E_S ,))
  (ID CMMALORY,33.1031))