CorpusSearch Logical Operators

Contents of this chapter:

search-function operators:

AND

same-instance
same-instance with prefix indices

OR
NOT

argument operators:

! (not)

not one argument at a time
ordering ! and prefix indices

| (or)

negating a list

AND

AND, in its simplest form, returns trees in which both conjuncts hold within a single boundary node. For instance, this query:


node: IP*

query: (NP-TMP* iDominates ADV*)
       AND (TO iPrecedes VB)

yields this output:

/*
4 IP-INF-SBJ:  5 NP-TMP, 6 ADV+NS, 8 TO, 10 VB
*/

( (IP-MAT (CONJ but)
          (IP-INF-SBJ (NP-TMP (ADV+NS oftymes))
                      (TO to)
                      (VB rede)
                      (NP-OB1 (PRO it)))
          (MD shal)
          (VB cause)
          (NP-OB1 (PRO it))
          (IP-INF (ADVP (ADV wel))
                  (TO to)
                  (BE be)
                  (VAN vnderstande))
          (E_S /))
  (ID CMREYNAR,6.10))

AND has been implemented with a default feature that we call "same-instance." If the same label occurs twice in search functions conjoined by AND, CorpusSearch assumes that the two occurrences should refer to the same node in the tree. Thus, the following query

(IP* iDomsNumber 1 VBP|VBD) AND (IP* iDomsNumber 2 ADVP|PP*)

returns only trees where the same IP node has the described number 1 and 2 children. Trees containing one IP with number 1 child VBP and some other IP with number 2 child ADVP are not returned.

The same-instance assumption is triggered by matching argument label strings, so that

(ADVP precedes MD|HV*|VB*) AND (MD|HV*|VB* precedes NP-SBJ)

returns only sentences with the same instance of MD|HV*|VB*, but

(ADVP precedes MD|VB*|HV*) AND (MD|HV*|VB* precedes NP-SBJ)

returns sentences with the same instance or different instances (because the argument lists do not match as strings due to the difference in order of elements.)

Same-instance does not apply within single clauses of a query. Thus the query (ADVP precedes ADVP) is not vacuous.

AND; same-instance with prefix indices

If you need to specify which arguments coincide (that is, refer to the same instance) and which don't, you can use prefix indices. Matching arguments with the same prefix index must coincide, matching arguments with different prefix indices must not coincide. Prefix indices must be enclosed by the square brackets "[" and "]".

For example, suppose you are looking for two sister noun-phrases that each immediately dominate a pronoun. Use prefix indices as follows:

([1]NP* hasSister [2]NP*) AND ([1]NP* iDominates [3]PRO) AND ([2]NP* iDominates [4]PRO)

to find sentences like this one:

/~*
And +tere it lykede him to suffre many repreuynges and scornes for vs
(CMMANDEV,1.4)
*~/

/*
    1 IP-MAT: 5 NP-SBJ-1, 8 NP-OB2, 6 PRO it, 9 PRO him
*/

(0
   (1 IP-MAT (2 CONJ And)
             (3 ADVP-LOC (4 ADV +tere))
             (5 NP-SBJ-1 (6 PRO it))
             (7 VBD lykede)
             (8 NP-OB2 (9 PRO him))
             (10 IP-INF-1 (11 TO to)
                          (12 VB suffre)
                          (13 NP-OB1 (14 Q many)
                                     (15 NS repreuynges)
                                     (16 CONJP (17 CONJ and)
                                               (18 NX (19 NS scornes))))
                          (20 PP (21 P for)
                                 (22 NP (23 PRO vs)))))
      (ID CMMANDEV,1.4))

Here's another example:

query: (IP-SMC iDoms [1]NP*)
   AND ([1]NP* iDoms [3]\**)
   AND (IP-SMC iDoms [2]NP*)
   AND ([2]NP* iDoms [4]\**)

This query searches for a node labelled IP-SMC which immediately dominates two different NP* nodes, each immediately dominating a trace. In this example, the two mentions of IP-SMC must refer to the same node in the tree (same-instance); [1]NP* and [2]NP* must refer to different nodes (because of the different prefix indices); similarly, [3]\** and [4]\** must not coincide. If the substrings following the indices were not identical, then the arguments would not be forced to pick out distinct nodes.

Here's a sentence found by the above query:

 
/~*
+After +t+am L+acedemonie gecuron him to ladteowe, Ircclidis w+as haten,
(OR4,1.53.30.12)
*~/

/*
    23 IP-SMC: 24 NP-NOM *-2, 25 NP-NOM-PRD *ICH*-1
    23 IP-SMC: 25 NP-NOM-PRD *ICH*-1, 24 NP-NOM *-2
*/


(0  (1 CODE )
  (2 IP-MAT
            (3 PP (4 P +After)
                  (5 NP-DAT (6 D^D +t+am)))
            (7 NP-NOM (8 NPR^N L+acedemonie))
            (9 VBDI gecuron)
            (10 NP-DAT-RFL-ADT (11 PRO|D him))
            (12 PP (13 P to)
                   (14 NP-DAT (15 N|D ladteowe)))
            (16 , ,)
            (17 IP-MAT-PRN (18 NP-NOM-2 *pro*)
                           (19 NP-NOM-1 (20 NPR^N Ircclidis))
                           (21 BEDI w+as)
                           (22 VBN haten)
                           (23 IP-SMC (24 NP-NOM *-2)
                                      (25 NP-NOM-PRD *ICH*-1)))
            (26 . ,))
  (27 ID OR4,1.53.30.12))

OR

WARNING: OR is currently under active development and is likely to yield unexpected results. Its functionality can normally be achieved with the argument disjunction operator "|" (see below). When this operator is not sufficient, it is recommended that a coding query be used.

OR is logical disjunction. "(FOO) OR (BAR)" returns all subtrees rooted in an instance of the query's selected node boundary in which either the property "FOO" or the property "BAR" or both hold. "FOO" and "BAR" may consist of single search functions or be built up out of conjunctions, disjunctions and negations of simple search functions.

NOT

WARNING: NOT is currently under active development. It does not yet work correctly in any but the simplest cases. Avoid it except for testing purposes.

NOT returns trees rooted in the node boundary that do not contain the described structure. It differs from ! because none of the arguments need to appear in the node boundary-defined domain.

For instance,

NOT(NP* precedes VB*)

returns trees that do not contain the structure (NP* precedes VB*), including those that contain neither NP* nor VB*.

On the other hand,

(NP* iPrecedes !VB*)

returns trees that contain an NP* which does not iPrecede VB*.

! (not)

! is used to negate one argument to a search function.

For instance, suppose you're looking for sentences in which the nodes immediately dominated by the subject do not include a pronoun. You could use this query:

(NP-SBJ* iDominates !PRO*)

to obtain sentences like this:

/~*
a runde fot & +ticke bi-come+t an hors wel.
(CMHORSES,87.17)
*~/

/*
    1 IP-MAT: 2 NP-SBJ, 10 ADJ +ticke
*/

(0
   (1 IP-MAT
             (2 NP-SBJ (3 D a)
                       (4 ADJP (5 ADJ runde)
                               (6 CONJP *ICH*-1))
                       (7 N fot)
                       (8 CONJP-1 (9 CONJ &))
                       (10 ADJ +ticke))
             (11 VBP bi-come+t)
             (12 NP-OB1 (13 D an) (14 N hors))
             (15 ADVP (16 ADV wel))
             (17 E_S .))
      (ID CMHORSES,87.17))

! one argument at a time

CorpusSearch does not allow you to negate both arguments to a single search function. So this is *not* a legitimate command, and its appearance will abort a search:

(!NP-SBJ iPrecedes !VBD)

ordering ! and prefix indices

If you need to use both ! and prefix indices, put the ! before the indices.

For instance, suppose you're looking for sentences that contain a subject that precedes the object, and neither the subject nor the object contains a pronoun. You could use this query:

    (NP-SBJ* precedes NP-OB1*)
AND (NP-SBJ* iDominates ![1]PRO*))
AND (NP-OB1* iDominates ![2]PRO*))

to obtain sentences like these:

/~*
& +tat schal be a good hors.
(CMHORSES,85.9)
*~/

/*
    1 IP-MAT: 3 NP-SBJ, 7 NP-OB1, 4 D +tat, 10 N hors
*/

(0
   (1 IP-MAT (2 CONJ &)
             (3 NP-SBJ (4 D +tat))
             (5 MD schal)
             (6 BE be)
             (7 NP-OB1 (8 D a) (9 ADJ good) (10 N hors))
             (11 E_S .))
      (ID CMHORSES,85.9))

Notice that it is necessary to use prefix indices before the PRO* labels. Otherwise, CorpusSearch would try to find an NP-SBJ* and an NP-OB1* both dominating the *same* not-PRO* object, and would come up empty.

| (or argument)

Any number of arguments to a search function may be linked together into an argument list using |, which means "or". For instance,

(*VB*|*HV*|*BE*|*DO*|*MD* iPrecedes NP-SBJ*)

means "*VB* or *HV* or *BE* or *DO* or *MD* immediately precedes NP-SBJ*," and will find sentences like this:

/~*
+Tan was pompe & pryde cast down & leyd on syde.
(CMKEMPE,2.12)
*~/

/*
    2 IP-MAT-1: 5 BED was, 6 NP-SBJ
*/

(
      (2 IP-MAT-1
                  (3 ADVP-TMP (4 ADV +Tan))
                  (5 BED was)
                  (6 NP-SBJ (7 N pompe) (8 CONJ &) (9 N pryde))
                  (10 VAN cast)
                  (11 RP down))
      (ID CMKEMPE,2.12))

negating a list

If a list is preceded by !, the entire list is negated. So,

(!*VB*|*HV*|*BE*|*DO*|*MD* iPrecedes NP-SBJ*)

means, "none of these (*VB* or *HV* or *BE* or *DO* or *MD*) iPrecedes NP-SBJ*", and finds sentences like this:

 
/~*
& sche wold not consentyn in no wey,
(CMKEMPE,3.34)
*~/

/*
    1 IP-MAT: 2 CONJ &, 3 NP-SBJ
*/

(0
   (1 IP-MAT (2 CONJ &)
             (3 NP-SBJ (4 PRO sche))
             (5 MD wold)
             (6 NEG not)
             (7 VB consentyn)
             (8 PP (9 P in)
                   (10 NP (11 Q no) (12 N wey)))
             (13 E_S ,))
      (ID CMKEMPE,3.34))