Contents of this chapter:

Format of a part-of-speech tagged corpus
Search functions

Exists
iDominates
iPrecedes
Neighborhood
Precedes

Format of a part-of-speech tagged corpus

CorpusSearch can search files that are tagged for part of speech but not further parsed for syntactic structure.

Here is a template for the format of a POS-tagged corpus file:

#!FORMAT=POS_1

Insert header information, if any, below format line above, which must be the first line in the file.

<text>

WORD1/TAG WORD2/TAG WORD3/TAG ..... ./.
 
WORD1/TAG WORD2/TAG WORD3/TAG ..... ?/.

.....

</text>

Every word in a POS-tagged file after the initial "" tag should end with a backslash and tag, followed by a space. There can be no spaces within a word or between a word and its tag. Note that every sentence of the corpus must end with a punctuation mark. Also, there must be a blank line between sentences. If the format of the file is "POS_1", the tag for sentence final punctuation must be a period. If the format is "POS_0", an alternative format, the tag for sentence final punctuation is "PONFP". Sentence internal punctuation must also be treated as a separate word with a tag, which should be different from the sentence final punctuation tag.

Search functions

The query file for searching a POS-tagged corpus looks much like that for a parsed corpus. The node boundary, however, is always $ROOT. CorpusSearch treats POS-tagged files as containing sentences parsed with a completely flat structure, with every word/tag pair as an immediate daughter of the root node. The tag for a word is treated as its mother, so that a query like "(N iDoms king)" returns sentences containing the word/tag pair "king/N". Because of the flat structure of a POS-tagged file, many CorpusSearch functions cannot be used. Below is a list of those that are ordinarily appropriate. The function "Neighborhood" works only on POS-tagged files.

Exists (variants: exists)

Exists searches for a POS tag or text anywhere in the sentence. For instance, this query:

(MD0 exists)

will find this sentence:

/~*
I shal not conne wel goo thyder ./. (ID CMREYNAR,14.261)
*~/

/*
    4 MD0 conne
*/

( (PRO I) (MD shal) (NEG not) (MD0 conne) (ADV wel)) (VB goo) (ADV thyder) )

iDominates (variants: idominates, iDoms, idoms)

iDominates means "immediately dominates". That is, x dominates y if y is a child of x. So this query:

((PRO iDominates he) AND (FP iDominates ane))

finds this sentence:

/~*
Sythen he ledes +tam by +tar ane,
(CMROLLEP,118.978)
*~/

/*
    2 PRO he, 7 FP ane
*/
( (ADV Sythen) (PRO he) (VBP ledes) (8 PRO +tam) (10 P by) (12 PRO$ +tar) (13 FP ane) (. ,) )

/*

Notice that "iDominates" describes the relationship between a POS tag and its associated text (e.g., "FP" and "ane").

iPrecedes (variants: iprecedes, iPres, ipres)

This function is true if and only if its first argument immediately precedes its second argument in the text/tag string.

The following query:

query: (as iPrecedes sone) AND (sone iPrecedes P)

finds this sentence:

/~*
and as sone as he myght he toke his horse .
(CMMALORY,206.3401)
*~/
/*
2  as, 3 sone, 4 P as
*/

( CONJ and) (ADVR as) (ADV sone) (P as) (PRO he) (MD myght) (PRO he) (VBD toke) (PRO$ his) (N horse) (. .) )

Neighborhood (variant: neighborhood)

Neighborhood takes three arguments, two words or tags and a number. It searches for sentences in which the two words/tags occur within a certain number of words of one another. For instance, this query:

query: (whoreson Neighborhood 2 wilt)

will return all tokens in the corpus in which the word "whoreson" is within two words of the word "wilt," for instance, the following sentence:

/~*
why thou whoreson when wilt thou be maried?
(DELONEY,79.296)
*~/
/*
3 whoreson,  5 wilt
*/

(  (WADV why) (PRO thou) (N whoreson) WADV when) (MD wilt) (PRO thou) (BE be) (VAN maried)  (. ?) )
  (ID DELONEY,79.296))

Precedes (variants: precedes, Pres, pres)

"x precedes y" means "x comes before y in the sentence but perhaps not immediately". So this query:

(VB precedes N)

finds this case:

/~*
thenne have ye cause to make myghty werre upon hym.
(CMMALORY,2.25)
*~/

/*
    6 VB make, 8 N werre
*/

( (ADV thenne) (HV have) (PRO ye) (N cause) (TO to) (VB make) (ADJ myghty) (N werre) (P upon)
(PRO hym) (. .) )
      (ID CMMALORY,2.25))