The CorpusSearch Command File

Contents of this chapter:

placement of commands
boolean shorthand
label strings
nodes to ignore in queries

node:
add_to_ignore:
ignore_nodes:
ignore_words:

begin_remark:, end_remark
nodes_only:
print_complement:
print_indices:
remove_nodes:
ur_text_only:

query: <query specification>
coding_query: <coding specification>
local_frames: <frame specification>
make_lexicon: true
print_only: <pos_label string>
reformat_corpus: true
copy_corpus: true

Comments

Introduction

Every command file must contain a search specification command and ordinary query files must contain a value for the search control command node:. The extension of a command file is determined by the search specification command that it contains See below.

placement of commands

The preamble to a command file consists of the search control commands and the output format commands. These may appear in any order with respect to one another but they must all appear before the query specfication. Comments may appear anywhere. Many commands have default values which are used if no value is found in the command file. The query: command itself is obligatory, as is the node: command.

boolean shorthand

For commands that take a boolean argument, CorpusSearch will accept any of these strings: "true", "TRUE", "T", "t", or "false", "FALSE", "F", "f".

node label strings

Many commands, including query language clauses, can accept strings of alternative label values as well as single node labels. These alternatives are separated by the vertical bar character "|" without any spaces.

nodes to ignore

There are some nodes in the corpus that we usually don't want to consider as part of the structure of the sentence, for instance, punctuation, line breaks, page numbers, and comments. These and other nodes should usually also be ignored when a query function counts the number of words in a constituent. In deciding whether a function is matched by a given structure in the corpus, CorpusSearch will ignore nodes whose labels are contained in the "ignore-list". If the function is a word counting function, CorpusSearch ignores the nodes on the "word-ignore-list". Below are the default versions of the two ignore-lists. Note that traces and empty complementizers (\** and 0) are on the default word-ignore-list but not on the default ignore-list.

ignore_nodes: COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*
ignore_words:  COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*|0|\**

For instance, if you run this query:

(NP* iPrecedes PP*)

This sentence will be returned:

/*
 1 IP-MAT-SPE: 5 NP-1, 9 PP
*/
/~*
There ar two bretheren beyond the see,
(CMMALORY,15.439)
*~/

(0
(1 IP-MAT-SPE
              (2 NP-SBJ-1 (3 EX There))
              (4 BEP ar)
              (5 NP-1 (6 NUM two) (7 NS bretheren))
              (8 CODE <P_15>)
              (9 PP (10 P beyond)
                    (11 NP (12 D the) (13 N see)))
              (14 E_S ,))
(15 ID CMMALORY,15.439))

Notice that NP-1 immediately precedes PP in spite of the intervening node (8 CODE <P_15>). This is because CODE is on the default ignore-list.

We will sometimes refer to nodes that are not to be ignored as "legitimate" nodes.

Search control commands

node: <node_boundary string>

Required element in every command file of the query type. A query file without a node specification produces an ERROR. The node specification is a node label or a disjunction of labels.

The value of node: gives CorpusSearch a node boundary within which to search. The list of labels gives boundaries that any structure you search for will fall within; for example, IP* would yield all the basic clauses in the corpus, and $ROOT is the topmost level of every syntactic tree, whatever its label. In the case of searches on the output of a previous search in which nodes_only is set to "true", $ROOT refers to the root of the tree, which will have the label of the node boundary.

Whenever you want to consider the entire tree as the domain within which to search use

node: $ROOT

The choice of node boundary determines the following:

the counting of hits, defined as "number of distinct node boundaries containing the structure";
what nodes are removed if remove_nodes is true;
what nodes are printed if nodes_only is true.

To illustrate this, we ran the same query with different node boundaries on a simple file containing one sentence. First we ran the query with the node boundary, IP*|$ROOT:

node:   IP*|$ROOT
query:  (NP* iDominates PRO*)

Here's the output; notice that 1 hit is counted because there was one IP* node (1 IP-MAT containing both NP* nodes:


/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/

/*
    1 IP-MAT: 3 NP-SBJ, 4 PRO he
    1 IP-MAT: 6 NP-OB2, 7 PRO them
*/

(0
   (1 IP-MAT (2 CONJ and)
             (3 NP-SBJ (4 PRO he))
             (5 VBD made)
             (6 NP-OB2 (7 PRO them))
             (8 NP-OB1 (9 ADJ grete) (10 N chere))
             (11 ADVP (12 ADV out)
                      (13 PP (14 P of)
                             (15 NP (16 N mesure)))))
      (ID CMMALORY,2.13))

/*
    FOOTER
    source file:  CMMALORY
    hits found:  1
    sentences containing the hits:  1
    total sentences searched:  1
*/

Next we ran the query with node boundary NP*:

node:   NP*
nodes_only: t
query:  (NP* iDominates PRO*)

Here's the output; this time 2 hits are counted, because there are two distinct NP* nodes (3 NP-SBJ and (6 NP-OB2. Because nodes_only is set to true in this query, only the NP* nodes are printed:

/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/

/*
    3 NP-SBJ: 4 PRO he
    6 NP-OB2: 7 PRO them
*/

(
      (3 NP-SBJ (4 PRO he))
      (ID CMMALORY,2.13))

(
      (6 NP-OB2 (7 PRO them))
      (ID CMMALORY,2.13))


/*
    FOOTER
    source file:  CMMALORY
    hits found:  2
    sentences containing the hits:  1
    total sentences searched:  1
*/

add_to_ignore: (label_list string)

default "" (empty string)

adds given labels to the ignore_list. For instance,

add_to_ignore:  \**

will tell CorpusSearch to ignore traces for this search. When nodes are ignored, they are not considered as possible arguments for search functions. For example, ignoring traces means that IPs with subject traces due to movement of the subject to a position outside the IP will behave in searches as though they had no subject. Thus, whether a given node type should appear on the ignore list depends on the purpose of the search.

ignore_nodes: (ignore_list string)

default COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*

tells CorpusSearch what nodes to ignore.

To replace the default ignore-list with your own ignore-list, include this command in your command file:

ignore_nodes:  <your_ignore_list>

To tell CorpusSearch not to ignore any nodes, include this command in your command file:

ignore_nodes:   null

If you try to search for an item that is on the ignore_list, you'll get an error message. For instance, this query:

(NP-SBJ* iPrecedes CODE)

generates this message:

WARNING!  CODE in y_argument to iPrecedes is on the ignore_list.

    To make the ignore_list empty, add this line to your command file:

        ignore_nodes: null

    To write your own ignore_list, add this line to your command file:

        ignore_nodes:

The program goes ahead and runs as usual, but if you don't get the results you were looking for, you should probably change the ignore_list.

ignore_words: (word_ignore_list string)

default COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*|0|\**

tells CorpusSearch what nodes to ignore in counting words

To replace the default word-ignore-list with your own word-ignore-list, include this command in your command file:

ignore_words:  <your_word_ignore_list>

To tell CorpusSearch not to ignore any nodes in counting words, include this command in your command file:

ignore_words:   null

To add nodes to the word-ignore-list, use

add_to_ignore_words:

The following search functions are governed by the word-ignore-list: DomsWords, DomsWords<, DomsWords>. All other functions use the main ignore-list.

Output format commands

These commands do not in any way influence the current search. They only give instructions about how the results of the current search should be printed to the output file. However, because these commands can cause the output of the current search to take different forms, they may influence future searches which will take as their input the output of the current search.

begin_remark: (remark string) end_remark

default "" (empty string)

tells CorpusSearch to print user's remark in the output Preface. This is a way for the user to record a note, for instance to remember the goal of the search.

For instance, the command file "pro-obj.q" contains this command:

begin_remark: 
	pronoun objects
end_remark

which is printed in the output preface like this:

/*
    PREFACE:  regular output file.
    CorpusSearch copyright Beth Randall 1999.
    Date:  Wed Nov 03 19:12:03 EST 1999

    command file:       pro-obj.q
    input file:         ipmat-2vb.out
    output file:        pro-obj.out

    remark:
        pronoun objects

    node:   IP*
    query:  (NP-OB* iDominates PRO)
*/

nodes_only: (boolean true or false)

default false

If true, CorpusSearch prints out only the nodes that contain the structure described in "query".

If false, CorpusSearch prints out the entire sentence that contains the structure described in "query".

For instance, suppose you have this query:

node:  ADVP* 

nodes_only: t
query:  (ADVP* iDominates ADVP*)

Here's what a piece of the output looks like with nodes_only true.

/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.574)
*~/

/*
 2 ADVP: 3 ADVP
*/

(     (ADVP
            (ADVP (ADV certayn))
            (CONJP (CONJ and)
                   (PP (P wit-owte)
                       (NP (N doute))))
            (, ,))(ID CMAELR3,45.574))

And here's the same piece of output with nodes_only false:

/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.589)
*~/

/*
 2 ADVP: 3 ADVP
*/

(
(IP-MAT
        (ADVP
              (ADVP (ADV certayn))
              (CONJP (CONJ and)
                     (PP (P wit-owte)
                         (NP (N doute)))))
        (, ,)
        (NP-OB1 (NPR Ihon))
        (BEP is)
        (NP-SBJ (PRO$ is) (N name))
        (E_S .))
(ID CMAELR3,45.589))

print_complement: (boolean true or false)

default false

In the normal case CorpusSearch prints as output only nodes or tokens that match the query. Setting print_complement to true causes CorpusSearch to print not only the matching tokens (in the regular output file, extension .out), but also all the tokens that don't match, in a separate file called the complement file (extension .cmp). Thus, print_complement is a form of NOT applied to queries. Generally print_complement should be used on the output of a previous search that has narrowed down the possibilities to some set that can be meaningfully divided; using it on corpus files will generally result in a completely meaningless set of tokens.

Examples: the following query could be used on an output file containing all IPs with objects to divide the IPs into two sets: those with two objects (in the .out file) and those with one (in the .cmp file). The first example is from the regular output file and matches the query, that is, it has two objects. The second example is from the complement file and does not match the query; it has only one object.

print_complement: t
node: IP*
query: ((IP* iDoms [1]NP-OB*)
AND (IP* iDoms [2]NP-OB*))

from the regular output file:

/~*
And there is no knyght now lyvynge that ought to yelde God so grete thanke os
ye,
(CMMALORY,655.4474)
*~/
/*
1 IP-SUB-SPE: 6 NP-OB2, 8 NP-OB1
1 IP-SUB-SPE: 8 NP-OB1, 6 NP-OB2
*/

(0 (1 IP-SUB-SPE (2 NP-SBJ *T*-2)
                 (3 MD ought)
                 (4 TO to)
                 (5 VB yelde)
                 (6 NP-OB2 (7 NPR God))
                 (8 NP-OB1 (9 ADJP (10 ADVR so) (11 ADJ grete))
                           (12 N thanke)
                           (13 PP (14 P os)
                                  (15 NP (16 PRO ye)))))
      (ID CMMALORY,655.4474))

from the complement file:

/~*
The kynge lyked and loved this lady wel,
(CMMALORY,2.12)
*~/

(0  (1 IP-MAT (2 NP-SBJ (3 D The) (4 N kynge))
              (5 VBD (6 VBD lyked) (7 CONJ and) (8 VBD loved))
              (9 NP-OB1 (10 D this) (11 N lady))
              (12 ADVP (13 ADV wel))
              (14 E_S ,))
    (15 ID CMMALORY,2.12))

print_indices: (boolean true or false)

default false

tells CorpusSearch whether or not to print indices in the output.

Indices start at 0 and are used to label every node in the tree. CorpusSearch uses indices to distinguish, for instance, between several different NP nodes in the same output structure.

Here's a piece of output structure with indices:

             (10 NP-OB1 (11 NPR Morgan)
                        (12 NPR le)
                        (13 NPR Fay)

Here's how it looks without indices:

                  (NP-PRN (NPR Morgan)
                          (NPR le)
                          (NPR Fey)))

remove_nodes: (boolean true or false)

default false

removes subtrees whose root is of the same syntactic category as the node boundary embedded within a instance of that node boundary. "Remove_nodes" thus removes recursive structure. If the removed subtree matches the query, it will appear as a separate output token later in the output file. If the removed subtree does not contain the searched-for structure, it is discarded and replaced with a label indicating what has been removed.

The purpose of this feature is to make it easier to search output. For instance, if you were looking for IP nodes containing a certain structure, remove_nodes will ensure that your output contains only IP nodes with that structure, and no other IP nodes.

CorpusSearch uses the following algorithm to find the syntactic category of a node: Start with the node boundary label. If that label contains any hyphens, the node's syntactic category is the substring of the label up to the leftmost hyphen, with a '*' tacked on. If the node boundary label does not contain a hyphen, the syntactic category is simply the label with a '*' tacked on, unless the label already has one.

Thus, if the node boundary label is IP-PRN*, the node category is IP*.

Consider the following command file, in which remove_nodes is set to true, and its effect on the output below:

remove_nodes: true
query: (NP-OB* iDoms PRO)

Output:

/~*
'And I shall defende the,' seyde the knyght.
(CMMALORY,39.1264)
*~/

/*
 1 IP-MAT-SPE: 8 NP-OB1, 9 PRO the
*/

 (0 (1 IP-MAT-SPE (2 ' ')
                 (3 CONJ And)
                 (4 NP-SBJ (5 PRO I))
                 (6 MD shall)
                 (7 VB defende)
                 (8 NP-OB1 (9 PRO the))
                 (10 , ,)
                 (11 ' ')
                 (12 IP-MAT-PRN RMV:seyde_the_knyght...)
                 (13 E_S .))
     (ID CMMALORY,39.1264))

The structure of sub-sentence "seyde the knyght" has been removed from the parsed sentence and replaced with the symbol RMV:<rmv_string>, where rmv_string stands for a string of (up to) the first three words (leaf nodes) of the removed material and serves as a reminder of what has been removed. A further search on this output will be a search only on IP* nodes that contain a pronoun object, and on no other nodes.

Search specification

Every command file must contain a search specification, which instructs CorpusSearch as to what action to carry out. The search specification string must follow the preamble of search control commands and output format commands. The most common search specification, by far, is query:, used for searching a corpus.

ur_text_only: (boolean true or false)

default false

prints only the text of the tokens that match the query, suppressing printing of the labeled bracketing and associated information.

query: <query specification>

Queries must follow the syntax of the CorpusSearch query language. Every command files containing queries must bear the extension .q.

coding_query: <coding specification>

Coding query syntax is described in the chapter on coding. Every command file containing coding queries must bear the extension .c.

local_frames: <frame specification>

See the chapter on local frames for a description of this option.

make_lexicon: true

See the chapter on building a lexicon for a description of this option.

print_only: <pos_label string>

This option was designed for use on a file that contains coding strings produced by coding queries. To create an output file with only the coding strings use the following search specification:

print_only: CODING

The resultant output file will bear the extension .ooo. Please note that his feature does not work on output files of prior queries in which the nodes of the parse were indexed.

In theory, you could substitute a part of speech label for CODING, although if you wanted a list of, for instance, all the nouns in your file, you would probably be better off using the make_lexicon feature.

reformat_corpus: true

This takes as input a corpus file, and outputs the same file in the same format as CS would output from a search. This is useful, for instance, if you need to follow up with unix "diff" to compare search output with an original corpus file. The output file of "reformat_corpus" bears the extension .fmt.

copy_corpus: true

See the chapter on automated corpus revision for a description of this option.

Comments

Comments may be added anywhere to the command file or to files of parsed sentences that serve as input to CS. The program uses the following delimiters for such comments. Comment lines begin with "//" and block comments appear between "/*" and "*/". Unlike remarks, comments are not printed to the search output file.

For input files, but not for command files, you can also define custom comment delimiters. Add the following commands to the command file preamble or to the preferences file, followed by the desired delimiter strings.

For a line comment delimiter, add corpus_line_comment:

For block comment delimiters, add corpus_comment_begin: and corpus_comment_end: