Contents of this chapter:
ignore_nodes: COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:* ignore_words: COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*|0|\**For instance, if you run this query:
(NP* iPrecedes PP*)This sentence will be returned:
/* 1 IP-MAT-SPE: 5 NP-1, 9 PP */ /~* There ar two bretheren beyond the see, (CMMALORY,15.439) *~/ (0 (1 IP-MAT-SPE (2 NP-SBJ-1 (3 EX There)) (4 BEP ar) (5 NP-1 (6 NUM two) (7 NS bretheren)) (8 CODE <P_15>) (9 PP (10 P beyond) (11 NP (12 D the) (13 N see))) (14 E_S ,)) (15 ID CMMALORY,15.439))Notice that NP-1 immediately precedes PP in spite of the intervening node (8 CODE <P_15>). This is because CODE is on the default ignore-list.
We will sometimes refer to nodes that are not to be ignored as "legitimate" nodes.
The value of node: gives CorpusSearch a node boundary within which to search. The list of labels gives boundaries that any structure you search for will fall within; for example, IP* would yield all the basic clauses in the corpus, and $ROOT is the topmost level of every syntactic tree, whatever its label. In the case of searches on the output of a previous search in which nodes_only is set to "true", $ROOT refers to the root of the tree, which will have the label of the node boundary.
Whenever you want to consider the entire tree as the domain within which to search use
node: $ROOT
The choice of node boundary determines the following:
node: IP*|$ROOT query: (NP* iDominates PRO*)
Here's the output; notice that 1 hit is counted because there was one IP* node (1 IP-MAT containing both NP* nodes:
/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 1 IP-MAT: 3 NP-SBJ, 4 PRO he 1 IP-MAT: 6 NP-OB2, 7 PRO them */ (0 (1 IP-MAT (2 CONJ and) (3 NP-SBJ (4 PRO he)) (5 VBD made) (6 NP-OB2 (7 PRO them)) (8 NP-OB1 (9 ADJ grete) (10 N chere)) (11 ADVP (12 ADV out) (13 PP (14 P of) (15 NP (16 N mesure))))) (ID CMMALORY,2.13)) /* FOOTER source file: CMMALORY hits found: 1 sentences containing the hits: 1 total sentences searched: 1 */
Next we ran the query with node boundary NP*:
node: NP* nodes_only: t query: (NP* iDominates PRO*)
Here's the output; this time 2 hits are counted, because there are two distinct NP* nodes (3 NP-SBJ and (6 NP-OB2. Because nodes_only is set to true in this query, only the NP* nodes are printed:
/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 3 NP-SBJ: 4 PRO he 6 NP-OB2: 7 PRO them */ ( (3 NP-SBJ (4 PRO he)) (ID CMMALORY,2.13)) ( (6 NP-OB2 (7 PRO them)) (ID CMMALORY,2.13)) /* FOOTER source file: CMMALORY hits found: 2 sentences containing the hits: 1 total sentences searched: 1 */
adds given labels to the ignore_list. For instance,
add_to_ignore: \**
will tell CorpusSearch to ignore traces for this search. When nodes are ignored, they are not considered as possible arguments for search functions. For example, ignoring traces means that IPs with subject traces due to movement of the subject to a position outside the IP will behave in searches as though they had no subject. Thus, whether a given node type should appear on the ignore list depends on the purpose of the search.
tells CorpusSearch what nodes to ignore.
To replace the default ignore-list with your own ignore-list, include this command in your command file:
ignore_nodes: <your_ignore_list>
To tell CorpusSearch not to ignore any nodes, include this command in your command file:
ignore_nodes: null
If you try to search for an item that is on the ignore_list, you'll get an error message. For instance, this query:
(NP-SBJ* iPrecedes CODE)
generates this message:
WARNING! CODE in y_argument to iPrecedes is on the ignore_list. To make the ignore_list empty, add this line to your command file: ignore_nodes: null To write your own ignore_list, add this line to your command file: ignore_nodes:
The program goes ahead and runs as usual, but if you don't get the results you were looking for, you should probably change the ignore_list.
tells CorpusSearch what nodes to ignore in counting words
To replace the default word-ignore-list with your own word-ignore-list, include this command in your command file:
ignore_words: <your_word_ignore_list>
To tell CorpusSearch not to ignore any nodes in counting words, include this command in your command file:
ignore_words: nullTo add nodes to the word-ignore-list, use
add_to_ignore_words:
The following search functions are governed by the word-ignore-list: DomsWords, DomsWords<, DomsWords>. All other functions use the main ignore-list.
These commands do not in any way influence the current search. They only give instructions about how the results of the current search should be printed to the output file. However, because these commands can cause the output of the current search to take different forms, they may influence future searches which will take as their input the output of the current search.
tells CorpusSearch to print user's remark in the output Preface. This is a way for the user to record a note, for instance to remember the goal of the search.
For instance, the command file "pro-obj.q" contains this command:
begin_remark: pronoun objects end_remark
which is printed in the output preface like this:
/* PREFACE: regular output file. CorpusSearch copyright Beth Randall 1999. Date: Wed Nov 03 19:12:03 EST 1999 command file: pro-obj.q input file: ipmat-2vb.out output file: pro-obj.out remark: pronoun objects node: IP* query: (NP-OB* iDominates PRO) */
If true, CorpusSearch prints out only the nodes that contain the structure described in "query".
If false, CorpusSearch prints out the entire sentence that contains the structure described in "query".
For instance, suppose you have this query:
node: ADVP*
nodes_only: t query: (ADVP* iDominates ADVP*)
Here's what a piece of the output looks like with nodes_only true.
/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.574) *~/ /* 2 ADVP: 3 ADVP */ ( (ADVP (ADVP (ADV certayn)) (CONJP (CONJ and) (PP (P wit-owte) (NP (N doute)))) (, ,))(ID CMAELR3,45.574))
And here's the same piece of output with nodes_only false:
/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.589) *~/ /* 2 ADVP: 3 ADVP */ ( (IP-MAT (ADVP (ADVP (ADV certayn)) (CONJP (CONJ and) (PP (P wit-owte) (NP (N doute))))) (, ,) (NP-OB1 (NPR Ihon)) (BEP is) (NP-SBJ (PRO$ is) (N name)) (E_S .)) (ID CMAELR3,45.589))
In the normal case CorpusSearch prints as output only nodes or tokens that match the query. Setting print_complement to true causes CorpusSearch to print not only the matching tokens (in the regular output file, extension .out), but also all the tokens that don't match, in a separate file called the complement file (extension .cmp). Thus, print_complement is a form of NOT applied to queries. Generally print_complement should be used on the output of a previous search that has narrowed down the possibilities to some set that can be meaningfully divided; using it on corpus files will generally result in a completely meaningless set of tokens.
Examples: the following query could be used on an output file containing all IPs with objects to divide the IPs into two sets: those with two objects (in the .out file) and those with one (in the .cmp file). The first example is from the regular output file and matches the query, that is, it has two objects. The second example is from the complement file and does not match the query; it has only one object.
from the regular output file:print_complement: t node: IP* query: ((IP* iDoms [1]NP-OB*) AND (IP* iDoms [2]NP-OB*))
/~* And there is no knyght now lyvynge that ought to yelde God so grete thanke os ye, (CMMALORY,655.4474) *~/ /* 1 IP-SUB-SPE: 6 NP-OB2, 8 NP-OB1 1 IP-SUB-SPE: 8 NP-OB1, 6 NP-OB2 */ (0 (1 IP-SUB-SPE (2 NP-SBJ *T*-2) (3 MD ought) (4 TO to) (5 VB yelde) (6 NP-OB2 (7 NPR God)) (8 NP-OB1 (9 ADJP (10 ADVR so) (11 ADJ grete)) (12 N thanke) (13 PP (14 P os) (15 NP (16 PRO ye))))) (ID CMMALORY,655.4474))from the complement file:
/~* The kynge lyked and loved this lady wel, (CMMALORY,2.12) *~/ (0 (1 IP-MAT (2 NP-SBJ (3 D The) (4 N kynge)) (5 VBD (6 VBD lyked) (7 CONJ and) (8 VBD loved)) (9 NP-OB1 (10 D this) (11 N lady)) (12 ADVP (13 ADV wel)) (14 E_S ,)) (15 ID CMMALORY,2.12))
tells CorpusSearch whether or not to print indices in the output.
Indices start at 0 and are used to label every node in the tree. CorpusSearch uses indices to distinguish, for instance, between several different NP nodes in the same output structure.
Here's a piece of output structure with indices:
(10 NP-OB1 (11 NPR Morgan) (12 NPR le) (13 NPR Fay)
Here's how it looks without indices:
(NP-PRN (NPR Morgan) (NPR le) (NPR Fey)))
removes subtrees whose root is of the same syntactic category as the node boundary embedded within a instance of that node boundary. "Remove_nodes" thus removes recursive structure. If the removed subtree matches the query, it will appear as a separate output token later in the output file. If the removed subtree does not contain the searched-for structure, it is discarded and replaced with a label indicating what has been removed.
The purpose of this feature is to make it easier to search output. For instance, if you were looking for IP nodes containing a certain structure, remove_nodes will ensure that your output contains only IP nodes with that structure, and no other IP nodes.
CorpusSearch uses the following algorithm to find the syntactic category of a node: Start with the node boundary label. If that label contains any hyphens, the node's syntactic category is the substring of the label up to the leftmost hyphen, with a '*' tacked on. If the node boundary label does not contain a hyphen, the syntactic category is simply the label with a '*' tacked on, unless the label already has one.
Thus, if the node boundary label is IP-PRN*, the node category is IP*.
Consider the following command file, in which remove_nodes is set to true, and its effect on the output below:
remove_nodes: true query: (NP-OB* iDoms PRO)
Output:
/~* 'And I shall defende the,' seyde the knyght. (CMMALORY,39.1264) *~/ /* 1 IP-MAT-SPE: 8 NP-OB1, 9 PRO the */ (0 (1 IP-MAT-SPE (2 ' ') (3 CONJ And) (4 NP-SBJ (5 PRO I)) (6 MD shall) (7 VB defende) (8 NP-OB1 (9 PRO the)) (10 , ,) (11 ' ') (12 IP-MAT-PRN RMV:seyde_the_knyght...) (13 E_S .)) (ID CMMALORY,39.1264))
The structure of sub-sentence "seyde the knyght" has been removed from the parsed sentence and replaced with the symbol RMV:<rmv_string>, where rmv_string stands for a string of (up to) the first three words (leaf nodes) of the removed material and serves as a reminder of what has been removed. A further search on this output will be a search only on IP* nodes that contain a pronoun object, and on no other nodes.
prints only the text of the tokens that match the query, suppressing printing of the labeled bracketing and associated information.
print_only: CODINGThe resultant output file will bear the extension .ooo. Please note that his feature does not work on output files of prior queries in which the nodes of the parse were indexed.
In theory, you could substitute a part of speech label for CODING, although if you wanted a list of, for instance, all the nouns in your file, you would probably be better off using the make_lexicon feature.
Comments may be added anywhere to the command file or to files of parsed sentences that serve as input to CS. The program uses the following delimiters for such comments. Comment lines begin with "//" and block comments appear between "/*" and "*/". Unlike remarks, comments are not printed to the search output file.
For input files, but not for command files, you can also define custom comment delimiters. Add the following commands to the command file preamble or to the preferences file, followed by the desired delimiter strings.
For a line comment delimiter, add corpus_line_comment:
For block comment delimiters, add corpus_comment_begin: and corpus_comment_end: