Contents of this chapter:
The extension of an output file will be .out. Output files have this general form:
1 per output file | 1 per input file | 1 per output sentence |
---|---|---|
Preface | Header | ur_text sentence |
Summary | Footer | result block |
parsed sentence |
Since the output file can become input to a subsequent search, everything except parsed sentences is surrounded by comment markers /* and */ (the ur_text block has slightly different markers).
As an example, we will walk through a typical output file. The query was designed to search for inverted pronoun subjects, that is, pronoun subjects that appear after the tensed verb.
To make this example easier to follow, the search was done with the default value (false) for "nodes_only."
We will discuss "nodes_only" and "remove_nodes" below.
/* PREFACE: CorpusSearch copyright Beth Randall 2004. Date: Sun Apr 30 07:05:51 EDT 2004 command file: under.q output file: under.out remark: this query searches for inverted pronoun subjects. node: IP* print_indices: true query: ((((NP*|ADJP*|ADVP*|PP* iPrecedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD) AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*)) AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*)) AND (NP-SBJ* iDominates PRO|MAN)) */
The preface begins with a copyright declaration and the date and time of the search.
The names of the command file and output file are listed. If this search had been performed using an output file as input (instead of a corpus file), the name of the output-as-input file would also have been listed in this block. But because the input file is a corpus file, the header and summary blocks contain all the necessary information (for more on searching output files, see below).
The remark was found in the command file. It serves as a reminder of the purpose of the query.
The beginning of the query,
((NP*|ADJP*|ADVP*|PP* iPrecedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD) AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*))
requires a constituent (NP*|ADJP*|ADVP*|PP*) which immediately precedes the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD). The constituent is required not to have a trace (\*T*) (a placeholder for a word which would appear in that place under some circumstances, but in fact appears elsewhere in this particular sentence.) This requirement was put in to preclude questions (such as, "Kepte he his fadir scheep full mekly?"), where there is no constituent before the inverted pronoun subject other than the tensed verb. In Middle English, there must be one constituent before the tensed verb in statements, as the first two lines of the query describe.
The last two lines of the query,
AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*)) AND (NP-SBJ* iDominates PRO|MAN))
describe the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD) which precedes the subject noun phrase (NP-SBJ*), which itself immediately dominates a pronoun ("PRO|MAN") (that is, the subject is a pronoun.)
/* HEADER: source file: cmcapchr.m4.psd */Here, the source file is listed as its name appears in the corpus directory. If this had been an output file, the source file would have been listed as its name appears in the ID node of each sentence, that is, CMCAPCHR. (for more on searching output files, see below).
Here's an example of an output sentence, first presented as the original text, followed by the result block, which lists the nodes relevant to the query, followed by the sentence in its parsed form:
/~* His fadir scheep kepte he ful mekly; (CMCAPCHR,32.13) *~/ /* 1 IP-MAT: 2 NP-OB1, 7 VBD kepte, 6 N scheep, 8 NP-SBJ, 9 PRO he */ (0 (1 IP-MAT (2 NP-OB1 (3 NP-POS (4 PRO$ His) (5 N$ fadir)) (6 N scheep)) (7 VBD kepte) (8 NP-SBJ (9 PRO he)) (10 ADVP (11 ADVR ful) (12 ADV mekly)) (13 E_S ;)) (ID CMCAPCHR,32.13))
The indices on the nodes of the labeled bracketing are intended to facilitate seeing how the token comes to match the query. They are present whenever the query preamble contains the line "print_indices: true."
Notice that the original text is surrounded by special markers, "/~*" and "*~/". When a search is run on the output file, CorpusSearch will find and record this block of data as the original text of the output sentence. In this way the entire original text is conserved, even when only bits and pieces of the original parse tree for the sentence appear in the output.
The first item in the result block is the boundary node (in this case, 1 IP-MAT), which match the value of the "node: " line of the command file. The boundary node is followed by a colon to separate it from the rest of the list, which gives the structures that correspond to the "query: " line of the command file. The list of indices and structures is structured so that no node is reported more than once.
For some queries, there may be many nodes that fit one search-function argument. In these cases CorpusSearch always reports the last legitimate fitting node. For instance, look at this part of the query:
(NP*|ADJP*|ADVP*|PP* iDominates !\*T*)
In the sentence above, (2 NP-OB1 iDominates the following nodes, where neither (3 NP-POS nor (6 N scheep is \*T*:
(3 NP-POS (4 PRO$ His) (5 N$ fadir)) (6 N scheep)
so it is the last node, (6 N scheep), that is reported in the result block.
The parsed version of the output sentence is indented to show the structure of the tree. Sisters have the same indentation (for instance, 2 NP-OB1 and 7 VBD kepte.) Daughters are indented further than their mothers. If a node dominates only leaves, they are printed on the same line to save space.
/* FOOTER source file, hits/tokens/total cmcapchr.m4.psd 220/220/4175 */
The footer gives the statistics for hits, tokens, and total as found in that input file. The same information appears again as one line of the summary block.
CorpusSearch reports these statistics:
When you're searching a corpus file, most "tokens" are "matrix sentences", though some corpora have fragments and other material as independent tokens. In searches of corpus files it is very common to have "hits" greater than "tokens", since one matrix sentence may contain many distinct boundary nodes.
But suppose you follow these steps:
/* SUMMARY: source files, hits/tokens/total: cmaelr4.m4.psd 46/46/766 cmcapchr.m4.psd 220/220/4175 cmcapser.m4.psd 12/12/91 cmedmund.m4.psd 2/2/300 cmfitzja.m4.psd 14/14/228 cmgregor.m4.psd 14/14/2631 cminnoce.m4.psd 6/6/208 cmkempe.m4.psd 203/202/3851 cmmalory.m4.psd 214/213/4995 cmreynar.m4.psd 36/36/547 cmreynes.m4.psd 0/0/245 cmsiege.m4.psd 6/6/731 whole search, hits/tokens/total 773/771/18772 */The summary block gives the same information as the footer blocks for each input file, but brought together in one place. This summary block was produced by a search on all files in the Middle English corpus (PPCME2) whose titles contain "m4", meaning they are from the fourth chronological period (1420 - 1500).
Consider this query file, called ipmat-2vb.q:
begin_remark: This query searches for matrix clauses which contain a subject and at least two verbs. The subject precedes both verbs. end_remark nodes_only: t remove_nodes: t node: IP-MAT* query: (((((IP-MAT* iDoms NP-SBJ*) AND (NP-SBJ* precedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)) AND (NP-SBJ* precedes VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN)) AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iDoms ![1]\**)) AND (VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN iDoms ![2]\**))
Because remove_nodes and nodes_only are set to "true," the output will print only the boundary nodes containing the structure, and irrelevant boundary nodes will be removed. The purpose of these settings would be to ensure that subsequent searches are conducted only on the matrix clauses that contain a subject preceding two verbs. Here's a sample output sentence: in Modern English, this sentence would be: "He would have told you more if you had allowed him to."
/~* and more he wolde a tolde you and $ye wolde a suffirde hym. (CMMALORY,35.1106) *~/ /* 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde */ (0 (1 IP-MAT-SPE (2 CONJ and) (3 NP-OB1 (4 QR more)) (5 NP-SBJ (6 PRO he)) (7 MD wolde) (8 HV a) (9 VBN tolde) (10 NP-OB2 (11 PRO you)) (12 PP (13 P and) (14 CP-ADV (15 C 0) (IP-SUB RMV:$ye_wolde_a...))) (24 E_S .))(ID CMMALORY,35.1106))
Notice that the IP-SUB clause, "$ye wold a suffirde hym", has been removed.
Suppose we run this output through a search for pronoun objects, using this query file, called "pro-obj.q":
begin_remark: pronoun objects end_remark add_to_ignore: \** query: (NP-OB* iDoms PRO)
The 35.1106 sentence shows up again, because it has a pronoun object "you":
/~* and more he wolde a tolde you and $ye wolde a suffirde hym. (CMMALORY,35.1106) *~/ /* 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you */ (0 (1 IP-MAT-SPE (2 CONJ and) (3 NP-OB1 (4 QR more)) (5 NP-SBJ (6 PRO he)) (7 MD wolde) (8 HV a) (9 VBN tolde) (10 NP-OB2 (11 PRO you)) (12 PP (13 P and) (14 CP-ADV (15 C 0) (16 IP-SUB RMV:$ye_wolde_a...))) (17 E_S .))(ID CMMALORY,35.1106))
Notice that the results block describes one structure,
1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you
This structure will be counted as one hit in the final summary block.
Now suppose we run the same series of searches, but this time we change "nodes_only" to "false."
nodes_only: f
When "nodes_only" is false, "remove_nodes" is automatically false.
Here's how the 35.1106 sentence looks after running ipmat-2vb.q with nodes_only and remove_nodes false:
/~* and more he wolde a tolde you and $ye wolde a suffirde hym. (CMMALORY,35.1106) *~/ /* 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde */ (0 (1 IP-MAT-SPE (2 CONJ and) (3 NP-OB1 (4 QR more)) (5 NP-SBJ (6 PRO he)) (7 MD wolde) (8 HV a) (9 VBN tolde) (10 NP-OB2 (11 PRO you)) (12 PP (13 P and) (14 CP-ADV (15 C 0) (16 IP-SUB (17 NP-SBJ (18 PRO $ye)) (19 MD wolde) (20 HV a) (21 VBN suffirde) (22 NP-OB1 (23 PRO hym))))) (24 E_S .)) (25 ID CMMALORY,35.1106))Notice that the clause "$ye wolde a suffirde hym" is printed out in full.
Now we run pro-obj.q on this output. Here's the 35.1106 sentence in the output of this search:
/~* and more he wolde a tolde you and $ye wolde a suffirde hym. (CMMALORY,35.1106) *~/ /* 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you 16 IP-SUB: 22 NP-OB1, 23 PRO hym */ (0 (1 IP-MAT-SPE (2 CONJ and) (3 NP-OB1 (4 QR more)) (5 NP-SBJ (6 PRO he)) (7 MD wolde) (8 HV a) (9 VBN tolde) (10 NP-OB2 (11 PRO you)) (12 PP (13 P and) (14 CP-ADV (15 C 0) (16 IP-SUB (17 NP-SBJ (18 PRO $ye)) (19 MD wolde) (20 HV a) (21 VBN suffirde) (22 NP-OB1 (23 PRO hym))))) (24 E_S .)) (25 ID CMMALORY,35.1106))
Notice that here the results block contains two different structures,
1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you 16 IP-SUB: 22 NP-OB1, 23 PRO hym
The structure
16 IP-SUB: 22 NP-OB1, 23 PRO hym
is reported in this case because remove_nodes was false in the previous search. The pronoun object "hym" was found in a subordinate clause, not the matrix clause that was of interest to the last search.
Because the structures occur in two distinct boundary nodes (1 IP-MAT-SPE and 16 IP-SUB), this will count as two hits in the summary block, in contrast to the one hit that was found when remove_nodes was true. This explains why, after using "remove_nodes: true" in the initial search, a second search for pronoun objects finds fewer hits than when the initial search was conducted with "remove_nodes: false."
Here's the summary block from the "remove_nodes: true" version:
/* SUMMARY: source files, hits/tokens/total: CMMALORY 177/176/875 whole search, hits/tokens/total 177/176/875 */
And here's the summary block from the "remove_nodes: false" version:
/* SUMMARY: source files, hits/tokens/total: CMMALORY 290/249/875 whole search, hits/tokens/total 290/249/875 */