Contents of this chapter:

general form of the output
a typical output file
preface
header
result block with output sentence
footer
hits/tokens/total
summary block
using nodes_only and remove_nodes

general form of the output

The extension of an output file will be .out. Output files have this general form:

1 per output file 1 per input file 1 per output sentence
PrefaceHeaderur_text sentence
SummaryFooterresult block
parsed sentence

Since the output file can become input to a subsequent search, everything except parsed sentences is surrounded by comment markers /* and */ (the ur_text block has slightly different markers).

a typical output file

As an example, we will walk through a typical output file. The query was designed to search for inverted pronoun subjects, that is, pronoun subjects that appear after the tensed verb.

To make this example easier to follow, the search was done with the default value (false) for "nodes_only."

We will discuss "nodes_only" and "remove_nodes" below.

preface

/*
    PREFACE:
    CorpusSearch copyright Beth Randall 2004.
    Date:  Sun Apr 30 07:05:51 EDT 2004

    command file:       under.q
    output file:        under.out

    remark:   this query searches for inverted pronoun subjects.

    node:   IP*
    print_indices: true

    query:  ((((NP*|ADJP*|ADVP*|PP* iPrecedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)
            AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*))
            AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*))
            AND (NP-SBJ* iDominates PRO|MAN))
*/

The preface begins with a copyright declaration and the date and time of the search.

The names of the command file and output file are listed. If this search had been performed using an output file as input (instead of a corpus file), the name of the output-as-input file would also have been listed in this block. But because the input file is a corpus file, the header and summary blocks contain all the necessary information (for more on searching output files, see below).

The remark was found in the command file. It serves as a reminder of the purpose of the query.

The beginning of the query,

          ((NP*|ADJP*|ADVP*|PP* iPrecedes  *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)
          AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*))

requires a constituent (NP*|ADJP*|ADVP*|PP*) which immediately precedes the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD). The constituent is required not to have a trace (\*T*) (a placeholder for a word which would appear in that place under some circumstances, but in fact appears elsewhere in this particular sentence.) This requirement was put in to preclude questions (such as, "Kepte he his fadir scheep full mekly?"), where there is no constituent before the inverted pronoun subject other than the tensed verb. In Middle English, there must be one constituent before the tensed verb in statements, as the first two lines of the query describe.

The last two lines of the query,

          AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*))
          AND (NP-SBJ* iDominates PRO|MAN))

describe the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD) which precedes the subject noun phrase (NP-SBJ*), which itself immediately dominates a pronoun ("PRO|MAN") (that is, the subject is a pronoun.)

header

/*
    HEADER:
    source file:  cmcapchr.m4.psd
*/
Here, the source file is listed as its name appears in the corpus directory. If this had been an output file, the source file would have been listed as its name appears in the ID node of each sentence, that is, CMCAPCHR. (for more on searching output files, see below).

result block with output sentence

Here's an example of an output sentence, first presented as the original text, followed by the result block, which lists the nodes relevant to the query, followed by the sentence in its parsed form:

/~*
His fadir scheep kepte he ful mekly;
(CMCAPCHR,32.13)
*~/

/*
    1 IP-MAT: 2 NP-OB1, 7 VBD kepte, 6 N scheep, 8 NP-SBJ, 9 PRO he
*/

(0
   (1 IP-MAT
             (2 NP-OB1 (3 NP-POS (4 PRO$ His) (5 N$ fadir))
                       (6 N scheep))
             (7 VBD kepte)
             (8 NP-SBJ (9 PRO he))
             (10 ADVP (11 ADVR ful) (12 ADV mekly))
             (13 E_S ;))
   (ID CMCAPCHR,32.13))

The indices on the nodes of the labeled bracketing are intended to facilitate seeing how the token comes to match the query. They are present whenever the query preamble contains the line "print_indices: true."

Notice that the original text is surrounded by special markers, "/~*" and "*~/". When a search is run on the output file, CorpusSearch will find and record this block of data as the original text of the output sentence. In this way the entire original text is conserved, even when only bits and pieces of the original parse tree for the sentence appear in the output.

The first item in the result block is the boundary node (in this case, 1 IP-MAT), which match the value of the "node: " line of the command file. The boundary node is followed by a colon to separate it from the rest of the list, which gives the structures that correspond to the "query: " line of the command file. The list of indices and structures is structured so that no node is reported more than once.

For some queries, there may be many nodes that fit one search-function argument. In these cases CorpusSearch always reports the last legitimate fitting node. For instance, look at this part of the query:

(NP*|ADJP*|ADVP*|PP* iDominates !\*T*)

In the sentence above, (2 NP-OB1 iDominates the following nodes, where neither (3 NP-POS nor (6 N scheep is \*T*:

	               (3 NP-POS (4 PRO$ His) (5 N$ fadir))
                       (6 N scheep)

so it is the last node, (6 N scheep), that is reported in the result block.

The parsed version of the output sentence is indented to show the structure of the tree. Sisters have the same indentation (for instance, 2 NP-OB1 and 7 VBD kepte.) Daughters are indented further than their mothers. If a node dominates only leaves, they are printed on the same line to save space.

footer

/*
FOOTER
  source file, hits/tokens/total
  cmcapchr.m4.psd	220/220/4175
*/

The footer gives the statistics for hits, tokens, and total as found in that input file. The same information appears again as one line of the summary block.

hits/tokens/total

CorpusSearch reports these statistics:

hits
number of distinct boundary nodes contaning the searched-for structure.
tokens
number of independent parsed objects in which hits occurred.
total
total number of independent parsed objects searched.

When you're searching a corpus file, most "tokens" are "matrix sentences", though some corpora have fragments and other material as independent tokens. In searches of corpus files it is very common to have "hits" greater than "tokens", since one matrix sentence may contain many distinct boundary nodes.

But suppose you follow these steps:

  1. Run a search on the corpus with "nodes_only" and "remove_nodes" set to "true." Call the output of this search "1.out".
  2. Now, run a second search on "1.out" with the same boundary node as used in the first search. Call the output of this second search "2.out".
In "2.out", "hits" and "tokens" will be the same number, because each token in "1.out" contained exactly one boundary node and thus can contain at most one hit in the second search.

summary block

/*
SUMMARY:  
source files, hits/tokens/total:
  cmaelr4.m4.psd    46/46/766
  cmcapchr.m4.psd   220/220/4175
  cmcapser.m4.psd   12/12/91
  cmedmund.m4.psd   2/2/300
  cmfitzja.m4.psd   14/14/228
  cmgregor.m4.psd   14/14/2631
  cminnoce.m4.psd   6/6/208
  cmkempe.m4.psd    203/202/3851
  cmmalory.m4.psd   214/213/4995
  cmreynar.m4.psd   36/36/547
  cmreynes.m4.psd   0/0/245
  cmsiege.m4.psd    6/6/731
whole search, hits/tokens/total
		773/771/18772
*/
The summary block gives the same information as the footer blocks for each input file, but brought together in one place. This summary block was produced by a search on all files in the Middle English corpus (PPCME2) whose titles contain "m4", meaning they are from the fourth chronological period (1420 - 1500).

using nodes_only and remove_nodes

Consider this query file, called ipmat-2vb.q:

begin_remark:
    This query searches for matrix clauses which contain a
    subject and at least two verbs.  The subject precedes
    both verbs.
end_remark

nodes_only: t
remove_nodes: t
node:  IP-MAT*
query: (((((IP-MAT* iDoms NP-SBJ*)
AND (NP-SBJ* precedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD))
AND (NP-SBJ* precedes VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN))
AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iDoms ![1]\**))
AND (VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN iDoms ![2]\**))

Because remove_nodes and nodes_only are set to "true," the output will print only the boundary nodes containing the structure, and irrelevant boundary nodes will be removed. The purpose of these settings would be to ensure that subsequent searches are conducted only on the matrix clauses that contain a subject preceding two verbs. Here's a sample output sentence: in Modern English, this sentence would be: "He would have told you more if you had allowed him to."

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde
*/

(0 (1 IP-MAT-SPE (2 CONJ and)
                 (3 NP-OB1 (4 QR more))
                 (5 NP-SBJ (6 PRO he))
                 (7 MD wolde)
                 (8 HV a)
                 (9 VBN tolde)
                 (10 NP-OB2 (11 PRO you))
                 (12 PP (13 P and)
                        (14 CP-ADV (15 C 0)
                                   (IP-SUB RMV:$ye_wolde_a...)))
                 (24 E_S .))(ID CMMALORY,35.1106))

Notice that the IP-SUB clause, "$ye wold a suffirde hym", has been removed.

Suppose we run this output through a search for pronoun objects, using this query file, called "pro-obj.q":

begin_remark:
pronoun objects
end_remark

add_to_ignore: \**
query: (NP-OB* iDoms PRO)

The 35.1106 sentence shows up again, because it has a pronoun object "you":

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you
*/

 (0 (1 IP-MAT-SPE (2 CONJ and)
                 (3 NP-OB1 (4 QR more))
                 (5 NP-SBJ (6 PRO he))
                 (7 MD wolde)
                 (8 HV a)
                 (9 VBN tolde)
                 (10 NP-OB2 (11 PRO you))
                 (12 PP (13 P and)
                        (14 CP-ADV (15 C 0)
                                   (16 IP-SUB RMV:$ye_wolde_a...)))
                 (17 E_S .))(ID CMMALORY,35.1106))

Notice that the results block describes one structure,

1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you

This structure will be counted as one hit in the final summary block.

Now suppose we run the same series of searches, but this time we change "nodes_only" to "false."

nodes_only: f

When "nodes_only" is false, "remove_nodes" is automatically false.

Here's how the 35.1106 sentence looks after running ipmat-2vb.q with nodes_only and remove_nodes false:

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde
*/

(0
(1 IP-MAT-SPE (2 CONJ and)
              (3 NP-OB1 (4 QR more))
              (5 NP-SBJ (6 PRO he))
              (7 MD wolde)
              (8 HV a)
              (9 VBN tolde)
              (10 NP-OB2 (11 PRO you))
              (12 PP (13 P and)
                     (14 CP-ADV (15 C 0)
                                (16 IP-SUB
                                           (17 NP-SBJ (18 PRO $ye))
                                           (19 MD wolde)
                                           (20 HV a)
                                           (21 VBN suffirde)
                                           (22 NP-OB1 (23 PRO hym)))))
              (24 E_S .))
(25 ID CMMALORY,35.1106))
Notice that the clause "$ye wolde a suffirde hym" is printed out in full.

Now we run pro-obj.q on this output. Here's the 35.1106 sentence in the output of this search:

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you
 16 IP-SUB: 22 NP-OB1, 23 PRO hym
*/

(0
(1 IP-MAT-SPE (2 CONJ and)
              (3 NP-OB1 (4 QR more))
              (5 NP-SBJ (6 PRO he))
              (7 MD wolde)
              (8 HV a)
              (9 VBN tolde)
              (10 NP-OB2 (11 PRO you))
              (12 PP (13 P and)
                     (14 CP-ADV (15 C 0)
                                (16 IP-SUB
                                           (17 NP-SBJ (18 PRO $ye))
                                           (19 MD wolde)
                                           (20 HV a)
                                           (21 VBN suffirde)
                                           (22 NP-OB1 (23 PRO hym)))))
              (24 E_S .))
(25 ID CMMALORY,35.1106))

Notice that here the results block contains two different structures,

1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you 
16 IP-SUB: 22 NP-OB1, 23 PRO hym

The structure

16 IP-SUB: 22 NP-OB1, 23 PRO hym

is reported in this case because remove_nodes was false in the previous search. The pronoun object "hym" was found in a subordinate clause, not the matrix clause that was of interest to the last search.

Because the structures occur in two distinct boundary nodes (1 IP-MAT-SPE and 16 IP-SUB), this will count as two hits in the summary block, in contrast to the one hit that was found when remove_nodes was true. This explains why, after using "remove_nodes: true" in the initial search, a second search for pronoun objects finds fewer hits than when the initial search was conducted with "remove_nodes: false."

Here's the summary block from the "remove_nodes: true" version:

/*
SUMMARY:
source files, hits/tokens/total:
  CMMALORY      177/176/875
whole search, hits/tokens/total
                177/176/875
*/

And here's the summary block from the "remove_nodes: false" version:

/*
SUMMARY:
source files, hits/tokens/total:
  CMMALORY      290/249/875
whole search, hits/tokens/total
                290/249/875
*/