Coding

Contents of this chapter:

What is coding?
a coding file example
an output file example
how to search coding strings
just the codes

What is coding?

Coding is used for creating input to multivariate analysis programs like Varbrul; general statistical programming environments like S, Splus, and R; and statistical analysis packages like Datadesk, JMP, SAS, and SPSS.

Coding string values in a coding file may be in part automatically determined with coding queries and in part hand entered in a text editor. The resultant files can then be inputs to further searches.

a coding file example

Here's an example of a basic coding file, called "obj.c". All coding file names must end with ".c". To simplify our discussion, we show only the first four columns of an originally more complicated coding system.

node: IP*
coding_query:

1: {
        s: (IP*SPE* iDoms NP-OB*)
        n: ELSE
   }

2: {
        m: (IP-MAT* iDoms NP-OB*)
        s: (IP-SUB* iDoms NP-OB*)
        i: (IP-INF* iDoms NP-OB*)
        e: ELSE

   }

3: {
        t: ((IP* iDoms NEG)
          AND (NEG iDoms !ne))
        p: (IP* iDoms !NEG)
        n: ELSE
   }
4: {
        \1: (NP-OB* domsWords 1)
        \2: (NP-OB* domsWords 2)
        \3: (NP-OB* domsWords> 2)
        \0: ELSE
   }

In general, coding files have this form:

<PREAMBLE>
coding_query:

column_number: {
	label: condition
	label: condition 
	.
	.
	.
	}

The coding file begins with the preamble commands (see Command File chapter), which must include the obligatory bounding node for the coding queries. The obligatory query specification "coding_query:" then introduces the coding queries for each column of the output coding string.

In the present example, column 1 of the coding string will contain an "s" if IP*SPE* iDoms NP-OB*. Everywhere else, due to the presence of the "ELSE" function (used only in coding queries), the column will contain an "n".

Note that when numerals (0-9) are used as codes, they must be introduced with the backslash character ("\"), as illustrated in column 4 above.

Coding query files are alternatives to ordinary query files in a CorpusSearch run. So, to code a file, invoke CorpusSearch as follows:

java CorpusSearch <coding_file.c> <file_to_code>

an output file example

Output files resulting from coding will carry the extension .cod. They contain every token of the input file, with coding nodes inserted at every boundary node. A coding node has the form:

(CODING-node label <coding_string>)

The node label suffix of the CODING node is the full label of the bounding node to which the coding string has been added as a daughter. If a given sentence contains more than one instance of the boundary node, the output sentence will contain multiple coding nodes. Here's a sentence from the output file resulting from the above coding file:

/~*
knewe kyndes & complexciones of men & of bestus
(CMHORSES,85.2)
*~/

( (IP-SUB (CODING-IP-SUB n:s:p)
          (NP-SBJ *T*-1)
          (VBD knewe)
          (NP-OB1 (NS kyndes)
                  (CONJ &)
                  (NS complexciones)
                     (PP
                     (PP (P of)
                         (NP (NS men)))
                     (CONJP (CONJ &)
                     (PP    (P of)
                            (NP (NS bestus)))))))
   (ID CMHORSES,85.2))

The same file can be coded multiple times with different bounding node choices and different coding query files. These different coding strings can be printed out separately for statistical analysis (see below). If desired, the coding strings associated with different bounding nodes can be concatenated, using a corpus revision query. See the section on the Concat function in the corpus revision chapter. Concatenating coding strings allows information about the structures of different syntactic domains to be included in a single coding string for statistical analysis.

how to search coding strings

Coding strings may be searched using column. For instance, to find all boundary nodes whose coding string contains "m" or "p" in the 7th column, use this query:

query:  (CODING-XP column 7 m|p)

where "XP" stands for the node label suffix of the codes being searched.

just the codes

To obtain a file with all of the coding strings in a coded file and nothing else, use print_only as follows.

print_only: CODING*

The trailing asterisk is necessary here because the CODING nodes of a file may contain different node label suffixes (e.g., both CODING-IP-MAT and CODING-CP-QUE). In order to restrict your output to coding strings for particular categories, use a print_only command that makes explicit reference to the appropriate node label, as in the following:

print_only: CODING-IP*
          or
print_only: CODING-IP-MAT*

The extension of the resultant output file will be .ooo. If the line

add_IDs: t

is added to the query file, each coding string will have the ID of its sentence token appended to it.

Note on compatibility: In older versions of CorpusSearch (those before version 74), coding strings were labeled simply CODING and lacked a node label suffix. In consequence, the command

print_only: CODING

will create a file of coding strings from coded files created by such a version of CorpusSearch.