|
|
cts:classify(
|
|
$data-nodes as node()*,
|
|
$classifier as element(cts:classifier),
|
|
$options as element()?,
|
|
$training-nodes as node()*
|
| ) as element(cts:label)* |
|
 |
Summary:
Classifies a sequence of nodes based on training data. The training data
is in the form of a classifier specification, which is generated from the
output of cts:train. Returns labels for each of the input
documents in the same order as the input document.
|
Parameters:
$data-nodes
:
The sequence of nodes to be classified.
|
$classifier
:
An element node containing the classifier specification. This is typically
the output of cts:train, either run directly or saved in an
XML document in the database.
|
$options
:
An options element. The options for classification are passed
automatically from cts:train to the cts:classifier
specification as part of the classifier element so that they are
consistent with the parameters used in training. The following option
may be separately passed to cts:classify and is in the
cts:classify namespace:
<thresholds>
- A definition of the thresholds to use in classification. This is
a complex element with one or more
<threshold> children.
You can specify both a global value and per-class values (as computed from
cts:thresholds). The global value will apply to
any classes for which a per-class value is not specified. For example:
<options xmlns="cts:classify">
<thresholds>
<threshold>-1.0</threshold>
<threshold class="Example 1">-2.42</threshold>
</thresholds>
</options>
|
$training-nodes
:
The sequence of training nodes used to train the classifier.
Required if the supports form of the classifier is used;
ignored if the weights form of the classifier is used.
|
|
Usage Notes:
cts:classify classifies a sequence of nodes using
the output from cts:train. The $data-nodes
and $classifier parameters are respectively the nodes to
be classified and the specification output from cts:train.
cts:classify can use either supports or
weights forms of the $classifier output
from cts:train (see Output
Formats). If the supports form is used, the training
nodes must be passed as the 4th parameter. The $options
parameter is an options element in the cts:classify namespace.
The output is a sequence of label elements of the form:
<cts:label>
<cts:class name="Example 1" val="-0.003"/>
<cts:class name="Example 2" val="1.4556"/>
...
</cts:label>
Each label corresponds to the data node in the corresponding position
in the input sequence. There will be a <class> child
for each class where the document passed the class threshold. The
val attribute gives the class membership value for the
data node in the given class. Values greater than zero indicate
likely class membership, values less than zero indicate likely
non-membership. Adjusting thresholds can give more or less selective
classification. Increasing the threshold leads to a more selective
classification (that is, decreases the likelihood of classification in the
class). Decreasing the threshold gives less selective classification.
|
Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37]
let $classifier :=
let $labels := for $x in $firsthalf
return
<cts:label>
<cts:class name={xdmp:document-properties(xdmp:node-uri($x))
//playtype/text()}/>
</cts:label>
return
cts:train($firsthalf, $labels,
<options xmlns="cts:train">
<classifier-type>supports</classifier-type>
</options>)
return
cts:classify($secondhalf, $classifier,
<options xmlns="cts:classify"/>,
$firsthalf)
=> ( <label>...</label>,... )
|
|
|
|
cts:thresholds(
|
|
$computed-labels as element(cts:label)*,
|
|
$known-labels as element(cts:label)*,
|
|
[$recall-weight as xs:double]
|
| ) as element(cts:thresholds)? |
|
 |
Summary:
Compute precision, recall, the F measure, and thresholds for the
classes computed by the classifier, by comparing with the labels
for the same set.
|
Parameters:
$computed-labels
:
A sequence of element nodes containing the labels from classification
(the output from cts:classify) for a set of documents.
|
$known-labels
:
A sequence of element nodes containing the known labels for the same set
of documents.
|
$recall-weight
(optional):
The factor to use in the calculation of the F measure. The number should
be non-negative. A value of 0 means F is just precision and a value
of +INF means F is just recall. The default is 1, which gives the harmonic
mean between precision and recall.
|
|
Usage Notes:
You use the output of cts:thresholds to determine the best
thresholds values for your data, based on the first pass through the first
part of your training data. The output of cts:thresholds
provides you with precision and recall measurements at the calculated
thresholds for each class. The following are the definitions of the
attributes of the thresholds element returned
by cts:thresholds:
name
- The name of the class.
threshold
- The threshold that is computed by the classifier to give the best
results. The threshold is used by
cts:classify when
classifying documents, and is defined to be the positive
or negative distance from the hyperplane which represents the edge of
the class.
precision
- A number which represents the fraction of nodes identified in a
class that are actually in that class. As this aproaches 1, there is
a higher probability that you over-classified.
recall
- A number which represents the the fraction of nodes in a class that
were identified by the classifier as being in that class. As this
aproaches 1, there is a higher probability that you under-classified.
F (the F-measure)
- A measure which represents if the classification at the given
threshold is closer to recall or closer to precision. A value of 1
indicates that precision and recall have equal weight. A value of 0.5
indicates that precision is weighted 2x recall. A value of 2 indicates
that recall is weighted 2x prcision. A value of 0 indicates that the
weighting is precision only, and a value of +INF
(
xs:double('+INF')) indicates that weighting is recall only.
|
Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $labels := for $x in $firsthalf
return
<cts:label>
<cts:class name={xdmp:document-properties(xdmp:node-uri($x))
//playtype/text()}/>
</cts:label>
let $classifylabels :=
let $secondhalf :=
xdmp:directory("/shakespeare/plays/", "1")[20 to 37]
let $classifier :=
cts:train($firsthalf, $labels,
<options xmlns="cts:train">
<classifier-type>supports</classifier-type>
</options>)
return
cts:classify($firsthalf, $classifier,
<options xmlns="cts:classify"/>,
$firsthalf)
return
cts:thresholds($classifylabels, $labels)
(:
This returns the computed thresholds for the plays in a
Shakespeare database. For example:
<thresholds xmlns="http://marklogic.com/cts">
<class name="TRAGEDY" threshold="-0.0192594" precision="1"
recall="1" f="1" count="8"/>
<class name="COMEDY" threshold="0.934239" precision="1"
recall="0.8" f="0.888889" count="5"/>
<class name="HISTORY" threshold="0.101927" precision="1"
recall="1" f="1" count="6"/>
</thresholds>
:)
|
|
|
|
cts:train(
|
|
$training-nodes as node()*,
|
|
$labels as element(cts:label)*,
|
|
[$options as element()?]
|
| ) as element(cts:classifier)? |
|
 |
Summary:
Produces a set of classifiers from a list of
labeled training documents.
|
Parameters:
$training-nodes
:
The sequence of training nodes. These are nodes that represent
members of the classes.
|
$labels
:
A sequence of labels for the training nodes, in the order corresponding
to the training nodes.
|
$options
(optional):
An XML representation of the options for defining the training
parameters. The options node must be in the cts:train
namespace. The following is a sample options node:
<options xmlns="cts:train">
<classifier-type>supports</classifier-type>
<kernel>geodesic</kernel>
</options>
The cts:train options include:
<classifier-type>
- A string defining the kind of classifier to produce, either
weights or supports. The default is
weights.
<kernel>
- A string defining which function to use for comparing documents.
The default is
sqrt. Normalization (the values that end in
-normalized) brings document vectors into the unit sphere,
which may improve the mathematical properties of the calculations.
Possible values are:
simple
- Model documents as 1 or 0 for presence or absence of each term.
simple-normalized
- Like
simple, but normalized by square root of
document length.
sqrt
- Model documents using the square root of the term frequencies.
sqrt-normalized
- Like
sqrt, but normalized by the sum of the term
frequencies.
linear-normalized
- Model documents as the term frequencies normalized by the
square root of the sum of the squares of the term frequencies.
gaussian
- Compare documents using the Gaussian of the term frequencies.
Requires a
classifier-type of supports.
geodesic
- Compare documents using the Riemann geodesic distance over
term frequencies. Requires a
classifier-type of
supports.
<max-terms>
- An integer defining the maximum number of terms to use to
represent each document. If a positive number M is given, then the
M most discriminating terms are used; other terms are dropped. The
default is 0 (unlimited).
<max-support>
- A double specifying the maximum influence a single training node
can have. This parameter has a strong influence on performance.
The default value of 1.0 should work well in most cases. Larger
values means greater sensitivity and may improve accuracy on small
datasets, but give longer running times. Smaller values mean less
sensitivity and better resistance to mis-classified documents, and
shorter running times.
<min-weight>
- A double specifying the minimum weight a term can have and still
be considered for inclusion in the term vector. This parameter only applies
to the term weight form of the classifier. Smaller values mean longer
term vectors and as a consequence longer running times and greater memory
consumption during classification, but may also improve accuracy.
The default is is 0.01.
<tolerance>
- How close the final solutions to the constraint equations must be.
Smaller values lead to a greater number of iterations and longer
running times. Larger values lead to less precise classification.
The default is 0.01.
<epsilon>
- How close a value must be to 0 to be counted as equal to 0.
Since double arithmetic is not precise, setting this value to exactly
0 will likely lead to non-convergence of the algorithm. Smaller
values lead to a greater number of iterations and longer running
times. Larger values lead to less precise classification.
The default is 0.01.
<max-iterations>
- The maximum number of iterations of the constraint satisfaction
algorithm to run. The algorithm usually converges very quickly,
so this parameter usually has no effect unless it is set very low.
The default is 500.
<thresholds>
- A definition of the thresholds to use in classification. This is
a complex element with one or more
<threshold>
children. You can specify both a global value and per-class values
(as computed from cts:thresholds). The global value
will apply to any classes for which a per-class value is not
specified. For example:
<options xmlns="cts:train">
<thresholds>
<threshold>-1.0</threshold>
<threshold class="Example 1">-2.42</threshold>
</thresholds>
</options>
For the initial tuning phase of training your data, leave the value
of this parameter at its default value which is a very large negative
number (-10E30). This will allow you to accurately compute the
threshold values when you run cts:thresholds on the initial
training data. Then you can use the calculated thresholds values
when you run the secondary pass through the second part of your training
data.
The options element also includes indexing options in the
http://marklogic.com/xdmp/database namespace.
These control which terms to use. Note that the use of certain
options, such as fast-case-sensitive-searches, will not
impact final results unless the term vector size is limited with
the max-terms option. Other options, such as
phrase-throughs, will only generate terms if some
other option is also enabled (in this case
fast-phrase-searches).
These database options include the following (shown here with
a db prefix to denote the different namespace, as
declared in the example below):
<db:word-searches>
- Include terms for the words in the node.
<db:stemmed-searches>
- Include terms for the stems in the node.
<db:fast-case-sensitive-searches>
- Include terms for case-sensitive variations of the words in the
node.
<db:fast-diacritic-sensitive-searches>
- Include terms for diacritic-sensitive variations of the words in
the node.
<db:fast-phrase-searches> - Include
terms for two-word phrases in the node.
<db:phrase-throughs> - If phrase
terms are included, include terms for phrases that cross the given
elements.
<db:phrase-arounds> - If phrase
terms are included, include terms for phrases that skip over the
given elements.
<db:fast-element-word-searches>
- Include terms for words in particular elements.
<db:fast-element-phrase-searches>
- Include terms for phrases in particular elements.
<db:element-word-query-throughs>
- Include terms for words in sub-elements of the given elements.
<db:fast-element-character-searches>
- Include terms for characters in particular elements.
<db:range-element-indexes>
- Include terms for data values in specific elements.
<db:range-element-attribute-indexes>
- Include terms for data values in specific attributes.
<db:one-character-searches>
- Include terms for single character.
<db:two-character-searches>
- Include terms for two-character sequences.
<db:three-character-searches>
- Include terms three-character sequences.
<db:trailing-wildcard-searches>
- Include terms for trailing wildcards.
<db:fast-element-trailing-wildcard-searches>
- If trailing wildcard terms are included, include terms for
trailing wildcards by element.
<db:fields>
- Include terms for the defined fields.
|
|
Usage Notes:
The elements in the label sequence should match one for one with the nodes
in the training node sequence. The first label element describes the first node
in the training node sequence, the second label element describes the second
node in the training node sequence, and so on.
If there are more labels than training nodes or more training nodes
than labels, an error is raised.
The format of each label element is:
<cts:label name="Node1">
<cts:class name="Example1"/>
<cts:class name="Example2" val="-1"/>
: :
</cts:label>
Each class listed indicates whether the corresponding node in the training
sequence is in the given class. Examples are taken to be positive examples
unless specified otherwise (with a val attribute of -1).
The document is assumed to be a negative example of any classes that are
not explicitly listed.
The name attribute on the label element is an optional name for the labelled
node. It is purely for human consumption to help in tuning the classification
parameters.
Output Formats
A linear classifier is defined by a weight vector w on terms, and
an offset value b. The <weights/> node encodes the weight vector
directly. Its children are the classes, and each class includes
a list of terms. The term node uses an internal id to identify the term
and a term weight:
<weights>
<class name="Example1" offset="2.04">
<term id="43587329645324245" val="0.3423432"/>
<term id="47893427895432534" val="-0.12345556"/>
: :
</class>
:
</weights>
The weight vector w is a linear combination of the documents
themselves, and it may be more convenient to express the classifier in
this way. For instance, if the number of terms is not limited, the
<weights/> node will be extremely large. The weight vector form
may not be used if the classifier kernel is non-linear, that is, with
the Gaussian or geodesic kernel.
The support vector representation of the classifier includes a
supports node that has <class/> children for each class. Here the
class elements contain a list of doc elements which identify the specific
training nodes using an internal key. This internal key is valid across
queries only for nodes in the database. Each doc element has an
attribute encoding the weight of that document and an error attribute
which shows how well the document fit the classifier. Large positive
or negative errors (greater than about 1.5) are potentially
mis-classified documents.
<supports>
<class name="Example1" offset="2.04">
<doc id="155584958759" name="Node102" val="-0.00334163" err="1.4"/>
<doc id="594064848864" name="Node57" val="0.025341234" err="-2.3"/>
: :
</class>
:
</supports>
Each class is identified by a unique name.
|
Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $labels := for $x in $firsthalf
return
<cts:label>
<cts:class name={xdmp:document-properties(xdmp:node-uri($x))
//playtype/text()}/>
</cts:label>
return
cts:train($firsthalf, $labels,
<options xmlns="cts:train">
<classifier-type>supports</classifier-type>
</options>)
=> <cts:classifier>...
|
|
|