LSCLASS programming library
|LSCLASS is Lingsoft's software library and application programming interface, written in C, for Lingsoft Classifier, a Bayesian classifier for text documents. |
Bayesian classification is a form of supervised learning: In order for the classifier to be able to assign class labels to new items, it must first be taught by exposing it to a set of pre-classified items. Classifying new items is then performed on the ground of statistical evidence gathered from that set of examples. Only those class labels are assigned to the new items that appear as labels in the example set. The process of using a Bayesian classifier can thus be divided in two distinct phases: first, the learning phase, and second, classifying new data. LSCLASS implements the latter one.
Using LSCLASS thus presupposes the earlier teaching phase has been performed already. Lingsoft performs that phase by using specifications and material provided by the customer. The material needed consists of two parts: first, a class scheme; and second, text documents classified according to that scheme. As a result there will be a packed class scheme file and a set of classifier files comprising statistical data extracted from the material. These files are delivered to the customer, along with LSCLASS library.
In cases the customer has various class schemes, perhaps for different kinds of material, the teaching phase must be performed separately for each scheme and teaching material set. However, one and the same LSCLASS library may be used for all of them in classifying phase.
Using appropriate linguistic tools may greatly enhance LSCLASS accuracy, especially in cases the documents in question have been written in a language with rich inflectional morphology. Lingsoft provides morphological analysis components for (Finnish, Swedish, Norwegian (Bokmål), Norwegian (Nynorsk), Danish, and German). Components of other providers may be used, too. However, some extra customization programming might be needed then.
LSCLASS supports Unicode. Text to be classified is supposed to be in wchar_t buffers.
Copyright ©1986-2017, Lingsoft Ltd.