[Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Documentation and updates for `libbow' are available at http://www.cs.cmu.edu/~mccallum/bow
Rainbow is a C program that performs document classification using one of several different methods, including naive Bayes, TFIDF/Rocchio, K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's Probabilitistic Indexing, and a simple-minded form a shrinkage with naive Bayes.
Rainbow's accompanying library, `libbow', is a library of C code intended for support of statistical text-processing programs. The current source distribution includes the library, a text classification front-end (rainbow), a simple TFIDF-based document retrieval front-end (arrow), an AltaVista-style document retrieval front-end (archer), and a unsupported document clustering front-end with hierarchical clustering and deterministic annealing (crossbow).
The library provides facilities for: * Recursively descending directories, finding text files. * Finding `document' boundaries when there are multiple docs per file. * Tokenizing a text file, according to several different methods. * Including N-grams among the tokens. * Mapping strings to integers and back again, very efficiently. * Building a sparse matrix of document/token counts. * Pruning vocabulary by occurrence counts or by information gain. * Building and manipulating word vectors. * Setting word vector weights according to NaiveBayes, TFIDF, and a simple form of Probabilistic Indexing. * Scoring queries for retrieval or classification. * Writing all data structures to disk in a compact format. * Reading the document/token matrix from disk in an efficient, sparse fashion. * Performing test/train splits, and automatic classification tests. * Operating in server mode, receiving and answering queries over a socket. |
It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Six months ago, it compiled on WindowsNT (with a GNU build environment); it would probably work again with little effort. Patches to the code are most welcome.
It is relatively efficient. Reading, tokenizing and indexing the raw text of 20,000 UseNet articles takes about 3 minutes. Building a naive Bayes classifier from 10,000 articles, and classifying the other 10,000 takes about 1 minute.
The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).
The library does not: Have parsing facilities. Do smoothing across N-gram models. Claim to be finished. Have good documentation. Claim to be bug-free. ...many other things. |
Pronounciation guide: "libbow" rhymes with "lib-low", not "lib-cow".
Notes from Devika:
How to delimit documents. How to tag things--how to augment the lexers. Lead in gently, steps. Big picture.... more and more interesting things Variety of examples. Guide to sea of command-line references. Structure. When to consider using which switch. Sensible defaults.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Lexer buffers, Lexers
3.1 The Simple Lexer | ||
3.2 The N-Gram Lexer | ||
3.3 The Email/News Lexer | ||
3.4 The HTML Lexer | ||
3.5 Functions Useful for Writing Lexers |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Apply the Porter stemming algorithm to modify word. Return 0 on success.
A function wrapper around POSIX's isalpha
macro.
A function wrapper around POSIX's isgraph
macro.
Return non-zero if word is on the stoplist.
Add to the stoplist the white-space delineated words from filename. Return the number of words added. If the file could not be opened, return -1.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
4.1 Generic Maps between Integers and Strings | ||
4.2 The Global Dictionary |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Allocate, initialize and return a new int/string mapping structure. The parameter capacity is used as a hint about the number of words to expect; if you don't know or don't care about a capacity value, pass 0, and a default value will be used.
Given a integer index, return its corresponding string.
Given the char-pointer string, return its integer index. If this is the first time we're seeing string, add it to the mapping, assign it a new index, and return the new index.
Given the char-pointer string, return its integer index. If string is not yet in the mapping, return -1.
Write the int-str mapping to file-pointer fp.
Return a new int-str mapping, created by reading file-pointer fp.
Return a new int-string mapping, created by reading filename.
Free the memory held by the int-string mapping map.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Given a "word index" wi, return its word, according to the global word-int mapping.
Given a word, return its "word index," according to the global word-int mapping; if it's not yet in the mapping, add it.
Like bow_word2int()
, except it also increments the occurrence
count associated with word.
If this is non-zero, then bow_word2int()
will return -1 when
asked for the index of a word that is not already in the mapping.
Essentially, setting this to non-zero makes bow_word2int()
and
bow_word2int_add_occurrence()
behave like
bow_str2int_no_add()
.
Add to the word occurrence counts by recursively decending directory dirname and lexing all the text files; skip any files matching exception_name.
Return the number of times bow_word2int_add_occurrence()
was
called with the word whose index is wi.
Replace the current word/int mapping with map.
Modify the int/word mapping by removing all words that occurred less
than occur number of times. WARNING: This totally changes the
word/int mapping; any wv
's, wi2dvf
's or barrel
's
you build with the old mapping will have bogus word indices afterward.
Return the total number of unique words in the int/word map.
Save the int/word map to file-pointer FP.
Same as above, but with a filename instead of a FILE*
.
Read the int/word map from file-pointer fp.
Same as above, but with a filename instead of a FILE*
.
Same as above, but don't bother rereading unless filename is different from the last one, or force_update is non-zero.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
5.1 Creating a Word Vector from a Text File | ||
5.2 Writing and Reading Word Vectors as Data Files |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
10.1 Arrays indexed by integers | ||
10.2 Arrays indexed by strings |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[Top] | [Contents] | [Index] | [ ? ] |
[Top] | [Contents] | [Index] | [ ? ] |
This document was generated by root on October, 27 2005 using texi2html 1.76.
The buttons in the navigation panels have the following meaning:
Button | Name | Go to | From 1.2.3 go to |
---|---|---|---|
[ < ] | Back | previous section in reading order | 1.2.2 |
[ > ] | Forward | next section in reading order | 1.2.4 |
[ << ] | FastBack | beginning of this chapter or previous chapter | 1 |
[ Up ] | Up | up section | 1.2 |
[ >> ] | FastForward | next chapter | 2 |
[Top] | Top | cover (top) of document | |
[Contents] | Contents | table of contents | |
[Index] | Index | index | |
[ ? ] | About | about (help) |
where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:
This document was generated by root on October, 27 2005 using texi2html 1.76.