A sequence of word filters are used to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set and a topic word set whose members are highly predictive of content. These two word sets are then formed into a two dimensional matrix with matrix entries calculated as the conditional probability that a document will contain a word in a row given that it contains the word in a column. The matrix representation allows the resultant vectors to be utilized to interpret document contents.
SYSTEM FOR INFORMATION DISCOVERY
This invention was made with Government support under Contract DE-AC06-76RLO 1830 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.