A set of known protein sequences associated with an organism is identified, wherein each known protein sequence comprises a plurality of ordered residues. A set of scores associated with a set of residues of the plurality of ordered residues is identified, wherein each score indicates a frequency of a residue in sequence context. A set of unique sub-sequences of the set of known protein sequences is identified. A plurality of protein signature residues is determined based on the set of scores associated with the set of residues and the set of unique sub-sequences.
STATEMENT REGARDING FEDERALLY FUNDED RESEARCH
 This invention was made in the course of or under prime Contract No. DE-AC52-07NA27344 between the U.S. Department of Energy and Lawrence Livermore National Security, LLC. This Record of Invention is prepared for the Office of the Assistant General Counsel for Patents, U.S. Department of Energy.