alignment-free sequence comparison using spaced words
Spaced (Words) is a new approach to alignment-free sequence
comparison. While most alignment-free algorithms compare the
word-composition of sequences, spaced uses a pattern of care and
don't care positions. The occurrence of a spaced word in a sequence
is then defined by the characters at the match positions only, while
the characters at the don't care positions are ignored. Instead of
comparing the frequencies of contiguous words in the input sequences,
this new approach compares the frequencies of the spaced words according
to the pre-defined pattern. An information-theoretic distance measure
is then used to define pairwise distances on the set of input sequences
based on their spaced-word frequencies. Systematic test runs on real and
simulated sequence sets have shown that, for phylogeny reconstruction,
this multiple-spaced-words approach is far superior to the classical
alignment-free approach based on contiguous word frequencies.