The first step of seeding against the protein database is done using
a reduced amino acid alphabet index, as used in RAPSearch
(Ye et al. 2011). The index is computed directly from the input
fasta files and stored in memory. Given an input sequence,
overlapping k-mers are translated into the reduced set and looked up
in the index table. Candidates are ranked by the number of matches
that follow the same order as in the query sequences, allowing for
prioritizing candidates. Several hundred candidates are considered.
2. Cross-correlation filtering
The candidates and query sequence are next translated into three
vectors, representing the chemical properties of amino acids as
numerical values, namely side-chain polarity, side-chain charge, and
hydropathy. Cross-correlation, as implemented in Satsuma
(Grabherr et al. 2010), is the applied to detect the relative shift
positions between the sequences.
3. Final alignment
A dynamic programming (DP) alignment algorithm is applied at the end
to generate user-readable alignments. For efficiency, the DP is
restricted to the shift positions found by the cross-correlation,
resulting in fast, albeit approximate alignments.
References
- Grabherr MG, Russell P, Meyer M, Mauceli E, Alföldi J, Di Palma F,
Lindblad-Toh K. (2010) Genome-wide synteny through highly sensitive
sequence alignment: Satsuma. Bioinformatics 26(9):1145-51
- Yuzhen Ye, Jeong-Hyeon Choi and Haixu Tang. RAPSearch: a fast
protein similarity search tool for short reads. (2011) BMC
Bioinformatics 12:159