Protein alignment - Methods

1. Pre-Seeding

The first step of seeding against the protein database is done using a reduced amino acid alphabet index, as used in RAPSearch (Ye et al. 2011). The index is computed directly from the input fasta files and stored in memory. Given an input sequence, overlapping k-mers are translated into the reduced set and looked up in the index table. Candidates are ranked by the number of matches that follow the same order as in the query sequences, allowing for prioritizing candidates. Several hundred candidates are considered.


2. Cross-correlation filtering

The candidates and query sequence are next translated into three vectors, representing the chemical properties of amino acids as numerical values, namely side-chain polarity, side-chain charge, and hydropathy. Cross-correlation, as implemented in Satsuma (Grabherr et al. 2010), is the applied to detect the relative shift positions between the sequences.


3. Final alignment

A dynamic programming (DP) alignment algorithm is applied at the end to generate user-readable alignments. For efficiency, the DP is restricted to the shift positions found by the cross-correlation, resulting in fast, albeit approximate alignments.

4. Availability

The most recent version of the software is distributed with the Smörgås Alignment Web Server.

References
- Grabherr MG, Russell P, Meyer M, Mauceli E, Alföldi J, Di Palma F, Lindblad-Toh K. (2010) Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics 26(9):1145-51
- Yuzhen Ye, Jeong-Hyeon Choi and Haixu Tang. RAPSearch: a fast protein similarity search tool for short reads. (2011) BMC Bioinformatics 12:159