Manual

Smörgås Sequence Alignment Web Server - Documentation

Smörgås (or Smorgas) is a local web interface for protein and transcript alignments, utilizing the mongoose software (https://github.com/cesanta/mongoose). It takes protein databases in fasta format, making it ideal for local installations serving individual research groups, institutions, or resources dedicated to specific organisms, groups or taxa. While its interface can be easily customized for a specific look-and-feel, no web programming skills are required for installation and setup.

To get a quick overview, check out the Video Tutorial.

The source code repository is here: https://github.com/GrabherrGroup/Smorgas-proteins.

1. Installation

1.1. Clone the repository

To get the source code and compile it, type:

> git clone https://github.com/GrabherrGroup/Smorgas-proteins.git

1.2. Build executables

> cd Smorgas-proteins

> make

2. Prepare your protein database

Smörgås reads the database in multi-fasta format. For smaller databases (<300K entries), you can supply the entire file. For larger databases, we recommend to split the file into equal-sized chunks, which will both reduce memory usage, as well as allowing for parallel processing. You can split the database by typing

> SplitFasta -i <database.fasta> -n <how many chunks you would like>

Please note that the number of chunks must match the number of processes you use for parallel runs.

3. Run the web server

3.1 Data files

Smörgås needs a number of files to run a web server. All these files can be found in the smorgas_server_data/ directory, and must be located where the programs are executed. In this directory, there are also a few sample scripts (run_fp, run_plants, run_te), as well as some small databases.

To manually start the server with a single database, cd into the directory and run:

> ../SmorgasCommLayer -f refseq/refseq_plants.faa -server localhost > scl.out &

> ../TangerineServer -t taxonomy.txt > ts.out &

which intializes the server with the refseq plant database, instructs the server that both the communication layer, which controls the processes, as well as the web server run on the same physical machine (localhost).

**** IMPORTANT *****: Please wait until the ProtServer processes are initialized (i.e. their CPU usage drops to 0) before using the service!! This should take less than a few minutes.

***** Please note: ***** the supplied taxonomy.txt text file (derived from the NCBI taxonomy database) is not current. For a more recent version, uncompress taxonomy_update.txt.gz via

> gzip -d taxonomy_update.txt.gz

and use this file instead.

To change the taxonomic resolution, if you e.g. work with a specific subset of species, you can supply your custom colors for display using the -col option <color_file.txt> (it will use defaults if not specified), where the file is a list of organisms, groups, or taxa. For example, when using a plant protein database, you can specify (see plant_colors.txt):

Streptophyta;
Chlorophyta;
Embryophyta;
Tracheophyta;
Spermatophyta;
Coniferopsida;
Magnoliophyta;
Viridiplantae;

To parallelize the database, supply a configuration file using the -c option:

../SmorgasCommLayer -c <your config file> -server localhost > scl.out &

The configuration file is a simple text files with two entries per line, a name, and a location of the fasta file, e.g.:

nr00 /references/databases/nr/2014_09_15/nr.fasta.0
nr01 /references/databases/nr/2014_09_15/nr.fasta.1
nr02 /references/databases/nr/2014_09_15/nr.fasta.2
nr03 /references/databases/nr/2014_09_15/nr.fasta.3

The communication layer will launch a number of ProtServer instances according to the number of entries in the config file.

The Smörgås Web Server runs its own implementation of protein alignments by default. For information on how to swap in other aligners, see section 5.

4. Run in batch mode

4.1 Help screen and arguments

Type:

> ./runSatsumaProt

to bring up the following help:

./runSatsumaProt: Satsuma-based (cross-correlation) protein alignment tool.

Available arguments:

-t<string> : target protein fasta (def=)
-q<string> : query protein fasta

-db<string> : protein database in Spines binary format (def=)
-e<bool> : exhaustive search (no filtering) (def=0)
-k<int> : k-mer size (def=4)
-kmerStep<int> : Step size to be used in generating filteration kmers (def=1)
-filter<int> : Type of prefilter of hits to use- 1:fixed distance k-mer based 2:max k-mer based (def=1)
-allowFails<int> : Number of failures to allow before stopping the search (def=50)
-E-value<double> : show only alignments with better E-value (def=0.01)
-self<bool> : self-alignments (def=0)
-same<bool> : same alignments (def=0)
-block<int> : search only this subset (requires -e) (def=0)
-n_blocks<int> : number of blocks (w/ -e) (def=0)
-rna<bool> : Do RNA alignments (def=0)
-timestamps<bool> : Print time stamps (def=0)
-m<int> : number of results to display (def=50)
-cutoff<double> : show only alignments at this (ingapped) identity or higher (def=0)
-wSlide<int> : Filter Window slide, if set to 1 the window sliding will cover all kmers. (def=2)
-l<string> : Application logging file (def=application.log)

4.2 Running the aligner

For using the default options, type:

> runSatsumaProtein -t <protein database> -q <query fasta file> > <output file> &

The output will be written to stdout in human-readable format, including statistics and a summary line, e.g.:

**********************************************

Target sequence size: 368
Query sequence size: 757
Target offset: 1
Query offset: 200
Target aligned basepairs: 366
Query aligned basepairs: 416
Raw Score: -1
Identity score: 0.507212
Total Edit Count 171
Mean Contiguity length 0
Mod-Smith-waterman score: -2
Significance P-value: 1
***********************************************

Query: 200 LMNNSTGRSHVLAHPTGIDTIARSLAADNIKTKIAALEILGAVCLVPGGHKKVLTAMLNYQEYAAERARFQGIVNDLDKS 279

LMNNS GR+HVL+H I+ IA+SLA +NIKTK+A LEI+GAVCLVPGGH+K+L AML+YQ++A ER RFQ ++NDLD+S
Sbjct: 1 LMNNSQGRAHVLSHSESINIIAQSLATENIKTKVAVLEIMGAVCLVPGGHRKILEAMLHYQKFACERTRFQTLLNDLDRS 80

Query: 280 TGAYRDDVNLKTAIMSFINAVLNYGPGQENLEFRLHLRYEFLMLGIQPVIDKLRKHENETLNRHLDFFEMVRNEDEKELA 359

TG YRD+V+LKTAIMSFINA+L+ G G+ +LEFR+HLRYEFLMLGIQP+IDKLR H+N TL+RHLD+FEM+RN+DE LA
Sbjct: 81 TGRYRDEVSLKTAIMSFINAILSQGAGETSLEFRVHLRYEFLMLGIQPIIDKLRSHDNATLDRHLDYFEMLRNDDELALA 160

Query: 360 RKFNHEHVDTKSATAMFDLLRRKLSHSGAYPHLLSLLQHLLLLPHGG--PNAQHWLMFDRVVQQIVLQQEERPTSEIIDP 437

R+F H+DTKSA+ +FDL+R+K++H+ AYPH +S+L H LL+PH Q+WL+ DR+VQQ+VLQ +
Sbjct: 161 RRFESVHIDTKSASQVFDLIRKKMNHTDAYPHFMSVLHHCLLMPHKRSGNTVQYWLLLDRIVQQMVLQ--NDKGHDPDVT 238

Summary Seq_504 vs. UniRef90_B0W4A3_Disheveled-associated_activator_of_morphogenesis_2_n=1_Tax=Culex_quinquefasciatus_RepID=B0W4A3_CULQU score: 4.83175e-07 q-coords: 200 616 t-coords: 1 367 0.507212 4.83175e-07

Note that the summary line appears after the alignment.

4.3 Adjusting parameters for parallel run

Consider lowering the -m option (number of alignments to report) to increase speed, e.g. if the desired number of alignments in the end is 50, and the program runs in 24 processes, you might set -m to 10.

4.4 Interpret the output

Note that Smörgås doe not print alignments in sorted format. To sort the results (summary line only), use:

> SortSmorgasOut -i <result file> -c <maximum # of alignments per query sequence> > <sorted output>

Note that the option -f, which defaults to 0.1, applies an additional f-vale threshold.

For results from parallel runs, concatenate the output and run SortSmorgasOut.

5. Using a different alignment method

To switch to a different aligner, edit src/Smorgas/ProtServer.cc, which is a thin wrapper that links the communication layer to the aligner core, and replace SatsumaProt with the aligner of your choice. Note that some changes in which the parameters are being passed through the system might have to change as well, and that the output of the aligner should adhere to the SatsumaProt (blastp-like) format.