Smörgås Sequence Alignment Web Server - Documentation
Smörgås (or
Smorgas) is a local web interface for protein and transcript
alignments, utilizing the mongoose software (https://github.com/cesanta/mongoose).
It takes protein databases in fasta format, making it ideal for
local installations serving individual research groups,
institutions, or resources dedicated to specific organisms, groups
or taxa. While its interface can be easily customized for a
specific look-and-feel, no web programming skills are required for
installation and setup.
To get a quick overview, check out the Video
Tutorial.
The source code
repository is here: https://github.com/GrabherrGroup/Smorgas-proteins.
1. Installation
1.1. Clone the repository
To get the source code and compile it, type:
> git clone https://github.com/GrabherrGroup/Smorgas-proteins.git
1.2. Build executables
> cd
Smorgas-proteins
> make
Smörgås reads the database in multi-fasta format. For smaller databases (<300K entries), you can supply the entire file. For larger databases, we recommend to split the file into equal-sized chunks, which will both reduce memory usage, as well as allowing for parallel processing. You can split the database by typing
> SplitFasta -i <database.fasta> -n <how many chunks you would like>
Please note that the number of chunks must match the number of processes you use for parallel runs.
3. Run the web server
3.1 Data files
Smörgås needs a number of files to run a web server. All these files can be found in the smorgas_server_data/ directory, and must be located where the programs are executed. In this directory, there are also a few sample scripts (run_fp, run_plants, run_te), as well as some small databases.
To manually start the server with a single database, cd into the directory and run:
> ../SmorgasCommLayer -f refseq/refseq_plants.faa
-server localhost > scl.out &
> ../TangerineServer -t taxonomy.txt > ts.out &
which intializes the server with the refseq plant database, instructs the server that both the communication layer, which controls the processes, as well as the web server run on the same physical machine (localhost).
**** IMPORTANT *****: Please wait until the ProtServer processes are initialized (i.e. their CPU usage drops to 0) before using the service!! This should take less than a few minutes.
***** Please note: ***** the supplied taxonomy.txt text
file (derived from the NCBI taxonomy database) is not current. For
a more recent version, uncompress taxonomy_update.txt.gz via
> gzip -d taxonomy_update.txt.gz
and use this file instead.
To change the taxonomic resolution, if you e.g. work with a
specific subset of species, you can supply your custom colors for
display using the -col option <color_file.txt> (it will use
defaults if not specified), where the file is a list of organisms,
groups, or taxa. For example, when using a plant protein database,
you can specify (see plant_colors.txt):
Streptophyta;
Chlorophyta;
Embryophyta;
Tracheophyta;
Spermatophyta;
Coniferopsida;
Magnoliophyta;
Viridiplantae;
To parallelize the database, supply a configuration file using the -c option:
../SmorgasCommLayer -c <your config file> -server localhost > scl.out &
The configuration file is a simple text files with two entries per line, a name, and a location of the fasta file, e.g.:
The communication layer will launch a number of ProtServer instances according to the number of entries in the config file.
The Smörgås Web Server runs its
own implementation of protein alignments by default. For
information on how to swap in other aligners, see section 5.
4. Run in batch mode
4.1 Help screen and arguments
Type:
> ./runSatsumaProt
to bring up the following help:
./runSatsumaProt: Satsuma-based (cross-correlation) protein alignment tool.
Available arguments:
-t<string> :
target protein fasta (def=)
-q<string> : query protein fasta
4.2 Running the aligner
For using the default options, type:
> runSatsumaProtein -t <protein database> -q <query fasta file> > <output file> &
The output will be written to stdout in human-readable format, including statistics and a summary line, e.g.:
**********************************************
Query: 200 LMNNSTGRSHVLAHPTGIDTIARSLAADNIKTKIAALEILGAVCLVPGGHKKVLTAMLNYQEYAAERARFQGIVNDLDKS 279
LMNNS GR+HVL+H I+ IA+SLA +NIKTK+A LEI+GAVCLVPGGH+K+L AML+YQ++A ER RFQ ++NDLD+S
Query: 280 TGAYRDDVNLKTAIMSFINAVLNYGPGQENLEFRLHLRYEFLMLGIQPVIDKLRKHENETLNRHLDFFEMVRNEDEKELA 359
TG YRD+V+LKTAIMSFINA+L+ G G+ +LEFR+HLRYEFLMLGIQP+IDKLR H+N TL+RHLD+FEM+RN+DE LA
Query: 360 RKFNHEHVDTKSATAMFDLLRRKLSHSGAYPHLLSLLQHLLLLPHGG--PNAQHWLMFDRVVQQIVLQQEERPTSEIIDP 437
R+F H+DTKSA+ +FDL+R+K++H+ AYPH +S+L H LL+PH Q+WL+ DR+VQQ+VLQ +
Note that the summary line appears after the alignment.
4.3 Adjusting parameters for parallel run
Consider lowering the -m option (number of alignments to report) to increase speed, e.g. if the desired number of alignments in the end is 50, and the program runs in 24 processes, you might set -m to 10.
4.4 Interpret the output
> SortSmorgasOut -i <result file> -c <maximum # of alignments per query sequence> > <sorted output>
Note that the option -f, which defaults to 0.1, applies an additional f-vale threshold.
For results from parallel runs, concatenate the output and
run SortSmorgasOut.
5. Using a different alignment method