Data preparation

SCORPiOs is a flexible gene tree correction pipeline: it can either start from a set of precomputed, phylogeny-reconciled gene trees, or build one from a set of gene multiple aligments using TreeBeST.

If you do not have gene trees or gene alignments readily available for your study species, please refer to the Building a dataset section.

Warning

Because SCORPiOs leverages local synteny similarity, i.e evolution of neighboring genes, it requires genome-wide data.

Input files

SCORPiOs requires four input files, which are:

  1. A set of phylogeny-reconciled gene trees as a single file in NHX format (extended Newick format, see example and description).

OR

  1. (bis) A genes-to-species mapping file, if starting the process from gene alignments only (see example and description).

  1. The gene multiple alignments corresponding to the gene trees, as a single file in FASTA format (can be compressed with gzip, see example and description).

  1. Gene coordinates files for each species in BED format (see example) or dyogen format (see example). See also description.

  1. A species tree in NEWICK format, with names of ancestral species indicated at internal nodes (see example and description).

For a detailed description of expected formats please refer to the Data file formats section.

Note

If starting from gene trees, SCORPiOS uses the NHX S (species name) tag to build the gene-species mapping. Otherwise, it uses the gene-to-species mapping file.

SCORPiOs parameters

All parameters for a SCORPiOs run have to be indicated in the YAML configuration file, as shown in config_example.yaml.

A critical parameter is the position(s) of WGD(s) in the species tree and the species to use as outgroup(s). They both have to be specified together using the WGDs keyword. The WGD position should be indicated with the name of the last common ancestor of all duplicated species.

For instance, consider the species tree below:

(spotted_gar, (zebrafish, (medaka, (tetraodon, fugu)Tetraodontidae)Euteleosteomorpha)Clupeocephala)Neopterygii;
https://raw.githubusercontent.com/DyogenIBENS/SCORPIOS/master/doc/img/basic_sptree.png

The fish WGD occurred in the branch leading to the “Clupeocephala” ancestor, and we wish to use the spotted_gar as outgroup. This should be specified in the configuration file as:

WGDs:
  Clupeocephala: spotted_gar

For a detailed description of all parameters available in SCORPiOs please refer to the Configuration file section.

Complex configurations

SCORPiOs can correct gene trees that contain more than one whole-genome duplication event.

In this case, each WGD is treated independently, starting from the more recent one (closer to the leaves) going up towards the more ancient one (closer to the root). If the WGDs are nested, the subtrees from the more recent events are ignored while correcting for the older WGD event(s), and reinserted after correction using their outgroup as a branching point.

SCORPiOs can also use more than one reference outgroup to correct gene trees. Outgroup(s), separated by commas if more than one, are to be indicated for each WGDs.

For instance, in the example config_example.yaml, WGDs to correct are specified by:

WGDs:
  Clupeocephala: Lepisosteus.oculatus,Amia.calva
  Salmonidae: Esox.lucius,Gasterosteus.aculeatus,Oryzias.latipes

This specifies that gene trees have to be corrected for the teleost WGD (species below the Clupeocephala ancestor in the species tree) and for the salmonids WGD (species below the Salmonidae ancestor in the species tree). Lepisosteus.oculatus and Amia.calva should be used as outgroups to the teleost WGD and Esox.lucius, Gasterosteus.aculeatus and Oryzias.latipes as outgroups to the salmonid WGD.