scripts.lorelei package

scripts.lorelei.constrained_aore_lore_topologies module

Script that browses scorpios corrected tree forest to build constrained AORe and LORe tree topologies.

Example::

$ python -m scripts.lorelei.constrained_aore_lore_topologies TODO

scripts.lorelei.constrained_aore_lore_topologies.check_aore_consistent_tree(subtree, outgroups, dup_sp)

Checks that groups in synteny-consistent trees look correct and use them to build the AORe tree.

Parameters
  • subtree (ete3.subtree) – AORe tree

  • outgroups (list of str) – list of outgroup species

  • dup_sp (list of str) – list of duplicated species

Returns

a tuple containing:

tree (ete3.Tree): resulting gene tree, None if the tree is fishy

outgr_gene (str) : one outgroup gene name, to identify the tree, None if the tree is fishy

Return type

tuple

scripts.lorelei.constrained_aore_lore_topologies.check_copy_number(tree, ref_species, sp_min=3, copy_max=2, sp_min_2copies=0, copy_in_ref=None, groups=None)

Checks the number of gene copies in an input tree.

Parameters
  • tree (ete3.Tree) – input tree

  • ref_species (list of str) – list of outgroup species

  • sp_min (int) – minimal number of species in the tree

  • copy_max (int) – maximal number of gene copies for any species in tree

  • sp_min_2copies (int) – minimal number of species with 2 gene copies

  • copy_in_ref (int) – number of gene copies expected in ref species

  • groups (list of str) – groups of species, if provided, sp_min_2copies has to be verified for all groups.

Returns

True if criteria are met, False otherwise

Return type

bool

scripts.lorelei.constrained_aore_lore_topologies.extract_subtrees(tree, ali, target_species, ref_species, treedir, outali, olore, oaore, species_groups, restrict_sp=None)

For a full gene tree, extracts subtrees and builds AORe and LORe gene tree topologies for them. Writes aore and lore trees to file in nhx format and corresponding multiple alignement in fasta.

Parameters
  • tree (str) – tree file in nhx format for the considered gene family

  • ali (str) – alignment fasta file for the considered gene family

  • target_species (list of str) – duplicated+outgroup species

  • ref_species (list of str) – outgroup(s) species

  • treedir (str) – directory with SCORPiOs constrained gene tree topologies

  • outali (str) – output directory for the alignment

  • olore (str) – output directory for the lore topology (should exist)

  • oaore (str) – output directory for the aore topology (should exist)

  • species_groups (list of str) – groups of species for the LORe topology

  • restrict_sp (list of str, optional) – restrict the set of duplicated species to this set

scripts.lorelei.constrained_aore_lore_topologies.get_scorpios_aore_tree(gene_list, treefile, outgroups, outgr_gene)

Loads the AORe gene tree built by SCORPiOs.

Parameters
  • gene_list (dict) – dict of gene_names (key) : species_names (value) to keep in the tree

  • treefile (str) – name of the input tree file

  • outgroups (list of str) – list of outgroup species to keep/add in tree

  • outgr_gene (str) – name of the outgroup gene

Returns

the loaded tree

Return type

ete3.Tree

scripts.lorelei.constrained_aore_lore_topologies.get_species_groups(speciestree, dup_anc, outgroups, restrict_sp=None, groups_by_anc=None)

Get the 2 species groups diverging at a given speciation point (dup_anc) + 1 group of outgroup species.

Parameters
  • speciestree (str) – filename for the newick species tree

  • dup_anc (str) – speciation point to consider

  • outgroups (list of str) – list of outgroup species to include in the outgroup group.

  • restrict_sp (list of str, optional) –

  • groups_by_anc (str) – ancestral name for the two species groups to extract, comma separated (overrules the dup_anc arg).

Returns

the 3 groups of species

Return type

list of str

scripts.lorelei.constrained_aore_lore_topologies.make_tree_from_groups(subtree_leaves, species_groups, groups_are_genes=False)

Builds a gene tree from groups of species or groups of genes.

Parameters
  • subtree_leaves (list of ete3.nodes) – all genes to place in the tree

  • species_groups (list of str) – species to group together (first group is outgroup)

  • groups_are_genes (bool, optional) – set to True if species_groups are groups of genes

Returns

resulting gene tree str : one outgroup gene name, to identify the tree

Return type

ete3.Tree

scripts.lorelei.fix_rideogram module

Fix RIdeogram karyotype figure by adding a legend and title to it.

Example:

$ python -m scripts.lorelei.fix_rideograms -i fig.svg -o out.svg -l AORe LOre [-c 2] [-t '']
scripts.lorelei.fix_rideogram.add_legend(input_svg, legend_svg, outfile)

Create a new svg by putting one svg on top of another.

Parameters
  • input_svg (str) – name for first svg file

  • legend_svg (str) – name for second svg file (will be drawn on top of first)

  • outfile (list) – name for the output figure file

scripts.lorelei.fix_rideogram.make_legend(outfilename, title, colors, labels)

Plots to file a matplotlib figure with only legend and title.

Parameters
  • outfilename (str) – name for the output figure file

  • title (str) – title for the figure

  • colors (list) – ordered list of colors for the legend

  • labels (list) – ordered list of labels for the legend

scripts.lorelei.homeologs_pairs_from_ancestor module

Loads gene tree classes and group them by ancestrally duplicated chromsome pairs.

An ancestral karyotype or a non-duplicated outgroup can be used as a proxy to the ancestral pre-duplication genome.

Example:

$ python -m scripts.lorelei.homeologs_pairs_from_ancestor.py TODO
scripts.lorelei.homeologs_pairs_from_ancestor.load_acc(input_file)

Loads accepted correction.

Parameters

input_file (str) – path to the input file.

Returns

all family ids (outgroup gene name) for which correction was accepted

Return type

set

scripts.lorelei.homeologs_pairs_from_ancestor.load_combin(input_file, genes)

Loads family SCORPiOs family combination file (get family correspondance across multiple outgr)

Parameters
  • input_file (str) – path to the family combination file

  • genes (set) – list of the genes of the outgroup used as reference

Returns

genes in the reference outgroup to genes in the non-reference outgroups

Return type

dict

scripts.lorelei.homeologs_pairs_from_ancestor.load_outgr_fam(input_file, ctrees=None)

Loads SCORPiOs teleost families file.

Parameters
  • input_file (str) – path to the input file

  • ctrees (set, optional) – list of families to load, by default everything is loaded.

Returns

for each family, identified by the outgroup gene (key, str), teleost genes (value, set)

Return type

dict

scripts.lorelei.homeologs_pairs_from_ancestor.load_pm(input_file, is_post_dup=False)

Loads predicted homeolog names for teleosts gene families.

Parameters

input_file (str) – path to the input file (output from the paralogy_map pipeline)

Returns

for each gene family (key, a set of teleost genes) its corresponding homoelog chromosome (value).

Return type

dict

scripts.lorelei.homeologs_pairs_from_ancestor.load_summary(input_file, accepted)

Loads SCORPiOs summary of synteny-sequence trees inconsistencies.

Parameters
  • input_file (str) – path to the input file

  • accepted (str) – path to the file with accepted correction (inconsistent trees which have been corrected are now consistent)

Returns

for each outgroup gene family identifier (key, str) wheter trees and synteny predictions are consistent or inconsistent (value, str).

Return type

dict

scripts.lorelei.homeologs_pairs_from_ancestor.outgroup_genes_to_homeologs(fam_outgr, fam_homeo)

Combines teleost genes in SCORPiOs families and paralogy map result to assign SCORPiOs families to homeologs.

Parameters
  • fam_outgr (dict) – for each gene family (name of the outgroup gene, str, key) the set of genes (value)

  • fam_homeo (dict) – for each gene family (set of genes, set, key) its homeolog (value)

Returns

for each gene family (name of the outgroup gene, str, key), its homeolog (value)

Return type

dict

scripts.lorelei.homeologs_pairs_from_ancestor.write_counter(counter, output_file)

Writes a dict object to file

Parameters
  • counter (dict) – input dict

  • output_file (str) – name of the output file

scripts.lorelei.homeologs_pairs_from_ancestor.write_output(d_homeo, ctrees, output_all, output_incons)

Write numbers of trees per homeologs

Parameters
  • d_homeo (dict) – family to homeologs correspondance

  • ctrees (dict) – family to synteny-sequence conflicts

  • output_all (str) – output file to write all considered families

  • output_incons (str) – output file to write sequence-synteny inconsistent families

scripts.lorelei.homeologs_tree_conflicts module

Barplots and hypergeomtric tests.

Example:

$ python -m scripts.lorelei.homeologs_tree_conflicts TODO
scripts.lorelei.homeologs_tree_conflicts.barplot(data, output, title='', xlabel='', ylabel='', avg=None, avg_lab='', highlight_over='', highlight_under='', sign_all=False, sign_up_only=False, sign_down_only=False)

Plots data as a barplot and highlight significant enrichment and/or depletion. Saves the plot to file.

Parameters
  • data (tuple of lists) – categorical input data, with each tuple containing, category, proportion of observed counts, enrichment or depletion, whether null hyp. is rejected, and adjusted p-values.

  • output (str) – output file name for the figure

  • title (str, optional) – title for the plot

  • xlabel (str, optional) – label for the x axis

  • ylabel (str, optional) – label for the y axis

  • avg (float, optional) – average over bars, plot as a dashed-line if given

  • avg_lab (str, optional) – label to give to the average line

  • highlight_under (str, optional) – color for bars under average

  • highlight_over (str, optional) – color for bars over average

  • sign_all (bool, optional) – highlight significant enrichment & depletion with stars

  • sign_up_only (bool, optional) – highlight significant enrichment with stars

  • sign_down_only (bool, optional) – highlight significant depletion with stars

scripts.lorelei.homeologs_tree_conflicts.hypergeom_enrich_depl(data_obs, data_tot, alpha=0.05, multitest_adjust='fdr_bh')

Hypergeometric tests for enrichment and depletion with multiple testing correction.

Parameters
  • data_obs (dict) – for each category, observed counts

  • data_tot (dict) – for each category, total number of objects

  • alpha (float) – significance level

  • multitest_adjust (str) – method to adjust pvalues for multiple testing

Returns

categories, corresponding proportion of observed counts, enrichment or depletion, whether null hyp. is rejected, and adjusted p-values.

Return type

tuple of lists

scripts.lorelei.homeologs_tree_conflicts.load_counts(input_file)

Loads input data for hypergeom test: each line is a count category pair, space-separated.

Parameters

input_file (str) – path to the input file

Returns

for each category (chromosomes), as keys, the count (value)

Return type

dict

scripts.lorelei.homeologs_tree_conflicts.plot_sign(ax, to_highlight=None)

Adds star for significant p-val on an existing barplot.

Parameters
  • ax (matplotlib.Axes) – matplotlib figure (axis object) to update

  • to_highlight (list of int) – x values for significant bars

scripts.lorelei.make_rideograms_inputs module

Writes files that can be read by RIdeograms to draw karyotype with overlaid features.

Example:

$ python -m scripts.make_rideograms_inputs TODO
scripts.lorelei.make_rideograms_inputs.features_to_ide(genome, features_file, karyo, output, to_load=None)

Writes the gene to gene family class to file, to use as input to RIdeogram.

Parameters
  • genome (scripts.synteny.mygenome.Genome) – genome of the species for which to extract classes

  • features_file (str) – path to the input file with gene family classes

  • karyo (list) – ordered set of chromosome

  • output (str) – name for the output file

  • to_load (list, optional) – load only genes of given classes

scripts.lorelei.make_rideograms_inputs.load_features(genome, features_file, to_load=None)

Loads a 3-columns tab-delimited file with gene_family_name, genes and gene_family class.

Parameters
  • genome (scripts.synteny.mygenome.Genome) – genome of the species for which to extract classes

  • features_file (str) – path to the input file with gene family classes

  • to_load (list, optional) – load only genes of given classes

Returns

for genes in the input genome (key) gives the gene family class (value)

Return type

dict

scripts.lorelei.make_rideograms_inputs.make_karyo(genesfile, output, fomt='bed', min_size=None)

Makes a karyotype file for drawing with RIdeograms from a bed file with genes coordinates.

Parameters
  • genesfile (str) – input file with genes coordinates

  • output (str) – output file name

  • fomt (str, optional) – input format .bed or dyogen format

Returns

a tuple containing:

genome (scripts.synteny.mygenome.Genome): genome of the species for which to extract classes

karyo (list): ordered set of chromosomes.

Return type

tuple

scripts.lorelei.make_rideograms_inputs.strip_chr_name(chr_name)

Tries to strip a chr name so that it can be converted to int for RIdeograms.

Parameters

chr_name (str) – chr name

Returns

stripped chr name

Return type

str

scripts.lorelei.write_ancgenes_treeclass module

Writes a 3-columns file for gene families, giving family_id, genes in the family and its class. Class can be for instance LORe or AORe, tree clustering, synteny consistency etc…

Example:

$ python -m scripts.lorelei.write_ancgenes_treeclust TODO
scripts.lorelei.write_ancgenes_treeclass.load_gene_list(input_summary, input_acc=None)

Loads a tab-delimited summary of tree classes.

Parameters
  • input_summary (str) – path to the two-columns tab-delimited input file, giving a family_id to tree class correspondance. The family_id should be the name of the corresponding tree file for write_ancgenes to work properly.

  • input_acc (str, optional) – if input is SCORPiOs-generated sequence-synteny inconsistent trees summary, provide here the summary of accepted correction. Indeed, gene trees that were initially found to be synteny-inconsistent but were later corrected should be defined as consistent.

Returns

for each gene family, the corresponding gene tree class

Return type

dict

scripts.lorelei.write_ancgenes_treeclass.write_ancgenes(clustered_genes, treedir, out_ancgenes, clusters_to_load=None)

Writes the output 3-columns file, tab-separated.

Parameters
  • clustered_genes (dict) – class of gene families

  • treedir (str) – path to the gene trees

  • out_ancgenes (str) – name of the output file

  • clusters_to_load (list, optional) – write only entries for these given family classes.

scripts.lorelei.write_ancgenes_treeclass.write_summary(summary_dict, output_file)

Writes a simpler 2-columns file with family_id and family class.

Parameters
  • summary_dict (dict) – class of gene families

  • output_file (str) – name of the output file