scripts.lorelei package¶
scripts.lorelei.constrained_aore_lore_topologies module¶
Script that browses scorpios corrected tree forest to build constrained AORe and LORe tree topologies.
- Example::
$ python -m scripts.lorelei.constrained_aore_lore_topologies TODO
-
scripts.lorelei.constrained_aore_lore_topologies.
check_aore_consistent_tree
(subtree, outgroups, dup_sp)¶ Checks that groups in synteny-consistent trees look correct and use them to build the AORe tree.
- Parameters
subtree (ete3.subtree) – AORe tree
outgroups (list of str) – list of outgroup species
dup_sp (list of str) – list of duplicated species
- Returns
- a tuple containing:
tree (ete3.Tree): resulting gene tree, None if the tree is fishy
outgr_gene (str) : one outgroup gene name, to identify the tree, None if the tree is fishy
- Return type
tuple
-
scripts.lorelei.constrained_aore_lore_topologies.
check_copy_number
(tree, ref_species, sp_min=3, copy_max=2, sp_min_2copies=0, copy_in_ref=None, groups=None)¶ Checks the number of gene copies in an input tree.
- Parameters
tree (ete3.Tree) – input tree
ref_species (list of str) – list of outgroup species
sp_min (int) – minimal number of species in the tree
copy_max (int) – maximal number of gene copies for any species in tree
sp_min_2copies (int) – minimal number of species with 2 gene copies
copy_in_ref (int) – number of gene copies expected in ref species
groups (list of str) – groups of species, if provided, sp_min_2copies has to be verified for all groups.
- Returns
True if criteria are met, False otherwise
- Return type
bool
-
scripts.lorelei.constrained_aore_lore_topologies.
extract_subtrees
(tree, ali, target_species, ref_species, treedir, outali, olore, oaore, species_groups, restrict_sp=None)¶ For a full gene tree, extracts subtrees and builds AORe and LORe gene tree topologies for them. Writes aore and lore trees to file in nhx format and corresponding multiple alignement in fasta.
- Parameters
tree (str) – tree file in nhx format for the considered gene family
ali (str) – alignment fasta file for the considered gene family
target_species (list of str) – duplicated+outgroup species
ref_species (list of str) – outgroup(s) species
treedir (str) – directory with SCORPiOs constrained gene tree topologies
outali (str) – output directory for the alignment
olore (str) – output directory for the lore topology (should exist)
oaore (str) – output directory for the aore topology (should exist)
species_groups (list of str) – groups of species for the LORe topology
restrict_sp (list of str, optional) – restrict the set of duplicated species to this set
-
scripts.lorelei.constrained_aore_lore_topologies.
get_scorpios_aore_tree
(gene_list, treefile, outgroups, outgr_gene)¶ Loads the AORe gene tree built by SCORPiOs.
- Parameters
gene_list (dict) – dict of gene_names (key) : species_names (value) to keep in the tree
treefile (str) – name of the input tree file
outgroups (list of str) – list of outgroup species to keep/add in tree
outgr_gene (str) – name of the outgroup gene
- Returns
the loaded tree
- Return type
ete3.Tree
-
scripts.lorelei.constrained_aore_lore_topologies.
get_species_groups
(speciestree, dup_anc, outgroups, restrict_sp=None, groups_by_anc=None)¶ Get the 2 species groups diverging at a given speciation point (dup_anc) + 1 group of outgroup species.
- Parameters
speciestree (str) – filename for the newick species tree
dup_anc (str) – speciation point to consider
outgroups (list of str) – list of outgroup species to include in the outgroup group.
restrict_sp (list of str, optional) –
groups_by_anc (str) – ancestral name for the two species groups to extract, comma separated (overrules the dup_anc arg).
- Returns
the 3 groups of species
- Return type
list of str
-
scripts.lorelei.constrained_aore_lore_topologies.
make_tree_from_groups
(subtree_leaves, species_groups, groups_are_genes=False)¶ Builds a gene tree from groups of species or groups of genes.
- Parameters
subtree_leaves (list of ete3.nodes) – all genes to place in the tree
species_groups (list of str) – species to group together (first group is outgroup)
groups_are_genes (bool, optional) – set to True if species_groups are groups of genes
- Returns
resulting gene tree str : one outgroup gene name, to identify the tree
- Return type
ete3.Tree
scripts.lorelei.fix_rideogram module¶
Fix RIdeogram karyotype figure by adding a legend and title to it.
Example:
$ python -m scripts.lorelei.fix_rideograms -i fig.svg -o out.svg -l AORe LOre [-c 2] [-t '']
-
scripts.lorelei.fix_rideogram.
add_legend
(input_svg, legend_svg, outfile)¶ Create a new svg by putting one svg on top of another.
- Parameters
input_svg (str) – name for first svg file
legend_svg (str) – name for second svg file (will be drawn on top of first)
outfile (list) – name for the output figure file
-
scripts.lorelei.fix_rideogram.
make_legend
(outfilename, title, colors, labels)¶ Plots to file a matplotlib figure with only legend and title.
- Parameters
outfilename (str) – name for the output figure file
title (str) – title for the figure
colors (list) – ordered list of colors for the legend
labels (list) – ordered list of labels for the legend
scripts.lorelei.homeologs_pairs_from_ancestor module¶
Loads gene tree classes and group them by ancestrally duplicated chromsome pairs.
An ancestral karyotype or a non-duplicated outgroup can be used as a proxy to the ancestral pre-duplication genome.
Example:
$ python -m scripts.lorelei.homeologs_pairs_from_ancestor.py TODO
-
scripts.lorelei.homeologs_pairs_from_ancestor.
load_acc
(input_file)¶ Loads accepted correction.
- Parameters
input_file (str) – path to the input file.
- Returns
all family ids (outgroup gene name) for which correction was accepted
- Return type
set
-
scripts.lorelei.homeologs_pairs_from_ancestor.
load_combin
(input_file, genes)¶ Loads family SCORPiOs family combination file (get family correspondance across multiple outgr)
- Parameters
input_file (str) – path to the family combination file
genes (set) – list of the genes of the outgroup used as reference
- Returns
genes in the reference outgroup to genes in the non-reference outgroups
- Return type
dict
-
scripts.lorelei.homeologs_pairs_from_ancestor.
load_outgr_fam
(input_file, ctrees=None)¶ Loads SCORPiOs teleost families file.
- Parameters
input_file (str) – path to the input file
ctrees (set, optional) – list of families to load, by default everything is loaded.
- Returns
for each family, identified by the outgroup gene (key, str), teleost genes (value, set)
- Return type
dict
-
scripts.lorelei.homeologs_pairs_from_ancestor.
load_pm
(input_file, is_post_dup=False)¶ Loads predicted homeolog names for teleosts gene families.
- Parameters
input_file (str) – path to the input file (output from the paralogy_map pipeline)
- Returns
for each gene family (key, a set of teleost genes) its corresponding homoelog chromosome (value).
- Return type
dict
-
scripts.lorelei.homeologs_pairs_from_ancestor.
load_summary
(input_file, accepted)¶ Loads SCORPiOs summary of synteny-sequence trees inconsistencies.
- Parameters
input_file (str) – path to the input file
accepted (str) – path to the file with accepted correction (inconsistent trees which have been corrected are now consistent)
- Returns
for each outgroup gene family identifier (key, str) wheter trees and synteny predictions are consistent or inconsistent (value, str).
- Return type
dict
-
scripts.lorelei.homeologs_pairs_from_ancestor.
outgroup_genes_to_homeologs
(fam_outgr, fam_homeo)¶ Combines teleost genes in SCORPiOs families and paralogy map result to assign SCORPiOs families to homeologs.
- Parameters
fam_outgr (dict) – for each gene family (name of the outgroup gene, str, key) the set of genes (value)
fam_homeo (dict) – for each gene family (set of genes, set, key) its homeolog (value)
- Returns
for each gene family (name of the outgroup gene, str, key), its homeolog (value)
- Return type
dict
-
scripts.lorelei.homeologs_pairs_from_ancestor.
write_counter
(counter, output_file)¶ Writes a dict object to file
- Parameters
counter (dict) – input dict
output_file (str) – name of the output file
-
scripts.lorelei.homeologs_pairs_from_ancestor.
write_output
(d_homeo, ctrees, output_all, output_incons)¶ Write numbers of trees per homeologs
- Parameters
d_homeo (dict) – family to homeologs correspondance
ctrees (dict) – family to synteny-sequence conflicts
output_all (str) – output file to write all considered families
output_incons (str) – output file to write sequence-synteny inconsistent families
scripts.lorelei.homeologs_tree_conflicts module¶
Barplots and hypergeomtric tests.
Example:
$ python -m scripts.lorelei.homeologs_tree_conflicts TODO
-
scripts.lorelei.homeologs_tree_conflicts.
barplot
(data, output, title='', xlabel='', ylabel='', avg=None, avg_lab='', highlight_over='', highlight_under='', sign_all=False, sign_up_only=False, sign_down_only=False)¶ Plots data as a barplot and highlight significant enrichment and/or depletion. Saves the plot to file.
- Parameters
data (tuple of lists) – categorical input data, with each tuple containing, category, proportion of observed counts, enrichment or depletion, whether null hyp. is rejected, and adjusted p-values.
output (str) – output file name for the figure
title (str, optional) – title for the plot
xlabel (str, optional) – label for the x axis
ylabel (str, optional) – label for the y axis
avg (float, optional) – average over bars, plot as a dashed-line if given
avg_lab (str, optional) – label to give to the average line
highlight_under (str, optional) – color for bars under average
highlight_over (str, optional) – color for bars over average
sign_all (bool, optional) – highlight significant enrichment & depletion with stars
sign_up_only (bool, optional) – highlight significant enrichment with stars
sign_down_only (bool, optional) – highlight significant depletion with stars
-
scripts.lorelei.homeologs_tree_conflicts.
hypergeom_enrich_depl
(data_obs, data_tot, alpha=0.05, multitest_adjust='fdr_bh')¶ Hypergeometric tests for enrichment and depletion with multiple testing correction.
- Parameters
data_obs (dict) – for each category, observed counts
data_tot (dict) – for each category, total number of objects
alpha (float) – significance level
multitest_adjust (str) – method to adjust pvalues for multiple testing
- Returns
categories, corresponding proportion of observed counts, enrichment or depletion, whether null hyp. is rejected, and adjusted p-values.
- Return type
tuple of lists
-
scripts.lorelei.homeologs_tree_conflicts.
load_counts
(input_file)¶ Loads input data for hypergeom test: each line is a count category pair, space-separated.
- Parameters
input_file (str) – path to the input file
- Returns
for each category (chromosomes), as keys, the count (value)
- Return type
dict
-
scripts.lorelei.homeologs_tree_conflicts.
plot_sign
(ax, to_highlight=None)¶ Adds star for significant p-val on an existing barplot.
- Parameters
ax (matplotlib.Axes) – matplotlib figure (axis object) to update
to_highlight (list of int) – x values for significant bars
scripts.lorelei.make_rideograms_inputs module¶
Writes files that can be read by RIdeograms to draw karyotype with overlaid features.
Example:
$ python -m scripts.make_rideograms_inputs TODO
-
scripts.lorelei.make_rideograms_inputs.
features_to_ide
(genome, features_file, karyo, output, to_load=None)¶ Writes the gene to gene family class to file, to use as input to RIdeogram.
- Parameters
genome (scripts.synteny.mygenome.Genome) – genome of the species for which to extract classes
features_file (str) – path to the input file with gene family classes
karyo (list) – ordered set of chromosome
output (str) – name for the output file
to_load (list, optional) – load only genes of given classes
-
scripts.lorelei.make_rideograms_inputs.
load_features
(genome, features_file, to_load=None)¶ Loads a 3-columns tab-delimited file with gene_family_name, genes and gene_family class.
- Parameters
genome (scripts.synteny.mygenome.Genome) – genome of the species for which to extract classes
features_file (str) – path to the input file with gene family classes
to_load (list, optional) – load only genes of given classes
- Returns
for genes in the input genome (key) gives the gene family class (value)
- Return type
dict
-
scripts.lorelei.make_rideograms_inputs.
make_karyo
(genesfile, output, fomt='bed', min_size=None)¶ Makes a karyotype file for drawing with RIdeograms from a bed file with genes coordinates.
- Parameters
genesfile (str) – input file with genes coordinates
output (str) – output file name
fomt (str, optional) – input format .bed or dyogen format
- Returns
- a tuple containing:
genome (scripts.synteny.mygenome.Genome): genome of the species for which to extract classes
karyo (list): ordered set of chromosomes.
- Return type
tuple
-
scripts.lorelei.make_rideograms_inputs.
strip_chr_name
(chr_name)¶ Tries to strip a chr name so that it can be converted to int for RIdeograms.
- Parameters
chr_name (str) – chr name
- Returns
stripped chr name
- Return type
str
scripts.lorelei.write_ancgenes_treeclass module¶
Writes a 3-columns file for gene families, giving family_id, genes in the family and its class. Class can be for instance LORe or AORe, tree clustering, synteny consistency etc…
Example:
$ python -m scripts.lorelei.write_ancgenes_treeclust TODO
-
scripts.lorelei.write_ancgenes_treeclass.
load_gene_list
(input_summary, input_acc=None)¶ Loads a tab-delimited summary of tree classes.
- Parameters
input_summary (str) – path to the two-columns tab-delimited input file, giving a family_id to tree class correspondance. The family_id should be the name of the corresponding tree file for write_ancgenes to work properly.
input_acc (str, optional) – if input is SCORPiOs-generated sequence-synteny inconsistent trees summary, provide here the summary of accepted correction. Indeed, gene trees that were initially found to be synteny-inconsistent but were later corrected should be defined as consistent.
- Returns
for each gene family, the corresponding gene tree class
- Return type
dict
-
scripts.lorelei.write_ancgenes_treeclass.
write_ancgenes
(clustered_genes, treedir, out_ancgenes, clusters_to_load=None)¶ Writes the output 3-columns file, tab-separated.
- Parameters
clustered_genes (dict) – class of gene families
treedir (str) – path to the gene trees
out_ancgenes (str) – name of the output file
clusters_to_load (list, optional) – write only entries for these given family classes.
-
scripts.lorelei.write_ancgenes_treeclass.
write_summary
(summary_dict, output_file)¶ Writes a simpler 2-columns file with family_id and family class.
- Parameters
summary_dict (dict) – class of gene families
output_file (str) – name of the output file