scripts.synteny package¶
scripts.synteny.duplicated_families module¶
Script to find all orthology relationships between a group of WGD duplicated species and a non duplicated outgroup. These ortholog groups define gene families.
Example:
$ python -m scripts.synteny.duplicated_families -t forest_v89.nhx -n Lepisosteus.oculatus
-d Clupeocephala -s Species_tree_v89.nwk -g genes89/genesST.%s.list.bed [-o out]
[-ow anc1,anc2] [-u ufile]
-
scripts.synteny.duplicated_families.
get_genes_positions
(genes, species, dict_genes)¶ Gets genomic position of given genes of a species.
- Parameters
sp (str) – input species name
genes (list of str) – list of the genes to search
dict_genes (dict of str to GeneSpeciesPosition tuples) – genes location
- Returns
genes and their position as a list of GeneSpeciesPosition tuples
- Return type
list
-
scripts.synteny.duplicated_families.
orthologies_with_outgroup
(forest, duplicated_sp, outgroup, dict_genes, out)¶ Browses a gene tree forest and searches for orthologs with the outgroup. Writes genes without phylogenetic orthologs to a file. Also writes files with high-confidence orthologs and paralogs to use to otpimize the synteny support threshold to call orthology.
- Parameters
forest (str) – name of the gene trees forest file
duplicated_sp (list of str) – list of all duplicated species for the considered WGD
outgroup (str) – non-duplicated outgroup
dict_genes (dict of GeneSpeciesPosition tuples) – all gene positions for each species
out (str) – output file to write genes without phylogenetic orthologs
- Returns
orthologs of outgroup genes in each duplicated species
- Return type
dict
Note
#FIXME Written to work within scorpios as orthologs and paralogs file names are derived from output file patterns, assuming it contains an ‘_’.
-
scripts.synteny.duplicated_families.
print_out_stats
(stats_dict, wgd='')¶ Prints to stdout some statistics on the families in the phylogenetic Orthology Table.
- Parameters
stats_dict (dict) – a dict counting number of families and genes in the families
wgd (str, optional) – the wgd for which the Orhtology Table was built
-
scripts.synteny.duplicated_families.
tag_duplicated_species
(leaves, duplicated)¶ Adds a tag to genes of duplicated species in an ete3.Tree instance, in-place.
- Parameters
leaves (list of ete3.TreeNode) – leaves of the tree
duplicated (list of str) – list of the names of all duplicated species
-
scripts.synteny.duplicated_families.
write_orthologs
(orthos, dicgenomes, dict_genes, outgroup, duplicated_sp, out, min_length=20)¶ Writes to a file gene orthologies between the non-duplicated species and all duplicated species (orthologytable), with all gene names and gene positions. All these gene families are ordered along the outgroup genome in the output.
- Parameters
orthos (dict of str to str to GeneSpeciesPosition tuples) – orthologs of outgroup genes in each duplicated species
dicgenomes (dict of str to mygenome.Genome) – genomes
dict_genes (dict of str to GeneSpeciesPosition tuples) – genes location
outgroup (str) – non-duplicated outgroup
duplicated_sp (list of str) – list of duplicated species to include in the results
out (str) – output file name for genes without orthologs
min_length (int, optional) – minimum length for a chromosome in the outgroup, gene families mapping to smaller chromosomes won’t be included
scripts.synteny.f1_score_optimization module¶
This script loads 2 scores distributions and finds the optimal discriminative threshold to separate distributions based on the F1-score, assuming true positives to recover are in the distribution of higher scores.
Inputs are python lists pickled in files, output is written to file with the --support
prefix, to call the script missed_orthologies.py in snakemake with the --support
arg.
Example:
$ python -m scripts.synteny.f1_score_optimization -i1 scores_1.pkl -i2 scores_2.pkl
[-out out]
-
scripts.synteny.f1_score_optimization.
compute_f1
(scores1, scores2, threshold)¶ Computes the F1-score for a given threshold.
- Parameters
scores1 (list) – list of scores 1
scores2 (list) – list of scores 2
threshold (float) – threshold value
- Returns
F1-score
- Return type
float
-
scripts.synteny.f1_score_optimization.
get_discriminant_threshold
(input1, input2, test_range=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])¶ Finds the most discriminative threshold between the two distributions based on F1-score.
- Parameters
input2 (input1,) – paths to the pickled objects
test_range (list, optional) – list of thresholds to test
- Returns
optimized threshold based on F1-score
- Return type
int
-
scripts.synteny.f1_score_optimization.
load_scores
(input1, input2)¶ Unpickles the lists of scores.
- Parameters
input1 (str) – paths to the pickled object 1
input2 (str) – paths to the pickled object 2
- Returns
a tuple containing:
scores1, scores2: the unpickled lists
- Return type
tuple
scripts.synteny.filter_no_synteny_genes module¶
This script identifies genes in the orthology table that never, in any of their sliding windows, have genes on the same chromosome in the orthology table. A new orthology table is written as output, where genomic posistion of these genes is omitted, which forces SCORPiOs other scripts to not use them in the synteny analysis.
Example:
$ python -m scripts.synteny.filter_no_synteny_genes -i OrthoTable.txt -chr Chr_outgr_file
[-o out] [-w 15]
-
scripts.synteny.filter_no_synteny_genes.
print_out_stats
(stats_dict, wgd='')¶ Prints to stdout some statistics on the genes without syteny support that will be ignored in scorpios synteny analysis.
- Parameters
stats_dict (dict) – a dict with the number of filtered genes per species
wgd (str, optional) – the wgd for which the filter was run
scripts.synteny.filter_regions module¶
Module with functions to extract gene families having updated synteny information in SCORPiOs iteration n versus iteration n-1.
-
scripts.synteny.filter_regions.
get_genes_to_keep
(orthotable, modified_fam, windowsize)¶ Extracts all families with updated synteny information after a SCORPiOs iteration, i.e. all families within the same window as a modified family.
- Parameters
orthotable (dict) – gene families at iteration n
modified_fam (dict) – modified gene families
- Returns
- a tuple containing:
dict: for each chromosome, families with updated synteny information
list: flat list of updated families (outgroup gene name)
- Return type
tuple
-
scripts.synteny.filter_regions.
get_modified_families
(orthotable, orthotable_prev, corrected_fam, mapping_fam=None)¶ For OrthologyTables of two successive SCORPiOs iterations, find families with updated homologies in iteration n compared to iteration n-1.
Updated families are either (i) a corrected tree or (ii) an outgroup gene in iteration n without duplicated species orthologs in iteration n-1.
- Parameters
orthotable (dict) – gene families at iteration n
orthotable_prev (dict) – gene families at iteration n-1
corrected_fam (list) – list of corrected families
mapping_fam (dict, optional) – when multiple outgroups are used, a dictionary with correspondence of families ids across outgroups, useful when a tree was corrected using an other outgroup
- Returns
modified gene families
- Return type
dict
-
scripts.synteny.filter_regions.
make_region_file
(orthotable_file, orthotable_file_previous, corrections_file, outfile, win=15, file_fam_no_graph='', wgd='', file_combin_graphs='')¶ Builds and writes a file with gene families having updated synteny information in SCORPiOs iteration n versus iteration n-1.
- Parameters
orthotable_file (str) – file with gene families at iteration n
orthotable_file_prev (str) – file with gene families at iteration n-1
corrections_file (str) – file with corrected families
outfile (str) – name of the output file.
win (int) – side of SCORPiOs sliding window for synteny orthology predictions
file_fam_no_graph (str, optional) – file with families that can’t result in a synteny graph
wgd (str, optional) – the wgd for which the Orhtology Table was built
file_combin_graphs (str, optional) – summary file of graphs combination across outgroups
-
scripts.synteny.filter_regions.
print_out_stats
(fam_up, file_fam_no_graph='', wgd='')¶ Prints to stdout some statistics on the families with updated synteny.
- Parameters
modified_fam (dict) – a dict listing for each outgroup chromosome, the updated families
file_fam_no_graph (str, optional) – file with families that can’t result in a synteny graph
wgd (str, optional) – the wgd for which the Orhtology Table was built
Reads a file with gene families having updated synteny information in SCORPiOs iteration n versus iteration n-1. Returns a list of regions, which are bounds for windows to be considered in iteration n and a list of genes which are genes for which orthologies can be updated. This two list differ in the fact that gene can be in a considered window without having updated synteny information.
- Parameters
region_file (str) – input file
chrom (str) – outgroup chromosome considered
windowsize (int) – size of the sliding window
- Returns
- a tuple containing:
regions (list of tuple): list of regions, as tuples (start_index, stop_index), corresponding to index in the OrthologyTable.
genes (list of str): list of genes with updated synteny information
- Return type
tuple
Note
If the region_file is empty, regions is set to None (and we don’t filter regions in SCORPiOs main). If the regions file is not empty, but no family has an updated synteny context, regions is set to [(0, 0)] i.e. no window will be computed on this chromosome.
-
scripts.synteny.filter_regions.
read_combin_file
(file_combin_graphs)¶ Reads the summary of graphs combination across outgroups. Corrected subtrees with another outgroup should be marked as an updated family for all outgroups.
- Parameters
file_combin_graphs (str) – input summary file
- Returns
for each gene in the current outgroup, the corresponding selected graph if from another outgroup
- Return type
dict
-
scripts.synteny.filter_regions.
write_regions_file
(fam_to_keep, outfile)¶ Writes a file with families that have updated synteny information after a SCORPiOs iteration.
- Parameters
fam_to_keep (dict) – for each outgroup chromosome, families with updated synteny info.
outfile (str) – name of the output file.
scripts.synteny.missed_orthologies module¶
This script finds potential orthologs between an outgroup and duplicated species based on synteny, for genes without obvious orthologs in trees.
Example:
$ python -m scripts.synteny.missed_orthologies -i Orthotable -u UncertainGenes -c Chroms
[-o output] [-wgd ''] [-w 0] [-f out]
-
scripts.synteny.missed_orthologies.
find_synteny_orthologs
(input_file, optimize=False, threshold=2.0, opt_fam=None)¶ Browses ingroup genes without phylogenetic orthologs in the outgroup and attempts to find synteny-supported orthologs
- Parameters
input_file (str) – name of the input file storing genes without orthologs in ingroups
optimize (bool, optional) – option to use if the script is called to optimize the threshold
threshold (float, optional) – synteny support threshold
opt_fam (list, optional) – if defined, restricts fmailies to use for optimization to the ones in this list
- Returns
identified synteny-supported orthologies, stored in nested dict with, for each outgroup gene with newly identified ortholog(s) (GeneSpeciesPosition tuple, key1) and for each duplicated species with such ortholog(s) (str, key2), orthologous gene as GeneSpeciesPosition tuple(s).
- Return type
dict
-
scripts.synteny.missed_orthologies.
load_genes
(genes, outgr=False)¶ Parses an entry in the “no phylogenetic ortholog” file and loads genes as GeneSpeciesPosition namedTuples.
- Parameters
genes (str) – a line of the input file
outgr (bool) – Whether entry of ingroups (True) or outgroup should be parsed (False)
- Returns
- a tuple containing:
dict_genes (dict): for each species (key), genes in the entry (value) as a GeneSpeciesPosition namedtuple
unplaced_genes (dict): stores genes with no gene position entry in the .bed file in a dict of similar structure as dict_genes
- Return type
tuple
-
scripts.synteny.missed_orthologies.
neighbour_outgr_ortholog
(ortho_neighbours, all_outgroup_candidates)¶ Searches for syntenic neighbours between ingroup and outgroup genes. Gene neighbouring ingroup genes have their orthologs in the outgroup stored in ortho_neighbours. This function searches if ortho_neighbours are in the neighbourhood of an outgroup gene all_outgroup_candidates (genes in the same tree as ingroup genes).
- Parameters
ortho_neighbours (list) – list of orthologs of neighbours of ingroup genes, as tuples (chromosome, index)
all_outgroup_candidates (list) – all outgroup genes in the same tree, as a list of GeneSpeciesPosition tuples.
- Returns
list of outgroup genes in the same tree with at least one syntenic neighbour, with repetitions. The number of repetitions indicates the number of syntenic neighbours. For instance, [gene_a, gene_a, gene_b, gene_a, gene_a] indicates that gene a has four syntenic neighbours with ingroup genes and gene_b one.
- Return type
list
-
scripts.synteny.missed_orthologies.
print_out_stats
(stats_dict, wgd='', file_fam_nograph='out_nog')¶ Prints to stdout some statistics on the families in the final Orthology Table.
- Parameters
stats_dict (dict) – a dict counting number of families and genes in the families
wgd (str, optional) – the wgd for which the Orhtology Table was built
file_fam_nograph (str, optional) – file to write families that can’t result in a graph (won’t be in a large enough window or has too few genes)
-
scripts.synteny.missed_orthologies.
search_closest_neighbours
(ingroup_genes, dup_sp, all_genefam, all_outgroup_candidates)¶ Extracts orthologs, in the outgroup species, of genes in the neighbourhood of genes without phylogenetic orthologs in species dup sp .
- Parameters
ingroup_genes (dict) – a clade of ingroup genes without phylogenetic orthologs, as a dict, giving, for each species, a list of GeneSpeciesPosition tuples.
dup_sp (str) – name of the considered duplicated species.
all_genefam (nested dict) – Pre-computed orthology table based on phylogenetic orthologs used to search for syntenic neighbours, represented by a nested dict, giving for each outgroup chromosome (key1) and each duplicated species (key2), a list of GeneFamily objects.
all_outgroup_candidates (list) – all outgroup genes in the same tree, as a list of GeneSpeciesPosition tuples.
- Returns
- a tuple containing:
ortho_neighbours (list): a list of orthologs of ingroup genes in the outgroup, in a tuple (chromosome, gene index)
skip (bool): If True, we should not use dup_sp to search for syntenic neighbours because one neighbour is orthologous to another outgroup gene in the same tree (i.e history of tandem duplication which will artefactually inflate the number of syntenic neighbours). Conservation of synteny in the case of tandem duplication is not a proof for orthology.
- Return type
tuple
scripts.synteny.mygenome module¶
Module with functions to load a genome from a .bed (or a .bz2 in DYOGEN format) gene file.
-
class
scripts.synteny.mygenome.
ContigType
¶ Bases:
enum.Enum
Enum grouping all possible values describing the type of a contig.
-
Chromosome
= 'Chromosome'¶
-
Mitochondrial
= 'Mitochondrial'¶
-
Random
= 'Random'¶
-
Scaffold
= 'Scaffold'¶
-
-
class
scripts.synteny.mygenome.
Gene
(chromosome, beginning, end, names)¶ Bases:
tuple
-
property
beginning
¶ Alias for field number 1
-
property
chromosome
¶ Alias for field number 0
-
property
end
¶ Alias for field number 2
-
property
names
¶ Alias for field number 3
-
property
-
class
scripts.synteny.mygenome.
GenePosition
(chromosome, index)¶ Bases:
tuple
-
property
chromosome
¶ Alias for field number 0
-
property
index
¶ Alias for field number 1
-
property
-
class
scripts.synteny.mygenome.
Genome
(fichier, file_format)¶ Bases:
object
Object representing genomic position of genes in a species, as loaded from a .bed (or in DYOGEN format) gene file. Can load bzipped (.bz2) files.
-
name
¶ name of the input gene file
- Type
str
-
genes_list
¶ For each chromosome (key), a list of Gene namedtuples.
- Type
dict
-
chr_list
¶ For each ContigType (key), list of chromosomes with this type (value).
- Type
dict
-
dict_genes
¶ For each gene name (key), its position in a GenePosition namedtuple.
- Type
dict
-
add_gene
(names, chromosome, beg, end)¶ Adds a gene to the genes_list.
- Parameters
names (list) – list of gene names
chromosome (str) – chromosome name
end (beg,) – start and end positions of the gene
-
init_other_attributes
()¶ Inits the genes and chromosomes dictionaries.
-
-
scripts.synteny.mygenome.
contig_type
(chr_name)¶ Deduces the type of a contig from its name.
- Arg:
chr_name (str): Name of the contig
- Returns
- The type of the contig, either Chromosome, Mitochondrial, Scaffold or
Random
- Return type
ContigType object
-
scripts.synteny.mygenome.
is_bz2
(filename)¶ Checks if file extension is bz2 (looks at file extension only, not its encoding, could be improved).
- Arg:
filename (str): input file name
- Returns
boolean: True if extension is bz2, False otherwise.
-
scripts.synteny.mygenome.
toint
(chr_name)¶ Converts the input to an integer, if possible. Otherwise leave the name unchanged, as str.
- Parameters
chr_name (str) – String to convert, for instance a chromosome name.
- Returns
Converted input if possible, input otherwise
- Return type
int or str
scripts.synteny.pairwise_orthology_synteny module¶
This script uses synteny conservation patterns to predict orthologous gene pairs in 2 wgd-duplicated species.
Example:
$ python -m scripts.synteny.pairwise_orthology_synteny -i OrthoTable.txt
-p Oryzias.latipes_Danio.rerio -chr LG1 -ortho TreesOrthologies/ [-o out] [-w 15]
[-cutoff 0] [-filter None]
-
scripts.synteny.pairwise_orthology_synteny.
find_best_threading
(dup_seg_sp1, dup_seg_sp2, tree_orthos)¶ For all threading possibilities for duplicated segments in sp1 and sp2, finds the most parsimonious scenario.
- Parameters
dup_seg_sp2 (dup_seg_sp1,) – duplicated segments in sp1 and sp2
tree_orthos (dict) – Orthologous gene pairs in sp1 and sp2, defined from molecular evolution
- Returns
- a tuple containing:
best (tuple): most parsimonious threading scenario for sp1 and for sp2
s_max (float): corresponding synteny similarity score (delta score)
- Return type
tuple
-
scripts.synteny.pairwise_orthology_synteny.
load_tree_orthologies
(orthology_file, rev=False)¶ Loads orthologies from a tabulated-separated orthology files giving pre-computed orthologous gene pairs in species1 and species2, based on molecular sequence evolution.
- Parameters
orthology_file (str) – name of the input orthology file
rev (bool, optional) – should species in column 1 and 2 be inverted (i.e use sp2 genes as dict keys)
- Returns
for each gene in sp1 (keys), a list of orthologous genes in sp2 (values), resp. sp2 and sp1 if rev is True.
- Return type
dict
-
scripts.synteny.pairwise_orthology_synteny.
synteny_orthology_prediction
(orthotable, sp1, sp2, chrom, tree_orthos, res_orthologies, win_size=15, cutoff=0, regions=None)¶ Compares synteny similarity of duplicated segments stored in the Orthology Table, for sp1 and sp2, using a sliding window on chromosomes `chrom of the outgroup. Gene pairs in similar syntenic context are predicted orthologs.
- Parameters
orthotable (str) – Name of the file with the Orthology Table
sp2 (sp1,) – Name of compared duplicated species
chrom (str) – Name of the outgroup chromosome
tree_orthos (dict) – Orthologous gene pairs in sp1 and sp2, defined from molecular evolution
res_orthologies (dict) – dict to store results
win_size (int, optional) – Size of the sliding window to browse the orthology table
cutoff (int, optional) – cutoff on synteny similarity delta scores to predict orthology
regions (list, optional) – List of regions on the outgroup chromosome to restrict the analysis on
- Returns
Synteny-predicted orthologous gene pairs
- Return type
dict
-
scripts.synteny.pairwise_orthology_synteny.
write_orthologies
(out, all_orthologies, sp1, sp2, filter_genes=None)¶ Writes synteny-predicted orthologies to file.
- Parameters
out (str) – name of the output file
all_orthologies (dict) – Synteny-predicted orthologous gene pairs
sp2 (sp1,) – Name of compared duplicated species
filter_genes (list of str, optional) – Restricted list of gene families to write (restrict the orthology prediction to some families)
scripts.synteny.syntenycompare module¶
Module with functions to compare duplicated segments in 2 duplicated species.
-
class
scripts.synteny.syntenycompare.
DupSegments
(family_ids, chromosomes, matrix, genes_dict)¶ Bases:
object
Object to represent a list of GeneFamilies (i.e entries of a given duplicated species in a window of the OrthologyTable). This object is used to perform duplicated segment threading and synteny similarity comparisons in pairs of duplicated species.
-
family_ids
¶ Names of gene families, given by the outgroup gene in the OrthologyTable
- Type
list of str
-
chromosomes
¶ Names of genomic segments with a gene copy in the duplicated species
- Type
list of str
-
matrix
¶ A binary matrix, representing absence/presence of a duplicated gene copy in each genomic segment. Columns are genomic segments, with order given in list (ii). Rows are gene families.
- Type
numpy.array
-
genes_dict
¶ For each ‘1’ in the matrix, the corresponding duplicated species gene name(s)
- Type
dict
-
all_reduce_in_two_blocks
()¶ Gives a list of all possible ways to thread duplicated segments together.
- Returns
all possible threadings.
For instance, a threading given by [[0, 2], [1, 3]] means segments 0 and 2 threaded together to form an ancestrally duplicated region track1 and segment 1 and 3 threaded together to form track2.
- Return type
nested list
-
get_score
(dup_seg_sp2, tree_orthos, threadingsp1, threadingsp2)¶ Computes the two delta scores between tracks of threaded duplicated segments in 2 species.
- Parameters
dup_seg_sp2 (DupSegments) – Corresponding duplicated segments in species 2
tree_orthos (dict) – Orthologous gene pairs in sp1 and sp2, defined from molecular evolution
threadingsp1 (nested list) – duplicated segment threading for species 1
threadingsp2 (nested list) – duplicated segment threading for each species 2
- Returns
tuple of 2 floats, delta score based on the ‘pattern of retentions and losses’ and delta score based on ‘syntenic neighbours’
- Return type
tuple
-
orthologs_score
(dup_seg_sp2, tree_orthos, threadingsp1, threadingsp2)¶ Computes delta score based on ‘syntenic neighbours’ between tracks of threaded duplicated segments in 2 species.
- Parameters
dup_seg_sp2 (DupSegments) – Corresponding duplicated segments in species 2
tree_orthos (dict) – Orthologous gene pairs in sp1 and sp2, defined from molecular evolution
threadingsp2 (threadingsp1,) – duplicated segments threading for each species
- Returns
delta score based on ‘syntenic neighbours’
- Return type
float
-
retention_loss_score
(dup_seg_sp2, threadingsp1, threadingsp2)¶ Computes delta score based on the ‘pattern of retentions and losses’ between tracks of threaded duplicated segments in 2 species.
- Parameters
dup_seg_sp2 (DupSegments) – Corresponding duplicated segments in species 2
threadingsp2 (threadingsp1,) – duplicated segments threading for each species
- Returns
delta score based on the ‘pattern of retentions and losses’
- Return type
float
-
sort
()¶ Orders duplicated segments by descending number of genes.
-
update_discard
(threading)¶ Updates the discard attribute.
- Parameters
threading (nested list) – threading scenario.
-
update_orthologies
(dup_seg_sp2, score, threading2sp, all_orthologies)¶ Stores genes in identified orthologous duplicated segment. Fills all_orthologies in-place.
- Parameters
dup_seg_sp2 (DupSegments) – corresponding duplicated segments in species 2
score (float) – delta score of synteny similarity (diff. of the 2 orthology scenarios)
threading2sp (nested list) – duplicated segment threading for each species
all_orthologies (dict) – stores orthologies, for each family (key) gives a tuple (value) with the confidence score and predicted orthologs.
-
-
scripts.synteny.syntenycompare.
check_orthology
(orthologous_chroms, dup_seg_sp2, ortho_genes, loc)¶ Checks if there is a pre-computed orthology relation between genes of matched duplicated segments in 2 species for the family loc.
- Parameters
orthologous_chroms (list) – list of orthologous segments in species 2
dup_seg_sp2 (DupSegments) – duplicated segments object for species 2
ortho_genes (list) – list of orthologous genes in species 2 for a gene in species 1
loc (int) – index of the gene family
- Returns
True if there is a pre-computed gene orthology, False otherwise.
- Return type
bool
-
scripts.synteny.syntenycompare.
to_dup_segments
(fams)¶ Transforms a list of GeneFamilies (i.e entries of a given duplicated species in a window of the OrthologyTable) into a DupSegments object.
A DupSegments object consist in:
(i): a list of names of each gene family, given by the corresponding outgroup gene in the OrthologyTable
(ii): a list of all genomic segments with a gene copy in the duplicated species
(iii): a binary matrix, representing absence/presence of a duplicated gene copy in each genomic segment. Columns are genomic segments, with order given in list (ii). Rows are gene families.
(iv): a dictionary, giving for each ‘1’ in the matrix, corresponding duplicated species gene names
(v): a list keeping track of discarded families segments threadings
- Arg:
fams (list of GeneFamily objects): object to transform
- Returns
the transformed object
- Return type
scripts.synteny.utilities module¶
Module with functions to load and write a duplicated ingroups-outgroup orthology table.
-
class
scripts.synteny.utilities.
GeneFamily
(outgr_genename, outgr_chr, outgr_position, all_duplicate_genes, involved_chromosomes)¶ Bases:
object
Stores an entry in the orthology table for one duplicated species.
-
outgr_genename
¶ name of the outgroup gene, giving an unique IDs to the family
- Type
str
-
outgr_chr
¶ name of the chromosome of the outgroup gene
- Type
str
-
outgr_position
¶ index of the outgroup gene on its chromosome
- Type
int
-
all_duplicate_genes
¶ gene copies in the duplicated species and their genomic location
- Type
list of GeneSpeciesPosition
-
involved_chromosomes
¶ list of chromosomes in the duplicated species with a gene copy
- Type
list of str
Note
No public method, used as a structure to store data. GeneFamily objects are manipulated in lists with functions on `GeneFamily`lists defined below for better readability of manipulations.
-
-
scripts.synteny.utilities.
GeneSpeciesPosition
¶ alias of
scripts.synteny.utilities.GenePosition
-
scripts.synteny.utilities.
add_gene
(list_of_genefam, ind, gene)¶ Adds a gene copy member of the duplicated species in the orthology table (i.e in the corresponding GeneFamily).
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
ind (int) – family index to add the gene in the list
gene (GeneSpeciesPosition namedtuple) – gene to add
-
scripts.synteny.utilities.
complete_load_orthotable
(table_file, chrom_outgr, species, load_no_position_genes=False)¶ Loads entries for one duplicated species species in the orthologytable, corresponding to chromosome chrom_outgr`in the outgroup, as a list of `GeneFamily objects.
- Parameters
table_file (str) – Name of the orthologytable file
chrom_outgr (str) – Name of the considered outgroup chromosome
species (str) – Name of the considered duplicated species
- Returns
list of GeneFamily objects
-
scripts.synteny.utilities.
find_closest
(number, number_list, index=False)¶ Finds, in a list of int number_list, the closest integer to number, or its index. Assumes the list is sorted. If two values are equally close to number, gives the smallest.
- Parameters
number (int) – the input number to search
number_list (list) – the list of int to mine
index (bool, optional) – Whether the index of the closest element should be returned instead of its value.
- Returns
closest number in list (or ist index if index is True)
- Return type
int
-
scripts.synteny.utilities.
get_all_chromosome_and_position
(list_of_genefam)¶ Gets chromosome and chromosomal location index of all the duplicated species genes in a list of GeneFamily objects.
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
- Returns
for each chromosome (key), list of gene positions (value)
- Return type
dict
-
scripts.synteny.utilities.
get_all_chromosomes_involved
(list_of_genefam)¶ Gets all chromosome with a gene copy in a list of GeneFamily objects.
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
- Returns
list of chromosome names
- Return type
list of str
-
scripts.synteny.utilities.
get_all_outgr_names
(list_of_genefam)¶ Gets gene names of all outgroup genes in a list of GeneFamily objects.
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
- Returns
list of gene names
- Return type
list of str
-
scripts.synteny.utilities.
get_all_outgr_pos
(list_of_genefam)¶ Gets chromosomal location index of all outgroup genes in a list of GeneFamily objects.
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
- Returns
list of chromosomal indexes
- Return type
list of int
-
scripts.synteny.utilities.
insert_outgr_gene
(list_of_genefam, ind, gene)¶ Inserts an outgroup gene in the orthology table (i.e in the list of GeneFamily).
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
ind (int) – index to insert the gene in the list
gene (GeneSpeciesPosition namedtuple) – gene to add
-
scripts.synteny.utilities.
light_load_orthotable
(table_file)¶ Another simplified loading function for the orthologytable, in order to only get outgroup genes in the orthology table and their corresponding chromosomes.
- Parameters
input_file (str) – Input file name.
- Returns
Correspondence between chromosome of the outrgoup (key) and its genes in the orthology table (value). Genes are given in order of along each chromosome.
- Return type
names (dict)
-
scripts.synteny.utilities.
load_orthotable
(table_file)¶ Simplified loading function for the orthologytable, in order to get outgroup genes in the orthology table and all duplicated species gene copies in the corresponding family.
- Parameters
table_file (str) – Input file name.
- Returns
Correspondence between genes in the outgroup (key) and duplicated species genes in its family (value).
- Return type
orthotable (dict)
-
scripts.synteny.utilities.
outgr_chromosomes
(chr_file)¶ Reads a simple file with a single entry on each line (for instance chrom names) on each line.
- Parameters
chr_file (str) – input file name
- Returns
entry on each line of the file
- Return type
list
-
scripts.synteny.utilities.
split_chr_with_ohnologs
(list_of_genefam)¶ Splits the duplicated species chromosomes in two separate regions if there are two ohnolgs on the same chromosome but more than 100 genes apart. This can potentially be the result of different duplicated chromosomes that fused together.
- Parameters
list_of_genefam (list of GeneFamily) – input list of GeneFamily
- Returns
- list tuples with an historic of ohnologs 100 genes apart on the same
chromosome
- splits (dict of dict): for each split chromosome (key1), each gene copy on it (key2) and
its corresponding after-split region (‘a’ or ‘b’)
- Return type
store (store)
-
scripts.synteny.utilities.
update_orthologytable
(all_genefam, res_dict, sp_list)¶ Updates the orthology table by adding all newly found orthologies between ingroups and the outgroup. Inserts the new family in the orthology table (i.e in the list of GeneFamily).
- Parameters
all_genefam (nested dict) – Stores the full orthology table. For each chromosome of the outgroup (key1), for each duplicated species (key2), a list of GeneFamily objects (value).
res_dict (nested dict) – Stores new orthologies. For each gene family (key1; represented by its family id, the outgroup gene name), for each duplicated species (key2), the corresponding gene copies in the duplicated species (value).
sp_list (list of str) – list of duplicated species
-
scripts.synteny.utilities.
write_updated_orthotable
(all_genefam, outgr, sp_list, chr_outgr, out, wsize=0, filt_genes=None)¶ Writes an orthology table file from data stored in all_genefam.
- Parameters
all_genefam (nested dict) – Stores the full orthology table. For each chromosome of the outgroup (key1), for each duplicated species (key2), a list of GeneFamily objects (value).
outgr (str) – name of the outgroup species
sp_list (list of str) – list of duplicated species
chr_outgr (str) – chromosome of the outgroup
out (str) – name of the output file