Python API¶
This page provides detailed documentation for the tskit
Python API.
Trees and tree sequences¶
The TreeSequence
class represents a sequence of correlated trees
output by a simulation. The Tree
class represents a single
tree in this sequence.
These classes are the interfaces used to interact with the trees
and mutational information stored in a tree sequence returned from a simulation.
There are also methods for loading data into these objects, either from the native
format using tskit.load()
, or from another sources
using tskit.load_text()
or TableCollection.tree_sequence()
.
Top levelclasses¶

class
tskit.
TreeSequence
¶ A single tree sequence, as defined by the data model. A TreeSequence instance can be created from a set of tables using
TableCollection.tree_sequence()
; or loaded from a set of text files usingload_text()
; or, loaded from a native binary file usingload()
.TreeSequences are immutable. To change the data held in a particular tree sequence, first get the table information as a
TableCollection
instance (usingdump_tables()
), edit those tables using the tables api, and create a new tree sequence usingTableCollection.tree_sequence()
.The
trees()
method iterates over all trees in a tree sequence, and thevariants()
method iterates over all sites and their genotypes.
Fst
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes “windowed” Fst between pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.For sample sets
X
andY
, ifd(X, Y)
is thedivergence
betweenX
andY
, andd(X)
is thediversity
ofX
, then what is computed isFst = 1  2 * (d(X) + d(Y)) / (d(X) + 2 * d(X, Y) + d(Y))
What is computed for diversity and divergence depends on
mode
; see those functions for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 2tuples.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

Tajimas_D
(sample_sets=None, windows=None, mode='site')¶ Computes Tajima’s D of sets of nodes from
sample_sets
in windows. Please see the oneway statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, and return value. Operates onk = 1
sample sets at a time. For a sample setX
ofn
nodes, if andT
is the mean number of pairwise differing sites inX
andS
is the number of sites segregating inX
(computed withdiversity
andsegregating sites
, respectively, both not span normalised), then Tajima’s D isD = (T  S/h) / sqrt(a*S + (b/c)*S*(S1)) h = 1 + 1/2 + ... + 1/(n1) g = 1 + 1/2^2 + ... + 1/(n1)^2 a = (n+1)/(3*(n1)*h)  1/h^2 b = 2*(n^2 + n + 3)/(9*n*(n1))  (n+2)/(h*n) + g/h^2 c = h^2 + g
What is computed for diversity and divergence depends on
mode
; see those functions for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 2tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
 Returns
A ndarray with shape equal to (num windows, num statistics).

Y1
(sample_sets, windows=None, mode='site', span_normalise=True)¶ Computes the ‘Y1’ statistic within each of the sets of nodes given by
sample_sets
. Please see the oneway statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates onk = 1
sample set at a time.What is computed depends on
mode
. Each is computed exactly asY3
, except that the average is across a randomly chosen trio of samples(a1, a2, a3)
all chosen without replacement from the same sample set. SeeY3
for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

Y2
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes the ‘Y2’ statistic between pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is computed exactly asY3
, except that the average across randomly chosen trios of samples(a, b1, b2)
, wherea
is chosen from the first sample set, andb1, b2
are chosen (without replacement) from the second sample set. SeeY3
for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 2tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

Y3
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes the ‘Y’ statistic between triples of sets of nodes from
sample_sets
. Operates onk = 3
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is an average across randomly chosen trios of samples(a, b, c)
, one from each sample set: “site”
The average density of sites at which
a
differs fromb
andc
, per unit of chromosome length. “branch”
The average length of all branches that separate
a
fromb
andc
(in units of time). “node”
For each node, the average proportion of the window on which
a
inherits from that node butb
andc
do not, or viceversa.
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 3tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

allele_frequency_spectrum
(sample_sets=None, windows=None, mode='site', span_normalise=True, polarised=False)¶ Computes the allele frequency spectrum (AFS) in windows across the genome for with respect to the specified
sample_sets
. See the statistics interface section for details on sample sets, windows, mode, span normalise, polarised, and return value. and see Allele frequency spectra for examples of how to use this method.Similar to other windowed stats, the first dimension in the returned array corresponds to windows, such that
result[i]
is the AFS in the ith window. The AFS in each window is a kdimensional numpy array, where k is the number of input sample sets, such thatresult[i, j0, j1, ...]
is the value associated with frequencyj0
insample_sets[0]
,j1
insample_sets[1]
, etc, in windowi
. From here, we will assume thatafs
corresponds to the result in a single window, i.e.,afs = result[i]
.If a single sample set is specified, the allele frequency spectrum within this set is returned, such that
afs[j]
is the value associated with frequencyj
. Thus, singletons are counted inafs[1]
, doubletons inafs[2]
, and so on. The zeroth entry counts alleles or branches not seen in the samples but that are polymorphic among the rest of the samples of the tree sequence; likewise, the last entry counts alleles fixed in the sample set but polymorphic in the entire set of samples. Please see the Zeroth and final entries in the AFS for an illustration.Warning
Please note that singletons are not counted in the initial entry in each AFS array (i.e.,
afs[0]
), but inafs[1]
.If
sample_sets
is None (the default), the allele frequency spectrum for all samples in the tree sequence is returned.If more than one sample set is specified, the joint allele frequency spectrum within windows is returned. For example, if we set
sample_sets = [S0, S1]
, then afs[1, 2] counts the number of sites that at singletons within S0 and doubletons in S1. The dimensions of the output array will be[num_windows] + [1 + len(S) for S in sample_sets]
.If
polarised
is False (the default) the AFS will be folded, so that the counts do not depend on knowing which allele is ancestral. If folded, the frequency spectrum for a single sample setS
hasafs[j] = 0
for allj > len(S) / 2
, so that alleles at frequencyj
andlen(S)  j
both add to the same entry. If there is more than one sample set, the returned array is “lower triangular” in a similar way.What is computed depends on
mode
: “site”
The number of sites at a given frequency within the specified sample sets for each window, per unit of sequence length. To obtain the total number of sites, set
span_normalise
to False. “branch”
The total length of branches in the trees subtended by subsets of the specified sample sets, per unit of sequence length. To obtain the total, set
span_normalise
to False. “node”
Not supported for this method (raises a ValueError).
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of samples to compute the joint allele frequency
windows (list) – An increasing list of breakpoints between windows along the genome.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A (k + 1) dimensional numpy array, where k is the number of sample sets specified.

aslist
()¶ Returns the trees in this tree sequence as a list. Each tree is represented by a different instance of
Tree
. As such, this method is inefficient and may use a large amount of memory, and should not be used when performance is a consideration. Thetrees()
method is the recommended way to efficiently iterate over the trees in a tree sequence. Returns
A list of the trees in this tree sequence.
 Return type

at
(position)¶ Returns the tree covering the specified genomic location. The returned tree will have
tree.interval[0]
<=position
<tree.interval[1]
. See alsoTree.seek()
.

at_index
(index)¶ Returns the tree at the specified index. See also
Tree.seek_index()
.

breakpoints
(as_array=False)¶ Returns the breakpoints along the chromosome, including the two extreme points 0 and L. This is equivalent to
>>> iter([0] + [t.interval[1] for t in self.trees()])
By default we return an iterator over the breakpoints as Python float objects; if
as_array
is True we return them as a numpy array.Note that the
as_array
form will be more efficient and convenient in most cases; the default iterator behaviour is mainly kept to ensure compatability with existing code. Parameters
as_array (bool) – If True, return the breakpoints as a numpy array.
 Returns
The breakpoints defined by the tree intervals along the sequence.
 Return type
iter or array

delete_intervals
(intervals, simplify=True, record_provenance=True)¶ Returns a copy of this tree sequence for which information in the specified list of genomic intervals has been deleted. Edges spanning these intervals are truncated or deleted, and sites and mutations falling within them are discarded.
Note that node IDs may change as a result of this operation, as by default
simplify()
is called on the returned tree sequence to remove redundant nodes. If you wish to map node IDs onto the same nodes before and after this method has been called, specifysimplify=False
.See also
keep_intervals()
. Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to delete. Intervals must be nonoverlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, return a simplified tree sequence where nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance information of the returned tree sequence. (Default:True
).
 Return type

delete_sites
(site_ids, record_provenance=True)¶ Returns a copy of this tree sequence with the specified sites (and their associated mutations) entirely removed. The site IDs do not need to be in any particular order, and specifying the same ID multiple times does not have any effect (i.e., calling
tree_sequence.delete_sites([0, 1, 1])
has the same effect as callingtree_sequence.delete_sites([0, 1])
.

divergence
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes mean genetic divergence between (and within) pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.As a special case, an index
(j, j)
will compute thediversity
ofsample_set[j]
.What is computed depends on
mode
: “site”
Mean pairwise genetic divergence: the average across distinct, randomly chosen pairs of chromosomes (one from each sample set), of the density of sites at which the two carry different alleles, per unit of chromosome length.
 “branch”
Mean distance in the tree: the average across distinct, randomly chosen pairs of chromsomes (one from each sample set) and locations in the window, of the mean distance in the tree between the two samples (in units of time).
 “node”
For each node, the proportion of genome on which the node is an ancestor to only one of a random pair (one from each sample set), averaged over choices of pair.
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 2tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

diversity
(sample_sets=None, windows=None, mode='site', span_normalise=True)¶ Computes mean genetic diversity (also knowns as “Tajima’s pi”) in each of the sets of nodes from
sample_sets
. Please see the oneway statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.Note that this quantity can also be computed by the
divergence
method.What is computed depends on
mode
: “site”
Mean pairwise genetic diversity: the average across distinct, randomly chosen pairs of chromosomes, of the density of sites at which the two carry different alleles, per unit of chromosome length.
 “branch”
Mean distance in the tree: the average across distinct, randomly chosen pairs of chromsomes and locations in the window, of the mean distance in the tree between the two samples (in units of time).
 “node”
For each node, the proportion of genome on which the node is an ancestor to only one of a random pair from the sample set, averaged over choices of pair.
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A numpy array.

dump
(path, zlib_compression=False)¶ Writes the tree sequence to the specified file path.

dump_tables
()¶ A copy of the tables defining this tree sequence.
 Returns
A
TableCollection
containing all tables underlying the tree sequence. Return type

dump_text
(nodes=None, edges=None, sites=None, mutations=None, individuals=None, populations=None, provenances=None, precision=6, encoding='utf8', base64_metadata=True)¶ Writes a text representation of the tables underlying the tree sequence to the specified connections.
If Base64 encoding is not used, then metadata will be saved directly, possibly resulting in errors reading the tables back in if metadata includes whitespace.
 Parameters
nodes (stream) – The filelike object (having a .write() method) to write the NodeTable to.
edges (stream) – The filelike object to write the EdgeTable to.
sites (stream) – The filelike object to write the SiteTable to.
mutations (stream) – The filelike object to write the MutationTable to.
individuals (stream) – The filelike object to write the IndividualTable to.
populations (stream) – The filelike object to write the PopulationTable to.
provenances (stream) – The filelike object to write the ProvenanceTable to.
precision (int) – The number of digits of precision.
encoding (string) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.

edge_diffs
()¶ Returns an iterator over all the edges that are inserted and removed to build the trees as we move from lefttoright along the tree sequence. The iterator yields a sequence of 3tuples,
(interval, edges_out, edges_in)
. Theinterval
is a pair(left, right)
representing the genomic interval (seeTree.interval
). Theedges_out
value is a tuple of the edges that were justremoved to create the tree covering the interval (henceedges_out
will always be empty for the first tree). Theedges_in
value is a tuple of edges that were just inserted to construct the tree convering the current interval. The iterator terminates after the final interval in the tree sequence (i.e. it does not report a final removal of all remaining edges); the number of iterations is thus always equal to the number of trees in the tree sequence.

edges
()¶ Returns an iterator over all the edges in this tree sequence. Edges are returned in the order required for a valid tree sequence. So, edges are guaranteed to be ordered such that (a) all parents with a given ID are contiguous; (b) edges are returned in nondescreasing order of parent time ago; (c) within the edges for a given parent, edges are sorted first by child ID and then by left coordinate.
 Returns
An iterator over all edges.
 Return type
iter(
Edge
)

f2
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes Patterson’s f3 statistic between two groups of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each works exactly asf4
, except the average is across randomly chosen set of four samples(a1, b1; a2, b2)
, with a1 and a2 both chosen (without replacement) from the first sample set andb1
andb2
chosen randomly without replacement from the second sample set. Seef4
for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 2tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

f3
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes Patterson’s f3 statistic between three groups of nodes from
sample_sets
. Operates onk = 3
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each works exactly asf4
, except the average is across randomly chosen set of four samples(a1, b; a2, c)
, with a1 and a2 both chosen (without replacement) from the first sample set. Seef4
for more details. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 3tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

f4
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)¶ Computes Patterson’s f4 statistic between four groups of nodes from
sample_sets
. Operates onk = 4
sample sets at a time; please see the multiway statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is an average across randomly chosen set of four samples(a, b; c, d)
, one from each sample set: “site”
The average density of sites at which
a
andc
agree but differs fromb
andd
, minus the average density of sites at whicha
andd
agree but differs fromb
andc
, per unit of chromosome length. “branch”
The average length of all branches that separate
a
andc
fromb
andd
, minus the average length of all branches that separatea
andd
fromb
andc
(in units of time). “node”
For each node, the average proportion of the window on which
a
andc
inherit from that node butb
andd
do not, or viceversa, minus the average proportion of the window on whicha
ancd
inherit from that node butb
andc
do not, or viceversa.
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
indexes (list) – A list of 4tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

first
()¶ Returns the first tree in this
TreeSequence
. To iterate over all trees in the sequence, use thetrees()
method. Returns
The first tree in this tree sequence.
 Return type
Tree
.

genealogical_nearest_neighbours
(focal, sample_sets, num_threads=0)¶ Return the genealogical nearest neighbours (GNN) proportions for the given focal nodes, with reference to two or more sets of interest, averaged over all trees in the tree sequence.
The GNN proportions for a focal node in a single tree are given by first finding the most recent common ancestral node \(a\) between the focal node and any other node present in the reference sets. The GNN proportion for a specific reference set, \(S\) is the number of nodes in \(S\) that descend from \(a\), as a proportion of the total number of descendant nodes in any of the reference sets.
For example, consider a case with 2 sample sets, \(S_1\) and \(S_2\). For a given tree, \(a\) is the node that includes at least one descendant in \(S_1\) or \(S_2\) (not including the focal node). If the descendants of \(a\) include some nodes in \(S_1\) but no nodes in \(S_2\), then the GNN proportions for that tree will be 100% \(S_1\) and 0% \(S_2\), or \([1.0, 0.0]\).
For a given focal node, the GNN proportions returned by this function are an average of the GNNs for each tree, weighted by the genomic distance spanned by that tree.
For an precise mathematical definition of GNN, see https://doi.org/10.1101/458067
Note
The reference sets need not include all the samples, hence the most recent common ancestral node of the reference sets, \(a\), need not be the immediate ancestor of the focal node. If the reference sets only comprise sequences from relatively distant individuals, the GNN statistic may end up as a measure of comparatively distant ancestry, even for tree sequences that contain many closely related individuals.
Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future. The longterm stable API for this method will be consistent with other Statistics.
 Parameters
 Returns
An \(n\) by \(m\) array of focal nodes by GNN proportions. Every focal node corresponds to a row. The numbers in each row corresponding to the GNN proportion for each of the passedin reference sets. Rows therefore sum to one.
 Return type
numpy.ndarray

general_stat
(W, f, output_dim, windows=None, polarised=False, mode=None, span_normalise=True, strict=True)¶ Compute a windowed statistic from weights and a summary function. See the statistics interface section for details on windows, mode, span normalise, and return value. On each tree, this propagates the weights
W
up the tree, so that the “weight” of each node is the sum of the weights of all samples at or below the node. Then the summary functionf
is applied to the weights, giving a summary for each node in each tree. How this is then aggregated depends onmode
: “site”
Adds together the total summary value across all alleles in each window.
 “branch”
Adds together the summary value for each node, multiplied by the length of the branch above the node and the span of the tree.
 “node”
Returns each node’s summary value added across trees and multiplied by the span of the tree.
Both the weights and the summary can be multidimensional: if
W
hask
columns, andf
takes ak
vector and returns anm
vector, then the output will bem
dimensional for each node or window (depending on “mode”).Note
The summary function
f
should return zero when given both 0 and the total weight (i.e.,f(0) = 0
andf(np.sum(W, axis=0)) = 0
), unlessstrict=False
. This is necessary for the statistic to be unaffected by parts of the tree sequence ancestral to none or all of the samples, respectively. Parameters
W (ndarray) – An array of values with one row for each sample and one column for each weight.
f (function) – A function that takes a onedimensional array of length equal to the number of columns of
W
and returns a onedimensional array.output_dim (int) – The length of
f
’s return value.windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
polarised (bool) – Whether to leave the ancestral state out of computations: see Statistics for more details.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
strict (bool) – Whether to check that f(0) and f(total weight) are zero.
 Returns
A ndarray with shape equal to (num windows, num statistics).

genotype_matrix
(impute_missing_data=False)¶ Returns an \(m \times n\) numpy array of the genotypes in this tree sequence, where \(m\) is the number of sites and \(n\) the number of samples. The genotypes are the indexes into the array of
alleles
, as described for theVariant
class. The value 0 always corresponds to the ancestal state, and values > 0 represent distinct derived states.If there is missing data present the genotypes array will contain a special value
tskit.MISSING_DATA
(1) to identify these missing samples. Missing data is reported by default, but ifimpute_missing_data
is set to to True, missing data will be imputed such that all isolated samples are assigned the ancestral state. This was the default behaviour in versions prior to 0.2.0.Warning
This method can consume a very large amount of memory! If all genotypes are not needed at once, it is usually better to access them sequentially using the
variants()
iterator. Parameters
impute_missing_data (bool) – If True, the allele assigned to any isolated samples is the ancestral state; that is, we impute missing data as the ancestral state. Default: False.
 Returns
The full matrix of genotypes.
 Return type
numpy.ndarray (dtype=np.int8)

haplotypes
(impute_missing_data=False)¶ Returns an iterator over the haplotypes resulting from the trees and mutations in this tree sequence as a string. The iterator returns a total of \(n\) strings, each of which contains \(s\) characters (\(n\) is the sample size returned by
tskit.TreeSequence.num_samples
and \(s\) is the number of sites returned bytskit.TreeSequence.num_sites
). The first string returned is the haplotype for sample 0, and so on. For a given haplotypeh
, the value ofh[j]
is the observed allelic state at sitej
.See also the
variants()
iterator for sitecentric access to sample genotypes.This method is only supported for singleletter alleles.
As missing data cannot be represented directly in the haplotypes array, an error will be raised if missing data is present. However, if
impute_missing_data
set to True, missing data will be imputed such that all isolated samples are assigned the ancestral state. This was the default behaviour in versions prior to 0.2.0. Returns
An iterator over the haplotype strings for the samples in this tree sequence.
 Parameters
impute_missing_data (bool) – If True, the allele assigned to any isolated samples is the ancestral state; that is, we impute missing data as the ancestral state. Default: False.
 Return type
iter
 Raises
LibraryError if called on a tree sequence containing multiletter alleles.
 Raises
LibraryError if missing data is present and impute_missing_data is False

individual
(id_)¶ Returns the individual in this tree sequence with the specified ID.
 Return type

individuals
()¶ Returns an iterator over all the individuals in this tree sequence.
 Returns
An iterator over all individuals.
 Return type
iter(
Individual
)

keep_intervals
(intervals, simplify=True, record_provenance=True)¶ Returns a copy of this tree sequence which includes only information in the specified list of genomic intervals. Edges are truncated to lie within these intervals, and sites and mutations falling outside these intervals are discarded.
Note that node IDs may change as a result of this operation, as by default
simplify()
is called on the returned tree sequence to remove redundant nodes. If you wish to map node IDs onto the same nodes before and after this method has been called, specifysimplify=False
.See also
delete_intervals()
. Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to keep. Intervals must be nonoverlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, return a simplified tree sequence where nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
 Return type

last
()¶ Returns the last tree in this
TreeSequence
. To iterate over all trees in the sequence, use thetrees()
method. Returns
The last tree in this tree sequence.
 Return type
Tree
.

property
max_root_time
¶ Returns time of the oldest root in any of the trees in this tree sequence. This is usually equal to
np.max(ts.tables.nodes.time)
but may not be since there can be nodes that are not present in any tree. Consistent with the definition of tree roots, if there are no edges in the tree sequence we return the time of the oldest sample. Returns
The maximum time of a root in this tree sequence.
 Return type

mean_descendants
(sample_sets)¶ Computes for every node the mean number of samples in each of the sample_sets that descend from that node, averaged over the portions of the genome for which the node is ancestral to any sample. The output is an array, C[node, j], which reports the total span of all genomes in sample_sets[j] that inherit from node, divided by the total span of the genome on which node is an ancestor to any sample in the tree sequence.
Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future. The longterm stable API for this method will be consistent with other Statistics. In particular, the normalization by proportion of the genome that node is an ancestor to anyone may not be the default behaviour in the future.
 Parameters
sample_sets (list) – A list of lists of node IDs.
 Returns
An array with dimensions (number of nodes in the tree sequence, number of reference sets)

migrations
()¶ Returns an iterator over all the migrations in this tree sequence.
Migrations are returned in nondecreasing order of the
time
value. Returns
An iterator over all migrations.
 Return type
iter(
Migration
)

mutations
()¶ Returns an iterator over all the mutations in this tree sequence. Mutations are returned in order of nondecreasing site ID. See the
Mutation
class for details on the available fields for each mutation.The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:
>>> for site in tree_sequence.sites(): >>> for mutation in site.mutations: >>> yield mutation
 Returns
An iterator over all mutations in this tree sequence.
 Return type
iter(
Mutation
)

nodes
()¶ Returns an iterator over all the nodes in this tree sequence.
 Returns
An iterator over all nodes.
 Return type
iter(
Node
)

property
num_edges
¶ Returns the number of edges in this tree sequence.
 Returns
The number of edges in this tree sequence.
 Return type

property
num_individuals
¶ Returns the number of individuals in this tree sequence.
 Returns
The number of individuals in this tree sequence.
 Return type

property
num_migrations
¶ Returns the number of migrations in this tree sequence.
 Returns
The number of migrations in this tree sequence.
 Return type

property
num_mutations
¶ Returns the number of mutations in this tree sequence.
 Returns
The number of mutations in this tree sequence.
 Return type

property
num_nodes
¶ Returns the number of nodes in this tree sequence.
 Returns
The number of nodes in this tree sequence.
 Return type

property
num_populations
¶ Returns the number of populations in this tree sequence.
 Returns
The number of populations in this tree sequence.
 Return type

property
num_provenances
¶ Returns the number of provenances in this tree sequence.
 Returns
The number of provenances in this tree sequence.
 Return type

property
num_samples
¶ Returns the number of samples in this tree sequence. This is the number of sample nodes in each tree.
 Returns
The number of sample nodes in this tree sequence.
 Return type

property
num_sites
¶ Returns the number of sites in this tree sequence.
 Returns
The number of sites in this tree sequence.
 Return type

property
num_trees
¶ Returns the number of distinct trees in this tree sequence. This is equal to the number of trees returned by the
trees()
method. Returns
The number of trees in this tree sequence.
 Return type

pairwise_diversity
(samples=None)¶ Returns the pairwise nucleotide site diversity, the average number of sites that differ between a randomly chosen pair of samples. If samples is specified, calculate the diversity within this set.
Deprecated since version 0.2.0: please use
diversity()
instead. Since version 0.2.0 the error semantics have also changed slightly. It is no longer an error when there is one sample and a tskit.LibraryError is raised when nonsample IDs are provided rather than a ValueError. It is also no longer an error to compute pairwise diversity at sites with multiple mutations.

population
(id_)¶ Returns the population in this tree sequence with the specified ID.
 Return type

populations
()¶ Returns an iterator over all the populations in this tree sequence.
 Returns
An iterator over all populations.
 Return type
iter(
Population
)

provenances
()¶ Returns an iterator over all the provenances in this tree sequence.
 Returns
An iterator over all provenances.
 Return type
iter(
Provenance
)

sample_count_stat
(sample_sets, f, output_dim, windows=None, polarised=False, mode=None, span_normalise=True, strict=True)¶ Compute a windowed statistic from sample counts and a summary function. This is a wrapper around
general_stat()
for the common case in which the weights are all either 1 or 0, i.e., functions of the joint allele frequency spectrum. See the statistics interface section for details on sample sets, windows, mode, span normalise, and return value. Ifsample_sets
is a list ofk
sets of samples, thenf
should be a function that takes an argument of lengthk
and returns a onedimensional array. Thej
th element of the argument tof
will be the number of samples insample_sets[j]
that lie below the node thatf
is being evaluated for. Seegeneral_stat()
for more details.Here is a contrived example: suppose that
A
andB
are two sets of samples withnA
andnB
elements, respectively. Passing these as sample sets will givef
an argument of length two, giving the number of samples inA
andB
below the node in question. So, if we definedef f(x): pA = x[0] / nA pB = x[1] / nB return np.array([pA * pB])
then if all sites are biallelic,
ts.sample_count_stat([A, B], f, windows="site", polarised=False, mode="site")
would compute, for each site, the product of the derived allele frequencies in the two sample sets, in a (num sites, 1) array. If instead
f
returnsnp.array([pA, pB, pA * pB])
, then the output would be a (num sites, 3) array, with the first two columns giving the allele frequencies inA
andB
, respectively.Note
The summary function
f
should return zero when given both 0 and the sample size (i.e.,f(0) = 0
andf(np.array([len(x) for x in sample_sets]) = 0
). This is necessary for the statistic to be unaffected by parts of the tree sequence ancestral to none or all of the samples, respectively. Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
f (function) – A function that takes a onedimensional array of length equal to the number of sample sets and returns a onedimensional array.
output_dim (int) – The length of
f
’s return value.windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
polarised (bool) – Whether to leave the ancestral state out of computations: see Statistics for more details.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
strict (bool) – Whether to check that f(0) and f(total weight) are zero.
 Returns
A ndarray with shape equal to (num windows, num statistics).

samples
(population=None, population_id=None)¶ Returns an array of the sample node IDs in this tree sequence. If the
population
parameter is specified, only return sample IDs from this population.

segregating_sites
(sample_sets=None, windows=None, mode='site', span_normalise=True)¶ Computes the density of segregating sites for each of the sets of nodes from
sample_sets
, and related quantities. Please see the oneway statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. For a sample setA
, computes: “site”
The sum over sites of the number of alleles found in
A
at each site minus one, per unit of chromosome length. If all sites have at most two alleles, this is the density of sites that are polymorphic inA
. To get the number of segregating minor alleles per window, passspan_normalise=False
. “branch”
The total length of all branches in the tree subtended by the samples in
A
, averaged across the window. “node”
The proportion of the window on which the node is ancestral to some, but not all, of the samples in
A
.
 Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of individuals to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

property
sequence_length
¶ Returns the sequence length in this tree sequence. This defines the genomic scale over which tree coordinates are defined. Given a tree sequence with a sequence length \(L\), the constituent trees will be defined over the halfclosed interval \([0, L)\). Each tree then covers some subset of this interval — see
tskit.Tree.interval
for details. Returns
The length of the sequence in this tree sequence in bases.
 Return type

simplify
(samples=None, filter_zero_mutation_sites=None, map_nodes=False, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True, record_provenance=True, keep_unary=False)¶ Returns a simplified tree sequence that retains only the history of the nodes given in the list
samples
. Ifmap_nodes
is true, also return a numpy array mapping the node IDs in this tree sequence to their node IDs in the simplified tree tree sequence. If a nodeu
is not present in the new tree sequence, the value of this mapping will be NULL (1).In the returned tree sequence, the node with ID
0
corresponds tosamples[0]
, node1
corresponds tosamples[1]
, and so on. Besides the samples, node IDs in the returned tree sequence are then allocated sequentially in time order.If you wish to simplify a set of tables that do not satisfy all requirements for building a TreeSequence, then use
TableCollection.simplify()
.If the
reduce_to_site_topology
parameter is True, the returned tree sequence will contain only topological information that is necessary to represent the trees that contain sites. If there are zero sites in this tree sequence, this will result in an output tree sequence with zero edges. When the number of sites is greater than zero, every tree in the output tree sequence will contain at least one site. For a given site, the topology of the tree containing that site will be identical (up to node ID remapping) to the topology of the corresponding tree in the input tree sequence.If
filter_populations
,filter_individuals
orfilter_sites
is True, any of the corresponding objects that are not referenced elsewhere are filtered out. As this is the default behaviour, it is important to realise IDs for these objects may change through simplification. By setting these parameters to False, however, the corresponding tables can be preserved without changes. Parameters
samples (list) – The list of nodes for which to retain information. This may be a numpy array (or arraylike) object (dtype=np.int32).
filter_zero_mutation_sites (bool) – Deprecated alias for
filter_sites
.map_nodes (bool) – If True, return a tuple containing the resulting tree sequence and a numpy array mapping node IDs in the current tree sequence to their corresponding node IDs in the returned tree sequence. If False (the default), return only the tree sequence object itself.
reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (Default: False)
filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True)
filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True)
filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True)
record_provenance (bool) – If True, record details of this call to simplify in the returned tree sequence’s provenance information (Default: True).
keep_unary (bool) – If True, any unary nodes (i.e. nodes with exactly one child) that exist on the path from samples to root will be preserved in the output. (Default: False)
 Returns
The simplified tree sequence, or (if
map_nodes
is True) a tuple consisting of the simplified tree sequence and a numpy array mapping source node IDs to their corresponding IDs in the new tree sequence. Return type
TreeSequence or a (TreeSequence, numpy.array) tuple

sites
()¶ Returns an iterator over all the sites in this tree sequence. Sites are returned in order of increasing ID (and also position). See the
Site
class for details on the available fields for each site. Returns
An iterator over all sites.
 Return type
iter(
Site
)

property
tables
¶ A copy of the tables underlying this tree sequence. See also
dump_tables()
.Warning
This propery currently returns a copy of the tables underlying a tree sequence but it may return a readonly view in the future. Thus, if the tables will subsequently be updated, please use the
dump_tables()
method instead as this will always return a new copy of the TableCollection. Returns
A
TableCollection
containing all a copy of the tables underlying this tree sequence. Return type

trait_correlation
(W, windows=None, mode='site', span_normalise=True)¶ Computes the mean squared correlations between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.This is computed as squared covariance in
trait_covariance
, but divided by \(p (1p)\), where p is the proportion of samples inheriting from the allele, branch, or node in question.What is computed depends on
mode
: “site”
The sum of squared correlations between presence/absence of each allele and phenotypes, divided by length of the window (if
span_normalise=True
). This is computed as thetrait_covariance
divided by the variance of the relevant column of W and by ;math:p * (1  p), where \(p\) is the allele frequency. “branch”
The sum of squared correlations between the split induced by each branch and phenotypes, multiplied by branch length, averaged across trees in the window. This is computed as the
trait_covariance
, divided by the variance of the column of w and by \(p * (1  p)\), where \(p\) is the proportion of the samples lying below the branch. “node”
For each node, the squared correlation between the property of inheriting from this node and phenotypes, computed as in “branch”.
Note that above we divide by the sample variance, which for a vector x of length n is
np.var(x) * n / (n1)
. Parameters
W (ndarray) – An array of values with one row for each sample and one column for each “phenotype”. Each column must have positive standard deviation.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

trait_covariance
(W, windows=None, mode='site', span_normalise=True)¶ Computes the mean squared covariances between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.Concretely, if g is a binary vector that indicates inheritance from an allele, branch, or node and w is a column of W, normalised to have mean zero, then the covariance of g and w is \(\sum_i g_i w_i\), the sum of the weights corresponding to entries of g that are 1. Since weights sum to zero, this is also equal to the sum of weights whose entries of g are 0. So, \(cov(g,w)^2 = ((\sum_i g_i w_i)^2 + (\sum_i (1g_i) w_i)^2)/2\).
What is computed depends on
mode
: “site”
The sum of squared covariances between presence/absence of each allele and phenotypes, divided by length of the window (if
span_normalise=True
). This is computed as sum_a (sum(w[a])^2 / 2), where w is a column of W with the average subtracted off, and w[a] is the sum of all entries of w corresponding to samples carrying allele “a”, and the first sum is over all alleles. “branch”
The sum of squared covariances between the split induced by each branch and phenotypes, multiplied by branch length, averaged across trees in the window. This is computed as above: a branch with total weight w[b] below b contributes (branch length) * w[b]^2 to the total value for a tree. (Since the sum of w is zero, the total weight below b and not below b are equal, canceling the factor of 2 above.)
 “node”
For each node, the squared covariance between the property of inheriting from this node and phenotypes, computed as in “branch”.
 Parameters
W (ndarray) – An array of values with one row for each sample and one column for each “phenotype”.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

trait_regression
(W, Z=None, windows=None, mode='site', span_normalise=True)¶ For each trait w (i.e., each column of W), performs the leastsquares linear regression \(w ~ g + Z\), where \(g\) is inheritance in the tree sequence and the columns of \(Z\) are covariates, and computes the squared coefficient of \(g\) in this regression. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.
Concretely, if g is a binary vector that indicates inheritance from an allele, branch, or node and w is a column of W, there are \(k\) columns of \(Z\), and the \(k+2\)vector \(b\) minimises \(\sum_i (w_i  b_0  b_1 g_i  b_2 z_{2,i}  ... b_{k+2} z_{k+2,i})^2\) then this returns the number \(b_1^2\). If \(g\) lies in the linear span of the columns of \(Z\), then \(b_1\) is set to 0. To perform the regression without covariates (only the intercept), set Z = None.
What is computed depends on
mode
: “site”
Computes the sum of \(b_1^2/2\) for each allele in the window, as above with \(g\) indicating presence/absence of the allele, then divided by the length of the window if
span_normalise=True
. (For biallelic loci, this number is the same for both alleles, and so summing over each cancels the factor of two.) “branch”
The squared coefficient b_1^2, computed for the split induced by each branch (i.e., with \(g\) indicating inheritance from that branch), multiplied by branch length and tree span, summed over all trees in the window, and divided by the length of the window if
span_normalise=True
. “node”
For each node, the squared coefficient b_1^2, computed for the property of inheriting from this node, as in “branch”.
 Parameters
W (ndarray) – An array of values with one row for each sample and one column for each “phenotype”.
Z (ndarray) – An array of values with one row for each sample and one column for each “covariate”, or None. Columns of Z must be linearly independent.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
 Returns
A ndarray with shape equal to (num windows, num statistics).

trees
(tracked_samples=None, sample_counts=True, sample_lists=False, tracked_leaves=None, leaf_counts=None, leaf_lists=None)¶ Returns an iterator over the trees in this tree sequence. Each value returned in this iterator is an instance of
Tree
. Upon successful termination of the iterator, the tree will be in the “cleared” null state.The
sample_counts
,sample_lists
andtracked_samples
parameters are passed to theTree
constructor, and control the options that are set in the returned tree instance. Warning
Do not store the results of this iterator in a list! For performance reasons, the same underlying object is used for every tree returned which will most likely lead to unexpected behaviour. If you wish to obtain a list of trees in a tree sequence please use
ts.aslist()
instead. Parameters
tracked_samples (list) – The list of samples to be tracked and counted using the
Tree.get_num_tracked_samples()
method.sample_counts (bool) – If True, support constant time sample counts via the
Tree.num_samples()
andTree.get_num_tracked_samples()
methods.sample_lists (bool) – If True, provide more efficient access to the samples beneath a give node using the
Tree.samples()
method.
 Returns
An iterator over the sparse trees in this tree sequence.
 Return type
iter

variants
(as_bytes=False, samples=None, impute_missing_data=False)¶ Returns an iterator over the variants in this tree sequence. See the
Variant
class for details on the fields of each returned object. By default thegenotypes
for the variants are numpy arrays, corresponding to indexes into thealleles
array. If theas_bytes
parameter is true, these allelic values are recorded directly into a bytes array.By default, genotypes are generated for all samples. The
samples
parameter allows us to specify the nodes for which genotypes are generated; output order of genotypes in the returned variants corresponds to the order of the samples in this list. It is also possible to provide nonsample nodes as an argument here, if you wish to generate genotypes for (e.g.) internal nodes. However,impute_missing_data
must be True in this case, as it is not possible to detect missing data for nonsample nodes.If missing data is present at a given site, the genotypes array will contain a special value
tskit.MISSING_DATA
(1) to identify these missing samples. See theVariant
class for more details on how missing data is reported.Missing data is reported by default, but if
impute_missing_data
is set to to True, missing data will be imputed such that all isolated samples are assigned the ancestral state. This was the default behaviour in versions prior to 0.2.0.Note
The
as_bytes
parameter is kept as a compatibility option for older code. It is not the recommended way of accessing variant data, and will be deprecated in a later release. Another method will be provided to obtain the allelic states for each site directly. Parameters
as_bytes (bool) – If True, the genotype values will be returned as a Python bytes object. This is useful in certain situations (i.e., directly printing the genotypes) or when numpy is not available. Otherwise, genotypes are returned as a numpy array (the default).
samples (array_like) – An array of sample IDs for which to generate genotypes, or None for all samples. Default: None.
impute_missing_data (bool) – If True, the allele assigned to any isolated samples is the ancestral state; that is, we impute missing data as the ancestral state. Default: False.
 Returns
An iterator of all variants this tree sequence.
 Return type
iter(
Variant
)

write_fasta
(output, sequence_ids=None, wrap_width=60)¶ Writes haplotype data for samples in FASTA format to the specified filelike object.
Default sequence_ids (i.e. the text immediately following “>”) are “tsk_{sample_number}” e.g. “tsk_0”, “tsk_1” etc. They can be set by providing a list of strings to the sequence_ids argument, which must equal the length of the number of samples. Please ensure that these are unique and compatible with fasta standards, since we do not check this. Default wrap_width for sequences is 60 characters in accordance with fasta standard outputs, but this can be specified. In order to avoid any linewrapping of sequences, set wrap_width = 0.
Example usage:
with open("output.fasta", "w") as fasta_file: ts.write_fasta(fasta_file)
This can also be achieved on the command line use the
tskit fasta
command, e.g.:$ tskit fasta example.trees > example.fasta
 Parameters
output (File) – The filelike object to write the fasta output.
sequence_ids (list(str)) – A list of string names to uniquely identify each of the sequences in the fasta file. If specified, this must be a list of strings of length equal to the number of samples which are output. Note that we do not check the form of these strings in any way, so that it is possible to output bad fasta IDs (for example, by including spaces before the unique identifying part of the string). The default is to output
tsk_j
for the jth individual.wrap_width (int) – This parameter specifies the number of sequence characters to include on each line in the fasta file, before wrapping to the next line for each sequence. Defaults to 60 characters in accordance with fasta standard outputs. To avoid any linewrapping of sequences, set wrap_width = 0. Otherwise, supply any positive integer.

write_vcf
(output, ploidy=None, contig_id='1', individuals=None, individual_names=None, position_transform=None)¶ Writes a VCF formatted file to the specified filelike object. If there is individual information present in the tree sequence (see Individual Table), the values for sample nodes associated with these individuals are combined into phased multiploid individuals and output.
If there is no individual data present in the tree sequence, synthetic individuals are created by combining adjacent samples, and the number of samples combined is equal to the specified ploidy value (1 by default). For example, if we have a ploidy of 2 and a sample of size 6, then we will have 3 diploid samples in the output, consisting of the combined genotypes for samples [0, 1], [2, 3] and [4, 5]. If we had genotypes 011110 at a particular variant, then we would output the diploid genotypes 01, 11 and 10 in VCF.
Each individual in the output is identified by a string; these are the VCF “sample” names. By default, these are of the form
tsk_0
,tsk_1
etc, up to the number of individuals, but can be manually specified using theindividual_names
argument. We do not check for duplicates in this array, or perform any checks to ensure that the output VCF is wellformed.The REF value in the output VCF is the ancestral allele for a site and ALT values are the remaining alleles. It is important to note, therefore, that for real data this means that the REF value for a given site may not be equal to the reference allele. We also do not check that the alleles result in a valid VCF—for example, it is possible to use the tab character as an allele, leading to a broken VCF.
The
position_transform
argument provides a way to flexibly translate the genomic location of sites in tskit to the appropriate value in VCF. There are two fundamental differences in the way that tskit and VCF define genomic coordinates. The first is that tskit uses floating point values to encode positions, whereas VCF uses integers. Thus, if the tree sequence contains positions at nonintegral locations there is an information loss incurred by translating to VCF. By default, we round the site positions to the nearest integer, such that there may be several sites with the same integer position in the output. The second difference between VCF and tskit is that VCF is defined to be a 1based coordinate system, whereas tskit uses 0based. However, how coordinates are transformed depends on the VCF parser, and so we do not account for this change in coordinate system by default.Example usage:
with open("output.vcf", "w") as vcf_file: tree_sequence.write_vcf(vcf_file, ploidy=2)
The VCF output can also be compressed using the
gzip
module, if you wish:import gzip with gzip.open("output.vcf.gz", "wt") as f: ts.write_vcf(f)
However, this gzipped VCF may not be fully compatible with downstream tools such as tabix, which may require the VCF use the specialised bgzip format. A general way to convert VCF data to various formats is to pipe the text produced by
tskit
intobcftools
, as done here:import os import subprocess read_fd, write_fd = os.pipe() write_pipe = os.fdopen(write_fd, "w") with open("output.bcf", "w") as bcf_file: proc = subprocess.Popen( ["bcftools", "view", "O", "b"], stdin=read_fd, stdout=bcf_file) ts.write_vcf(write_pipe) write_pipe.close() os.close(read_fd) proc.wait() if proc.returncode != 0: raise RuntimeError("bcftools failed with status:", proc.returncode)
This can also be achieved on the command line use the
tskit vcf
command, e.g.:$ tskit vcf example.trees  bcftools view O b > example.bcf
 Parameters
output (File) – The filelike object to write the VCF output.
ploidy (int) – The ploidy of the individuals to be written to VCF. This sample size must be evenly divisible by ploidy.
contig_id (str) – The value of the CHROM column in the output VCF.
individuals (list(int)) – A list containing the individual IDs to write out to VCF. Defaults to all individuals in the tree sequence.
individual_names (list(str)) – A list of string names to identify individual columns in the VCF. In VCF nomenclature, these are the sample IDs. If specified, this must be a list of strings of length equal to the number of individuals to be output. Note that we do not check the form of these strings in any way, so that is is possible to output malformed VCF (for example, by embedding a tab character within on of the names). The default is to output
tsk_j
for the jth individual.position_transform (callable) – A callable that transforms the site position values into integer valued coordinates suitable for VCF. The function takes a single positional parameter x and must return an integer numpy array the same dimension as x. By default, this is set to
numpy.round()
which will round values to the nearest integer. If the string “legacy” is provided here, the pre 0.2.0 legacy behaviour of rounding values to the nearest integer (starting from 1) and avoiding the output of identical positions by incrementing is used.


class
tskit.
Tree
(tree_sequence, tracked_samples=None, sample_counts=True, sample_lists=False)¶ A single tree in a
TreeSequence
. Please see the Moving along a tree sequence section for information on how efficiently access trees sequentially or obtain a list of individual trees in a tree sequence.The
sample_counts
andsample_lists
parameters control the features that are enabled for this tree. Ifsample_counts
is True, then it is possible to count the number of samples underneath a particular node in constant time using thenum_samples()
method. Ifsample_lists
is True a more efficient algorithm is used in theTree.samples()
method.The
tracked_samples
parameter can be used to efficiently count the number of samples in a given set that exist in a particular subtree using theTree.num_tracked_samples()
method. It is an error to use thetracked_samples
parameter when thesample_counts
flag is False.The
Tree
class is a statemachine which has a state corresponding to each of the trees in the parent tree sequence. We transition between these states by using the seek functions likeTree.first()
,Tree.last()
,Tree.seek()
andTree.seek_index()
. There is one more state, the socalled “null” or “cleared” state. This is the state that aTree
is in immediately after initialisation; it has an index of 1, and no edges. We can also enter the null state by callingTree.next()
on the last tree in a sequence, callingTree.prev()
on the first tree in a sequence or calling calling theTree.clear()
method at any time.The highlevel TreeSequence seeking and iterations methods (e.g,
TreeSequence.trees()
) are built on these lowlevel statemachine seek operations. We recommend these higher level operations for most users. Parameters
tree_sequence (TreeSequence) – The parent tree sequence.
tracked_samples (list) – The list of samples to be tracked and counted using the
Tree.num_tracked_samples()
method.sample_counts (bool) – If True, support constant time sample counts via the
Tree.num_samples()
andTree.num_tracked_samples()
methods.sample_lists (bool) – If True, provide more efficient access to the samples beneath a give node using the
Tree.samples()
method.

branch_length
(u)¶ Returns the length of the branch (in generations) joining the specified node to its parent. This is equivalent to
>>> tree.time(tree.parent(u))  tree.time(u)
The branch length for a node that has no parent (e.g., a root) is defined as zero.
Note that this is not related to the property .length which is a deprecated alias for the genomic
span
covered by a tree.

children
(u)¶ Returns the children of the specified node
u
as a tuple of integer node IDs. Ifu
is a leaf, return the empty tuple.

clear
()¶ Resets this tree back to the initial null state. Calling this method on a tree already in the null state has no effect.

copy
()¶ Returns a deep copy of this tree. The returned tree will have identical state to this tree.
 Returns
A copy of this tree.
 Return type

draw
(path=None, width=None, height=None, node_labels=None, node_colours=None, mutation_labels=None, mutation_colours=None, format=None, edge_colours=None, tree_height_scale=None, max_tree_height=None)¶ Returns a drawing of this tree.
When working in a Jupyter notebook, use the
IPython.display.SVG
function to display the SVG output from this function inline in the notebook:>>> SVG(tree.draw())
The unicode format uses unicode box drawing characters to render the tree. This allows rendered trees to be printed out to the terminal:
>>> print(tree.draw(format="unicode")) 6 ┏━┻━┓ ┃ 5 ┃ ┏━┻┓ ┃ ┃ 4 ┃ ┃ ┏┻┓ 3 0 1 2
The
node_labels
argument allows the user to specify custom labels for nodes, or no labels at all:>>> print(tree.draw(format="unicode", node_labels={})) ┃ ┏━┻━┓ ┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
Note: in some environments such as Jupyter notebooks with Windows or Mac, users have observed that the Unicode box drawings can be misaligned. In these cases, we recommend using the SVG or ASCII display formats instead. If you have a strong preference for aligned Unicode, you can try out the solution documented here.
 Parameters
path (str) – The path to the file to write the output. If None, do not write to file.
width (int) – The width of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels.
height (int) – The height of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels.
node_labels (map) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label.
node_colours (map) – If specified, show custom colours for the nodes given in the map. Any nodes not specified in the map will take the default colour; a value of
None
is treated as transparent and hence the node symbol is not plotted. (Only supported in the SVG format.)mutation_labels (map) – If specified, show custom labels for the mutations (specified by ID) that are present in the map. Any mutations not in the map will not have a label. (Showing mutations is currently only supported in the SVG format)
mutation_colours (map) – If specified, show custom colours for the mutations given in the map (specified by ID). As for
node_colours
, mutations not present in the map take the default colour, and those mapping toNone
are not drawn. (Only supported in the SVG format.)format (str) – The format of the returned image. Currently supported are ‘svg’, ‘ascii’ and ‘unicode’.
edge_colours (map) – If specified, show custom colours for the edge joining each node in the map to its parent. As for
node_colours
, unspecified edges take the default colour, andNone
values result in the edge being omitted. (Only supported in the SVG format.)tree_height_scale (str) – Control how height values for nodes are computed. If this is equal to
"time"
, node heights are proportional to their time values. If this is equal to"log_time"
, node heights are proportional to their log(time) values. If it is equal to"rank"
, node heights are spaced equally according to their ranked times. For SVG output the default is ‘time’scale whereas for text output the default is ‘rank’scale. Time scaling is not currently supported for text output.max_tree_height (str,float) – The maximum tree height value in the current scaling system (see
tree_height_scale
). Can be either a string or a numeric value. If equal to"tree"
, the maximum tree height is set to be that of the oldest root in the tree. If equal to"ts"
the maximum height is set to be the height of the oldest root in the tree sequence; this is useful when drawing trees from the same tree sequence as it ensures that node heights are consistent. If a numeric value, this is used as the maximum tree height by which to scale other nodes. This parameters is not currently supported for text output.
 Returns
A representation of this tree in the requested format.
 Return type

first
()¶ Seeks to the first tree in the sequence. This can be called whether the tree is in the null state or not.

property
index
¶ Returns the index this tree occupies in the parent tree sequence. This index is zero based, so the first tree in the sequence has index 0.
 Returns
The index of this tree.
 Return type

property
interval
¶ Returns the coordinates of the genomic interval that this tree represents the history of. The interval is returned as a tuple \((l, r)\) and is a halfopen interval such that the left coordinate is inclusive and the right coordinate is exclusive. This tree therefore applies to all genomic locations \(x\) such that \(l \leq x < r\).
 Returns
A tuple (l, r) representing the leftmost (inclusive) and rightmost (exclusive) coordinates of the genomic region covered by this tree.
 Return type

is_descendant
(u, v)¶ Returns True if the specified node u is a descendant of node v and False otherwise. A node \(u\) is a descendant of another node \(v\) if \(v\) is on the path from \(u\) to root. A node is considered to be a descendant of itself, so
tree.is_descendant(u, u)
will be True for any valid node. Parameters
 Returns
True if u is a descendant of v.
 Return type
 Raises
ValueError – If u or v are not valid node IDs.

is_internal
(u)¶ Returns True if the specified node is not a leaf. A node is internal if it has one or more children in the current tree.

is_leaf
(u)¶ Returns True if the specified node is a leaf. A node \(u\) is a leaf if it has zero children.

is_sample
(u)¶ Returns True if the specified node is a sample. A node \(u\) is a sample if it has been marked as a sample in the parent tree sequence.

last
()¶ Seeks to the last tree in the sequence. This can be called whether the tree is in the null state or not.

leaves
(u=None)¶ Returns an iterator over all the leaves in this tree that are underneath the specified node. If u is not specified, return all leaves in the tree.
 Parameters
u (int) – The node of interest.
 Returns
An iterator over all leaves in the subtree rooted at u.
 Return type
iterator

map_mutations
(genotypes, alleles)¶ Given observations for the samples in this tree described by the specified set of genotypes and alleles, return a parsimonious set of state transitions explaining these observations. The genotypes array is interpreted as indexes into the alleles list in the same manner as described in the
TreeSequence.variants()
method. Thus, if samplej
carries the allele at indexk
, then we havegenotypes[j] = k
. Missing data can be specified for a sample using the valuetskit.MISSING_DATA
(1). At least one nonmissing observation must be provided. A maximum of 63 alleles are supported.The current implementation uses the Fitch parsimony algorithm to determine the minimum number of state transitions required to explain the data. In this model, transitions between any of the nonmissing states is equally likely.
The returned values correspond directly to the data model for describing variation at sites using mutations. See the Site Table and Mutation Table definitions for details and background.
The state reconstruction is returned as twotuple,
(ancestral_state, mutations)
, whereancestral_state
is the allele assigned to the tree root(s) andmutations
is a list ofMutation
objects. For each mutation,node
is the tree node at the bottom of the branch on which the transition occurred, andderived_state
is the new state after this mutation. When multiple mutations are returned theparent
property contains the index of the previous mutation on the path to root (see the Mutation Table for more information on the concept of mutation parents). All other attributes of theMutation
object are undefined and should not be used.See the Parsimony section in the tutorial for examples of how to use this method.
 Parameters
genotypes (array_like) – The input observations for the samples in this tree.
 Returns
The inferred ancestral state and list of mutations on this tree that encode the specified observations.
 Return type
(str, list(tskit.Mutation))

mrca
(u, v)¶ Returns the most recent common ancestor of the specified nodes.

mutations
()¶ Returns an iterator over all the mutations in this tree. Mutations are returned in order of nondecreasing site ID. See the
Mutation
class for details on the available fields for each mutation.The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:
>>> for site in tree.sites(): >>> for mutation in site.mutations: >>> yield mutation

newick
(precision=14, root=None, node_labels=None)¶ Returns a newick encoding of this tree. If the
root
argument is specified, return a representation of the specified subtree, otherwise the full tree is returned. If the tree has multiple roots then seperate newick strings for each rooted subtree must be found (i.e., we do not attempt to concatenate the different trees).By default, leaf nodes are labelled with their numerical ID + 1, and internal nodes are not labelled. Arbitrary node labels can be specified using the
node_labels
argument, which maps node IDs to the desired labels.Warning
Node labels are not Newick escaped, so care must be taken to provide labels that will not break the encoding.
 Parameters
precision (int) – The numerical precision with which branch lengths are printed.
root (int) – If specified, return the tree rooted at this node.
node_labels (map) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label.
 Returns
A newick representation of this tree.
 Return type

next
()¶ Seeks to the next tree in the sequence. If the tree is in the initial null state we seek to the first tree (equivalent to calling
first()
). Callingnext
on the last tree in the sequence results in the tree being cleared back into the null initial state (equivalent to callingclear()
). The return value of the function indicates whether the tree is in a nonnull state, and can be used to loop over the trees:# Iterate over the trees from lefttoright tree = tskit.Tree(tree_sequence) while tree.next() # Do something with the tree. print(tree.index) # tree is now back in the null state.
 Returns
True if the tree has been transformed into one of the trees in the sequence; False if the tree has been transformed into the null state.
 Return type

nodes
(root=None, order='preorder')¶ Returns an iterator over the nodes in this tree. If the root parameter is provided, iterate over the nodes in the subtree rooted at this node. If this is None, iterate over all nodes. If the order parameter is provided, iterate over the nodes in required tree traversal order.

property
num_mutations
¶ Returns the total number of mutations across all sites on this tree.
 Returns
The total number of mutations over all sites on this tree.
 Return type

property
num_nodes
¶ Returns the number of nodes in the
TreeSequence
this tree is in. Equivalent totree.tree_sequence.num_nodes
. To find the number of nodes that are reachable from all roots uselen(list(tree.nodes()))
. Return type

property
num_roots
¶ The number of roots in this tree, as defined in the
roots
attribute.Requires O(number of roots) time.
 Return type

num_samples
(u=None)¶ Returns the number of samples in this tree underneath the specified node (including the node itself). If u is not specified return the total number of samples in the tree.
If the
TreeSequence.trees()
method is called withsample_counts=True
this method is a constant time operation. If not, a slower traversal based algorithm is used to count the samples.

property
num_sites
¶ Returns the number of sites on this tree.
 Returns
The number of sites on this tree.
 Return type

num_tracked_samples
(u=None)¶ Returns the number of samples in the set specified in the
tracked_samples
parameter of theTreeSequence.trees()
method underneath the specified node. If the input node is not specified, return the total number of tracked samples in the tree.This is a constant time operation.
 Parameters
u (int) – The node of interest.
 Returns
The number of samples within the set of tracked samples in the subtree rooted at u.
 Return type
 Raises
RuntimeError – if the
TreeSequence.trees()
method is not called withsample_counts=True
.

parent
(u)¶ Returns the parent of the specified node. Returns the
NULL
if u is the root or is not a node in the current tree.

population
(u)¶ Returns the population associated with the specified node. Equivalent to
tree.tree_sequence.node(u).population
.

prev
()¶ Seeks to the previous tree in the sequence. If the tree is in the initial null state we seek to the last tree (equivalent to calling
last()
). Callingprev
on the first tree in the sequence results in the tree being cleared back into the null initial state (equivalent to callingclear()
). The return value of the function indicates whether the tree is in a nonnull state, and can be used to loop over the trees:# Iterate over the trees from righttoleft tree = tskit.Tree(tree_sequence) while tree.prev() # Do something with the tree. print(tree.index) # tree is now back in the null state.
 Returns
True if the tree has been transformed into one of the trees in the sequence; False if the tree has been transformed into the null state.
 Return type

property
root
¶ The root of this tree. If the tree contains multiple roots, a ValueError is raised indicating that the
roots
attribute should be used instead. Returns
The root node.
 Return type
 Raises
ValueError
if this tree contains more than one root.

property
roots
¶ The list of roots in this tree. A root is defined as a unique endpoint of the paths starting at samples. We can define the set of roots as follows:
roots = set() for u in tree_sequence.samples(): while tree.parent(u) != tskit.NULL: u = tree.parent(u) roots.add(u) # roots is now the set of all roots in this tree. assert sorted(roots) == sorted(tree.roots)
The roots of the tree are returned in a list, in no particular order.
Requires O(number of roots) time.
 Returns
The list of roots in this tree.
 Return type

property
sample_size
¶ Returns the sample size for this tree. This is the number of sample nodes in the tree.
 Returns
The number of sample nodes in the tree.
 Return type

samples
(u=None)¶ Returns an iterator over all the samples in this tree that are underneath the specified node. If u is a sample, it is included in the returned iterator. If u is not specified, return all samples in the tree.
If the
TreeSequence.trees()
method is called withsample_lists=True
, this method uses an efficient algorithm to find the samples. If not, a simple traversal based method is used. Parameters
u (int) – The node of interest.
 Returns
An iterator over all samples in the subtree rooted at u.
 Return type
iterator

seek
(position)¶ Sets the state to represent the tree that covers the specified position in the parent tree sequence. After a successful return of this method we have
tree.interval[0]
<=position
<tree.interval[1]
. Parameters
position (float) – The position along the sequence length to seek to.
 Raises
ValueError – If 0 < position or position >=
TreeSequence.sequence_length
.

seek_index
(index)¶ Sets the state to represent the tree at the specified index in the parent tree sequence. Negative indexes following the standard Python conventions are allowed, i.e.,
index=1
will seek to the last tree in the sequence. Parameters
index (int) – The tree index to seek to.
 Raises
IndexError – If an index outside the acceptable range is provided.

sites
()¶ Returns an iterator over all the sites in this tree. Sites are returned in order of increasing ID (and also position). See the
Site
class for details on the available fields for each site. Returns
An iterator over all sites in this tree.

property
span
¶ Returns the genomic distance that this tree spans. This is defined as \(r  l\), where \((l, r)\) is the genomic interval returned by
interval
. Returns
The genomic distance covered by this tree.
 Return type

time
(u)¶ Returns the time of the specified node in generations. Equivalent to
tree.tree_sequence.node(u).time
.

tmrca
(u, v)¶ Returns the time of the most recent common ancestor of the specified nodes. This is equivalent to:
>>> tree.time(tree.mrca(u, v))

property
total_branch_length
¶ Returns the sum of all the branch lengths in this tree (in units of generations). This is equivalent to
>>> sum(tree.branch_length(u) for u in tree.nodes())
Note that the branch lengths for root nodes are defined as zero.
As this is defined by a traversal of the tree, technically we return the sum of all branch lengths that are reachable from roots. Thus, this is the sum of all branches that are ancestral to at least one sample. This distinction is only important in tree sequences that contain ‘dead branches’, i.e., those that define topology not ancestral to any samples.
 Returns
The sum of lengths of branches in this tree.
 Return type

property
tree_sequence
¶ Returns the tree sequence that this tree is from.
 Returns
The parent tree sequence for this tree.
 Return type
Constants¶

tskit.NULL == 1
Special reserved value representing a null ID.

tskit.FORWARD == 1
Constant representing the forward direction of travel (i.e., increasing genomic coordinate values).

tskit.REVERSE == 1
Constant representing the reverse direction of travel (i.e., decreasing genomic coordinate values).
Simple container classes¶
These classes are simple shallow containers representing the entities defined
in the Definitions. These classes are not intended to be instantiated
directly, but are the return types for the various iterators provided by the
TreeSequence
and Tree
classes.

class
tskit.
Individual
¶ An individual in a tree sequence. Since nodes correspond to genomes, individuals are associated with a collection of nodes (e.g., two nodes per diploid). See Nodes, Genomes, or Individuals? for more discussion of this distinction.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
id (int) – The integer ID of this individual. Varies from 0 to
TreeSequence.num_individuals
 1.flags (int) – The bitwise flags for this individual.
location (numpy.ndarray) – The spatial location of this individual as a numpy array. The location is an empty array if no spatial location is defined.
nodes – The IDs of the nodes that are associated with this individual as a numpy array (dtype=np.int32). If no nodes are associated with the individual this array will be empty.

class
tskit.
Node
¶ A node in a tree sequence, corresponding to a single genome. The
time
andpopulation
are attributes of theNode
, rather than theIndividual
, as discussed in Nodes, Genomes, or Individuals?.Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
id (int) – The integer ID of this node. Varies from 0 to
TreeSequence.num_nodes
 1.flags (int) – The bitwise flags for this node.
time (float) – The birth time of this node.
population (int) – The integer ID of the population that this node was born in.
individual (int) – The integer ID of the individual that this node was a part of.

class
tskit.
Edge
¶ An edge in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
left (float) – The left coordinate of this edge.
right (float) – The right coordinate of this edge.
parent (int) – The integer ID of the parent node for this edge. To obtain further information about a node with a given ID, use
TreeSequence.node()
.child (int) – The integer ID of the child node for this edge. To obtain further information about a node with a given ID, use
TreeSequence.node()
.id (int) – The integer ID of this edge. Varies from 0 to
TreeSequence.num_edges
 1.

class
tskit.
Site
¶ A site in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
id (int) – The integer ID of this site. Varies from 0 to
TreeSequence.num_sites
 1.position (float) – The floating point location of this site in genome coordinates. Ranges from 0 (inclusive) to
TreeSequence.sequence_length
(exclusive).ancestral_state (str) – The ancestral state at this site (i.e., the state inherited by nodes, unless mutations occur).
mutations (list[
Mutation
]) – The list of mutations at this site. Mutations within a site are returned in the order they are specified in the underlyingMutationTable
.

class
tskit.
Mutation
¶ A mutation in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
id (int) – The integer ID of this mutation. Varies from 0 to
TreeSequence.num_mutations
 1.site (int) – The integer ID of the site that this mutation occurs at. To obtain further information about a site with a given ID use
TreeSequence.site()
.node (int) – The integer ID of the first node that inherits this mutation. To obtain further information about a node with a given ID, use
TreeSequence.node()
.derived_state (str) – The derived state for this mutation. This is the state inherited by nodes in the subtree rooted at this mutation’s node, unless another mutation occurs.
parent (int) – The integer ID of this mutation’s parent mutation. When multiple mutations occur at a site along a path in the tree, mutations must record the mutation that is immediately above them. If the mutation does not have a parent, this is equal to the
NULL
(1). To obtain further information about a mutation with a given ID, useTreeSequence.mutation()
.

class
tskit.
Variant
¶ A variant represents the observed variation among samples for a given site. A variant consists (a) of a reference to the
Site
instance in question; (b) the alleles that may be observed at the samples for this site; and (c) the genotypes mapping sample IDs to the observed alleles.Each element in the
alleles
tuple is a string, representing the actual observed state for a given sample. The first element of this tuple is guaranteed to be the same as the site’sancestral_state
value. The list of alleles is also guaranteed not to contain any duplicates. However, allelic values may be listed that are not referred to by any samples. For example, if we have a site that is fixed for the derived state (i.e., we have a mutation over the tree root), all genotypes will be 1, but the alleles list will be equal to('0', '1')
. Other than the ancestral state being the first allele, the alleles are listed in no particular order, and the ordering should not be relied upon (but see the notes on missing data below).The
genotypes
represent the observed allelic states for each sample, such thatvar.alleles[var.genotypes[j]]
gives the string allele for sample IDj
. Thus, the elements of the genotypes array are indexes into thealleles
list. The genotypes are provided in this way via a numpy array to enable efficient calculations.When missing data is present at a given site boolean flag
has_missing_data
will be True, at least one element of thegenotypes
array will be equal totskit.MISSING_DATA
, and the last element of thealleles
array will beNone
. Note that in this casevariant.num_alleles
will not be equal tolen(variant.alleles)
. The rationale for addingNone
to the end of thealleles
list is to help code that does not handle missing data correctly fail early rather than introducing subtle and hardtofind bugs. Astskit.MISSING_DATA
is equal to 1, code that decodes genotypes into allelic values without taking missing data into account would otherwise output the last allele in the list rather missing data.Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
site (
Site
) – The site object for this variant.alleles (tuple(str)) – A tuple of the allelic values that may be observed at the samples at the current site. The first element of this tuple is always the site’s ancestral state.
genotypes (numpy.ndarray) – An array of indexes into the list
alleles
, giving the state of each sample at the current site.has_missing_data (bool) – True if there is missing data for any of the samples at the current site.
num_alleles (int) – The number of distinct alleles at this site. Note that this may be greater than the number of distinct values in the genotypes array.

class
tskit.
Migration
¶ A migration in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
 Variables
left (float) – The left end of the genomic interval covered by this migration (inclusive).
right (float) – The right end of the genomic interval covered by this migration (exclusive).
node (int) – The integer ID of the node involved in this migration event. To obtain further information about a node with a given ID, use
TreeSequence.node()
.source (int) – The source population ID.
dest (int) – The destination population ID.
time (float) – The time at which this migration occured at.

class
tskit.
Population
¶ A population in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
Loading data¶
There are several methods for loading data into a TreeSequence
instance. The simplest and most convenient is the use the tskit.load()
function to load a tree sequence file. For small
scale data and debugging, it is often convenient to use the
tskit.load_text()
to read data in the text file format. The TableCollection.tree_sequence()
function
efficiently creates a TreeSequence
object from a set of tables
using the Tables API.

tskit.
load
(path)¶ Loads a tree sequence from the specified file path. This file must be in the tree sequence file format produced by the
TreeSequence.dump()
method. Parameters
path (str) – The file path of the
.trees
file containing the tree sequence we wish to load. Returns
The tree sequence object containing the information stored in the specified file path.
 Return type

tskit.
load_text
(nodes, edges, sites=None, mutations=None, individuals=None, populations=None, sequence_length=0, strict=True, encoding='utf8', base64_metadata=True)¶ Parses the tree sequence data from the specified filelike objects, and returns the resulting
TreeSequence
object. The format for these files is documented in the Text file formats section, and is produced by theTreeSequence.dump_text()
method. Further properties required for an input tree sequence are described in the Valid tree sequence requirements section. This method is intended as a convenient interface for importing external data into tskit; the binary file format using bytskit.load()
is many times more efficient than this text format.The
nodes
andedges
parameters are mandatory and must be filelike objects containing text with whitespace delimited columns, parsable byparse_nodes()
andparse_edges()
, respectively.sites
,mutations
,individuals
andpopulations
are optional, and must be parsable byparse_sites()
,parse_individuals()
,parse_populations()
, andparse_mutations()
, respectively.TODO: there is no method to parse the remaining tables at present, so only tree sequences not requiring Population and Individual tables can be loaded. This will be fixed: https://github.com/tskitdev/msprime/issues/498
The
sequence_length
parameter determines theTreeSequence.sequence_length
of the returned tree sequence. If it is 0 or not specified, the value is taken to be the maximum right coordinate of the input edges. This parameter is useful in degenerate situations (such as when there are zero edges), but can usually be ignored.The
strict
parameter controls the field delimiting algorithm that is used. Ifstrict
is True (the default), we require exactly one tab character separating each field. Ifstrict
is False, a more relaxed whitespace delimiting algorithm is used, such that any run of whitespace is regarded as a field separator. In most situations,strict=False
is more convenient, but it can lead to error in certain situations. For example, if a deletion is encoded in the mutation table this will not be parseable whenstrict=False
.After parsing the tables,
TableCollection.sort()
is called to ensure that the loaded tables satisfy the tree sequence ordering requirements. Note that this may result in the IDs of various entities changing from their positions in the input file. Parameters
nodes (stream) – The filelike object containing text describing a
NodeTable
.edges (stream) – The filelike object containing text describing an
EdgeTable
.sites (stream) – The filelike object containing text describing a
SiteTable
.mutations (stream) – The filelike object containing text describing a
MutationTable
.individuals (stream) – The filelike object containing text describing a
IndividualTable
.populations (stream) – The filelike object containing text describing a
PopulationTable
.sequence_length (float) – The sequence length of the returned tree sequence. If not supplied or zero this will be inferred from the set of edges.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (string) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
 Returns
The tree sequence object containing the information stored in the specified file paths.
 Return type
Tables¶
The tables API provides an efficient way of working
with and interchanging tree sequence data. Each table
class (e.g, NodeTable
, EdgeTable
) has a specific set of
columns with fixed types, and a set of methods for setting and getting the data
in these columns. The number of rows in the table t
is given by len(t)
.
Each table supports accessing the data either by row or column. To access the
row j
in table t
simply use t[j]
. The value returned by such an
access is an instance of collections.namedtuple()
, and therefore supports
either positional or named attribute access. To access the data in
a column, we can use standard attribute access which will return a numpy array
of the data. For example:
>>> import tskit
>>> t = tskit.EdgeTable()
>>> t.add_row(left=0, right=1, parent=10, child=11)
0
>>> t.add_row(left=1, right=2, parent=9, child=11)
1
>>> print(t)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
>>> t[0]
EdgeTableRow(left=0.0, right=1.0, parent=10, child=11)
>>> t[1]
EdgeTableRow(left=1.0, right=2.0, parent=9, child=11)
>>> t.left
array([ 0., 1.])
>>> t.parent
array([10, 9], dtype=int32)
>>> len(t)
2
>>>
Tables also support the pickle
protocol, and so can be easily
serialised and deserialised (for example, when performing parallel
computations using the multiprocessing
module).
>>> serialised = pickle.dumps(t)
>>> t2 = pickle.loads(serialised)
>>> print(t2)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
However, pickling will not be as efficient as storing tables in the native format.
Tables support the equality operator ==
based on the data
held in the columns:
>>> t == t2
True
>>> t is t2
False
>>> t2.add_row(0, 1, 2, 3)
2
>>> print(t2)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
2 0.00000000 1.00000000 2 3
>>> t == t2
False
Text columns¶
As described in the Encoding ragged columns, working with variable length columns is somewhat more involved. Columns encoding text data store the encoded bytes of the flattened strings, and the offsets into this column in two separate arrays.
Consider the following example:
>>> t = tskit.SiteTable()
>>> t.add_row(0, "A")
>>> t.add_row(1, "BB")
>>> t.add_row(2, "")
>>> t.add_row(3, "CCC")
>>> print(t)
id position ancestral_state metadata
0 0.00000000 A
1 1.00000000 BB
2 2.00000000
3 3.00000000 CCC
>>> t[0]
SiteTableRow(position=0.0, ancestral_state='A', metadata=b'')
>>> t[1]
SiteTableRow(position=1.0, ancestral_state='BB', metadata=b'')
>>> t[2]
SiteTableRow(position=2.0, ancestral_state='', metadata=b'')
>>> t[3]
SiteTableRow(position=3.0, ancestral_state='CCC', metadata=b'')
Here we create a SiteTable
and add four rows, each with a different
ancestral_state
. We can then access this information from each
row in a straightforward manner. Working with the data in the columns
is a little trickier, however:
>>> t.ancestral_state
array([65, 66, 66, 67, 67, 67], dtype=int8)
>>> t.ancestral_state_offset
array([0, 1, 3, 3, 6], dtype=uint32)
>>> tskit.unpack_strings(t.ancestral_state, t.ancestral_state_offset)
['A', 'BB', '', 'CCC']
Here, the ancestral_state
array is the UTF8 encoded bytes of the flattened
strings, and the ancestral_state_offset
is the offset into this array
for each row. The unpack_strings()
function, however, is a convient
way to recover the original strings from this encoding. We can also use the
pack_strings()
to insert data using this approach:
>>> a, off = tskit.pack_strings(["0", "12", ""])
>>> t.set_columns(position=[0, 1, 2], ancestral_state=a, ancestral_state_offset=off)
>>> print(t)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 12
2 2.00000000
When inserting many rows with standard infinite sites mutations (i.e.,
ancestral state is “0”), it is more efficient to construct the
numpy arrays directly than to create a list of strings and use
pack_strings()
. When doing this, it is important to note that
it is the encoded byte values that are stored; by default, we
use UTF8 (which corresponds to ASCII for simple printable characters).:
>>> t_s = tskit.SiteTable()
>>> m = 10
>>> a = ord("0") + np.zeros(m, dtype=np.int8)
>>> off = np.arange(m + 1, dtype=np.uint32)
>>> t_s.set_columns(position=np.arange(m), ancestral_state=a, ancestral_state_offset=off)
>>> print(t_s)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 0
2 2.00000000 0
3 3.00000000 0
4 4.00000000 0
5 5.00000000 0
6 6.00000000 0
7 7.00000000 0
8 8.00000000 0
9 9.00000000 0
>>> t_s.ancestral_state
array([48, 48, 48, 48, 48, 48, 48, 48, 48, 48], dtype=int8)
>>> t_s.ancestral_state_offset
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint32)
Here we create 10 sites at regular positions, each with ancestral state equal to
“0”. Note that we use ord("0")
to get the ASCII code for “0” (48), and create
10 copies of this by adding it to an array of zeros. We have done this for
illustration purposes: it is equivalent (though slower for large examples) to do
a, off = tskit.pack_strings(["0"] * m)
.
Mutations can be handled similarly:
>>> t_m = tskit.MutationTable()
>>> site = np.arange(m, dtype=np.int32)
>>> d, off = tskit.pack_strings(["1"] * m)
>>> node = np.zeros(m, dtype=np.int32)
>>> t_m.set_columns(site=site, node=node, derived_state=d, derived_state_offset=off)
>>> print(t_m)
id site node derived_state parent metadata
0 0 0 1 1
1 1 0 1 1
2 2 0 1 1
3 3 0 1 1
4 4 0 1 1
5 5 0 1 1
6 6 0 1 1
7 7 0 1 1
8 8 0 1 1
9 9 0 1 1
>>>
Binary columns¶
Columns storing binary data take the same approach as
Text columns to encoding
variable length data.
The difference between the two is
only raw bytes
values are accepted: no character encoding or
decoding is done on the data. Consider the following example:
>>> t = tskit.NodeTable()
>>> t.add_row(metadata=b"raw bytes")
>>> t.add_row(metadata=pickle.dumps({"x": 1.1}))
>>> t[0].metadata
b'raw bytes'
>>> t[1].metadata
b'\x80\x03}q\x00X\x01\x00\x00\x00xq\x01G?\xf1\x99\x99\x99\x99\x99\x9as.'
>>> pickle.loads(t[1].metadata)
{'x': 1.1}
>>> print(t)
id flags population time metadata
0 0 1 0.00000000000000 cmF3IGJ5dGVz
1 0 1 0.00000000000000 gAN9cQBYAQAAAHhxAUc/8ZmZmZmZmnMu
>>> t.metadata
array([ 114, 97, 119, 32, 98, 121, 116, 101, 115, 128, 3,
125, 113, 0, 88, 1, 0, 0, 0, 120, 113, 1,
71, 63, 15, 103, 103, 103, 103, 103, 102, 115, 46], dtype=int8)
>>> t.metadata_offset
array([ 0, 9, 33], dtype=uint32)
Here we add two rows to a NodeTable
, with different
metadata. The first row contains a simple
byte string, and the second contains a Python dictionary serialised using
pickle
. We then show several different (and seemingly incompatible!)
different views on the same data.
When we access the data in a row (e.g., t[0].metadata
) we are returned
a Python bytes object containing precisely the bytes that were inserted.
The pickled dictionary is encoded in 24 bytes containing unprintable
characters, and when we unpickle it using pickle.loads()
, we obtain
the original dictionary.
When we print the table, however, we see some data which is seemingly unrelated to the original contents. This is because the binary data is base64 encoded to ensure that it is printsafe (and doesn’t break your terminal). (See the Metadata section for more information on the use of base64 encoding.).
Finally, when we print the metadata
column, we see the raw byte values
encoded as signed integers. As for Text columns,
the metadata_offset
column encodes the offsets into this array. So, we
see that the first metadata value is 9 bytes long and the second is 24.
The pack_bytes()
and unpack_bytes()
functions are also useful
for encoding data in these columns.
Table classes¶
This section describes the methods and variables available for each table class. For description and definition of each table’s meaning and use, see the table definitions.

class
tskit.
IndividualTable
¶ A table defining the individuals in a tree sequence. Note that although each Individual has associated nodes, reference to these is not stored in the individual table, but rather reference to the individual is stored for each node in the
NodeTable
. This is similar to the way in which the relationship between sites and mutations is modelled. Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
flags (numpy.ndarray, dtype=np.uint32) – The array of flags values.
location (numpy.ndarray, dtype=np.float64) – The flattened array of floating point location values. See Encoding ragged columns for more details.
location_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the location column. See Encoding ragged columns for more details.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.

add_row
(flags=0, location=None, metadata=None)¶ Adds a new row to this
IndividualTable
and returns the ID of the corresponding individual. Parameters
flags (int) – The bitwise flags for the new node.
location (arraylike) – A list of numeric values or onedimensional numpy array describing the location of this individual. If not specified or None, a zerodimensional location is stored.
metadata (bytes) – The binaryencoded metadata for the new node. If not specified or None, a zerolength byte string is stored.
 Returns
The ID of the newly added node.
 Return type

append_columns
(flags=None, location=None, location_offset=None, metadata=None, metadata_offset=None)¶ Appends the specified arrays to the end of the columns in this
IndividualTable
. This allows many new rows to be added at once.The
flags
array is mandatory and defines the number of extra individuals to add to the table. Thelocation
andlocation_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information. Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required.
location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with
location_offset
. If not specified or None, an empty location value is stored for each individual.location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
location
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each individual.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_location
(locations)¶ Packs the specified list of location values and updates the
location
andlocation_offset
columns. The length of the locations array must be equal to the number of rows in the table. Parameters
locations (list) – A list of locations interpreted as numpy float64 arrays.

packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table. Parameters
metadatas (list) – A list of metadata bytes values.

set_columns
(flags=None, location=None, location_offset=None, metadata=None, metadata_offset=None)¶ Sets the values for each column in this
IndividualTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
flags
array is mandatory and defines the number of individuals the table will contain. Thelocation
andlocation_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information. Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required.
location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with
location_offset
. If not specified or None, an empty location value is stored for each individual.location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
location
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each individual.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

class
tskit.
NodeTable
¶ A table defining the nodes in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a node table to be a part of a valid tree sequence.
 Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
time (numpy.ndarray, dtype=np.float64) – The array of time values.
flags (numpy.ndarray, dtype=np.uint32) – The array of flags values.
population (numpy.ndarray, dtype=np.int32) – The array of population IDs.
individual (numpy.ndarray, dtype=np.int32) – The array of individual IDs that each node belongs to.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.

add_row
(flags=0, time=0, population=1, individual=1, metadata=None)¶ Adds a new row to this
NodeTable
and returns the ID of the corresponding node. Parameters
flags (int) – The bitwise flags for the new node.
time (float) – The birth time for the new node.
population (int) – The ID of the population in which the new node was born. Defaults to
NULL
.individual (int) – The ID of the individual in which the new node was born. Defaults to
NULL
.metadata (bytes) – The binaryencoded metadata for the new node. If not specified or None, a zerolength byte string is stored.
 Returns
The ID of the newly added node.
 Return type

append_columns
(flags=None, time=None, population=None, individual=None, metadata=None, metadata_offset=None)¶ Appends the specified arrays to the end of the columns in this
NodeTable
. This allows many new rows to be added at once.The
flags
,time
andpopulation
arrays must all be of the same length, which is equal to the number of nodes that will be added to the table. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information. Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required.
time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required.
population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the
NULL
value is stored for each node.individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the
NULL
value is stored for each node.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table. Parameters
metadatas (list) – A list of metadata bytes values.

set_columns
(flags=None, time=None, population=None, individual=None, metadata=None, metadata_offset=None)¶ Sets the values for each column in this
NodeTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
flags
,time
andpopulation
arrays must all be of the same length, which is equal to the number of nodes the table will contain. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information. Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required.
time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required.
population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the
NULL
value is stored for each node.individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the
NULL
value is stored for each node.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

class
tskit.
EdgeTable
¶ A table defining the edges in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for an edge table to be a part of a valid tree sequence.
 Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
left (numpy.ndarray, dtype=np.float64) – The array of left coordinates.
right (numpy.ndarray, dtype=np.float64) – The array of right coordinates.
parent (numpy.ndarray, dtype=np.int32) – The array of parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The array of child node IDs.

add_row
(left, right, parent, child)¶ Adds a new row to this
EdgeTable
and returns the ID of the corresponding edge.

append_columns
(left, right, parent, child)¶ Appends the specified arrays to the end of the columns of this
EdgeTable
. This allows many new rows to be added at once.All four parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional edges to add to the table).
 Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
parent (numpy.ndarray, dtype=np.int32) – The parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The child node IDs.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

set_columns
(left=None, right=None, parent=None, child=None)¶ Sets the values for each column in this
EdgeTable
using the values in the specified arrays. Overwrites any data currently stored in the table.All four parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of edges the table will contain).
 Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
parent (numpy.ndarray, dtype=np.int32) – The parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The child node IDs.

squash
()¶ Sorts, then condenses the table into the smallest possible number of rows by combining any adjacent edges. A pair of edges is said to be adjacent if they have the same parent and child nodes, and if the left coordinate of one of the edges is equal to the right coordinate of the other edge. The
squash
method modifies anEdgeTable
in place so that any set of adjacent edges is replaced by a single edge. The new edge will have the same parent and child node, a left coordinate equal to the smallest left coordinate in the set, and a right coordinate equal to the largest right coordinate in the set. The new edge table will be sorted in the canonical order (P, C, L, R).

class
tskit.
MigrationTable
¶ A table defining the migrations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a migration table to be a part of a valid tree sequence.
 Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
left (numpy.ndarray, dtype=np.float64) – The array of left coordinates.
right (numpy.ndarray, dtype=np.float64) – The array of right coordinates.
node (numpy.ndarray, dtype=np.int32) – The array of node IDs.
source (numpy.ndarray, dtype=np.int32) – The array of source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The array of destination population IDs.
time (numpy.ndarray, dtype=np.float64) – The array of time values.

add_row
(left, right, node, source, dest, time)¶ Adds a new row to this
MigrationTable
and returns the ID of the corresponding migration. Parameters
 Returns
The ID of the newly added migration.
 Return type

append_columns
(left, right, node, source, dest, time)¶ Appends the specified arrays to the end of the columns of this
MigrationTable
. This allows many new rows to be added at once.All six parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional migrations to add to the table).
 Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
node (numpy.ndarray, dtype=np.int32) – The node IDs.
source (numpy.ndarray, dtype=np.int32) – The source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The destination population IDs.
time (numpy.ndarray, dtype=np.int64) – The time of each migration.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

set_columns
(left=None, right=None, node=None, source=None, dest=None, time=None)¶ Sets the values for each column in this
MigrationTable
using the values in the specified arrays. Overwrites any data currently stored in the table.All six parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of migrations the table will contain).
 Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
node (numpy.ndarray, dtype=np.int32) – The node IDs.
source (numpy.ndarray, dtype=np.int32) – The source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The destination population IDs.
time (numpy.ndarray, dtype=np.int64) – The time of each migration.

class
tskit.
SiteTable
¶ A table defining the sites in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a site table to be a part of a valid tree sequence.
 Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
position (numpy.ndarray, dtype=np.float64) – The array of site position coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened array of ancestral state strings. See Text columns for more details.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the ancestral_state array. See Text columns for more details.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.

add_row
(position, ancestral_state, metadata=None)¶ Adds a new row to this
SiteTable
and returns the ID of the corresponding site. Parameters
 Returns
The ID of the newly added site.
 Return type

append_columns
(position, ancestral_state, ancestral_state_offset, metadata=None, metadata_offset=None)¶ Appends the specified arrays to the end of the columns of this
SiteTable
. This allows many new rows to be added at once.The
position
,ancestral_state
andancestral_state_offset
parameters are mandatory, and must be 1D numpy arrays. The length of theposition
array determines the number of additional rows to add the table. Theancestral_state
andancestral_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Parameters
position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
ancestral_state
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_ancestral_state
(ancestral_states)¶ Packs the specified list of ancestral_state values and updates the
ancestral_state
andancestral_state_offset
columns. The length of the ancestral_states array must be equal to the number of rows in the table.

packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table. Parameters
metadatas (list) – A list of metadata bytes values.

set_columns
(position=None, ancestral_state=None, ancestral_state_offset=None, metadata=None, metadata_offset=None)¶ Sets the values for each column in this
SiteTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
position
,ancestral_state
andancestral_state_offset
parameters are mandatory, and must be 1D numpy arrays. The length of theposition
array determines the number of rows in table. Theancestral_state
andancestral_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Parameters
position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
ancestral_state
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

class
tskit.
MutationTable
¶ A table defining the mutations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a mutation table to be a part of a valid tree sequence.
 Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
site (numpy.ndarray, dtype=np.int32) – The array of site IDs.
node (numpy.ndarray, dtype=np.int32) – The array of node IDs.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened array of derived state strings. See Text columns for more details.
derived_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the derived_state array. See Text columns for more details.
parent (numpy.ndarray, dtype=np.int32) – The array of parent mutation IDs.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.

add_row
(site, node, derived_state, parent=1, metadata=None)¶ Adds a new row to this
MutationTable
and returns the ID of the corresponding mutation. Parameters
site (int) – The ID of the site that this mutation occurs at.
node (int) – The ID of the first node inheriting this mutation.
derived_state (str) – The state of the site at this mutation’s node.
parent (int) – The ID of the parent mutation. If not specified, defaults to
NULL
.metadata (bytes) – The binaryencoded metadata for the new node. If not specified or None, a zerolength byte string is stored.
 Returns
The ID of the newly added mutation.
 Return type

append_columns
(site, node, derived_state, derived_state_offset, parent=None, metadata=None, metadata_offset=None)¶ Appends the specified arrays to the end of the columns of this
MutationTable
. This allows many new rows to be added at once.The
site
,node
,derived_state
andderived_state_offset
parameters are mandatory, and must be 1D numpy arrays. Thesite
andnode
(alsoparent
, if supplied) arrays must be of equal length, and determine the number of additional rows to add to the table. Thederived_state
andderived_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Parameters
site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at.
node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required.
derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
derived_state
array.parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_derived_state
(derived_states)¶ Packs the specified list of derived_state values and updates the
derived_state
andderived_state_offset
columns. The length of the derived_states array must be equal to the number of rows in the table.

packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table. Parameters
metadatas (list) – A list of metadata bytes values.

set_columns
(site=None, node=None, derived_state=None, derived_state_offset=None, parent=None, metadata=None, metadata_offset=None)¶ Sets the values for each column in this
MutationTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
site
,node
,derived_state
andderived_state_offset
parameters are mandatory, and must be 1D numpy arrays. Thesite
andnode
(alsoparent
, if supplied) arrays must be of equal length, and determine the number of rows in the table. Thederived_state
andderived_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Parameters
site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at.
node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required.
derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
derived_state
array.parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.

class
tskit.
PopulationTable
¶ A table defining the populations referred to in a tree sequence. The PopulationTable stores metadata for populations that may be referred to in the NodeTable and MigrationTable”. Note that although nodes may be associated with populations, this association is stored in the
NodeTable
: only metadata on each population is stored in the population table. Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
 Variables
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.

add_row
(metadata=None)¶ Adds a new row to this
PopulationTable
and returns the ID of the corresponding population.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table. Parameters
metadatas (list) – A list of metadata bytes values.

set_columns
(metadata=None, metadata_offset=None)¶ Sets the values for each column in this
Table
using values provided in numpy arrays. Overwrites any data currently stored in the table.

class
tskit.
ProvenanceTable
¶ A table recording the provenance (i.e., history) of this table, so that the origin of the underlying data and sequence of subsequent operations can be traced. Each row contains a “record” string (recommended format: JSON) and a timestamp.
Todo
The format of the record field will be more precisely specified in the future.
 Variables
record (numpy.ndarray, dtype=np.int8) – The flattened array containing the record strings. Text columns for more details.
record_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the record column. See Text columns for more details.
timestamp (numpy.ndarray, dtype=np.int8) – The flattened array containing the timestamp strings. Text columns for more details.
timestamp_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the timestamp column. See Text columns for more details.

add_row
(record, timestamp=None)¶ Adds a new row to this ProvenanceTable consisting of the specified record and timestamp. If timestamp is not specified, it is automatically generated from the current time.

asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.

clear
()¶ Deletes all rows in this table.

copy
()¶ Returns a deep copy of this table

packset_record
(records)¶ Packs the specified list of record values and updates the
record
andrecord_offset
columns. The length of the records array must be equal to the number of rows in the table.

packset_timestamp
(timestamps)¶ Packs the specified list of timestamp values and updates the
timestamp
andtimestamp_offset
columns. The length of the timestamps array must be equal to the number of rows in the table.

set_columns
(timestamp=None, timestamp_offset=None, record=None, record_offset=None)¶ Sets the values for each column in this
Table
using values provided in numpy arrays. Overwrites any data currently stored in the table.
Table Collections¶
Each of the table classes defines a different aspect of the structure of
a tree sequence. It is convenient to be able to refer to a set of these
tables which together define a tree sequence. We
refer to this grouping of related tables as a TableCollection
.
The TableCollection
and TreeSequence
classes are
deeply related. A TreeSequence
instance is based on the information
encoded in a TableCollection
. Tree sequences are immutable, and
provide methods for obtaining trees from the sequence. A TableCollection
is mutable, and does not have any methods for obtaining trees.
The TableCollection
class essentially exists to allow the
dynamic creation of tree sequences.

class
tskit.
TableCollection
(sequence_length=0)¶ A collection of mutable tables defining a tree sequence. See the Data model section for definition on the various tables and how they together define a
TreeSequence
. Arbitrary data can be stored in a TableCollection, but there are certain requirements that must be satisfied for these tables to be interpreted as a tree sequence.To obtain an immutable
TreeSequence
instance corresponding to the current state of aTableCollection
, please use thetree_sequence()
method. Variables
individuals (IndividualTable) – The individual table.
nodes (NodeTable) – The node table.
edges (EdgeTable) – The edge table.
migrations (MigrationTable) – The migration table.
sites (SiteTable) – The site table.
mutations (MutationTable) – The mutation table.
populations (PopulationTable) – The population table.
provenances (ProvenanceTable) – The provenance table.
sequence_length (float) – The sequence length defining the coordinate space.
file_uuid (str) – The UUID for the file this TableCollection is derived from, or None if not derived from a file.

asdict
()¶ Returns a dictionary representation of this TableCollection.
Note: the semantics of this method changed at tskit 1.0.0. Previously a map of table names to the tables themselves was returned.

build_index
()¶ Builds an index on this TableCollection. Any existing indexes are automatically dropped.

compute_mutation_parents
()¶ Modifies the tables in place, computing the
parent
column of the mutation table. For this to work, the node and edge tables must be valid, and the site and mutation tables must be sorted (seeTableCollection.sort()
). This will produce an error if mutations are not sorted (i.e., if a mutation appears before its mutation parent) unless the two mutations occur on the same branch, in which case there is no way to detect the error.The
parent
of a given mutation is the ID of the next mutation encountered traversing the tree upwards from that mutation, orNULL
if there is no such mutation.Note
note: This method does not check that all mutations result in a change of state, as required; see Mutation requirements.

copy
()¶ Returns a deep copy of this TableCollection.
 Returns
A deep copy of this TableCollection.
 Return type

deduplicate_sites
()¶ Modifies the tables in place, removing entries in the site table with duplicate
position
(and keeping only the first entry for each site), and renumbering thesite
column of the mutation table appropriately. This requires the site table to be sorted by position.

delete_intervals
(intervals, simplify=True, record_provenance=True)¶ Delete all information from this set of tables which lies within the specified list of genomic intervals. This is identical to
TreeSequence.delete_intervals()
but acts in place to alter the data in thisTableCollection
. Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to delete. Intervals must be nonoverlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, run simplify on the tables so that nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).

delete_sites
(site_ids, record_provenance=True)¶ Remove the specified sites entirely from the sites and mutations tables in this collection. This is identical to
TreeSequence.delete_sites()
but acts in place to alter the data in thisTableCollection
.

drop_index
()¶ Drops an indexes present on this table collection. If the table are not currently indexed this method has no effect.

has_index
()¶ Returns True if this TableCollection is indexed.

keep_intervals
(intervals, simplify=True, record_provenance=True)¶ Delete all information from this set of tables which lies outside the specified list of genomic intervals. This is identical to
TreeSequence.keep_intervals()
but acts in place to alter the data in thisTableCollection
. Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to keep. Intervals must be nonoverlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, run simplify on the tables so that nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).

map_ancestors
(samples, ancestors)¶ Returns an
EdgeTable
instance describing a subset of the genealogical relationships between the nodes in``samples`` andancestors
.Each row
parent, child, left, right
in the output table indicates thatchild
has inherited the segment[left, right)
fromparent
more recently than from any other node in these lists.In particular, suppose
samples
is a list of nodes such thattime
is 0 for each node, andancestors
is a list of nodes such thattime
is greater than 0.0 for each node. Then each row of the output table will show an interval[left, right)
over which a node insamples
has inherited most recently from a node inancestors
, or an interval over which one of theseancestors
has inherited most recently from another node inancestors
.The following table shows which
parent>child
pairs will be shown in the output ofmap_ancestors
. A node is a relevant descendant on a given interval if it also appears somewhere in theparent
column of the outputted table.Type of relationship
Shown in output of
map_ancestors
ancestor>sample
Always
ancestor1>ancestor2
Only if
ancestor2
has a relevant descendantsample1>sample2
Always
sample>ancestor
Only if
ancestor
has a relevant descendantThe difference between
samples
andancestors
is that information about the ancestors of a node inancestors
will only be retained if it also has a relevant descendant, while information about the ancestors of a node insamples
will always be retained. The node IDs inparent
andchild
refer to the IDs in the node table of the inputted tree sequence.The supplied nodes must be nonempty lists of the node IDs in the tree sequence: in particular, they do not have to be samples of the tree sequence. The lists of
samples
andancestors
may overlap, although adding a node fromsamples
toancestors
will not change the output. So, settingsamples
andancestors
to the same list of nodes will find all genealogical relationships within this list.If none of the nodes in
ancestors
orsamples
are ancestral tosamples
anywhere in the tree sequence, an empty table will be returned.

simplify
(samples=None, filter_zero_mutation_sites=None, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True, keep_unary=False)¶ Simplifies the tables in place to retain only the information necessary to reconstruct the tree sequence describing the given
samples
. This will change the ID of the nodes, so that the nodesamples[k]
will have IDk
in the result. The resulting NodeTable will have only the firstlen(samples)
individuals marked as samples. The mapping from node IDs in the current set of tables to their equivalent values in the simplified tables is also returned as a numpy array. If an arraya
is returned by this function andu
is the ID of a node in the input table, thena[u]
is the ID of this node in the output table. For any nodeu
that is not mapped into the output tables, this mapping will equal1
.Tables operated on by this function must: be sorted (see
TableCollection.sort()
), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint. Other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).This is identical to
TreeSequence.simplify()
but acts in place to alter the data in thisTableCollection
. Please see theTreeSequence.simplify()
method for a description of the remaining parameters. Parameters
samples (list[int]) – A list of node IDs to retain as samples. If not specified or None, use all nodes marked with the IS_SAMPLE flag.
filter_zero_mutation_sites (bool) – Deprecated alias for
filter_sites
.reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (default: False).
filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True)
filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True)
filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True)
keep_unary (bool) – If True, any unary nodes (i.e. nodes with exactly one child) that exist on the path from samples to root will be preserved in the output. (Default: False)
 Returns
A numpy array mapping node IDs in the input tables to their corresponding node IDs in the output tables.
 Return type
numpy array (dtype=np.int32)

sort
(edge_start=0)¶ Sorts the tables in place. This ensures that all tree sequence ordering requirements listed in the Valid tree sequence requirements section are met, as long as each site has at most one mutation (see below).
If the
edge_start
parameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal toedge_start
are sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.The individual, node, population and provenance tables are not affected by this method.
Edges are sorted as follows:
time of parent, then
parent node ID, then
child node ID, then
left endpoint.
Note that this sorting order exceeds the edge sorting requirements for a valid tree sequence. For a valid tree sequence, we require that all edges for a given parent ID are adjacent, but we do not require that they be listed in sorted order.
Sites are sorted by position, and sites with the same position retain their relative ordering.
Mutations are sorted by site ID, and mutations with the same site retain their relative ordering. This does not currently rearrange tables so that mutations occur after their mutation parents, which is a requirement for valid tree sequences.
 Parameters
edge_start (int) – The index in the edge table where sorting starts (default=0; must be <= len(edges)).

tree_sequence
()¶ Returns a
TreeSequence
instance with the structure defined by the tables in thisTableCollection
. If the table collection is not in canonical form (i.e., does not meet sorting requirements) or cannot be interpreted as a tree sequence an exception is raised. Thesort()
method may be used to ensure that input sorting requirements are met. Returns
A
TreeSequence
instance reflecting the structures defined in this set of tables. Return type
Table functions¶

tskit.
parse_nodes
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)¶ Parse the specified filelike object containing a whitespace delimited description of a node table and returns the corresponding
NodeTable
instance. See the node text format section for the details of the required format and the node table definition section for the required properties of the contents.See
load_text()
for a detailed explanation of thestrict
parameter. Parameters
source (stream) – The filelike object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (string) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (NodeTable) – If specified write into this table. If not, create a new
NodeTable
instance.

tskit.
parse_edges
(source, strict=True, table=None)¶ Parse the specified filelike object containing a whitespace delimited description of a edge table and returns the corresponding
EdgeTable
instance. See the edge text format section for the details of the required format and the edge table definition section for the required properties of the contents.See
load_text()
for a detailed explanation of thestrict
parameter. Parameters
source (stream) – The filelike object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
table (EdgeTable) – If specified, write the edges into this table. If not, create a new
EdgeTable
instance and return.

tskit.
parse_sites
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)¶ Parse the specified filelike object containing a whitespace delimited description of a site table and returns the corresponding
SiteTable
instance. See the site text format section for the details of the required format and the site table definition section for the required properties of the contents.See
load_text()
for a detailed explanation of thestrict
parameter. Parameters
source (stream) – The filelike object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (string) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (SiteTable) – If specified write site into this table. If not, create a new
SiteTable
instance.

tskit.
parse_mutations
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)¶ Parse the specified filelike object containing a whitespace delimited description of a mutation table and returns the corresponding
MutationTable
instance. See the mutation text format section for the details of the required format and the mutation table definition section for the required properties of the contents.See
load_text()
for a detailed explanation of thestrict
parameter. Parameters
source (stream) – The filelike object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (string) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (MutationTable) – If specified, write mutations into this table. If not, create a new
MutationTable
instance.

tskit.
pack_strings
(strings, encoding='utf8')¶ Packs the specified list of strings into a flattened numpy array of 8 bit integers and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.
 Parameters
 Returns
The tuple (packed, offset) of numpy arrays representing the flattened input data and offsets.
 Return type
numpy.array (dtype=np.int8), numpy.array (dtype=np.uint32)

tskit.
unpack_strings
(packed, offset, encoding='utf8')¶ Unpacks a list of strings from the specified numpy arrays of packed byte data and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.
 Parameters
 Returns
The list of strings unpacked from the parameter arrays.
 Return type

tskit.
pack_bytes
(data)¶ Packs the specified list of bytes into a flattened numpy array of 8 bit integers and corresponding offsets. See Encoding ragged columns for details of this encoding.

tskit.
unpack_bytes
(packed, offset)¶ Unpacks a list of bytes from the specified numpy arrays of packed byte data and corresponding offsets. See Encoding ragged columns for details of this encoding.
Linkage disequilibrium¶
Note
This API will soon be deprecated in favour of multisite extensions to the Statistics API.

class
tskit.
LdCalculator
(tree_sequence)¶ Class for calculating linkage disequilibrium coefficients between pairs of mutations in a
TreeSequence
. This class requires the numpy library.This class supports multithreaded access using the Python
threading
module. Separate instances ofLdCalculator
referencing the same tree sequence can operate in parallel in multiple threads.Note
This class does not currently support sites that have more than one mutation. Using it on such a tree sequence will raise a LibraryError with an “Unsupported operation” message.
 Parameters
tree_sequence (TreeSequence) – The tree sequence containing the mutations we are interested in.

r2
(a, b)¶ Returns the value of the \(r^2\) statistic between the pair of mutations at the specified indexes. This method is not an efficient method for computing large numbers of pairwise; please use either
r2_array()
orr2_matrix()
for this purpose.

r2_array
(a, direction=1, max_mutations=None, max_distance=None)¶ Returns the value of the \(r^2\) statistic between the focal mutation at index \(a\) and a set of other mutations. The method operates by starting at the focal mutation and iterating over adjacent mutations (in either the forward or backwards direction) until either a maximum number of other mutations have been considered (using the
max_mutations
parameter), a maximum distance in sequence coordinates has been reached (using themax_distance
parameter) or the start/end of the sequence has been reached. For every mutation \(b\) considered, we then insert the value of \(r^2\) between \(a\) and \(b\) at the corresponding index in an array, and return the entire array. If the returned array is \(x\) anddirection
istskit.FORWARD
then \(x[0]\) is the value of the statistic for \(a\) and \(a + 1\), \(x[1]\) the value for \(a\) and \(a + 2\), etc. Similarly, ifdirection
istskit.REVERSE
then \(x[0]\) is the value of the statistic for \(a\) and \(a  1\), \(x[1]\) the value for \(a\) and \(a  2\), etc. Parameters
a (int) – The index of the focal mutation.
direction (int) – The direction in which to travel when examining other mutations. Must be either
tskit.FORWARD
ortskit.REVERSE
. Defaults totskit.FORWARD
.max_mutations (int) – The maximum number of mutations to return \(r^2\) values for. Defaults to as many mutations as possible.
max_distance (float) – The maximum absolute distance between the focal mutation and those for which \(r^2\) values are returned.
 Returns
An array of double precision floating point values representing the \(r^2\) values for mutations in the specified direction.
 Return type
numpy.ndarray
 Warning
For efficiency reasons, the underlying memory used to store the returned array is shared between calls. Therefore, if you wish to store the results of a single call to
get_r2_array()
for later processing you must take a copy of the array!

r2_matrix
()¶ Returns the complete \(m \times m\) matrix of pairwise \(r^2\) values in a tree sequence with \(m\) mutations.
 Returns
An 2 dimensional square array of double precision floating point values representing the \(r^2\) values for all pairs of mutations.
 Return type
numpy.ndarray
Provenance¶
We provide some preliminary support for validating JSON documents against the provenance schema. Programmatic access to provenance information is planned for future versions.

tskit.
validate_provenance
(provenance)¶ Validates the specified dictlike object against the tskit provenance schema. If the input does not represent a valid instance of the schema an exception is raised.
 Parameters
provenance (dict) – The dictionary representing a JSON document to be validated against the schema.
 Raises
ProvenanceValidationError

tskit.
ProvenanceValidationError
¶