Development

If you would like to add some features to tskit, this documentation should help you get set up and contributing. Please help us to improve the documentation by either opening an issue or pull request if you see any problems.

The tskit-dev team strives to create a welcoming and open environment for contributors; please see our code of conduct for details. We wish our code and documentation to be inclusive and in particular to be gender and racially neutral.

Project structure

Tskit is a multi-language project, which is reflected in the directory structure:

  • The python directory contains the Python library and command line interface, which is what most contributors are likely to be interested in. Please see the Python library section for details. The low-level Python C Interface interface is also defined here.

  • The c directory contains the high-performance C library code. Please see the C Library for details on how to contribute.

  • The docs directory contains the source for this documentation, which covers both the Python and C APIs. Please see the Documentation for details.

The remaining files in the root directory of the project are for controlling Continuous Integration tests providers and other administrative purposes.

Please see the Best Practices for Development section for an overview of how to contribute a new feature to tskit.

Getting started

Requirements

To develop the Python code you will need a working C compiler and a development installation of Python (>= 3.6). On Debian/Ubuntu we can install these with:

$ sudo apt install python3-dev build-essential doxygen

Python packages required for development are listed in python/requirements/development.txt. These can be installed using pip:

$ python3 -m pip install -r python/requirements/development.txt

You may wish isolate your development environment using a virtualenv.

A few extra dependencies are required if you wish to work on the C library.

For OSX and Windows users we recommending using conda, and isolating development in a dedicated environment as follows:

$ conda create -q -n tskit-dev
$ source activate tskit-dev
$ conda install -c conda-forge --yes --file=python/requirements/conda-minimal.txt
$ conda install -c conda-forge doxygen
$ conda install -c bioconda --yes pysam
$ pip install -r python/requirements/development.txt

On macOS, conda builds are generally done using clang packages that are kept up to date:

$ conda install clang_osx-64  clangxx_osx-64

In order to make sure that these compilers work correctly (e.g., so that they can find other dependencies installed via conda), you need to compile tskit with this command on versions of macOS older than “Mojave”:

$ CONDA_BUILD_SYSROOT=/ python3 setup.py build_ext -i

On more recent macOS releases, you may omit the CONDA_BUILD_SYSROOT prefix.

Note

The use of the C toolchain on macOS is a moving target. The above advice was written on 23 January, 2020 and was validated by a few tskit contributors. Caveat emptor, etc..

Environment

To get a local git development environment, please follow these steps:

  • Make a fork of the tskit repo on GitHub

  • Clone your fork into a local directory, making sure that the submodules are correctly initialised:

    $ git clone git@github.com:YOUR_GITHUB_USERNAME/tskit.git --recurse-submodules
    

    For an already checked out repo, the submodules can be initialised using:

    $ git submodule update --init --recursive
    
  • Install the Pre-commit checks:

    $ pre-commit install
    

See the Git workflow section for detailed information on the recommended way to use git and GitHub.

Workflow

Git workflow

  1. Make your own fork of the tskit repository on GitHub, and clone a local copy as detailed in Environment.

  2. Make sure that your local repository has been configured with an upstream remote:

    $ git remote add upstream git@github.com:tskit-dev/tskit.git
    
  3. Create a “topic branch” to work on. One reliable way to do it is to follow this recipe:

    $ git fetch upstream
    $ git checkout upstream/master
    $ git checkout -b topic_branch_name
    
  4. As you work on your topic branch you can add commits to it. Once you’re ready to share this, you can then open a pull request. Your PR will be reviewed by some of the maintainers, who may ask you to make changes. If you’d like to open the PR but feel that the code isn’t ready for review yet, please use the “Draft” option on GitHub.

  5. If your topic branch has been around for a long time and has gotten out of date with the main repository, we might ask you to rebase or to “squash” your commits into one.

Please follow this guide for step-by-step instructions on rebasing and squashing commits.

Pre-commit checks

On each commit a pre-commit hook will run that checks for violations of code style (see the Code style section for details) and other common problems. Where possible, these hooks will try to fix any problems that they find (including reformatting your code to conform to the required style). In this case, the commit will not complete and report that “files were modified by this hook”. To include the changes that the hooks made, git add any files that were modified and run git commit (or, use git commit -a to commit all changed files.)

If you would like to run the checks without committing, use pre-commit run. To bypass the checks (to save or get feedback on work-in-progress) use git commit --no-verify

Documentation

The documentation for tskit is written using Sphinx and contained in the docs directory. It is written in the reStructuredText format and deployed automatically to readthedocs. API documentation for both Python and C are generated automatically from source. For the C code, a combination of Doxygen and breathe is used to generate API documentation.

Please help us to improve the documentation!

Small edits

If you see a typo or some other small problem that you’d like to fix, this is most easily done through the GitHub UI.

If the typo is in a large section of text (like this page), go to the top of the page and click on the “Edit on GitHub” link at the top right. This will bring you to the page on GitHub for the RST source file in question. Then, click on the pencil icon on the right hand side. This will open a web editor allowing you to quickly fix the typo and submit a pull request with the changes. Fix the typo, add a commit message like “Fixed typo” and click on the green “Propose file change” button. Then follow the dialogues until you’ve created a new pull request with your changes, so that we can incorporate them.

If the change you’d like to make is in the API documentation for a particular function, then you’ll need to find where this function is defined first. The simplest way to do this is to click the green “[source]” link next to the function. This will show you a HTML rendered version of the function, and the rest of the file that it is in. You can then navigate to this file on GitHub, and edit it using the same approach as above.

Significant edits

When making changes more substantial than typo fixes it’s best to check out a local copy. Follow the steps in the Git workflow to get a fork of tskit, a local clone and newly checked out feature branch. Then follow the steps in the Getting started section to get a working development environment.

Once you are ready to make edits to the documentation, cd into the docs directory and run make. This should build the HTML documentation in docs/_build/html/, which you can then view in your browser. As you make changes, run make regularly and view the final result to see if it matches your expectations.

Once you are happy with the changes, commit your updates and open a pull request on GitHub.

Tips and resources

  • The reStructuredText primer is a useful general resource on rst.

  • See also the sphinx and rst cheatsheet

  • The Sphinx Python and C domains have extensive options for marking up code.

  • Make extensive use of cross referencing. When linking to sections in the documentation, use the :ref:`sec_some_section_label` form rather than matching on the section title (which is brittle). Use :meth:`.Tree.some_method`, :func:`some_function` etc to refer to parts of the API.

Python library

The Python library is defined in the python directory. We assume throughout this section that you have cd’d into this directory. We also assume that the tskit package is built and run locally within this directory. That is, tskit is not installed into the Python installation using pip install -e or setuptools development mode. Please see the Troubleshooting section for help if you encounter problems with compiling or running the tests.

Getting started

After you have installed the basic Requirements and created a development environment, you will need to compile the low-level Python C Interface module. This is most easily done using make:

$ make

If this has completed successfully you should see a file _tskit.cpython-XXXXXX.so in the current directory (the suffix depends on your platform and Python version; with Python 3.6 on Linux it’s _tskit.cpython-36m-x86_64-linux-gnu.so).

To make sure that your development environment is working, run some tests.

Layout

Code for the tskit module is in the tskit directory. The code is split into a number of modules that are roughly split by function; for example, code for visualisation is kept in the tskit/drawing.py.

Test code is contained in the tests directory. Tests are also roughly split by function, so that tests for the drawing module are in the tests/test_drawing.py file. This is not a one-to-one mapping, though.

The requirements directory contains descriptions of the requirements needed for development and on various Continuous Integration tests providers.

Code style

Python code in tskit is formatted using Black. Any code submitted as a pull request will be checked to see if it conforms to this format as part of the Continuous Integration tests. Black is very strict, which seems unhelpful and nitpicky at first but is actually very useful. This is because it can also automatically format code for you, removing tedious and subjective choices (and even more tedious and subjective discussions!) about exactly how a particular code block should be aligned and indented.

In addition to Black autoformatting, code is checked for common problems using flake8

Black autoformatting and flake8 checks are performed as part of the pre-commit checks, which ensures that your code is always formatted correctly.

Vim users may find the black, and vim-flake8 plugins useful for automatically formatting code and lint checking within vim. There is good support for Black in a number of other editors.

Tests

The tests are defined in the tests directory, and run using nose. If you want to run the tests in a particular module (say, test_tables.py), use:

$ python3 -m nose -vs tests/test_tables.py

To run all the tests in a particular class in this module (say, TestNodeTable) use:

$ python3 -m nose -vs tests/test_tables.py:TestNodeTable

To run a specific test case in this class (say, test_copy) use:

$ python3 -m nose -vs tests/test_tables.py:TestNodeTable.test_copy

When developing your own tests, it is much quicker to run the specific tests that you are developing rather than rerunning large sections of the test suite each time.

To run all of the tests, we can use:

$ python3 -m nose -vs

As tskit’s test suite is large, it is helpful to run the tests in parallel, e.g.:

$ python3 -m nose -vs --processes=8 --process-timeout=5000

All new code must have high test coverage, which will be checked as part of the Continuous Integration tests tests by CodeCov. All tests must pass for a PR to be accepted.

Packaging

The tskit Python module follows the current best-practices advocated by the Python Packaging Authority. The primary means of distribution is though PyPI, which provides the canonical source for each release.

A package for conda is also available on conda-forge.

Interfacing with low-level module

Much of the high-level Python code only exists to provide a simpler interface to the low-level _tskit module. As such, many objects (e.g. Tree) are really just a shallow layer on top of the corresponding low-level object. The usual convention here is to keep a reference to the low-level object via a private instance variable such as self._ll_tree.

Command line interface

The command line interface for tskit is defined in the tskit/cli.py file. The CLI has a single entry point (e.g. tskit_main) which is invoked to run the program. These entry points are registered with setuptools using the console_scripts argument in setup.py, which allows them to be deployed as first-class executable programs in a cross-platform manner.

The CLI can also be run using python3 -m tskit. This is the recommended approach for running the CLI during development.

Installing development versions

We strongly recommend that you do not install development versions of tskit and instead use versions released to PyPI and conda-forge. However, if you really need to be on the bleeding edge, you can use the following command to install:

python3 -m pip install git+https://github.com/tskit-dev/tskit.git#subdirectory=python

(Because the Python package is not defined in the project root directory, using pip to install directly from GitHub requires you to specify subdirectory=python.)

Troubleshooting

  • If make is giving you strange errors, or if tests are failing for strange reasons, try running make clean in the project root and then rebuilding.

  • If cryptic problems still persist, it may be that your git submodules are out of date. Try running git submodule update --init --recursive.

  • Beware of multiple versions of the python library installed by different programs (e.g., pip versus installing locally from source)! In python, tskit.__file__ will tell you the location of the package that is being used.

C Library

The Python module uses the high-performance tskit C API behind the scenes. All C code and associated development infrastructure is held in the c directory.

Requirements

We use the meson build system in conjunction with ninja-build to compile the C code. Unit tests use the CUnit library and we use clang-format to automatically format code. On Debian/Ubuntu, these can be installed using

$ sudo apt install libcunit1-dev ninja-build meson clang-format

(A more recent version of meson can alternatively be installed using pip, if you wish.)

Conda users can install the basic requirements as follows:

$ conda install -c conda-forge ninja meson cunit

Unfortunately clang-format is not available on conda, but it is not essential.

Code style

C code is formatted using clang-format with a custom configuration. To ensure that your code is correctly formatted, you can run

clang-format -i c/tskit/* c/tests/*.c c/tests/*.h

before submitting a pull request.

Vim users may find the vim-clang-format plugin useful for automatically formatting code.

Building

We use meson and ninja-build to compile the C code. Meson keeps all compiled binaries in a build directory (this has many advantages such as allowing multiple builds with different options to coexist). The build configuration is defined in meson.build. To set up the initial build directory, run

$ cd c
$ meson build

To compile the code run

$ ninja -C build

All the tests and other artefacts are in the build directory. Individual test suites can be run, via (e.g.) ./build/test_trees. To run all of the tests, run

$ ninja -C build test

For vim users, the mesonic plugin simplifies this process and allows code to be compiled seamlessly within the editor.

Unit Tests

The C-library has an extensive suite of unit tests written using CUnit. These tests aim to establish that the low-level APIs work correctly over a variety of inputs, and particularly, that the tests don’t result in leaked memory or illegal memory accesses. All tests are run under valgrind to make sure of this as part of the Continuous Integration tests.

Tests are defined in the tests/*.c files. These are roughly split by the source files, so that the tests for functionality in the tskit/tables.c file will be tested in tests/test_tables.c. To run all the tests in the test_tables suite, run (e.g.) ./build/test_tables. To just run a specific test on its own, provide this test name as a command line argument, e.g.:

$ ./build/test_tables test_node_table

While 100% test coverage is not feasible for C code, we aim to cover all code that can be reached. (Some classes of error such as malloc failures and IO errors are difficult to simulate in C.) Code coverage statistics are automatically tracked using CodeCov.

Coding conventions

The code is written using the C99 standard. All variable declarations should be done at the start of a function, and functions kept short and simple where at all possible.

No global or module level variables are used for production code.

Please see the API structure section for more information about how the API is structured.

Error handling

A critical element of producing reliable C programs is consistent error handling and checking of return values. All return values must be checked! In tskit, all functions (except the most trivial accessors) return an integer to indicate success or failure. Any negative value is an error, and must be handled accordingly. The following pattern is canonical:

    ret = tsk_tree_do_something(self, argument);
    if (ret != 0) {
        goto out;
    }
    // rest of function
out:
    return ret;

Here we test the return value of tsk_tree_do_something and if it is non-zero, abort the function and return this same value from the current function. This is a bit like throwing an exception in higher-level languages, but discipline is required to ensure that the error codes are propagated back to the original caller correctly.

Particular care must be taken in functions that allocate memory, because we must ensure that this memory is freed in all possible success and failure scenarios. The following pattern is used throughout for this purpose:

    double *x = NULL;

    x = malloc(n * sizeof(double));
    if (x == NULL) {
        ret = TSK_ERR_NO_MEMORY;
        goto out;
    }
    // rest of function
out:
    tsk_safe_free(x);
    return ret;

It is vital here that x is initialised to NULL so that we are guaranteed correct behaviour in all cases. For this reason, the convention is to declare all pointer variables on a single line and to initialise them to NULL as part of the declaration.

Error codes are defined in core.h, and these can be translated into a message using tsk_strerror(err).

Type conventions

  • tsk_id_t is an ID for any entity in a table.

  • tsk_size_t refers to the size of a table or entity that can be stored in a table.

  • size_t is an OS size, e.g. the result of sizeof.

  • Error indicators (the return type of most functions) are int.

  • uint32_t etc should be avoided (any that exist are a leftover from older code that didn’t use tsk_size_t etc.)

  • int64_t and uint64_t are sometimes useful when working with bitstrings (e.g. to implement a set).

Python C Interface

Overview

The Python C interface is defined in the python directory and written using the Python C API. The source code for this interface is in the _tskitmodule.c file. When compiled, this produces the _tskit module, which is imported by the high-level Python code. The low-level Python module is not intended to be used directly by users and may change arbitrarily over time.

The usual pattern in the low-level Python API is to define a Python class which corresponds to a given “class” in the C API. For example, we define a TreeSequence class, which is essentially a thin wrapper around the tsk_tree_t type from the C library.

The _tskitmodule.c file follows the standard conventions given in the Python documentation.

Compiling and debugging

The setup.py file describes the requirements for the low-level _tskit module and how it is built from source. The simplest way to compile the low-level module is to run make in the python directory:

$ make

If make is not available, you can run the same command manually:

$ python3 setup.py build_ext --inplace

It is sometimes useful to specify compiler flags when building the low level module. For example, to make a debug build you can use:

$ CFLAGS='-Wall -O0 -g' make

If you need to track down a segfault etc, running some code through gdb can be very useful. For example, to run a particular test case, we can do:

$ gdb python3
(gdb) run -m nose -vs tests/test_lowlevel.py


(gdb) run  -m nose -vs tests/test_tables.py:TestNodeTable.test_copy
Starting program: /usr/bin/python3 -m nose -vs tests/test_tables.py:TestNodeTable.test_copy
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1e48700 (LWP 1503)]
[New Thread 0x7fffef647700 (LWP 1504)]
[New Thread 0x7fffeee46700 (LWP 1505)]
[Thread 0x7fffeee46700 (LWP 1505) exited]
[Thread 0x7fffef647700 (LWP 1504) exited]
[Thread 0x7ffff1e48700 (LWP 1503) exited]
test_copy (tests.test_tables.TestNodeTable) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
[Inferior 1 (process 1499) exited normally]
(gdb)

Tracing problems in C code is many times more difficult when the Python C API is involved because of the complexity of Python’s memory management. It is nearly always best to start by making sure that the tskit C API part of your addition is thoroughly tested with valgrind before resorting to the debugger.

Testing for memory leaks

The Python C API can be subtle, and it is easy to get the reference counting wrong. The stress_lowlevel.py script makes it easier to track down memory leaks when they do occur. The script runs the unit tests in a loop, and outputs memory usage statistics.

Continuous Integration tests

A number of different continuous integration providers are used, which run different combinations of tests on different platforms, as well as running various checks for code quality.

  • A Github action runs some code style and quality checks.

  • CircleCI runs all Python tests using the apt-get infrastructure for system requirements. We also runs C tests, compiled using gcc and clang, and check for memory leaks using valgrind. Documentation is also built to check for any errors.

  • Travis CI runs Python tests on Linux and OSX using the Conda infrastructure for the system level requirements.

  • AppVeyor runs Python tests on Windows using conda.

  • CodeCov tracks test coverage in Python and C.

Best Practices for Development

The following is a rough guide of best practices for contributing a function to the tskit codebase.

Note that this guide covers the most complex case of adding a new function to both the C and Python APIs.

  1. Open an issue with your proposed functionality. If consensus is reached that your proposed addition should be added to the codebase, proceed!

  2. Create a new branch on your fork of tskit-dev (see Getting started above). Then open a pull request on GitHub, with an initial description of your planned addition.

  3. Write your function in Python: in python/tests/ find the test module that pertains to the functionality you wish to add. For instance, the kc_distance metric was added to test_topology.py. Add a python version of your function here.

  4. Create a new class in this module to write unit tests for your function: in addition to making sure that your function is correct, make sure it fails on inappropriate inputs. This can often require judgement. For instance, Tree.kc_distance() fails on a tree with multiple roots, but allows users to input parameter values that are nonsensical, as long as they don’t break functionality. See the TestKCMetric for example.

  5. Write your function in C: check out the C API for guidance. There are also many examples in the c directory. Your function will probably go in trees.c.

  6. Write a few tests for your function in C: again, write your tests in tskit/c/tests/test_tree.c. The key here is code coverage, you don’t need to worry as much about covering every corner case, as we will proceed to link this function to the Python tests you wrote earlier.

  7. Create a low-level definition of your function using Python’s C API: this will go in _tskitmodule.c.

  8. Test your low-level implementation in tskit/python/tests/test_lowlevel.py: again, these tests don’t need to be as comprehensive as your first python tests, instead, they should focus on the interface, e.g., does the function behave correctly on malformed inputs?

  9. Link your C function to the Python API: write a function in tskit’s Python API, for example the kc_distance function lives in tskit/python/tskit/trees.py.

  10. Modify your Python tests to test the new C-linked function: if you followed the example of other tests, you might need to only add a single line of code here. In this case, the tests are well factored so that we can easily compare the results from both the Python and C versions.

  11. Write a docstring for your function in the Python API: for instance, the kc_distance docstring is in tskit/python/tskit/trees.py. Ensure that your docstring renders correctly by building the documentation (see Documentation).

  12. Update your Pull Request (rebasing if necessary!) and let the community check your work.

Releasing a new version

Tskit maintains separate visioning for the C API and Python package, each has its own release process.

C API

To release the C API, the TSK_VERSION_* macros should be updated, and the changelog updated with the release date and version. The changelog should also be checked for completeness. Comparing git log --follow --oneline -- c with git log --follow --oneline -- c/CHANGELOG.rst may help here. After the commit including these changes has been merged, tag a release on GitHub using the pattern C_MAJOR.MINOR.PATCH, with:

git tag -a C_MAJOR.MINOR.PATCH -m "C API version C_MAJOR.MINOR.PATCH"
git push upstream --tags

Then prepare a release for the tag on GitHub, copying across the changelog. After release, start a section in the changelog for new developments.

Python

To make a release first prepare a pull request that sets the correct version number in tskit/_version.py following PEP440 format. For a normal release this should be MAJOR.MINOR.PATCH, for a beta release use MAJOR.MINOR.PATCHbX e.g. 1.0.0b1. Update the Python CHANGELOG.rst, ensuring that all significant changes since the last release have been listed. Comparing git log --follow --oneline -- python with git log --follow --oneline -- python/CHANGELOG.rst may help here. Once this PR is merged, push a tag to github:

git tag -a MAJOR.MINOR.PATCH -m "Python version MAJOR.MINOR.PATCH"
git push upstream --tags

This will trigger a build of the distribution artifacts for Python on Github Actions. and deploy them to the test PyPI. Check the release looks good there, then create a release on Github based on the tag you pushed. Copy the changelog into the release. Publishing this release will cause the github action to deploy to the production PyPI. After release, start a section in the changelog for new developments.