vcfsim 1.0

_images/vcfsim_logo.png

What is vcfsim?

vcfsim is a command-line tool for generating simulated VCFs (Variant Call Format files used to encode genetic variation data). It pairs a coalescent simulation backend (msprime) with lightweight postprocessing to produce biologically realistic VCFs with parameterized missing data — a full simulated dataset can be created from just a few command-line arguments.

In particular, vcfsim makes it easy to simulate all-sites VCFs that contain both variant and invariant sites. All-sites VCFs are required for unbiased estimation of π and dxy (see pixy), and are typically expensive to obtain from real data — vcfsim is designed to drop straight into pixy-style workflows for testing, benchmarking, and methods development.

vcfsim also supports two missing-data models (uniform and HMM-based spatial clustering), arbitrary ploidy, custom sample names, two-population splits, and multi-chromosome batch runs from a parameter file.

How should I cite vcfsim?

If you use vcfsim in your research, please cite the repository and the underlying coalescent simulator: