vcfsim 1.0
What is vcfsim?
vcfsim is a command-line tool for generating simulated VCFs (Variant Call Format files used to encode genetic variation data). It pairs a coalescent simulation backend (msprime) with lightweight postprocessing to produce biologically realistic VCFs with parameterized missing data — a full simulated dataset can be created from just a few command-line arguments.
In particular, vcfsim makes it easy to simulate all-sites VCFs that contain both variant and invariant sites. All-sites VCFs are required for unbiased estimation of π and dxy (see pixy), and are typically expensive to obtain from real data — vcfsim is designed to drop straight into pixy-style workflows for testing, benchmarking, and methods development.
vcfsim also supports two missing-data models (uniform and HMM-based spatial clustering), arbitrary ploidy, custom sample names, two-population splits, and multi-chromosome batch runs from a parameter file.
Documentation
How should I cite vcfsim?
If you use vcfsim in your research, please cite the repository and the underlying coalescent simulator:
Baumdicker, F. et al. (2022). Efficient ancestry and mutation simulation with msprime 1.0. Genetics, 220(3), iyab229. https://doi.org/10.1093/genetics/iyab229