************
About vcfsim
************
| **Maintainer:** Kieran Samuk (UC Riverside).
| **Original authors:** Paimon Goulart (UC Riverside) and Kieran Samuk (UC Riverside).
``vcfsim`` is a command-line tool that uses coalescent simulation to produce
realistic VCFs for use in methods development, benchmarking, and teaching. It
wraps `msprime `_ for the underlying simulation
and adds a thin postprocessing layer that:
* Emits a single, well-formed VCF (with an ``msprime``-style header).
* Inserts **invariant sites** so the result is a true all-sites VCF.
* Applies configurable **missing data** at both the site and genotype level
— either as uniform deterministic missingness, or as spatially clustered
missingness governed by a two-state Hidden Markov Model.
* Supports arbitrary **ploidy**, custom **sample names**, and a simple
two-population **demographic split**.
* Concatenates the output of multiple per-chromosome runs into one VCF
when given a parameter file.
Why simulated all-sites VCFs?
=============================
Population genetic summary statistics — π, d\ :sub:`xy`, F\ :sub:`ST`,
Watterson's θ, Tajima's *D* — are sensitive to missing data, and many tools
silently treat missing genotypes as homozygous reference. The result is
biased estimates that are hard to spot without ground truth.
``vcfsim`` exists to provide that ground truth. Because each output VCF
comes from a coalescent simulation with known parameters (Ne, μ, ploidy,
divergence time, missingness rate), the *true* value of each summary
statistic is known by construction. That makes ``vcfsim`` a natural
companion for:
* **Benchmarking statistical estimators** — verify that an estimator
recovers the correct value, and quantify the bias introduced by missing
data.
* **Building test fixtures** — small, deterministic VCFs for unit and
integration tests of downstream tools.
* **Teaching** — illustrate how missing data biases estimates of nucleotide
diversity and divergence.
* **Methods development** — generate VCFs with controlled, spatially
structured missingness to test mask-aware or block-aware estimators.
Notable features
================
* **All-sites VCF output** by default — both variant and invariant sites
are written, so downstream tools that depend on callable-site denominators
(e.g. `pixy `_) work out of the box.
* **Two missing-data models.** A simple uniform model (``--percent_missing_sites``)
and a two-state HMM that produces spatially clustered missingness
(``--hmm_baseline``, ``--hmm_multiplier``, ``--hmm_p_low_to_high``,
``--hmm_p_high_to_low``).
* **Independent site- and genotype-level missingness** — control the rate
at which whole rows go missing separately from the rate at which
individual genotypes go missing within a row.
* **Arbitrary ploidy** via ``--ploidy``.
* **Custom sample names** from the command line (``--samples A1 B1 C1``)
or from a file (``--samples_file names.txt``), with comma- or
whitespace-separated input. Numeric ``--sample_size`` is the default.
* **Two-population mode** (``--population_mode 2``) that simulates an
ancestral population *C* splitting into populations *A* and *B* at a
user-specified divergence time (``--div_time``).
* **Multi-chromosome batch mode** — point ``--chromosome_file`` at a
parameter file listing per-chromosome ploidy, sequence length, Ne, and
mutation rate, and ``vcfsim`` runs each row and concatenates the
results into a single VCF.
* **Reproducibility built in** — every run is keyed off a user-supplied
``--seed``. When ``--replicates > 1`` is requested, replicates are
produced by incrementing the seed.
Acknowledgements
================
``vcfsim`` is built on top of `msprime `_ and
`tskit `_. The all-sites postprocessing and missing-data
models are designed to interoperate cleanly with the
`pixy `_ population genetics toolkit.