Missing data models
vcfsim offers two independent levers for controlling missing data:
Genotype-level missingness (
--percent_missing_genotypes) — a per-cell drop rate applied uniformly across the genotype matrix. Always required, regardless of the site-missingness model.Site-level missingness — controlled by either a uniform model (
--percent_missing_sites) or a Hidden Markov Model (--hmm_*flags). Exactly one of these must be specified.
The two layers compose: site-level missingness drops entire VCF rows; the genotype-level rate then applies to the cells of the rows that remain.
Uniform site missingness
The simplest model drops a fixed percentage of VCF rows uniformly at random:
vcfsim \
--chromosome 1 --replicates 1 --seed 1234 \
--sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
--percent_missing_sites 20 \
--percent_missing_genotypes 5 \
--sample_size 10 \
--output_file myvcf
Here roughly 20% of sites are dropped, and within the surviving sites roughly 5% of genotypes are marked missing. The realized percentages are close to but not exactly the requested values for small sample sizes.
HMM-based spatially clustered missingness
Real sequencing data rarely produces uniformly random missing sites —
poorly mappable regions and low-coverage stretches produce runs of
missing data instead. To simulate this pattern, leave
--percent_missing_sites unset and supply all four HMM parameters:
vcfsim \
--chromosome 1 --replicates 1 --seed 4000 \
--sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
--percent_missing_genotypes 0 \
--hmm_baseline 0.05 \
--hmm_multiplier 6 \
--hmm_p_low_to_high 0.002 \
--hmm_p_high_to_low 0.005 \
--sample_size 10 \
--output_file myvcf
How the HMM works
Each site is in one of two hidden states:
Low-missing ("good") — the per-site missingness probability is
--hmm_baseline.High-missing ("bad") — the per-site missingness probability is
--hmm_baseline × --hmm_multiplier.
The state evolves along the chromosome as a two-state Markov chain with
transition probabilities --hmm_p_low_to_high (low → high) and
--hmm_p_high_to_low (high → low). The expected fraction of sites in
the high-missing state is approximately
p_low_to_high / (p_low_to_high + p_high_to_low); combined with the
two per-state missingness probabilities, this determines the long-run
overall missing-site rate.
The expected mean length of a high-missing run is approximately
1 / p_high_to_low sites, so smaller --hmm_p_high_to_low values
produce longer contiguous bad regions.
Picking parameters
The example above produces clearly clustered missingness without overwhelming the dataset:
Baseline 5% missingness in good regions.
6× elevated rate (30%) in bad regions.
Bad regions are entered on average once per ~500 sites (
1 / 0.002).Bad regions persist on average ~200 sites (
1 / 0.005).
To increase clustering without changing the overall missing rate, lower
both transition probabilities by the same factor. To increase the overall
missing rate while keeping the same spatial structure, raise
--hmm_baseline and/or --hmm_multiplier.
Mixing the two models
If both --percent_missing_sites and HMM flags are passed, vcfsim
warns and uses the uniform model. To use the HMM, leave
--percent_missing_sites unset.