*******************
Missing data models
*******************

``vcfsim`` offers two independent levers for controlling missing data:

* **Genotype-level missingness** (``--percent_missing_genotypes``) — a
  per-cell drop rate applied uniformly across the genotype matrix. Always
  required, regardless of the site-missingness model.
* **Site-level missingness** — controlled by either a uniform model
  (``--percent_missing_sites``) or a Hidden Markov Model
  (``--hmm_*`` flags). Exactly one of these must be specified.

The two layers compose: site-level missingness drops entire VCF rows; the
genotype-level rate then applies to the cells of the rows that remain.

Uniform site missingness
========================

The simplest model drops a fixed percentage of VCF rows uniformly at
random:

.. code:: console

    vcfsim \
        --chromosome 1 --replicates 1 --seed 1234 \
        --sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
        --percent_missing_sites 20 \
        --percent_missing_genotypes 5 \
        --sample_size 10 \
        --output_file myvcf

Here roughly 20% of sites are dropped, and within the surviving sites
roughly 5% of genotypes are marked missing. The realized percentages are
close to but not exactly the requested values for small sample sizes.

HMM-based spatially clustered missingness
=========================================

Real sequencing data rarely produces uniformly random missing sites —
poorly mappable regions and low-coverage stretches produce *runs* of
missing data instead. To simulate this pattern, leave
``--percent_missing_sites`` unset and supply all four HMM parameters:

.. code:: console

    vcfsim \
        --chromosome 1 --replicates 1 --seed 4000 \
        --sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
        --percent_missing_genotypes 0 \
        --hmm_baseline 0.05 \
        --hmm_multiplier 6 \
        --hmm_p_low_to_high 0.002 \
        --hmm_p_high_to_low 0.005 \
        --sample_size 10 \
        --output_file myvcf

How the HMM works
-----------------

Each site is in one of two hidden states:

* **Low-missing** ("good") — the per-site missingness probability is
  ``--hmm_baseline``.
* **High-missing** ("bad") — the per-site missingness probability is
  ``--hmm_baseline × --hmm_multiplier``.

The state evolves along the chromosome as a two-state Markov chain with
transition probabilities ``--hmm_p_low_to_high`` (low → high) and
``--hmm_p_high_to_low`` (high → low). The expected fraction of sites in
the high-missing state is approximately
``p_low_to_high / (p_low_to_high + p_high_to_low)``; combined with the
two per-state missingness probabilities, this determines the long-run
overall missing-site rate.

The expected mean length of a high-missing run is approximately
``1 / p_high_to_low`` sites, so smaller ``--hmm_p_high_to_low`` values
produce longer contiguous bad regions.

Picking parameters
------------------

The example above produces clearly clustered missingness without
overwhelming the dataset:

* Baseline 5% missingness in good regions.
* 6× elevated rate (30%) in bad regions.
* Bad regions are entered on average once per ~500 sites
  (``1 / 0.002``).
* Bad regions persist on average ~200 sites (``1 / 0.005``).

To increase clustering without changing the overall missing rate, lower
both transition probabilities by the same factor. To increase the overall
missing rate while keeping the same spatial structure, raise
``--hmm_baseline`` and/or ``--hmm_multiplier``.

Mixing the two models
=====================

If both ``--percent_missing_sites`` and HMM flags are passed, ``vcfsim``
warns and uses the uniform model. To use the HMM, leave
``--percent_missing_sites`` unset.