Multi-chromosome runs

For genome-scale simulations — or any time you want to vary parameters across chromosomes — pass a parameter file to --chromosome_file instead of specifying the per-chromosome parameters on the command line. vcfsim runs each row of the file as an independent simulation and concatenates the results into a single VCF.

The parameter file format

The file is whitespace-separated, headerless, and has exactly five columns:

chromosome   ploidy   sequence_length   Ne   mu

For example:

1   2   10000   100000   1e-6
2   2   15000   100000   1e-6
X   1    8000    50000   1e-6
  • chromosome — name written into the #CHROM column for the corresponding rows.

  • ploidy — must be a positive integer.

  • sequence_length, Ne, mu — may be plain numbers or simple Python numeric expressions (1e4, 1.7e6, 5.5e-9, 2*10000).

Each row is run as a separate msprime simulation; the per-chromosome ploidy, sequence length, Ne, and mutation rate can therefore vary freely. This is particularly useful for simulating sex chromosomes alongside autosomes — give the X-chromosome row ploidy = 1 and the autosome rows ploidy = 2.

Running with a parameter file

When using --chromosome_file, omit the per-chromosome flags (--chromosome, --replicates, --sequence_length, --ploidy, --Ne, --mu) — they come from the file instead. Everything else (--seed, sample specification, missingness model, --output_file) works as usual:

vcfsim \
    --seed 1234 \
    --percent_missing_sites 0 \
    --percent_missing_genotypes 0 \
    --output_file myvcf \
    --sample_size 10 \
    --chromosome_file input.txt

This writes a single concatenated VCF (myvcf.vcf) containing all chromosomes from input.txt. The header is written once at the top and the per-chromosome bodies are appended in order.

Combining with custom sample names

The sample-specification flags (--sample_size, --samples, --samples_file) work in batch mode exactly as they do for single-chromosome runs. To label the same set of samples across all chromosomes:

vcfsim \
    --seed 1234 \
    --percent_missing_sites 0 \
    --percent_missing_genotypes 0 \
    --output_file myvcf \
    --samples_file names.txt \
    --chromosome_file input.txt

The same sample names are used in every per-chromosome simulation, so the final concatenated VCF has a single, consistent column ordering.