Basic usage

This page walks through the most common invocations of vcfsim. For the full argument reference, see Arguments.

A first run

The minimal command needed to produce a VCF specifies a seed, a missing-data model (here --percent_missing_sites 0 and --percent_missing_genotypes 0 for a clean reference dataset), and a way to specify samples:

vcfsim \
    --chromosome 1 \
    --replicates 1 \
    --seed 1234 \
    --sequence_length 10000 \
    --ploidy 2 \
    --Ne 100000 \
    --mu 1e-6 \
    --percent_missing_sites 0 \
    --percent_missing_genotypes 0 \
    --output_file myvcf \
    --sample_size 10

This produces myvcf1234.vcf — the seed is appended to the prefix automatically. If --output_file is omitted, the VCF is written to stdout instead, which is convenient for piping into other tools.

Producing replicates

Asking for more than one replicate runs the simulator multiple times with an incrementing seed. With --seed 1234 --replicates 3, vcfsim writes myvcf1234.vcf, myvcf1235.vcf, and myvcf1236.vcf. This is the recommended way to produce a set of independent replicates with known seeds — it keeps the mapping from seed to file deterministic and inspectable.

Custom sample names

By default samples are named tsk_0, tsk_1, ..., tsk_n. To write explicit names into the VCF columns, pass --samples with the names directly:

vcfsim \
    --chromosome 1 --replicates 1 --seed 1234 \
    --sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
    --percent_missing_sites 0 --percent_missing_genotypes 0 \
    --samples A1 B1 C1 D1 \
    --output_file myvcf

Names may be space-separated (as above) or comma-separated (--samples A1,B1,C1,D1). The sample size is set to the number of names provided — --sample_size and --samples are mutually exclusive.

For larger sample sets, store the names in a file and use --samples_file:

vcfsim \
    --chromosome 1 --replicates 1 --seed 1234 \
    --sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
    --percent_missing_sites 0 --percent_missing_genotypes 0 \
    --samples_file names.txt \
    --output_file myvcf

where names.txt contains comma- or whitespace-separated names:

A1 B1 C1 D1 E1

Writing to stdout

Omitting --output_file sends the VCF to stdout, so you can pipe the simulator directly into bgzip, bcftools, or any other VCF consumer:

vcfsim \
    --chromosome 1 --replicates 1 --seed 1234 \
    --sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
    --percent_missing_sites 0 --percent_missing_genotypes 0 \
    --sample_size 10 \
    | bgzip -c > myvcf.vcf.gz

This is convenient for one-off runs and is the form to use when feeding vcfsim output into a downstream pipeline without leaving intermediate files on disk.