Basic usage
This page walks through the most common invocations of vcfsim. For the
full argument reference, see Arguments.
A first run
The minimal command needed to produce a VCF specifies a seed, a
missing-data model (here --percent_missing_sites 0 and
--percent_missing_genotypes 0 for a clean reference dataset), and a
way to specify samples:
vcfsim \
--chromosome 1 \
--replicates 1 \
--seed 1234 \
--sequence_length 10000 \
--ploidy 2 \
--Ne 100000 \
--mu 1e-6 \
--percent_missing_sites 0 \
--percent_missing_genotypes 0 \
--output_file myvcf \
--sample_size 10
This produces myvcf1234.vcf — the seed is appended to the prefix
automatically. If --output_file is omitted, the VCF is written to
stdout instead, which is convenient for piping into other tools.
Producing replicates
Asking for more than one replicate runs the simulator multiple times with
an incrementing seed. With --seed 1234 --replicates 3, vcfsim
writes myvcf1234.vcf, myvcf1235.vcf, and myvcf1236.vcf. This
is the recommended way to produce a set of independent replicates with
known seeds — it keeps the mapping from seed to file deterministic and
inspectable.
Custom sample names
By default samples are named tsk_0, tsk_1, ..., tsk_n. To
write explicit names into the VCF columns, pass --samples with the
names directly:
vcfsim \
--chromosome 1 --replicates 1 --seed 1234 \
--sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
--percent_missing_sites 0 --percent_missing_genotypes 0 \
--samples A1 B1 C1 D1 \
--output_file myvcf
Names may be space-separated (as above) or comma-separated
(--samples A1,B1,C1,D1). The sample size is set to the number of
names provided — --sample_size and --samples are mutually
exclusive.
For larger sample sets, store the names in a file and use
--samples_file:
vcfsim \
--chromosome 1 --replicates 1 --seed 1234 \
--sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
--percent_missing_sites 0 --percent_missing_genotypes 0 \
--samples_file names.txt \
--output_file myvcf
where names.txt contains comma- or whitespace-separated names:
A1 B1 C1 D1 E1
Writing to stdout
Omitting --output_file sends the VCF to stdout, so you can pipe
the simulator directly into bgzip, bcftools, or any other VCF
consumer:
vcfsim \
--chromosome 1 --replicates 1 --seed 1234 \
--sequence_length 10000 --ploidy 2 --Ne 100000 --mu 1e-6 \
--percent_missing_sites 0 --percent_missing_genotypes 0 \
--sample_size 10 \
| bgzip -c > myvcf.vcf.gz
This is convenient for one-off runs and is the form to use when feeding
vcfsim output into a downstream pipeline without leaving intermediate
files on disk.