Advanced Usage:

Parameters:

The following table outlines user-controllable parameters that can be adjusted at run time:

Parameter Name

Default Value

Description

infile

N/A, required

Prefix to plink library or .raw file to be used as input

out

‘chrY_hgs’

Prefix to .out and .all files generated by SNAPPY

min_hap_score

0.6

Minimum match score for a haplogroup to be considered for assignment

min_deep_score

0.8

Minimum score to switch from highest scoring haplogroup to the deepest haplogroup for assignment

ref_files_dir

‘ref_data’

Directory where SNAPPY’s reference files are saved

id2pos

‘id_to_pos.txt’.

File listing SNP ids and corresponding positions

pos2allele

‘pos_to_allele.txt’

File listing SNP positions and corresponding alleles

hg2snp

‘y_hg_and_snps.sort’

File listing markers and haplogroups

tree_strct

‘tree_structure.txt’

file listing haplogroup parent-child relationships for haplogroups that do not conform to naming conventions

ancestral_hg_depth

2

number of ancestral haplogroups to check when considering whether a haplogroup receives a score

truncate_haps

N/A

file with list of haplogroups past which SNAPPY will not make assignments

All adjustable parameters can be accessed at runtime by calling SNAPPY followed by –help. To adjust a parameter, append a double hyphen (–) followed immediately by the parameter name, a space, and the desired value for that parameter.

Example:

snappy --infile plink_prefix --min_hap_score 0.7

Auxiliary Tools:

SNAPPY’s installation includes three additional tools to create a custom SNP library created from ISOGG to allow users to customize their library based on their genotyped sites, and to allow a user to make haplogroup calls with the most up-to-date data possible. These tools are: snappy-clean, snappy-qc, and snappy-build. Each of these tools has options that can be viewed at run-time via the --help argument.

Creating Custom Reference Libraries:

To start creating a library, download the most recent SNP index table from ISOGG as a .tsv file. In the example below, the .tsv is saved as ISOGG_SNP_index.tsv. Use the downloaded table as input to snappy-clean as follows:

snappy-clean --infile ISOGG_SNP_index.tsv

Then run snappy-qc as follows:

snappy-qc --infile isogg_snps.txt

Finally, run snappy-build:

snappy-build --snp_list snp_qc.txt --pos_file genotyped_positions.txt --tree_file ref_files/tree_structure.txt

where genotyped_positions.txt is a file where each row gives the position of a genotyped site in the data to be used for haplogroup assignmenet, and ref_files/tree_strucutre.txt is the tree structure distributed file with the default reference files for SNAPPY.

Instructions on Uninstalling SNAPPY:

Simply run pip uninstall snappy from a terminal window. If prompted to confirm removal, press “y”.

Notes and Considerations:

  • All reference files included in the current distribution of SNAPPY use positions from human genome version GRCh37. Genotype positions from other versions of the human genome may result in inaccurate results.

  • Prior to running SNAPPY, it may be necessary to check for strand concordance with the Y-chromosome of GRCh37, and to flip and/or remove ambiguous sites and those whose variants correspond to genotyping from the non-reference strand.

  • A key aspect of the SNAPPY’s accuracy is the robust nature of the Y-chromosome tree and the inclusion of informative variants on the Multi-Ethnic Genotyping Array (MEGA). SNAPPY’s current reference library was designed and tested using genotyping data from the MEGA, which includes over 11,000 variants on the Y-chromosome. SNAPPY should readily apply to other arrays, but care should be taken to ensure that arrays have a sufficient number of genotyped loci are at haplogroup-informative sites.

  • Genotyping by sequencing (GBS) is increasingly popular, and data generated through GBS is compatible with SNAPPY, provided that all sites passing quality filters are included in the output genotypes during variant calling (this can be accomplished, for example, using the –emit-all argument in GATK’s variant calling pipeline). Otherwise, haplogroup-informative sites where the reference sequence used in variant calling has a derived allele may not be included in the genotype file.