Advanced Usage:

Parameters:

The following table outlines user-controllable parameters that can be adjusted at run time:

Parameter Name

Default Value

Description

infile

N/A, required

Prefix to plink library or .raw file to be used as input

out

‘chrY_hgs’

Prefix to .out and .all files generated by SNAPPY

min_hap_score

0.6

Minimum match score for a haplogroup to be considered for assignment

min_deep_score

0.8

Minimum score to switch from highest scoring haplogroup to the deepest haplogroup for assignment

ref_files_dir

‘ref_data’

Directory where SNAPPY’s reference files are saved

id2pos

‘id_to_pos.txt’.

File listing SNP ids and corresponding positions

pos2allele

‘pos_to_allele.txt’

File listing SNP positions and corresponding alleles

hg2snp

‘y_hg_and_snps.sort’

File listing markers and haplogroups

tree_strct

‘tree_structure.txt’

file listing haplogroup parent-child relationships for haplogroups that do not conform to naming conventions

ancestral_hg_depth

2

number of ancestral haplogroups to check when considering whether a haplogroup receives a score

truncate_haps

N/A

file with list of haplogroups past which SNAPPY will not make assignments

All adjustable parameters can be accessed at runtime by calling SNAPPY followed by –help. To adjust a parameter, append a double hyphen (–) followed immediately by the parameter name, a space, and the desired value for that parameter.

Example:

python SNAPPY_v123.py --infile plink_prefix --min_hap_score 0.7

Notes and Considerations:

  • All reference files included in the current distribution of SNAPPY use positions from human genome version GRCh37. Genotype positions from other versions of the human genome may result in inaccurate results.

  • Prior to running SNAPPY, it may be necessary to check for strand concordance with the Y-chromosome of GRCh37, and to flip and/or remove ambiguous sites and those whose variants correspond to genotyping from the non-reference strand.

  • A key aspect of the SNAPPY’s success is the robust nature of the Y-chromosome tree and the inclusion of informative variants on the Multi-Ethnic Genotyping Array (MEGA). SNAPPY’s current implementation was designed and tested using genotyping data from the MEGA, which includes over 11,000 variants on the Y-chromosome. SNAPPY should readily apply to other arrays, but care should be taken to ensure that arrays have a sufficient number of loci that are included in the reference library.

  • Genotyping by sequencing (GBS) is increasingly popular, and data generated through GBS is compatible with SNAPPY, provided that all sites passing quality filters are included in the output genotypes during variant calling (this can be accomplished, for example, using the –emit-all argument in GATK’s variant calling pipeline). Otherwise, haplogroup-informative sites where the reference sequence used in variant calling has a derived allele may not be included in the genotype file.