SexFindR Workflow Overview

These pages contain the documentation necessary to run SexFindR on your species of interest. The pages are meant to used in conjunction with the folders on Github (https://github.com/phil-grayson/SexFindR), which contain the same structure. Commands for the analysis of tiger pufferfish (Takifugu rubripes) are present throughout to reproduce these analyses as a test case. When relevant, software versions and links to the source code are presented when the software is mentioned the first time.

Commands were run using a combination of SLURM systems and standard Linux servers. When possible, configurations are provided for the SLURM systems in .sh scripts (these are submitted with sbatch commands). Commands that finish quickly, or commands that could only be run locally, are simply provided as command line input. R code is also provided for relevant steps. File paths, module loads, SLURM accounts, etc. may need to be modified for your specific use case.

Please reach out to Dr. Phil Grayson (phil.d.grayson <at> gmail.com) or open a GitHub issue if you experience difficulty running any part of the SexFindR workflow.

_images/SexFindR.png

Figure 1. SexFindR workflow in three steps

Contents

Step 0. Mapping and variant calling

Prior to running Step 1 (Coverage-based analysis) or any of the analyses from Step 2 (Sequence-based analyses) except the reference-free k-mer analysis, you will need to map your reads to a reference genome.

In the SexFindR paper, we used Bowtie2 (v 2.3.4.3; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for read mapping and we provide some scripts and configuration for this within the GitHub repo and below. If you already have mapped reads from bwa-mem or another widely-used algorithm, please feel free to use those.

For all the sequence-based analyses mapped to a reference genome, SNP calling is also required. In the SexFindR paper, we used Platypus (commit 3e72641; https://github.com/andyrimmer/Platypus) to jointly call SNPs across all samples, and we provide some scripts and configurations for this within the GitHub and below. Again, if you have already called SNPs for your samples using GATK or another widely-used algorithm, please feel free to use those.

Platypus requires that the genome file be indexed with samtools (v1.10; https://github.com/samtools/samtools). We also filtered the raw vcf to only include sites that PASS the quality filters. This requires bcftools (v1.9; https://github.com/samtools/bcftools). vcftools is also used to filter for biallelic sites (v0.1.14; https://github.com/vcftools/vcftools).

Download raw reads from NCBI

Requires fasterq-dump from the SRA-tools (https://github.com/ncbi/sra-tools). In my experience, fasterq-dump can be fairly flakey depending on your system and connection, so this download might need to be repeated multiple times to get all the samples. You might want to separate out the acc_list.txt files or add something like if [ ! -f ${file}_1.fastq ] to the code below to avoid downloading the same file repeatedly.

for file in $(cat acc_list.txt); do echo $file; date; fasterq-dump $file -e 36; done

Download the genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/000/725/GCF_901000725.2_fTakRub1.2/GCF_901000725.2_fTakRub1.2_genomic.fna.gz
gunzip GCF_901000725.2_fTakRub1.2_genomic.fna.gz

Create a bowtie2 index

bash bowtie2_makeindex_linux.sh GCF_901000725.2_fTakRub1.2_genomic.fna fugu &> index_outerr.txt &

Map reads to the genome

For each sample, the reads must be mapped using a command like:

bash bowtie2_16_linux.sh SRR8585991_* fugu &> bt2_SRR8585991_outerr.txt &

bowtie2_16_long.sh is also included as a reference on SLURM systems.

Call variants using Platypus

Variants are jointly called (all at once) through the use of a bam list (e.g., 15M_14F_bams.txt). The following was run on a SLURM system:

samtools faidx GCF_901000725.2_fTakRub1.2_genomic.fna
sbatch platypus_all_region_1day.sh 15M_14F_bams.txt GCF_901000725.2_fTakRub1.2_genomic.fna

Filter for variant calls that PASS quality filters

This script will keep only those variants that have PASS in the FILTER field, removing low quality calls.

sbatch bcf_filter.sh fugu_14M_13F.vcf

Filter for biallelic sites

Some downstream analyses (e.g., Fst) require only biallelic sites to work properly (e.g., 0/0, 0/1, 1/1). Execute the following to filter for only biallelic sites:

vcftools --vcf filtered_PASS_fugu_14M_13F.vcf --max-alleles 2 --stdout --recode --recode-INFO-all | gzip -c > biallelic_filtered_PASS_fugu_14M_13F.vcf.gz

Step 1. Coverage-based analysis

_images/Step1.png

To determine if there are large sex-specific regions present in your species of interest, a coverage-based analysis is first carried out.

Running DifCover

DifCover (https://github.com/timnat/DifCover) requires only a single male and single female bam file aligned to a reference genome.

Example DifCover command with 1 male (SRR8585998_1.fastq.bam) versus 1 female (SRR8585999_1.fastq.bam).

bash run_difcover.sh SRR8585998_1.fastq.bam SRR8585999_1.fastq.bam 1 >> difCov_Male98_over_Female99_outerr.txt 2>&1

Note that you will need to modify run_difcover.sh to have a proper path for FOLDER_PATH on your system, and the 1 is the library-specific Adjustment Coefficient (AC) to account for differences in sequencing depth between samples. The correct AC value can be determined by taking the ratio of modal depths for the samples of interest through samtools stats as follows:

chmod u+rwx samtools_Modaldepth.sh
for file in $(ls *q.bam); do ./samtools_Modaldepth.sh $file; done

Once you have a modal depth for each sample you are interested in, AC is (modal coverage of sample 2 / modal coverage of sample 1). Of note, I have found that although this works well for some samples, others report a modal depth of 1, which is not helpful for determining AC. In these cases, I have have mixed success simply using the ratio of the bam file sizes for AC. This can be dialed in after the analysis as well by plotting and examining the center of the distribution (which should lie at 0 for autosomes if AC is properly set).

Analyzing DifCover Results

An example output file sample1_sample2.ratio_per_w_CC0_a10_A219_b10_B240_v1000_l500.log2adj_1.DNAcopyout is included in the GitHub repository alongside the R script Fugu_M98_F99_DifCover.R used to generate the figures in the main body of the paper. Given that fugu has only a single SNP that is fixed between males and females, coverage does not produce any outlier windows for this species.

As an example of a species that does exhibit significant differences in depth, the DifCover plot for chicken is also provided below. The code for chicken can be found in Supplemental Code on GitHub.

_images/fugu_difcover.png _images/chicken_difcover.png

Figure 2. DifCover fails (top) to identify coverage differences between a single male and single female in the fugu, indicating that there are no large regions of recombination suppression in the fugu genome (and/or suggesting that the fugu sex determining region is very young in evolutionary time). DifCover succeeds (bottom) at identify the Z and W chromosomes in chicken using only information on depth of coverage between a single male and single female

Step 2. Sequence-based analyses

_images/Step2.png

SNP Density

In nascent sex-linked sequences, we would expect to see differences in overall SNP density between males and females due to Y reads still mapping to the X (or W to the Z). This is due to the divergent evolutionary trajectories of the X and Y following recombination suppression on the Y. SNP density can identify regions where male or female-specific SNPs are still segregating within their respective populations.

This analysis requires vcftools and R.

Running SNP Density

We use vcftools SNPDensity to calculate SNP Density for each sample in 10kb windows across the genome. This requires individual vcf files for each sample. The individual files are created and the SNPDensity calculations carried out using the commands below.

mkdir individual_SNP_density

for file in $(cat snpDen_females.txt); do sbatch SNPdensity.sh $file individual_SNP_density/Female_${file%%.*}.vcf biallelic_filtered_PASS_fugu_14M_13F.vcf.gz; sleep 0.1; done

for file in $(cat snpDen_males.txt); do sbatch SNPdensity.sh $file individual_SNP_density/Male_${file%%.*}.vcf biallelic_filtered_PASS_fugu_14M_13F.vcf.gz; sleep 0.1; done

Here, snpDen_males.txt and snpDen_females.txt reference files are simply lists of the sample IDs (as found in the vcf header) that contain the sample from those sexes. The submission script above appends Male or Female to the SNPDensity output file, which we then use to parse these output files in R.

Analyzing SNP Density

SNP density can be analyzed from the R script SNPdensity_permutations_fugu.R. This file runs the user through reading in the individual SNP densities calculated above, then calculates the true male and female means (and differences) for each 10 kb window, before carrying out a small permutation test. This permutation test generates the p-values that allows us to determine the correct order for SNP Density windows in Step 3. A larger permutation test was carried out for both the lamprey and fugu in the main manuscript to further analyze the SNP Density data. Code and documentation for this test is included in the Supplemental Code folder on the GitHub repo.

Fst

Fst is an index of allelic fixation in populations, and so, if there are high levels of Fst within discrete genomic regions when comparing males to females, this would suggest that those regions differ between males and females and, therefore, are not recombining. For the SexFindR workflow, Fst is calculated from a biallelic vcf file using vcftools.

Example command
vcftools --gzvcf biallelic_filtered_PASS_fugu_14M_13F.vcf.gz --weir-fst-pop snpDen_males.txt --weir-fst-pop snpDen_females.txt --out biallelic_fst

Here, we make use of the same snpDen_males.txt and snpDen_females.txt reference files from the SNP Density step to assign individuals to their correct “population”.

Genome Wide Association Study (GWAS)

GWAS identifies associations between a phenotype and a genotype (Klein et al. 2005). Similar to Fst, by carrying out a GWAS with the male and female populations as two different phenotypes, it is possible to identify SNPs that are strongly or weakly associated with sex. These SNPs can be fixed, or nearly fixed, in either males or females.

This analysis requires vcftools and a combination of plink (v 1.07-x86_64; https://www.cog-genomics.org/plink/1.9/) and GEMMA (v0.98.1; https://github.com/genetics-statistics/GEMMA). We tested numerous GWAS algorithms and this one performed best across the sex chromosome divergence continuum.

Filtering the vcf

We use vcftools to convert the vcf to plink format, remove indels, remove sites with >50% missing data, and sites where the minor allele frequency is less than 5% or more than 95%.

vcftools --vcf filtered_PASS_fugu_14M_13F.vcf --plink --remove-indels --max-missing 0.5 --max-maf 0.95 --maf 0.05 --out gwas_fugu
Running the GWAS

As outlined in the GEMMA manual (https://github.com/genetics-statistics/GEMMA/blob/master/doc/manual.pdf), the program requires input files in the plink binary format. We first run plink (as recommended) to generate these files and then supply these files to GEMMA alongside the option -lm 2 to specify a likelihood ratio test.

/home/pgrayson/programs/plink-1.07-x86_64/plink --file gwas_fugu --pheno sex_fugu_meta.txt --make-bed --out gwas_fugu_plink --noweb --allow-no-sex
~/programs/gemma-0.98.1-linux-static -bfile gwas_fugu_plink -lm 2 -o fugu_gemma_out

sex_fugu_meta.txt is included in the GitHub repository as an example file. It contains 3 tab-delimited columns that contain the sample name (repeated in column 1 and column 2) and the phenotype (1 for male, 2 for female - or vice versa). An example is below:

SRR8585991_1.fastq      SRR8585991_1.fastq      1
SRR8585992_1.fastq      SRR8585992_1.fastq      2
SRR8585993_1.fastq      SRR8585993_1.fastq      2
Analysing the output

The commands above will generate fugu_gemma_out.assoc and fugu_gemma_out.log.txt. The .assoc file contains the GWAS results, including the genome position (rs) the number of individuals that had a call (n_obs) or missing data (n_mis) at that site, the allele frequency (af), and the p-value for the likelihood ratio test (p_lrt). This file can be searched and parsed to identify the top candidates from this analysis.

kmersGWAS

The kmersGWAS approach described below is possibly the most powerful single analysis within the SexFindR protocol. Apart from not requiring a reference genome, this single method has proved capable of detecting sex-linked sequences across the entire sex chromosome evolutionary continuum, from highly degenerate mammalian systems to takifugu, which has a single base that is heterozygous is males and homozygous in females.

After testing a few other k-mer association algorithms as well as a number of specific options within this algorithm, I settled on the following workflow given its consistent results across the sex chromosome divergence continuum. This protocol requires kmersGWAS (v0.2 beta; https://github.com/voichek/kmersGWAS), ABYSS (v2.2.5; https://github.com/bcgsc/abyss), plink (v 1.07-x86_64; https://www.cog-genomics.org/plink/1.9/), R and blast+ ( v2.10.0; https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.10.0/). I have presented the necessary steps and configurations for the fugu analysis below, but it should be noted that run times and memory usage vary greatly with genome size, coverage, and sample number. For fewer than 30 samples of fugu, the config provided in the SLURM scripts should be sufficient, but some of the kmersGWAS steps required days to run on large machines (200 GB+ memory) for lamprey, and other species.

Directory structure

kmersGWAS requires a specific directory structure and has many separate steps. These are presented below, but more information can be found in the kmersGWAS manual here (https://github.com/voichek/kmersGWAS/blob/master/manual.pdf).

Generally, you want the following:

  1. Files present in a top level directory

  2. Individual directories created within the top level directory corresponding to the base name of each sample (e.g., top level directory contains SRR8585991_1.fastq.gz, SRR8585991_2.fastq.gz, and the SRR8585991 directory)

  3. Within the base name subdirectories (e.g., SRR8585991), there are execution scripts and an input_files.txt file that contains the paths to the fastq files (e.g., ../SRR8585991_1.fastq.gz and ../SRR8585991_2.fastq.gz each on their own line).

Running kmersGWAS

The following is a set of example commands to set up the directory structure for the fugu samples, but these will need to be made specific to your file names, path, etc. Many of these steps were executed on a SLURM system and the execution scripts will need to be modified to match your system as well. This process begins with the fastq files and one copy of each of the execution scripts in the top level (fugu_kmerGWAS) directory. All commands are executed from that top level directory. I have included an example of all necessary text files in the GitHub repository to check against as well.

cd /home/pgrayson/scratch/fugu_kmerGWAS
for file in $(ls --color=none *_1.*gz); do baseName=$(echo $file | awk -F'[_]' '{print $1}'); mkdir $baseName; done
ls --color=none -d */ > dirlist.txt
sed 's/\///g' dirlist.txt > clean_dirlist.txt
for dir in $(cat clean_dirlist.txt); do cd $dir; echo $dir; ls --color=none ../*${dir}*gz > input_files.txt; cd ..; done
for dir in $(cat clean_dirlist.txt); do cp step1_kmerGWAS.sh $dir; done
for dir in $(cat clean_dirlist.txt); do cd $dir; sbatch step1_kmerGWAS.sh; cd ..; done

Once those first jobs complete, you can run the following (again with modification to the execution script):

for dir in $(cat clean_dirlist.txt); do cp strand_kmerGWAS.sh $dir; done
for dir in $(cat clean_dirlist.txt); do cd $dir; sbatch strand_kmerGWAS.sh; cd ..; done

We next need the individuals k-mers list file. As above, these should be generated following the kmersGWAS manual. An example set of commands used to generate this file for the fugu samples is provided below:

cd /home/pgrayson/
ll scratch/fugu_kmerGWAS/ | tail -n +2 | awk '{printf "/scratch/fugu_kmerGWAS/%s/kmers_with_strand\t%s\n", $NF,$NF}' > kmers_list_paths.txt
grep -v .gz kmers_list_paths.txt | grep SRR > kmers_clean_list_path.txt
sed 's/\/scratch/\/home\/pgrayson\/scratch/g' kmers_clean_list_path.txt > final_kmers_clean_list_path.txt
mv final_kmers_clean_list_path.txt ~/scratch/fugu_kmerGWAS

Example execution scripts are provided to combine the k-mers and create the k-mers table.

sbatch combine_kmersGWAS.sh
sbatch create_table_kmersGWAS.sh

Once these steps are completed, we generate plink binary files using the kmers_table_to_bed function. The phenotype.pheno file here contains two tab-delimited columns, accession_id and phenotype_value with the sample ID in accession_id and 1 or 2 in the phenotype value column for the male or female phenotype. -b here is batch size. Depending on the power of your machine and the number of samples included in your analysis, you might have to produce smaller batches (I ran on a Linux machine with 256 Gb of memory and had to batch other species with more samples). Batches here are the number of k-mers within the analysis for plink.

~/programs/kmerGWAS/bin/kmers_table_to_bed -t kmers_table -k 31 -p phenotype.pheno --maf 0.05 --mac 5 -b 1000000000 -o fugu_kmerGWAS_plink
Running ABYSS

Once you are happy with the filtered k-mer set, you can use ABYSS to assemble those k-mers into small contigs for blastn analysis (if a reference genome exists). A basic python script (plink_to_abyss_kmers.py) is included to parse the filtered plink output into ABYSS-ready input. The ABYSS input should be in fasta format, with the p-value as the sample ID and the k-mer as the sequence. e.g.,

>2.005e-13
AAAAAAAAAAAATCATTTCCCACCTCATCAA
>2.005e-13
AAAAAAAAAAATCATTTCCCACCTCATCAAT
>2.005e-13
AAAAAAAAAATCATTTCCCACCTCATCAATC
...

To generate this file, the following should work:

python plink_to_abyss_kmers.py most_significant_fugu_assoc.txt fugu_plink_abyss_input.txt

If your output does not match the example above, you might need to change the index positions in the python script to correctly grab the k-mer and the p-value columns (given different plink versions). This input file is then run through ABYSS to assemble small contigs.

ABYSS -k25 -c0 -e0 fugu_plink_abyss_input.txt -o fugu_plink_abyss_output.txt
Running blastn

If you have a reference genome, you might be interested in where the enriched k-mers map to. The easiest way to check this is through a blastn analysis.

You first need to create a blastdb:

makeblastdb -in GCF_901000725.2_fTakRub1.2_genomic.fna -dbtype nucl

Then you can run blastn:

blastn -query fugu_plink_abyss_output.txt -db GCF_901000725.2_fTakRub1.2_genomic.fna -outfmt 6

The output file from blastn will provide top candidate regions for each contig that was assembled from the k-mers. In the case of fugu, there are only 4 contigs that all blast to NC_042303.1, but in poplar and the golden monkey, these blast outputs were parsed with an R script (e.g., poplar_kmerGWAS_blast_results.R) to visualize which genomic regions the sex-assocaited k-mers map to. This script and the necessary input files have been included in the GitHub repo for poplar.

Step 3. Combined sequence-based analysis

_images/Step3.png

If your species of interest has a genetic basis for sex determination, but you have not been able to identify a consistent signal within your genomic data using Steps 1 and 2, the scripts for Step 3, Combined sequence-based analysis, should be able to help. If you have not identified a compelling candidate by the end of Step 3, I would suggest that there is not a genetic signal for sex determination in the genomic data you have analyzed, which was the case for sea lamprey (see the main manuscript).

In Step 3, non-overlapping signals from the reference-based methods are converted to 10kb windows and combined to increase the user’s power to focus on specific genomic regions. Once candidate windows are identified in Step 3, the user can return to GWAS, SNP density, and Fst results to determine if fixed or nearly-fixed differences exist between the sexes in the top candidate windows.

Once again, the example code Fugu_SexFindR.R and necessary input files are included in the GitHub repository in order to run the analysis for fugu and generate the plot in the SexFindR figure in the manuscript.

For the GWAS and Fst input data, parsing of the input data is carried out both in the R script, and through Python scripts and bash commands.

To generate gemma_window_count.txt which is required for the R script from fugu_gemma_sort_cut.txt (which is created in the R script from test_gemma_out.assoc.txt.zip), do the following on the command line:

sort -k1,1 -k2n fugu_gemma_sort_cut.txt > coordinated_sorted_gemma_cut.txt
python window_count.py coordinated_sorted_gemma_cut.txt gemma_window_count.txt 10000

This creates a window-based count of top GWAS hits (top 5%) within each 10 kb region across the genome.

Similarly, to generate Fstugu_window_count.txt from Fstugu_sort_cut.txt (created in R from biallelic_fst.weir.fst.zip),

sort -k1,1 -k2n Fstugu_sort_cut.txt > coordinated_sorted_Fstugu_cut.txt
python window_count.py coordinated_sorted_Fstugu_cut.txt Fstugu_window_count.txt 10000

This creates a window-based count of top Fst hits (top 5%) within each 10 kb region across the genome.

The SNP Denisty data is already in 10 kb windows from the SNP Density analysis (Step 2) and is able to be used without modifications outside of R.

With all these scripts and input data, you should be able to recreate this plot from the main manuscript in R.

_images/FuguSexFindR.png

Figure 3. SexFindR Step 3 combined results identified the fugu SDR (red dotted line) alongside 4 additional candidate windows (thin black dotted lines), all on NC_042303.1. Regions of high interest are darkly colored, whereas regions of low interest are colored in grey.

Supplemental Code

This section is a work in progress. Code for the analysis of lamprey and the other species from the main manuscript on the GitHub repository and documentation (when necessary or helpful) will be added.

Running 100k Permutations

For SNP density, the R code provided runs 1000 permutations, but I wanted to run more. For fugu and lamprey, custom Python and R code was written to carry out the permuations and pivot the resulting table to a useful format. Below is the example code for the lamprey run.

cd ~/SexFindR/Supplemental_Code/Lamprey/SNP\ Density/
unzip SNPdensity_rows_location.txt.gz
for run in $(cat runs.txt); do python SNP_density_permuter.py SNPdensity_rows_location.txt perm_${run}.txt 2500 $run & sleep 0.1; done

cat out_true_SNPden.txt perm* > true_plus_100k_perms.txt

# using R to pivot the dataframe
R
long <- read_tsv("true_plus_100k_perms.txt")
pivot <- t(long)
output <- as.data.frame(pivot) %>% rownames_to_column()
write_tsv(output,"trans_true_plus_100k_perms.txt",col_names = F)

python3 SNP_density_p_generator.py trans_true_plus_100k_perms.txt SNP_density_100k_permutations_p_values.txt &

Supplemental Text

This page contains two sections that were not included in the SexFindR manuscript or supplements, but could be of use for others.

1. Genetic Sexing of the Tiger Pufferfish (fugu)

In the tiger pufferfish (Takifugu rubripes), it had been previously shown that a single polymorphic site within the Amhr2 gene is consistently C/C in females and C/G in males (Kamiya et al. 2012). This SNP (NC_042303.1:12708100, hereafter referred to as fugu SDR) is reportedly the only fixed genomic difference between males and females of this species and is believed to be involved in sex determination. Recently, a group studying the evolution of salinity tolerance within the pufferfish clade, released a dataset containing whole-genome re-sequencing data for 29 tiger pufferfish (Zhang et al. 2020). Although these samples were deposited onto NCBI without phenotypic sex data, the strong support for the fugu SDR allowed us to genetically sex most of the samples using the following methods and criteria. The tiger pufferfish samples (table S4) were downloaded from NCBI SRA repository using fasterq-dump and mapped to the reference genome, fTakRub1.2 (https://www.ncbi.nlm.nih.gov/genome/63), using Bowtie2. The fugu SDR and surrounding alignment was pulled from the resulting BAM files and viewed in IGV. Of the 29 samples, 13 displayed C/C genotypes for all of their reads and were classified as females for subsequent analyses. Of the remaining 16 samples, given the expected 50/50 ratio of G:C for males, samples with an excess of C or an excess of G were removed from the analysis. To this end, we classified samples as males if at least 25% of their reads were G and at least 25% of their reads were C. This resulted in confident identification of 14 males and the exclusion of 2 samples; one which had 87% G-containing reads and one that had 10% G-containing reads. No other bases were reported at the fugu SDR in any of these samples, suggesting that sequencing errors within this region were minimal for these data. The genetically-assigned sexes are found in table S5.

Although there are numerous studies that demonstrate a perfect correlation between genetic and phenotypic sex through the fugu SDR in the tiger pufferfish (Kamiya et al. 2012; Matsunaga et al. 2014), there is at least one reported population from an aquaculture center where the gonads (testis or ovary) did not agree with the fugu SDR (Matsunaga et al. 2014). The purpose of the tiger pufferfish in our study is to determine whether the methods we employ are capable of identifying fixed differences between males and females, even in a very young sex determination system where a single SNP could govern phenotypic sex. Given this purpose, it is somewhat irrelevant whether the specific fugu samples downloaded from NCBI would have also followed a perfect correlation between genetic and phenotypic sex. We are primarily using these data to validate our ability to identify sex specific sequences as small as a single SNP to examine this possibility in sea lamprey. Although we could have generated simulated data for the same purpose, the fugu data is particularly valuable given that it represents true biological data from a living species.

2. The Value and Utility of kmersGWAS

As described in the main paper, k-mer based analyses have the potential to be incredibly powerful for uncovering sex-specific sequences. Since k-mer based analyses do not rely on an assembled genome for read mapping, they side-step many of the issues associated with standard analyses. One of the main reasons for carrying out coverage-based and population genomic analyses in Steps 1 and 2 of SexFindR is that the amount of divergence between an X and Y (or Z and W) will dictate the power of those different methods, and by only using one, you could easily miss an important signal. Coverage methods are key if there is sufficient divergence to prevent reads from one sex chromosome from mapping to the other, whereas standard population genomics methods like FST and GWAS have power only when the reads from both sex chromosomes map equally well to the reference, so that the fixed differences between males and females can be identified.

With k-mer based methods, this boundary does not exist, because mapping is not required to identify the sex-specific sequence. We already demonstrated that kmersGWAS was capable of identifying fixed differences in homomorphic systems (poplar and fugu), but we did not look at a system where coverage-alone would be sufficient to identify the sex chromosomes. To prove the utility of kmersGWAS across the divergence continuum, we downloaded re-sequencing data from the golden snub-nosed monkey (Rhinopithecus roxellana) which has the standard placental mammalian XY system with a degenerate Y chromosome (table S4).

Sex chromosomes were initially identified by running DifCover for the male reference (SRR1040959) used in genome assembly (Zhou et al. 2014) against individuals that had been previously re-sequenced and sexed (Yu et al. 2016; Kuang et al. 2019). In the ASM756505v1 assembly, (GCA_007565055.1 on NCBI), the X chromosome is NC_044555.1 (chromosome 7) and the Y chromosome is NC_044570.1 (chromosome 22) given our results (fig. S9). Sex of the remaining samples was confirmed by running samtools depth on 10kb of sequence selected from the X, the Y and an autosome (NC_044554.1). Males were expected to display similar coverage between the X and Y, but twice that level of coverage on the autosome (i.e., 1X, 1Y, 2A). Females were expected to show similar coverage between the X and autosome, but little-to-no coverage on the Y. Sexed predicted using this method were compared to those that were previously published (Kuang et al. 2019), and samples that did not agree were discarded (table S11).

_images/difcover_monkey.png

Fig. S9. DifCover analysis of the golden snub-nosed monkey. Each plot shows log2(coverage difference) on the X axis and the proportion of the full chromosome that that specific region occupies on the Y. Only full chromosomes (not scaffolds) are analyzed. The X and Y chromosomes are highlighted in red and teal, respectively, and all autosomes are coloured in yellow. A: male/female comparison with low male coverage on the X and high male coverage on the Y. B and C: female/female and male/male comparison, respectively, with no significant differences between X and Y chromosomes between the individuals of the same sex.

Once the sexing was completed, kmersGWAS was carried out using 12 males and 13 females. The top p-value achieved in this analysis was 1.537e-12 which appeared for 1,688,851 k-mers that were fixed in males and absent in females. These k-mers were assembled with ABYSS into 73,400 short contigs and blastn was used to map these contigs to the reference genome. 36,699 of these contigs found a position in the assembled genome using blastn, and this position was the Y chromosome for over 90% of the contigs (fig. S10).

_images/abyss_monkey.png

Fig. S10. kmersGWAS identified the Y chromosome in golden snub-nosed monkey within the class of k-mers that produced the smallest p-value. These male-specific k-mers map back to the Y chromosome almost exclusively.

Given this result, we have demonstrated that kmersGWAS is able to identify sex-specific across the entire divergence continuum, ranging from a single sex-specific SNP in fugu to a degenerate sex-specific Y chromosome in the golden snub-nosed monkey. kmersGWAS is a fast and straightforward method that does not rely on a reference genome and it is quite powerful for the detection of sex chromosomes.

References

Kamiya T, Kai W, Tasumi S, Oka A, Matsunaga T, Mizuno N, Fujita M, Suetake H, Suzuki S, Hosoya S, et al. 2012. A trans-species missense SNP in Amhr2 is associated with sex determination in the tiger Pufferfish, Takifugu rubripes (Fugu). PLoS Genet. 8(7). doi:10.1371/journal.pgen.1002798.

Kuang WM, Ming C, Li HP, Wu H, Frantz L, Roos C, Zhang YP, Zhang CL, Jia T, Yang JY, et al. 2019. The Origin and Population History of the Endangered Golden Snub-Nosed Monkey (Rhinopithecus roxellana). Mol Biol Evol. 36(3):487–499. doi:10.1093/molbev/msy220.

Matsunaga T, Ieda R, Hosoya S, Kuroyanagi M. 2014. An efficient molecular technique for sexing tiger pufferfish (fugu) and the occurrence of sex reversal in a hatchery population. :933–942. doi:10.1007/s12562-014-0768-0.

Yu L, Wang GD, Ruan J, Chen Y Bin, Yang CP, Cao X, Wu H, Liu YH, Du ZL, Wang XP, et al. 2016. Genomic analysis of snub-nosed monkeys (Rhinopithecus) identifies genes and processes related to high-altitude adaptation. Nat Genet. 48(8):947–952. doi:10.1038/ng.3615.

Zhang H, Hou J, Liu H, Zhu H, Xu G, Xu J. 2020. Adaptive evolution of low-salinity tolerance and hypoosmotic regulation in a euryhaline teleost, Takifugu obscurus. Mar Biol. 167(7):1–12. doi:10.1007/s00227-020-03705-x. https://doi.org/10.1007/s00227-020-03705-x. Zhou X, Wang B, Pan Q, Zhang J, Kumar S, Sun X, Liu Z, Pan H, Lin Y, Liu G, et al. 2014. Whole-genome sequencing of the snub-nosed monkey provides insights into folivory and evolutionary history. Nat Genet. 46(12):1303–1310. doi:10.1038/ng.3137.