kmersGWAS
The kmersGWAS
approach described below is possibly the most powerful single analysis within the SexFindR
protocol. Apart from not requiring a reference genome, this single method has proved capable of detecting sex-linked sequences across the entire sex chromosome evolutionary continuum, from highly degenerate mammalian systems to takifugu, which has a single base that is heterozygous is males and homozygous in females.
After testing a few other k-mer association algorithms as well as a number of specific options within this algorithm, I settled on the following workflow given its consistent results across the sex chromosome divergence continuum. This protocol requires kmersGWAS
(v0.2 beta; https://github.com/voichek/kmersGWAS), ABYSS
(v2.2.5; https://github.com/bcgsc/abyss), plink
(v 1.07-x86_64; https://www.cog-genomics.org/plink/1.9/), R
and blast+
( v2.10.0; https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.10.0/). I have presented the necessary steps and configurations for the fugu analysis below, but it should be noted that run times and memory usage vary greatly with genome size, coverage, and sample number. For fewer than 30 samples of fugu, the config provided in the SLURM scripts should be sufficient, but some of the kmersGWAS
steps required days to run on large machines (200 GB+ memory) for lamprey, and other species.
Directory structure
kmersGWAS
requires a specific directory structure and has many separate steps. These are presented below, but more information can be found in the kmersGWAS manual here (https://github.com/voichek/kmersGWAS/blob/master/manual.pdf).
Generally, you want the following:
Files present in a top level directory
Individual directories created within the top level directory corresponding to the base name of each sample (e.g., top level directory contains
SRR8585991_1.fastq.gz
,SRR8585991_2.fastq.gz
, and theSRR8585991
directory)Within the base name subdirectories (e.g.,
SRR8585991
), there are execution scripts and aninput_files.txt
file that contains the paths to thefastq
files (e.g.,../SRR8585991_1.fastq.gz
and../SRR8585991_2.fastq.gz
each on their own line).
Running kmersGWAS
The following is a set of example commands to set up the directory structure for the fugu samples, but these will need to be made specific to your file names, path, etc. Many of these steps were executed on a SLURM system and the execution scripts will need to be modified to match your system as well. This process begins with the fastq
files and one copy of each of the execution scripts in the top level (fugu_kmerGWAS
) directory. All commands are executed from that top level directory. I have included an example of all necessary text files in the GitHub repository to check against as well.
cd /home/pgrayson/scratch/fugu_kmerGWAS
for file in $(ls --color=none *_1.*gz); do baseName=$(echo $file | awk -F'[_]' '{print $1}'); mkdir $baseName; done
ls --color=none -d */ > dirlist.txt
sed 's/\///g' dirlist.txt > clean_dirlist.txt
for dir in $(cat clean_dirlist.txt); do cd $dir; echo $dir; ls --color=none ../*${dir}*gz > input_files.txt; cd ..; done
for dir in $(cat clean_dirlist.txt); do cp step1_kmerGWAS.sh $dir; done
for dir in $(cat clean_dirlist.txt); do cd $dir; sbatch step1_kmerGWAS.sh; cd ..; done
Once those first jobs complete, you can run the following (again with modification to the execution script):
for dir in $(cat clean_dirlist.txt); do cp strand_kmerGWAS.sh $dir; done
for dir in $(cat clean_dirlist.txt); do cd $dir; sbatch strand_kmerGWAS.sh; cd ..; done
We next need the individuals k-mers list file. As above, these should be generated following the kmersGWAS
manual. An example set of commands used to generate this file for the fugu samples is provided below:
cd /home/pgrayson/
ll scratch/fugu_kmerGWAS/ | tail -n +2 | awk '{printf "/scratch/fugu_kmerGWAS/%s/kmers_with_strand\t%s\n", $NF,$NF}' > kmers_list_paths.txt
grep -v .gz kmers_list_paths.txt | grep SRR > kmers_clean_list_path.txt
sed 's/\/scratch/\/home\/pgrayson\/scratch/g' kmers_clean_list_path.txt > final_kmers_clean_list_path.txt
mv final_kmers_clean_list_path.txt ~/scratch/fugu_kmerGWAS
Example execution scripts are provided to combine the k-mers and create the k-mers table.
sbatch combine_kmersGWAS.sh
sbatch create_table_kmersGWAS.sh
Once these steps are completed, we generate plink
binary files using the kmers_table_to_bed
function. The phenotype.pheno
file here contains two tab-delimited columns, accession_id
and phenotype_value
with the sample ID in accession_id
and 1
or 2
in the phenotype value
column for the male or female phenotype. -b
here is batch size. Depending on the power of your machine and the number of samples included in your analysis, you might have to produce smaller batches (I ran on a Linux machine with 256 Gb of memory and had to batch other species with more samples). Batches here are the number of k-mers within the analysis for plink
.
~/programs/kmerGWAS/bin/kmers_table_to_bed -t kmers_table -k 31 -p phenotype.pheno --maf 0.05 --mac 5 -b 1000000000 -o fugu_kmerGWAS_plink
Running plink
Next, we run these files through plink
to obtain p-values
for each k-mer, representing the association between k-mer presence and phenotypic sex. If you had to batch your k-mers (above), you will also need to execute this command for batch the other batches (e.g., fugu_kmerGWAS_plink.1, fugu_kmerGWAS_plink.2, etc.). Make sure to also append something new to the --out
name (e.g., fugu_kmers1
, fugu_kmers2
), so that previous results are not overwritten.
~/programs/plink-1.07-x86_64/plink --noweb --bfile fugu_kmerGWAS_plink.0 --allow-no-sex --assoc --out fugu_kmers
The resulting outfile (e.g., fugu_kmers.assoc
) can explored and parsed based on the P
column to identify and pull out the k-mers that have the highest association with sex. An example for fugu to obtain the k-mers with the most significant p-values
is:
awk '$9 < 0.000000000001' fugu_kmers.assoc > most_significant_fugu_assoc.txt
Results from different batches can be concatenated with cat fugu_kmers*.assoc
or something similar following analysis in plink
(although a simple cat
will also repeat the header line from each batch, so be aware that you will need to take this into account using tail
or parsing the output file from cat
to remove the extra header lines).
Running ABYSS
Once you are happy with the filtered k-mer set, you can use ABYSS
to assemble those k-mers into small contigs for blastn
analysis (if a reference genome exists). A basic python
script (plink_to_abyss_kmers.py
) is included to parse the filtered plink
output into ABYSS
-ready input. The ABYSS
input should be in fasta format, with the p-value
as the sample ID and the k-mer as the sequence. e.g.,
>2.005e-13
AAAAAAAAAAAATCATTTCCCACCTCATCAA
>2.005e-13
AAAAAAAAAAATCATTTCCCACCTCATCAAT
>2.005e-13
AAAAAAAAAATCATTTCCCACCTCATCAATC
...
To generate this file, the following should work:
python plink_to_abyss_kmers.py most_significant_fugu_assoc.txt fugu_plink_abyss_input.txt
If your output does not match the example above, you might need to change the index positions in the python script to correctly grab the k-mer and the p-value columns (given different plink
versions). This input file is then run through ABYSS to assemble small contigs.
ABYSS -k25 -c0 -e0 fugu_plink_abyss_input.txt -o fugu_plink_abyss_output.txt
Running blastn
If you have a reference genome, you might be interested in where the enriched k-mers map to. The easiest way to check this is through a blastn
analysis.
You first need to create a blastdb:
makeblastdb -in GCF_901000725.2_fTakRub1.2_genomic.fna -dbtype nucl
Then you can run blastn:
blastn -query fugu_plink_abyss_output.txt -db GCF_901000725.2_fTakRub1.2_genomic.fna -outfmt 6
The output file from blastn will provide top candidate regions for each contig that was assembled from the k-mers. In the case of fugu, there are only 4 contigs that all blast to NC_042303.1, but in poplar and the golden monkey, these blast outputs were parsed with an R
script (e.g., poplar_kmerGWAS_blast_results.R
) to visualize which genomic regions the sex-assocaited k-mers map to. This script and the necessary input files have been included in the GitHub repo for poplar.