Step 0. Mapping and variant calling
Prior to running Step 1 (Coverage-based analysis) or any of the analyses from Step 2 (Sequence-based analyses) except the reference-free k-mer analysis, you will need to map your reads to a reference genome.
In the SexFindR paper, we used Bowtie2
(v 2.3.4.3; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for read mapping and we provide some scripts and configuration for this within the GitHub repo and below. If you already have mapped reads from bwa-mem
or another widely-used algorithm, please feel free to use those.
For all the sequence-based analyses mapped to a reference genome, SNP calling is also required. In the SexFindR paper, we used Platypus
(commit 3e72641; https://github.com/andyrimmer/Platypus) to jointly call SNPs across all samples, and we provide some scripts and configurations for this within the GitHub and below. Again, if you have already called SNPs for your samples using GATK
or another widely-used algorithm, please feel free to use those.
Platypus
requires that the genome file be indexed with samtools
(v1.10; https://github.com/samtools/samtools). We also filtered the raw vcf
to only include sites that PASS
the quality filters. This requires bcftools
(v1.9; https://github.com/samtools/bcftools). vcftools
is also used to filter for biallelic sites (v0.1.14; https://github.com/vcftools/vcftools).
Download raw reads from NCBI
Requires fasterq-dump
from the SRA-tools
(https://github.com/ncbi/sra-tools). In my experience, fasterq-dump
can be fairly flakey depending on your system and connection, so this download might need to be repeated multiple times to get all the samples. You might want to separate out the acc_list.txt
files or add something like if [ ! -f ${file}_1.fastq ]
to the code below to avoid downloading the same file repeatedly.
for file in $(cat acc_list.txt); do echo $file; date; fasterq-dump $file -e 36; done
Download the genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/000/725/GCF_901000725.2_fTakRub1.2/GCF_901000725.2_fTakRub1.2_genomic.fna.gz
gunzip GCF_901000725.2_fTakRub1.2_genomic.fna.gz
Create a bowtie2 index
bash bowtie2_makeindex_linux.sh GCF_901000725.2_fTakRub1.2_genomic.fna fugu &> index_outerr.txt &
Map reads to the genome
For each sample, the reads must be mapped using a command like:
bash bowtie2_16_linux.sh SRR8585991_* fugu &> bt2_SRR8585991_outerr.txt &
bowtie2_16_long.sh
is also included as a reference on SLURM systems.
Call variants using Platypus
Variants are jointly called (all at once) through the use of a bam list (e.g., 15M_14F_bams.txt). The following was run on a SLURM system:
samtools faidx GCF_901000725.2_fTakRub1.2_genomic.fna
sbatch platypus_all_region_1day.sh 15M_14F_bams.txt GCF_901000725.2_fTakRub1.2_genomic.fna
Filter for variant calls that PASS quality filters
This script will keep only those variants that have PASS
in the FILTER
field, removing low quality calls.
sbatch bcf_filter.sh fugu_14M_13F.vcf
Filter for biallelic sites
Some downstream analyses (e.g., Fst
) require only biallelic sites to work properly (e.g., 0/0, 0/1, 1/1). Execute the following to filter for only biallelic sites:
vcftools --vcf filtered_PASS_fugu_14M_13F.vcf --max-alleles 2 --stdout --recode --recode-INFO-all | gzip -c > biallelic_filtered_PASS_fugu_14M_13F.vcf.gz