--- title: "Getting better mapping efficiency out of Bismark" output: html_notebook --- We've been doing some bisulfite mapping via bismark, and one issue we noticed was pretty abysmal mapping efficiency. Hollie noted around 3%, while I got 11% out of some data Steven provided. The developers published a paper [here](http://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57) offering some suggestions to improve that, so I'm going to try those out. First, they suggest running some form of quality control tool on the data. They use FastQC, but say that something like FastX toolkit, Prinseq, or another tool would work also. I'm going to use FastQC, just for simplicity's sake. ```{r} ## Moving data files to the correct directory and unzipping them setwd("~/Documents/Bismark-Mapping-Eff/") system("mkdir ~/Documents/Bismark-Mapping-Eff/Data") system("mv ~/Documents/Bismark-Mapping-Eff/*.gz ~/Documents/Bismark-Mapping-Eff/Data") system("gunzip ~/Documents/Bismark-Mapping-Eff/Data/*.gz") ## Downloading fastqc system("wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip") ## gksudo is a graphical implementation of sudo, which gets around the whole inability to interact with the Linux command line via an R script. You can get it via sudo apt-get install gksu if you want to implement the script this way, otherwise you can just do this part in terminal. Unlike regular sudo, which will remember your password for a period of time after entering it, gksudo requires entry every time, which gets a little tedious. system("gksudo mkdir /home/shared/fastQC") system("gksudo mv fastqc_v0.11.5.zip /home/shared/fastQC/fastqc_v0.11.5.zip") system("gksudo unzip fastqc_v0.11.5.zip") system("gksudo mv /home/shared/fastQC/FastQC/* /home/shared/fastQC") system("gksudo chmod +x /home/shared/fastQC/fastqc") ``` Lets run some fastQC on our data files! ```{r} time.fastqc <- proc.time() system("gksudo /home/shared/fastQC/fastqc ~/Documents/Bismark-Mapping-Eff/Data/*.fastq") time.fastqc2 <- proc.time() - time.fastqc ``` ```{r} html_dir <- "~/Documents/Bismark-Mapping-Eff/Data/" html_file_names <- list.files(path = "~/Documents/Bismark-Mapping-Eff/Data/", pattern = "*.html") ``` I can't find a nice way to render html documents inside of an R notebook, so the output files from FastQC can be found [here](https://github.com/seanb80/seanb80.github.io/tree/master/images/fastQC) The FastQC reports look pretty rough, with some indication of adapter contamination and low quality reads. We'll run trimmomatic here, as it's what Hollie did in her notebook. The arguments she used are Leading = 3, Trailing = 3, and MinLen = 20. I also pulled the TruSeq adapter sequences from there. ```{r} setwd("~/Documents/Bismark-Mapping-Eff/Data") temp.file.list <- list.files(path = "~/Documents/Bismark-Mapping-Eff/Data", pattern = "*.fastq") trimo.time <- proc.time() for(i in 1:length(temp.file.list)) { system(paste0("TrimmomaticSE -threads 16 -phred33 ~/Documents/Bismark-Mapping-Eff/Data/", temp.file.list[i], " ~/Documents/Bismark-Mapping-Eff/Cleaned/cleaned_", temp.file.list[i], " ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20" )) } trimo.time2 <- proc.time() - trimo.time print(trimo.time2) ad <- 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' galore.time <- proc.time() for(i in 1:length(temp.file.list)) { system(paste0("/home/shared/trimgalore/trim_galore --fastqc -a '", toString(ad), "' -q 20 -clip_R1 3 -three_prime_clip_R1 3 --length 20 ", temp.file.list[i], " -o ~/Documents/Bismark-Mapping-Eff/Galore/")) } galore.time2 <- proc.time - galore.time ``` Now we'll re-run FastQC, to check if things have changed ```{r} time.clean.fastqc <- proc.time() system("/home/shared/fastQC/fastqc ~/Documents/Bismark-Mapping-Eff/Cleaned/*.fastq") time.clean.fastqc2 <- proc.time() - time.clean.fastqc print(time.clean.fastqc2) ``` The Babraham paper suggests trimming using Trim Galore with a --trim1 argument, but this is for paired end reads, and we have single end reads here, so that isn't applicable to us. update after one night running: I apparently don't have FastQC pathed propery for the --fastqc call to work in Trim Galore. Oops. So I guess we get to run FastQC on everything manually. Ah well. ```{r} setwd("~/Documents/Bismark-Mapping-Eff/Cleaned") system("/home/shared/fastQC/fastqc ~/Documents/Bismark-Mapping-Eff/Cleaned/*.fastq") setwd("~/Documents/Bismark-Mapping-Eff/Galore") system("/home/shared/fastQC/fastqc ~/Documents/Bismark-Mapping-Eff/Galore/*.fastq") ``` Moving on to the bismark steps. First, we prep the reference genome. ```{r} #system("/home/shared/Bismark/bismark_genome_preparation ~/Documents/Bismark-Mapping-Eff/Bowtie1/") #system("/home/shared/Bismark/bismark_genome_preparation --bowtie2 ~/Documents/Bismark-Mapping-Eff/Bowtie2/") ``` Now we're going to run the Bismark aligner a few different ways. I'm trying both Trimmomatic and Trim Galore trimmed files, with N arguments of 0 and 1 for Bismark. N indiates the number of high-quality mismatches, and may not have a measurable effect on mapping efficiencies, but it's worth a shot I suppose? ```{r} # trimmo.file.names <- list.files(path = "~/Documents/Bismark-Mapping-Eff/Cleaned", pattern = "*.fastq") # # galore.file.names <- list.files(path = "~/Documents/Bismark-Mapping-Eff/Galore", pattern = "*.fastq") # # for(i in 1:nrow(trimmo.file.names)) { # # system(paste0("/home/shared/Bismark/bismark -n 0 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/Bismark-Mapping-Eff/Cleaned/", trimmo.file.names[i]," --output_dir ~/Documents/Bismark-Mapping-Eff/Trimmo-Out-0 ")) # # } # # for(i in 1:nrow(trimmo.file.names)) { # # system(paste0("/home/shared/Bismark/bismark -n 1 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/Bismark-Mapping-Eff/Cleaned/", trimmo.file.names[i]," --output_dir ~/Documents/Bismark-Mapping-Eff/Trimmo-Out-1 ")) # # } # # for(i in 1:nrow(galore.file.names)) { # # system(paste0("/home/shared/Bismark/bismark -n 0 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/Bismark-Mapping-Eff/Cleaned/", galore.file.names[i]," --output_dir ~/Documents/Bismark-Mapping-Eff/Galore-Out-0 ")) # # } # # for(i in 1:nrow(galore.file.names)) { # # system(paste0("/home/shared/Bismark/bismark -n 1 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/Bismark-Mapping-Eff/Cleaned/", galore.file.names[i]," --output_dir ~/Documents/Bismark-Mapping-Eff/Galore-Out-1 ")) # # } ``` ```{r} file.names.list <- list.files(path = "~/Documents/test", pattern = "*.fastq") for(i in 1:length(file.names.list)) { time.0 <- proc.time() system(paste0("/home/shared/Bismark/bismark --multicore 8 -n 0 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/test/", file.names.list[i], " --output_dir ~/Documents/test/")) time.0 <- time.0 - proc.time() print(time.0) time.1 <- proc.time() system(paste0("/home/shared/Bismark/bismark --multicore 8 -n 1 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/test/", file.names.list[i], " --output_dir ~/Documents/test/")) time.1 <- time.1 - proc.time() print(time.1) time.2 <- proc.time() system(paste0("/home/shared/Bismark/bismark --multicore 8 -n 2 --genome ~/Documents/Bismark-Mapping-Eff/Bowtie2/ ~/Documents/test/", file.names.list[i], " --output_dir ~/Documents/test/")) time.2 <- time.2 - proc.time() print(time.2) } ``` ```{r} for(i in 1:length(file.names.list)) { system(paste0("bsmap -a ~/Documents/test/", file.names.list[i], " -d ~/Documents/Bismark-Mapping-Eff/Bowtie1/Ostrea_lurida-Scaff-10k.fa -o ~/Documents/test/", file.names.list[i], ".sam")) } ```