Trinity RNA-Seq Pipeline


Based on the khmer protocols
khmer protocols online documents

Starting with the following files
filtered_106A_Female_Mix_GATCAG_L004_R1.fastq
filtered_106A_Female_Mix_GATCAG_L004_R2.fastq
filtered_106A_Male_Mix_TAGCTT_L004_R1.fq.fastq
filtered_106A_Male_Mix_TAGCTT_L004_R2.fq.fastq
filtered_108A_Female_Mix_GGCTAC_L004_R1.fastq
filtered_108A_Female_Mix_GGCTAC_L004_R2.fastq
filtered_108A_Male_Mix_AGTCAA_L004_R1.fastq
filtered_108A_Male_Mix_AGTCAA_L004_R2.fastq

Cleaning & Trimming
#!/bin/sh
FILES=`ls -1 *_R1.fastq | awk -F "." '{print $1}' | sed -e 's/...$//'`
 
for file in ${FILES}
do
 echo ${file}
 java -jar /usr/local/bioinformatics/Trimmomatic/trimmomatic-0.30.jar \
 PE ${file}_R1.fastq ${file}_R2.fastq \
 s1_pe s1_se s2_pe s2_se ILLUMINACLIP:illuminaClipping.fa:2:30:10
 
 interleave-reads.py s1_pe s2_pe > ${file}.pe.fq
 cat s1_se s2_se > ${file}.se.fq
 rm s1_pe s2_pe s1_se s2_se
 
 fastq_quality_filter -Q 33 -q 30 -p 50 -v -i ${file}.pe.fq \
 -o ${file}.pe.qc.fq
 
 fastq_quality_filter -Q 33 -q 30 -p 50 -v -i ${file}.se.fq \
 -o ${file}.se.qc.fq
 
 extract-paired-reads.py ${file}.pe.qc.fq
 
done
The first step, Trimmomatic, trims the paired read files and removes the adapters listed in the illuminaClipping.fa file. Trimmomatic saves the results in four files, s1_pe and s2_pe, the forward and reverse paired reads that are still paired, s1_se and s2_se which contain the broken forward and reverse paired reads left over.

Next the paired reads are woven into a single file and the left over broken reads are put into a single file.

Both the paired reads and the broken reads are then quality filtered using fastq_quality_filter. The last step removes
any newly broken paired reads generated during the QC step, it generates a single paired end file (ending in pe) and
a new broken paired read file (ending in se).

NOTE: Trimmomatic is capable of doing both of these options but I'm not sure why the people who came up with pipeline decided to use fastq_quality_filter instead.

Files after cleaning
*.pe.fq
*.pe.qc.fq
*.pe.qc.fq.pe # Cleaned/Trimmed paired reads
*.pe.qc.fq.se # Cleaned/Trimmed broken paired reads
*.se.fq
*.se.qc.fq # Broken Paired reads that were Cleaned/Trimmed
etc...

Cleaning up after trimming and filtering
FILES=`ls -1 *_R1.fastq | awk -F "." '{print $1}' | sed -e 's/...$//'`
 
for file in ${FILES}
do
    echo ${file}
    cat ${file}.pe.qc.fq.se ${file}.se.qc.fq > ${file}.combined.se.qc.fq
 
    rm ${file}.pe.fq
    rm ${file}.pe.qc.fq
    rm ${file}.se.fq
 
    echo "Gzip-ing ${file}_R1"
    gzip -c ${file}_R1.fastq > ${file}_R1.fq.gz
    echo "Gzip-ing ${file}_R2"
    gzip -c ${file}_R2.fastq > ${file}_R2.fq.gz
 
    mv ${file}.pe.qc.fq.pe ${file}.cleaned.pe.qc.fq
    rm ${file}.pe.qc.fq.se
    rm ${file}.se.qc.fq
done
The left over files are removed, the QC'd paired end file is renamed, the raw files are compressed and the QC'd broken paired reads are combined into a single file.

Your files should now looks like this
*.cleaned.pe.qc.fq # Cleaned PE sequences
*.combined.se.qc.fq # Cleaned SE sequences
*_R1.fastq # Original R1 file
*_R1.fq.gz # compressed R1 file
*_R2.fastq # Original R2 file
*_R2.fq.gz # compressed R2 file

You should be able to remove the .fastq files because we now have the compressed .fastq files.

Digital Normalization
#!/bin/sh
 
FILES=`ls -1 *cleaned* | awk -F "." '{print $1}'`
 
for file in ${FILES}
do
    echo ${file}
 
    # Digi-Norm PE reads
    normalize-by-median.py \
        -p \
        -k 20 \
        -C 20 \
        -N 4 \
        -x 3e9 \
        --savehash ${file}.normC20k20.kh \
        ${file}.cleaned.pe.qc.fq
 
    # Digi-Norm SE reads
    normalize-by-median.py \
        -C 20 \
        -N 4 \
        -x 3e9 \
        --loadhash ${file}.normC20k20.kh \
        --savehash ${file}.normC20k20.kh \
        ${file}.combined.se.qc.fq
 
    # Trim off errors
    filter-abund.py \
        -V \
        ${file}.normC20k20.kh \
        ${file}.*.keep
 
done
The first step looks through our cleaned PE file and only saves sequences that have kmers that appear more then once. It also saves a database of these kmers in the .kh file. The second step does the same thing for the SE sequences but uses our .kh file from our previous step. Finally we trim off any low abundance areas from the sequences.
*.cleaned.pe.qc.fq # cleaned PE file from previous steps
*.cleaned.pe.qc.fq.keep # sequences that made it past digital normalization
*.cleaned.pe.qc.fq.keep.abundfilt # removing erros from last file
*.combined.se.qc.fq # cleaned SE file from previous steps
*.combined.se.qc.fq.keep # Same as above except for SE sequences
*.combined.se.qc.fq.keep.abundfilt # Same as above except for SE sequences
*.normC20k20.kh # temporary database of kmer sequences
*_R1.fq.gz
*_R2.fq.gz

Cleaning up after digital normalization
#!/bin/sh
 
FILES=`ls -1 *_R1.fq.gz | awk -F "." '{print $1}' | sed -e 's/...$//'`
 
for file in ${FILES}
do
    echo ${file}
    echo "Break out orphaned reads"
    extract-paired-reads.py ${file}.cleaned.pe.qc.fq.keep.abundfilt
 
    echo "Combining SE reads"
    cat ${file}.cleaned.pe.qc.fq.keep.abundfilt.se \
        ${file}.combined.se.qc.fq.keep.abundfilt \
        > ${file}.se.qc.keep.abundfilt.fq
 
    echo "Rename Compress"
    mv ${file}.cleaned.pe.qc.fq.keep.abundfilt.pe \
        ${file}.pe.qc.keep.abundfilt.fq
 
    gzip -c ${file}.pe.qc.keep.abundfilt.fq \
        > ${file}.pe.qc.keep.abundfilt.fq.gz
    gzip -c ${file}.se.qc.keep.abundfilt.fq \
        > ${file}.se.qc.keep.abundfilt.fq.gz
 
    echo "Removing Last files"
    rm ${file}.cleaned.pe.qc.fq.keep.abundfilt.se
    rm ${file}.pe.qc.keep.abundfilt.fq
    rm ${file}.se.qc.keep.abundfilt.fq
    rm ${file}.*.kh
    rm ${file}.*.keep
    rm ${file}.*.abundfilt
 
done

First we remove any broken pairs from the PE file, then combine the SE sequences into a single file. Then we compress the digitally normalized files. Finally we remove any left over files that we don't need. Files should now look like this.

*.cleaned.pe.qc.fq # Cleaned, Trimmed PE file from before
*.combined.se.qc.fq # Cleaned, Trimmed SE file from before
*.pe.qc.keep.abundfilt.fq.gz # Cleaned, Trimmed, Normalized & Compressed PE file
*.se.qc.keep.abundfilt.fq.gz # Cleaned, Trimmed, Normalized & Compressed SE file
*_R1.fq.gz
*_R2.fq.gz

Setup for Trinity
#!/bin/sh
 
for file in *.pe.qc.keep.abundfilt.fq.gz
do
    echo ${file}
    split-paired-reads.py ${file}
done
 
cat *.1 > left.fq
cat *.2 > right.fq
 
gunzip -c *.se.qc.keep.abundfilt.fq.gz >> left.fq

Break the PE file into left and right reads then add the SE sequences to the left file. Now we're ready to run Trinity.

Running Trinity
#!/bin/sh
 
Trinity.pl \
    --left left.fq \
    --right right.fq \
    --seqType fq \
    --JM 20G \
    --CPU 6 \
    > trinity.log 2>&1

Telling Trinity that we are using fastq files (--seqType fq) and that we have paired reads in left and right files. We're telling Trinity that it can't use more then 20 GB of memory for the Jellyfish step and to use at most 6 cores. Finally it should redirect all output and error messages to the trinity.log file.

Once Trinity is finished you should see a folder called trinity_out_dir (unless you told Trinity something else). Inside of that are a whole slew of files but the important one is Trinity.fasta. It contains all the assembled contigs. There is also a Trinity.timing text file that shows how long each step took.