{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Blast and GOSlim annotation of *Acropora palmata* transcriptome " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workflow details the annotation of an *Acropora palmata* [transcriptome](https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta)\n", "\n", "The notebook requires you have the following \n", "- [NCBI Blast: 2.2.3](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)\n", "- [SQLShare](https://sqlshare.escience.washington.edu/accounts/login/?next=/sqlshare/%3F__hash__%3D)\n", "\n", "The annotation also requires a Uniprot/Swissprot BLAST database. Instructions for setting up this database can be found [here](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/README.md)\n", "\n", "The orginal analysis was carried out on on Mac OS X v10.10.3 running Python: 2.7.9 and IPython: 3.1.0.\n", "\n", "This workflow is structured so that anyone can reproduce the analysis by downloading the repository locally and executing." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/data/Apalm\n" ] } ], "source": [ "cd ../data/Apalm" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 65.5M 100 65.5M 0 0 5165k 0 0:00:12 0:00:12 --:--:-- 5210k\n" ] } ], "source": [ "#Obtain FASTA file\n", "!curl -O https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta > Apalm.fasta" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Rename to Apalm.fasta\n", "!mv display?to_ext=fasta Apalm.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!head Apalm.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!tail Apalm.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#fasta is full of double quotes (\") in front of some of the (>) ...Removing \" from fasta\n", "!sed 's/\"//g' Apalm.fasta > Apalm2.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!head Apalm.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Count number of seqs\n", "!fgrep -c \">\" Apalm2.fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blastx query" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!blastx \\\n", "-query Apalm2.fasta \\ #FASTA file\n", "-db ~blast/db/uniprot_sprot \\ #Use your blastx database address\n", "-max_target_seqs 1 \\ #maximum number of target sequences = 1\n", "-max_hsps 1 \\ #maximum number of high-scoring pairs = 1\n", "-outfmt 6 \\ #output format = tabular\n", "-evalue 1E-05 \\ #E-value = 10^-5\n", "-num_threads 8 \\ #number of threads = 8\n", "-out ../analyses/Apalm/Apalm_blastx_uniprot.tab \\ #Direct output to analyses directory\n", "2> ../analyses/Apalm/Apalm_blastx_uniprot.error #Direct standard error output to its own file" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/analyses/Apalm\n" ] } ], "source": [ "cd ../../analyses/Apalm" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "head: ../analyses/Apalm/Apalm_blastx_uniprot.tab: No such file or directory\r\n" ] } ], "source": [ "#Checking head and tail of the output file.\n", "!head -10 Apalm_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Comparison of the tail with original FASTA should give an idea of whether\n", "#the blast job is complete (note contig25409_16070 present in both)\n", "!tail -10 Apalm_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!wc Apalm_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Removing pipes and converted to tab-delimited file\n", "!tr '|' \"\\t\" Apalm_blastx_uniprot_sql.tab\n", "!head -1 Apalm_blastx_uniprot.tab\n", "!echo SQLShare ready version has Pipes converted to Tabs ....\n", "!head -1 Apalm_blastx_uniprot_sql.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Manually uploading Apalm_blastx_uniprot_sql.tab to SQLShare and joining with GOSlim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###First upload dataset\n", "![screen shot1](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.01.38%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Then find the dataset, execute query, and download the new dataset\n", "![screen shot](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.29.18%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Query (note: insert your SQLShare account instead of jldimond@washington.edu)\n", "`SELECT Distinct Column1 as ContigID, GOSlim_bin FROM\n", " [jldimond@washington.edu].[Apalm_blastx_uniprot_sql.tab]anno\n", " left join [sr320@washington.edu].[SPID and GO Numbers]go\n", " on anno.Column3=go.SPID\n", " left join [sr320@washington.edu].[GO_to_GOslim]slim\n", " on go.GOID=slim.GO_id where aspect like 'P'`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Output file downloaded to ./analyses/Apalm/Apalm_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "head: ./analyses/Apalm/Apalm_GOSlim.csv: No such file or directory\r\n" ] } ], "source": [ "!head -10 Apalm_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/sh: ./analyses/Apalm/Apalm_GOSlim.csv: No such file or directory\r\n" ] } ], "source": [ "#Converting from comma to tab delimited\n", "!tr ',' \"\\t\" Apalm_GOSlim.tab" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ContigID\tGOSlim_bin\r", "\r\n", "contig135011_153678_153601\tcell organization and biogenesis\r", "\r\n", "contig135011_153678_153601\tother biological processes\r", "\r\n", "contig135011_153678_153601\tdevelopmental processes\r", "\r\n", "contig69684\tprotein metabolism\r", "\r\n", "contig113621\tprotein metabolism\r", "\r\n", "contig97647\tprotein metabolism\r", "\r\n", "contig199902\tprotein metabolism\r", "\r\n", "contig78855\tother biological processes\r", "\r\n", "contig8505_94477\tDNA metabolism\r", "\r\n" ] } ], "source": [ "!head -10 Apalm_GOSlim.tab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }