{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Annotation of *Acropora millepora* transcriptome " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebooks details the annotation of an *Acropora millepora* [transcriptome](http://www.ncbi.nlm.nih.gov/nuccore?term=74409%5BBioProject%5D)\n", "\n", "The notebook requires you have the following \n", "- [NCBI Blast: 2.2.3](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)\n", "- [SQLShare](https://sqlshare.escience.washington.edu/accounts/login/?next=/sqlshare/%3F__hash__%3D)\n", "\n", "The annotation also requires a Uniprot/Swissprot BLAST database. Instructions for setting up this database can be found [here](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/README.md)\n", "\n", "The orginal analysis was carried out on on Mac OS X v10.10.3 running Python: 2.7.9 and IPython: 3.1.0.\n", "\n", "This workflow is structured so that anyone can reproduce the analysis by downloading the repository locally and executing." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/sr320/git-repos/paper-Coral-CpG-ratio/data/Amil\n" ] } ], "source": [ "cd ../data/Amil" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Obtain FASTA file\n", "\n", "Transcriptome is accesible from: http://www.ncbi.nlm.nih.gov/nuccore?term=74409[BioProject]\n", "\n", "Save to ./data/Amil directory and rename fasta file as Amil_Moya.fasta" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">gi|379072745|gb|JR970414.1| TSA: Acropora millepora Cluster034439.Acmimixed mRNA sequence\r\n", "TCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTG\r\n", "CGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCC\r\n", "ATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTG\r\n", "ACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTG\r\n", "CACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTA\r\n", "TTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGG\r\n", "TGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCA\r\n", "GCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGG\r\n", "AATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT\r\n" ] } ], "source": [ "!head Amil_Moya.fasta" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "52963\r\n" ] } ], "source": [ "#Count number of seqs\n", "!fgrep -c \">\" Amil_Moya.fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!blastx \\\n", "-query Amil_Moya.fasta \\ #FASTA file\n", "-db ~blast/db/uniprot_sprot \\ #Use local blastx database address\n", "-max_target_seqs 1 \\ #maximum number of target sequences = 1\n", "-max_hsps 1 \\ #maximum number of high-scoring pairs = 1\n", "-outfmt 6 \\ #output format = tabular\n", "-evalue 1E-05 \\ #E-value = 10^-5\n", "-num_threads 8 \\ #number of threads = 8\n", "-out ../analyses/Amil/Amil_blastx_uniprot.tab \\ #Direct output to analyses directory\n", "2> ../analyses/Amil/Amil_blastx_uniprot.error #Direct standard error output to its own file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cd ../../analyses/Amil" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "head: ../analyses/Amil/Amil_blastx_uniprot.tab: No such file or directory\r\n" ] } ], "source": [ "#Checking head and tail of the output file.\n", "!head -10 Amil_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gi|379125674|gb|JT023344.1|\tsp|Q8WXW3|PIBF1_HUMAN\t45.09\t601\t320\t3\t2257\t464\t161\t754\t2e-161\t 493\r\n", "gi|379125690|gb|JT023360.1|\tsp|Q7ZXT3|EDC4_XENLA\t37.50\t216\t120\t5\t682\t59\t1175\t1383\t5e-28\t 121\r\n", "gi|379125695|gb|JT023365.1|\tsp|P02784|SFP1_BOVIN\t50.98\t51\t23\t1\t934\t782\t86\t134\t2e-09\t59.7\r\n", "gi|379125697|gb|JT023367.1|\tsp|P49285|MTR1A_CHICK\t26.10\t272\t191\t5\t845\t45\t47\t313\t1e-18\t89.4\r\n", "gi|379125698|gb|JT023368.1|\tsp|Q5ZIJ9|MIB2_CHICK\t37.87\t647\t379\t11\t186\t2075\t11\t651\t1e-121\t 392\r\n", "gi|379125699|gb|JT023369.1|\tsp|O76093|FGF18_HUMAN\t42.07\t145\t83\t1\t351\t782\t36\t180\t4e-29\t 120\r\n", "gi|379125700|gb|JT023370.1|\tsp|Q5NVI3|RED_PONAB\t50.00\t158\t70\t3\t959\t489\t390\t539\t6e-35\t 143\r\n", "gi|379125702|gb|JT023372.1|\tsp|P55268|LAMB2_HUMAN\t34.02\t388\t228\t9\t2070\t976\t264\t646\t3e-65\t 241\r\n", "gi|379125703|gb|JT023373.1|\tsp|Q91WQ5|TAF5L_MOUSE\t40.23\t599\t326\t9\t100\t1857\t2\t581\t3e-142\t 435\r\n", "gi|379125706|gb|JT023376.1|\tsp|Q9UBX3|DIC_HUMAN\t56.99\t193\t81\t2\t235\t810\t1\t192\t5e-63\t 217\r\n" ] } ], "source": [ "#Comparison of the tail with original FASTA should give an idea of whether\n", "#the blast job is complete \n", "!tail Amil_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 21027 252324 1944045 /Users/jay/Desktop/blast_jobs/Amil_blastx_uniprot3.tab\r\n" ] } ], "source": [ "!wc Amil_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gi|379072747|gb|JR970416.1|\tsp|Q7ZWG6|PCFT_DANRE\t32.69\t416\t264\t8\t297\t1532\t43\t446\t4e-61\t 220\n", "SQLShare ready version has Pipes converted to Tabs ....\n", "gi\t379072747\tgb\tJR970416.1\t\tsp\tQ7ZWG6\tPCFT_DANRE\t32.69\t416\t264\t8\t297\t1532\t43\t446\t4e-61\t 220\n" ] } ], "source": [ "#Removing pipes and converted to tab-delimited file\n", "!tr '|' \"\\t\" Amil_blastx_uniprot_sql.tab\n", "!head -1 Amil_blastx_uniprot.tab\n", "!echo SQLShare ready version has Pipes converted to Tabs ....\n", "!head -1 Amil_blastx_uniprot_sql.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Manually uploading Amil_blastx_uniprot_sql.tab to SQLShare and joining with GOSlim:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###First upload dataset\n", "![screen shot1](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.01.38%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Then find the dataset, execute query, and download the new dataset\n", "![screen shot](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.29.18%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Query (note: insert your SQLShare account instead of jldimond@washington.edu)\n", "`SELECT Distinct Column2 as ContigID, GOSlim_bin\n", " FROM [jldimond@washington.edu].[Amil_blastx_uniprot_sql.tab]anno\n", " left join [sr320@washington.edu].[SPID and GO Numbers]go\n", " on anno.Column7=go.SPID left join [sr320@washington.edu].[GO_to_GOslim]slim\n", " on go.GOID=slim.GO_id where aspect like 'P'`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output file downloaded to ./Analyses/Amil/Amil_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ContigID,GOSlim_bin\r", "\r\n", "379087027,protein metabolism\r", "\r\n", "379104054,other metabolic processes\r", "\r\n", "379088262,other biological processes\r", "\r\n", "379100196,RNA metabolism\r", "\r\n", "379125389,transport\r", "\r\n", "379073249,other biological processes\r", "\r\n", "379086301,other biological processes\r", "\r\n", "379074703,DNA metabolism\r", "\r\n", "379093891,death\r", "\r\n" ] } ], "source": [ "!head -10 Amil_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Replacing commas with tabs\n", "!tr ',' \"\\t\" <.Amil_GOSlim.csv> Amil_GOSlim.tab" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ContigID\tGOSlim_bin\r", "\r\n", "379087027\tprotein metabolism\r", "\r\n", "379104054\tother metabolic processes\r", "\r\n", "379088262\tother biological processes\r", "\r\n", "379100196\tRNA metabolism\r", "\r\n", "379125389\ttransport\r", "\r\n", "379073249\tother biological processes\r", "\r\n", "379086301\tother biological processes\r", "\r\n", "379074703\tDNA metabolism\r", "\r\n", "379093891\tdeath\r", "\r\n" ] } ], "source": [ "!head -10 Amil_GOSlim.tab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }