{ "cells": [ { "cell_type": "markdown", "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "# Examining Fasta from Assembly" ] }, { "cell_type": "markdown", "metadata": { "button": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Note that the fasta file is not within the repository as it is greater than 100MB. A zipped version is. `Geoduck-transcriptome-v2.fasta.zip`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mGeoduck-transcriptome-v2.fasta\u001b[m\u001b[m* Geoduck-transcriptome-v2.fasta.zip\r\n" ] } ], "source": [ "ls ../data-results/" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">comp7_c0_seq1 len=210 path=[5082:0-45 293:46-209]\r\n", "TTAACCAAGGTGTGACGCCGACGCAAGGGTGAGTAGAATAGCTCTGTTTATTATCCGAAT\r\n", "AGTCGAGCTAAAAACACAAAGAATAAAGGTTTAACAGTTCTATCTGAAATATATATTTGG\r\n", "ATATCTATTGGTAAGGATACGTTTTATATTAAAAACAAACAATTTATAAAGCGCTCTCGC\r\n", "ACCTTGTTTTTGCATTATGAGCATATACAT\r\n", ">comp30_c0_seq1 len=201 path=[6331:0-200]\r\n", "AAGAAAATTGATTTGAAATTGACTCTGCTTGAATAGAAAAAAATGTTTTGTTCTTTTTTT\r\n", "CGAAGTGTAAATTGTAAATTACTTTATTAAAAAATTCATAGTTTCCGGGCAAGTTATTTT\r\n", "TAATATATTGTAAATGTTGTCATTCAGAGGTTTGTTACGAATATATTGTTTGACAGACAT\r\n", "GCTACTGTTGTACTACTATTG\r\n" ] } ], "source": [ "!head ../data-results/Geoduck-transcriptome-v2.fasta" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "154407\r\n" ] } ], "source": [ "!fgrep -c \">\" ../data-results/Geoduck-transcriptome-v2.fasta" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "154480\r\n" ] } ], "source": [ "!fgrep -c \">\" /Volumes/web/cnidarian/Geo-trinity/trinity_out_dir/Trinity.fasta" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "154407\r\n" ] } ], "source": [ "!fgrep -c \">\" /Volumes/web/cnidarian/Geo-Trinity2/trinity_out_dir/Trinity.fasta" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "0:999 \t127881\r\n", "1000:1999 \t18040\r\n", "2000:2999 \t5312\r\n", "3000:3999 \t1808\r\n", "4000:4999 \t773\r\n", "5000:5999 \t284\r\n", "6000:6999 \t139\r\n", "7000:7999 \t86\r\n", "8000:8999 \t21\r\n", "9000:9999 \t33\r\n", "10000:10999 \t7\r\n", "11000:11999 \t6\r\n", "12000:12999 \t3\r\n", "13000:13999 \t4\r\n", "14000:14999 \t4\r\n", "15000:15999 \t3\r\n", "16000:16999 \t0\r\n", "17000:17999 \t2\r\n", "18000:18999 \t1\r\n", "\r\n", "Total length of sequence:\t101836734 bp\r\n", "Total number of sequences:\t154407\r\n", "N25 stats:\t\t\t25% of total sequence length is contained in the 8036 sequences >= 2055 bp\r\n", "N50 stats:\t\t\t50% of total sequence length is contained in the 26074 sequences >= 1014 bp\r\n", "N75 stats:\t\t\t75% of total sequence length is contained in the 64502 sequences >= 445 bp\r\n", "Total GC count:\t\t\t37647852 bp\r\n", "GC %:\t\t\t\t36.97 %\r\n", "\r\n" ] } ], "source": [ "!perl ../scripts/count_fasta.pl \\\n", "-i 1000 \\\n", "../data-results/Geoduck-transcriptome-v2.fasta" ] }, { "cell_type": "markdown", "metadata": { "button": false, "collapsed": true, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "# Blast" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 22933 321062 1883878 ../data-results/Geoduck-tranv2_blastx_sprot.tab\r\n" ] } ], "source": [ "#output of one blast (will repeat to confirm completeness)\n", "!wc -l ../data-results/Geoduck-tranv2_blastx_sprot.tab" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 18943 /Users/sr320/Dropbox/hummingbird-ipython-nbs/data/Geoduck-v2-b/blastx_sprot.sql\r\n" ] } ], "source": [ "!wc -l /Users/sr320/Dropbox/hummingbird-ipython-nbs/data/Geoduck-v2-b/blastx_sprot.sql" ] }, { "cell_type": "markdown", "metadata": { "button": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Add segmentation faults on second blast..\n", "![sf](https://www.evernote.com/l/AAqWIUNObqNMqoc2uhtLVizkw1m7PDL4Wz0B/image.png)" ] }, { "cell_type": "markdown", "metadata": { "button": false, "collapsed": true, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## REDONE with total of 23165 hits" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 23165 ../data-results/Geoduck-tranv2-blastx_sprot.tab\n", "Sun Nov 1 08:43:55 PST 2015\n" ] } ], "source": [ "!wc -l ../data-results/Geoduck-tranv2-blastx_sprot.tab\n", "!date" ] }, { "cell_type": "markdown", "metadata": { "button": false, "collapsed": true, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "# Checking for non-Geoduck seqs (ie looking for bacteria)" ] }, { "cell_type": "markdown", "metadata": { "button": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "```BASH\n", "%%bash\n", "for f in query.part*\n", "do\n", "blastn \\\n", "-query $f \\\n", "-db /Volumes/Data/blast_db/nt \\\n", "-evalue 1e-20 \\\n", "-max_target_seqs 1 \\\n", "-max_hsps 1 \\\n", "-outfmt \"6 std sskingdoms stitle staxids sscinames scomnames sblastnames\" \\\n", "-num_threads 16 \\\n", "-out blastout_\"$f\"_nt \\\n", "2> err_\"$f\"_nt\n", "done\n", "```" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1862 analyses/Geoduck_v2_blastn-NT.out\r\n" ] } ], "source": [ "!wc -l analyses/Geoduck_v2_blastn-NT.out" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "comp159055_c0_seq1\tgi\t531997082\tgb\tKC876030.1\t\t100.00\t223\t0\t0\t1\t223\t23516\t23738\t5e-112\t 412\tEukaryota\tHomo sapiens clone LA13_165F6 sequence\t9606\tHomo sapiens\thuman\tprimates\r\n", "comp159196_c0_seq1\tgi\t459351451\temb\tHF558646.1\t\t89.70\t165\t15\t2\t1\t163\t91\t255\t8e-51\t 209\tEukaryota\tMalassezia sympodialis ATCC 42132 complete mitochondrial genome\t1230383\tMalassezia sympodialis ATCC 42132\tMalassezia sympodialis ATCC 42132\tbasidiomycetes\r\n", "comp159331_c0_seq1\tgi\t388571211\tgb\tJN184768.1\t\t94.25\t261\t15\t0\t1\t261\t227\t487\t4e-108\t 399\tEukaryota\tOstrea edulis nucleoside diphosphate kinase mRNA, complete cds\t37623\tOstrea edulis\tOstrea edulis\tbivalves\r\n" ] } ], "source": [ "!tail -3 analyses/Geoduck_v2_blastn-NT.out" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 405\r\n" ] } ], "source": [ "!fgrep \"Bacteria\" analyses/Geoduck_v2_blastn-NT.out | wc -l" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "button": false, "collapsed": false, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "comp28250_c0_seq1\tgi\t514055706\tgb\tKC802228.1\t\t100.00\t181\t0\t0\t18\t198\t216\t36\t1e-88\t 335\tN/A\tSynthetic construct breast cancer binding peptide PC82 gene, partial cds\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp61908_c0_seq1\tgi\t389297270\tgb\tJQ794641.1\t\t96.40\t222\t8\t0\t1\t222\t240\t19\t4e-98\t 366\tN/A\tUncultured Petrobacter sp. clone OTU-17 16S ribosomal RNA gene, partial sequence\t463796\tN/A\tN/A\tN/A\n", "comp95185_c0_seq1\tgi\t539360076\tgb\tKC989926.1\t\t100.00\t343\t0\t0\t1\t343\t1500\t1842\t1e-178\t 634\tN/A\tCloning vector pSTn5-KM, complete sequence\t1389801\tCloning vector pSTn5-KM\tCloning vector pSTn5-KM\tother sequences\n", "comp95185_c1_seq1\tgi\t499074117\tgb\tKC577243.1\t\t100.00\t226\t0\t0\t1\t226\t3832\t4057\t1e-113\t 418\tN/A\tCloning vector pR6KT-miniTn7T-P1eGFP-FK, complete sequence\t1332675\tCloning vector pR6KT-miniTn7T-P1eGFP-FK\tCloning vector pR6KT-miniTn7T-P1eGFP-FK\tother sequences\n", "comp95185_c1_seq2\tgi\t18150422\tgb\tAF409199.1\t\t99.05\t105\t1\t0\t180\t284\t4121\t4017\t1e-44\t 189\tN/A\tShuttle vector pCE320, partial sequence\t183765\tShuttle vector pCE320\tShuttle vector pCE320\tother sequences\n", "comp95185_c2_seq1\tgi\t459360454\tgb\tKC200570.1\t\t100.00\t268\t0\t0\t1\t268\t234\t501\t6e-137\t 496\tN/A\tBinary vector pYBA-300, complete sequence\t1301036\tBinary vector pYBA-300\tBinary vector pYBA-300\tother sequences\n", "comp98229_c0_seq1\tgi\t259116154\tgb\tGQ874257.1\t\t95.79\t214\t9\t0\t1\t214\t790\t1003\t5e-92\t 346\tN/A\tUncultured organism clone 1041059766404 genomic sequence\t155900\tuncultured organism\tuncultured organism\tN/A\n", "comp101927_c3_seq1\tgi\t195934828\tgb\tBC168400.1\t\t80.35\t173\t33\t1\t194\t366\t974\t803\t1e-26\t 130\tN/A\tSynthetic construct Mus musculus clone IMAGE:100068369, MGC:195913 tau tubulin kinase 2 (Ttbk2) mRNA, encodes complete protein\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp104423_c0_seq1\tgi\t539360076\tgb\tKC989926.1\t\t99.85\t664\t1\t0\t1\t664\t2305\t2968\t0.0\t 1221\tN/A\tCloning vector pSTn5-KM, complete sequence\t1389801\tCloning vector pSTn5-KM\tCloning vector pSTn5-KM\tother sequences\n", "comp115679_c0_seq2\tgi\t259116154\tgb\tGQ874257.1\t\t94.87\t234\t8\t2\t1\t234\t1349\t1120\t5e-97\t 363\tN/A\tUncultured organism clone 1041059766404 genomic sequence\t155900\tuncultured organism\tuncultured organism\tN/A\n", "comp115969_c0_seq1\tgi\t512388800\temb\tHG315104.1\t\t100.00\t471\t0\t0\t1\t471\t1264\t1734\t0.0\t 870\tN/A\tStreptococcus sp. DSM 27088 partial 23S rRNA gene, strain DSM 27089, isolate 7746\t1345497\tN/A\tN/A\tN/A\n", "comp126119_c0_seq1\tgi\t371881539\temb\tFQ727577.1\t\t94.69\t245\t11\t2\t1\t244\t445\t202\t5e-102\t 379\tN/A\t16S rRNA amplicon fragment from a soil sample (ferralsol, Madagascar) resulting from a 16 days laboratory incubation experiment in the presence of 13C-enriched wheat-straw : Light-DNA fraction (DNA-SIP technique)\t32644\tunidentified\tunidentified\tN/A\n", "comp135476_c0_seq5\tgi\t254048722\tgb\tGQ233872.1\t\t89.88\t257\t26\t0\t1\t257\t803\t547\t2e-87\t 331\tN/A\tUncultured marine organism clone IOBCBE001_08-A08-SP6.ab1 genomic sequence\t360281\tuncultured marine organism\tuncultured marine organism\tN/A\n", "comp137358_c0_seq11\tgi\t364588385\tgb\tJN436381.1\t\t89.50\t1209\t109\t13\t65\t1269\t127\t1321\t0.0\t 1513\tN/A\tUncultured organism clone SBXZ_5221 16S ribosomal RNA gene, partial sequence\t155900\tuncultured organism\tuncultured organism\tN/A\n", "comp138387_c0_seq4\tgi\t168151307\temb\tCU674602.1\t\t78.25\t308\t65\t2\t1\t307\t322\t16\t1e-46\t 196\tN/A\tSynthetic construct Homo sapiens gateway clone IMAGE:100018300 5' read TUBB2A mRNA\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp138387_c0_seq6\tgi\t168151367\temb\tCU674662.1\t\t78.99\t714\t150\t0\t1\t714\t730\t17\t3e-134\t 488\tN/A\tSynthetic construct Homo sapiens gateway clone IMAGE:100018301 5' read TUBB2B mRNA\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp141713_c0_seq1\tgi\t312152669\tgb\tHQ448367.1\t\t75.84\t592\t131\t6\t1356\t1944\t663\t81\t2e-74\t 291\tN/A\tSynthetic construct Homo sapiens clone IMAGE:100071791; CCSB003826_02 polymerase (RNA) II (DNA directed) polypeptide E, 25kDa (POLR2E) gene, encodes complete protein\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp142037_c1_seq1\tgi\t117645665\temb\tAM393421.1\t\t72.23\t1019\t246\t31\t1623\t2624\t1592\t2590\t1e-70\t 279\tN/A\tSynthetic construct Homo sapiens clone IMAGE:100001729 for hypothetical protein (CYFIP2 gene)\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", "comp144044_c1_seq9\tgi\t293651473\tdbj\tAB553833.1\t\t91.59\t416\t35\t0\t1\t416\t11771\t12186\t1e-160\t 575\tN/A\tHuman artificial chromosome vector 21HAC4 DNA, isolated from the short arm, clone: YAC/BAC#37-2\t751903\tHuman artificial chromosome vector 21HAC4\tHuman artificial chromosome vector 21HAC4\tother sequences\n", "comp153429_c0_seq1\tgi\t29825358\tgb\tAY238516.1\t\t99.19\t246\t2\t0\t1\t246\t678\t433\t2e-121\t 444\tN/A\tSynthetic construct triacylglycerol lipase gene, complete cds\t32630\tsynthetic construct\tsynthetic construct\tother sequences\n", " 20\n" ] } ], "source": [ "!fgrep \"N/A\" analyses/Geoduck_v2_blastn-NT.out \n", "!fgrep \"N/A\" analyses/Geoduck_v2_blastn-NT.out | wc -l" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "collapsed": true, "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }