{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculating CpG ratio for the *Acropora millepora* transcriptome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workflow calculates CpG ratio, or CpG O/E, for contigs in the *Acropora millepora* [transcriptome](http://www.ncbi.nlm.nih.gov/nuccore?term=74409%5BBioProject%5D). CpG ratio is an estimate of germline DNA methylation.\n", "\n", "**NOTE: This particular workflow uses Genbank accession numbers for contig IDs, specifically for generating a CpG O/E file that can be joined with *A. millepora* gene expression data in Amil_expression.ipynb.**\n", "\n", "This workflow is an extension of another IPython notebook workflow, `Amil_blast_anno.ipynb`, that generates an annotation of the same transcriptome. This workflow assumes that you have created the directories and files specified in the annotation workflow." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/data/Amil\n" ] } ], "source": [ "cd ../data/Amil" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "head: Amil_Moya.fasta: No such file or directory\n", "\n", "number of seqs =\n", "fgrep: Amil_Moya.fasta: No such file or directory\n" ] } ], "source": [ "#fasta file\n", "!head -2 Amil_Moya.fasta\n", "!echo \n", "!echo number of seqs =\n", "!fgrep -c \">\" Amil_Moya.fasta" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Converted 52963 FASTA records in 1149658 lines to tabular format\r\n", "Total sequence length: 71250280\r\n", "\r\n" ] } ], "source": [ "#Converting FASTA to tabular format and placing output file in analyses directory\n", "!perl -e '$count=0; $len=0; while(<>) {s/\\r?\\n//; s/\\t/ /g; if (s/^>//) { if ($. != 1) {print \"\\n\"} s/ |$/\\t/; $count++; $_ .= \"\\t\";} else {s/ //g; $len += length($_)} print $_;} print \"\\n\"; warn \"\\nConverted $count FASTA records in $. lines to tabular format\\nTotal sequence length: $len\\n\\n\";' \\\n", "Amil_Moya.fasta > ../../analyses/Amil/fasta2tab" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/analyses/Amil\n" ] } ], "source": [ "cd ../../analyses/Amil" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gi|379072745|gb|JR970414.1|\tTSA: Acropora millepora Cluster034439.Acmimixed mRNA sequence\tTCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT\n", "gi|379072746|gb|JR970415.1|\tTSA: Acropora millepora Cluster034438.Acmimixed mRNA sequence\tCGCCCCCACGTCGTCATCTGACGTTCCTGTCCTGTTGCTAAATCAGCCTATTGATTGCGGGAACACATCAATCAACTACTAAACAACAGAATCCTGGGTTTTCAGACTTACAGTGTCTCTGCGATGAGGAGAATGTTCCCCTGTCACCACAAGCACTCCTGTCTGGGCACTAAGACAAATCAGCAATGAGACATTCTTGGCTTCCAAATCAATAAGTGCACATTAACTGGTGTTTGGAGAGACCAATCACCTATCTAGATATGGTCCACCATATTGCAGATTGAAACAATGAATAATAGAACACAAACAATACCCTAACTTGACCACAATAGAAGGTACAGGTTATAAGGACAAATAACAACAGAGGTCTGGAAAAGCCACAGGATTACTCAGTTTGAGGCAAGACATGCCACCTCATAAAATATCTTTGAACATCTATTATTGAATGTTTACATTAACCACCTGTAGATAAAGTGCTTAAGCCTCTTTGTAAAATACAAGAACACAAAACTATATATACACTAATTTGCAGTATCTCAAGTTGTTGTAACAGGCTACTCACTCAATCCTGTGTCCCTTCATATCTTTCATCAAATCAGAGCGAGCATTGGAATGCACACAATGTAGC\n", "gi|379125706|gb|JT023376.1|\tTSA: Acropora millepora Cluster011149.Acmimixed mRNA sequence\tGATATTGTCGTTAGTCTTGCTGTCATTTACAAAGCGCGTGTTTAGCAGCATGGAGCATATGACGCCATGAGTTCCCTAGACGCTTAGTAACCAAGCTACTGTTATCGACACGAGTGCAGTCGCTCTCGGACAAGCGATGATGCGGGTAATGTGAAGAGTCGATTCAATTCATTCTCCTAGAGGAACATGTCCGTTAGTTTTAAAGAGGGAGCTAGCCAAGGCTCTGCACAAACTATGAAAACCACGCCTAAAAAACACAGATGGTATCTTGGGGGTATCGCTTCTGCCATGGCGGCGTGCTGCACCCATCCATTGGATCTTCTTAAGGTTCATTTACAAACACAACAGCAAGCTACTCATAACCTTACATCCATGGGAATTCATGTTGTCCGTACGCAGGGTGTGTTAGCACTTTATAATGGACTTTCAGCCTCCGTAATGAGACAGTTAACATATTCCACAACACGCTATGGCTTATATGAAGTGGTAACAGCAGAGCTAAAGAAAACTAATGATCCAATACCCTTTTACCAGAAAATTGCTGTGGCAGCTGCAGCAGGTTTTGTGGGAGGAATTGTTGGAAACCCTGCAGATATGGTAAATGTAAGGATGCAAAATGATGTGAAAATGCTTGACTTGGCAAAAAGAAGAAACTACAACCACGTATTTGATGGTCTCTATCGAACAGCAACAGAAGAGGGTGTGAGCACATGGATGAGAGGTGTGACTATGACATCATCAAGGGCTCTTCTCATGACAGTGGCCCAGATAGCCTGCTATGACCAAGCAAAGCAGTTCTTACTTACAACAAGGTAAGGATAAACAAACATGGCAACTGTTTATTGATTGTTGTAATCATTGTGATTATCAACACACAGTATTTGTTGCAATGGTCACTCAACATTAGAATTGACATCAAAACCTTCCCCACTGATTGCAAAAAGTGTTATTTTCCTGTTATATAAACCTGTCAGCTGGCAATTAATTTTATATTTGCAGGTTTTTCAAGGACAATATCGTCACACATTTTACAGCGAGTTTCATAGCGGGTACAATTGCTACTAGCATTACACAGCCAGTTGATGTAATGAAAACAAGACTGATGGAGGCAAAACCTGGACAATATAAGAGTGTAGCCCATTGTATTCTTTATACAGCAAAACTTGGACCTCTTGGATTTTATAAGGGTTTTATTCCAGCTTGGGTTCGCTTAGCACCTCACACAATCCTCACATGGATTTTTCTGGAGCAGCTGAGGGTTTTCTTTCCAATAAAGCAGTAATAGTAATTCTCAAAGTATTAACTATGATTAGATTTTGTTTAGAAATAAAAGAAAGAGTTATTTATTTGAAAATTTAATTAAATGGAGTACATAATATTGTTTCAATTCCCAAAGGTGGAGGTAGGGGAAATAGCAGCATTTGCTACTCTCATTAGTCTGAAAGGAGTCCTATCTTTGCCACCAAGTCTAAGAACAAGGTGAAAATAAAAACAGGTGTAATGAGAAAATGTCTGCTCAAAATCTAGATTTTGCATGGTCATTTTTTAAGATGGAATCAAATGACGAAGGAGGAGTTGCTTGTAATCCAAGTTTAACCAACACTAGGGCGAAATTTATAATAATAATAATGATAATAATAATAATAATAATAATAATAACAACAACAAGAGTTATTGTTGAAGGGATTATCCATTCAAACTGATTACTCTCTCGCAAAGCAATTTGATTGTGCAAAAGGAACAATTATTGAGTTGTGTGAGAATGCACTGGGTAATGACATTGAAATTCTGAAGAATTACTGTTTTAAAATTTTCTGTTTAGATTGGCATCATTTTATAAATTACTTTCAGTAAGAATCCTCAGAAAGGTGTGATTAAGTTGTAGTATTAATAATAATTATTATTGAGATGATAATTGCGTTTAAGTGTTGTAATACAGTTTTTCATATCCTGAGTAATACATGTTTATATGCAACGTCAGTGGCAAAAAGTCAATGGTTTAATAAATTGGTCGCCGTGCAACGTAATTTTATTAATTATTTAACAAATAAATAAGCCTAAGCCGTGTTGCTCT\n", "gi|379125707|gb|JT023377.1|\tTSA: Acropora millepora Cluster028086.Acmimixed mRNA sequence\tAAGGTTCCAGAGCTTAATTCTATTGCGTTAGAAATTTTCCAGAAATGTATGTTAAATGGGATCACTATTGATGTGAATTGGATCCCAAGAGATTTCAACAGTGTGGCAGATGAGATTAGTAAGATAATAGATTACGATGATTACACAATTAATGATGATATTTTTGCTTTTTTGGACAAATCATGGGGACCCCATACAGTCGATCGTTTTGCGTGTCACTATAATAAGAAGCTACCTTTATTTAATTCGAAGTTTTTTCAGCCAGGCACGAGTGGGGTGAATGCTTTCAGCCAAGACTGGGCTTTTGCCAATAATTGGTTATGTCCTCCCATTTATCTCACCGCGAGGGTAGTTAATCATTTGAAAGCTTGTAGAGCGGCTGGGACCCTTATCGTCCCTTTGTGGAGATCGGCACATTTTTGGCCAATTATTTGTGACGATGGAGTTCACTTTAGTAACTTTGTGCATGACTGGTATGTTTTGCCGCACATTCCTAATTTATTTATTAGAGGTAAAGCAAAGAATTGTATCTTCGGAAACGGCCAGTTGAAGTTTATAATGTTGGCATTAAGGATTGATTTTTCTGTCCCTCTTAGGTCCGCTGTACGAGGATTTTGTACTGAGTTTAAGCAACTCTGCACTGTCTGTACGACGTGACGTCGTTTGTTGTGTCTTGTAGGGCTTGAGGCCACGGTTGTGCGTGTTTTTTCACACGTCTAGGTGATAATAGTGATAACACTGAGAGATCTGTGTTTTGTTGGCTTGAGGCCATGGAGGTTGTTCCCTGCGTGGAACTGGAGTTTTGGCATTTTAGGCCACAGAGATTGACGTTCTGAGAACTGTCGTATTTTACTATTCCAGATTGTGTATTATGGCGGGGCTAGGGTTTCGCAACTCGTACTCCTTCTAGCGTAGCTAAAGTTTCATTCGTGTTCAATCGAATAAATCACAAGAAAGAGAATGGATTCGTCGG\n" ] } ], "source": [ "#Checking header on new tabular format file\n", "!head -2 fasta2tab\n", "!tail -2 fasta2tab" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gi\t379072745\tgb\tJR970414.1\t\tTSA: Acropora millepora Cluster034439.Acmimixed mRNA sequence\tTCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT\r\n", "gi\t379072746\tgb\tJR970415.1\t\tTSA: Acropora millepora Cluster034438.Acmimixed mRNA sequence\tCGCCCCCACGTCGTCATCTGACGTTCCTGTCCTGTTGCTAAATCAGCCTATTGATTGCGGGAACACATCAATCAACTACTAAACAACAGAATCCTGGGTTTTCAGACTTACAGTGTCTCTGCGATGAGGAGAATGTTCCCCTGTCACCACAAGCACTCCTGTCTGGGCACTAAGACAAATCAGCAATGAGACATTCTTGGCTTCCAAATCAATAAGTGCACATTAACTGGTGTTTGGAGAGACCAATCACCTATCTAGATATGGTCCACCATATTGCAGATTGAAACAATGAATAATAGAACACAAACAATACCCTAACTTGACCACAATAGAAGGTACAGGTTATAAGGACAAATAACAACAGAGGTCTGGAAAAGCCACAGGATTACTCAGTTTGAGGCAAGACATGCCACCTCATAAAATATCTTTGAACATCTATTATTGAATGTTTACATTAACCACCTGTAGATAAAGTGCTTAAGCCTCTTTGTAAAATACAAGAACACAAAACTATATATACACTAATTTGCAGTATCTCAAGTTGTTGTAACAGGCTACTCACTCAATCCTGTGTCCCTTCATATCTTTCATCAAATCAGAGCGAGCATTGGAATGCACACAATGTAGC\r\n" ] } ], "source": [ "#Removing pipes\n", "!tr '|' \"\\t\" fasta2tab2\n", "!head -2 fasta2tab2" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Added column with length of column 6 for 52963 lines.\r\n", "\r\n" ] } ], "source": [ "#Add column with length of sequence\n", "!perl -e '$col = 6;' -e 'while (<>) { s/\\r?\\n//; @F = split /\\t/, $_; $len = length($F[$col]); print \"$_\\t$len\\n\" } warn \"\\nAdded column with length of column $col for $. lines.\\n\\n\";' \\\n", "fasta2tab2 > tab_1" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gi\t379072745\tgb\tJR970414.1\t\tTSA: Acropora millepora Cluster034439.Acmimixed mRNA sequence\tTCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT\t628\n", " 52963 635556 76310959 tab_1\n" ] } ], "source": [ "!head -1 tab_1\n", "!wc tab_1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JR970414.1 \t TCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT \t 628\r\n" ] } ], "source": [ "#Just printing contig ID in column 2\n", "!awk '{print $4, \"\\t\", $11, \"\\t\", $12}' tab_1 > tab_2\n", "!head -1 tab_2" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JR970414 \t TCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT \t 628\r\n", "JR970415 \t CGCCCCCACGTCGTCATCTGACGTTCCTGTCCTGTTGCTAAATCAGCCTATTGATTGCGGGAACACATCAATCAACTACTAAACAACAGAATCCTGGGTTTTCAGACTTACAGTGTCTCTGCGATGAGGAGAATGTTCCCCTGTCACCACAAGCACTCCTGTCTGGGCACTAAGACAAATCAGCAATGAGACATTCTTGGCTTCCAAATCAATAAGTGCACATTAACTGGTGTTTGGAGAGACCAATCACCTATCTAGATATGGTCCACCATATTGCAGATTGAAACAATGAATAATAGAACACAAACAATACCCTAACTTGACCACAATAGAAGGTACAGGTTATAAGGACAAATAACAACAGAGGTCTGGAAAAGCCACAGGATTACTCAGTTTGAGGCAAGACATGCCACCTCATAAAATATCTTTGAACATCTATTATTGAATGTTTACATTAACCACCTGTAGATAAAGTGCTTAAGCCTCTTTGTAAAATACAAGAACACAAAACTATATATACACTAATTTGCAGTATCTCAAGTTGTTGTAACAGGCTACTCACTCAATCCTGTGTCCCTTCATATCTTTCATCAAATCAGAGCGAGCATTGGAATGCACACAATGTAGC \t 628\r\n", "JR970416 \t CGTCCTCGTGACTCATCATTGCTTTTGTCAATACACGAGAGTGAAAAGTCCCAGATAATTGGGAGGCTGGAGGATACTGTGATCATAATATTGTAACTATTAATAAAGTATGAACAATGTGGCTGCACTGGGAAGAGCACCAATATGGCCTGCATGGGTCAGGAGCCTGACTTGGGGCTGTGTGCTGGTTGAGTCTGTAACAAGGTTTTCTTCACTGTTCCAAGAGGTTTAATCTTCAGTTCTTATAGGTTTGTATTTGATAGTGAATCACTATGGCAACCAATTGGAGACAGCTGCTGACTGTAGAACCAGTGATGTTCTTTTATGCATATGGCTTGTTTATGGCAATGCCTGTTTTCCAGCAATATATCTACCATCGGCTTAGTGAAGAACATCATTTTCCATACAACTTTAAGGAACAAACCTCAAGTTGCGGAAGTTCCTTGAATGAATCAATGGAGAAACTTGAGAAAAAGGTGCAGTCTTCTGCTTCTTATGTTCAACTGGGCGTTGTCATGTTTTCTACCTTTCCATCGATTGTGATGACCCTCTTCATGGGTGGTTGGACAGATAAGGTGGGCCGACGCCCTGCTCTGATCATGCCACTTCTTGGCAGTGCACTGGATGCTGCTGTTGTACTTACTGTCATGTATGCCAAGTTGCCTGTGTACTGTCTCTTTATTGGCTCTTTTATCCATGGAGTTTGTGGATATTACACCACCATTCTCTTGGCCTGTTTGGCCTACATTGCGGACACCACTGAGCGGGGACATTTTGCATTTAGATTGGGTATTTTAGAAGCCATTGTTTTTGTGGGAGGAATGGTTGCCCAGTTAACAAGTGGGTTTTGGATTGAAAAGCTGGGATTCACTGCTCCGTATTGGTTCATATTTGGATGTGAAGTCTTTGCGCTGATTTATGCAGCTGTTCTTGTTCCTGAGTCAAAATGCCCATCCAAGGAAGAGAGAGGAAAGCTTTTCAGCTTGGATAACTTGAAGTCTTCTTGGAAAGTTTACAAAAAGGCTGTGGGCACTAAGAAGAGAAATTTAATCATTTTGACATTTTGCTGTGGTATCACGGCCATACCAATCATGGGTATACGTGGAGTTTCGAGCCTGTTTTTGCTTTATTCCCCACTCTGTTTCTCACCAGAACGTGTGGGATATTTTTCAGCCTTGCAGAATTCTGTTTATGGTGTTGGTGGTATTGTGACTATAAAAGCATTTGGAATGTGTCTTTCTCATGTCAATGTAGCGCGCATATCCATTCTATCATATCTAGGATTCCTCGTATACTTTGGATTTTCAAGAACTCTGCTCATGGTCTTTTTGAGTCCCTTGATAGGGATTCTTGGTGGAGCTGTAGCCCCTTTAATCAGAGCGATGATGTCCGAGATTGTCAGTTCAGATGACCAAGGTTCACTTTTTTCAGCCACATCATCTATGGAGGTTCTATTCACGTACCTCGGAGCTCTCCTATTGAACTCACTCTACGCGAAATCTCTAAAATTCAATGCTCCTGGGTTTGTGTTTTTCCTGGCTGCTGGCATACTATTGCTGCCCCTCGCTTTAACTTTCTGCTTAAAAGATCTGTCCATGTTTAAAATGGGAAGGAAGCTAATAAATAAAGCCAGTAGATATGAGAGCATAACAGACGAGGAAGACGGGAGAGAGGGCCAAACAGGATCTCCCGATTCACCGTATTCTGACATAACTGGCGATGATCTGCATGTTATTCCGGGCGGTGACGGGAGGAATGTGTAAACTGAGGAAGTCCGCGTACGATGGGATAAAAACAGGTTTCTCTTCTTGGGGTTCTAGTGCTAAAGTTGCAGCCAAATTCTGACCAAAAAATCACAGAAGACGCCATGATACTTCATTTTGATGTCGATCATTGTACTGTTTGATATATTTTGGAACGTAAGATATGGTCGCATGCAATGAAAACCTTATTTCCATGCTGATTTGGTGTCATATATTTTGGTTCAACCAATCAGATCTGCTGTTAGATGAAAAAAAGGTGAAGCGTAGAAAATGATAAGCGTAACCAATAATAGCCAAAGAGGTCATTGTTCTTAAATGGAAACGAGGCTTCTGTCTCACGTGATCAAAGACTCGTTTGGTCTGTCTATCCACTCTTTGTAATGCGACCAGGAGCTTTACATCTCACATGTAGTTGTCTTTTCAAGGAAAAGGAATTTATTTAACACTGAAAATAGTTTGAACCCAGGAGTGACACGCCTTGATTGAATCTGATTGTATGGGCAAGTGGAATCCTGAGAAGAATTGTGGTTGGCTGTGACTGACGTTTCGACAACCTTTGCGGCAGAAATTTTCTGTGTCAAGTAATGATGTAATCAGTTGCTTTTATGTTCTTATTAATAAGTACTCCTCTCGCCAAGACGACCAGCTTTCATCAAGAAATTTTATTTACTTGTCGATTTGTAAATATTTCATAATCGACTGTCCCCAGTCGCTCAAAGGCTGGGTGGCACTTAGTCTATCCACTGGACAGAGACTTGGACATAGCGCTATCCACAGGTTGAACGACAGGGGCCTGGTATGTGTAAAATTAAGCTGAATATAGCAGATATCCGCGGACTTCGAACACCGATACGTTATTGAGATTAGGCGGTAGAAAGATAGATAGATAGAAAGATAGATAGATAGATAGATAGAT \t 2679\r\n", "JR970417 \t TCGACTTCGTCTTGGGAAACAGATATTCAAGGGAAACAAAATTCATTGTTCCCCAAAGGACCAGTCATTAAGTTATTTGTTGCATAGGAAAACAAAGGAAGAAAAAGCGTGCTGAGATTCCAGTGACAACAACAGGCCAACTTCAACGGCATGCTCAGATCACGTGTAGCAGCAGTCAACGTAGCCCGGGTAACAGTGAACTATTTCCCATTTTACGTTATCGTTTTCGCAATTGTTGCTGCTTGCGGCATTTGGCGGTGAACAGTTTCACAGTCAGAGGTCATGTGGCCATGAACTAGTCAATGAATGGGTGAGCCATGGCGGATGGCGGGAAGAACACCAGCTGTACAAAAATTTGTGTTAAGGTTGCATAAGTTTGATTCACAATTACAGTGCACAATAGAACTGAATCATAGGAAAACATGGACAAAGCACATCAAATCAAAACAAACATTGCTTGATATGCACCACAACGAAACTATTGTGAGAAGTCATTGAGACTCGGGTGGCTAAGTGTCTCAGTACCATATCCAGCAAAGTTGTATGAAAGGAGAGAATGATGCTCGCCCTTTATGTTATTTAGTCAGGGAGCCATTTAGTTTTTGGTGTGGTCACGCAACGCTACTCTC \t 629\r\n", "JR970418 \t TTGAGACGTAGCCTGGTGTAACAACCAATAAAATCCCTGTAAGCAAATCAGTTATTGCCAGGTTAAACAGCAGCATATTATGAGGCTTTCGTAACATTTCACGTTTCTTGAATAATACGAGACAAAACATGAGGTTGAACGTGAAGGCTGAGCCAGCGATGACGGTGAAAGCGGTTTCAAGCACACTCTCTGATAGCGACTCGCTTTGTTCGGTAAATGCAGTTGATTGGTTGTTTTCTTGTCCATTCAAGCTATATGATTGGCTTATTTCAGTGGCTGATATCAACGTTGATTGGTTGATTGCAGATTCCATGATGGCAGTTGATTGGCTTAATCAAAGCTTCGAACGCAGTGTCATTAGTTTCGTGCTTTTAGATGTGATGATGCCGAGTAAATTCTAAATCCTGTTTAGAAAGACCAGGCTCAAAGGTCAGTGTGACAAGTTATCTTCAAACAGATTCTACTTCTTTCAAGTGTTCTGGTTGTAGTCTTTCTGTGTCGAGCTTTTAAGATGTGTGCAGTGGGCCACCTCAATCAGTTACAACAGGATCACTTCCGGAAATGAAAATACAGGTTAATTCTCTGGAGTAAGCTTTGAGGTTGGATAAAACATTACTGACCGATGTGTG \t 629\r\n", "JR970419 \t CGAACGTCCTCAAGCTGATTACCACTGCAAGAATTCAAAAATGTTTATTTTTGCACCTCAAGGTTGCTTGGAGGAGAACAAATAAGTTATAGTGCTTTGCGAAATACTCTTGAGCAAGCCAGTAAGTACCCACATAAAGCACTATTCACTTGTAACACATCAAACTGAAAAAATATATATATTGCCACAATTTTTGTACCACCTCAACAAACCCACTCCTTTAAATTGCTTGATTCTTTTTTTCACCAAAATACTGATGTTACTCCCACGACCTATGAAACTGTCACCTACTTCATTGACAAGCCTGAAATGCAGTAATATCTGCGGTAAACTTATAAAGGCAATAGCCAATGTTTTTTTGCACCACACATCCACCAACACTTTGTGAGAATATCCATACGATGTTTAAGGGTAAGGGCATTGGATCTGTATGCAGCTGCAGTAGGTCTCAATTCTGGTCTGATCGCTATGTATGGTGGACTGGATTTCAATTTGTGTCAACCTAATTATCCCAGAGTTTAGGATTAAACTCTGGGATCATCAATTCTTAACTATAAGGACTGTGTGCTTTGACTCAGAAGTATCAACTTAAAGCCCTAAATCCACAGAAGG \t 612\r\n", "JR970420 \t TTTTTTTTTTCTTCAATTTTTACTTCATTTTACGTCGTAGATAGCACTAGAGCGGCTGTGGAATGTACGAAAACGAAAAAAAGTTCTTACAAGGCGTGTGAAACTACTGCTTTCCATCGTTAAATATGCAAATTTGAGGTTTGGCTTCCTAATTACGGAACGATCTCCTCACCGCGAAAGAGGCCGCTTAATTTGAGCCATGAACTCTCGTTGAAATGGCTTTTCGGATAACAGATTCGCTAGTACTTTAAATATTGCACTGATTATTTATCAAATCAAAGCGGGTCGATATCAACCACATTGCCCACTAGAATACTTGATTTCTCGGACATACTCAATCACATTATCAGTAGCGGAGTCTGAGAGCAAAGCAAAGATGATGCATATGGCAAGCGGCTTTTACGGTTAGGGTTTAGGGAATAAGGGTTTAGGAATAGTCTAGTACTGTATCAAGAATACATCAAGTATAGAAGGACTTTACCACAAGTCGATTTTGTTGAGCCACGAAAACTAGAGAATCCCGTGAAAAACGCTCGAGTCATTGACTTCTAAACGCGCAAACCACAAAATGAAATAACAAAAAAGTCAAACCACAAAATGAAATAAAAAAAGTCATCTATTTAGTTTTC \t 629\r\n", "JR970421 \t ATTTACTTTATATTACTGAAGTTCTGCTCATGGCGCAACACTGTCCCTTTGAGTTCCAGGCCTCACATGGATTAGAGGAAAGTGTAGGCATGGGTTTTGCCCCCTTTTCATTTTAGTCATGGAATTCCTATCATTTAGGAACCTTTTTTCGGATGTTGTATTCGCTAGGAATGGCCATGAGGTGATATCAATCTGCATAAGAACAGTGGAAGCTTGCTCAGACCGAAATCGATTGCATATTTCTGCTGATAATAATTATAATGAATGGGCATGATCCTTGCAGTGTGTTAGCATTATTTAAGCAGTAACAAGAAAGGCCTGAATGGGGATTCGAACCTTGACCTCTGAGATGCCGGGGCAATGCTCTACCAGTTGAGCCAGTAGGTCACCCTGGAGCTGTTTGTTATGTGGCTTGATGATTAACTTGCAGATGGCCTGATAGCTCAATTGCAAGGATTTTACAAATTCATTTCAATCTGCAGTTCTAATGCATGATATATTTACACCAAATGTATATACGTTTTCCTGCATTGCAATTACTTAAATTCTCATTTACATACTGTAACTATTATATTCACCTTTAAGGTGGCTTACTGCAGTTTTATAAGGGTTCAAGAGGTGTATACCA \t 628\r\n", "JR970422 \t TAGTCTTTCAAGCTTTGGAATCGCGCAATGAAGAAAAAGAACGAGAAACGATGTTGGAAGCGCGAATTCACTTTCAAAAAAAGTACAATATAAGCCACGCCGACATGCAGACCTTCGTGAACAAGATCGAGGAGATAGTAGACCACGGCTTCAGCCAGCACTGGATGAAGAGATGGACGATACTGGGCTCCCTTTTCTTTGCCGGTACAGTTGTCACGACAATAGGTTCAAGGTTTTCCTCTTTTGCTTTGTAACCAGGTTATGGCCACGTGACTCCCTGTACCAATGCGGGTCGAATATTTTGCATTCTATACGCCTTAGTTGGAATACCGCTTACATGGCTCATGTTATCAACACTGGCGAAGCAGATCAATGAGAGGATAAAAAAAAAAAAAAAAAAAANNNNNNNNNNATGCTGTTACGAGCGATTCCTCCGAAGAAAACCCACAGGAATAGGGTTAAAAACTGCTTCAATTACTTTGATGATAAGCATAATGATGATTTTAATCATAGCTCTGTTTGGTTGTTATCTCGAAGGATGGCGCTACATAGACGGAGTTTATTTCGGATTTATAACCCTCACGACAATCGGTTTCGGGGATTTTGTCCCGCTTCACCCATCCCCCAG \t 628\r\n", "JR970423 \t AGTCGGTCCATTTTGGAACTCTAACAAGGTCCACGCGACCTCGCTCAATTGGGAATCCTTCACAGGTTCCTGGATGATAGACGGTGTAGTTGTCGAGTCCTTTACAATAGCTTAACTCGTGCCAACATTTGTTGTCGTATGTTGTACCATTGGCTGTGCACACTGGATCCTGATAGGATGGACAGTCTTCAAAACAGACACAACGTGCTTCGTGGGCACTGTATGTTCTACACACTCCAAACGATGGGCATTCAATGTCCAGGCAAGGATCTAGATCTCCGGTGACAATGTAGCTAACATTCAGAGGATCATGCAAATTTCCGGATCCACTGAGGTCCTTTACACAAATTGTCATTGACGTCAAACTGATGTCCTCCACCCAGGCTGTGATGATGTTGTTTTCTGGAAGGGAGTGGTGTCCTAATTGACGATCGTATTTGTGATGGACGGAGAGTAGGACCAGCGGTGGTGCGTAAAAAGTAGTGTTGAACTTCGTTTGTTGACAGAATCCATAGTTGTCGTCGGCTTTCGGGAAGCCACTGTTAGGAAAAAACTCTTCTCCCGCTAGCGTGAAATTAACACCAGGGACATTGCCTTCGGACGCGAACCAGTCCACAATGGTGTCTTG \t 628\r\n" ] } ], "source": [ "!sed 's/\\S*\\.1\\S*//g' tab_2 > tab_3\n", "!head tab_3" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Instead of using the step to deal with description name issues above, the file used to count Cs and Gs will only include the sequence\n", "!awk '{print $2}' tab_3 > tab_4" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#This counts CGs - both cases\n", "!echo \"CG\" | awk -F\\[Cc][Gg] '{print NF-1}' tab_4 > CG " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Counts Cs\n", "!echo \"C\" | awk -F\\[Cc] '{print NF-1}' tab_4 > C " ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Counts Gs\n", "!echo \"G\" | awk -F\\[Gg] '{print NF-1}' tab_4 > G " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JR970414 \t TCATCATTATTTCTTTTTGTTTTTCTGTGATCTTCGGTCAATGCGATAGATCCTCGAGTTATCGTGACTGCGCTCCAGACCAAGTCTGTTGTAGGAGGCAGTGTTTTTACAGCTCGAACTGTTTATATCTGTCTTGCTCCATGGACTCCGATTGTTCAGTGAATGAAGTCTGTTGTAGCAGCAAGTGTCGTTCTGGTNNNNNNNNNNCTGACTGCAGTGGGGATTTTTGTCGCTCGAACAATGATTGCAGCGTTGGGCAAAAGTGTTGTGTGAATACCTGCACCAACTATGATTGTGAGGACCCTACCGTCGCCATTCTTATCGCGGTAGTGGGTTCGCTCGTGGGCTTATTTGTTGTTTTCATTTCAATTTACTACTGCCACAGAAGAGCTCGTTTGGGTCGTTCCGGTACAGCAGAGGTGGGAAGACAAGTTGCCCCAACCGATGCTATCACAACCCAATCAGCAAACCAACAAGGCTACGCATATCAGCAACTCCCTTAAATATCATCAGTATCAGACACCCATTTACAATCCGGAGACACAGAGACAACCAGGAGGAATACTTACTTCACATCGTGCATATGGTGAACTTCAAACCACTTGATCCCCATGTCCAGCGGAAGACT \t 628\t25\t145\t140\r\n", "JR970415 \t CGCCCCCACGTCGTCATCTGACGTTCCTGTCCTGTTGCTAAATCAGCCTATTGATTGCGGGAACACATCAATCAACTACTAAACAACAGAATCCTGGGTTTTCAGACTTACAGTGTCTCTGCGATGAGGAGAATGTTCCCCTGTCACCACAAGCACTCCTGTCTGGGCACTAAGACAAATCAGCAATGAGACATTCTTGGCTTCCAAATCAATAAGTGCACATTAACTGGTGTTTGGAGAGACCAATCACCTATCTAGATATGGTCCACCATATTGCAGATTGAAACAATGAATAATAGAACACAAACAATACCCTAACTTGACCACAATAGAAGGTACAGGTTATAAGGACAAATAACAACAGAGGTCTGGAAAAGCCACAGGATTACTCAGTTTGAGGCAAGACATGCCACCTCATAAAATATCTTTGAACATCTATTATTGAATGTTTACATTAACCACCTGTAGATAAAGTGCTTAAGCCTCTTTGTAAAATACAAGAACACAAAACTATATATACACTAATTTGCAGTATCTCAAGTTGTTGTAACAGGCTACTCACTCAATCCTGTGTCCCTTCATATCTTTCATCAAATCAGAGCGAGCATTGGAATGCACACAATGTAGC \t 628\t7\t145\t107\r\n", "JR970416 \t CGTCCTCGTGACTCATCATTGCTTTTGTCAATACACGAGAGTGAAAAGTCCCAGATAATTGGGAGGCTGGAGGATACTGTGATCATAATATTGTAACTATTAATAAAGTATGAACAATGTGGCTGCACTGGGAAGAGCACCAATATGGCCTGCATGGGTCAGGAGCCTGACTTGGGGCTGTGTGCTGGTTGAGTCTGTAACAAGGTTTTCTTCACTGTTCCAAGAGGTTTAATCTTCAGTTCTTATAGGTTTGTATTTGATAGTGAATCACTATGGCAACCAATTGGAGACAGCTGCTGACTGTAGAACCAGTGATGTTCTTTTATGCATATGGCTTGTTTATGGCAATGCCTGTTTTCCAGCAATATATCTACCATCGGCTTAGTGAAGAACATCATTTTCCATACAACTTTAAGGAACAAACCTCAAGTTGCGGAAGTTCCTTGAATGAATCAATGGAGAAACTTGAGAAAAAGGTGCAGTCTTCTGCTTCTTATGTTCAACTGGGCGTTGTCATGTTTTCTACCTTTCCATCGATTGTGATGACCCTCTTCATGGGTGGTTGGACAGATAAGGTGGGCCGACGCCCTGCTCTGATCATGCCACTTCTTGGCAGTGCACTGGATGCTGCTGTTGTACTTACTGTCATGTATGCCAAGTTGCCTGTGTACTGTCTCTTTATTGGCTCTTTTATCCATGGAGTTTGTGGATATTACACCACCATTCTCTTGGCCTGTTTGGCCTACATTGCGGACACCACTGAGCGGGGACATTTTGCATTTAGATTGGGTATTTTAGAAGCCATTGTTTTTGTGGGAGGAATGGTTGCCCAGTTAACAAGTGGGTTTTGGATTGAAAAGCTGGGATTCACTGCTCCGTATTGGTTCATATTTGGATGTGAAGTCTTTGCGCTGATTTATGCAGCTGTTCTTGTTCCTGAGTCAAAATGCCCATCCAAGGAAGAGAGAGGAAAGCTTTTCAGCTTGGATAACTTGAAGTCTTCTTGGAAAGTTTACAAAAAGGCTGTGGGCACTAAGAAGAGAAATTTAATCATTTTGACATTTTGCTGTGGTATCACGGCCATACCAATCATGGGTATACGTGGAGTTTCGAGCCTGTTTTTGCTTTATTCCCCACTCTGTTTCTCACCAGAACGTGTGGGATATTTTTCAGCCTTGCAGAATTCTGTTTATGGTGTTGGTGGTATTGTGACTATAAAAGCATTTGGAATGTGTCTTTCTCATGTCAATGTAGCGCGCATATCCATTCTATCATATCTAGGATTCCTCGTATACTTTGGATTTTCAAGAACTCTGCTCATGGTCTTTTTGAGTCCCTTGATAGGGATTCTTGGTGGAGCTGTAGCCCCTTTAATCAGAGCGATGATGTCCGAGATTGTCAGTTCAGATGACCAAGGTTCACTTTTTTCAGCCACATCATCTATGGAGGTTCTATTCACGTACCTCGGAGCTCTCCTATTGAACTCACTCTACGCGAAATCTCTAAAATTCAATGCTCCTGGGTTTGTGTTTTTCCTGGCTGCTGGCATACTATTGCTGCCCCTCGCTTTAACTTTCTGCTTAAAAGATCTGTCCATGTTTAAAATGGGAAGGAAGCTAATAAATAAAGCCAGTAGATATGAGAGCATAACAGACGAGGAAGACGGGAGAGAGGGCCAAACAGGATCTCCCGATTCACCGTATTCTGACATAACTGGCGATGATCTGCATGTTATTCCGGGCGGTGACGGGAGGAATGTGTAAACTGAGGAAGTCCGCGTACGATGGGATAAAAACAGGTTTCTCTTCTTGGGGTTCTAGTGCTAAAGTTGCAGCCAAATTCTGACCAAAAAATCACAGAAGACGCCATGATACTTCATTTTGATGTCGATCATTGTACTGTTTGATATATTTTGGAACGTAAGATATGGTCGCATGCAATGAAAACCTTATTTCCATGCTGATTTGGTGTCATATATTTTGGTTCAACCAATCAGATCTGCTGTTAGATGAAAAAAAGGTGAAGCGTAGAAAATGATAAGCGTAACCAATAATAGCCAAAGAGGTCATTGTTCTTAAATGGAAACGAGGCTTCTGTCTCACGTGATCAAAGACTCGTTTGGTCTGTCTATCCACTCTTTGTAATGCGACCAGGAGCTTTACATCTCACATGTAGTTGTCTTTTCAAGGAAAAGGAATTTATTTAACACTGAAAATAGTTTGAACCCAGGAGTGACACGCCTTGATTGAATCTGATTGTATGGGCAAGTGGAATCCTGAGAAGAATTGTGGTTGGCTGTGACTGACGTTTCGACAACCTTTGCGGCAGAAATTTTCTGTGTCAAGTAATGATGTAATCAGTTGCTTTTATGTTCTTATTAATAAGTACTCCTCTCGCCAAGACGACCAGCTTTCATCAAGAAATTTTATTTACTTGTCGATTTGTAAATATTTCATAATCGACTGTCCCCAGTCGCTCAAAGGCTGGGTGGCACTTAGTCTATCCACTGGACAGAGACTTGGACATAGCGCTATCCACAGGTTGAACGACAGGGGCCTGGTATGTGTAAAATTAAGCTGAATATAGCAGATATCCGCGGACTTCGAACACCGATACGTTATTGAGATTAGGCGGTAGAAAGATAGATAGATAGAAAGATAGATAGATAGATAGATAGAT \t 2679\t65\t501\t623\r\n", "JR970417 \t TCGACTTCGTCTTGGGAAACAGATATTCAAGGGAAACAAAATTCATTGTTCCCCAAAGGACCAGTCATTAAGTTATTTGTTGCATAGGAAAACAAAGGAAGAAAAAGCGTGCTGAGATTCCAGTGACAACAACAGGCCAACTTCAACGGCATGCTCAGATCACGTGTAGCAGCAGTCAACGTAGCCCGGGTAACAGTGAACTATTTCCCATTTTACGTTATCGTTTTCGCAATTGTTGCTGCTTGCGGCATTTGGCGGTGAACAGTTTCACAGTCAGAGGTCATGTGGCCATGAACTAGTCAATGAATGGGTGAGCCATGGCGGATGGCGGGAAGAACACCAGCTGTACAAAAATTTGTGTTAAGGTTGCATAAGTTTGATTCACAATTACAGTGCACAATAGAACTGAATCATAGGAAAACATGGACAAAGCACATCAAATCAAAACAAACATTGCTTGATATGCACCACAACGAAACTATTGTGAGAAGTCATTGAGACTCGGGTGGCTAAGTGTCTCAGTACCATATCCAGCAAAGTTGTATGAAAGGAGAGAATGATGCTCGCCCTTTATGTTATTTAGTCAGGGAGCCATTTAGTTTTTGGTGTGGTCACGCAACGCTACTCTC \t 629\t19\t123\t147\r\n", "JR970418 \t TTGAGACGTAGCCTGGTGTAACAACCAATAAAATCCCTGTAAGCAAATCAGTTATTGCCAGGTTAAACAGCAGCATATTATGAGGCTTTCGTAACATTTCACGTTTCTTGAATAATACGAGACAAAACATGAGGTTGAACGTGAAGGCTGAGCCAGCGATGACGGTGAAAGCGGTTTCAAGCACACTCTCTGATAGCGACTCGCTTTGTTCGGTAAATGCAGTTGATTGGTTGTTTTCTTGTCCATTCAAGCTATATGATTGGCTTATTTCAGTGGCTGATATCAACGTTGATTGGTTGATTGCAGATTCCATGATGGCAGTTGATTGGCTTAATCAAAGCTTCGAACGCAGTGTCATTAGTTTCGTGCTTTTAGATGTGATGATGCCGAGTAAATTCTAAATCCTGTTTAGAAAGACCAGGCTCAAAGGTCAGTGTGACAAGTTATCTTCAAACAGATTCTACTTCTTTCAAGTGTTCTGGTTGTAGTCTTTCTGTGTCGAGCTTTTAAGATGTGTGCAGTGGGCCACCTCAATCAGTTACAACAGGATCACTTCCGGAAATGAAAATACAGGTTAATTCTCTGGAGTAAGCTTTGAGGTTGGATAAAACATTACTGACCGATGTGTG \t 629\t19\t111\t146\r\n" ] } ], "source": [ "#Combining counts\n", "!paste tab_3 \\\n", "CG \\\n", "C \\\n", "G \\\n", "> comb\n", "!head -5 comb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculating CpGo/e based on [Gavery and Roberts (2010)](http://www.biomedcentral.com/1471-2164/11/483)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"BMC_Genomics___Full_text___DNA_methylation_patterns_provide_insight_into_epigenetic_regulation_in_the_Pacific_oyster__Crassostrea_gigas__1A0683A5.png\"/" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!awk '{print $1, \"\\t\", (($4)/($5*$6))*(($3^2)/($3-1))}' comb > ID_CpG #use ^ instead of ** for exponent\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JR970414 \t 0.774633\r\n", "JR970415 \t 0.283791\r\n", "JR970416 \t 0.558113\r\n", "JR970417 \t 0.662023\r\n", "JR970418 \t 0.738617\r\n", "JR970419 \t 0.318018\r\n", "JR970420 \t 1.25641\r\n", "JR970421 \t 0.328632\r\n", "JR970422 \t 1.11106\r\n", "JR970423 \t 0.801276\r\n" ] } ], "source": [ "!head ID_CpG2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Now joining CpG to annotation, but first must sort files." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JR970414 \t 0.774633\r\n", "JR970415 \t 0.283791\r\n", "JR970416 \t 0.558113\r\n", "JR970417 \t 0.662023\r\n", "JR970418 \t 0.738617\r\n", "JR970419 \t 0.318018\r\n", "JR970420 \t 1.25641\r\n", "JR970421 \t 0.328632\r\n", "JR970422 \t 1.11106\r\n", "JR970423 \t 0.801276\r\n" ] } ], "source": [ "#Sorting CpG file\n", "!sort ID_CpG2 > ID_CpG2.sorted\n", "!head ID_CpG2.sorted" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }