# Blast and GOSlim annotation of *Acropora palmata* transcriptome 

This workflow details the annotation of an *Acropora palmata* [transcriptome](https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta)

The notebook requires you have the following 
- [NCBI Blast: 2.2.3](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)
- [SQLShare](https://sqlshare.escience.washington.edu/accounts/login/?next=/sqlshare/%3F__hash__%3D)

The annotation also requires a Uniprot/Swissprot BLAST database. Instructions for setting up this database can be found [here](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/README.md)

The orginal analysis was carried out on on Mac OS X v10.10.3 running Python: 2.7.9 and IPython: 3.1.0.

This workflow is structured so that anyone can reproduce the analysis by downloading the repository locally and executing.

In [2]:
cd ../data/Apalm

/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/data/Apalm


In [7]:
#Obtain FASTA file
!curl -O https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta > Apalm.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 65.5M  100 65.5M    0     0  5165k      0  0:00:12  0:00:12 --:--:-- 5210k


In [8]:
#Rename to Apalm.fasta
!mv display?to_ext=fasta Apalm.fasta

In [None]:
!head Apalm.fasta

In [None]:
!tail Apalm.fasta

In [None]:
#fasta is full of double quotes (") in front of some of the (>) ...Removing " from fasta
!sed 's/"//g' Apalm.fasta > Apalm2.fasta

In [None]:
!head Apalm.fasta

In [None]:
#Count number of seqs
!fgrep -c ">" Apalm2.fasta

### Blastx query

In [None]:
!blastx \
-query Apalm2.fasta \ #FASTA file
-db ~blast/db/uniprot_sprot \ #Use your blastx database address
-max_target_seqs 1 \ #maximum number of target sequences = 1
-max_hsps 1 \ #maximum number of high-scoring pairs = 1
-outfmt 6 \ #output format = tabular
-evalue 1E-05 \ #E-value = 10^-5
-num_threads 8 \ #number of threads = 8
-out ../analyses/Apalm/Apalm_blastx_uniprot.tab \ #Direct output to analyses directory
2> ../analyses/Apalm/Apalm_blastx_uniprot.error #Direct standard error output to its own file

In [16]:
cd ../../analyses/Apalm

/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/analyses/Apalm


In [14]:
#Checking head and tail of the output file.
!head -10 Apalm_blastx_uniprot.tab

head: ../analyses/Apalm/Apalm_blastx_uniprot.tab: No such file or directory


In [None]:
#Comparison of the tail with original FASTA should give an idea of whether
#the blast job is complete (note contig25409_16070 present in both)
!tail -10 Apalm_blastx_uniprot.tab

In [None]:
!wc Apalm_blastx_uniprot.tab

In [None]:
#Removing pipes and converted to tab-delimited file
!tr '|' "\t" <Apalm_blastx_uniprot.tab> Apalm_blastx_uniprot_sql.tab
!head -1 Apalm_blastx_uniprot.tab
!echo SQLShare ready version has Pipes converted to Tabs ....
!head -1 Apalm_blastx_uniprot_sql.tab

# Manually uploading Apalm_blastx_uniprot_sql.tab to SQLShare and joining with GOSlim

###First upload dataset
![screen shot1](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.01.38%20PM.png?raw=true)

###Then find the dataset, execute query, and download the new dataset
![screen shot](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.29.18%20PM.png?raw=true)

##Query (note: insert your SQLShare account instead of jldimond@washington.edu)
`SELECT Distinct Column1 as ContigID, GOSlim_bin FROM
  [jldimond@washington.edu].[Apalm_blastx_uniprot_sql.tab]anno
  left join [sr320@washington.edu].[SPID and GO Numbers]go
  on anno.Column3=go.SPID
  left join [sr320@washington.edu].[GO_to_GOslim]slim
  on go.GOID=slim.GO_id where aspect like 'P'`

### Output file downloaded to ./analyses/Apalm/Apalm_GOSlim.csv

In [9]:
!head -10 Apalm_GOSlim.csv

head: ./analyses/Apalm/Apalm_GOSlim.csv: No such file or directory


In [10]:
#Converting from comma to tab delimited
!tr ',' "\t" <Apalm_GOSlim.csv> Apalm_GOSlim.tab

/bin/sh: ./analyses/Apalm/Apalm_GOSlim.csv: No such file or directory


In [19]:
!head -10 Apalm_GOSlim.tab

ContigID	GOSlim_bin
contig135011_153678_153601	cell organization and biogenesis
contig135011_153678_153601	other biological processes
contig135011_153678_153601	developmental processes
contig69684	protein metabolism
contig113621	protein metabolism
contig97647	protein metabolism
contig199902	protein metabolism
contig78855	other biological processes
contig8505_94477	DNA metabolism
