---
title: "Geoduck transcriptome"
output: html_notebook
---
Purpose of this notebook is to develop some graphs of CpGo/e for the Geoduck transcriptome.
lets load up some data!
```{r}
library(forestplot)
data <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/Geoduck-transcriptome_v3_bigtable_v08.tab.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
GOslim_bin <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/GOslim-bin.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
```
```{r}
dim(data)
head(data)
colnames(data)
dim(GOslim_bin)
head(GOslim_bin)
```
Lets weld these two together, at least where they match.
```{r}
data_merged <- merge(data, GOslim_bin, by.x = "ContigID", by.y = "Column1", all = FALSE)
```
```{r}
data_merged$GOSlim_bin <- as.factor(data_merged$GOSlim_bin)
GO_means[1] <- levels(data_merged$GOSlim_bin)
GO_means[2] <- as.data.frame(tapply(data_merged$`CpG_o/e`, data_merged$GOSlim_bin, mean))
GO_means[3] <- tapply(data_merged$`CpG_o/e`, data_merged$GOSlim_bin, sd)
GO_means[4] <- GO_means[3]/sqrt(summary(data_merged$GOSlim_bin))
GO_means[5] <- GO_means[2] - GO_means[4]
GO_means[6] <- GO_means[2] + GO_means[4]
colnames(GO_means) <- c("Gene Types", "CpG O/E Mean", "CpG O/E SD", "CpG O/E SEM", "Lower", "Upper")
head(GO_means)
```
Lets do a couple quick spot checks, to make sure my logic for calculating the SEM works (I have the right S and sqrt(N) lined up.)
```{r}
summary(data_merged$GOSlim_bin)
factor(data_merged$GOSlim_bin)[1]
transport_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[1])])
transport_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[1])]) / sqrt(transport_n)
transport_SEM == GO_means[14,4]
protein_met_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])])
protein_met_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])]) / sqrt(protein_met_n)
protein_met_SEM == GO_means[10,4]
```
Good deal, looks like it worked! Or I got really lucky twice.
Now to play with the forestplot command. I had some issues with the column gapping, originally it wanted to right justify the graph, leading to a large gap between the labels and the graph. I cheated around this by using the colgap argument with a negative number. This led to some weird spacing issues with the graph label. I'll likely look for a better tool for making the forest plot, but this is a functioning first pass.
```{r}
forestplot(labeltext = GO_means$`Gene Types`, mean = GO_means$`CpG O/E Mean`, lower = GO_means$Lower, upper = GO_means$Upper, xticks = c(0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75), boxsize = 0.08, grid = TRUE, lwd.ci = 5, ci.verticies = TRUE, lty.ci = 1, colgap = unit(-140, 'mm'), xlab = " CpG O/E")
```
Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.