---
title: "Geoduck transcriptome"
output: html_notebook
---

Purpose of this notebook is to develop some graphs of CpGo/e for the Geoduck transcriptome.

lets load up some data!

```{r}
library(forestplot)

data <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/Geoduck-transcriptome_v3_bigtable_v08.tab.txt", "\t", escape_double = FALSE, trim_ws = TRUE)

GOslim_bin <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/GOslim-bin.txt", 
    "\t", escape_double = FALSE, trim_ws = TRUE)
```


```{r}
dim(data)

head(data)

colnames(data)

dim(GOslim_bin)

head(GOslim_bin)

```

Lets weld these two together, at least where they match.

```{r}

data_merged <- merge(data, GOslim_bin, by.x = "ContigID", by.y = "Column1", all = FALSE)
```

```{r}
data_merged$GOSlim_bin <- as.factor(data_merged$GOSlim_bin)
GO_means[1] <- levels(data_merged$GOSlim_bin)
GO_means[2] <- as.data.frame(tapply(data_merged$`CpG_o/e`, data_merged$GOSlim_bin, mean))
GO_means[3] <- tapply(data_merged$`CpG_o/e`, data_merged$GOSlim_bin, sd)
GO_means[4] <- GO_means[3]/sqrt(summary(data_merged$GOSlim_bin))
GO_means[5] <- GO_means[2] - GO_means[4]
GO_means[6] <- GO_means[2] + GO_means[4]
colnames(GO_means) <- c("Gene Types", "CpG O/E Mean", "CpG O/E SD", "CpG O/E SEM", "Lower", "Upper")

head(GO_means)
```


Lets do a couple quick spot checks, to make sure my logic for calculating the SEM works (I have the right S and sqrt(N) lined up.)

```{r}
summary(data_merged$GOSlim_bin)
factor(data_merged$GOSlim_bin)[1]

transport_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[1])])

transport_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[1])]) / sqrt(transport_n)

transport_SEM == GO_means[14,4]

protein_met_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])])

protein_met_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])]) / sqrt(protein_met_n)

protein_met_SEM == GO_means[10,4]
```


Good deal, looks like it worked! Or I got really lucky twice.

Now to play with the forestplot command. I had some issues with the column gapping, originally it wanted to right justify the graph, leading to a large gap between the labels and the graph. I cheated around this by using the colgap argument with a negative number. This led to some weird spacing issues with the graph label. I'll likely look for a better tool for making the forest plot, but this is a functioning first pass. 

```{r}
forestplot(labeltext = GO_means$`Gene Types`, mean = GO_means$`CpG O/E Mean`, lower = GO_means$Lower, upper = GO_means$Upper, xticks = c(0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75), boxsize = 0.08, grid = TRUE, lwd.ci = 5, ci.verticies = TRUE, lty.ci = 1, colgap = unit(-140, 'mm'), xlab = "                                                                                                                                                                                                     CpG O/E")

```

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.