---
title: "Revamping the nightingales spreadsheet"
output: html_notebook
---

After talking today, we decided that we were going to change how the spreadsheet that organized and linked sequencing data produced by the Roberts Lab. This notebook begins that work by reworking the SeqID column, removing remenants of file name and the second of paired end reads as well as remaking the links to the files themselves.

First we read in the current Nightingales spreadsheet, as pulled from the Google sheets. A quick head and dim command make sure it looks roughly correct, which it does. 

```{r}
library(readr)
read_csv("~/Documents/dmp/Nightingales - master.csv") -> night

head(night)
dim(night)
```

Below we strip out anything that looks like a file name by removing anything on the right side of the first . (ex. sequence.fasta.gz -> sequence)

Then we remove any existing text containing _R1, leaving just the Sequence ID (full example, sequence_R1.fasta.gz would first go to sequence_R1, and then just sequence)

Finally, we delete any rows which have the second of a pair of reads, as we no longer need these with the current format of the spreadsheet. 
```{r}
night$SeqID <- gsub("\\..*", "", night$SeqID)
night$SeqID <- gsub("_R1.*", "", night$SeqID)
night <- night[-which(grepl("_R2", night$SeqID) == TRUE),]
```


This... looks like a mess. The loop below iterates through all remaining rows of the data frame, first setting the working directory to the Nightingales directory corresponding to the species. I'm using a mounted version of Owl, so the directory structure may look a little odd. 

After it sets the working directory, it creates a temporary file list of all files within the working directory that correspond to the sequence ID. 

It then checks if the Length variable has a "PE", indicating that the reads are paired, and if that is the case, writes the two files to the URL 1/2 and File Location 1/2 columns, with the correct directory structure/http address.

If it's a single end read, the else portion of the if/else just writes the single read file to URL and File Location 1, leaving the second read columns blank.
```{r} 
for(i in 1:nrow(night))   {
   setwd(paste0("~/Documents/owl/nightingales/", night$Primary_taxa[i]))
   temp_file_list <- list.files(".", pattern = paste0(night$SeqID[i]), include.dirs = TRUE, recursive = TRUE, ignore.case = TRUE)
  if(grepl("PE", night$Length[i]) == TRUE)   {
    night$FileLocation[i] <- paste0("/Volumes/web/nightingales/", night$Primary_taxa[i], "/", temp_file_list[1])
    night$FileLocation2[i] <- paste0("/Volumes/web/nightingales/", night$Primary_taxa[i], "/", temp_file_list[2])
    night$URL[i] <- paste0("http://owl.fish.washington.edu/nightingales/", night$Primary_taxa[i], "/", temp_file_list[1])
    night$URL2[i] <- paste0("http://owl.fish.washington.edu/nightingales/", night$Primary_taxa[i], "/", temp_file_list[2])
  } else   {
    night$FileLocation[i] <- paste0("/Volumes/web/nightingales/", night$Primary_taxa[i], "/", temp_file_list[1])
    night$URL[i] <- paste0("http://owl.fish.washington.edu/nightingales/", night$Primary_taxa[i], "/", temp_file_list[1])
    night$FileLocation2[i] <- ""
    night$URL2[i] <- ""
  }
}
```

This largely worked, but there were 53 rows that produced incorrect entries. Two were due to missing pairs, which is hard to fix. The majority were due to file names that didn't fit the standard SeqID_R1.* naming conventions. As an example. there are some O. keta reads that have the format S_1_1_sequence.txt.gz and S_1_2_sequence.txt.gz. As the paired read indicator is just a number, it's difficult to detect these algorithmically, and so it will likely be easier to fix these by hand. 600/653 isn't bad for an hour and a half's work. 

```{r}
setwd("~/Documents/dmp/")
write.csv(night, file = "night_test.csv")
```

From here, we need to get Library IDs and numbers over from the Next Gen Sequencing file, but that will be tomorrow's problem.