###A flexible platform for querying disparate oyster datasets (qDOD)

A significant challenge facing genome-enabled improvements in U.S. agriculture is aggregating and integrating large, disparate data to accurately identify factors contributing to commercially important traits. The goal of this proposed project is to facilitate analysis, curation, and distribution of genomic data available for the Pacific oyster, Crassostrea gigas. The proposed research project merges experience developing genomic resources (Roberts, UW SAFS) with computing science 
expertise (Howe and Halperin, UW eScience Institute) to use SQLShare ([http://escience.washington.edu/sqlshare](http://escience.washington.edu/sqlshare)) as a platform for aggregation and analysis of genomic data. SQLShare is a free, open to the public, web-based query-as-a-service platform designed to replace script-based scientific workflows with declarative queries. SQLShare provides a public data repository that accepts any tabular data (spreadsheets, GFF files, etc.) and allows the user to derive new datasets via standard database operations in the Structured Query Language (SQL). For the proposed project the objectives are to **1) Centralize datasets needed for effective analysis of oyster genomic data** and **2) Provide a series of modules that provide examples of relevant queries and export protocols**.
	

Datasets that will be aggregated and made immediately available to the public includes data from NCBI (nt, nr, SRA, ESTdb, RefSeq), EMBL-EBI (Gene Ontology Association files) UniProt (UniPortKB Swiss-Prot), GigasDatabase, Crassostreome, OysterDB, OrthoDB, and RepBase as well as supplemental data from publications and all of our own data. Some of this data is already available including oyster gene expression information published along with genome. In fact, SQLShare has been used extensively in the University of Washington course FISH546: Bioinformatics for Environmental Sciences with genomic data from a variety of species. The real power of the platform pertains to objective 2, the ability to query multiple datasets instead of focusing on one at a time. One query example is the four-line query at right that obtains all oyster genes where the percent methylation in gill tissue is greater than 20%, the number of CGs is greater than 10, and expression level is less than 10. For objective 2, a wiki will be developed that provides a number of well-annotated and fully described queries to bootstrap and will be public so any users can provide their own queries. Documentation will allow any user to easily use some of the queries or modify for their own interest. Users can decide to make their data and queries public or private. As the oyster genome is now easily browsable on the USDA NRSP-8 Program Bioinformatics Coordination Project Website there will be particular attention given to generating tabular data that is in GFF format and can be easily viewed on the site. Data we upload to SQLShare will be automatically and instantly shared with the public, and associated documentation, examples, and notes will be available on the project wiki. The next milestone of the project will be to provide a unified portal and experience (i.e. http://oystergen.es).