This experiment is in collaboration with Martin Morgan at Fred Hutch. The task is to count the reads that overlap each of a set of regions. The hypothesis is that a SQL implementation is competitive with the custom R and C code developed explicitly for this purpose. The schema is region(region_id, start, end) and read(start, end). The sample dataset includes 10,000 regions and 1,000,000 reads. The query is a [straightforward range query with an aggregation:](https://sqlshare.escience.washington.edu/sqlshare#s=query/mtmorgan%40washington.edu/counts) ```sql SELECT roi.id, count(rd.start) FROM [mtmorgan@washington.edu].[roi.csv] roi , [mtmorgan@washington.edu].[reads.csv] rd WHERE roi.start <= rd.start AND roi.[end] >= rd.[end] GROUP BY roi.id​ ``` The two datasets were loaded into SQLShare. The performance without any further tuning was not competitive. We made the following modifications: 1. We created a clustered index on (start, end) (using a separate table called indexed_reads. Recall that clustered just means "physically sorted by the specified columns." SQLShare guesses a default clustered index, but it chose (start) instead of (start, end), which prevented applying both predicates during the index scan. 1. We modified the query as follows. The reason is that by adding checking that the read start falls within the region and, separately, checking that the read end falls within the region, the index can be used to check "both sides" at once. That is, it can apply an efficient range condition on the start. Without this condition, there's no way the database can know that read.end is always greater than read.start, so it needs to consider a lot more possibilities. (I suspect this domain-specific knowledge is hard-coded into the custom tool.) ```sql SELECT roi.id, count(rd.start) as cnt FROM [mtmorgan@washington.edu].[roi.csv] roi , indexed_reads rd WHERE roi.start <= rd.start AND rd.start <= roi.[end] AND roi.start <= rd.[end] AND rd.[end] >= roi.[end] GROUP BY roi.id ``` This query runs in 3 seconds on a single processor on SQL Azure. Martin reports that the same task runs on a single processor on a Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz in about 6.8s. Unfortunately, there's no direct way to assert the CPU speed on SQL Azure since it varies based on load, but I don't think it is the dominant effect here.