I've just began working on my final year project, and was advised to document my progress in a diary or blog, so here we go.
My project is going to be based around genomic data that has been sequenced from three strains of yeast. The plan is to annotate the sequenced contigs of DNA, and then compare them against each other to see where the genes of the yeasts differ.
This will hopefully help the researchers understand what genes are responsible for certain characteristics of the yeasts, and enable them to modify one of the species to perform a more useful function.
Playing with alignment tools
After getting the data, I began to experiment with different tools and getting used to working with the fasta format.
The most commonly used tool for gene alignment is blast, however it was written in 1990 originally and is over 600MB's to install. So I would like to look into more modern tools, in the hopes of speeding up the time it takes to align the sequences.
I started by downloading the 66GB NCBI nr database, it is a fasta file that contains a collection of known genes that have been annotated. This is what I will use to query the yeast contigs against.
I started out by using diamond, which works just like blast, however it is much more modern and runs a claimed 20,000 times faster! The first step in using diamond for alignment is to take your reference sequences and create a database from that for your queries to be ran against.
This took me 33 minutes to do for the whole NCBI nr database, which didn't seem too bad. I then ran an alignment with the 16MB of sequenced yeast contigs, against this database.
That took 75 minutes to run and produced some useful (I hope!) data.
I should mention the hardware that I am running this on, as the times are meaningless with out that. My home PC has a i7-6700k @ 4.6Ghz, 16GB of RAM, a GTX 980 and unfortunately I didn't have an SSD big enough for this data, so I'm having to use my HDD which is probably causing a lot of bottlenecks.
Now that I know I can run an alignment with diamond in a reasonable time, I wanted to explore the options of using a GPU accelerated alignment tool in the hopes that this would allow for much faster alignment times. I first tried BarraCUDA, it installed fine but would appear to try and read the entire reference database (64GB) into my 16GB of RAM, and then crash, so this wasn't a good start.
I then tried CLAST, but was unable to successfully compile due to a C++ error that looked like a rabbit hole I really didn't want to go down. Next up was cudasw++, this installed fine but again, like BarraCUDA, it would attempt to read the whole reference file into memory.
Starting to see a theme here... Perhaps having one monolithic 64GB file isn't the best way to go about business. I decided to split the file up into six chunks to make it more manageable. Oddly this proved a lot more difficult than you would think. Several of the solutions I tried like pyfasta would read the whole file into memory again, which seems like a slight design oversight.
I then tried faSplit, which appeared to be working well, it created three 12GB files, and then hundreds of thousands of ~300byte files. This is how I found out that
rm can only take 100,000 arguments at a time. Not wanting to deal with cleaning up nearly a million files again I decided to try and use genome tools.
Mercifully, it worked perfectly. Although it did take quite a long time, but writing 64GB's to disk is never going to be fast with a mechanical drive.
I now had 11GB chunks of the nr database, so I decided to try cudasw++ again, however it segfaulted as it did before. I then went back to BarraCUDA and it also segfaulted. In one last ditch effort I split one of the 11GB files in half and tried again, but it still segfaulted in both applications.
Looks like there is a lot more work to go into getting GPU accelerated alignment than I had hoped. I'll probably just stick with diamond because of this.