After looking at the data I was given, I have been using my limited biology knowledge to come up with theories as to how the data was created, what it means, and how it would best be used.

This process has taken about three weeks, and in that time I spent good chunk of it lamenting over the blast2go outputs that I had been given. I thought they were the key to the puzzle.

However it soon became apparent that what I was in fact looking at was just blast results in XML format! Something I could have easily replicated in seconds. Reeling from this I had a good sit down with my tutor and discovered that included in the data was actually all of the annotated proteins, and the coding sequences for them!

I had it all a long!

Progress

Now that I knew I had pretty much all the data I needed it was time to start building, but after talking with the researchers about how they use the candida genome database, I thought it would be a good idea to ensure that any genes found could be linked to that as a reference.

To get this data I used diamond to blast the coding sequences I had for each species against the proteins for the very maturely annotated C. albicans from the Candida Genome Database.

rm albicans_ref.dmnd  
rm ../shehate/blastx_shehate_codingseq_albicans_proteins.json  
rm ../tropicalis/blastx_tropicalis_codingseq_albicans_proteins.json  
rm ../boidinii/blastx_boidinii_codingseq_albicans_proteins.json

./diamond makedb --in candida_albicans_proteins.fa -d albicans_ref -p 8

./diamond blastx -d albicans_ref.dmnd -q ../shehate/candida_shehate_codingseq.fa -o ../shehate/blastx_shehate_codingseq_albicans_proteins.xml -p 8 -f 5 -k 1
./diamond blastx -d albicans_ref.dmnd -q ../tropicalis/candida_tropicalis_codingseq.fa -o ../tropicalis/blastx_tropicalis_codingseq_albicans_proteins.xml -p 8 -f 5 -k 1
./diamond blastx -d albicans_ref.dmnd -q ../boidinii/candida_boidinii_codingseq.fa -o ../boidinii/blastx_boidinii_codingseq_albicans_proteins.xml -p 8 -f 5 -k 1

xml2json ../shehate/blastx_shehate_codingseq_albicans_proteins.xml ../shehate/blastx_shehate_codingseq_albicans_proteins.json  
xml2json ../tropicalis/blastx_tropicalis_codingseq_albicans_proteins.xml ../tropicalis/blastx_tropicalis_codingseq_albicans_proteins.json  
xml2json ../boidinii/blastx_boidinii_codingseq_albicans_proteins.xml ../boidinii/blastx_boidinii_codingseq_albicans_proteins.json

rm ../shehate/blastx_shehate_codingseq_albicans_proteins.xml  
rm ../tropicalis/blastx_tropicalis_codingseq_albicans_proteins.xml  
rm ../boidinii/blastx_boidinii_codingseq_albicans_proteins.xml  

This gave me a link to CGD for almost every gene that had been found in the three species. With some mapping-fu, I was able to create a JSON file that mapped out every ID with a description, name and even a uniprot ID.

{"CGDID":"CAL0000196141","GeneName":"AAF1","GeneID":"C3_06470W_A","uniprot":"P46589"},
{"CGDID":"CAL0000194200","GeneName":"AAH1","GeneID":"C2_06970W_A","uniprot":"A0A1D8"},
{"CGDID":"CAL0000188193","GeneName":"AAP1","GeneID":"C3_03990C_A","uniprot":"A0A1D8"},
{"CGDID":"CAL0000198574","GeneName":"AAT1","GeneID":"C2_05250C_A","uniprot":"A0A1D8"},
{"CGDID":"CAL0000174616","GeneName":"AAT21","GeneID":"CR_07620W_A","uniprot":"Q59N40"},
{"CGDID":"CAL0000188482","GeneName":"AAT22","GeneID":"C4_01200C_A","uniprot":"A0A1D8"},

Now the fun can begin... I have all the data in one place!

Building the database

I had already created a couple of database importing scripts, but now I had to modify those to accept the new data. This wasn't a trouble at all, but now that I had the coding sequences there was a new challenge. Highlighting the coding sequence in it's place in the contig.

This was one of the more difficult parts of the project as I had to come up with a way to as efficiently as possible but as accurately as possible select the required areas in the contigs. I played around with a few ideas, but the one that stuck was arguably the most simple.

By simply taking the first and last twelve bases in the coding sequence and searching for them in the contig then marking their positions, I was able to get over half of them highlighted in the contigs. Not ideal, but to get a better percentage I would have to account for any introns (non-coding regions), or use a more dynamic method of selection which would have made it much less efficient. It only takes about two minutes to import all of the genes from the three species with my chosen method.

Front end needs some love

With the data in the database, my attention turned to the front end of the website. There were three clear tasks:

  • Make everything look a little prettier, and work well with mobile
  • Highlight coding regions in contigs in red
  • Copy the coding regions with +/- bases around it to the clipboard

For the first task I used bootstrap, to layout my form and data tables, it's CSS really makes a project look so much better just by adding it to the project! I also made a logo for the project when my mind wasn't up to much more than colouring!

Logo

Then I fixed a lot of the small niggles that you notice when you interact with a webpage a lot. Things like the searches not persisting in the form fields after they have been ran, and a couple of other small UX bugs.

Creating the highlighting proved a little tricky to get done nicely. But with a bit of regex and the use of some span tags it was possible! I did have to go back and change my schema a little to include more information about the positions, but I'm rather pleased with my end solution.

The next feature I'm really happy with as I don't think the researchers will have expected this, but I bet they will love it (fingers crossed!), I've added a button to copy not only the coding sequence to the clipboard, but if needed +/- a user defined amount of bases up and down stream of the coding sequence in the contig.

Adjusting the search

I also felt like the search could do with some love, so I updated how it works. Before it was using a really slow and unelegant solution of a regex for every search field, for example a search might have looked like this:

{ description: /arabinose/i, name: /SEN1/i, uniprot /KVY86X/i }

This had one upside, which is that you could get a very granular search looking for very specific combinations of fields in the entire database. The issue was though that it is terribly inefficient as it effectively is doing a regex on the whole database every time there is a query!

Instead I opted to use a text search field that had weighted indexes for, name, candida genome id and description. This made one super search field that would return meaningful results based on several strings, without a huge slow down.

This was then enhanced by adding a regex field for coding sequences, so you can narrow down your search with a sequence of nucleotide bases that you might find in a coding sequence. There is also now an option to filter the results by species, and limit the number of results.

Conclusions

I could go into a lot more detail about how this was all done but I think I will save that for my report, which I can hopefully crack on with now as the project appears to be at least in a semi-functional state now. Although I do need to check with my tutor and the researchers that I haven't missed a key feature!

:wq