This is the set of steps I have followed in an attempt to annotate the data I have been given, and put it into a database. At the moment is is almost entirely a manual task, but if the scope for my project allows I would like to make this into a fully automated process that is compatible with the web front end that I produce.

Raw contigs

First set of data is the reads from the sequencer, assembled into many contigs.

This produces candida_boidinii.fa.

Diamond blast against NCBI non-redundant database

Second set of data is generated by using diamond to compare the contigs you have against the nr database to find known genes.

./diamond makedb --in nr -d nr_ref
./diamond blastx -d nr_ref.dmnd -q candida_boidinii.fa -o blastx_nr_boidinii.m8 -p 8

This produces blastx_nr_boidinii.m8.

Select best matches from blast

Using awk pull out only the best matches from diamond. I need a proper bioinformatician to tell me if this is okay or not.

awk '!_[$1]++ {print}' blastx_nr_boidinii.m8 > boidinii_best.m8  

This produces boidinii_best.m8.

Get uniprot & Kegg IDs and proteins

Use awk to get the list of RefSeq protein IDs from the blast results

awk '{print $2}' boidinii_best.m8 | xclip  

Then paste it into the uniprot look up and search for RefSeq Protein -> UniProtKB

Download the Mapping Table file, to get a list of the IDs that correspond to uniprot IDs.

Also download the fasta format of all the proteins found.

This produces boidinii_uniprot.fa & boidinii_uniprot_ids.lst.

Take the uniprot ids and get the conversion to kegg Id's in the same way. This produces boidinii_kegg_ids.lst.

Merge the data into a working file

Using Vim macro's it is simple to combine the found data into a copy of the original fasta file.

The gene is now annotated with the original scaffold, the blast result, and the uniprot & kegg ID.

>C76265 63.0 | XP_002553495.1 23.7  219 154 5 696 1331  25  237 4.1e-05 58.9 | C5DFV2 | lth:KLTH0D18194g

This produces boidinii_working.fa.

Create a database with the found data

Now we have two files that need to be ingested into a database. The boidinii_working.fa & boidinii_uniprot.fa files should contain all the data that the website needs.

I have created a quick and nasty script to put this data into MongoDB.

const fasta2json = require('fasta2json')  
    , MongoClient = require('mongodb').MongoClient
    , url = 'mongodb://localhost:27017/candida'

let proteins = fasta2json.ReadFasta('../data/boidinii/boidinii_uniprot.fa')  
  , species = []

fasta2json.ParseFasta = (str) => {  
  let fasta = []
    , seqs = str.split('>')

  for (let i = 1; i < seqs.length; i++) {
    seqs[i] = seqs[i].replace(/\r/g, '')

    let seq = {}
      , fas = seqs[i].split('\n')
      , head = fas[0]

    seq.contig = head.split('|')[0].trim()
    seq.blast = head.split('|')[1] || 'No blast result'
    seq.uniprot = head.split('|')[2] || 'No uniprot match'
    seq.kegg = head.split('|')[3] || 'No kegg match'
    seq.blast = seq.blast.trim()
    seq.uniprot = seq.uniprot.trim()
    seq.kegg = seq.kegg.trim()
    seq.sequence = ''

    for (let j = 1; j < fas.length; j++) {
      seq.sequence = seq.sequence + fas[j]

    seq.protein = proteins.filter((obj) => {
      let proteinId = obj.head.split('|')[1]
      return proteinId === seq.uniprot
    })[0] || 'No uniprot match'


  return fasta

species = fasta2json.ReadFasta('../data/boidinii/boidinii_working.fa')

MongoClient.connect(url, (err, db) => {  
  if (err) console.log('ERROR: ' + err)
  let collection = db.collection('boidinii')

  collection.insertMany(species, (err, result) => {
    if (err) console.log('ERROR: ' + err)

This gives provides the following data:

> db.boidinii.findOne({ uniprot: "K4AC16" })
        "_id" : ObjectId("58b96c298660061466e9be53"),
        "contig" : "C69229  4.0",
        "blast" : "XP_004983205.1 91.7  36  3 0 146 39  311 346 6.1e-09 68.2",
        "uniprot" : "K4AC16",
        "kegg" : "sita:101760397",
        "protein" : {
                "head" : "tr|K4AC16|K4AC16_SETIT Uncharacterized protein...",
                "seq" : "MNIASAALVFLAHCLLLHRCMGSEA..."

Build a web front end

Then I just have to make a web front end to make this data accessible, including viewing, searching and comparing genes across the species.

This is my next big challenge. I will create a list of stories for this task.