This is the set of steps I have followed in an attempt to annotate the data I have been given, and put it into a database. At the moment is is almost entirely a manual task, but if the scope for my project allows I would like to make this into a fully automated process that is compatible with the web front end that I produce.
Raw contigs
First set of data is the reads from the sequencer, assembled into many contigs.
This produces candida_boidinii.fa
.
Diamond blast against NCBI non-redundant database
Second set of data is generated by using diamond to compare the contigs you have against the nr database to find known genes.
./diamond makedb --in nr -d nr_ref
./diamond blastx -d nr_ref.dmnd -q candida_boidinii.fa -o blastx_nr_boidinii.m8 -p 8
This produces blastx_nr_boidinii.m8
.
Select best matches from blast
Using awk pull out only the best matches from diamond. I need a proper bioinformatician to tell me if this is okay or not.
awk '!_[$1]++ {print}' blastx_nr_boidinii.m8 > boidinii_best.m8
This produces boidinii_best.m8
.
Get uniprot & Kegg IDs and proteins
Use awk to get the list of RefSeq protein IDs from the blast results
awk '{print $2}' boidinii_best.m8 | xclip
Then paste it into the uniprot look up and search for RefSeq Protein -> UniProtKB
Download the Mapping Table file, to get a list of the IDs that correspond to uniprot IDs.
Also download the fasta format of all the proteins found.
This produces boidinii_uniprot.fa
& boidinii_uniprot_ids.lst
.
Take the uniprot ids and get the conversion to kegg Id's in the same way. This produces boidinii_kegg_ids.lst
.
Merge the data into a working file
Using Vim macro's it is simple to combine the found data into a copy of the original fasta file.
The gene is now annotated with the original scaffold, the blast result, and the uniprot & kegg ID.
>C76265 63.0 | XP_002553495.1 23.7 219 154 5 696 1331 25 237 4.1e-05 58.9 | C5DFV2 | lth:KLTH0D18194g
AAAAAAAAAAAAATGTTCAGTCAAAAATAAGCTAATTTACCGTACAATGGCATGCATATGCGACAAGGTTCTTTTTTTCTGTTGTTTAGCAAATGCAGTA
AACCAGTGGTTATACATTCATCATTAGGTGGTACTCTAAATCTGTCTTTATAATCCATCTTTTATCCATAAGTGAAGCTGAAAAGGCTGAAAGTCCTTTT
This produces boidinii_working.fa
.
Create a database with the found data
Now we have two files that need to be ingested into a database. The boidinii_working.fa
& boidinii_uniprot.fa
files should contain all the data that the website needs.
I have created a quick and nasty script to put this data into MongoDB.
const fasta2json = require('fasta2json')
, MongoClient = require('mongodb').MongoClient
, url = 'mongodb://localhost:27017/candida'
let proteins = fasta2json.ReadFasta('../data/boidinii/boidinii_uniprot.fa')
, species = []
fasta2json.ParseFasta = (str) => {
let fasta = []
, seqs = str.split('>')
for (let i = 1; i < seqs.length; i++) {
seqs[i] = seqs[i].replace(/\r/g, '')
let seq = {}
, fas = seqs[i].split('\n')
, head = fas[0]
seq.contig = head.split('|')[0].trim()
seq.blast = head.split('|')[1] || 'No blast result'
seq.uniprot = head.split('|')[2] || 'No uniprot match'
seq.kegg = head.split('|')[3] || 'No kegg match'
seq.blast = seq.blast.trim()
seq.uniprot = seq.uniprot.trim()
seq.kegg = seq.kegg.trim()
seq.sequence = ''
for (let j = 1; j < fas.length; j++) {
seq.sequence = seq.sequence + fas[j]
}
seq.protein = proteins.filter((obj) => {
let proteinId = obj.head.split('|')[1]
return proteinId === seq.uniprot
})[0] || 'No uniprot match'
fasta.push(seq)
}
return fasta
}
species = fasta2json.ReadFasta('../data/boidinii/boidinii_working.fa')
MongoClient.connect(url, (err, db) => {
if (err) console.log('ERROR: ' + err)
let collection = db.collection('boidinii')
collection.drop()
collection.insertMany(species, (err, result) => {
if (err) console.log('ERROR: ' + err)
db.close()
})
})
This gives provides the following data:
> db.boidinii.findOne({ uniprot: "K4AC16" })
{
"_id" : ObjectId("58b96c298660061466e9be53"),
"contig" : "C69229 4.0",
"blast" : "XP_004983205.1 91.7 36 3 0 146 39 311 346 6.1e-09 68.2",
"uniprot" : "K4AC16",
"kegg" : "sita:101760397",
"sequence" : "GAGGATCCTAACAATCTAGTAAGGCTAG...",
"protein" : {
"head" : "tr|K4AC16|K4AC16_SETIT Uncharacterized protein...",
"seq" : "MNIASAALVFLAHCLLLHRCMGSEA..."
}
}
>
Build a web front end
Then I just have to make a web front end to make this data accessible, including viewing, searching and comparing genes across the species.
This is my next big challenge. I will create a list of stories for this task.