Split fasta files based on header

I have 1,500 fasta files with many protein fragments in them. My goal is to separate these fragments into single files and to name these files something intuitive.

Here is an example of a fasta file that I have called plate9.H7.faa:

>39_fragment_4_295  (310978..311196)    1   None    hypothetical protein
MQTATKQETYDRTMKVTLAVKANGGSVTVQIQAGDNWITTDTFWKDGGYQLSIPPATIRYVPAAGAAFEVYA*
>39_fragment_4_296  (311193..312437)    1   VOG01158    REFSEQ hypothetical protein
MSLLVNPIPRRQPIRRGLGLLGDSFSGNCHTIAATAFGTEAYGYAGWIAARTGLFPSYVDNQGKLGDHTGQFLARLPACIASSTADLWLLLSRTNDSTTAGMSLADTKANVMKIVTAFLNTPGKYLIIGTGTPRFGSRALTGQALADAIAYKDWVLSYVSQFVPVVNIWDGFTEAMTVEGLHPNLLGAEFISSRVVPIITANFEFPGIPLPTDAGDIYSAIRPFGCLNANPLLAGTGGTLPAGVNAAAGSVLADGYKAVGSGLTGITTRWFKEPAAYGEAQCIELRGNMAAAGGYIYMQPTANVVQTNLAAGDVIEMVSAVEIMGSSRGILAWEAELTITKTVSGAASTFYYRSMDKYQEPFTMPASFSGALETQRGTIDLTETVITSRMGLYLAAGVPQDSTVKAAQFGIRKV*
>56_fragment_9_667  (768674..769846)    -1  K14059  int; integrase
MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
>56_fragment_9_668  (770054..770281)    -1  PF02599.16  Global regulator protein family
MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
>56_fragment_9_669  (770485..770697)    1   None    hypothetical protein
MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
>56_fragment_9_670  (770705..771487)    -1  VOG00563    sp|Q05292|VG77_BPML5 Gene 77 protein
MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*

So far I have been able to split the files into many files with this command:

for x in *.faa; do csplit -z $x '/>/' '{*}'; done

And then rename them according to their fragment in the header:

for file in xx*; do mv "$file" `head -1 "$file" | cut -d$'t' -f 1`_$x.fasta; done

And then rename each file to not have the ‘>’ from each file, along with assigning it the original filename:

for i in *.fasta; do mv $i `echo $i | cut -c 2-`; done

My problem is that this works on a single file (since there are temporary files in the directory I am doing it in that are temporarily called xx00, xx01, xx02, xx03, and so on..

I feel like my solution would be to loop through each fasta file and do all of these for loops in succession before starting the next fasta file, and I feel like that would have to be a nested for loop which I have never done myself. Any guidance for what I could do would be appreciated.

Answer

You will improve performance by using a tool that doesn’t require files to be opened and closed all the time. Awk is an excellent choice for this.

It seems to me that similar results to what you have written could be achieved with:

$ awk '/^>/ { file=substr($1,2) ".fasta" } { print > file }' *.faa

Note that unless you close() a file, awk leaves it open until the awk process is done, so the solution above will append to common fragment names, should they appear in multiple input files.

If you have a very large number of these (tens of thousands), then *.faa might expand to too many files for your shell to handle on one command line. If that’s the case, you could process things more slowly using find.