Here are some examples of how to set correct rules to parse FASTA-header.
Ensembl Arabidopsis Thaliana CDNA:
>ATMG00010.1 cdna:known chromosome:TAIR10:Mt:273:734:-1 gene:ATMG00010 transcript:ATMG00010.1 description:"Uncharacterized mitochondrial protein AtMg00010"
This row will be treated as that:
>(Value 1) cdna:(Value 2) chromosome:(Value 3):(Value 4):(Value 5):(Value 6):(Value 7) gene:(Value 8) transcript:(Value 9) description:"(Value 10)"
The most important things in parsing headers are not the values, because they are variable. But separators between values (eg '>', ' cdna:', ' chromosome:') are constant, that's why selecting whole values and showng by that proper separators is very important.
Here is example of the same header with wrong separators:
>ATMG00010.1 cdna:known chromosome:TAIR10:Mt:273:734:-1 gene:ATMG00010 transcript:ATMG00010.1 description:"Uncharacterized mitochondrial protein AtMg00010"
Because of we selected not total transcript name but only a part (ATMG00010 instead of ATMG00010.1), separator will be recognized as '.1 cdna:'. That means if some transcripts have name like ATMG00010.2 the system can not recognize the separator. In this case values will be unpredictable.
TAIR protein header
Here we can find two types of headers. The firstone looks like this:
>AT1G51370.2 | Symbols: | F-box/RNI-like/FBD-like domains-containing protein | chr1:19045615-19046748 FORWARD LENGTH=346
And the second one have symbols included:
>AT1G75120.1 | Symbols: RRA1 | Nucleotide-diphospho-sugar transferase family protein | chr1:28197022-28198656 REVERSE LENGTH=402
If the file has more than one type of headers we can create multiple parsers for each type of files. Headers with different types can be added to parsers from "Headers browser" page or from "Converted headers browser" page that accures right after first conversion (that page shows the way headers have been parsed).
UniProtKB header:
>>sp|Q8I6R7|ACN2_ACAGO Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana GN=acantho2 PE=1 SV=1