Architectural decision record 001 - Data Standardization
In this document we collect different data standardization strategies for the various prediction tasks.
We will list all possibilities we considered and justify the final decision we made.
The final decision is marked in this document and can also be found in the data_standardization.md
file in the
base directory.
Standardization strategies overview
Residue -> Class
Residue -> Class standardization
1. 2 Fasta files (sequence.fasta, label.fasta)
sequences.fasta
>Seq1
SEQWENCE
labels.fasta
>Seq1 SET=train VALIDATION=False
DVCDVVDD
PRO:
- Easy mapping of residue -> class
Residue -> Value
Residue -> Value standardization
1. 1 single CSV file
sequence, values, set, validation
PRTEIN, 0.5;0.3;0.2;0.1;1.5;0.01, train, False
PRO:
- Only one file
CON:
- File will be very large and have bad readability
Sequence -> Class && Sequence -> Value
Sequence -> Class standardization
1. 2 Fasta files (sequence.fasta, label.fasta)
sequences.fasta
>Seq1
SEQWENCE
labels.fasta
>Seq1 SET=train VALIDATION=False
Glob
PRO:
- Compliant with residue -> class structure
CON:
- Fasta interpreters might misinterpret "Glob" as "G, L, O, B"
- 2 files
2. 1 single Fasta file
sequences.fasta
>Seq1 TARGET=Glob SET=train VALIDATION=False
SEQWENCE
PRO:
- Only one file
- Readability
CON:
- Conversion from FLIP to biotrainer needed
3. 1 single CSV file
sequence, label, set, validation
SEQWENCE, Glob, train, False
PRO:
- Only one file
- FLIP data format
CON:
- Bad readability for longer sequences