Skip to main content

Architectural decision record 001 - Data Standardization

In this document we collect different data standardization strategies for the various prediction tasks. We will list all possibilities we considered and justify the final decision we made. The final decision is marked in this document and can also be found in the data_standardization.md file in the base directory.

Standardization strategies overview

Residue -> Class

Residue -> Class standardization

1. 2 Fasta files (sequence.fasta, label.fasta)

sequences.fasta

>Seq1
SEQWENCE

labels.fasta

>Seq1 SET=train VALIDATION=False
DVCDVVDD

PRO:

  • Easy mapping of residue -> class

Residue -> Value

Residue -> Value standardization

1. 1 single CSV file

sequence, values, set, validation
PRTEIN, 0.5;0.3;0.2;0.1;1.5;0.01, train, False

PRO:

  • Only one file

CON:

  • File will be very large and have bad readability

Sequence -> Class && Sequence -> Value

Sequence -> Class standardization

1. 2 Fasta files (sequence.fasta, label.fasta)

sequences.fasta

>Seq1
SEQWENCE

labels.fasta

>Seq1 SET=train VALIDATION=False
Glob

PRO:

  • Compliant with residue -> class structure

CON:

  • Fasta interpreters might misinterpret "Glob" as "G, L, O, B"
  • 2 files

2. 1 single Fasta file

sequences.fasta

>Seq1 TARGET=Glob SET=train VALIDATION=False 
SEQWENCE

PRO:

  • Only one file
  • Readability

CON:

  • Conversion from FLIP to biotrainer needed

3. 1 single CSV file

sequence, label, set, validation
SEQWENCE, Glob, train, False

PRO:

  • Only one file
  • FLIP data format

CON:

  • Bad readability for longer sequences