Dataset

DatasetGenPept-Curated-2025
Total sequences11,000
Class balance5,500 AMP / 5,500 non-AMP
Length range10-200 amino acids
Training / validation / test7698 / 989 / 2313
Split designCluster-level 70/9/21 split with class-by-length-bin stratification
Leakage controlHomology-aware partitioning with zero cross-partition leakage after all-vs-all MMseqs2 verification.