Dataset
| Dataset | GenPept-Curated-2025 |
|---|---|
| Total sequences | 11,000 |
| Class balance | 5,500 AMP / 5,500 non-AMP |
| Length range | 10-200 amino acids |
| Training / validation / test | 7698 / 989 / 2313 |
| Split design | Cluster-level 70/9/21 split with class-by-length-bin stratification |
| Leakage control | Homology-aware partitioning with zero cross-partition leakage after all-vs-all MMseqs2 verification. |