Skip to main content

Table 1 Details of the benchmark datasets used for evaluation

From: Clustering biological sequences with dynamic sequence similarity threshold

Dataset

No. of sequences

Sequence length

Mean (standard deviation)

Min

Max

AMR genes

4027

939.93 (± 381.98)

162

4359

AMR proteins

3891

312.53 (± 127.90)

53

1452

Plasmid nucleotides

5005

1010.38 (± 1 008.45)

77

9511

Viral nucleotides

478,652

717.09 (± 837.21)

13

9993

Long viral nucleotides

676

14,803.87 (± 12 048.56)

10,002

262,388

Viral amino acids

469,835

242.64 (± 313.29)

9

13,556