How Cluster Optimization Works

After candidate glycosylation sites are identified, HyperImmunISE runs a filtering and optimization step to produce a smaller final set of site combinations. The goal is to satisfy two user-defined limits:

  • number: the maximum number of final combinations to keep

  • size: the maximum number of sites allowed in each combination

This logic is implemented by calc_overlap(), refine_clusters(), reduce_clusters(), coverage_rank(), and iterate_clusters().

Step 1. Build the initial combinations

The workflow first builds candidate combinations at an initial clash radius of 17.5 Å. Two sites are treated as clashing if their pairwise distance is less than or equal to that radius. The initial list of compatible combinations is generated by calc_overlap().

At this point, the code does not yet enforce the final number or size limits. It simply enumerates combinations that can coexist under the current clash definition.

Step 2. Keep the longest useful combinations

refine_clusters() reduces the raw combination list by preferring longer combinations. It first keeps only combinations with the maximum length, then checks whether every candidate site is still represented in the reduced set.

If all sites remain represented, that reduced set is kept. If some sites disappear entirely, the minimum allowed combination length is lowered by one and the process is repeated.

Step 3. Iteratively optimize until the constraints are met

The main optimization loop is handled by iterate_clusters(). At each recursion step, the code measures:

  • cur_i: the number of current combinations

  • cur_j: the size of the largest current combination

It then applies one of four cases.

Case A. Too many combinations, and combinations are also too large

Condition: cur_i > number and cur_j > size

  • Increase the clash radius by 1

  • Call reduce_clusters() to split combinations that now contain clashes

  • Call coverage_rank() to keep only the top number combinations

Case B. Combinations are too large, but the count is acceptable

Condition: cur_j > size and cur_i <= number

  • Increase the clash radius by 1

  • Call reduce_clusters()

  • Do not rank yet

This step only fixes oversized combinations.

Case C. Combinations are the right size, but there are too many

Condition: cur_i > number and cur_j <= size

  • Keep the current radius

  • Call coverage_rank()

  • Keep only the top number combinations

This step does not change the geometry. It only filters by predicted coverage.

Case D. Both limits are satisfied

If neither condition is violated, the current combinations are returned as the final design set.

How coverage ranking works

coverage_rank() estimates how much of the surface is covered by each candidate combination and ranks combinations by coverage fraction. For each site in a combination, nearby surface residues within the current radius are counted as covered. The highest-scoring combinations are retained.

How to read the optimization trajectory

HyperImmunISE writes a file named *_coverage_rank_searchTrajectory_scores.csv during this pruning step. This file records only recursion steps where coverage_rank() is called. If a round only increases the clash radius and runs reduce_clusters(), the recursion still advances, but no scoring rows are written for that step. For that reason, iteration numbers in the CSV may be discontinuous.

The columns are:

  • iteration: recursion step at which coverage scoring was recorded

  • coverage_fraction: fraction of all surface residues covered by at least one site in the combination at the current radius

  • covered_count: number of surface residues counted as covered by the combination

  • total_surface_residues: total number of surface residues considered in the coverage calculation

  • radius_A: radius, in angstroms, used for that scoring round

  • num_sites_in_combo: number of glycosylation sites in the candidate combination

  • sites: semicolon-separated list of residue identifiers in the candidate combination

Example execution

Suppose the user requests:

  • at most 2 combinations

  • at most 2 sites per combination

If the current combinations at radius 17.5 are:

  • [A, B, C]

  • [A, D]

  • [B, E]

Then there are too many combinations and the largest combination is also too large, so the algorithm enters Case A. The radius is increased to 18.5 and reduce_clusters() is run. If B and C now clash, [A, B, C] may be split into smaller valid subsets such as:

  • [A, B]

  • [A, C]

  • [A, D]

  • [B, E]

At this point, the size limit is satisfied, but the number of combinations is still too high. coverage_rank() is then used to keep only the top 2 combinations. Once both limits are satisfied, the optimization stops.