How Cluster Optimization Works¶
After candidate glycosylation sites are identified, HyperImmunISE runs a filtering and optimization step to produce a smaller final set of site combinations. The goal is to satisfy two user-defined limits:
number: the maximum number of final combinations to keepsize: the maximum number of sites allowed in each combination
This logic is implemented by calc_overlap(), refine_clusters(), reduce_clusters(), coverage_rank(), and iterate_clusters().
Step 1. Build the initial combinations¶
The workflow first builds candidate combinations at an initial clash radius of 17.5 Å. Two sites are treated as clashing if their pairwise distance is less than or equal to that radius. The initial list of compatible combinations is generated by calc_overlap().
At this point, the code does not yet enforce the final number or size limits. It simply enumerates combinations that can coexist under the current clash definition.
Step 2. Keep the longest useful combinations¶
refine_clusters() reduces the raw combination list by preferring longer combinations. It first keeps only combinations with the maximum length, then checks whether every candidate site is still represented in the reduced set.
If all sites remain represented, that reduced set is kept. If some sites disappear entirely, the minimum allowed combination length is lowered by one and the process is repeated.
Step 3. Iteratively optimize until the constraints are met¶
The main optimization loop is handled by iterate_clusters(). At each recursion step, the code measures:
cur_i: the number of current combinationscur_j: the size of the largest current combination
It then applies one of four cases.
Case A. Too many combinations, and combinations are also too large¶
Condition: cur_i > number and cur_j > size
Increase the clash radius by
1Call
reduce_clusters()to split combinations that now contain clashesCall
coverage_rank()to keep only the topnumbercombinations
Case B. Combinations are too large, but the count is acceptable¶
Condition: cur_j > size and cur_i <= number
Increase the clash radius by
1Call
reduce_clusters()Do not rank yet
This step only fixes oversized combinations.
Case C. Combinations are the right size, but there are too many¶
Condition: cur_i > number and cur_j <= size
Keep the current radius
Call
coverage_rank()Keep only the top
numbercombinations
This step does not change the geometry. It only filters by predicted coverage.
Case D. Both limits are satisfied¶
If neither condition is violated, the current combinations are returned as the final design set.
How coverage ranking works¶
coverage_rank() estimates how much of the surface is covered by each candidate combination and ranks combinations by coverage fraction. For each site in a combination, nearby surface residues within the current radius are counted as covered. The highest-scoring combinations are retained.
How to read the optimization trajectory¶
HyperImmunISE writes a file named *_coverage_rank_searchTrajectory_scores.csv during this pruning step. This file records only recursion steps where coverage_rank() is called. If a round only increases the clash radius and runs reduce_clusters(), the recursion still advances, but no scoring rows are written for that step. For that reason, iteration numbers in the CSV may be discontinuous.
The columns are:
iteration: recursion step at which coverage scoring was recordedcoverage_fraction: fraction of all surface residues covered by at least one site in the combination at the current radiuscovered_count: number of surface residues counted as covered by the combinationtotal_surface_residues: total number of surface residues considered in the coverage calculationradius_A: radius, in angstroms, used for that scoring roundnum_sites_in_combo: number of glycosylation sites in the candidate combinationsites: semicolon-separated list of residue identifiers in the candidate combination
Example execution¶
Suppose the user requests:
at most
2combinationsat most
2sites per combination
If the current combinations at radius 17.5 are:
[A, B, C][A, D][B, E]
Then there are too many combinations and the largest combination is also too large, so the algorithm enters Case A. The radius is increased to 18.5 and reduce_clusters() is run. If B and C now clash, [A, B, C] may be split into smaller valid subsets such as:
[A, B][A, C][A, D][B, E]
At this point, the size limit is satisfied, but the number of combinations is still too high. coverage_rank() is then used to keep only the top 2 combinations. Once both limits are satisfied, the optimization stops.