How Cluster Optimization Works
==============================

After candidate glycosylation sites are identified, HyperImmunISE runs a filtering and optimization step to produce a smaller final set of site combinations. The goal is to satisfy two user-defined limits:

- ``number``: the maximum number of final combinations to keep
- ``size``: the maximum number of sites allowed in each combination

This logic is implemented by ``calc_overlap()``, ``refine_clusters()``, ``reduce_clusters()``, ``coverage_rank()``, and ``iterate_clusters()``.

Step 1. Build the initial combinations
--------------------------------------

The workflow first builds candidate combinations at an initial clash radius of ``17.5`` Å. Two sites are treated as clashing if their pairwise distance is less than or equal to that radius. The initial list of compatible combinations is generated by ``calc_overlap()``.

At this point, the code does not yet enforce the final ``number`` or ``size`` limits. It simply enumerates combinations that can coexist under the current clash definition.

Step 2. Keep the longest useful combinations
--------------------------------------------

``refine_clusters()`` reduces the raw combination list by preferring longer combinations. It first keeps only combinations with the maximum length, then checks whether every candidate site is still represented in the reduced set.

If all sites remain represented, that reduced set is kept. If some sites disappear entirely, the minimum allowed combination length is lowered by one and the process is repeated.

Step 3. Iteratively optimize until the constraints are met
----------------------------------------------------------

The main optimization loop is handled by ``iterate_clusters()``. At each recursion step, the code measures:

- ``cur_i``: the number of current combinations
- ``cur_j``: the size of the largest current combination

It then applies one of four cases.

Case A. Too many combinations, and combinations are also too large
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Condition: ``cur_i > number`` and ``cur_j > size``

- Increase the clash radius by ``1``
- Call ``reduce_clusters()`` to split combinations that now contain clashes
- Call ``coverage_rank()`` to keep only the top ``number`` combinations

Case B. Combinations are too large, but the count is acceptable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Condition: ``cur_j > size`` and ``cur_i <= number``

- Increase the clash radius by ``1``
- Call ``reduce_clusters()``
- Do not rank yet

This step only fixes oversized combinations.

Case C. Combinations are the right size, but there are too many
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Condition: ``cur_i > number`` and ``cur_j <= size``

- Keep the current radius
- Call ``coverage_rank()``
- Keep only the top ``number`` combinations

This step does not change the geometry. It only filters by predicted coverage.

Case D. Both limits are satisfied
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If neither condition is violated, the current combinations are returned as the final design set.

How coverage ranking works
--------------------------

``coverage_rank()`` estimates how much of the surface is covered by each candidate combination and ranks combinations by coverage fraction. For each site in a combination, nearby surface residues within the current radius are counted as covered. The highest-scoring combinations are retained.

How to read the optimization trajectory
---------------------------------------

HyperImmunISE writes a file named ``*_coverage_rank_searchTrajectory_scores.csv`` during this pruning step. This file records only recursion steps where ``coverage_rank()`` is called. If a round only increases the clash radius and runs ``reduce_clusters()``, the recursion still advances, but no scoring rows are written for that step. For that reason, iteration numbers in the CSV may be discontinuous.

The columns are:

- ``iteration``: recursion step at which coverage scoring was recorded
- ``coverage_fraction``: fraction of all surface residues covered by at least one site in the combination at the current radius
- ``covered_count``: number of surface residues counted as covered by the combination
- ``total_surface_residues``: total number of surface residues considered in the coverage calculation
- ``radius_A``: radius, in angstroms, used for that scoring round
- ``num_sites_in_combo``: number of glycosylation sites in the candidate combination
- ``sites``: semicolon-separated list of residue identifiers in the candidate combination

Example execution
-----------------

Suppose the user requests:

- at most ``2`` combinations
- at most ``2`` sites per combination

If the current combinations at radius ``17.5`` are:

- ``[A, B, C]``
- ``[A, D]``
- ``[B, E]``

Then there are too many combinations and the largest combination is also too large, so the algorithm enters Case A. The radius is increased to ``18.5`` and ``reduce_clusters()`` is run. If ``B`` and ``C`` now clash, ``[A, B, C]`` may be split into smaller valid subsets such as:

- ``[A, B]``
- ``[A, C]``
- ``[A, D]``
- ``[B, E]``

At this point, the size limit is satisfied, but the number of combinations is still too high. ``coverage_rank()`` is then used to keep only the top ``2`` combinations. Once both limits are satisfied, the optimization stops.