Functions
=========

surface_residues(pdb, min_SASA)
-------------------------------

This function takes a protein input and outputs a 
list of all the surface residues if  their SASA > min_SASA, set at 
default to 2.5 Å

based on: http://pymolwiki.org/index.php/FindSurfaceResidues
and modified to use freesasa module to find SASA of residues

freesasa is used for SASA calculation and cited here:

Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5:189. (doi: 10.12688/f1000research.7931.1)

Parameters:

pdb : FILE
   pdb file of protein.

Returns:

surface_residues: [List] of surface residues in the protein.

residue_info(surface_residues, pdb, destination)
------------------------------------------------

This function generates dictionaries with information
about a list of input residues, including a dictionary of the 
type of each residue and the coordinates
made to reduce the number of times a pdb file needs to be parsed

Parameters:

surface_residues: LIST
   list of strings: 'resnum chain'
   surface residues of interest

pdb : FILE
   pdb file of protein.

Returns:

surface_dict: dictionary of {'resnum chain': residue type}

surface_coords: dictionary of {'resnum chain': coordinates}

map_dict: nested dictionary of {chain: {resnum: index}}

jwalk_map: dictionary that maps old pdb file numbering to new pdb file numbering::

   {chain: {original pdb num: jwalk pdb id}}

consecutive_residues(resMap)
----------------------------

This function finds three or more consecutive residues and 
adds them into a single separate list together

Parameters:

resMap, dictionary of residues by chain

Returns:

[list] of lists of consecutive numbers in the input residue numbers
or empty list if no consecutive residue numbers found

consensus_sequences_similarAA()
-------------------------------

This function creates possible combinations of N-glycosylation 
consensus sequences that may be found in an amino acid sequence.

Returns:

[list] of possible consensus sequences for N-linked glycosylation

identify_native_site(target, query, res_Dict)
---------------------------------------------

This function identify native glycosylation sites in EpitopeCA.txt by 
identifying whether the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query, list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of native glycosylation sites.

id_double_mutation_similarAA(target, query, query_nat, res_Dict)
----------------------------------------------------------------

This function identify novel glycosylation sites two mutations
in EpitopeCA.txt by identifying whether 1 of three amino acids in 
the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query,list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of glycosylation sites that can be singly mutated.

dist(a, b)
----------

This function calculates distance between two points (a, b)

Parameters:

a,  point 1 (x1, y1, z1)
b,  point 2 (x2, y2, z2)   

Returns:

Distance between two points as float.

distance_n_atoms(cords)
-----------------------

This function calculates distance between a set of points. 

Parameters:
cords, list of alpha carbon coordinates from possible glycosylation sites

Returns:
Coord_matrix, nested [dictionary] of distances between coordinates

allSASD(reslist, jwalk_pdbfile, Jwalk_path, jwalk_map)
------------------------------------------------------

This program uses Jwalk, cited here:

Sinnott et al., Combining Information from Crosslinks and Monolinks in the Modeling
of Protein Structures, Structure (2020), https://doi.org/10/1016/j.str.2020.05.012

The Importance of Non-accessible Crosslinks and Solvent Accessible Surface Distance in Modeling Proteins with Restraints From Crosslinking Mass Spectrometry.
J Bullock, J Schwab, K Thalassinos, M Topf. Mol Cell Proteomics. 15, 2491–2500, 2016

Purpose: This function uses Jwalk to find the SAS distances between a list
of residues 

Parameters:

reslist: list of surface residues to calculate SASD for

jwalk_pdbfile : FILE
   name of pdb file which contains the residues in JWALK renumbered format.

Jwalk_path: STRING
   string of the path location to Jwalk download

jwalk_map: dictionary that maps new pdb file numbering to original pdb file numbering 
   {chain: {jwalk pdb num: original num}}

Returns:

SASD_dict: dict of SASD for each residue pair in residue list
   key: (res1, res2) - strings
   value: SASD - float 

calc_overlap(sphere_rad, distances, priority_sites)
---------------------------------------------------

This function identifies and creates clusters 

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

priority_sites, list of residues that must be in each combo

Returns:

all_clusters, Dictionary of {site number: non overlapping sites}

reduce_clusters(sphere_rad, distances, combinations, priority_sites)
--------------------------------------------------------------------

This function reduces size of clusters until they have no remaining clashes under new clash radius

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

combinations, nested list of combinations for clusters created from previous overlap radius

priority_sites, list of residues that must be in each combo

Returns:

new_clusters, list of new combinations that have no clashes based on starting combination and sphere_rad

write_fasta(pdb, destination)
-----------------------------

Takes pdb file and write FASTA sequence manually to a txt file
(ensures that all residues that appear in PDB match those submitted to
NetNGlyc)

Parameters:

pdb: pdb file of protein
destination: path where user wants to send txt file of FASTA seq, should 
end in "/"

Returns:

None

mutate_fasta(res_to_mutate, allRes, map_dict, mutant_sites, original_fasta, destination)
----------------------------------------------------------------------------------------

Takes pdb file and write FASTA sequence manually to a txt file
with residue to mutate changed to appropriate residue to ensure consensus sequence

Parameters:

res_to_mutate: string ('number chain' e.g. '20 A') of residue to mutate (based on PDB numbering NOT mapped integer numbering)
   Provides the first position residue in set of 3, may not actually be the residue that will change (could be third residue)

allRes: dict of surface residues (number and chain) and corresponding amino acid values

map_dict: nested dict::

   nested dictionary of {chain: {resnum: index}} 

destination: path where user wants to send txt file of FASTA seq, should end in '/'

mutant_sites: list of list of sites that can be consensus sequence with mutation

original_fasta: txt file of fasta seq for original protein sequence

Returns:

None

run_netNglyc_all(fasta, map_dict, threshold = 6, netnglyc_loc)
--------------------------------------------------------------

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fasta : file
   input file of protein fasta sequence.
map_dict : nested dict
   nested dictionary of {chain: {resnum: index}}
threshold : integer, optional
   number of neural nets that must agree for site to count as 'likely' glycosylated. The default is 6.
netnglyc_loc : string
   path to netnglyc folder

Returns:

predicted_sites: list
   list of strings ('resnum chain') predicted by netNglyc to be glycosylated.


run_netNglyc_chain(fasta, map_dict, threshold = 6,netnglyc_loc)
---------------------------------------------------------------

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fasta : file
   input file of protein fasta sequence.
map_dict : nested dict
   nested dictionary of {chain: {resnum: index}}
threshold : integer, optional
   number of neural nets that must agree for site to count as 'likely' glycosylated. The default is 6.
netnglyc_loc : string
   path to netnglyc folder

Returns:

predicted_sites: list
   list of strings ('resnum chain') predicted by netNglyc to be glycosylated.

coverage_rank(distance_n_atoms, clusters, i, radius = 17.5)
-----------------------------------------------------------

Refine possible cluster combinations based on predicted
coverage -- only high coverage combinations remain 

Parameters:

distance_n_atoms : dictionary of {site: {residue: distance}} using linear
   distance calculation (all surface residues)
   
clusters : dict
   Dictionary of {site number: non overlapping sites}

radius: int
   distance cutoff of nearby residues considered to be covered by glycan

i: int
   final number of combinations needed

Returns:

refined_clusters: dict
   Dictionary of {site number: non overlapping sites} (i.e. combos/clusters) with lower-coverage combinations eliminated

iterate_clusters(distance_n_atoms, clusters, siteDistances, priority_sites, i, j, initial_r)
--------------------------------------------------------------------------------------------

Iteratively increase radius of overlap to reduce size of combinations in clusters
and check coverage to reduce number of combinations

Parameters:

distance_n_atoms : dictionary of {site: {residue: distance}} using linear distance calculation (all surface residues)

priority_sites : list
   list of N residue of user designated priority sites that must be glycosylated

clusters : dict
   Dictionary of {site number: non overlapping sites}


i: int
   final number of clusters (j different combinations)
j: int
   final size of each cluster (each cluster constitutes i or less residues)

initial_r: int
   first radius of overlap to determine overlapping residues in clusters


siteDistances: dict
   distance dictionary between coordinate pairs

Returns:

next_clusters: dict
   Dictionary of {site number: non overlapping sites} with j site numbers, combinations of size i

refine_clusters(cluster_list, max_len=0)
----------------------------------------

Reduce number of combinations in cluster_list so that only clusters of 
longer length are included

Parameters:

cluster_list: nested LIST
   Nested list of all combinations
max_len: integer, optional
   Minimum length of a combination. The default is 0.

Returns:

new_list: LIST
   Nested list of combinations with minimum length of max_len.

add_glycans(pdb, combo_list, glycan, native_sites, destination, model_glycans)
------------------------------------------------------------------------------

Generates glycosylated pdb files for each combination

Parameters:

pdb : file
   pdb file of protein.
combo_list : nested list
   list of list of site combinations for adding glycans.
glycan: string
   indicates glycan to be added
native_sites:
   list of native glycosylation sites identified previously
model_glycans:
   STRING 'Y' or 'N', only if 'Y' will have rosetta do glycan sampling of conformations

Returns:

updated_combos list
   list of lists of combos to account for any residues removed due to disulfide bonds.