Functions¶
surface_residues(pdb, min_SASA)¶
This function takes a protein input and outputs a list of all the surface residues if their SASA > min_SASA, set at default to 2.5 Å
based on: http://pymolwiki.org/index.php/FindSurfaceResidues and modified to use freesasa module to find SASA of residues
freesasa is used for SASA calculation and cited here:
Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5:189. (doi: 10.12688/f1000research.7931.1)
Parameters:
- pdbFILE
pdb file of protein.
Returns:
surface_residues: [List] of surface residues in the protein.
residue_info(surface_residues, pdb, destination)¶
This function generates dictionaries with information about a list of input residues, including a dictionary of the type of each residue and the coordinates made to reduce the number of times a pdb file needs to be parsed
Parameters:
- surface_residues: LIST
list of strings: ‘resnum chain’ surface residues of interest
- pdbFILE
pdb file of protein.
Returns:
surface_dict: dictionary of {‘resnum chain’: residue type}
surface_coords: dictionary of {‘resnum chain’: coordinates}
map_dict: nested dictionary of {chain: {resnum: index}}
jwalk_map: dictionary that maps old pdb file numbering to new pdb file numbering:
{chain: {original pdb num: jwalk pdb id}}
consecutive_residues(resMap)¶
This function finds three or more consecutive residues and adds them into a single separate list together
Parameters:
resMap, dictionary of residues by chain
Returns:
[list] of lists of consecutive numbers in the input residue numbers or empty list if no consecutive residue numbers found
consensus_sequences_similarAA()¶
This function creates possible combinations of N-glycosylation consensus sequences that may be found in an amino acid sequence.
Returns:
[list] of possible consensus sequences for N-linked glycosylation
identify_native_site(target, query, res_Dict)¶
This function identify native glycosylation sites in EpitopeCA.txt by identifying whether the consensus sequence lies in its sequence.
Parameters:
target, list of consecutive residue numbers from EpitopeCA.txt
query, list of possible consensus sequence combinations
res_Dict, dictionary of {residue number: residue name} from CA file.
Returns:
nested [list] of residue numbers of native glycosylation sites.
id_double_mutation_similarAA(target, query, query_nat, res_Dict)¶
This function identify novel glycosylation sites two mutations in EpitopeCA.txt by identifying whether 1 of three amino acids in the consensus sequence lies in its sequence.
Parameters:
target, list of consecutive residue numbers from EpitopeCA.txt
query,list of possible consensus sequence combinations
res_Dict, dictionary of {residue number: residue name} from CA file.
Returns:
nested [list] of residue numbers of glycosylation sites that can be singly mutated.
dist(a, b)¶
This function calculates distance between two points (a, b)
Parameters:
a, point 1 (x1, y1, z1) b, point 2 (x2, y2, z2)
Returns:
Distance between two points as float.
distance_n_atoms(cords)¶
This function calculates distance between a set of points.
Parameters: cords, list of alpha carbon coordinates from possible glycosylation sites
Returns: Coord_matrix, nested [dictionary] of distances between coordinates
allSASD(reslist, jwalk_pdbfile, Jwalk_path, jwalk_map)¶
This program uses Jwalk, cited here:
Sinnott et al., Combining Information from Crosslinks and Monolinks in the Modeling of Protein Structures, Structure (2020), https://doi.org/10/1016/j.str.2020.05.012
The Importance of Non-accessible Crosslinks and Solvent Accessible Surface Distance in Modeling Proteins with Restraints From Crosslinking Mass Spectrometry. J Bullock, J Schwab, K Thalassinos, M Topf. Mol Cell Proteomics. 15, 2491–2500, 2016
Purpose: This function uses Jwalk to find the SAS distances between a list of residues
Parameters:
reslist: list of surface residues to calculate SASD for
- jwalk_pdbfileFILE
name of pdb file which contains the residues in JWALK renumbered format.
- Jwalk_path: STRING
string of the path location to Jwalk download
- jwalk_map: dictionary that maps new pdb file numbering to original pdb file numbering
{chain: {jwalk pdb num: original num}}
Returns:
- SASD_dict: dict of SASD for each residue pair in residue list
key: (res1, res2) - strings value: SASD - float
calc_overlap(sphere_rad, distances, priority_sites)¶
This function identifies and creates clusters
Parameters:
sphere_rad, input radius of glycan size
distances, distance dictionary between coordinate pairs
priority_sites, list of residues that must be in each combo
Returns:
all_clusters, Dictionary of {site number: non overlapping sites}
reduce_clusters(sphere_rad, distances, combinations, priority_sites)¶
This function reduces size of clusters until they have no remaining clashes under new clash radius
Parameters:
sphere_rad, input radius of glycan size
distances, distance dictionary between coordinate pairs
combinations, nested list of combinations for clusters created from previous overlap radius
priority_sites, list of residues that must be in each combo
Returns:
new_clusters, list of new combinations that have no clashes based on starting combination and sphere_rad
write_fasta(pdb, destination)¶
Takes pdb file and write FASTA sequence manually to a txt file (ensures that all residues that appear in PDB match those submitted to NetNGlyc)
Parameters:
pdb: pdb file of protein destination: path where user wants to send txt file of FASTA seq, should end in “/”
Returns:
None
mutate_fasta(res_to_mutate, allRes, map_dict, mutant_sites, original_fasta, destination)¶
Takes pdb file and write FASTA sequence manually to a txt file with residue to mutate changed to appropriate residue to ensure consensus sequence
Parameters:
- res_to_mutate: string (‘number chain’ e.g. ‘20 A’) of residue to mutate (based on PDB numbering NOT mapped integer numbering)
Provides the first position residue in set of 3, may not actually be the residue that will change (could be third residue)
allRes: dict of surface residues (number and chain) and corresponding amino acid values
map_dict: nested dict:
nested dictionary of {chain: {resnum: index}}
destination: path where user wants to send txt file of FASTA seq, should end in ‘/’
mutant_sites: list of list of sites that can be consensus sequence with mutation
original_fasta: txt file of fasta seq for original protein sequence
Returns:
None
run_netNglyc_all(fasta, map_dict, threshold = 6, netnglyc_loc)¶
Evaluates likelihood of glycosylation using NetNGlyc
Parameters:
- fastafile
input file of protein fasta sequence.
- map_dictnested dict
nested dictionary of {chain: {resnum: index}}
- thresholdinteger, optional
number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.
- netnglyc_locstring
path to netnglyc folder
Returns:
- predicted_sites: list
list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.
run_netNglyc_chain(fasta, map_dict, threshold = 6,netnglyc_loc)¶
Evaluates likelihood of glycosylation using NetNGlyc
Parameters:
- fastafile
input file of protein fasta sequence.
- map_dictnested dict
nested dictionary of {chain: {resnum: index}}
- thresholdinteger, optional
number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.
- netnglyc_locstring
path to netnglyc folder
Returns:
- predicted_sites: list
list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.
coverage_rank(distance_n_atoms, clusters, i, radius = 17.5)¶
Refine possible cluster combinations based on predicted coverage – only high coverage combinations remain
Parameters:
- distance_n_atomsdictionary of {site: {residue: distance}} using linear
distance calculation (all surface residues)
- clustersdict
Dictionary of {site number: non overlapping sites}
- radius: int
distance cutoff of nearby residues considered to be covered by glycan
- i: int
final number of combinations needed
Returns:
- refined_clusters: dict
Dictionary of {site number: non overlapping sites} (i.e. combos/clusters) with lower-coverage combinations eliminated
iterate_clusters(distance_n_atoms, clusters, siteDistances, priority_sites, i, j, initial_r)¶
Iteratively increase radius of overlap to reduce size of combinations in clusters and check coverage to reduce number of combinations
Parameters:
distance_n_atoms : dictionary of {site: {residue: distance}} using linear distance calculation (all surface residues)
- priority_siteslist
list of N residue of user designated priority sites that must be glycosylated
- clustersdict
Dictionary of {site number: non overlapping sites}
- i: int
final number of clusters (j different combinations)
- j: int
final size of each cluster (each cluster constitutes i or less residues)
- initial_r: int
first radius of overlap to determine overlapping residues in clusters
- siteDistances: dict
distance dictionary between coordinate pairs
Returns:
- next_clusters: dict
Dictionary of {site number: non overlapping sites} with j site numbers, combinations of size i
refine_clusters(cluster_list, max_len=0)¶
Reduce number of combinations in cluster_list so that only clusters of longer length are included
Parameters:
- cluster_list: nested LIST
Nested list of all combinations
- max_len: integer, optional
Minimum length of a combination. The default is 0.
Returns:
- new_list: LIST
Nested list of combinations with minimum length of max_len.
add_glycans(pdb, combo_list, glycan, native_sites, destination, model_glycans)¶
Generates glycosylated pdb files for each combination
Parameters:
- pdbfile
pdb file of protein.
- combo_listnested list
list of list of site combinations for adding glycans.
- glycan: string
indicates glycan to be added
- native_sites:
list of native glycosylation sites identified previously
- model_glycans:
STRING ‘Y’ or ‘N’, only if ‘Y’ will have rosetta do glycan sampling of conformations
Returns:
- updated_combos list
list of lists of combos to account for any residues removed due to disulfide bonds.