Functions

surface_residues(pdb, min_SASA)

This function takes a protein input and outputs a list of all the surface residues if their SASA > min_SASA, set at default to 2.5 Å

based on: http://pymolwiki.org/index.php/FindSurfaceResidues and modified to use freesasa module to find SASA of residues

freesasa is used for SASA calculation and cited here:

Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5:189. (doi: 10.12688/f1000research.7931.1)

Parameters:

pdbFILE

pdb file of protein.

Returns:

surface_residues: [List] of surface residues in the protein.

residue_info(surface_residues, pdb, destination)

This function generates dictionaries with information about a list of input residues, including a dictionary of the type of each residue and the coordinates made to reduce the number of times a pdb file needs to be parsed

Parameters:

surface_residues: LIST

list of strings: ‘resnum chain’ surface residues of interest

pdbFILE

pdb file of protein.

Returns:

surface_dict: dictionary of {‘resnum chain’: residue type}

surface_coords: dictionary of {‘resnum chain’: coordinates}

map_dict: nested dictionary of {chain: {resnum: index}}

jwalk_map: dictionary that maps old pdb file numbering to new pdb file numbering:

{chain: {original pdb num: jwalk pdb id}}

consecutive_residues(resMap)

This function finds three or more consecutive residues and adds them into a single separate list together

Parameters:

resMap, dictionary of residues by chain

Returns:

[list] of lists of consecutive numbers in the input residue numbers or empty list if no consecutive residue numbers found

consensus_sequences_similarAA()

This function creates possible combinations of N-glycosylation consensus sequences that may be found in an amino acid sequence.

Returns:

[list] of possible consensus sequences for N-linked glycosylation

identify_native_site(target, query, res_Dict)

This function identify native glycosylation sites in EpitopeCA.txt by identifying whether the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query, list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of native glycosylation sites.

id_double_mutation_similarAA(target, query, query_nat, res_Dict)

This function identify novel glycosylation sites two mutations in EpitopeCA.txt by identifying whether 1 of three amino acids in the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query,list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of glycosylation sites that can be singly mutated.

dist(a, b)

This function calculates distance between two points (a, b)

Parameters:

a, point 1 (x1, y1, z1) b, point 2 (x2, y2, z2)

Returns:

Distance between two points as float.

distance_n_atoms(cords)

This function calculates distance between a set of points.

Parameters: cords, list of alpha carbon coordinates from possible glycosylation sites

Returns: Coord_matrix, nested [dictionary] of distances between coordinates

allSASD(reslist, jwalk_pdbfile, Jwalk_path, jwalk_map)

This program uses Jwalk, cited here:

Sinnott et al., Combining Information from Crosslinks and Monolinks in the Modeling of Protein Structures, Structure (2020), https://doi.org/10/1016/j.str.2020.05.012

The Importance of Non-accessible Crosslinks and Solvent Accessible Surface Distance in Modeling Proteins with Restraints From Crosslinking Mass Spectrometry. J Bullock, J Schwab, K Thalassinos, M Topf. Mol Cell Proteomics. 15, 2491–2500, 2016

Purpose: This function uses Jwalk to find the SAS distances between a list of residues

Parameters:

reslist: list of surface residues to calculate SASD for

jwalk_pdbfileFILE

name of pdb file which contains the residues in JWALK renumbered format.

Jwalk_path: STRING

string of the path location to Jwalk download

jwalk_map: dictionary that maps new pdb file numbering to original pdb file numbering

{chain: {jwalk pdb num: original num}}

Returns:

SASD_dict: dict of SASD for each residue pair in residue list

key: (res1, res2) - strings value: SASD - float

calc_overlap(sphere_rad, distances, priority_sites)

This function identifies and creates clusters

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

priority_sites, list of residues that must be in each combo

Returns:

all_clusters, Dictionary of {site number: non overlapping sites}

reduce_clusters(sphere_rad, distances, combinations, priority_sites)

This function reduces size of clusters until they have no remaining clashes under new clash radius

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

combinations, nested list of combinations for clusters created from previous overlap radius

priority_sites, list of residues that must be in each combo

Returns:

new_clusters, list of new combinations that have no clashes based on starting combination and sphere_rad

write_fasta(pdb, destination)

Takes pdb file and write FASTA sequence manually to a txt file (ensures that all residues that appear in PDB match those submitted to NetNGlyc)

Parameters:

pdb: pdb file of protein destination: path where user wants to send txt file of FASTA seq, should end in “/”

Returns:

None

mutate_fasta(res_to_mutate, allRes, map_dict, mutant_sites, original_fasta, destination)

Takes pdb file and write FASTA sequence manually to a txt file with residue to mutate changed to appropriate residue to ensure consensus sequence

Parameters:

res_to_mutate: string (‘number chain’ e.g. ‘20 A’) of residue to mutate (based on PDB numbering NOT mapped integer numbering)

Provides the first position residue in set of 3, may not actually be the residue that will change (could be third residue)

allRes: dict of surface residues (number and chain) and corresponding amino acid values

map_dict: nested dict:

nested dictionary of {chain: {resnum: index}}

destination: path where user wants to send txt file of FASTA seq, should end in ‘/’

mutant_sites: list of list of sites that can be consensus sequence with mutation

original_fasta: txt file of fasta seq for original protein sequence

Returns:

None

run_netNglyc_all(fasta, map_dict, threshold = 6, netnglyc_loc)

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fastafile

input file of protein fasta sequence.

map_dictnested dict

nested dictionary of {chain: {resnum: index}}

thresholdinteger, optional

number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.

netnglyc_locstring

path to netnglyc folder

Returns:

predicted_sites: list

list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.

run_netNglyc_chain(fasta, map_dict, threshold = 6,netnglyc_loc)

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fastafile

input file of protein fasta sequence.

map_dictnested dict

nested dictionary of {chain: {resnum: index}}

thresholdinteger, optional

number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.

netnglyc_locstring

path to netnglyc folder

Returns:

predicted_sites: list

list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.

coverage_rank(distance_n_atoms, clusters, i, radius = 17.5)

Refine possible cluster combinations based on predicted coverage – only high coverage combinations remain

Parameters:

distance_n_atomsdictionary of {site: {residue: distance}} using linear

distance calculation (all surface residues)

clustersdict

Dictionary of {site number: non overlapping sites}

radius: int

distance cutoff of nearby residues considered to be covered by glycan

i: int

final number of combinations needed

Returns:

refined_clusters: dict

Dictionary of {site number: non overlapping sites} (i.e. combos/clusters) with lower-coverage combinations eliminated

iterate_clusters(distance_n_atoms, clusters, siteDistances, priority_sites, i, j, initial_r)

Iteratively increase radius of overlap to reduce size of combinations in clusters and check coverage to reduce number of combinations

Parameters:

distance_n_atoms : dictionary of {site: {residue: distance}} using linear distance calculation (all surface residues)

priority_siteslist

list of N residue of user designated priority sites that must be glycosylated

clustersdict

Dictionary of {site number: non overlapping sites}

i: int

final number of clusters (j different combinations)

j: int

final size of each cluster (each cluster constitutes i or less residues)

initial_r: int

first radius of overlap to determine overlapping residues in clusters

siteDistances: dict

distance dictionary between coordinate pairs

Returns:

next_clusters: dict

Dictionary of {site number: non overlapping sites} with j site numbers, combinations of size i

refine_clusters(cluster_list, max_len=0)

Reduce number of combinations in cluster_list so that only clusters of longer length are included

Parameters:

cluster_list: nested LIST

Nested list of all combinations

max_len: integer, optional

Minimum length of a combination. The default is 0.

Returns:

new_list: LIST

Nested list of combinations with minimum length of max_len.

add_glycans(pdb, combo_list, glycan, native_sites, destination, model_glycans)

Generates glycosylated pdb files for each combination

Parameters:

pdbfile

pdb file of protein.

combo_listnested list

list of list of site combinations for adding glycans.

glycan: string

indicates glycan to be added

native_sites:

list of native glycosylation sites identified previously

model_glycans:

STRING ‘Y’ or ‘N’, only if ‘Y’ will have rosetta do glycan sampling of conformations

Returns:

updated_combos list

list of lists of combos to account for any residues removed due to disulfide bonds.