Functions¶

surface_residues(pdb, min_SASA)¶

This function takes a protein input and outputs a list of all the surface residues if their SASA > min_SASA, set at default to 2.5 Å

based on: http://pymolwiki.org/index.php/FindSurfaceResidues and modified to use freesasa module to find SASA of residues

freesasa is used for SASA calculation and cited here:

Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5:189. (doi: 10.12688/f1000research.7931.1)

Parameters:

pdbFILE: pdb file of protein.

Returns:

surface_residues: [List] of surface residues in the protein.

residue_info(surface_residues, pdb, destination)¶

This function generates dictionaries with information about a list of input residues, including a dictionary of the type of each residue and the coordinates made to reduce the number of times a pdb file needs to be parsed

Parameters:

surface_residues: LIST: list of strings: ‘resnum chain’ surface residues of interest
pdbFILE: pdb file of protein.

Returns:

surface_dict: dictionary of {‘resnum chain’: residue type}

surface_coords: dictionary of {‘resnum chain’: coordinates}

map_dict: nested dictionary of {chain: {resnum: index}}

jwalk_map: dictionary that maps old pdb file numbering to new pdb file numbering:

{chain: {original pdb num: jwalk pdb id}}

consecutive_residues(resMap)¶

This function finds three or more consecutive residues and adds them into a single separate list together

Parameters:

resMap, dictionary of residues by chain

Returns:

[list] of lists of consecutive numbers in the input residue numbers or empty list if no consecutive residue numbers found

consensus_sequences_similarAA()¶

This function creates possible combinations of N-glycosylation consensus sequences that may be found in an amino acid sequence.

Returns:

[list] of possible consensus sequences for N-linked glycosylation

identify_native_site(target, query, res_Dict)¶

This function identify native glycosylation sites in EpitopeCA.txt by identifying whether the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query, list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of native glycosylation sites.

id_double_mutation_similarAA(target, query, query_nat, res_Dict)¶

This function identify novel glycosylation sites two mutations in EpitopeCA.txt by identifying whether 1 of three amino acids in the consensus sequence lies in its sequence.

Parameters:

target, list of consecutive residue numbers from EpitopeCA.txt

query,list of possible consensus sequence combinations

res_Dict, dictionary of {residue number: residue name} from CA file.

Returns:

nested [list] of residue numbers of glycosylation sites that can be singly mutated.

dist(a, b)¶

This function calculates distance between two points (a, b)

Parameters:

a, point 1 (x1, y1, z1) b, point 2 (x2, y2, z2)

Returns:

Distance between two points as float.

distance_n_atoms(cords)¶

This function calculates distance between a set of points.

Parameters: cords, list of alpha carbon coordinates from possible glycosylation sites

Returns: Coord_matrix, nested [dictionary] of distances between coordinates

allSASD(reslist, jwalk_pdbfile, Jwalk_path, jwalk_map)¶

This program uses Jwalk, cited here:

Sinnott et al., Combining Information from Crosslinks and Monolinks in the Modeling of Protein Structures, Structure (2020), https://doi.org/10/1016/j.str.2020.05.012

The Importance of Non-accessible Crosslinks and Solvent Accessible Surface Distance in Modeling Proteins with Restraints From Crosslinking Mass Spectrometry. J Bullock, J Schwab, K Thalassinos, M Topf. Mol Cell Proteomics. 15, 2491–2500, 2016

Purpose: This function uses Jwalk to find the SAS distances between a list of residues

Parameters:

reslist: list of surface residues to calculate SASD for

jwalk_pdbfileFILE: name of pdb file which contains the residues in JWALK renumbered format.
Jwalk_path: STRING: string of the path location to Jwalk download
jwalk_map: dictionary that maps new pdb file numbering to original pdb file numbering: {chain: {jwalk pdb num: original num}}

Returns:

SASD_dict: dict of SASD for each residue pair in residue list: key: (res1, res2) - strings value: SASD - float

calc_overlap(sphere_rad, distances, priority_sites)¶

This function identifies and creates clusters

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

priority_sites, list of residues that must be in each combo

Returns:

all_clusters, Dictionary of {site number: non overlapping sites}

reduce_clusters(sphere_rad, distances, combinations, priority_sites)¶

This function reduces size of clusters until they have no remaining clashes under new clash radius

Parameters:

sphere_rad, input radius of glycan size

distances, distance dictionary between coordinate pairs

combinations, nested list of combinations for clusters created from previous overlap radius

priority_sites, list of residues that must be in each combo

Returns:

new_clusters, list of new combinations that have no clashes based on starting combination and sphere_rad

write_fasta(pdb, destination)¶

Takes pdb file and write FASTA sequence manually to a txt file (ensures that all residues that appear in PDB match those submitted to NetNGlyc)

Parameters:

pdb: pdb file of protein destination: path where user wants to send txt file of FASTA seq, should end in “/”

Returns:

None

mutate_fasta(res_to_mutate, allRes, map_dict, mutant_sites, original_fasta, destination)¶

Takes pdb file and write FASTA sequence manually to a txt file with residue to mutate changed to appropriate residue to ensure consensus sequence

Parameters:

res_to_mutate: string (‘number chain’ e.g. ‘20 A’) of residue to mutate (based on PDB numbering NOT mapped integer numbering): Provides the first position residue in set of 3, may not actually be the residue that will change (could be third residue)

allRes: dict of surface residues (number and chain) and corresponding amino acid values

map_dict: nested dict:

nested dictionary of {chain: {resnum: index}}

destination: path where user wants to send txt file of FASTA seq, should end in ‘/’

mutant_sites: list of list of sites that can be consensus sequence with mutation

original_fasta: txt file of fasta seq for original protein sequence

Returns:

None

run_netNglyc_all(fasta, map_dict, threshold = 6, netnglyc_loc)¶

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fastafile: input file of protein fasta sequence.
map_dictnested dict: nested dictionary of {chain: {resnum: index}}
thresholdinteger, optional: number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.
netnglyc_locstring: path to netnglyc folder

Returns:

predicted_sites: list: list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.

run_netNglyc_chain(fasta, map_dict, threshold = 6,netnglyc_loc)¶

Evaluates likelihood of glycosylation using NetNGlyc

Parameters:

fastafile: input file of protein fasta sequence.
map_dictnested dict: nested dictionary of {chain: {resnum: index}}
thresholdinteger, optional: number of neural nets that must agree for site to count as ‘likely’ glycosylated. The default is 6.
netnglyc_locstring: path to netnglyc folder

Returns:

predicted_sites: list: list of strings (‘resnum chain’) predicted by netNglyc to be glycosylated.

coverage_rank(distance_n_atoms, clusters, i, radius = 17.5)¶

Refine possible cluster combinations based on predicted coverage – only high coverage combinations remain

Parameters:

distance_n_atomsdictionary of {site: {residue: distance}} using linear: distance calculation (all surface residues)
clustersdict: Dictionary of {site number: non overlapping sites}
radius: int: distance cutoff of nearby residues considered to be covered by glycan
i: int: final number of combinations needed

Returns:

refined_clusters: dict: Dictionary of {site number: non overlapping sites} (i.e. combos/clusters) with lower-coverage combinations eliminated

iterate_clusters(distance_n_atoms, clusters, siteDistances, priority_sites, i, j, initial_r)¶

Iteratively increase radius of overlap to reduce size of combinations in clusters and check coverage to reduce number of combinations

Parameters:

distance_n_atoms : dictionary of {site: {residue: distance}} using linear distance calculation (all surface residues)

priority_siteslist: list of N residue of user designated priority sites that must be glycosylated
clustersdict: Dictionary of {site number: non overlapping sites}
i: int: final number of clusters (j different combinations)
j: int: final size of each cluster (each cluster constitutes i or less residues)
initial_r: int: first radius of overlap to determine overlapping residues in clusters
siteDistances: dict: distance dictionary between coordinate pairs

Returns:

next_clusters: dict: Dictionary of {site number: non overlapping sites} with j site numbers, combinations of size i

refine_clusters(cluster_list, max_len=0)¶

Reduce number of combinations in cluster_list so that only clusters of longer length are included

Parameters:

cluster_list: nested LIST: Nested list of all combinations
max_len: integer, optional: Minimum length of a combination. The default is 0.

Returns:

new_list: LIST: Nested list of combinations with minimum length of max_len.

add_glycans(pdb, combo_list, glycan, native_sites, destination, model_glycans)¶

Generates glycosylated pdb files for each combination

Parameters:

pdbfile: pdb file of protein.
combo_listnested list: list of list of site combinations for adding glycans.
glycan: string: indicates glycan to be added
native_sites:: list of native glycosylation sites identified previously
model_glycans:: STRING ‘Y’ or ‘N’, only if ‘Y’ will have rosetta do glycan sampling of conformations

Returns:

updated_combos list: list of lists of combos to account for any residues removed due to disulfide bonds.

Functions¶

surface_residues(pdb, min_SASA)¶

residue_info(surface_residues, pdb, destination)¶

consecutive_residues(resMap)¶

consensus_sequences_similarAA()¶

identify_native_site(target, query, res_Dict)¶

id_double_mutation_similarAA(target, query, query_nat, res_Dict)¶

dist(a, b)¶

distance_n_atoms(cords)¶

allSASD(reslist, jwalk_pdbfile, Jwalk_path, jwalk_map)¶

calc_overlap(sphere_rad, distances, priority_sites)¶

reduce_clusters(sphere_rad, distances, combinations, priority_sites)¶

write_fasta(pdb, destination)¶

mutate_fasta(res_to_mutate, allRes, map_dict, mutant_sites, original_fasta, destination)¶

run_netNglyc_all(fasta, map_dict, threshold = 6, netnglyc_loc)¶

run_netNglyc_chain(fasta, map_dict, threshold = 6,netnglyc_loc)¶

coverage_rank(distance_n_atoms, clusters, i, radius = 17.5)¶

iterate_clusters(distance_n_atoms, clusters, siteDistances, priority_sites, i, j, initial_r)¶

refine_clusters(cluster_list, max_len=0)¶

add_glycans(pdb, combo_list, glycan, native_sites, destination, model_glycans)¶

Table of Contents

Previous topic

Next topic

This Page