idpet package

Subpackages

Submodules

idpet.comparison module

idpet.comparison.all_vs_all_comparison(ensembles, score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence scores. Implemented scores are approximate average Jensen–Shannon divergence (JSD) over several kinds of molecular features. The lower these scores are, the higher the similarity between the probability distribution of the features of the ensembles. JSD scores here range from a minimum of 0 to a maximum of log(2) ~= 0.6931.

Return type:: dict

Parameters

ensembles: List[Ensemble]: Ensemble objectes to analyze.
score: str: Type of score used to compare ensembles. Choices: adaJSD (carbon Alfa Distance Average JSD), ramaJSD (RAMAchandran average JSD) and ataJSD (Alpha Torsion Average JSD). adaJSD scores the average JSD over all Ca-Ca distance distributions of residue pairs with sequence separation > 1. ramaJSD scores the average JSD over the phi-psi angle distributions of all residues. ataJSD scores the average JSD over all alpha torsion angles, which are the angles formed by four consecutive Ca atoms in a protein.
featurization_params: dict, optional: Optional dictionary to customize the featurization process for the above features.
bootstrap_iters: int, optional: Number of bootstrap iterations. By default its value is None. In this case, IDPET will directly compare each pair of ensemble i and j by using all of their conformers and perform the comparison only once. On the other hand, if providing an integer value to this argument, each pair of ensembles i and j will be compared bootstrap_iters times by randomly selecting (bootstrapping) conformations from them. Additionally, each ensemble will be auto-compared with itself by subsampling conformers via bootstrapping. Then IDPET will perform a statistical test to establish if the inter-ensemble (i != j) scores are significantly different from the intra-ensemble (i == j) scores. The tests work as follows: for each ensemble pair i != j IDPET will get their inter-ensemble comparison scores obtained in bootstrapping. Then, it will get the bootstrapping scores from auto-comparisons of ensemble i and j and the scores with the higher mean here are selected as reference intra-ensemble scores. Finally, the inter-ensemble and intra-ensemble scores are compared via a one-sided Mann-Whitney U test with the alternative hypothesis being: inter-ensemble scores are stochastically greater than intra-ensemble scores. The p-values obtained in these tests will additionally be returned. For small protein structural ensembles (less than 500 conformations) most comparison scores in IDPET are not robust estimators of divergence/distance. Bootstrapping helps estimate how ensemble size affects the robustness of the comparisons. Use values >= 50 when comparing small ensembles, with fewer than 100 conformations. For ensembles with more than 100 conformations, 10 iterations are typically sufficient to distinguish the distributions of auto- and intra-comparison scores. For large ensembles, with more than 1,000 conformations, bootstrapping can generally be omitted without loss of accuracy.
bootstrap_frac: float, optional: Fraction of the total conformations to sample when bootstrapping. Default value is 1.0, which results in bootstrap samples with the same number of conformations of the original ensemble.
bootstrap_replace: bool, optional: If True, bootstrap will sample with replacement. Default is True.
bins: Union[int, str], optional: Number of bins or bin assignment rule for JSD comparisons. See the documentation of dpet.comparison.get_num_comparison_bins for more information. for the bins=”auto” argument: In IDPET, JSD-based scores are computed by discretizing data into histograms. The number of bins can influence the results, so users can either set this parameter manually or rely on the default bins=”auto” option. With this setting, IDPET automatically determines the number of bins using the square-root rule (https://en.wikipedia.org/wiki/Histogram#Square-root_choice), based on the smallest ensemble size. If the smallest ensemble contains more than 2,500 conformers, the number of bins is capped at 50. A bin count greater than 10 generally provides sufficient resolution to discriminate interatomic distance or torsion angle distributions in JSD-based analyses of IDPs. Increasing the number of bins over 10 can improve resolution and discriminative power but also requires larger ensemble sizes to avoid sparse histograms. For most protein ensembles, values above 50 rarely improves the resolution of the analysis, therefore the maximum bin number of the “auto” option is capped to this value.
random_seed: int, optional: Random seed used when performing bootstrapping.
verbose: bool, optional: If True, some information about the comparisons will be printed to stdout.

Returns

results: dict

A dictionary containing the following key-value pairs:

scores: a (M, M, B) NumPy array storing the comparison: scores, where M is the number of ensembles being compared and B is the number of bootstrap iterations (B will be 1 if bootstrapping was not performed).
p_values: a (M, M) NumPy array storing the p-values: obtained in the statistical test performed when using a bootstrapping strategy (see the bootstrap_iters) method. Returned only when performing a bootstrapping strategy.

idpet.comparison.calc_freqs(x, bins)[source]

idpet.comparison.calc_jsd(p_h, q_h)[source]: Calculates JSD between distribution p and q. p_h: histogram frequencies for sample p. q_h: histogram frequencies for sample q.

idpet.comparison.calc_kld_for_jsd(x_h, m_h)[source]: Calculates KLD between distribution x and m. x_h: histogram frequencies for sample p or q. m_h: histogram frequencies for m = 0.5*(p+q).

idpet.comparison.check_feature_matrices(func)[source]

idpet.comparison.confidence_interval(theta_boot, theta_hat=None, confidence_level=0.95, method='percentile')[source]: Returns bootstrap confidence intervals. Adapted from: https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_resampling.py

idpet.comparison.get_adaJSD_matrix(ens_1, ens_2, bins='auto', return_bins=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD score between two ensembles and return a matrix with JSD scores for each pair of Ca-Ca distances.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.
return_binsbool, optional: If True, also return the histogram bin edges used in the comparison.
**remaining: Additional arguments passed to dpet.comparison.score_adaJSD.

Output

scorefloat: The overall adaJSD score between the two ensembles.
jsd_matrixnp.ndarray of shape (N, N): Matrix containing JSD scores for each Ca-Ca distance pair, where N is the number of residues.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_ataJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD score between two ensembles and return a profile with JSD scores for each alpha angle in the proteins.

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.

return_binsbool, optional

If True, also return the histogram bin edges used in the comparison.

**remaining: Additional arguments passed to dpet.comparison.score_ataJSD.

scorefloat: The overall ataJSD score between the two ensembles.
jsd_profilenp.ndarray of shape (N - 3,): JSD scores for individual α backbone angles, where N is the number of residues in the protein.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_num_comparison_bins(bins, x=None)[source]

Get the number of bins to be used in comparison between two ensembles using an histogram-based score (such as a JSD approximation).

Parameters

bins: Union[str, int]

Determines the number of bins to be used. When providing an int, the same value will simply be returned. When providing a string, the following rules to determine bin value will be applied: auto: applies sqrt if the size of the smallest ensemble is <

dpet.comparison.min_samples_auto_hist. If it >= than this value, returns dpet.comparison.num_default_bins.

sqrt: applies the square root rule for determining bin number using: the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Square-root_choice).
sturges: applies Sturge’s formula for determining bin number using: the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Sturges’s_formula).

x: List[np.ndarray], optional

List of M feature matrices (one for each ensembles) of shape (N_i, *). N_i values are the number of structures in each ensemble. The minimum N_i will be used to apply bin assignment rule when the bins argument is a string.

Returns

num_bins: int: Number of bins.

idpet.comparison.get_ramaJSD_profile(ens_1, ens_2, bins='auto', return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ramaJSD score between two ensembles and return a profile with JSD scores for the Ramachandran plots of pair of corresponding residue in the proteins.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.
return_binsbool, optional: If True, also return the histogram bin edges used in the comparison.
**remaining: Additional arguments passed to dpet.comparison.score_ramaJSD.

Returns

scorefloat: The overall ramaJSD score between the two ensembles.
jsd_profilenp.ndarray of shape (N - 2,): JSD scores for the Ramachandran distribution of each residue, where N is the number of residues in the protein.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.percentile_func(a, q)[source]

idpet.comparison.process_all_vs_all_output(comparison_out, confidence_level=0.95)[source]: Takes as input a dictionary produced as output of the all_vs_all_comparison function. If a bootstrap analysis was performed in all_vs_all_comparison, this function will assign bootstrap confidence intervals.

idpet.comparison.score_adaJSD(ens_1, ens_2, bins='auto', return_bins=False, return_scores=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD (carbon Alfa Distance Average JSD) score between two ensembles. The score evaluates the divergence between distributions of Ca-Ca distances of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory],: Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.
bins: Union[str, int], optional: Determines the number of bins to be used when constructing histograms. See idpet.comparison.get_num_comparison_bins for more information. See also idpet.comparison.all_vs_all_comparison().
return_bins: bool, optional: If True, returns the number of bins used in the calculation.
return_scores: bool, optional: If True, returns the a tuple with with (avg_score, all_scores), where all_scores is an array with all the F scores (one for each feature) used to compute the average score.
featurization_params: dict, optional: Optional dictionary to customize the featurization process to calculate Ca-Ca distances. See the Ensemble.get_features function for more information.

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_ataJSD(ens_1, ens_2, bins, return_bins=False, return_scores=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD (Alpha Torsion Average JSD) score between two ensembles. The score evaluates the divergence between distributions of alpha torsion angles (the angles formed by four consecutive Ca atoms in a protein) of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.

bins: Union[str, int], optional See idpet.comparison.all_vs_all_comparison().

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_avg_2d_angle_jsd(array_1, array_2, bins, return_scores=False, return_bins=False, *args, **kwargs)[source]

Takes as input two (*, F, 2) bidimensional feature matrices and computes an average JSD score over all F bidimensional features by discretizing them in 2d histograms. The features in this functions are supposed to be angles whose values range from -math.pi to math.pi. For example, int the score_ramaJSD function the F features represent the phi-psi values of F residues in a protein of length L=F+2 (first and last residues don’t have both phi and psi values).

Parameters

p_data, q_data: np.ndarray: NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.
bins: Union[int, str], optional: Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be -math.pi to math.pi. Note that the effective number of bins used in the functio will be the square of the number returned by dpet.comparison.get_num_comparison_bins, since we are building a 2d histogram.
return_bins: bool, optional: If True, returns the square root of the effective number of bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]: If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins. If return_scores is True it will also return the F scores used to compute the average JSD score.

idpet.comparison.score_avg_jsd(m1, m2, *args, **kwargs)[source]

idpet.comparison.score_histogram_jsd(p_data, q_data, limits, bins='auto', return_bins=False)[source]

Scores an approximation of Jensen-Shannon divergence by discretizing in a histogram the values two 1d samples provided as input.

Return type:: Union[float, Tuple[float, ndarray]]

Parameters

p_data, q_data: np.ndarray

NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.

limits: Union[str, Tuple[int]]

Define the method to calculate the minimum and maximum values of the range spanned by the bins. Accepted values are:

“m”: will use the minimum and maximum values observed by
concatenating samples in p_data and q_data.

“p”: will use the minimum and maximum values observed by
concatenating samples in p_data. If q_data contains values outside that range, new bins of the same size will be added to cover all values of q. Currently, this is not used in any IDPET functionality. Note that the bins argument will determine only the bins originally spanned by p_data.

“a”: limits for scoring angular features. Will use a
(-math.pi, math.pi) range for scoring such features.

(float, float): provide a custom range. Currently, not used in any
IDPET functionality.

bins: Union[int, str], optional

Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be define by the limits argument.

return_bins: bool, optional

If True, returns the bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]: If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins.

idpet.comparison.score_ramaJSD(ens_1, ens_2, bins, return_scores=False, return_bins=False)[source]

Utility unction to calculate the ramaJSD (Ramachandran plot average JSD) score between two ensembles. The score evaluates the divergence between distributions of phi-psi torsion angles of every residue in the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.

bins: Union[int, str], optional See idpet.comparison.all_vs_all_comparison().

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.sqrt_rule(n)[source]

idpet.comparison.sturges_rule(n)[source]

idpet.coord module

idpet.coord.calc_chain_dihedrals(xyz, norm=False)[source]

idpet.coord.contact_probability_map(traj, scheme='ca', contact='all', threshold=0.8)[source]

idpet.coord.create_consecutive_indices_matrix(ca_indices)[source]: This function gets the CA indices of (L,) shape and create all possible 4 consecutive indices with the shape (L-3, 4)

idpet.coord.dict_phi_psi_normal_cases(dict_phi_psi)[source]

idpet.coord.get_contact_map(dmap, threshold=0.8, pseudo_count=0.01)[source]: Gets a trajectory of distance maps with shape (N, L, L) and returns a (L, L) contact probability map.

idpet.coord.get_distance_matrix(xyz)[source]: Gets an ensemble of xyz conformations with shape (N, L, 3) and returns the corresponding distance matrices with shape (N, L, L).

idpet.coord.site_specific_order_parameter(ca_xyz_dict)[source]

Return type:: dict

Computes site-specific order parameters for a set of protein conformations. Parameters:

ca_xyz_dict (dict): A dictionary where keys represent unique identifiers for proteins, and values are 3D arrays containing the coordinates of alpha-carbon (CA) atoms for different conformations of the protein.

Returns:: dict: A dictionary where keys are the same protein identifiers provided in ca_xyz_dict, and values are one-dimensional arrays containing the site-specific order parameters computed for each residue of the protein.

idpet.coord.split_dictionary_phipsiangles(features_dict)[source]

idpet.coord.ss_measure_disorder(features_dict)[source]: This function accepts the dictionary of phi-psi arrays which is saved in featurized_data attribute and as an output provide flexibility parameter for each residue in the ensemble Note: this function only works on phi/psi feature

idpet.dimensionality_reduction module

class idpet.dimensionality_reduction.DimensionalityReduction[source]

Bases: ABC

abstract fit(data)[source]

Fit the dimensionality reduction model to the data.

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Notes

This method fits the dimensionality reduction model to the input data.

abstract fit_transform(data)[source]

Fit the dimensionality reduction model to the data and then transform it.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method fits the dimensionality reduction model to the input data and then transforms it.

abstract transform(data)[source]

Transform the input data using the fitted dimensionality reduction model.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features) to be transformed.

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method transforms the input data using the fitted dimensionality reduction model.

class idpet.dimensionality_reduction.DimensionalityReductionFactory[source]

Bases: object

Factory class for creating instances of various dimensionality reduction algorithms.

Methods

get_reducer(method, *args, **kwargs): Get an instance of the specified dimensionality reduction algorithm.

static get_reducer(method, *args, **kwargs)[source]

Get an instance of the specified dimensionality reduction algorithm.

Return type:: DimensionalityReduction

Parameters

methodstr: Name of the dimensionality reduction method.
*args: Positional arguments to pass to the constructor of the selected method.
**kwargs: Keyword arguments to pass to the constructor of the selected method.

Returns

DimensionalityReduction: Instance of the specified dimensionality reduction algorithm.

class idpet.dimensionality_reduction.KPCAReduction(circular=False, n_components=10, kernel='poly', gamma=None)[source]

Bases: DimensionalityReduction

Class for performing dimensionality reduction using Kernel PCA (KPCA) algorithm.

Parameters

circularbool, optional: Whether to use circular metrics for angular features. Default is False. If True, it will override the kernel argument.
num_dimint, optional: Number of dimensions for the reduced space. Default is 10.
kernel: str, optional: Kernel used for PCA, as in the scikit-learn implementation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html
gammafloat, optional: Kernel coefficient. Default is None.

fit(data)[source]

Fit the dimensionality reduction model to the data.

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Notes

This method fits the dimensionality reduction model to the input data.

fit_transform(data)[source]

Fit the dimensionality reduction model to the data and then transform it.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method fits the dimensionality reduction model to the input data and then transforms it.

transform(data)[source]

Transform the input data using the fitted dimensionality reduction model.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features) to be transformed.

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method transforms the input data using the fitted dimensionality reduction model.

class idpet.dimensionality_reduction.PCAReduction(n_components=10)[source]

Bases: DimensionalityReduction

Principal Component Analysis (PCA) for dimensionality reduction.

Parameters

num_dimint, optional: Number of components to keep. Default is 10.

fit(data)[source]

Fit the dimensionality reduction model to the data.

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Notes

This method fits the dimensionality reduction model to the input data.

fit_transform(data)[source]

Fit the dimensionality reduction model to the data and then transform it.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method fits the dimensionality reduction model to the input data and then transforms it.

transform(data)[source]

Transform the input data using the fitted dimensionality reduction model.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features) to be transformed.

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method transforms the input data using the fitted dimensionality reduction model.

class idpet.dimensionality_reduction.TSNEReduction(perplexity_vals=[30], metric='euclidean', circular=False, n_components=2, learning_rate='auto', range_n_clusters=range(2, 10), random_state=None)[source]

Bases: DimensionalityReduction

Class for performing dimensionality reduction using t-SNE algorithm.

Parameters

perplexity_valsList[float], optional: The perplexity is related to the number of nearest neighbors that are used in the manifold learning. It can be interpreted as a smooth measure of the effective number of neighbors for each point. Typical values range from 5 to 50. Choosing a value too small may make the data appear too clustered, while a value too large may cause different clusters to merge. see also https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html

Default is [30].
metricstr, optional: Metric to use. Default is “euclidean”.
circularbool, optional: Whether to use circular metrics. Default is False.
n_componentsint, optional: Number of dimensions of the embedded space. Default is 2.
learning_ratefloat, optional: Learning rate. Default is 100.0.
range_n_clustersList[int], optional: Range of cluster values. Default is range(2, 10, 1). Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions.
random_state: int, optional: Random seed for sklearn.

fit(data)[source]

Fit the dimensionality reduction model to the data.

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Notes

This method fits the dimensionality reduction model to the input data.

fit_transform(data)[source]

Fit the dimensionality reduction model to the data and then transform it.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method fits the dimensionality reduction model to the input data and then transforms it.

transform(data)[source]

Transform the input data using the fitted dimensionality reduction model.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features) to be transformed.

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method transforms the input data using the fitted dimensionality reduction model.

class idpet.dimensionality_reduction.UMAPReduction(n_components=2, n_neighbors=[15], circular=False, min_dist=0.1, metric='euclidean', range_n_clusters=range(2, 10), random_state=None)[source]

Bases: DimensionalityReduction

Class for performing dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) algorithm.

Parameters

num_dimint, optional: Number of dimensions for the reduced space. Default is 2.
n_neighborsList[int], optional: Number of neighbors to consider for each point in the input data. Default is [15].
min_distfloat, optional: The minimum distance between embedded points. Default is 0.1.
metricstr, optional: The metric to use for distance calculation. Default is ‘euclidean’.
range_n_clustersrange or List, optional: Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions. Default is range(2, 10, 1).
random_stateint, optional: Random state of the UMAP implementation.

cluster(embedding, n_neighbor)[source]

Perform clustering using KMeans algorithm for each number of clusters in the specified range.

Return type:: List[Tuple]

Returns

List[Tuple]: A list of tuples containing the number of clusters and the corresponding silhouette score for each clustering result.

fit(data)[source]

Fit the dimensionality reduction model to the data.

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Notes

This method fits the dimensionality reduction model to the input data.

fit_transform(data)[source]

Fit the dimensionality reduction model to the data and then transform it.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features).

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method fits the dimensionality reduction model to the input data and then transforms it.

transform(data)[source]

Transform the input data using the fitted dimensionality reduction model.

Return type:: ndarray

Parameters

datanp.ndarray: The input data array of shape (n_samples, n_features) to be transformed.

Returns

np.ndarray: The transformed data array of shape (n_samples, n_components).

Notes

This method transforms the input data using the fitted dimensionality reduction model.

idpet.dimensionality_reduction.unit_vector_distance(a0, a1, sqrt=True)[source]: Compute the sum of distances between two (*, N) arrays storing the values of N angles.

idpet.dimensionality_reduction.unit_vector_kernel(a1, a2, gamma)[source]: Compute unit vector kernel.

idpet.dimensionality_reduction.unit_vectorize(a)[source]

Convert an array with (*, N) angles in an array with (*, N, 2) sine and cosine values for the N angles.

Return type:: ndarray

idpet.ensemble module

class idpet.ensemble.Ensemble(code, data_path=None, top_path=None, database=None, chain_id=None, residue_range=None, fix_pbc=False)[source]

Bases: object

Represents a molecular dynamics ensemble.

Parameters

codestr: The code identifier of the ensemble.
data_pathstr, optional: The path to the data file associated with the ensemble. It could be a path to one multi-model pdb file , a path to a folder contain pdb files for each model, or .xtc , .dcd trajectory files. Default is None.
top_pathstr, optional: The path to the topology file associated with the ensemble. In case of having trajectory file. Default is None.
databasestr, optional: The database from which to download the ensemble. Options are ‘ped’ and ‘atlas’. Default is None.
chain_idstr, optional: Chain identifier used to select a single chain to analyze in case multiple chains are loaded. Default is None.
residue_rangeTuple, optional: A tuple indicating the start and end of the residue range (inclusive), using 1-based indexing. Default is None.
fix_pbcbool, optional: If True, IDPET automatically removes discontinuities arising from periodic boundary conditions using MDTraj’s built-in functions. Only takes effect when ‘data_path’ is a trajectory file.

Notes

If the database is ‘atlas’, the ensemble code should be provided as a PDB ID with a chain identifier separated by an underscore. Example: ‘3a1g_B’.
If the database is ‘ped’, the ensemble code should be in the PED ID format, which consists of a string starting with ‘PED’ followed by a numeric identifier, and ‘e’ followed by another numeric identifier. Example: ‘PED00423e001’.
The residue_range parameter uses 1-based indexing, meaning the first residue is indexed as 1.

extract_features(featurization, *args, **kwargs)[source]

Extract features from the trajectory using the specified featurization method.

Parameters

featurizationstr: The method to use for feature extraction. Supported options: ‘ca_dist’, ‘phi_psi’, ‘a_angle’, ‘tr_omega’, ‘tr_phi’, and ‘ca_phi_psi’.
min_sepint, optional: The minimum sequence separation for angle calculations. Required for certain featurization methods.
max_sepint, optional: The maximum sequence separation for angle calculations. Required for certain featurization methods.

Notes

This method extracts features from the trajectory using the specified featurization method and updates the ensemble’s features attribute.

get_chains_from_pdb()[source]

Extracts unique chain IDs from a PDB file.

Raises

FileNotFoundError: If the specified PDB file or directory does not exist, or if no PDB file is found in the directory.
ValueError: If the specified file is not a PDB file and the path is not a directory.

get_features(featurization, normalize=False, *args, **kwargs)[source]

Get features from the trajectory using the specified featurization method.

Return type:: Sequence

Parameters

featurizationstr: The method to use for feature extraction. Supported options: ‘ca_dist’, ‘phi_psi’, ‘a_angle’, ‘tr_omega’, ‘tr_phi’, ‘rg’, ‘prolateness’, ‘asphericity’, ‘sasa’, ‘end_to_end’.
min_sepint: The minimum sequence separation for angle calculations.
max_sepint: The maximum sequence separation for angle calculations.

Returns

featuresSequence: The extracted features.

Notes

This method extracts features from the trajectory using the specified featurization method.

get_num_residues()[source]

get_size()[source]

Return the number of conformations in an ensemble, if data has been loaded.

Return type:: int

load_trajectory(output_dir=None)[source]

Load a trajectory for the ensemble.

Parameters

output_dirstr, optional: The directory where the trajectory data is located or where generated trajectory files will be saved.

Notes

This method loads a trajectory for the ensemble based on the specified data path. It supports loading from various file formats such as PDB, DCD, and XTC. If the data path points to a directory, it searches for PDB files within the directory and generates a trajectory from them. If the data path points to a single PDB file, it loads that file and generates a trajectory. If the data path points to a DCD or XTC file along with a corresponding topology file (TOP), it loads both files to construct the trajectory. Additional processing steps include checking for coarse-grained models, selecting a single chain (if applicable), and selecting residues of interest based on certain criteria.

normalize_features(mean, std)[source]

Normalize the extracted features using the provided mean and standard deviation.

Parameters

meanfloat: The mean value used for normalization.
stdfloat: The standard deviation used for normalization.

Notes

This method normalizes the ensemble’s features using the provided mean and standard deviation.

random_sample_trajectory(sample_size)[source]

Randomly sample frames from the original trajectory.

Parameters

sample_sizeint: The number of frames to sample from the original trajectory.

Notes

This method samples frames randomly from the original trajectory and updates the ensemble’s trajectory attribute.

idpet.ensemble_analysis module

class idpet.ensemble_analysis.EnsembleAnalysis(ensembles, output_dir=None)[source]

Bases: object

Data analysis pipeline for ensemble data.

Initializes with a list of ensemble objects and a directory path for storing data.

Parameters

ensemblesList[Ensemble]): List of ensembles.
output_dirstr, optional: Directory path for storing data. If not provided, a directory named ${HOME}/.idpet/data will be created.

comparison_scores(score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.

Return type:: Tuple[ndarray, List[str]]

property ens_codes: List[str]

Get the ensemble codes.

Returns

List[str]: A list of ensemble codes.

execute_pipeline(featurization_params, reduce_dim_params, subsample_size=None)[source]

Execute the data analysis pipeline end-to-end. The pipeline includes:

Download from database (optional)
Generate trajectories
Randomly sample a number of conformations from trajectories (optional)
Perform feature extraction
Perform dimensionality reduction

Parameters

featurization_params: Dict: Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.
reduce_dim_params: Dict: Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.
subsample_size: int, optional: Optional parameter that specifies the trajectory subsample size. Default is None.

exists_coarse_grained()[source]

Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.

Return type:: bool

Returns

bool: True if at least one ensemble is coarse-grained, False otherwise.

extract_features(featurization, normalize=False, *args, **kwargs)[source]

Extract the selected feature.

Return type:: Dict[str, ndarray]

Parameters

featurizationstr: Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.
normalizebool, optional: Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.
min_sepint or None, optional: Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
max_sepint, optional: Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

property features: Dict[str, ndarray]

Get the features associated with each ensemble.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

get_features(featurization, normalize=False, *args, **kwargs)[source]

Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.

Return type:: Dict[str, ndarray]

Parameters:

featurizationstr: The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.
min_sepint, optional: Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
max_sepint or None, optional: Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
normalizebool, optional: Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.

Returns:

Dict[str, np.ndarray]: A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.

Raises:

ValueError:: If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.

get_features_summary_dataframe(selected_features=['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability=True)[source]

Create a summary DataFrame for each ensemble.

The DataFrame includes the ensemble code and the average for each feature.

Return type:: DataFrame

Parameters

selected_featuresList[str], optional: List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].
show_variability: bool, optional: If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).

Returns

pd.DataFrame: DataFrame containing the summary statistics (average and std) for each feature in each ensemble.

Raises

ValueError: If any feature in the selected_features is not a supported feature extraction method.

load_trajectories()[source]

Load trajectories for all ensembles.

This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.

Return type:: Dict[str, Trajectory]

Returns

Dict[str, mdtraj.Trajectory]: A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

Note

This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.

random_sample_trajectories(sample_size)[source]

Sample a defined random number of conformations from the ensemble trajectory.

Parameters

sample_size: int: Number of conformations sampled from the ensemble.

property reduce_dim_data: Dict[str, ndarray]

Get the transformed data associated with each ensemble.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

reduce_features(method, fit_on=None, *args, **kwargs)[source]

Perform dimensionality reduction on the extracted features.

Return type:: ndarray

Parameters

methodstr: Choose between “pca”, “tsne”, “kpca” and “umap”.
fit_onList[str], optional: if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.

Additional Parameters

The following optional parameters apply based on the selected reduction method:

pca:
- n_componentsint, optional
  Number of components to keep. Default is 10.
tsne:
- perplexity_valsList[float], optional
  The perplexity is related to the number of nearest neighbors that are used in the manifold learning. It can be interpreted as a smooth measure of the effective number of neighbors for each point. Typical values range from 5 to 50. Choosing a value too small may make the data appear too clustered, while a value too large may cause different clusters to merge. see also https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html
- metricstr, optional
  Metric to use. Default is “euclidean”.
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of dimensions of the embedded space. Default is 2.
- learning_ratefloat, optional
  Learning rate. Default is 100.0.
- range_n_clustersList[int], optional
  Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions. Default is range(2, 10, 1).
- random_state: int, optional
  Random seed for sklearn.
umap:
- n_neighborsList[int], optional
  List of number of neighbors. Default is [15].
- min_distfloat, optional
  Minimum distance between points in the embedded space. Default is 0.1.
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of dimensions of the embedded space. Default is 2.
- metricstr, optional
  Metric to use. Default is “euclidean”.
- random_state: int, optional
  Random seed for sklearn.
- range_n_clustersList[int], optional
  Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions. Default is range(2, 10, 1).
kpca:
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of components to keep. Default is 10.
- gammafloat, optional
  Kernel coefficient. Default is None.

Returns

np.ndarray

Returns the transformed data.

For more information on each method, see the corresponding documentation:

property trajectories: Dict[str, Trajectory]

Get the trajectories associated with each ensemble.

Returns

Dict[str, mdtraj.Trajectory]: A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

idpet.utils module

idpet.utils.set_verbosity(level, stream=None)[source]: Allows to change the verbosity of IDPET.

idpet.visualization module

class idpet.visualization.Visualization(analysis)[source]

Bases: object

Visualization class for ensemble analysis.

Parameters:: analysis (EnsembleAnalysis): An instance of EnsembleAnalysis providing data for visualization.

alpha_angles(bins=50, save=False, ax=None)[source]

Alpha angles: Angles between four consecutive Cα atoms along the protein backbone. This method calculates the alpha angles for each ensemble in the analysis and plots their distribution

Return type:: Axes

Parameters

binsint: The number of bins for the histogram. Default is 50.
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axplt.Axes, optional: The axes on which to plot. Default is None, which creates a new figure and axes.

Returns

plt.Axes: The Axes object containing the plot.

asphericity(bins=50, hist_range=None, violin_plot=True, summary_stat='mean', dpi=96, save=False, color='lightblue', multiple_hist_ax=False, x_ticks_rotation=45, ax=None)[source]

Plot asphericity distribution in each ensemble. Asphericity is calculated based on the gyration tensor.

Return type:: Axes

Parameters

binsint, optional: The number of bins for the histogram. Default is 50.
hist_range: Tuple, optional: A tuple with a min and max value for the histogram range. Default is None, which corresponds to using the min a max value across all data.
violin_plotbool, optional: If True, a violin plot is visualized. Default is True.
summary_statstr, optional: Specifies whether to display the “mean”, “median”, or “both” as reference lines on the plots. This applies when violin_plot is True or when multiple_hist_ax is True for histograms.
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
colorstr, optional: Color of the violin plot. Default is lightblue.
multiple_hist_axbool, optional: If True, each histogram will be plotted on separate axes. Default is False.
x_ticks_rotationint, optional: The rotation angle of the x-axis tick labels for the violin plot. Default is 45
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The axes on which to plot. Default is None, which creates a new figure and axes.

Returns

plt.Axes: The Axes object containing the plot.

comparison_matrix(score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, confidence_level=0.95, significance_level=0.05, bins=50, random_seed=None, verbose=False, ax=None, figsize=(6.0, 5.0), dpi=100, cmap='viridis_r', title=None, cbar_label=None, textcolors=('black', 'white'))[source]

Generates and visualizes the pairwise comparison matrix for the ensembles. This function computes the comparison matrix using the specified score type and feature. It then visualizes the matrix using a heatmap.

Return type:: dict

Parameters:

score, featurization_params, bootstrap_iters, bootstrap_frac, bootstrap_replace, bins, random_seed, verbose:

See the documentation of EnsembleAnalysis.comparison_scores for more information about these arguments.

ax: Union[None, plt.Axes], optional: Axes object where to plot the comparison heatmap. If None (the default value) is provided, a new Figure will be created.
figsize: Tuple[int], optional: The size of the figure for the heatmap. Default is (6.00, 5.0). Only takes effect if ax is not None.
dpi: int, optional: DPIs of the figure for the heatmap. Default is 100. Only takes effect if ax is not None.

confidence_level, significance_level, cmap, title, cbar_label, textcolors:

See the documentation of dpet.visualization.plot_comparison_matrix for more information about these arguments.

Returns:

results: dict

A dictionary containing the following keys:

ax: the Axes object with the comparison matrix heatmap. scores: comparison matrix. See EnsembleAnalysis.comparison_scores

for more information.

codes: codes of the ensembles that were compared. fig: Figure object, only returned when a new figure is created

inside this function.

Notes:

The comparison matrix is annotated with the scores, and the axes are labeled with the ensemble labels.

contact_prob_maps(log_scale=True, avoid_zero_count=False, threshold=0.8, dpi=96, color='Blues', save=False, ax=None)[source]

Plot the contact probability map based on the threshold.

Return type:: Union[List[Axes], ndarray]

Parameters

log_scalebool, optional: If True, use log scale range. Default is True.
avoid_zero_count: bool, optional: If True, avoid contacts with zero counts by adding to all contacts a pseudo count of 1e-6.
thresholdfloat, optional: Determining the threshold for calculating the contact frequencies. Default is 0.8 [nm].
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
colorstr, optional: The colormap to use for the contact probability map. Default is ‘Blues’.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
axUnion[None, List[plt.Axes], np.ndarray], optional: The axes on which to plot. If None, new axes will be created. Default is None.

Returns

Union[List[plt.Axes], np.ndarray]: Returns a list or array of Axes objects representing the subplot grid.

dimensionality_reduction_scatter(color_by='rg', save=False, ax=None, kde_by_ensemble=False, dpi=96, size=10, plotly=False, cmap_label='viridis', n_components=2)[source]

Plot the results of dimensionality reduction using the method specified in the analysis.

Return type:: List[Axes]

Parameters

color_bystr, optional: The feature extraction method used for coloring points in the scatter plot. Options are “rg”, “prolateness”, “asphericity”, “sasa”, and “end_to_end”. Default is “rg”.
savebool, optional: If True, the plot will be saved in the data directory. Default is False.
axUnion[None, List[plt.Axes]], optional: A list of Axes objects to plot on. Default is None, which creates new axes.
kde_by_ensemblebool, optional: If True, the KDE plot will be generated for each ensemble separately. If False, a single KDE plot will be generated for the concatenated ensembles. Default is False.
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
sizeint, optional: The size of the points in the scatter plot. Default is 10.
plotlybool, optional: If True, the plot will be generated using Plotly. Default is False.
cmap_labelstr, optional: The colormap to use for the feature-colored labels. Default is ‘viridis’.
n_componentsint, optional: The number of components for dimensionality reduction.

Returns

List[plt.Axes]: List containing Axes objects for the scatter plot of original labels, clustering labels, and feature-colored labels, respectively.

Raises

NotImplementedError: If the dimensionality reduction method specified in the analysis is not supported.

distance_maps(min_sep=2, max_sep=None, distance_type='both', get_names=True, inverse=False, color='plasma', dpi=96, save=False, ax=None)[source]

Plot CA and/or COM distance maps for one or more protein ensembles.

Return type:: List[Axes]

Parameters

min_sepint, default=2: Minimum sequence separation between residues to consider.
max_sepint or None, optional: Maximum sequence separation. If None, no upper limit is applied.
distance_type{‘ca’, ‘com’, ‘both’}, default=’both’: Specifies which type of distance map(s) to plot.
get_namesbool, default=True: Whether to return feature names from featurization (used internally).
inversebool, default=False: If True, compute and plot 1/distance instead of distance.
colorstr, default=’plasma’: Colormap to use for the distance maps.
dpiint, default=96: The DPI (dots per inch) of the output figure.
savebool, default=False: If True, the plot will be saved as an image file in the specified directory.
axmatplotlib Axes or array-like, optional: Axes on which to plot. If None, a new figure and axes will be created.

Returns

List[matplotlib.axes.Axes]: List of axes objects used for plotting.

end_to_end_distances(rg_norm=False, bins=50, hist_range=None, violin_plot=True, summary_stat='mean', dpi=96, save=False, color='lightblue', multiple_hist_ax=False, x_ticks_rotation=45, ax=None)[source]

Plot end-to-end distance distributions.

Return type:: Union[Axes, List[Axes]]

Parameters

rg_norm: bool, optional: Normalize end-to-end distances on the average radius of gyration.
binsint, optional: The number of bins for the histogram. Default is 50.
hist_range: Tuple, optional: A tuple with a min and max value for the histogram. Default is None, which corresponds to using the min a max value across all data.
violin_plotbool, optional: If True, a violin plot is visualized. Default is True.
summary_stat: str, optional: Specifies whether to display the “mean”, “median”, or “both” as reference lines on the plots. This applies when violin_plot is True or when multiple_hist_ax is True for histograms.
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The axes on which to plot. Default is None, which creates a new figure and axes.
color: str, optional: Change the color of the violin plot. Default is lightblue.
multiple_hist_ax: bool, optional: If True, it will plot each histogram in a different axis.
x_ticks_rotation: int, optional: The rotation angle of the x-axis tick labels for the violin plot. Default is 45

Returns

Union[plt.Axes, List[plt.Axes]]: The Axes object or a list of Axes objects containing the plot(s).

global_sasa(bins=50, hist_range=None, violin_plot=True, summary_stat='mean', save=False, dpi=96, color='lightblue', multiple_hist_ax=False, x_ticks_rotation=45, ax=None)[source]

Plot the distribution of SASA for each conformation within the ensembles.

Return type:: Axes

Parameters

binsint, optional: The number of bins for the histogram. Default is 50.
hist_range: Tuple, optional: A tuple with a min and max value for the histogram. Default is None, which corresponds to using the min a max value across all data.
violin_plotbool, optional: If True, a violin plot is visualized. Default is True.
summary_stat: str, optional: Specifies whether to display the “mean”, “median”, or “both” as reference lines on the plots. This applies when violin_plot is True or when multiple_hist_ax is True for histograms.
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
colorstr, optional: Color of the violin plot. Default is lightblue.
multiple_hist_axbool, optional: If True, it will plot each histogram in a different axis.
x_ticks_rotationint, optional: The rotation angle of the x-axis tick labels for the violin plot. Default is 45.
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The matplotlib Axes object on which to plot. If None, a new Axes object will be created. Default is None.

Returns

plt.Axes: The Axes object containing the plot.

pca_1d_histograms(save=False, sel_components=0, bins=30, ax=None, dpi=96)[source]

Plot 1D histogram when the dimensionality reduction method is “pca” or “kpca”.

Return type:: List[Axes]

Parameters

save: bool, optional: If True the plot will be saved in the data directory. Default is False.
dim: int, optional: To select the specific component (dimension) for which you want to visualize the histogram distribution. Default is 0 (first principal component in PCA).
n_bins: int, optional: Number of bins in the histograms.
ax: Union[None, List[plt.Axes]], optional: A list of Axes objects to plot on. Default is None, which creates new axes.
dpiint, optional: For changing the quality and dimension of the output figure. Default is 96.

Returns

List[plt.Axes]: A list of plt.Axes objects representing the subplots created.

pca_2d_landscapes(save=False, sel_components=[0, 1], ax=None, dpi=96)[source]

Plot 2D landscapes when the dimensionality reduction method is “pca” or “kpca”.

Return type:: List[Axes]

Parameters

save: bool, optional: If True the plot will be saved in the data directory. Default is False.
sel_components: List[int], optional: Indices of the selected principal components to analyze, starting from 0. The default components are the first and second.
ax: Union[None, List[plt.Axes]], optional: A list of Axes objects to plot on. Default is None, which creates new axes.
dpiint, optional: For changing the quality and dimension of the output figure. Default is 96.

Returns

List[plt.Axes]: A list of plt.Axes objects representing the subplots created.

pca_cumulative_explained_variance(save=False, dpi=96, ax=None)[source]

Plot the cumulative variance. Only applicable when the dimensionality reduction method is “pca”.

Return type:: Axes

Parameters

save: bool, optional: If True, the plot will be saved in the data directory. Default is False.
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
ax: Union[None, plt.Axes], optional: An Axes object to plot on. Default is None, which creates a new axes.

Returns

plt.Axes, cumvar: The Axes object for the cumulative explained variance plot and a numpy array with the cumulative variance.

pca_residue_correlation(sel_components=[0, 1, 2], save=False, ax=None, dpi=96, cmap='RdBu', cmap_range=None, scale_loadings=False)[source]

Plot the loadings (weights) of each pair of residues for a list of principal components (PCs).

Return type:: List[Axes]

Parameters

sel_componentsList[int], optional: A list of indices specifying the PC to include in the plot.
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axUnion[None, List[plt.Axes]], optional: A list of Axes objects to plot on. Default is None, which creates new axes.
dpiint, optional: For changing the quality and dimension of the output figure. Default is 96.
cmap: str, optional: Matplotlib colormap name.
cmap_range: Union[None, Tuple[float]], optional: Range of the colormap. Defaults to ‘None’, the range will be identified automatically. If a tuple, the first and second elements are the min. and max. of the range.
scale_loadings: bool, optional: Scale loadings by explained variance. Some definitions use correlation coefficients as loadings, when input features are standardized.

Returns

List[plt.Axes]: A list of plt.Axes objects representing the subplots created.

Notes

This method generates a correlation plot showing the weights of pairwise residue distances for selected PCA dimensions. The plot visualizes the correlation between residues based on the PCA weights.

The analysis is only valid on PCA and kernel PCA dimensionality reduction with ‘ca_dist’ feature extraction.

pca_rg_correlation(save=False, ax=None, dpi=96, sel_components=0)[source]

Examine and plot the correlation between PC dimension 1 and the amount of Rg. Typically high correlation can be detected in case of IDPs/IDRs .

Return type:: List[Axes]

Parameters

savebool, optional: If True, the plot will be saved in the data directory. Default is False.
ax: Union[None, List[plt.Axes]], optional: A list of Axes objects to plot on. Default is None, which creates new axes.
dpiint, optional: For changing the quality and dimension of the output figure. Default is 96.
sel_components: int, optional: Index of the selected principal component to analyze, defaults to 0 (first principal component).

Returns

List[plt.Axes], dict: A list of plt.Axes objects representing the subplots created and a dictionary with all the raw data being plotted.

per_residue_mean_sasa(probe_radius=0.14, n_sphere_points=960, figsize=(15, 5), dpi=96, size=3, auto_xticks=True, xtick_interval=5, pointer=None, save=False, ax=None)[source]

Plot the average solvent-accessible surface area (SASA) for each residue among all conformations in an ensemble. This function uses the Shrake–Rupley algorithm as implemented in MDTraj (mdtraj.shrake_rupley) to compute the solvent-accessible surface area.

Return type:: Axes

Parameters

probe_radius: float, optional: The radius of the probe sphere used in the Shrake–Rupley algorithm. Default is 0.14 nm.
n_sphere_points: int, optional: The number of points representing the surface of each atom, higher values leads to more accuracy. Default is 960.
figsize: Tuple[int, int], optional: Tuple specifying the size of the figure. Default is (15, 5).
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
size: int, optional: The size of the marker points. Default is 3.
auto_xticks: bool, optional: If True, use matplotlib default xticks. Default is True.
xtick_interval: int, optional: If auto_xticks is False, this parameter defines the interval between displayed residue indices on the x-axis. Always start with 1, followed by every xtick_interval residues (e.g., 1, 5, 10, 15, … if `xtick_interval`=5).
pointer: List[int], optional: List of desired residues to highlight with vertical dashed lines. Default is None.
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axUnion[None, plt.Axes], optional: The matplotlib Axes object on which to plot. If None, a new Axes object will be created. Default is None.

Returns

plt.Axes: Axes object containing the plot.

plot_histogram_grid(feature='ca_dist', ids=None, n_rows=2, n_cols=3, subplot_width=2.0, subplot_height=2.2, bins=None, dpi=90, save=False)[source]

Plot a grid if histograms for distance or angular features. Can only be used when analyzing ensembles of proteins with same number of residues. The function will create a new matplotlib figure for histogram grid.

Return type:: Axes

Parameters

feature: str, optional: Feature to analyze. Must be one of ca_dist (Ca-Ca distances), a_angle (alpha angles), phi or psi (phi or psi backbone angles).
ids: Union[list, List[list]], optional: Residue indices (integers starting from zero) to define the residues to analyze. For angular features it must be a 1d list with N indices of the residues. For distance features it must be 2d list/array of shape (N, 2) in which N is the number of residue pairs to analyze are 2 their indices. Each of the N indices (or pair of indices) will be plotted in an histogram of the grid. If this argument is not provided, random indices will be sampled, which is useful for quickly comparing the distance or angle distributions of multiple ensembles.
n_rows: int, optional: Number of rows in the histogram grid.
n_cols: int, optional: Number of columns in the histogram grid.
subplot_width: int, optional: Use to specify the Matplotlib width of the figure. The size of the figure will be calculated as: figsize = (n_cols*subplot_width, n_rows*subplot_height).
subplot_height: int, optional: See the subplot_width argument.
bins: Union[str, int], optional: Number of bins in all the histograms.
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
save: bool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.

Returns

ax: plt.Axes: The Axes object for the histogram grid.

plot_rama_grid(ids=None, n_rows=2, n_cols=3, subplot_width=2.0, subplot_height=2.2, dpi=90, save=False)[source]

Plot a grid if Ramachandran plots for different residues. Can only be be used when analyzing ensembles of proteins with same number of residues. The function will create a new matplotlib figure for the scatter plot grid.

Return type:: Axes

Parameters

ids: Union[list, List[list]], optional: Residue indices (integers starting from zero) to define the residues to analyze. For angular features it must be a 1d list with N indices of the residues. Each of the N indices will be plotted in an scatter plot in the grid. If this argument is not provided, random indices will be sampled, which is useful for quickly comparing features of multiple ensembles.
n_rows: int, optional: Number of rows in the scatter grid.
n_cols: int, optional: Number of columns in the scatter grid.
subplot_width: int, optional: Use to specify the Matplotlib width of the figure. The size of the figure will be calculated as: figsize = (n_cols*subplot_width, n_rows*subplot_height).
subplot_height: int, optional: See the subplot_width argument.
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.

Returns

ax: plt.Axes: The Axes object for the scatter plot grid.

prolateness(bins=50, hist_range=None, violin_plot=True, summary_stat='mean', dpi=96, save=False, color='lightblue', multiple_hist_ax=False, x_ticks_rotation=45, ax=None)[source]

Plot prolateness distribution in each ensemble. Prolateness is calculated based on the gyration tensor.

Return type:: Axes

Parameters

binsint, optional: The number of bins for the histogram. Default is 50.
hist_rangeTuple, optional: A tuple with a min and max value for the histogram. Default is None, which corresponds to using the min a max value across all data.
violin_plotbool, optional: If True, a violin plot is visualized. Default is True.
summary_statstr, optional: Specifies whether to display the “mean”, “median”, or “both” as reference lines on the plots. This applies when violin_plot is True or when multiple_hist_ax is True for histograms.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
colorstr, optional: Color of the violin plot. Default is lightblue.
multiple_hist_axbool, optional: If True, each histogram will be plotted on separate axes. Default is False.
x_ticks_rotationint, optional: The rotation angle of the x-axis tick labels for the violin plot. Default is 45
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The axes on which to plot. Default is None, which creates a new figure and axes.

Returns

plt.Axes: The Axes object containing the plot.

radius_of_gyration(bins=50, hist_range=None, multiple_hist_ax=False, violin_plot=True, x_ticks_rotation=45, summary_stat='mean', color='lightblue', dpi=96, save=False, ax=None)[source]

Plot the distribution of the radius of gyration (Rg) within each ensemble.

Return type:: Union[Axes, List[Axes]]

Parameters

binsint, optional: The number of bins for the histogram. Default is 50.
hist_rangeTuple, optional: A tuple with a min and max value for the histogram. Default is None, which corresponds to using the min and max value across all data.
multiple_hist_ax: bool, optional: If True, it will plot each histogram in a different axis.
violin_plotbool, optional: If True, a violin plot is visualized. Default is True.
x_ticks_rotationint, optional: The rotation angle of the x-axis tick labels for the violin plot. Default is 45
summary_stat: str, optional: Specifies whether to display the “mean”, “median”, or “both” as reference lines on the plots. This applies when violin_plot is True or when multiple_hist_ax is True for histograms.
colorstr, optional: Color of the violin plot. Default is lightblue.
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The axes on which to plot. If None, new axes will be created. Default is None.

Returns

Union[plt.Axes, List[plt.Axes]]: Returns a single Axes object or a list of Axes objects containing the plot(s).

Notes

This method plots the distribution of the radius of gyration (Rg) within each ensemble in the analysis.

The Rg values are binned according to the specified number of bins (bins) and range (hist_range) and displayed as histograms. Additionally, dashed lines representing the mean and median Rg values are overlaid on each histogram.

ramachandran_plots(two_d_hist=True, bins=(-180, 180, 80), dpi=96, color='viridis', log_scale=True, save=False, ax=None)[source]

Ramachandran plot. If two_d_hist=True it returns a 2D histogram for each ensemble. If two_d_hist=False it returns a simple scatter plot for all ensembles in one plot.

Return type:: Union[List[Axes], Axes]

Parameters

two_d_histbool, optional: If True, it returns a 2D histogram for each ensemble. Default is True.
binstuple, optional: You can customize the bins for 2D histogram. Default is (-180, 180, 80).
log_scalebool, optional: If True, the histogram will be plotted on a logarithmic scale. Default is True.
colorstr, optional: The colormap to use for the 2D histogram. Default is ‘viridis’.
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
axUnion[None, plt.Axes, np.ndarray, List[plt.Axes]], optional: The axes on which to plot. If None, new axes will be created. Default is None.

Returns

Union[List[plt.Axes], plt.Axes]: If two_d_hist=True, returns a list of Axes objects representing the subplot grid for each ensemble. If two_d_hist=False, returns a single Axes object representing the scatter plot for all ensembles.

relative_dssp_content(dssp_code='H', dpi=96, auto_xticks=False, xtick_interval=5, figsize=(10, 5), save=False, ax=None)[source]

Plot the relative ss content in each ensemble for each residue.

Return type:: Axes

Parameters

dssp_codestr, optional: The selected dssp code , it could be selected between ‘H’ for Helix, ‘C’ for Coil and ‘E’ for strand. It works based on the simplified DSSP codes
dpiint, optional: The DPI (dots per inch) of the output figure. Default is 96.
auto_xticks: bool, optional: If True, use matplotlib default xticks.
xtick_interval: int, optional: If auto_xticks is False, this parameter defines the interval between displayed residue indices on the x-axis. Residue 1 is always included,followed by every xtick_interval residues (e.g., 1, 5, 10, 15 if `xtick_interval`=5).
figsizeTuple[float, float], optional: The size of the figure in inches. Default is (10, 5).
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
axplt.Axes, optional: The axes on which to plot. Default is None, which creates a new figure and axes.

Returns

plt.Axes: The Axes object containing the plot.

rg_vs_asphericity(dpi=96, save=False, size=4, ax=None, verbose=True)[source]

Plots the Rg versus Asphericity and calculates the pearson correlation coefficient to evaluate the correlation between Rg and Asphericity.

Return type:: Axes

Parameters

dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
size: int, optional: The size of the scatter points. Default is 4.
ax: plt.Axes, optional: The axes on which to plot. Default is None, which creates a new figure and axes.
verbose: bool, optional: Verbosity for the output of the method. Showin the Pearson correlation coefficient for each ensemble.

Returns

plt.Axes: The Axes object containing the plot.

rg_vs_prolateness(dpi=96, size=4, save=False, ax=None, verbose=True)[source]

Plot the Rg versus Prolateness and get the Pearson correlation coefficient to evaluate the correlation between Rg and Prolateness.

Return type:: Axes

Parameters

dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
size: int, optional: The size of the scatter marker points. Default is 4.
save: bool, optional: If True, the plot will be saved as an image file in the specified directory. Default is False.
ax: plt.Axes, optional: The axes on which to plot. Default is None, which creates a new figure and axes.
verbose: bool, optional: Verbosity for the output of the method.

Returns

plt.Axes: The Axes object containing the plot.

site_specific_flexibility(pointer=None, auto_xticks=False, xtick_interval=5, dpi=96, figsize=(15, 5), save=False, ax=None)[source]

Generate a plot of the site-specific flexibility parameter.

This plot shows the site-specific measure of disorder, which is sensitive to local flexibility based on the circular variance of the Ramachandran angles φ and ψ for each residue in the ensemble. The score ranges from 0 for identical dihedral angles for all conformers at the residue i to 1 for a uniform distribution of dihedral angles at the residue i. (For more information about this method look at here https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4906)

Return type:: Axes

Parameters

pointer: List[int], optional: A list of desired residues. Vertical dashed lines will be added to point to these residues. Default is None.
auto_xticks: bool, optional: If True, use matplotlib default xticks.
xtick_interval: int, optional: If auto_xticks is False, this parameter defines the interval between displayed residue indices on the x-axis. Always start with 1, followed by every xtick_interval residues (e.g., 1, 5, 10, 15, … if `xtick_interval`=5).
figsize: Tuple[int, int], optional: The size of the figure. Default is (15, 5).
dpi: int, optional: The DPI (dots per inch) of the output figure. Default is 96.
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axUnion[None, plt.Axes], optional: The matplotlib Axes object on which to plot. If None, a new Axes object will be created. Default is None.

Returns

plt.Axes: The matplotlib Axes object containing the plot.

site_specific_order(pointer=None, auto_xticks=True, xtick_interval=5, dpi=96, figsize=(15, 5), save=False, ax=None)[source]

Generate a plot of the site-specific order parameter. The function computes and plots per-residue order parameters that quantify how consistently each residue’s backbone orientation is aligned with the rest of the chain across all conformers in an ensemble. The result is a per-residue value between 0 and 1: values near 1 indicate high orientational order (rigid or structured regions), while values near 0 reflect disorder (flexible or unstructured regions). This measure captures long-range orientational correlations in the backbone and is particularly useful for detecting weakly ordered segments in intrinsically disordered proteins. (For more information about this method look at here https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4906)

Return type:: Axes

Parameters

pointer: List[int], optional: A list of desired residues. Vertical dashed lines will be added to point to these residues. Default is None.
auto_xticks: bool, optional: If True, use matplotlib default xticks.
xtick_interval: int, optional: If auto_xticks is False, this parameter defines the interval between displayed residue indices on the x-axis. Always start with 1, followed by every xtick_interval residues (e.g., 1, 5, 10, 15, … if `xtick_interval`=5).
figsize: Tuple[int, int], optional: The size of the figure. Default is (15, 5).
savebool, optional: If True, the plot will be saved as an image file. Default is False.
axUnion[None, plt.Axes], optional: The matplotlib Axes object on which to plot. If None, a new Axes object will be created. Default is None.

Returns

plt.Axes: The matplotlib Axes object containing the plot.

idpet.visualization.plot_comparison_matrix(ax, comparison_out, codes, confidence_level=0.95, significance_level=0.05, cmap='viridis_r', title='New Comparison', cbar_label='score', textcolors=('black', 'white'))[source]

Plot a matrix with all-vs-all comparison scores of M ensembles as a heatmap. If plotting the results of a regular all-vs-all analysis (no bootstraping involved), it will just plot the M x M comparison scores, with empty values on the diagonal. If plotting the results of an all-vs-all analysis with bootstrapping it will plot the M x M confidence intervals for the scores. The intervals are obtained by using the ‘percentile’ method. Additionally, it will plot an asterisk for those non-diagonal entries in for which the inter-ensemble scores are significantly higher than the intra-ensemble scores according to a Mann–Whitney U test.

Parameters

ax: plt.Axes: Axes object where the heatmap should be created.
comparison_out: dict: A dictionary containing the output of the comparison_scores method of the dpet.ensemble_analysis.EnsembleAnalysis class. It must contain the following key-value pairs: scores: NumPy array with shape (M, M, B) containing the comparison scores for M ensembles and B bootstrap iterations. If no bootstrap analysis was performed, B = 1, otherwise it will be B > 1. p_values (optional): used only when a bootstrap analysis was performed. A (M, M) NumPy array storiging the p-values obtained by comparing with a statistical test the inter-ensemble and intra-ensemble comparison scores.
codes: List[str]: List of strings with the codes of the ensembles.
confidence_level: float, optional: Condifence level for the bootstrap intervals of the comparison scores.
significance_level: float, optional: Significance level for the statistical test used to compare inter and intra-ensemble comparison scores.
cmap: str, optional: Matplotlib colormap name to use in the heatmap.
title: str, optional: Title of the heatmap.
cbar_label: str, optional: Label of the colorbar.
textcolors: Union[str, tuple], optional: Color of the text for each cell of the heatmap, specified as a string. By providing a tuple with two elements, the two colors will be applied to cells with color intensity above/below a certain threshold, so that ligher text can be plotted in darker cells and darker text can be plotted in lighter cells.

Returns

ax: plt.Axes: The same updated Axes object from the input. The comparison_out will be updated to store confidence intervals if performing a bootstrap analysis.

Notes

The comparison matrix is annotated with the scores, and the axes are labeled with the ensemble labels.

idpet.visualization.plot_histogram(ax, data, labels, bins=50, range=None, title='Histogram', xlabel='x', ylabel='Density', location=None)[source]

Plot an histogram for different features.

Parameters

ax: plt.Axes: Matplotlib axis object where the histograms will be for plotted.
data: List[np.array]: List of NumPy array storing the data to be plotted.
labels: List[str]: List of strings with the labels of the arrays.
bins:: Number of bins.
range: Tuple, optional: A tuple with a min and max value for the histogram. Default is None, which corresponds to using the min a max value across all data.
title: str, optional: Title of the axis object.
xlabel: str, optional: Label of the horizontal axis.
ylabel: str, optional: Label of the vertical axis.

Returns

plt.Axes: Axis objects for the histogram plot of original labels.

idpet.visualization.plot_violins(ax, data, labels, summary_stat='mean', title='Histogram', xlabel='x', color='blue', x_ticks_rotation=45)[source]

Make a violin plot.

Parameters

ax: plt.Axes: Matplotlib axis object where the histograms will be for plotted.
data: List[np.array]: List of NumPy array storing the data to be plotted.
labels: List[str]: List of strings with the labels of the arrays.
summary_stat: str, optional: Select between “median” or “mean” to show in violin plot. Default value is “mean”
title: str, optional: Title of the axis object.
xlabel: str, optional: Label of the horizontal axis.

Returns

plt.Axes: Axis objects for the histogram plot of original labels.