ensemble_comparison

idpet.comparison.all_vs_all_comparison(ensembles, score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence scores. Implemented scores are approximate average Jensen–Shannon divergence (JSD) over several kinds of molecular features. The lower these scores are, the higher the similarity between the probability distribution of the features of the ensembles. JSD scores here range from a minimum of 0 to a maximum of log(2) ~= 0.6931.

Return type:: dict

Parameters

ensembles: List[Ensemble]: Ensemble objectes to analyze.
score: str: Type of score used to compare ensembles. Choices: adaJSD (carbon Alfa Distance Average JSD), ramaJSD (RAMAchandran average JSD) and ataJSD (Alpha Torsion Average JSD). adaJSD scores the average JSD over all Ca-Ca distance distributions of residue pairs with sequence separation > 1. ramaJSD scores the average JSD over the phi-psi angle distributions of all residues. ataJSD scores the average JSD over all alpha torsion angles, which are the angles formed by four consecutive Ca atoms in a protein.
featurization_params: dict, optional: Optional dictionary to customize the featurization process for the above features.
bootstrap_iters: int, optional: Number of bootstrap iterations. By default its value is None. In this case, IDPET will directly compare each pair of ensemble i and j by using all of their conformers and perform the comparison only once. On the other hand, if providing an integer value to this argument, each pair of ensembles i and j will be compared bootstrap_iters times by randomly selecting (bootstrapping) conformations from them. Additionally, each ensemble will be auto-compared with itself by subsampling conformers via bootstrapping. Then IDPET will perform a statistical test to establish if the inter-ensemble (i != j) scores are significantly different from the intra-ensemble (i == j) scores. The tests work as follows: for each ensemble pair i != j IDPET will get their inter-ensemble comparison scores obtained in bootstrapping. Then, it will get the bootstrapping scores from auto-comparisons of ensemble i and j and the scores with the higher mean here are selected as reference intra-ensemble scores. Finally, the inter-ensemble and intra-ensemble scores are compared via a one-sided Mann-Whitney U test with the alternative hypothesis being: inter-ensemble scores are stochastically greater than intra-ensemble scores. The p-values obtained in these tests will additionally be returned. For small protein structural ensembles (less than 500 conformations) most comparison scores in IDPET are not robust estimators of divergence/distance. Bootstrapping helps estimate how ensemble size affects the robustness of the comparisons. Use values >= 50 when comparing small ensembles, with fewer than 100 conformations. For ensembles with more than 100 conformations, 10 iterations are typically sufficient to distinguish the distributions of auto- and intra-comparison scores. For large ensembles, with more than 1,000 conformations, bootstrapping can generally be omitted without loss of accuracy.
bootstrap_frac: float, optional: Fraction of the total conformations to sample when bootstrapping. Default value is 1.0, which results in bootstrap samples with the same number of conformations of the original ensemble.
bootstrap_replace: bool, optional: If True, bootstrap will sample with replacement. Default is True.
bins: Union[int, str], optional: Number of bins or bin assignment rule for JSD comparisons. See the documentation of dpet.comparison.get_num_comparison_bins for more information. for the bins=”auto” argument: In IDPET, JSD-based scores are computed by discretizing data into histograms. The number of bins can influence the results, so users can either set this parameter manually or rely on the default bins=”auto” option. With this setting, IDPET automatically determines the number of bins using the square-root rule (https://en.wikipedia.org/wiki/Histogram#Square-root_choice), based on the smallest ensemble size. If the smallest ensemble contains more than 2,500 conformers, the number of bins is capped at 50. A bin count greater than 10 generally provides sufficient resolution to discriminate interatomic distance or torsion angle distributions in JSD-based analyses of IDPs. Increasing the number of bins over 10 can improve resolution and discriminative power but also requires larger ensemble sizes to avoid sparse histograms. For most protein ensembles, values above 50 rarely improves the resolution of the analysis, therefore the maximum bin number of the “auto” option is capped to this value.
random_seed: int, optional: Random seed used when performing bootstrapping.
verbose: bool, optional: If True, some information about the comparisons will be printed to stdout.

Returns

results: dict

A dictionary containing the following key-value pairs:

scores: a (M, M, B) NumPy array storing the comparison: scores, where M is the number of ensembles being compared and B is the number of bootstrap iterations (B will be 1 if bootstrapping was not performed).
p_values: a (M, M) NumPy array storing the p-values: obtained in the statistical test performed when using a bootstrapping strategy (see the bootstrap_iters) method. Returned only when performing a bootstrapping strategy.

idpet.comparison.calc_freqs(x, bins)[source]

idpet.comparison.calc_jsd(p_h, q_h)[source]: Calculates JSD between distribution p and q. p_h: histogram frequencies for sample p. q_h: histogram frequencies for sample q.

idpet.comparison.calc_kld_for_jsd(x_h, m_h)[source]: Calculates KLD between distribution x and m. x_h: histogram frequencies for sample p or q. m_h: histogram frequencies for m = 0.5*(p+q).

idpet.comparison.check_feature_matrices(func)[source]

idpet.comparison.confidence_interval(theta_boot, theta_hat=None, confidence_level=0.95, method='percentile')[source]: Returns bootstrap confidence intervals. Adapted from: https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_resampling.py

idpet.comparison.get_adaJSD_matrix(ens_1, ens_2, bins='auto', return_bins=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD score between two ensembles and return a matrix with JSD scores for each pair of Ca-Ca distances.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.
return_binsbool, optional: If True, also return the histogram bin edges used in the comparison.
**remaining: Additional arguments passed to dpet.comparison.score_adaJSD.

Output

scorefloat: The overall adaJSD score between the two ensembles.
jsd_matrixnp.ndarray of shape (N, N): Matrix containing JSD scores for each Ca-Ca distance pair, where N is the number of residues.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_ataJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD score between two ensembles and return a profile with JSD scores for each alpha angle in the proteins.

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.

return_binsbool, optional

If True, also return the histogram bin edges used in the comparison.

**remaining: Additional arguments passed to dpet.comparison.score_ataJSD.

scorefloat: The overall ataJSD score between the two ensembles.
jsd_profilenp.ndarray of shape (N - 3,): JSD scores for individual α backbone angles, where N is the number of residues in the protein.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_num_comparison_bins(bins, x=None)[source]

Get the number of bins to be used in comparison between two ensembles using an histogram-based score (such as a JSD approximation).

Parameters

bins: Union[str, int]

Determines the number of bins to be used. When providing an int, the same value will simply be returned. When providing a string, the following rules to determine bin value will be applied: auto: applies sqrt if the size of the smallest ensemble is <

dpet.comparison.min_samples_auto_hist. If it >= than this value, returns dpet.comparison.num_default_bins.

sqrt: applies the square root rule for determining bin number using: the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Square-root_choice).
sturges: applies Sturge’s formula for determining bin number using: the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Sturges’s_formula).

x: List[np.ndarray], optional

List of M feature matrices (one for each ensembles) of shape (N_i, *). N_i values are the number of structures in each ensemble. The minimum N_i will be used to apply bin assignment rule when the bins argument is a string.

Returns

num_bins: int: Number of bins.

idpet.comparison.get_ramaJSD_profile(ens_1, ens_2, bins='auto', return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ramaJSD score between two ensembles and return a profile with JSD scores for the Ramachandran plots of pair of corresponding residue in the proteins.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.
return_binsbool, optional: If True, also return the histogram bin edges used in the comparison.
**remaining: Additional arguments passed to dpet.comparison.score_ramaJSD.

Returns

scorefloat: The overall ramaJSD score between the two ensembles.
jsd_profilenp.ndarray of shape (N - 2,): JSD scores for the Ramachandran distribution of each residue, where N is the number of residues in the protein.
bin_edgesnp.ndarray, optional: Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.percentile_func(a, q)[source]

idpet.comparison.process_all_vs_all_output(comparison_out, confidence_level=0.95)[source]: Takes as input a dictionary produced as output of the all_vs_all_comparison function. If a bootstrap analysis was performed in all_vs_all_comparison, this function will assign bootstrap confidence intervals.

idpet.comparison.score_adaJSD(ens_1, ens_2, bins='auto', return_bins=False, return_scores=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD (carbon Alfa Distance Average JSD) score between two ensembles. The score evaluates the divergence between distributions of Ca-Ca distances of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory],: Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.
bins: Union[str, int], optional: Determines the number of bins to be used when constructing histograms. See idpet.comparison.get_num_comparison_bins for more information. See also idpet.comparison.all_vs_all_comparison().
return_bins: bool, optional: If True, returns the number of bins used in the calculation.
return_scores: bool, optional: If True, returns the a tuple with with (avg_score, all_scores), where all_scores is an array with all the F scores (one for each feature) used to compute the average score.
featurization_params: dict, optional: Optional dictionary to customize the featurization process to calculate Ca-Ca distances. See the Ensemble.get_features function for more information.

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_ataJSD(ens_1, ens_2, bins, return_bins=False, return_scores=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD (Alpha Torsion Average JSD) score between two ensembles. The score evaluates the divergence between distributions of alpha torsion angles (the angles formed by four consecutive Ca atoms in a protein) of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.

bins: Union[str, int], optional See idpet.comparison.all_vs_all_comparison().

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_avg_2d_angle_jsd(array_1, array_2, bins, return_scores=False, return_bins=False, *args, **kwargs)[source]

Takes as input two (*, F, 2) bidimensional feature matrices and computes an average JSD score over all F bidimensional features by discretizing them in 2d histograms. The features in this functions are supposed to be angles whose values range from -math.pi to math.pi. For example, int the score_ramaJSD function the F features represent the phi-psi values of F residues in a protein of length L=F+2 (first and last residues don’t have both phi and psi values).

Parameters

p_data, q_data: np.ndarray: NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.
bins: Union[int, str], optional: Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be -math.pi to math.pi. Note that the effective number of bins used in the functio will be the square of the number returned by dpet.comparison.get_num_comparison_bins, since we are building a 2d histogram.
return_bins: bool, optional: If True, returns the square root of the effective number of bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]: If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins. If return_scores is True it will also return the F scores used to compute the average JSD score.

idpet.comparison.score_avg_jsd(m1, m2, *args, **kwargs)[source]

idpet.comparison.score_histogram_jsd(p_data, q_data, limits, bins='auto', return_bins=False)[source]

Scores an approximation of Jensen-Shannon divergence by discretizing in a histogram the values two 1d samples provided as input.

Return type:: Union[float, Tuple[float, ndarray]]

Parameters

p_data, q_data: np.ndarray

NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.

limits: Union[str, Tuple[int]]

Define the method to calculate the minimum and maximum values of the range spanned by the bins. Accepted values are:

“m”: will use the minimum and maximum values observed by
concatenating samples in p_data and q_data.

“p”: will use the minimum and maximum values observed by
concatenating samples in p_data. If q_data contains values outside that range, new bins of the same size will be added to cover all values of q. Currently, this is not used in any IDPET functionality. Note that the bins argument will determine only the bins originally spanned by p_data.

“a”: limits for scoring angular features. Will use a
(-math.pi, math.pi) range for scoring such features.

(float, float): provide a custom range. Currently, not used in any
IDPET functionality.

bins: Union[int, str], optional

Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be define by the limits argument.

return_bins: bool, optional

If True, returns the bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]: If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins.

idpet.comparison.score_ramaJSD(ens_1, ens_2, bins, return_scores=False, return_bins=False)[source]

Utility unction to calculate the ramaJSD (Ramachandran plot average JSD) score between two ensembles. The score evaluates the divergence between distributions of phi-psi torsion angles of every residue in the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]: Two Ensemble objects storing the ensemble data to compare.

bins: Union[int, str], optional See idpet.comparison.all_vs_all_comparison().

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:

(avg_score, all_scores)Tuple[float, np.ndarray]: The average score and an array of JSD scores of shape (F,).

If return_bins=True:

(avg_score, num_bins)Tuple[float, int]: The average score and the number of bins used.

If both return_scores and return_bins are True:

((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]: The average score, array of per-feature scores, and number of bins used.

idpet.comparison.sqrt_rule(n)[source]

idpet.comparison.sturges_rule(n)[source]