ensemble_compariosn

idpet.comparison.all_vs_all_comparison(ensembles, score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence scores. Implemented scores are approximate average Jensen–Shannon divergence (JSD) over several kinds of molecular features. The lower these scores are, the higher the similarity between the probability distribution of the features of the ensembles. JSD scores here range from a minimum of 0 to a maximum of log(2) ~= 0.6931.

Return type:

dict

Parameters

ensembles: List[Ensemble]

Ensemble objectes to analyze.

score: str

Type of score used to compare ensembles. Choices: adaJSD (carbon Alfa Distance Average JSD), ramaJSD (RAMAchandran average JSD) and ataJSD (Alpha Torsion Average JSD). adaJSD scores the average JSD over all Ca-Ca distance distributions of residue pairs with sequence separation > 1. ramaJSD scores the average JSD over the phi-psi angle distributions of all residues. ataJSD scores the average JSD over all alpha torsion angles, which are the angles formed by four consecutive Ca atoms in a protein.

featurization_params: dict, optional

Optional dictionary to customize the featurization process for the above features.

bootstrap_iters: int, optional

Number of bootstrap iterations. By default its value is None. In this case, IDPET will directly compare each pair of ensemble $i$ and $j$ by using all of their conformers and perform the comparison only once. On the other hand, if providing an integer value to this argument, each pair of ensembles $i$ and $j$ will be compared bootstrap_iters times by randomly selecting (bootstrapping) conformations from them. Additionally, each ensemble will be auto-compared with itself by subsampling conformers via bootstrapping. Then IDPET will perform a statistical test to establish if the inter-ensemble ($i != j$) scores are significantly different from the intra-ensemble ($i == j$) scores. The tests work as follows: for each ensemble pair $i != j$ IDPET will get their inter-ensemble comparison scores obtained in bootstrapping. Then, it will get the bootstrapping scores from auto-comparisons of ensemble $i$ and $j$ and the scores with the higher mean here are selected as reference intra-ensemble scores. Finally, the inter-ensemble and intra-ensemble scores are compared via a one-sided Mann-Whitney U test with the alternative hypothesis being: inter-ensemble scores are stochastically greater than intra-ensemble scores. The p-values obtained in these tests will additionally be returned. For small protein structural ensembles (less than 500 conformations) most comparison scores in IDPET are not robust estimators of divergence/distance. By performing bootstrapping, you can have an idea of how the size of your ensembles impacts the comparison. Use values >= 50 when comparing ensembles with very few conformations (less than 100). When comparing large ensembles (more than 1,000-5,000 conformations) you can safely avoid bootstrapping.

bootstrap_frac: float, optional

Fraction of the total conformations to sample when bootstrapping. Default value is 1.0, which results in bootstrap samples with the same number of conformations of the original ensemble.

bootstrap_replace: bool, optional

If True, bootstrap will sample with replacement. Default is True.

bins: Union[int, str], optional

Number of bins or bin assignment rule for JSD comparisons. See the documentation of dpet.comparison.get_num_comparison_bins for more information.

random_seed: int, optional

Random seed used when performing bootstrapping.

verbose: bool, optional

If True, some information about the comparisons will be printed to stdout.

Returns

results: dict
A dictionary containing the following key-value pairs:
scores: a (M, M, B) NumPy array storing the comparison

scores, where M is the number of ensembles being compared and B is the number of bootstrap iterations (B will be 1 if bootstrapping was not performed).

p_values: a (M, M) NumPy array storing the p-values

obtained in the statistical test performed when using a bootstrapping strategy (see the bootstrap_iters) method. Returned only when performing a bootstrapping strategy.

idpet.comparison.calc_freqs(x, bins)[source]
idpet.comparison.calc_jsd(p_h, q_h)[source]

Calculates JSD between distribution p and q. p_h: histogram frequencies for sample p. q_h: histogram frequencies for sample q.

idpet.comparison.calc_kld_for_jsd(x_h, m_h)[source]

Calculates KLD between distribution x and m. x_h: histogram frequencies for sample p or q. m_h: histogram frequencies for m = 0.5*(p+q).

idpet.comparison.check_feature_matrices(func)[source]
idpet.comparison.confidence_interval(theta_boot, theta_hat=None, confidence_level=0.95, method='percentile')[source]

Returns bootstrap confidence intervals. Adapted from: https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_resampling.py

idpet.comparison.get_adaJSD_matrix(ens_1, ens_2, bins='auto', return_bins=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD score between two ensembles and return a matrix with JSD scores for each pair of Ca-Ca distances.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]

Two Ensemble objects storing the ensemble data to compare.

return_binsbool, optional

If True, also return the histogram bin edges used in the comparison.

remaining

Additional arguments passed to idpet.comparison.score_adaJSD.

Output

scorefloat

The overall adaJSD score between the two ensembles.

jsd_matrixnp.ndarray of shape (N, N)

Matrix containing JSD scores for each Ca-Ca distance pair, where N is the number of residues.

bin_edgesnp.ndarray, optional

Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_ataJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD score between two ensembles and return a profile with JSD scores for each alpha angle in the proteins.

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]

Two Ensemble objects storing the ensemble data to compare.

return_binsbool, optional

If True, also return the histogram bin edges used in the comparison.

**remaining

Additional arguments passed to dpet.comparison.score_ataJSD.

scorefloat

The overall ataJSD score between the two ensembles.

jsd_profilenp.ndarray of shape (N - 3,)

JSD scores for individual α backbone angles, where N is the number of residues in the protein.

bin_edgesnp.ndarray, optional

Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.get_num_comparison_bins(bins, x=None)[source]

Get the number of bins to be used in comparison between two ensembles using an histogram-based score (such as a JSD approximation).

Parameters

bins: Union[str, int]

Determines the number of bins to be used. When providing an int, the same value will simply be returned. When providing a string, the following rules to determine bin value will be applied: auto: applies sqrt if the size of the smallest ensemble is <

dpet.comparison.min_samples_auto_hist. If it >= than this value, returns dpet.comparison.num_default_bins.

sqrt: applies the square root rule for determining bin number using

the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Square-root_choice).

sturges: applies Sturge’s formula for determining bin number using

the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Sturges’s_formula).

x: List[np.ndarray], optional

List of M feature matrices (one for each ensembles) of shape (N_i, *). N_i values are the number of structures in each ensemble. The minimum N_i will be used to apply bin assignment rule when the bins argument is a string.

Returns

num_bins: int

Number of bins.

idpet.comparison.get_ramaJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]

Utility function to calculate the ramaJSD score between two ensembles and return a profile with JSD scores for the Ramachandran plots of pair of corresponding residue in the proteins.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]

Two Ensemble objects storing the ensemble data to compare.

return_binsbool, optional

If True, also return the histogram bin edges used in the comparison.

**remaining

Additional arguments passed to dpet.comparison.score_ramaJSD.

Returns

scorefloat

The overall ramaJSD score between the two ensembles.

jsd_profilenp.ndarray of shape (N - 2,)

JSD scores for the Ramachandran distribution of each residue, where N is the number of residues in the protein.

bin_edgesnp.ndarray, optional

Returned only if return_bins=True. The bin edges used in histogram comparisons.

idpet.comparison.percentile_func(a, q)[source]
idpet.comparison.process_all_vs_all_output(comparison_out, confidence_level=0.95)[source]

Takes as input a dictionary produced as output of the all_vs_all_comparison function. If a bootstrap analysis was performed in all_vs_all_comparison, this function will assign bootstrap confidence intervals.

idpet.comparison.score_adaJSD(ens_1, ens_2, bins='auto', return_bins=False, return_scores=False, featurization_params={}, *args, **kwargs)[source]

Utility function to calculate the adaJSD (carbon Alfa Distance Average JSD) score between two ensembles. The score evaluates the divergence between distributions of Ca-Ca distances of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory],

Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.

bins: Union[str, int], optional

Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information.

return_bins: bool, optional

If True, returns the number of bins used in the calculation.

return_scores: bool, optional

If True, returns the a tuple with with (avg_score, all_scores), where all_scores is an array with all the F scores (one for each feature) used to compute the average score.

featurization_params: dict, optional

Optional dictionary to customize the featurization process to calculate Ca-Ca distances. See the Ensemble.get_features function for more information.

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:
(avg_score, all_scores)Tuple[float, np.ndarray]

The average score and an array of JSD scores of shape (F,).

If return_bins=True:
(avg_score, num_bins)Tuple[float, int]

The average score and the number of bins used.

If both return_scores and return_bins are True:
((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]

The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_ataJSD(ens_1, ens_2, bins, return_bins=False, return_scores=False, *args, **kwargs)[source]

Utility function to calculate the ataJSD (Alpha Torsion Average JSD) score between two ensembles. The score evaluates the divergence between distributions of alpha torsion angles (the angles formed by four consecutive Ca atoms in a protein) of the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]

Two Ensemble objects storing the ensemble data to compare.

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:
(avg_score, all_scores)Tuple[float, np.ndarray]

The average score and an array of JSD scores of shape (F,).

If return_bins=True:
(avg_score, num_bins)Tuple[float, int]

The average score and the number of bins used.

If both return_scores and return_bins are True:
((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]

The average score, array of per-feature scores, and number of bins used.

idpet.comparison.score_avg_2d_angle_jsd(array_1, array_2, bins, return_scores=False, return_bins=False, *args, **kwargs)[source]

Takes as input two (*, F, 2) bidimensional feature matrices and computes an average JSD score over all F bidimensional features by discretizing them in 2d histograms. The features in this functions are supposed to be angles whose values range from -math.pi to math.pi. For example, int the score_ramaJSD function the F features represent the phi-psi values of F residues in a protein of length L=F+2 (first and last residues don’t have both phi and psi values).

Parameters

p_data, q_data: np.ndarray

NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.

bins: Union[int, str], optional

Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be -math.pi to math.pi. Note that the effective number of bins used in the functio will be the square of the number returned by dpet.comparison.get_num_comparison_bins, since we are building a 2d histogram.

return_bins: bool, optional

If True, returns the square root of the effective number of bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]

If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins. If return_scores is True it will also return the F scores used to compute the average JSD score.

idpet.comparison.score_avg_jsd(m1, m2, *args, **kwargs)[source]
idpet.comparison.score_histogram_jsd(p_data, q_data, limits, bins='auto', return_bins=False)[source]

Scores an approximation of Jensen-Shannon divergence by discretizing in a histogram the values two 1d samples provided as input.

Return type:

Union[float, Tuple[float, ndarray]]

Parameters

p_data, q_data: np.ndarray

NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.

limits: Union[str, Tuple[int]]

Define the method to calculate the minimum and maximum values of the range spanned by the bins. Accepted values are:

“m”: will use the minimum and maximum values observed by

concatenating samples in p_data and q_data.

“p”: will use the minimum and maximum values observed by

concatenating samples in p_data. If q_data contains values outside that range, new bins of the same size will be added to cover all values of q. Currently, this is not used in any IDPET functionality. Note that the bins argument will determine only the bins originally spanned by p_data.

“a”: limits for scoring angular features. Will use a

(-math.pi, math.pi) range for scoring such features.

(float, float): provide a custom range. Currently, not used in any

IDPET functionality.

bins: Union[int, str], optional

Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be define by the limits argument.

return_bins: bool, optional

If True, returns the bins used in the calculation.

Returns

results: Union[float, Tuple[float, np.ndarray]]

If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins.

idpet.comparison.score_ramaJSD(ens_1, ens_2, bins, return_scores=False, return_bins=False)[source]

Utility unction to calculate the ramaJSD (Ramachandran plot average JSD) score between two ensembles. The score evaluates the divergence between distributions of phi-psi torsion angles of every residue in the ensembles.

Parameters

ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]

Two Ensemble objects storing the ensemble data to compare.

Returns

avg_scorefloat

The average JSD score across the F features.

If return_scores=True:
(avg_score, all_scores)Tuple[float, np.ndarray]

The average score and an array of JSD scores of shape (F,).

If return_bins=True:
(avg_score, num_bins)Tuple[float, int]

The average score and the number of bins used.

If both return_scores and return_bins are True:
((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]

The average score, array of per-feature scores, and number of bins used.

idpet.comparison.sqrt_rule(n)[source]
idpet.comparison.sturges_rule(n)[source]