ensemble_compariosn
- idpet.comparison.all_vs_all_comparison(ensembles, score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]
Compare all pair of ensembles using divergence scores. Implemented scores are approximate average Jensen–Shannon divergence (JSD) over several kinds of molecular features. The lower these scores are, the higher the similarity between the probability distribution of the features of the ensembles. JSD scores here range from a minimum of 0 to a maximum of log(2) ~= 0.6931.
- Return type:
dict
Parameters
- ensembles: List[Ensemble]
Ensemble objectes to analyze.
- score: str
Type of score used to compare ensembles. Choices: adaJSD (carbon Alfa Distance Average JSD), ramaJSD (RAMAchandran average JSD) and ataJSD (Alpha Torsion Average JSD). adaJSD scores the average JSD over all Ca-Ca distance distributions of residue pairs with sequence separation > 1. ramaJSD scores the average JSD over the phi-psi angle distributions of all residues. ataJSD scores the average JSD over all alpha torsion angles, which are the angles formed by four consecutive Ca atoms in a protein.
- featurization_params: dict, optional
Optional dictionary to customize the featurization process for the above features.
- bootstrap_iters: int, optional
Number of bootstrap iterations. By default its value is None. In this case, IDPET will directly compare each pair of ensemble $i$ and $j$ by using all of their conformers and perform the comparison only once. On the other hand, if providing an integer value to this argument, each pair of ensembles $i$ and $j$ will be compared bootstrap_iters times by randomly selecting (bootstrapping) conformations from them. Additionally, each ensemble will be auto-compared with itself by subsampling conformers via bootstrapping. Then IDPET will perform a statistical test to establish if the inter-ensemble ($i != j$) scores are significantly different from the intra-ensemble ($i == j$) scores. The tests work as follows: for each ensemble pair $i != j$ IDPET will get their inter-ensemble comparison scores obtained in bootstrapping. Then, it will get the bootstrapping scores from auto-comparisons of ensemble $i$ and $j$ and the scores with the higher mean here are selected as reference intra-ensemble scores. Finally, the inter-ensemble and intra-ensemble scores are compared via a one-sided Mann-Whitney U test with the alternative hypothesis being: inter-ensemble scores are stochastically greater than intra-ensemble scores. The p-values obtained in these tests will additionally be returned. For small protein structural ensembles (less than 500 conformations) most comparison scores in IDPET are not robust estimators of divergence/distance. By performing bootstrapping, you can have an idea of how the size of your ensembles impacts the comparison. Use values >= 50 when comparing ensembles with very few conformations (less than 100). When comparing large ensembles (more than 1,000-5,000 conformations) you can safely avoid bootstrapping.
- bootstrap_frac: float, optional
Fraction of the total conformations to sample when bootstrapping. Default value is 1.0, which results in bootstrap samples with the same number of conformations of the original ensemble.
- bootstrap_replace: bool, optional
If True, bootstrap will sample with replacement. Default is True.
- bins: Union[int, str], optional
Number of bins or bin assignment rule for JSD comparisons. See the documentation of dpet.comparison.get_num_comparison_bins for more information.
- random_seed: int, optional
Random seed used when performing bootstrapping.
- verbose: bool, optional
If True, some information about the comparisons will be printed to stdout.
Returns
- results: dict
- A dictionary containing the following key-value pairs:
- scores: a (M, M, B) NumPy array storing the comparison
scores, where M is the number of ensembles being compared and B is the number of bootstrap iterations (B will be 1 if bootstrapping was not performed).
- p_values: a (M, M) NumPy array storing the p-values
obtained in the statistical test performed when using a bootstrapping strategy (see the bootstrap_iters) method. Returned only when performing a bootstrapping strategy.
- idpet.comparison.calc_jsd(p_h, q_h)[source]
Calculates JSD between distribution p and q. p_h: histogram frequencies for sample p. q_h: histogram frequencies for sample q.
- idpet.comparison.calc_kld_for_jsd(x_h, m_h)[source]
Calculates KLD between distribution x and m. x_h: histogram frequencies for sample p or q. m_h: histogram frequencies for m = 0.5*(p+q).
- idpet.comparison.confidence_interval(theta_boot, theta_hat=None, confidence_level=0.95, method='percentile')[source]
Returns bootstrap confidence intervals. Adapted from: https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_resampling.py
- idpet.comparison.get_adaJSD_matrix(ens_1, ens_2, bins='auto', return_bins=False, featurization_params={}, *args, **kwargs)[source]
Utility function to calculate the adaJSD score between two ensembles and return a matrix with JSD scores for each pair of Ca-Ca distances.
Parameters
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.
- return_binsbool, optional
If True, also return the histogram bin edges used in the comparison.
- remaining
Additional arguments passed to idpet.comparison.score_adaJSD.
Output
- scorefloat
The overall adaJSD score between the two ensembles.
- jsd_matrixnp.ndarray of shape (N, N)
Matrix containing JSD scores for each Ca-Ca distance pair, where N is the number of residues.
- bin_edgesnp.ndarray, optional
Returned only if return_bins=True. The bin edges used in histogram comparisons.
- idpet.comparison.get_ataJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]
Utility function to calculate the ataJSD score between two ensembles and return a profile with JSD scores for each alpha angle in the proteins.
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.
- return_binsbool, optional
If True, also return the histogram bin edges used in the comparison.
- **remaining
Additional arguments passed to dpet.comparison.score_ataJSD.
- scorefloat
The overall ataJSD score between the two ensembles.
- jsd_profilenp.ndarray of shape (N - 3,)
JSD scores for individual α backbone angles, where N is the number of residues in the protein.
- bin_edgesnp.ndarray, optional
Returned only if return_bins=True. The bin edges used in histogram comparisons.
- idpet.comparison.get_num_comparison_bins(bins, x=None)[source]
Get the number of bins to be used in comparison between two ensembles using an histogram-based score (such as a JSD approximation).
Parameters
- bins: Union[str, int]
Determines the number of bins to be used. When providing an int, the same value will simply be returned. When providing a string, the following rules to determine bin value will be applied: auto: applies sqrt if the size of the smallest ensemble is <
dpet.comparison.min_samples_auto_hist. If it >= than this value, returns dpet.comparison.num_default_bins.
- sqrt: applies the square root rule for determining bin number using
the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Square-root_choice).
- sturges: applies Sturge’s formula for determining bin number using
the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Sturges’s_formula).
- x: List[np.ndarray], optional
List of M feature matrices (one for each ensembles) of shape (N_i, *). N_i values are the number of structures in each ensemble. The minimum N_i will be used to apply bin assignment rule when the bins argument is a string.
Returns
- num_bins: int
Number of bins.
- idpet.comparison.get_ramaJSD_profile(ens_1, ens_2, bins, return_bins=False, *args, **kwargs)[source]
Utility function to calculate the ramaJSD score between two ensembles and return a profile with JSD scores for the Ramachandran plots of pair of corresponding residue in the proteins.
Parameters
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.
- return_binsbool, optional
If True, also return the histogram bin edges used in the comparison.
- **remaining
Additional arguments passed to dpet.comparison.score_ramaJSD.
Returns
- scorefloat
The overall ramaJSD score between the two ensembles.
- jsd_profilenp.ndarray of shape (N - 2,)
JSD scores for the Ramachandran distribution of each residue, where N is the number of residues in the protein.
- bin_edgesnp.ndarray, optional
Returned only if return_bins=True. The bin edges used in histogram comparisons.
- idpet.comparison.process_all_vs_all_output(comparison_out, confidence_level=0.95)[source]
Takes as input a dictionary produced as output of the all_vs_all_comparison function. If a bootstrap analysis was performed in all_vs_all_comparison, this function will assign bootstrap confidence intervals.
- idpet.comparison.score_adaJSD(ens_1, ens_2, bins='auto', return_bins=False, return_scores=False, featurization_params={}, *args, **kwargs)[source]
Utility function to calculate the adaJSD (carbon Alfa Distance Average JSD) score between two ensembles. The score evaluates the divergence between distributions of Ca-Ca distances of the ensembles.
Parameters
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory],
Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.
- bins: Union[str, int], optional
Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information.
- return_bins: bool, optional
If True, returns the number of bins used in the calculation.
- return_scores: bool, optional
If True, returns the a tuple with with (avg_score, all_scores), where all_scores is an array with all the F scores (one for each feature) used to compute the average score.
- featurization_params: dict, optional
Optional dictionary to customize the featurization process to calculate Ca-Ca distances. See the Ensemble.get_features function for more information.
Returns
- avg_scorefloat
The average JSD score across the F features.
- If return_scores=True:
- (avg_score, all_scores)Tuple[float, np.ndarray]
The average score and an array of JSD scores of shape (F,).
- If return_bins=True:
- (avg_score, num_bins)Tuple[float, int]
The average score and the number of bins used.
- If both return_scores and return_bins are True:
- ((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]
The average score, array of per-feature scores, and number of bins used.
- idpet.comparison.score_ataJSD(ens_1, ens_2, bins, return_bins=False, return_scores=False, *args, **kwargs)[source]
Utility function to calculate the ataJSD (Alpha Torsion Average JSD) score between two ensembles. The score evaluates the divergence between distributions of alpha torsion angles (the angles formed by four consecutive Ca atoms in a protein) of the ensembles.
Parameters
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.
Returns
- avg_scorefloat
The average JSD score across the F features.
- If return_scores=True:
- (avg_score, all_scores)Tuple[float, np.ndarray]
The average score and an array of JSD scores of shape (F,).
- If return_bins=True:
- (avg_score, num_bins)Tuple[float, int]
The average score and the number of bins used.
- If both return_scores and return_bins are True:
- ((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]
The average score, array of per-feature scores, and number of bins used.
- idpet.comparison.score_avg_2d_angle_jsd(array_1, array_2, bins, return_scores=False, return_bins=False, *args, **kwargs)[source]
Takes as input two (*, F, 2) bidimensional feature matrices and computes an average JSD score over all F bidimensional features by discretizing them in 2d histograms. The features in this functions are supposed to be angles whose values range from -math.pi to math.pi. For example, int the score_ramaJSD function the F features represent the phi-psi values of F residues in a protein of length L=F+2 (first and last residues don’t have both phi and psi values).
Parameters
- p_data, q_data: np.ndarray
NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.
- bins: Union[int, str], optional
Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be -math.pi to math.pi. Note that the effective number of bins used in the functio will be the square of the number returned by dpet.comparison.get_num_comparison_bins, since we are building a 2d histogram.
- return_bins: bool, optional
If True, returns the square root of the effective number of bins used in the calculation.
Returns
- results: Union[float, Tuple[float, np.ndarray]]
If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins. If return_scores is True it will also return the F scores used to compute the average JSD score.
- idpet.comparison.score_histogram_jsd(p_data, q_data, limits, bins='auto', return_bins=False)[source]
Scores an approximation of Jensen-Shannon divergence by discretizing in a histogram the values two 1d samples provided as input.
- Return type:
Union
[float
,Tuple
[float
,ndarray
]]
Parameters
- p_data, q_data: np.ndarray
NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.
- limits: Union[str, Tuple[int]]
Define the method to calculate the minimum and maximum values of the range spanned by the bins. Accepted values are:
- “m”: will use the minimum and maximum values observed by
concatenating samples in p_data and q_data.
- “p”: will use the minimum and maximum values observed by
concatenating samples in p_data. If q_data contains values outside that range, new bins of the same size will be added to cover all values of q. Currently, this is not used in any IDPET functionality. Note that the bins argument will determine only the bins originally spanned by p_data.
- “a”: limits for scoring angular features. Will use a
(-math.pi, math.pi) range for scoring such features.
- (float, float): provide a custom range. Currently, not used in any
IDPET functionality.
- bins: Union[int, str], optional
Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be define by the limits argument.
- return_bins: bool, optional
If True, returns the bins used in the calculation.
Returns
- results: Union[float, Tuple[float, np.ndarray]]
If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins.
- idpet.comparison.score_ramaJSD(ens_1, ens_2, bins, return_scores=False, return_bins=False)[source]
Utility unction to calculate the ramaJSD (Ramachandran plot average JSD) score between two ensembles. The score evaluates the divergence between distributions of phi-psi torsion angles of every residue in the ensembles.
Parameters
- ens_1, ens_2: Union[Ensemble, mdtraj.Trajectory]
Two Ensemble objects storing the ensemble data to compare.
Returns
- avg_scorefloat
The average JSD score across the F features.
- If return_scores=True:
- (avg_score, all_scores)Tuple[float, np.ndarray]
The average score and an array of JSD scores of shape (F,).
- If return_bins=True:
- (avg_score, num_bins)Tuple[float, int]
The average score and the number of bins used.
- If both return_scores and return_bins are True:
- ((avg_score, all_scores), num_bins)Tuple[Tuple[float, np.ndarray], int]
The average score, array of per-feature scores, and number of bins used.