ensemble_analysis

EnsembleAnalysis class

class idpet.ensemble_analysis.EnsembleAnalysis(ensembles, output_dir=None)[source]

Bases: object

Data analysis pipeline for ensemble data.

Initializes with a list of ensemble objects and a directory path for storing data.

Parameters

ensemblesList[Ensemble]): List of ensembles.
output_dirstr, optional: Directory path for storing data. If not provided, a directory named ${HOME}/.idpet/data will be created.

comparison_scores(score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.

Return type:: Tuple[ndarray, List[str]]

property ens_codes: List[str]

Get the ensemble codes.

Returns

List[str]: A list of ensemble codes.

execute_pipeline(featurization_params, reduce_dim_params, subsample_size=None)[source]

Execute the data analysis pipeline end-to-end. The pipeline includes:

Download from database (optional)
Generate trajectories
Randomly sample a number of conformations from trajectories (optional)
Perform feature extraction
Perform dimensionality reduction

Parameters

featurization_params: Dict: Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.
reduce_dim_params: Dict: Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.
subsample_size: int, optional: Optional parameter that specifies the trajectory subsample size. Default is None.

exists_coarse_grained()[source]

Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.

Return type:: bool

Returns

bool: True if at least one ensemble is coarse-grained, False otherwise.

extract_features(featurization, normalize=False, *args, **kwargs)[source]

Extract the selected feature.

Return type:: Dict[str, ndarray]

Parameters

featurizationstr: Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.
normalizebool, optional: Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.
min_sepint or None, optional: Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
max_sepint, optional: Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

property features: Dict[str, ndarray]

Get the features associated with each ensemble.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

get_features(featurization, normalize=False, *args, **kwargs)[source]

Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.

Return type:: Dict[str, ndarray]

Parameters:

featurizationstr: The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.
min_sepint, optional: Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
max_sepint or None, optional: Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
normalizebool, optional: Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.

Returns:

Dict[str, np.ndarray]: A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.

Raises:

ValueError:: If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.

get_features_summary_dataframe(selected_features=['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability=True)[source]

Create a summary DataFrame for each ensemble.

The DataFrame includes the ensemble code and the average for each feature.

Return type:: DataFrame

Parameters

selected_featuresList[str], optional: List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].
show_variability: bool, optional: If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).

Returns

pd.DataFrame: DataFrame containing the summary statistics (average and std) for each feature in each ensemble.

Raises

ValueError: If any feature in the selected_features is not a supported feature extraction method.

load_trajectories()[source]

Load trajectories for all ensembles.

This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.

Return type:: Dict[str, Trajectory]

Returns

Dict[str, mdtraj.Trajectory]: A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

Note

This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.

random_sample_trajectories(sample_size)[source]

Sample a defined random number of conformations from the ensemble trajectory.

Parameters

sample_size: int: Number of conformations sampled from the ensemble.

property reduce_dim_data: Dict[str, ndarray]

Get the transformed data associated with each ensemble.

Returns

Dict[str, np.ndarray]: A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

reduce_features(method, fit_on=None, *args, **kwargs)[source]

Perform dimensionality reduction on the extracted features.

Return type:: ndarray

Parameters

methodstr: Choose between “pca”, “tsne”, “kpca” and “umap”.
fit_onList[str], optional: if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.

Additional Parameters

The following optional parameters apply based on the selected reduction method:

pca:
- n_componentsint, optional
  Number of components to keep. Default is 10.
tsne:
- perplexity_valsList[float], optional
  The perplexity is related to the number of nearest neighbors that are used in the manifold learning. It can be interpreted as a smooth measure of the effective number of neighbors for each point. Typical values range from 5 to 50. Choosing a value too small may make the data appear too clustered, while a value too large may cause different clusters to merge. see also https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html
- metricstr, optional
  Metric to use. Default is “euclidean”.
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of dimensions of the embedded space. Default is 2.
- learning_ratefloat, optional
  Learning rate. Default is 100.0.
- range_n_clustersList[int], optional
  Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions. Default is range(2, 10, 1).
- random_state: int, optional
  Random seed for sklearn.
umap:
- n_neighborsList[int], optional
  List of number of neighbors. Default is [15].
- min_distfloat, optional
  Minimum distance between points in the embedded space. Default is 0.1.
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of dimensions of the embedded space. Default is 2.
- metricstr, optional
  Metric to use. Default is “euclidean”.
- random_state: int, optional
  Random seed for sklearn.
- range_n_clustersList[int], optional
  Highly disordered ensembles typically do not form more than ~10 distinct, visually separable clusters. Therefore, exploring more than 10 clusters is usually unnecessary. But users can modify this parameter based on their specific datasets and research questions. Default is range(2, 10, 1).
kpca:
- circularbool, optional
  Whether to use circular metrics. Default is False.
- n_componentsint, optional
  Number of components to keep. Default is 10.
- gammafloat, optional
  Kernel coefficient. Default is None.

Returns

np.ndarray

Returns the transformed data.

For more information on each method, see the corresponding documentation:

property trajectories: Dict[str, Trajectory]

Get the trajectories associated with each ensemble.

Returns

Dict[str, mdtraj.Trajectory]: A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.