ensemble_analysis

EnsembleAnalysis class

class idpet.ensemble_analysis.EnsembleAnalysis(ensembles, output_dir=None)[source]

Bases: object

Data analysis pipeline for ensemble data.

Initializes with a list of ensemble objects and a directory path for storing data.

Parameters

ensemblesList[Ensemble])

List of ensembles.

output_dirstr, optional

Directory path for storing data. If not provided, a directory named ${HOME}/.idpet/data will be created.

comparison_scores(score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]

Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.

Return type:

Tuple[ndarray, List[str]]

property ens_codes: List[str]

Get the ensemble codes.

Returns

List[str]

A list of ensemble codes.

execute_pipeline(featurization_params, reduce_dim_params, subsample_size=None)[source]
Execute the data analysis pipeline end-to-end. The pipeline includes:
  1. Download from database (optional)

  2. Generate trajectories

  3. Randomly sample a number of conformations from trajectories (optional)

  4. Perform feature extraction

  5. Perform dimensionality reduction

Parameters

featurization_params: Dict

Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.

reduce_dim_params: Dict

Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.

subsample_size: int, optional

Optional parameter that specifies the trajectory subsample size. Default is None.

exists_coarse_grained()[source]

Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.

Return type:

bool

Returns

bool

True if at least one ensemble is coarse-grained, False otherwise.

extract_features(featurization, normalize=False, *args, **kwargs)[source]

Extract the selected feature.

Return type:

Dict[str, ndarray]

Parameters

featurizationstr

Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.

normalizebool, optional

Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.

min_sepint or None, optional

Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.

max_sepint, optional

Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

Returns

Dict[str, np.ndarray]

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

property features: Dict[str, ndarray]

Get the features associated with each ensemble.

Returns

Dict[str, np.ndarray]

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

get_features(featurization, normalize=False, *args, **kwargs)[source]

Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.

Return type:

Dict[str, ndarray]

Parameters:

featurizationstr

The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.

min_sepint, optional

Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.

max_sepint or None, optional

Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

normalizebool, optional

Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.

Returns:

Dict[str, np.ndarray]

A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.

Raises:

ValueError:

If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.

get_features_summary_dataframe(selected_features=['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability=True)[source]

Create a summary DataFrame for each ensemble.

The DataFrame includes the ensemble code and the average for each feature.

Return type:

DataFrame

Parameters

selected_featuresList[str], optional

List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].

show_variability: bool, optional

If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).

Returns

pd.DataFrame

DataFrame containing the summary statistics (average and std) for each feature in each ensemble.

Raises

ValueError

If any feature in the selected_features is not a supported feature extraction method.

load_trajectories()[source]

Load trajectories for all ensembles.

This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.

Return type:

Dict[str, Trajectory]

Returns

Dict[str, mdtraj.Trajectory]

A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

Note

This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.

random_sample_trajectories(sample_size)[source]

Sample a defined random number of conformations from the ensemble trajectory.

Parameters

sample_size: int

Number of conformations sampled from the ensemble.

property reduce_dim_data: Dict[str, ndarray]

Get the transformed data associated with each ensemble.

Returns

Dict[str, np.ndarray]

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

reduce_features(method, fit_on=None, *args, **kwargs)[source]

Perform dimensionality reduction on the extracted features.

Return type:

ndarray

Parameters

methodstr

Choose between “pca”, “tsne”, “kpca” and “umap”.

fit_onList[str], optional

if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.

Additional Parameters

The following optional parameters apply based on the selected reduction method:

  • pca:
    • n_componentsint, optional

      Number of components to keep. Default is 10.

  • tsne:
    • perplexity_valsList[float], optional

      List of perplexity values. Default is range(2, 10, 2).

    • metricstr, optional

      Metric to use. Default is “euclidean”.

    • circularbool, optional

      Whether to use circular metrics. Default is False.

    • n_componentsint, optional

      Number of dimensions of the embedded space. Default is 2.

    • learning_ratefloat, optional

      Learning rate. Default is 100.0.

    • range_n_clustersList[int], optional

      Range of cluster values. Default is range(2, 10, 1).

    • random_state: int, optional

      Random seed for sklearn.

  • umap:
    • n_neighborsList[int], optional

      List of number of neighbors. Default is [15].

    • min_distfloat, optional

      Minimum distance between points in the embedded space. Default is 0.1.

    • circularbool, optional

      Whether to use circular metrics. Default is False.

    • n_componentsint, optional

      Number of dimensions of the embedded space. Default is 2.

    • metricstr, optional

      Metric to use. Default is “euclidean”.

    • random_state: int, optional

      Random seed for sklearn.

    • range_n_clustersList[int], optional

      Range of cluster values. Default is range(2, 10, 1).

  • kpca:
    • circularbool, optional

      Whether to use circular metrics. Default is False.

    • n_componentsint, optional

      Number of components to keep. Default is 10.

    • gammafloat, optional

      Kernel coefficient. Default is None.

Returns

np.ndarray

Returns the transformed data.

For more information on each method, see the corresponding documentation:
property trajectories: Dict[str, Trajectory]

Get the trajectories associated with each ensemble.

Returns

Dict[str, mdtraj.Trajectory]

A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.