ensemble_analysis
EnsembleAnalysis class
- class idpet.ensemble_analysis.EnsembleAnalysis(ensembles, output_dir=None)[source]
Bases:
object
Data analysis pipeline for ensemble data.
Initializes with a list of ensemble objects and a directory path for storing data.
Parameters
- ensemblesList[Ensemble])
List of ensembles.
- output_dirstr, optional
Directory path for storing data. If not provided, a directory named ${HOME}/.idpet/data will be created.
- comparison_scores(score, featurization_params={}, bootstrap_iters=None, bootstrap_frac=1.0, bootstrap_replace=True, bins=50, random_seed=None, verbose=False)[source]
Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.
- Return type:
Tuple
[ndarray
,List
[str
]]
- execute_pipeline(featurization_params, reduce_dim_params, subsample_size=None)[source]
- Execute the data analysis pipeline end-to-end. The pipeline includes:
Download from database (optional)
Generate trajectories
Randomly sample a number of conformations from trajectories (optional)
Perform feature extraction
Perform dimensionality reduction
Parameters
- featurization_params: Dict
Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.
- reduce_dim_params: Dict
Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.
- subsample_size: int, optional
Optional parameter that specifies the trajectory subsample size. Default is None.
- exists_coarse_grained()[source]
Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.
- Return type:
bool
Returns
- bool
True if at least one ensemble is coarse-grained, False otherwise.
- extract_features(featurization, normalize=False, *args, **kwargs)[source]
Extract the selected feature.
- Return type:
Dict
[str
,ndarray
]
Parameters
- featurizationstr
Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.
- normalizebool, optional
Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.
- min_sepint or None, optional
Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
- max_sepint, optional
Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
Returns
- Dict[str, np.ndarray]
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- property features: Dict[str, ndarray]
Get the features associated with each ensemble.
Returns
- Dict[str, np.ndarray]
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- get_features(featurization, normalize=False, *args, **kwargs)[source]
Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.
- Return type:
Dict
[str
,ndarray
]
Parameters:
- featurizationstr
The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.
- min_sepint, optional
Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
- max_sepint or None, optional
Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
- normalizebool, optional
Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.
Returns:
- Dict[str, np.ndarray]
A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.
Raises:
- ValueError:
If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.
- get_features_summary_dataframe(selected_features=['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability=True)[source]
Create a summary DataFrame for each ensemble.
The DataFrame includes the ensemble code and the average for each feature.
- Return type:
DataFrame
Parameters
- selected_featuresList[str], optional
List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].
- show_variability: bool, optional
If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).
Returns
- pd.DataFrame
DataFrame containing the summary statistics (average and std) for each feature in each ensemble.
Raises
- ValueError
If any feature in the selected_features is not a supported feature extraction method.
- load_trajectories()[source]
Load trajectories for all ensembles.
This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.
- Return type:
Dict
[str
,Trajectory
]
Returns
- Dict[str, mdtraj.Trajectory]
A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.
Note
This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.
- random_sample_trajectories(sample_size)[source]
Sample a defined random number of conformations from the ensemble trajectory.
Parameters
- sample_size: int
Number of conformations sampled from the ensemble.
- property reduce_dim_data: Dict[str, ndarray]
Get the transformed data associated with each ensemble.
Returns
- Dict[str, np.ndarray]
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- reduce_features(method, fit_on=None, *args, **kwargs)[source]
Perform dimensionality reduction on the extracted features.
- Return type:
ndarray
Parameters
- methodstr
Choose between “pca”, “tsne”, “kpca” and “umap”.
- fit_onList[str], optional
if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.
Additional Parameters
The following optional parameters apply based on the selected reduction method:
- pca:
- n_componentsint, optional
Number of components to keep. Default is 10.
- tsne:
- perplexity_valsList[float], optional
List of perplexity values. Default is range(2, 10, 2).
- metricstr, optional
Metric to use. Default is “euclidean”.
- circularbool, optional
Whether to use circular metrics. Default is False.
- n_componentsint, optional
Number of dimensions of the embedded space. Default is 2.
- learning_ratefloat, optional
Learning rate. Default is 100.0.
- range_n_clustersList[int], optional
Range of cluster values. Default is range(2, 10, 1).
- random_state: int, optional
Random seed for sklearn.
- umap:
- n_neighborsList[int], optional
List of number of neighbors. Default is [15].
- min_distfloat, optional
Minimum distance between points in the embedded space. Default is 0.1.
- circularbool, optional
Whether to use circular metrics. Default is False.
- n_componentsint, optional
Number of dimensions of the embedded space. Default is 2.
- metricstr, optional
Metric to use. Default is “euclidean”.
- random_state: int, optional
Random seed for sklearn.
- range_n_clustersList[int], optional
Range of cluster values. Default is range(2, 10, 1).
- kpca:
- circularbool, optional
Whether to use circular metrics. Default is False.
- n_componentsint, optional
Number of components to keep. Default is 10.
- gammafloat, optional
Kernel coefficient. Default is None.
Returns
- np.ndarray
Returns the transformed data.
- For more information on each method, see the corresponding documentation: