gt4sd.frameworks.granular.dataloader.dataset module¶
Dataset module.
Summary¶
Classes:
Autoencoder dataset.  | 
|
General dataset combining multiple granular datasets.  | 
|
A dataset wrapper for granular  | 
|
Latent model dataset.  | 
|
Dataset for SMILES/SELFIES preprocessing.  | 
Functions:
Build architecture configuration for the selected model type and dataset.  | 
|
Build data columns from hyper-parameters.  | 
|
Build a granular dataset.  | 
|
Build a dataset and an architecture configuration.  | 
Reference¶
- class GranularDataset(name, data)[source]¶
 Bases:
DatasetA dataset wrapper for granular
- __init__(name, data)[source]¶
 Initialize a granular dataset.
- Parameters
 name (
str) – dataset name.data (
Dict[str,Any]) – dataset samples.
- __getitem__(index)[source]¶
 Retrieve an item from the dataset by index.
- Parameters
 index (
int) – index for the item.- Return type
 Dict[str,Any]- Returns
 an item.
- __annotations__ = {}¶
 
- __doc__ = 'A dataset wrapper for granular'¶
 
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
 
- __parameters__ = ()¶
 
- class CombinedGranularDataset(datasets)[source]¶
 Bases:
DatasetGeneral dataset combining multiple granular datasets.
- __init__(datasets)[source]¶
 Initialize a general dataset.
- Parameters
 datasets (
List[Dict[str,Any]]) – list of dataset configurations.
- __getitem__(index)[source]¶
 Retrieve an item from the dataset by index.
- Parameters
 index (
int) – index for the item.- Return type
 Dict[str,Any]- Returns
 an item.
- __annotations__ = {}¶
 
- __doc__ = 'General dataset combining multiple granular datasets.'¶
 
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
 
- __parameters__ = ()¶
 
- class SmilesTokenizationPreProcessingDataset(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]¶
 Bases:
GranularDatasetDataset for SMILES/SELFIES preprocessing.
- __init__(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]¶
 Construct a SmilesTokenizationPreProcessingDataset.
- Parameters
 name (
str) – dataset name.data_columns (
Dict[str,Any]) – data columns mapping.input_smiles (
DataFrame) – dataframe containing input SMILES.target_smiles (
DataFrame) – dataframe containing target SMILES.tokenizer (
Tokenizer) – a tokenizer defining the molecule representation used.set_seq_size (
Optional[int,None]) – sequence size. Defaults to None, a.k.a., define this using the input SMILES.
- smiles_to_ids(input_smiles=[], target_smiles=[])[source]¶
 Process input SMILES lists generating examples by tokenizing strings and converting them to tensors.
- Parameters
 input_smiles (
List[str]) – list of input SMILES representations. Defaults to [].target_smiles (
List[str]) – list of target SMILES representations. Defaults to [].
- Return type
 None
- __annotations__ = {}¶
 
- __doc__ = 'Dataset for SMILES/SELFIES preprocessing.'¶
 
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
 
- __parameters__ = ()¶
 
- class LatentModelDataset(name, data_columns, target_data, scaling=None)[source]¶
 Bases:
GranularDatasetLatent model dataset.
- __init__(name, data_columns, target_data, scaling=None)[source]¶
 Construct a LatentModelDataset.
- Parameters
 name (
str) – dataset name.data_columns (
Dict[str,Any]) – data columns mapping.target_data (
DataFrame) – dataframe for targets.scaling (
Optional[str,None]) – feature scaling process. Defaults to None, a.k.a. no scaling. Currently not supported.
- Raises
 NotImplementedError – in case a scaling is selected.
- __annotations__ = {}¶
 
- __doc__ = 'Latent model dataset.'¶
 
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
 
- __parameters__ = ()¶
 
- class AutoEncoderDataset(name, data_columns, input_data, target_data, scaling=None)[source]¶
 Bases:
GranularDatasetAutoencoder dataset.
- __init__(name, data_columns, input_data, target_data, scaling=None)[source]¶
 Construct an AutoEncoderDataset.
- Parameters
 name (
str) – dataset name.data_columns (
Dict[str,Any]) – data columns mapping.input_data (
DataFrame) – dataframe for inputs.target_data (
DataFrame) – dataframe for targets.scaling (
Optional[str,None]) – feature scaling process. Defaults to None, a.k.a. no scaling. Feasible values: “onehot”, “min-max” and “standard”.
- Raises
 ValueError – in case requested scaling is not supported.
- __annotations__ = {}¶
 
- __doc__ = 'Autoencoder dataset.'¶
 
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
 
- __parameters__ = ()¶
 
- build_data_columns(hparams)[source]¶
 Build data columns from hyper-parameters.
- Parameters
 hparams (
Dict[str,Any]) – hyper-parameters for the data columns.- Return type
 Dict[str,Any]- Returns
 data columns.
- build_dataset(name, data, dataset_type, data_columns, hparams)[source]¶
 Build a granular dataset.
- Parameters
 name (
str) – dataset name.data (
DataFrame) – dataframe representing the dataset.dataset_type (
str) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.data_columns (
Dict[str,Any]) – data columns mapping.hparams (
Dict[str,Any]) – hyper-parameters for the data columns.
- Raises
 ValueError – in case requested dataset type is not supported.
- Return type
 - Returns
 a granular dataset.
- build_architecture(model_type, data_columns, dataset, hparams)[source]¶
 Build architecture configuration for the selected model type and dataset.
- Parameters
 model_type (
str) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.data_columns (
Dict[str,Any]) – data columns mapping.dataset (
GranularDataset) – a granular dataset.hparams (
Dict[str,Any]) – hyper-parameters for the data columns.
- Raises
 ValueError – in case requested model type is not supported.
- Return type
 Dict[str,Any]- Returns
 architecture configuration.
- build_dataset_and_architecture(name, data_path, data_file, dataset_type, model_type, hparams, **kwargs)[source]¶
 Build a dataset and an architecture configuration.
- Parameters
 name (
str) – dataset name.data_path (
str) – path to the dataset.data_file (
str) – data file name.dataset_type (
str) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.model_type (
str) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.hparams (
Dict[str,Any]) – hyper-parameters for the data columns.
- Raises
 ValueError – in case the data file has an unsupported extension/format.
- Return type
 Tuple[GranularDataset,Dict[str,Any]]- Returns
 a tuple containing a granular dataset and a related architecture configuration.