gt4sd.frameworks.granular.dataloader.dataset module¶
Dataset module.
Summary¶
Classes:
Autoencoder dataset. |
|
General dataset combining multiple granular datasets. |
|
A dataset wrapper for granular |
|
Latent model dataset. |
|
Dataset for SMILES/SELFIES preprocessing. |
Functions:
Build architecture configuration for the selected model type and dataset. |
|
Build data columns from hyper-parameters. |
|
Build a granular dataset. |
|
Build a dataset and an architecture configuration. |
Reference¶
- class GranularDataset(name, data)[source]¶
Bases:
Dataset
A dataset wrapper for granular
- __init__(name, data)[source]¶
Initialize a granular dataset.
- Parameters
name (
str
) – dataset name.data (
Dict
[str
,Any
]) – dataset samples.
- __getitem__(index)[source]¶
Retrieve an item from the dataset by index.
- Parameters
index (
int
) – index for the item.- Return type
Dict
[str
,Any
]- Returns
an item.
- __annotations__ = {}¶
- __doc__ = 'A dataset wrapper for granular'¶
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
- __parameters__ = ()¶
- class CombinedGranularDataset(datasets)[source]¶
Bases:
Dataset
General dataset combining multiple granular datasets.
- __init__(datasets)[source]¶
Initialize a general dataset.
- Parameters
datasets (
List
[Dict
[str
,Any
]]) – list of dataset configurations.
- __getitem__(index)[source]¶
Retrieve an item from the dataset by index.
- Parameters
index (
int
) – index for the item.- Return type
Dict
[str
,Any
]- Returns
an item.
- __annotations__ = {}¶
- __doc__ = 'General dataset combining multiple granular datasets.'¶
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
- __parameters__ = ()¶
- class SmilesTokenizationPreProcessingDataset(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]¶
Bases:
GranularDataset
Dataset for SMILES/SELFIES preprocessing.
- __init__(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]¶
Construct a SmilesTokenizationPreProcessingDataset.
- Parameters
name (
str
) – dataset name.data_columns (
Dict
[str
,Any
]) – data columns mapping.input_smiles (
DataFrame
) – dataframe containing input SMILES.target_smiles (
DataFrame
) – dataframe containing target SMILES.tokenizer (
Tokenizer
) – a tokenizer defining the molecule representation used.set_seq_size (
Optional
[int
,None
]) – sequence size. Defaults to None, a.k.a., define this using the input SMILES.
- smiles_to_ids(input_smiles=[], target_smiles=[])[source]¶
Process input SMILES lists generating examples by tokenizing strings and converting them to tensors.
- Parameters
input_smiles (
List
[str
]) – list of input SMILES representations. Defaults to [].target_smiles (
List
[str
]) – list of target SMILES representations. Defaults to [].
- Return type
None
- __annotations__ = {}¶
- __doc__ = 'Dataset for SMILES/SELFIES preprocessing.'¶
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
- __parameters__ = ()¶
- class LatentModelDataset(name, data_columns, target_data, scaling=None)[source]¶
Bases:
GranularDataset
Latent model dataset.
- __init__(name, data_columns, target_data, scaling=None)[source]¶
Construct a LatentModelDataset.
- Parameters
name (
str
) – dataset name.data_columns (
Dict
[str
,Any
]) – data columns mapping.target_data (
DataFrame
) – dataframe for targets.scaling (
Optional
[str
,None
]) – feature scaling process. Defaults to None, a.k.a. no scaling. Currently not supported.
- Raises
NotImplementedError – in case a scaling is selected.
- __annotations__ = {}¶
- __doc__ = 'Latent model dataset.'¶
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
- __parameters__ = ()¶
- class AutoEncoderDataset(name, data_columns, input_data, target_data, scaling=None)[source]¶
Bases:
GranularDataset
Autoencoder dataset.
- __init__(name, data_columns, input_data, target_data, scaling=None)[source]¶
Construct an AutoEncoderDataset.
- Parameters
name (
str
) – dataset name.data_columns (
Dict
[str
,Any
]) – data columns mapping.input_data (
DataFrame
) – dataframe for inputs.target_data (
DataFrame
) – dataframe for targets.scaling (
Optional
[str
,None
]) – feature scaling process. Defaults to None, a.k.a. no scaling. Feasible values: “onehot”, “min-max” and “standard”.
- Raises
ValueError – in case requested scaling is not supported.
- __annotations__ = {}¶
- __doc__ = 'Autoencoder dataset.'¶
- __module__ = 'gt4sd.frameworks.granular.dataloader.dataset'¶
- __parameters__ = ()¶
- build_data_columns(hparams)[source]¶
Build data columns from hyper-parameters.
- Parameters
hparams (
Dict
[str
,Any
]) – hyper-parameters for the data columns.- Return type
Dict
[str
,Any
]- Returns
data columns.
- build_dataset(name, data, dataset_type, data_columns, hparams)[source]¶
Build a granular dataset.
- Parameters
name (
str
) – dataset name.data (
DataFrame
) – dataframe representing the dataset.dataset_type (
str
) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.data_columns (
Dict
[str
,Any
]) – data columns mapping.hparams (
Dict
[str
,Any
]) – hyper-parameters for the data columns.
- Raises
ValueError – in case requested dataset type is not supported.
- Return type
- Returns
a granular dataset.
- build_architecture(model_type, data_columns, dataset, hparams)[source]¶
Build architecture configuration for the selected model type and dataset.
- Parameters
model_type (
str
) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.data_columns (
Dict
[str
,Any
]) – data columns mapping.dataset (
GranularDataset
) – a granular dataset.hparams (
Dict
[str
,Any
]) – hyper-parameters for the data columns.
- Raises
ValueError – in case requested model type is not supported.
- Return type
Dict
[str
,Any
]- Returns
architecture configuration.
- build_dataset_and_architecture(name, data_path, data_file, dataset_type, model_type, hparams, **kwargs)[source]¶
Build a dataset and an architecture configuration.
- Parameters
name (
str
) – dataset name.data_path (
str
) – path to the dataset.data_file (
str
) – data file name.dataset_type (
str
) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.model_type (
str
) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.hparams (
Dict
[str
,Any
]) – hyper-parameters for the data columns.
- Raises
ValueError – in case the data file has an unsupported extension/format.
- Return type
Tuple
[GranularDataset
,Dict
[str
,Any
]]- Returns
a tuple containing a granular dataset and a related architecture configuration.