gt4sd.frameworks.granular.dataloader.dataset module

Dataset module.

Summary

Classes:

AutoEncoderDataset

Autoencoder dataset.

CombinedGranularDataset

General dataset combining multiple granular datasets.

GranularDataset

A dataset wrapper for granular

LatentModelDataset

Latent model dataset.

SmilesTokenizationPreProcessingDataset

Dataset for SMILES/SELFIES preprocessing.

Functions:

build_architecture

Build architecture configuration for the selected model type and dataset.

build_data_columns

Build data columns from hyper-parameters.

build_dataset

Build a granular dataset.

build_dataset_and_architecture

Build a dataset and an architecture configuration.

Reference

class GranularDataset(name, data)[source]

Bases: Dataset

A dataset wrapper for granular

__init__(name, data)[source]

Initialize a granular dataset.

Parameters
  • name (str) – dataset name.

  • data (Dict[str, Any]) – dataset samples.

__len__()[source]

Dataset length.

Return type

int

Returns

length of the dataset.

__getitem__(index)[source]

Retrieve an item from the dataset by index.

Parameters

index (int) – index for the item.

Return type

Dict[str, Any]

Returns

an item.

__annotations__ = {}
__doc__ = 'A dataset wrapper for granular'
__module__ = 'gt4sd.frameworks.granular.dataloader.dataset'
__parameters__ = ()
class CombinedGranularDataset(datasets)[source]

Bases: Dataset

General dataset combining multiple granular datasets.

__init__(datasets)[source]

Initialize a general dataset.

Parameters

datasets (List[Dict[str, Any]]) – list of dataset configurations.

__len__()[source]

Dataset length.

Return type

int

Returns

length of the dataset.

__getitem__(index)[source]

Retrieve an item from the dataset by index.

Parameters

index (int) – index for the item.

Return type

Dict[str, Any]

Returns

an item.

__annotations__ = {}
__doc__ = 'General dataset combining multiple granular datasets.'
__module__ = 'gt4sd.frameworks.granular.dataloader.dataset'
__parameters__ = ()
class SmilesTokenizationPreProcessingDataset(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]

Bases: GranularDataset

Dataset for SMILES/SELFIES preprocessing.

__init__(name, data_columns, input_smiles, target_smiles, tokenizer, set_seq_size=None)[source]

Construct a SmilesTokenizationPreProcessingDataset.

Parameters
  • name (str) – dataset name.

  • data_columns (Dict[str, Any]) – data columns mapping.

  • input_smiles (DataFrame) – dataframe containing input SMILES.

  • target_smiles (DataFrame) – dataframe containing target SMILES.

  • tokenizer (Tokenizer) – a tokenizer defining the molecule representation used.

  • set_seq_size (Optional[int, None]) – sequence size. Defaults to None, a.k.a., define this using the input SMILES.

smiles_to_ids(input_smiles=[], target_smiles=[])[source]

Process input SMILES lists generating examples by tokenizing strings and converting them to tensors.

Parameters
  • input_smiles (List[str]) – list of input SMILES representations. Defaults to [].

  • target_smiles (List[str]) – list of target SMILES representations. Defaults to [].

Return type

None

__annotations__ = {}
__doc__ = 'Dataset for SMILES/SELFIES preprocessing.'
__module__ = 'gt4sd.frameworks.granular.dataloader.dataset'
__parameters__ = ()
class LatentModelDataset(name, data_columns, target_data, scaling=None)[source]

Bases: GranularDataset

Latent model dataset.

__init__(name, data_columns, target_data, scaling=None)[source]

Construct a LatentModelDataset.

Parameters
  • name (str) – dataset name.

  • data_columns (Dict[str, Any]) – data columns mapping.

  • target_data (DataFrame) – dataframe for targets.

  • scaling (Optional[str, None]) – feature scaling process. Defaults to None, a.k.a. no scaling. Currently not supported.

Raises

NotImplementedError – in case a scaling is selected.

__annotations__ = {}
__doc__ = 'Latent model dataset.'
__module__ = 'gt4sd.frameworks.granular.dataloader.dataset'
__parameters__ = ()
class AutoEncoderDataset(name, data_columns, input_data, target_data, scaling=None)[source]

Bases: GranularDataset

Autoencoder dataset.

__init__(name, data_columns, input_data, target_data, scaling=None)[source]

Construct an AutoEncoderDataset.

Parameters
  • name (str) – dataset name.

  • data_columns (Dict[str, Any]) – data columns mapping.

  • input_data (DataFrame) – dataframe for inputs.

  • target_data (DataFrame) – dataframe for targets.

  • scaling (Optional[str, None]) – feature scaling process. Defaults to None, a.k.a. no scaling. Feasible values: “onehot”, “min-max” and “standard”.

Raises

ValueError – in case requested scaling is not supported.

__annotations__ = {}
__doc__ = 'Autoencoder dataset.'
__module__ = 'gt4sd.frameworks.granular.dataloader.dataset'
__parameters__ = ()
build_data_columns(hparams)[source]

Build data columns from hyper-parameters.

Parameters

hparams (Dict[str, Any]) – hyper-parameters for the data columns.

Return type

Dict[str, Any]

Returns

data columns.

build_dataset(name, data, dataset_type, data_columns, hparams)[source]

Build a granular dataset.

Parameters
  • name (str) – dataset name.

  • data (DataFrame) – dataframe representing the dataset.

  • dataset_type (str) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.

  • data_columns (Dict[str, Any]) – data columns mapping.

  • hparams (Dict[str, Any]) – hyper-parameters for the data columns.

Raises

ValueError – in case requested dataset type is not supported.

Return type

GranularDataset

Returns

a granular dataset.

build_architecture(model_type, data_columns, dataset, hparams)[source]

Build architecture configuration for the selected model type and dataset.

Parameters
  • model_type (str) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.

  • data_columns (Dict[str, Any]) – data columns mapping.

  • dataset (GranularDataset) – a granular dataset.

  • hparams (Dict[str, Any]) – hyper-parameters for the data columns.

Raises

ValueError – in case requested model type is not supported.

Return type

Dict[str, Any]

Returns

architecture configuration.

build_dataset_and_architecture(name, data_path, data_file, dataset_type, model_type, hparams, **kwargs)[source]

Build a dataset and an architecture configuration.

Parameters
  • name (str) – dataset name.

  • data_path (str) – path to the dataset.

  • data_file (str) – data file name.

  • dataset_type (str) – dataset type. Feasible values: “latentmodel”, “smiles”, “selfies”, “big-smiles” and “autoencoder”.

  • model_type (str) – model type. Feasible values: “vae_rnn”, “vae_trans”, “mlp_predictor”, “no_encoding”, “mlp_autoencoder” and “vae_mlp”.

  • hparams (Dict[str, Any]) – hyper-parameters for the data columns.

Raises

ValueError – in case the data file has an unsupported extension/format.

Return type

Tuple[GranularDataset, Dict[str, Any]]

Returns

a tuple containing a granular dataset and a related architecture configuration.