gt4sd.algorithms.conditional_generation.regression_transformer.implementation module¶

Implementation of Regression Transformer conditional generators.

Summary¶

Classes:

`ChemicalLanguageRT`	Hybrid regression and conditional molecular generation model as implemented in https://arxiv.org/abs/2202.01338.
`ConditionalGenerator`	Main interface for a regression transformer.
`ProteinLanguageRT`	Hybrid regression and conditional protein generation model as implemented in https://arxiv.org/abs/2202.01338.

Reference¶

class ConditionalGenerator(resources_path, device=None, tolerance=20.0)[source]¶

Bases: object

Main interface for a regression transformer.

task: str¶

search: Search¶

property_collator: PropertyCollator¶

subs_format: str¶

batch_size: int = 8¶

sampling_wrapper: Dict[str, Any] = {}¶

__init__(resources_path, device=None, tolerance=20.0)[source]¶

Initialize the generator.

Parameters

resources_path (str) – directory where to find models and parameters.
device (Union[device, str, None]) – device where the inference is running either as a dedicated class or a string. If not provided is inferred.
tolerance (Union[float, Dict[str, float]]) – percentage of tolerated deviation between desired and obtained property. Either a single float or a dictionary of floats, where the keys are the properties. Note that the tolerance is only used for post-hoc filtering of the generated molecules.

device: device¶

load_model(resources_path)[source]¶

Loading a XLNetLMHeadModel which constitutes the base of a RT model.

Parameters: resources_path (str) – path to the model.
Returns: base of a Regression Transformer model. XLNetConfig: configuration of the model.
Return type: XLNetLMHeadModel

load_inference(resources_path)[source]¶

Load and set up all parameters necessary for inference.

Parameters: resources_path (str) – path to the model folder.
Return type: None

denormalize(x, idx, precision=4)[source]¶

Denormalize from the model scale to the original scale.

Parameters

x (float) – normalized value (often in [0,1]).
idx (int) – index of the property.
precision (int) – optional rounding precision. Defaults to 4.

Returns

Value in regular scale.

Return type

float

normalize(x, idx, precision=3)[source]¶

Normalize from original scale to desired scale.

Parameters

x (str) – unnormalized input.
idx (int) – index of the property.
precision (int) – optional rounding precision.

Returns

Normalized value.

Return type

float

validate_input(x)[source]¶

Sanity checking for formatting of the input string.

Parameters: x (str) – The string to be validated.
Raises: ValueError – If string was formatted incorrectly.
Return type: None

validate_input_molecule(sequence, input_type='SELFIES')[source]¶

Verifies that the non-numerical part of the input is a proper sequence.

Parameters: sequence (str) – input sequence to be validated.
Return type: None

validate_input_numerical(sequence)[source]¶

Verifies that the numeric part of the input sequence is valid.

Parameters: sequence (str) – input sequence to be validated.
Return type: None

safely_determine_task(x)[source]¶

Determines whether the passed sequence adheres to regression or generation task.

Parameters: x (str) – the user-provided input sequence for the model, inluding mask tokens.
Raises: ValueError – if the sequence does not adhere to the formatting rules.
Returns: the task, either ‘regression’ or ‘generation’.
Return type: str

generate_batch_regression(context)[source]¶

Predict the property of a sample.

Parameters

context (str) – a string with a masked property, a separator and an entity. E.g. <stab>[MASK][MASK][MASK][MASK]|GSQEVNSGTQTYKNASPEEAERIARKAGATTWTEKGNKWEIRI.

Returns

a list of (denormalized) predicted properties for the entity.: Stored as a Sequence (str), e.g., ‘<qed>0.727’.

Return type

List[Sequence]

compile_regression_result(input_ids, prediction)[source]¶

Postprocesses the prediction from the property task to obtain a float.

Parameters

input_ids (Tensor) – 2D Tensor of shape (batch_size, sequence_length).
prediction (Tensor) – 2D Tensor of shape (batch_size, sequence_length).

Returns

list of property sequences (one per sample). Can contain: multiple properties but have to be hashable, therefore we use Sequences, e.g., ‘<qed>0.727’ or ‘<logp>6.65<scscore>3.82’.

Return type

List[Sequence]

generate_batch_generation(sequence)[source]¶

Conditionally generate sequences given a continuous property value and a fixed sequence. This function first conditionally generates the novel sequences and then predicts their properties using the RT again. Only if the predicted property is within the tolerance range, the novel sequence is returned.

Parameters

sequence (str) – the input sequence with masked tokens on the text.

Returns

a tuple of tuples, each containing the generated: sequence alongside its predicted property value.

Return type

Tuple[Tuple[str, float]]

normalize_sequence(context)[source]¶

Take a sequence with unnormalized property score(s) and convert it to a sequence with a normalized score.

Parameters: context (str) – sequence with unnormalized property.
Return type: str
Returns: sequence with normalized property.

static isfloat(sequence)[source]¶

Safely determine whether a string can be converted to a float

Parameters: sequence (str) – A string
Return type: bool
Returns: Whether it can be converted to a float

validate_numerical(sequences)[source]¶

Validate whether a list of sequences contains only numerical values.

Parameters: sequences (List[Any]) – a list of hopefully only numerical values.
Return type: Tuple[List[str], List[int]]
Returns: A tuple of two lists for the validated Sequences and their respective indices.

validate_sampling_wrapper(context, property_goal={}, fraction_to_mask=0.2, tokens_to_mask=[], substructures_to_mask=[], substructures_to_keep=[], text_filtering=False)[source]¶

Validating whether the wrapper can be used for conditional generation of samples.

Parameters

context (str) – A string that is used as a seed. Has to be a SMILES or a block-copoylmer (RegressionTransformerMolecules) or AAS (RegressionTransformerProteins).
property_goal (Dict[str, Any]) –
Specifies the property conditions for the targeted generation. The keys are the properties and have to be aligned with the algorithm_version. For example, for the solubility model use: {‘<esol>’: 1.23} or for the logp_and_synthesizability model use: {‘<logp>’: 1.23, ‘<synthesizability>’: 2.34}

Defaults to {}, but it has to be specified.
fraction_to_mask (float) – The fraction of tokens that can be changed. Defaults to 0.2.
tokens_to_mask (List) – A list of atoms (or amino acids) that can be considered for masking. Defaults to [] meaning that all tokens can be masked. E.g., use [‘F’] to only mask fluorine atoms.
substructures_to_mask (List[str]) – Specifies a list of substructures that should be masked. Given in SMILES format. This is excluded from the stochastic masking. NOTE: The model operates on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
substructures_to_keep (List[str]) – Specifies a list of substructures that should definitely be kept. Given in SMILES format. This is excluded from the stochastic masking. NOTE: This keeps tokens even if they are included in tokens_to_mask. NOTE: The model operates on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
text_filtering (bool) – Generated sequences are post-hoc filtered for the presence of substructures_to_keep. This is done with RDKit substructure matches. If the sub- structure cant be converted to a mol object, this argument toggles whether a substructure should be ignored from post-hoc filtering (this happens per default) or whether filtering should occur on a pure string level. NOTE: This does not affect the actual generation process. Defaults to False.

Return type

None

sample_sequence(seq)[source]¶

Assembling a RT-sequence from a seed SMILES/AAS sequence.

Parameters: seq (str) – A SMILES/AAS string used as seed.
Returns: <logp>1.234|<synthesizability>1.234|[C][C][O]
Return type: A RT-sequence that uses SELFIES/AAS and incorporates the properties, e.g.

get_maskable_tokens(tokens_to_mask)[source]¶

language_encoding(seq)[source]¶

filter_substructures(property_successes)[source]¶

validate_substructures(substructures_to_mask, substructures_to_keep)[source]¶

Validates the substructures that are ignored/kept for the masking when the sampling_wrapper is used.

Parameters

substructures_to_mask (List[str]) – List of substructures that should be masked.
substructures_to_keep (List[str]) – List of substructures that should be kept.

Raises

NotImplementedError – Implemented by the child classes.

__annotations__ = {'batch_size': <class 'int'>, 'device': <class 'torch.device'>, 'property_collator': <class 'terminator.collators.PropertyCollator'>, 'sampling_wrapper': typing.Dict[str, typing.Any], 'search': <class 'terminator.search.Search'>, 'subs_format': <class 'str'>, 'task': <class 'str'>}¶

__dict__ = mappingproxy({'__module__': 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation', '__annotations__': {'device': <class 'torch.device'>, 'task': <class 'str'>, 'search': <class 'terminator.search.Search'>, 'property_collator': <class 'terminator.collators.PropertyCollator'>, 'subs_format': <class 'str'>, 'batch_size': <class 'int'>, 'sampling_wrapper': typing.Dict[str, typing.Any]}, '__doc__': 'Main interface for a regression transformer.', 'batch_size': 8, 'sampling_wrapper': {}, '__init__': <function ConditionalGenerator.__init__>, 'load_model': <function ConditionalGenerator.load_model>, 'load_inference': <function ConditionalGenerator.load_inference>, 'denormalize': <function ConditionalGenerator.denormalize>, 'normalize': <function ConditionalGenerator.normalize>, 'validate_input': <function ConditionalGenerator.validate_input>, 'validate_input_molecule': <function ConditionalGenerator.validate_input_molecule>, 'validate_input_numerical': <function ConditionalGenerator.validate_input_numerical>, 'safely_determine_task': <function ConditionalGenerator.safely_determine_task>, 'generate_batch_regression': <function ConditionalGenerator.generate_batch_regression>, 'compile_regression_result': <function ConditionalGenerator.compile_regression_result>, 'generate_batch_generation': <function ConditionalGenerator.generate_batch_generation>, 'normalize_sequence': <function ConditionalGenerator.normalize_sequence>, 'isfloat': <staticmethod(<function ConditionalGenerator.isfloat>)>, 'validate_numerical': <function ConditionalGenerator.validate_numerical>, 'validate_sampling_wrapper': <function ConditionalGenerator.validate_sampling_wrapper>, 'sample_sequence': <function ConditionalGenerator.sample_sequence>, 'get_maskable_tokens': <function ConditionalGenerator.get_maskable_tokens>, 'language_encoding': <function ConditionalGenerator.language_encoding>, 'filter_substructures': <function ConditionalGenerator.filter_substructures>, 'validate_substructures': <function ConditionalGenerator.validate_substructures>, '__dict__': <attribute '__dict__' of 'ConditionalGenerator' objects>, '__weakref__': <attribute '__weakref__' of 'ConditionalGenerator' objects>})¶

__doc__ = 'Main interface for a regression transformer.'¶

__module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶

__weakref__¶: list of weak references to the object (if defined)

class ChemicalLanguageRT(resources_path, context, search='sample', temperature=1.4, batch_size=8, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶

Bases: ConditionalGenerator

Hybrid regression and conditional molecular generation model as implemented in https://arxiv.org/abs/2202.01338.

resources_path¶: path to the model.

context¶: user-specified input text for the model.

search¶

search key to instantiate a search via terminator.search.SEARCH_FACTORY.

Type: terminator.search.Search

temperature¶: the temperature parameter in case of a sample search.

batch_size¶

the batch size for the model, applicable only to generative task.

Type: int

tolerance¶: the tolerance for the property of the generated molecules.

subs_format: str = 'SMILES'¶

__init__(resources_path, context, search='sample', temperature=1.4, batch_size=8, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶

Initialize the molecule generator.

Parameters

resources_path (str) – directory where to find models and parameters.
context (str) – user-specified input text for the model.
search (str) – search key to instantiate a search, defaults to sample.
temperature (float) – temperature for the sampling. Defaults to 1.4.
batch_size (int) – number of points sampled per call. Defaults to 8.
tolerance (Union[float, Dict[str, float]]) – the tolerance for the property of the generated molecules. Given in percent. Defaults to 20.
sampling_wrapper (Dict[str, Any]) –
A high-level entry point that allows specifying a seed SMILES alongside some target conditions. NOTE: If this is used, the target needs to be a single SMILES string. Example: {

’fraction_to_mask’: 0.5, ‘tokens_to_mask’: [], ‘property_goal’: {‘<qed>’: 0.85}

}
device (Union[device, str, None]) – device where the inference s running either as a dedicated class or a string. If not provided is inferred.

validate_input_molecule(sequence, input_type=MoleculeFormat.selfies)[source]¶

Verifies that the non-numerical part of the input sequence is a molecule.

Parameters

sequence (str) – input sequence to be validated.
input_type (str) – whether the input is validated to be a SELFIES (default), SMILES or COPOLYMER.

Return type

None

validate_output(sequences)[source]¶

Validate the output of the RT model.

Parameters

sequences (List[Any]) – list of sequences to be validated.

Returns

the validate items, a list of either:
- Chem.rdchem.Mol (generation task)
- Sequence denoting the predicted properties (regression task)
list of valid indexes.

Return type

A tuple of validated items

get_maskable_tokens(tokens_to_mask)[source]¶

Convert a user-defined list of maskable tokens into a RT model-friendly format.

Parameters: tokens_to_mask (List[str]) – List of atoms specified in SMILES notation.
Return type: List[str]
Returns: List of atoms in SELFIES notation.

language_encoding(seq)[source]¶

Return type: str

filter_substructures(property_successes)[source]¶

Remove samples where user-required substructures are absent from generated samples.

Parameters: property_successes (Tuple[Tuple[str, str]]) – A tuple of samples that passed the property constraints. This is a tuple of tuples, where the first element is the SMILES and the second is the property prediction string.
Return type: Tuple[Tuple[str, str]]
Returns: A tuple of samples that passed the property constraints and the substructure constraints. Same format as input

__annotations__ = {'batch_size': 'int', 'device': 'torch.device', 'property_collator': 'PropertyCollator', 'sampling_wrapper': 'Dict[str, Any]', 'search': 'Search', 'subs_format': <class 'str'>, 'task': 'str'}¶

__doc__ = '\n Hybrid regression and conditional molecular generation model as implemented in\n https://arxiv.org/abs/2202.01338.\n\n Attributes:\n resources_path: path to the model.\n context: user-specified input text for the model.\n search: search key to instantiate a search via terminator.search.SEARCH_FACTORY.\n temperature: the temperature parameter in case of a `sample` search.\n batch_size: the batch size for the model, applicable only to generative task.\n tolerance: the tolerance for the property of the generated molecules.\n '¶

__module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶

class ProteinLanguageRT(resources_path, context, search='sample', temperature=1.4, batch_size=32, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶

Bases: ConditionalGenerator

Hybrid regression and conditional protein generation model as implemented in https://arxiv.org/abs/2202.01338. It generates peptides with a desired stability score or predicts the stability score of a given molecule. For details on the stability task see: https://doi.org/10.1126/science.aan0693

resources_path¶: path to the model.

context¶: user-specified input text for the model.

search¶

search key to instantiate a search via terminator.search.SEARCH_FACTORY.

Type: terminator.search.Search

temperature¶: the temperature parameter in case of a sample search.

batch_size¶

the batch size for the model, applicable only to generative task.

Type: int

tolerance¶: the tolerance for the property of the generated molecules.

subs_format: str = 'AAS'¶

__init__(resources_path, context, search='sample', temperature=1.4, batch_size=32, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶

Initialize the protein generator.

Parameters

resources_path (str) – directory where to find models and parameters.
search (str) – search key to instantiate a search, defaults to sample.
temperature (float) – temperature for the sampling. Defaults to 1.4.
batch_size (int) – number of points sampled per call. Defaults to 8.
tolerance (Union[float, Dict[str, float]]) – the tolerance for the property of the generated molecules. Given in percent. Defaults to 20.0.
sampling_wrapper (Dict[str, Any]) –
A high-level entry point that allows specifying a seed SMILES alongside some target conditions. NOTE: If this is used, the target needs to be a single SMILES string. Example: {

’fraction_to_mask’: 0.5, ‘tokens_to_mask’: [], ‘property_goal’: {‘<stab>’: 0.85}

}
device (Union[device, str, None]) – device where the inference s running either as a dedicated class or a string. If not provided is inferred.

__annotations__ = {'batch_size': 'int', 'device': 'torch.device', 'property_collator': 'PropertyCollator', 'sampling_wrapper': 'Dict[str, Any]', 'search': 'Search', 'subs_format': <class 'str'>, 'task': 'str'}¶

__doc__ = '\n Hybrid regression and conditional protein generation model as implemented in\n https://arxiv.org/abs/2202.01338. It generates peptides with a desired stability\n score or predicts the stability score of a given molecule.\n For details on the stability task see: https://doi.org/10.1126/science.aan0693\n\n Attributes:\n resources_path: path to the model.\n context: user-specified input text for the model.\n search: search key to instantiate a search via terminator.search.SEARCH_FACTORY.\n temperature: the temperature parameter in case of a `sample` search.\n batch_size: the batch size for the model, applicable only to generative task.\n tolerance: the tolerance for the property of the generated molecules.\n '¶

__module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶

validate_input_molecule(sequence, input_type='')[source]¶

Verifies that the non-numerical part of the input sequence is a valid AAS.

Parameters

sequence (str) – input sequence to be validated.
input_type (str) – str argument that is ignored but needed for sibling class.

Return type

None

validate_output(sequences)[source]¶

Validate the output of the RT model.

Parameters

sequences (List[Any]) – list of sequences to be validated.

Returns

List of validated items, either:
- Amino acid sequences (generation task)
- Sequence denoting the predicted properties (regression task)
a list of valid indexes.

Return type

A tuple of validated items

get_maskable_tokens(tokens_to_mask)[source]¶

Convert a user-defined list of maskable tokens (amino acids) into a RT: model-friendly format. Nothing has to be done for proteins, but the function is more complex for sister-classes.

Parameters: tokens_to_mask (List[str]) – List of amino acids specified in IUPAC convention.
Return type: List[str]
Returns: The same.

language_encoding(seq)[source]¶

Return type: str

filter_substructures(property_successes)[source]¶

Remove samples where user-required substructures are absent from generated samples.

Parameters: property_successes (Tuple[Tuple[str, str]]) – A tuple of samples that passed the property constraints. This is a tuple of tuples, where the first element is the sequence and the second the property prediction string.
Return type: Tuple[Tuple[str, str]]
Returns: A tuple of samples that passed the property constraints and the substructure constraints. Same format as input