gt4sd.algorithms.conditional_generation.regression_transformer.implementation module¶
Implementation of Regression Transformer conditional generators.
Summary¶
Classes:
Hybrid regression and conditional molecular generation model as implemented in https://arxiv.org/abs/2202.01338. |
|
Main interface for a regression transformer. |
|
Hybrid regression and conditional protein generation model as implemented in https://arxiv.org/abs/2202.01338. |
Reference¶
- class ConditionalGenerator(resources_path, device=None, tolerance=20.0)[source]¶
Bases:
object
Main interface for a regression transformer.
- task: str¶
- search: Search¶
- property_collator: PropertyCollator¶
- subs_format: str¶
- batch_size: int = 8¶
- sampling_wrapper: Dict[str, Any] = {}¶
- __init__(resources_path, device=None, tolerance=20.0)[source]¶
Initialize the generator.
- Parameters
resources_path (
str
) – directory where to find models and parameters.device (
Union
[device
,str
,None
]) – device where the inference is running either as a dedicated class or a string. If not provided is inferred.tolerance (
Union
[float
,Dict
[str
,float
]]) – percentage of tolerated deviation between desired and obtained property. Either a single float or a dictionary of floats, where the keys are the properties. Note that the tolerance is only used for post-hoc filtering of the generated molecules.
- device: device¶
- load_model(resources_path)[source]¶
Loading a XLNetLMHeadModel which constitutes the base of a RT model.
- Parameters
resources_path (
str
) – path to the model.- Returns
base of a Regression Transformer model. XLNetConfig: configuration of the model.
- Return type
XLNetLMHeadModel
- load_inference(resources_path)[source]¶
Load and set up all parameters necessary for inference.
- Parameters
resources_path (
str
) – path to the model folder.- Return type
None
- denormalize(x, idx, precision=4)[source]¶
Denormalize from the model scale to the original scale.
- Parameters
x (
float
) – normalized value (often in [0,1]).idx (
int
) – index of the property.precision (
int
) – optional rounding precision. Defaults to 4.
- Returns
Value in regular scale.
- Return type
float
- normalize(x, idx, precision=3)[source]¶
Normalize from original scale to desired scale.
- Parameters
x (
str
) – unnormalized input.idx (
int
) – index of the property.precision (
int
) – optional rounding precision.
- Returns
Normalized value.
- Return type
float
- validate_input(x)[source]¶
Sanity checking for formatting of the input string.
- Parameters
x (
str
) – The string to be validated.- Raises
ValueError – If string was formatted incorrectly.
- Return type
None
- validate_input_molecule(sequence, input_type='SELFIES')[source]¶
Verifies that the non-numerical part of the input is a proper sequence.
- Parameters
sequence (
str
) – input sequence to be validated.- Return type
None
- validate_input_numerical(sequence)[source]¶
Verifies that the numeric part of the input sequence is valid.
- Parameters
sequence (
str
) – input sequence to be validated.- Return type
None
- safely_determine_task(x)[source]¶
Determines whether the passed sequence adheres to regression or generation task.
- Parameters
x (
str
) – the user-provided input sequence for the model, inluding mask tokens.- Raises
ValueError – if the sequence does not adhere to the formatting rules.
- Returns
the task, either ‘regression’ or ‘generation’.
- Return type
str
- generate_batch_regression(context)[source]¶
Predict the property of a sample.
- Parameters
context (
str
) – a string with a masked property, a separator and an entity. E.g. <stab>[MASK][MASK][MASK][MASK]|GSQEVNSGTQTYKNASPEEAERIARKAGATTWTEKGNKWEIRI.- Returns
- a list of (denormalized) predicted properties for the entity.
Stored as a Sequence (str), e.g., ‘<qed>0.727’.
- Return type
List[Sequence]
- compile_regression_result(input_ids, prediction)[source]¶
Postprocesses the prediction from the property task to obtain a float.
- Parameters
input_ids (
Tensor
) – 2D Tensor of shape (batch_size, sequence_length).prediction (
Tensor
) – 2D Tensor of shape (batch_size, sequence_length).
- Returns
- list of property sequences (one per sample). Can contain
multiple properties but have to be hashable, therefore we use Sequences, e.g., ‘<qed>0.727’ or ‘<logp>6.65<scscore>3.82’.
- Return type
List[Sequence]
- generate_batch_generation(sequence)[source]¶
Conditionally generate sequences given a continuous property value and a fixed sequence. This function first conditionally generates the novel sequences and then predicts their properties using the RT again. Only if the predicted property is within the tolerance range, the novel sequence is returned.
- Parameters
sequence (
str
) – the input sequence with masked tokens on the text.- Returns
- a tuple of tuples, each containing the generated
sequence alongside its predicted property value.
- Return type
Tuple[Tuple[str, float]]
- normalize_sequence(context)[source]¶
Take a sequence with unnormalized property score(s) and convert it to a sequence with a normalized score.
- Parameters
context (
str
) – sequence with unnormalized property.- Return type
str
- Returns
sequence with normalized property.
- static isfloat(sequence)[source]¶
Safely determine whether a string can be converted to a float
- Parameters
sequence (
str
) – A string- Return type
bool
- Returns
Whether it can be converted to a float
- validate_numerical(sequences)[source]¶
Validate whether a list of sequences contains only numerical values.
- Parameters
sequences (
List
[Any
]) – a list of hopefully only numerical values.- Return type
Tuple
[List
[str
],List
[int
]]- Returns
A tuple of two lists for the validated Sequences and their respective indices.
- validate_sampling_wrapper(context, property_goal={}, fraction_to_mask=0.2, tokens_to_mask=[], substructures_to_mask=[], substructures_to_keep=[], text_filtering=False)[source]¶
Validating whether the wrapper can be used for conditional generation of samples.
- Parameters
context (
str
) – A string that is used as a seed. Has to be a SMILES or a block-copoylmer (RegressionTransformerMolecules) or AAS (RegressionTransformerProteins).property_goal (
Dict
[str
,Any
]) –Specifies the property conditions for the targeted generation. The keys are the properties and have to be aligned with the algorithm_version. For example, for the solubility model use: {‘<esol>’: 1.23} or for the logp_and_synthesizability model use: {‘<logp>’: 1.23, ‘<synthesizability>’: 2.34}
Defaults to {}, but it has to be specified.
fraction_to_mask (
float
) – The fraction of tokens that can be changed. Defaults to 0.2.tokens_to_mask (
List
) – A list of atoms (or amino acids) that can be considered for masking. Defaults to [] meaning that all tokens can be masked. E.g., use [‘F’] to only mask fluorine atoms.substructures_to_mask (
List
[str
]) – Specifies a list of substructures that should be masked. Given in SMILES format. This is excluded from the stochastic masking. NOTE: The model operates on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.substructures_to_keep (
List
[str
]) – Specifies a list of substructures that should definitely be kept. Given in SMILES format. This is excluded from the stochastic masking. NOTE: This keeps tokens even if they are included in tokens_to_mask. NOTE: The model operates on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.text_filtering (
bool
) – Generated sequences are post-hoc filtered for the presence of substructures_to_keep. This is done with RDKit substructure matches. If the sub- structure cant be converted to a mol object, this argument toggles whether a substructure should be ignored from post-hoc filtering (this happens per default) or whether filtering should occur on a pure string level. NOTE: This does not affect the actual generation process. Defaults to False.
- Return type
None
- sample_sequence(seq)[source]¶
Assembling a RT-sequence from a seed SMILES/AAS sequence.
- Parameters
seq (
str
) – A SMILES/AAS string used as seed.- Returns
<logp>1.234|<synthesizability>1.234|[C][C][O]
- Return type
A RT-sequence that uses SELFIES/AAS and incorporates the properties, e.g.
- validate_substructures(substructures_to_mask, substructures_to_keep)[source]¶
Validates the substructures that are ignored/kept for the masking when the sampling_wrapper is used.
- Parameters
substructures_to_mask (
List
[str
]) – List of substructures that should be masked.substructures_to_keep (
List
[str
]) – List of substructures that should be kept.
- Raises
NotImplementedError – Implemented by the child classes.
- __annotations__ = {'batch_size': <class 'int'>, 'device': <class 'torch.device'>, 'property_collator': <class 'terminator.collators.PropertyCollator'>, 'sampling_wrapper': typing.Dict[str, typing.Any], 'search': <class 'terminator.search.Search'>, 'subs_format': <class 'str'>, 'task': <class 'str'>}¶
- __dict__ = mappingproxy({'__module__': 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation', '__annotations__': {'device': <class 'torch.device'>, 'task': <class 'str'>, 'search': <class 'terminator.search.Search'>, 'property_collator': <class 'terminator.collators.PropertyCollator'>, 'subs_format': <class 'str'>, 'batch_size': <class 'int'>, 'sampling_wrapper': typing.Dict[str, typing.Any]}, '__doc__': 'Main interface for a regression transformer.', 'batch_size': 8, 'sampling_wrapper': {}, '__init__': <function ConditionalGenerator.__init__>, 'load_model': <function ConditionalGenerator.load_model>, 'load_inference': <function ConditionalGenerator.load_inference>, 'denormalize': <function ConditionalGenerator.denormalize>, 'normalize': <function ConditionalGenerator.normalize>, 'validate_input': <function ConditionalGenerator.validate_input>, 'validate_input_molecule': <function ConditionalGenerator.validate_input_molecule>, 'validate_input_numerical': <function ConditionalGenerator.validate_input_numerical>, 'safely_determine_task': <function ConditionalGenerator.safely_determine_task>, 'generate_batch_regression': <function ConditionalGenerator.generate_batch_regression>, 'compile_regression_result': <function ConditionalGenerator.compile_regression_result>, 'generate_batch_generation': <function ConditionalGenerator.generate_batch_generation>, 'normalize_sequence': <function ConditionalGenerator.normalize_sequence>, 'isfloat': <staticmethod(<function ConditionalGenerator.isfloat>)>, 'validate_numerical': <function ConditionalGenerator.validate_numerical>, 'validate_sampling_wrapper': <function ConditionalGenerator.validate_sampling_wrapper>, 'sample_sequence': <function ConditionalGenerator.sample_sequence>, 'get_maskable_tokens': <function ConditionalGenerator.get_maskable_tokens>, 'language_encoding': <function ConditionalGenerator.language_encoding>, 'filter_substructures': <function ConditionalGenerator.filter_substructures>, 'validate_substructures': <function ConditionalGenerator.validate_substructures>, '__dict__': <attribute '__dict__' of 'ConditionalGenerator' objects>, '__weakref__': <attribute '__weakref__' of 'ConditionalGenerator' objects>})¶
- __doc__ = 'Main interface for a regression transformer.'¶
- __module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶
- __weakref__¶
list of weak references to the object (if defined)
- class ChemicalLanguageRT(resources_path, context, search='sample', temperature=1.4, batch_size=8, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶
Bases:
ConditionalGenerator
Hybrid regression and conditional molecular generation model as implemented in https://arxiv.org/abs/2202.01338.
- resources_path¶
path to the model.
- context¶
user-specified input text for the model.
- search¶
search key to instantiate a search via terminator.search.SEARCH_FACTORY.
- Type
terminator.search.Search
- temperature¶
the temperature parameter in case of a sample search.
- batch_size¶
the batch size for the model, applicable only to generative task.
- Type
int
- tolerance¶
the tolerance for the property of the generated molecules.
- subs_format: str = 'SMILES'¶
- __init__(resources_path, context, search='sample', temperature=1.4, batch_size=8, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶
Initialize the molecule generator.
- Parameters
resources_path (
str
) – directory where to find models and parameters.context (
str
) – user-specified input text for the model.search (
str
) – search key to instantiate a search, defaults to sample.temperature (
float
) – temperature for the sampling. Defaults to 1.4.batch_size (
int
) – number of points sampled per call. Defaults to 8.tolerance (
Union
[float
,Dict
[str
,float
]]) – the tolerance for the property of the generated molecules. Given in percent. Defaults to 20.sampling_wrapper (
Dict
[str
,Any
]) –A high-level entry point that allows specifying a seed SMILES alongside some target conditions. NOTE: If this is used, the target needs to be a single SMILES string. Example: {
’fraction_to_mask’: 0.5, ‘tokens_to_mask’: [], ‘property_goal’: {‘<qed>’: 0.85}
}
device (
Union
[device
,str
,None
]) – device where the inference s running either as a dedicated class or a string. If not provided is inferred.
- validate_input_molecule(sequence, input_type=MoleculeFormat.selfies)[source]¶
Verifies that the non-numerical part of the input sequence is a molecule.
- Parameters
sequence (
str
) – input sequence to be validated.input_type (
str
) – whether the input is validated to be a SELFIES (default), SMILES or COPOLYMER.
- Return type
None
- validate_output(sequences)[source]¶
Validate the output of the RT model.
- Parameters
sequences (
List
[Any
]) – list of sequences to be validated.- Returns
- the validate items, a list of either:
Chem.rdchem.Mol (generation task)
Sequence denoting the predicted properties (regression task)
list of valid indexes.
- Return type
A tuple of validated items
- get_maskable_tokens(tokens_to_mask)[source]¶
Convert a user-defined list of maskable tokens into a RT model-friendly format.
- Parameters
tokens_to_mask (
List
[str
]) – List of atoms specified in SMILES notation.- Return type
List
[str
]- Returns
List of atoms in SELFIES notation.
- filter_substructures(property_successes)[source]¶
Remove samples where user-required substructures are absent from generated samples.
- Parameters
property_successes (
Tuple
[Tuple
[str
,str
]]) – A tuple of samples that passed the property constraints. This is a tuple of tuples, where the first element is the SMILES and the second is the property prediction string.- Return type
Tuple
[Tuple
[str
,str
]]- Returns
A tuple of samples that passed the property constraints and the substructure constraints. Same format as input
- __annotations__ = {'batch_size': 'int', 'device': 'torch.device', 'property_collator': 'PropertyCollator', 'sampling_wrapper': 'Dict[str, Any]', 'search': 'Search', 'subs_format': <class 'str'>, 'task': 'str'}¶
- __doc__ = '\n Hybrid regression and conditional molecular generation model as implemented in\n https://arxiv.org/abs/2202.01338.\n\n Attributes:\n resources_path: path to the model.\n context: user-specified input text for the model.\n search: search key to instantiate a search via terminator.search.SEARCH_FACTORY.\n temperature: the temperature parameter in case of a `sample` search.\n batch_size: the batch size for the model, applicable only to generative task.\n tolerance: the tolerance for the property of the generated molecules.\n '¶
- __module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶
- class ProteinLanguageRT(resources_path, context, search='sample', temperature=1.4, batch_size=32, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶
Bases:
ConditionalGenerator
Hybrid regression and conditional protein generation model as implemented in https://arxiv.org/abs/2202.01338. It generates peptides with a desired stability score or predicts the stability score of a given molecule. For details on the stability task see: https://doi.org/10.1126/science.aan0693
- resources_path¶
path to the model.
- context¶
user-specified input text for the model.
- search¶
search key to instantiate a search via terminator.search.SEARCH_FACTORY.
- Type
terminator.search.Search
- temperature¶
the temperature parameter in case of a sample search.
- batch_size¶
the batch size for the model, applicable only to generative task.
- Type
int
- tolerance¶
the tolerance for the property of the generated molecules.
- subs_format: str = 'AAS'¶
- __init__(resources_path, context, search='sample', temperature=1.4, batch_size=32, tolerance=20.0, sampling_wrapper={}, device=None)[source]¶
Initialize the protein generator.
- Parameters
resources_path (
str
) – directory where to find models and parameters.search (
str
) – search key to instantiate a search, defaults to sample.temperature (
float
) – temperature for the sampling. Defaults to 1.4.batch_size (
int
) – number of points sampled per call. Defaults to 8.tolerance (
Union
[float
,Dict
[str
,float
]]) – the tolerance for the property of the generated molecules. Given in percent. Defaults to 20.0.sampling_wrapper (
Dict
[str
,Any
]) –A high-level entry point that allows specifying a seed SMILES alongside some target conditions. NOTE: If this is used, the target needs to be a single SMILES string. Example: {
’fraction_to_mask’: 0.5, ‘tokens_to_mask’: [], ‘property_goal’: {‘<stab>’: 0.85}
}
device (
Union
[device
,str
,None
]) – device where the inference s running either as a dedicated class or a string. If not provided is inferred.
- __annotations__ = {'batch_size': 'int', 'device': 'torch.device', 'property_collator': 'PropertyCollator', 'sampling_wrapper': 'Dict[str, Any]', 'search': 'Search', 'subs_format': <class 'str'>, 'task': 'str'}¶
- __doc__ = '\n Hybrid regression and conditional protein generation model as implemented in\n https://arxiv.org/abs/2202.01338. It generates peptides with a desired stability\n score or predicts the stability score of a given molecule.\n For details on the stability task see: https://doi.org/10.1126/science.aan0693\n\n Attributes:\n resources_path: path to the model.\n context: user-specified input text for the model.\n search: search key to instantiate a search via terminator.search.SEARCH_FACTORY.\n temperature: the temperature parameter in case of a `sample` search.\n batch_size: the batch size for the model, applicable only to generative task.\n tolerance: the tolerance for the property of the generated molecules.\n '¶
- __module__ = 'gt4sd.algorithms.conditional_generation.regression_transformer.implementation'¶
- validate_input_molecule(sequence, input_type='')[source]¶
Verifies that the non-numerical part of the input sequence is a valid AAS.
- Parameters
sequence (
str
) – input sequence to be validated.input_type (
str
) – str argument that is ignored but needed for sibling class.
- Return type
None
- validate_output(sequences)[source]¶
Validate the output of the RT model.
- Parameters
sequences (
List
[Any
]) – list of sequences to be validated.- Returns
- List of validated items, either:
Amino acid sequences (generation task)
Sequence denoting the predicted properties (regression task)
a list of valid indexes.
- Return type
A tuple of validated items
- get_maskable_tokens(tokens_to_mask)[source]¶
- Convert a user-defined list of maskable tokens (amino acids) into a RT
model-friendly format. Nothing has to be done for proteins, but the function is more complex for sister-classes.
- Parameters
tokens_to_mask (
List
[str
]) – List of amino acids specified in IUPAC convention.- Return type
List
[str
]- Returns
The same.
- filter_substructures(property_successes)[source]¶
Remove samples where user-required substructures are absent from generated samples.
- Parameters
property_successes (
Tuple
[Tuple
[str
,str
]]) – A tuple of samples that passed the property constraints. This is a tuple of tuples, where the first element is the sequence and the second the property prediction string.- Return type
Tuple
[Tuple
[str
,str
]]- Returns
A tuple of samples that passed the property constraints and the substructure constraints. Same format as input