gt4sd.domains.materials.protein_encoding module¶
Data processing utilities.
Summary¶
Classes:
Model like class to create tape embeddings/encodings. |
Reference¶
- class PrimarySequenceEncoder(model_type='transformer', from_pretrained='bert-base', model_config_file=None, tokenizer='iupac')[source]¶
Bases:
Module
Model like class to create tape embeddings/encodings.
This follows tapes implementation via run_embed closely, but removes any seed/device/cuda handling (of model and batch). This can be done in the training loop like for any other nn.Module.
Example
An example use with protein sequence dataset from pytoda (requires mock/rdkit and pytoda>0.2) passing ids with the primary sequence:
import sys from mock import Mock sys.modules['rdkit'] = Mock() sys.modules['rdkit.Chem'] = Mock() from torch.utils.data import DataLoader from pytoda.datasets.protein_sequence_dataset import protein_sequence_dataset from pytoda.datasets.tests.test_protein_sequence_dataset import ( FASTA_CONTENT_GENERIC, TestFileContent ) from pytoda.datasets.utils import keyed with TestFileContent(FASTA_CONTENT_GENERIC) as a_test_file: sequence_dataset = keyed(protein_sequence_dataset( a_test_file.filename, filetype='.fasta', backend='lazy' )) batch_size = 5 dataloader = DataLoader(sequence_dataset, batch_size=batch_size) encoder = PrimarySequenceEncoder( model_type='transformer', from_pretrained='bert-base', tokenizer='iupac', log_level=logging.INFO, ) # sending encoder to cuda device should work, not tested loaded = next(iter(dataloader)) print(loaded) encoded, ids = encoder.forward(loaded) print(ids) print(encoded)
However the forward call supports also not passing ids, but batch still has to be wrapped as list (of length 1):
encoded, dummy_ids = PrimarySequenceEncoder().forward( [ ['MQNP', 'LLLLL'], # type: Sequence[str] # sequence_ids may be missing here ] )
- __init__(model_type='transformer', from_pretrained='bert-base', model_config_file=None, tokenizer='iupac')[source]¶
Initialize the PrimarySequenceEncoder.
- Parameters
model_type (
str
) – Which type of model to create (e.g. transformer, unirep, …). Defaults to ‘transformer’.from_pretrained (
Optional
[str
,None
]) – either a string with the shortcut name of a pre-trained model to load from cache or download, e.g.:bert-base-uncased
, or a path to a directory containing model weights saved usingtape.models.modeling_utils.ProteinConfig.save_pretrained()
, e.g.:./my_model_directory/
. Defaults to ‘bert-base’.model_config_file (
Optional
[str
,None
]) – A json config file that specifies hyperparameters. Defaults to None.tokenizer (
str
) – vocabulary name. Defaults to ‘iupac’.
Note
tapes default seed would be 42 (see tape.utils.set_random_seeds)
- forward(batch)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tuple
[Tensor
,List
[str
]]
- __annotations__ = {}¶
- __doc__ = "Model like class to create tape embeddings/encodings.\n\n This follows tapes implementation via `run_embed` closely, but removes\n any seed/device/cuda handling (of model and batch). This can be done in\n the training loop like for any other nn.Module.\n\n Example:\n An example use with protein sequence dataset from `pytoda` (requires\n mock/rdkit and pytoda>0.2) passing ids with the primary sequence::\n\n import sys\n from mock import Mock\n sys.modules['rdkit'] = Mock()\n sys.modules['rdkit.Chem'] = Mock()\n from torch.utils.data import DataLoader\n from pytoda.datasets.protein_sequence_dataset import protein_sequence_dataset\n from pytoda.datasets.tests.test_protein_sequence_dataset import (\n FASTA_CONTENT_GENERIC, TestFileContent\n )\n from pytoda.datasets.utils import keyed\n\n with TestFileContent(FASTA_CONTENT_GENERIC) as a_test_file:\n sequence_dataset = keyed(protein_sequence_dataset(\n a_test_file.filename, filetype='.fasta', backend='lazy'\n ))\n batch_size = 5\n dataloader = DataLoader(sequence_dataset, batch_size=batch_size)\n\n encoder = PrimarySequenceEncoder(\n model_type='transformer',\n from_pretrained='bert-base',\n tokenizer='iupac',\n log_level=logging.INFO,\n )\n # sending encoder to cuda device should work, not tested\n\n loaded = next(iter(dataloader))\n print(loaded)\n encoded, ids = encoder.forward(loaded)\n print(ids)\n print(encoded)\n\n However the forward call supports also not passing ids, but batch still\n has to be wrapped as list (of length 1)::\n\n encoded, dummy_ids = PrimarySequenceEncoder().forward(\n [\n ['MQNP', 'LLLLL'], # type: Sequence[str]\n # sequence_ids may be missing here\n ]\n )\n "¶
- __module__ = 'gt4sd.domains.materials.protein_encoding'¶