gt4sd.domains.materials.protein_encoding module

Data processing utilities.

Summary

Classes:

PrimarySequenceEncoder

Model like class to create tape embeddings/encodings.

Reference

class PrimarySequenceEncoder(model_type='transformer', from_pretrained='bert-base', model_config_file=None, tokenizer='iupac')[source]

Bases: Module

Model like class to create tape embeddings/encodings.

This follows tapes implementation via run_embed closely, but removes any seed/device/cuda handling (of model and batch). This can be done in the training loop like for any other nn.Module.

Example

An example use with protein sequence dataset from pytoda (requires mock/rdkit and pytoda>0.2) passing ids with the primary sequence:

import sys
from mock import Mock
sys.modules['rdkit'] = Mock()
sys.modules['rdkit.Chem'] = Mock()
from torch.utils.data import DataLoader
from pytoda.datasets.protein_sequence_dataset import protein_sequence_dataset
from pytoda.datasets.tests.test_protein_sequence_dataset import (
    FASTA_CONTENT_GENERIC, TestFileContent
)
from pytoda.datasets.utils import keyed

with TestFileContent(FASTA_CONTENT_GENERIC) as a_test_file:
    sequence_dataset = keyed(protein_sequence_dataset(
        a_test_file.filename, filetype='.fasta', backend='lazy'
    ))
    batch_size = 5
    dataloader = DataLoader(sequence_dataset, batch_size=batch_size)

    encoder = PrimarySequenceEncoder(
        model_type='transformer',
        from_pretrained='bert-base',
        tokenizer='iupac',
        log_level=logging.INFO,
    )
    # sending encoder to cuda device should work, not tested

    loaded = next(iter(dataloader))
    print(loaded)
    encoded, ids = encoder.forward(loaded)
    print(ids)
    print(encoded)

However the forward call supports also not passing ids, but batch still has to be wrapped as list (of length 1):

encoded, dummy_ids = PrimarySequenceEncoder().forward(
    [
        ['MQNP', 'LLLLL'],  # type: Sequence[str]
        # sequence_ids may be missing here
    ]
)
__init__(model_type='transformer', from_pretrained='bert-base', model_config_file=None, tokenizer='iupac')[source]

Initialize the PrimarySequenceEncoder.

Parameters
  • model_type (str) – Which type of model to create (e.g. transformer, unirep, …). Defaults to ‘transformer’.

  • from_pretrained (Optional[str, None]) – either a string with the shortcut name of a pre-trained model to load from cache or download, e.g.: bert-base-uncased, or a path to a directory containing model weights saved using tape.models.modeling_utils.ProteinConfig.save_pretrained(), e.g.: ./my_model_directory/. Defaults to ‘bert-base’.

  • model_config_file (Optional[str, None]) – A json config file that specifies hyperparameters. Defaults to None.

  • tokenizer (str) – vocabulary name. Defaults to ‘iupac’.

Note

tapes default seed would be 42 (see tape.utils.set_random_seeds)

train(mode)[source]

Avoid any setting to train mode.

generate_tokenized(batch)[source]
Return type

Iterator[Tuple[str, ndarray, ndarray]]

classmethod collate_fn(batch)[source]
Return type

Dict[str, Union[List[str], Tensor]]

from_collated_batch(batch)[source]
Return type

Dict[str, Tensor]

forward(batch)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tuple[Tensor, List[str]]

get_dummy_ids(length)[source]
Return type

Tuple[str, …]

__annotations__ = {}
__doc__ = "Model like class to create tape embeddings/encodings.\n\n    This follows tapes implementation via `run_embed` closely, but removes\n    any seed/device/cuda handling (of model and batch). This can be done in\n    the training loop like for any other nn.Module.\n\n    Example:\n        An example use with protein sequence dataset from `pytoda` (requires\n        mock/rdkit and pytoda>0.2) passing ids with the primary sequence::\n\n            import sys\n            from mock import Mock\n            sys.modules['rdkit'] = Mock()\n            sys.modules['rdkit.Chem'] = Mock()\n            from torch.utils.data import DataLoader\n            from pytoda.datasets.protein_sequence_dataset import protein_sequence_dataset\n            from pytoda.datasets.tests.test_protein_sequence_dataset import (\n                FASTA_CONTENT_GENERIC, TestFileContent\n            )\n            from pytoda.datasets.utils import keyed\n\n            with TestFileContent(FASTA_CONTENT_GENERIC) as a_test_file:\n                sequence_dataset = keyed(protein_sequence_dataset(\n                    a_test_file.filename, filetype='.fasta', backend='lazy'\n                ))\n                batch_size = 5\n                dataloader = DataLoader(sequence_dataset, batch_size=batch_size)\n\n                encoder = PrimarySequenceEncoder(\n                    model_type='transformer',\n                    from_pretrained='bert-base',\n                    tokenizer='iupac',\n                    log_level=logging.INFO,\n                )\n                # sending encoder to cuda device should work, not tested\n\n                loaded = next(iter(dataloader))\n                print(loaded)\n                encoded, ids = encoder.forward(loaded)\n                print(ids)\n                print(encoded)\n\n        However the forward call supports also not passing ids, but batch still\n        has to be wrapped as list (of length 1)::\n\n            encoded, dummy_ids = PrimarySequenceEncoder().forward(\n                [\n                    ['MQNP', 'LLLLL'],  # type: Sequence[str]\n                    # sequence_ids may be missing here\n                ]\n            )\n    "
__module__ = 'gt4sd.domains.materials.protein_encoding'