gt4sd.frameworks.granular.tokenizer.tokenizer module

Tokenizers implementations.

Summary

Classes:

BasicSelfiesTokenizer

Basic SELFIES tokenizer.

BasicSmilesTokenizer

Basic SMILES tokenizer.

BasicTokenizer

Basic tokenizer.

BigSmilesTokenizer

Big-SMILES tokenizer that can build a vocabulary on the fly.

GenericTokenizer

Generic tokenizer that can build a vocabulary on the fly.

SelfiesTokenizer

SELFIES tokenizer that can build a vocabulary on the fly.

SmilesTokenizer

SMILES tokenizer that can build a vocabulary on the fly.

Tokenizer

Tokenizer that can build a vocabulary on the fly.

Functions:

load_vocab

Loads a vocabulary file into a dictionary.

selfies_alphabet

Legacy selfies 0.2.4 alphabet method.

Reference

selfies_alphabet()[source]

Legacy selfies 0.2.4 alphabet method.

Adapted from: https://github.com/aspuru-guzik-group/selfies/blob/84122855ae76a928e1cb7d58796b8b47385a4359/selfies/selfies.py#L4.

Return type

List[str]

Returns

SELFIES list of tokens.

load_vocab(vocab_file)[source]

Loads a vocabulary file into a dictionary.

Parameters

vocab_file (str) – vocabulary file.

Return type

Dict[str, int]

Returns

vocabulary mapping tokens to indices.

class BasicTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: object

Basic tokenizer.

__init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a BasicSmilesTokenizer.

Parameters
  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

tokenize(text)[source]

Tokenize input text.

Parameters

text (str) – text to tokenize.

Return type

List[str]

Returns

list of tokens.

build_vocab(smiles, vocab_file)[source]

Build and save a vocabulary given a SMILES list.

Parameters
  • smiles (Iterable[str]) – iterable of SMILES.

  • vocab_file (str) – path to a file where the vocabulary is saved.

Return type

List[str]

Returns

a list of all tokens in the vocabulary.

__dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Basic tokenizer.', '__init__': <function BasicTokenizer.__init__>, 'tokenize': <function BasicTokenizer.tokenize>, 'build_vocab': <function BasicTokenizer.build_vocab>, '__dict__': <attribute '__dict__' of 'BasicTokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'BasicTokenizer' objects>, '__annotations__': {}})
__doc__ = 'Basic tokenizer.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
__weakref__

list of weak references to the object (if defined)

class BasicSmilesTokenizer(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: BasicTokenizer

Basic SMILES tokenizer.

__init__(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a BasicSmilesTokenizer.

Parameters
  • regex_pattern (str) – regex pattern. Defaults to SMI_REGEX_PATTERN.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

tokenize(text)[source]

Tokenize input text.

Parameters

text (str) – text to tokenize.

Return type

List[str]

Returns

list of tokens.

__annotations__ = {}
__doc__ = 'Basic SMILES tokenizer.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
class BasicSelfiesTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: BasicTokenizer

Basic SELFIES tokenizer.

__init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a BasicSelfiesTokenizer.

Parameters
  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

smiles_to_selfies(smiles)[source]

Convert a list of SMILES into SELFIES.

Parameters

smiles (Iterable[str]) – a list of SMILES.

Return type

List[str]

Returns

a list of SELFIES.

tokenize(text)[source]

Tokenize input text.

Parameters

text (str) – text to tokenize.

Return type

List[str]

Returns

list of tokens.

build_vocab(smiles, vocab_file)[source]

Build and save a vocabulary given a SMILES list.

Parameters
  • smiles (Iterable[str]) – iterable of SMILES.

  • vocab_file (str) – path to a file where the vocabulary is saved.

Return type

List[str]

Returns

a list of all tokens in the vocabulary.

__annotations__ = {}
__doc__ = 'Basic SELFIES tokenizer.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
class Tokenizer(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: object

Tokenizer that can build a vocabulary on the fly.

__init__(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a Tokenizer.

Parameters
  • vocab_file (str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.

  • basic_tokenizer (BasicTokenizer) – a basic tokenizer. Defaults to BasicTokenizer character tokenizer.

  • smiles (List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

property vocab_size: int

Size of the vocabulary.

Return type

int

Returns

vocabulary file.

property vocab_list: List[str]

Return vocabulary tokens.

Return type

List[str]

Returns

all tokens from the vocabulary.

tokenize(text)[source]

Tokenize a given text.

Parameters

text (str) – text to tokenize.

Return type

List[str]

Returns

list of tokens.

convert_tokens_to_ids(tokens)[source]

Convert tokens to indices.

Parameters

tokens (List[str]) – list of tokens.

Return type

List[int]

Returns

list of indices.

convert_token_to_id(token)[source]

Convert token to index.

Parameters

token (str) – a token.

Return type

int

Returns

index corresponding to the input token. Unknown token index if the input

token is not present in the vocabulary.

convert_id_to_token(index)[source]

Convert index to token.

Parameters

index (int) – an index.

Return type

str

Returns

token corresponding to the input index. Unknown token if the input

index is not found.

add_padding_tokens(token_ids, length, right=True)[source]

Add padding token indices to the provided token indices.

Parameters
  • token_ids (List[int]) – token indices.

  • length (int) – length of the sequence.

  • right (bool) – wheter the padding is performed on the right. Defaults to True, if False the padding happens on the left.

Return type

List[int]

Returns

the padded sequence.

__dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Tokenizer that can build a vocabulary on the fly.', '__init__': <function Tokenizer.__init__>, 'vocab_size': <property object>, 'vocab_list': <property object>, 'tokenize': <function Tokenizer.tokenize>, 'convert_tokens_to_ids': <function Tokenizer.convert_tokens_to_ids>, 'convert_token_to_id': <function Tokenizer.convert_token_to_id>, 'convert_id_to_token': <function Tokenizer.convert_id_to_token>, 'add_padding_tokens': <function Tokenizer.add_padding_tokens>, '__dict__': <attribute '__dict__' of 'Tokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'Tokenizer' objects>, '__annotations__': {}})
__doc__ = 'Tokenizer that can build a vocabulary on the fly.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
__weakref__

list of weak references to the object (if defined)

class GenericTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: Tokenizer

Generic tokenizer that can build a vocabulary on the fly.

__init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a GenericTokenizer.

Parameters
  • vocab_file (str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.

  • smiles (List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

__annotations__ = {}
__doc__ = 'Generic tokenizer that can build a vocabulary on the fly.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
class SmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: Tokenizer

SMILES tokenizer that can build a vocabulary on the fly.

__init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a SmilesTokenizer.

Parameters
  • vocab_file (str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.

  • smiles (List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

__annotations__ = {}
__doc__ = 'SMILES tokenizer that can build a vocabulary on the fly.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
class BigSmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: Tokenizer

Big-SMILES tokenizer that can build a vocabulary on the fly.

__init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a BigSmilesTokenizer.

Parameters
  • vocab_file (str) – path to vocabulary file. If the file is not present, the provided Big-SMILES list is used to generate one.

  • smiles (List[str]) – list of big smiles. Default to empty list, used only if the vocabulary file does not exist.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

__annotations__ = {}
__doc__ = 'Big-SMILES tokenizer that can build a vocabulary on the fly.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'
class SelfiesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Bases: Tokenizer

SELFIES tokenizer that can build a vocabulary on the fly.

__init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]

Constructs a SelfiesTokenizer.

Parameters
  • vocab_file (str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.

  • smiles (List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.

  • pad_token (str) – padding token. Defaults to ‘<pad>’.

  • sos_token (str) – start of sequence token. Defaults to ‘<sos>’.

  • eos_token (str) – end of sequence token. Defaults to ‘</s>’.

  • unk_token (str) – unknown token. Defaults to ‘<unk>’.

__annotations__ = {}
__doc__ = 'SELFIES tokenizer that can build a vocabulary on the fly.'
__module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'