gt4sd.frameworks.granular.tokenizer.tokenizer module¶
Tokenizers implementations.
Summary¶
Classes:
Basic SELFIES tokenizer. |
|
Basic SMILES tokenizer. |
|
Basic tokenizer. |
|
Big-SMILES tokenizer that can build a vocabulary on the fly. |
|
Generic tokenizer that can build a vocabulary on the fly. |
|
SELFIES tokenizer that can build a vocabulary on the fly. |
|
SMILES tokenizer that can build a vocabulary on the fly. |
|
Tokenizer that can build a vocabulary on the fly. |
Functions:
Loads a vocabulary file into a dictionary. |
|
Legacy selfies 0.2.4 alphabet method. |
Reference¶
- selfies_alphabet()[source]¶
Legacy selfies 0.2.4 alphabet method.
Adapted from: https://github.com/aspuru-guzik-group/selfies/blob/84122855ae76a928e1cb7d58796b8b47385a4359/selfies/selfies.py#L4.
- Return type
List
[str
]- Returns
SELFIES list of tokens.
- load_vocab(vocab_file)[source]¶
Loads a vocabulary file into a dictionary.
- Parameters
vocab_file (
str
) – vocabulary file.- Return type
Dict
[str
,int
]- Returns
vocabulary mapping tokens to indices.
- class BasicTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
object
Basic tokenizer.
- __init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a BasicSmilesTokenizer.
- Parameters
pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- tokenize(text)[source]¶
Tokenize input text.
- Parameters
text (
str
) – text to tokenize.- Return type
List
[str
]- Returns
list of tokens.
- build_vocab(smiles, vocab_file)[source]¶
Build and save a vocabulary given a SMILES list.
- Parameters
smiles (
Iterable
[str
]) – iterable of SMILES.vocab_file (
str
) – path to a file where the vocabulary is saved.
- Return type
List
[str
]- Returns
a list of all tokens in the vocabulary.
- __dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Basic tokenizer.', '__init__': <function BasicTokenizer.__init__>, 'tokenize': <function BasicTokenizer.tokenize>, 'build_vocab': <function BasicTokenizer.build_vocab>, '__dict__': <attribute '__dict__' of 'BasicTokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'BasicTokenizer' objects>, '__annotations__': {}})¶
- __doc__ = 'Basic tokenizer.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- __weakref__¶
list of weak references to the object (if defined)
- class BasicSmilesTokenizer(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
BasicTokenizer
Basic SMILES tokenizer.
- __init__(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a BasicSmilesTokenizer.
- Parameters
regex_pattern (
str
) – regex pattern. Defaults to SMI_REGEX_PATTERN.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- tokenize(text)[source]¶
Tokenize input text.
- Parameters
text (
str
) – text to tokenize.- Return type
List
[str
]- Returns
list of tokens.
- __annotations__ = {}¶
- __doc__ = 'Basic SMILES tokenizer.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- class BasicSelfiesTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
BasicTokenizer
Basic SELFIES tokenizer.
- __init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a BasicSelfiesTokenizer.
- Parameters
pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- smiles_to_selfies(smiles)[source]¶
Convert a list of SMILES into SELFIES.
- Parameters
smiles (
Iterable
[str
]) – a list of SMILES.- Return type
List
[str
]- Returns
a list of SELFIES.
- tokenize(text)[source]¶
Tokenize input text.
- Parameters
text (
str
) – text to tokenize.- Return type
List
[str
]- Returns
list of tokens.
- build_vocab(smiles, vocab_file)[source]¶
Build and save a vocabulary given a SMILES list.
- Parameters
smiles (
Iterable
[str
]) – iterable of SMILES.vocab_file (
str
) – path to a file where the vocabulary is saved.
- Return type
List
[str
]- Returns
a list of all tokens in the vocabulary.
- __annotations__ = {}¶
- __doc__ = 'Basic SELFIES tokenizer.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- class Tokenizer(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
object
Tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a Tokenizer.
- Parameters
vocab_file (
str
) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.basic_tokenizer (
BasicTokenizer
) – a basic tokenizer. Defaults to BasicTokenizer character tokenizer.smiles (
List
[str
]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- property vocab_size: int¶
Size of the vocabulary.
- Return type
int
- Returns
vocabulary file.
- property vocab_list: List[str]¶
Return vocabulary tokens.
- Return type
List
[str
]- Returns
all tokens from the vocabulary.
- tokenize(text)[source]¶
Tokenize a given text.
- Parameters
text (
str
) – text to tokenize.- Return type
List
[str
]- Returns
list of tokens.
- convert_tokens_to_ids(tokens)[source]¶
Convert tokens to indices.
- Parameters
tokens (
List
[str
]) – list of tokens.- Return type
List
[int
]- Returns
list of indices.
- convert_token_to_id(token)[source]¶
Convert token to index.
- Parameters
token (
str
) – a token.- Return type
int
- Returns
- index corresponding to the input token. Unknown token index if the input
token is not present in the vocabulary.
- convert_id_to_token(index)[source]¶
Convert index to token.
- Parameters
index (
int
) – an index.- Return type
str
- Returns
- token corresponding to the input index. Unknown token if the input
index is not found.
- add_padding_tokens(token_ids, length, right=True)[source]¶
Add padding token indices to the provided token indices.
- Parameters
token_ids (
List
[int
]) – token indices.length (
int
) – length of the sequence.right (
bool
) – wheter the padding is performed on the right. Defaults to True, if False the padding happens on the left.
- Return type
List
[int
]- Returns
the padded sequence.
- __dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Tokenizer that can build a vocabulary on the fly.', '__init__': <function Tokenizer.__init__>, 'vocab_size': <property object>, 'vocab_list': <property object>, 'tokenize': <function Tokenizer.tokenize>, 'convert_tokens_to_ids': <function Tokenizer.convert_tokens_to_ids>, 'convert_token_to_id': <function Tokenizer.convert_token_to_id>, 'convert_id_to_token': <function Tokenizer.convert_id_to_token>, 'add_padding_tokens': <function Tokenizer.add_padding_tokens>, '__dict__': <attribute '__dict__' of 'Tokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'Tokenizer' objects>, '__annotations__': {}})¶
- __doc__ = 'Tokenizer that can build a vocabulary on the fly.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- __weakref__¶
list of weak references to the object (if defined)
- class GenericTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
Tokenizer
Generic tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a GenericTokenizer.
- Parameters
vocab_file (
str
) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List
[str
]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
- __doc__ = 'Generic tokenizer that can build a vocabulary on the fly.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- class SmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
Tokenizer
SMILES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a SmilesTokenizer.
- Parameters
vocab_file (
str
) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List
[str
]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
- __doc__ = 'SMILES tokenizer that can build a vocabulary on the fly.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- class BigSmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
Tokenizer
Big-SMILES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a BigSmilesTokenizer.
- Parameters
vocab_file (
str
) – path to vocabulary file. If the file is not present, the provided Big-SMILES list is used to generate one.smiles (
List
[str
]) – list of big smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
- __doc__ = 'Big-SMILES tokenizer that can build a vocabulary on the fly.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
- class SelfiesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Bases:
Tokenizer
SELFIES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
Constructs a SelfiesTokenizer.
- Parameters
vocab_file (
str
) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List
[str
]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str
) – padding token. Defaults to ‘<pad>’.sos_token (
str
) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str
) – end of sequence token. Defaults to ‘</s>’.unk_token (
str
) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
- __doc__ = 'SELFIES tokenizer that can build a vocabulary on the fly.'¶
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶