gt4sd.frameworks.granular.tokenizer.tokenizer module¶
Tokenizers implementations.
Summary¶
Classes:
Basic SELFIES tokenizer.  | 
|
Basic SMILES tokenizer.  | 
|
Basic tokenizer.  | 
|
Big-SMILES tokenizer that can build a vocabulary on the fly.  | 
|
Generic tokenizer that can build a vocabulary on the fly.  | 
|
SELFIES tokenizer that can build a vocabulary on the fly.  | 
|
SMILES tokenizer that can build a vocabulary on the fly.  | 
|
Tokenizer that can build a vocabulary on the fly.  | 
Functions:
Loads a vocabulary file into a dictionary.  | 
|
Legacy selfies 0.2.4 alphabet method.  | 
Reference¶
- selfies_alphabet()[source]¶
 Legacy selfies 0.2.4 alphabet method.
Adapted from: https://github.com/aspuru-guzik-group/selfies/blob/84122855ae76a928e1cb7d58796b8b47385a4359/selfies/selfies.py#L4.
- Return type
 List[str]- Returns
 SELFIES list of tokens.
- load_vocab(vocab_file)[source]¶
 Loads a vocabulary file into a dictionary.
- Parameters
 vocab_file (
str) – vocabulary file.- Return type
 Dict[str,int]- Returns
 vocabulary mapping tokens to indices.
- class BasicTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
objectBasic tokenizer.
- __init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a BasicSmilesTokenizer.
- Parameters
 pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- tokenize(text)[source]¶
 Tokenize input text.
- Parameters
 text (
str) – text to tokenize.- Return type
 List[str]- Returns
 list of tokens.
- build_vocab(smiles, vocab_file)[source]¶
 Build and save a vocabulary given a SMILES list.
- Parameters
 smiles (
Iterable[str]) – iterable of SMILES.vocab_file (
str) – path to a file where the vocabulary is saved.
- Return type
 List[str]- Returns
 a list of all tokens in the vocabulary.
- __dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Basic tokenizer.', '__init__': <function BasicTokenizer.__init__>, 'tokenize': <function BasicTokenizer.tokenize>, 'build_vocab': <function BasicTokenizer.build_vocab>, '__dict__': <attribute '__dict__' of 'BasicTokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'BasicTokenizer' objects>, '__annotations__': {}})¶
 
- __doc__ = 'Basic tokenizer.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- __weakref__¶
 list of weak references to the object (if defined)
- class BasicSmilesTokenizer(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
BasicTokenizerBasic SMILES tokenizer.
- __init__(regex_pattern='(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\|\\\\/|:|~|@|\\\\?|>>?|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])', pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a BasicSmilesTokenizer.
- Parameters
 regex_pattern (
str) – regex pattern. Defaults to SMI_REGEX_PATTERN.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- tokenize(text)[source]¶
 Tokenize input text.
- Parameters
 text (
str) – text to tokenize.- Return type
 List[str]- Returns
 list of tokens.
- __annotations__ = {}¶
 
- __doc__ = 'Basic SMILES tokenizer.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- class BasicSelfiesTokenizer(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
BasicTokenizerBasic SELFIES tokenizer.
- __init__(pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a BasicSelfiesTokenizer.
- Parameters
 pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- smiles_to_selfies(smiles)[source]¶
 Convert a list of SMILES into SELFIES.
- Parameters
 smiles (
Iterable[str]) – a list of SMILES.- Return type
 List[str]- Returns
 a list of SELFIES.
- tokenize(text)[source]¶
 Tokenize input text.
- Parameters
 text (
str) – text to tokenize.- Return type
 List[str]- Returns
 list of tokens.
- build_vocab(smiles, vocab_file)[source]¶
 Build and save a vocabulary given a SMILES list.
- Parameters
 smiles (
Iterable[str]) – iterable of SMILES.vocab_file (
str) – path to a file where the vocabulary is saved.
- Return type
 List[str]- Returns
 a list of all tokens in the vocabulary.
- __annotations__ = {}¶
 
- __doc__ = 'Basic SELFIES tokenizer.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- class Tokenizer(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
objectTokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, basic_tokenizer=<gt4sd.frameworks.granular.tokenizer.tokenizer.BasicTokenizer object>, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a Tokenizer.
- Parameters
 vocab_file (
str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.basic_tokenizer (
BasicTokenizer) – a basic tokenizer. Defaults to BasicTokenizer character tokenizer.smiles (
List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- property vocab_size: int¶
 Size of the vocabulary.
- Return type
 int- Returns
 vocabulary file.
- property vocab_list: List[str]¶
 Return vocabulary tokens.
- Return type
 List[str]- Returns
 all tokens from the vocabulary.
- tokenize(text)[source]¶
 Tokenize a given text.
- Parameters
 text (
str) – text to tokenize.- Return type
 List[str]- Returns
 list of tokens.
- convert_tokens_to_ids(tokens)[source]¶
 Convert tokens to indices.
- Parameters
 tokens (
List[str]) – list of tokens.- Return type
 List[int]- Returns
 list of indices.
- convert_token_to_id(token)[source]¶
 Convert token to index.
- Parameters
 token (
str) – a token.- Return type
 int- Returns
 - index corresponding to the input token. Unknown token index if the input
 token is not present in the vocabulary.
- convert_id_to_token(index)[source]¶
 Convert index to token.
- Parameters
 index (
int) – an index.- Return type
 str- Returns
 - token corresponding to the input index. Unknown token if the input
 index is not found.
- add_padding_tokens(token_ids, length, right=True)[source]¶
 Add padding token indices to the provided token indices.
- Parameters
 token_ids (
List[int]) – token indices.length (
int) – length of the sequence.right (
bool) – wheter the padding is performed on the right. Defaults to True, if False the padding happens on the left.
- Return type
 List[int]- Returns
 the padded sequence.
- __dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.granular.tokenizer.tokenizer', '__doc__': 'Tokenizer that can build a vocabulary on the fly.', '__init__': <function Tokenizer.__init__>, 'vocab_size': <property object>, 'vocab_list': <property object>, 'tokenize': <function Tokenizer.tokenize>, 'convert_tokens_to_ids': <function Tokenizer.convert_tokens_to_ids>, 'convert_token_to_id': <function Tokenizer.convert_token_to_id>, 'convert_id_to_token': <function Tokenizer.convert_id_to_token>, 'add_padding_tokens': <function Tokenizer.add_padding_tokens>, '__dict__': <attribute '__dict__' of 'Tokenizer' objects>, '__weakref__': <attribute '__weakref__' of 'Tokenizer' objects>, '__annotations__': {}})¶
 
- __doc__ = 'Tokenizer that can build a vocabulary on the fly.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- __weakref__¶
 list of weak references to the object (if defined)
- class GenericTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
TokenizerGeneric tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a GenericTokenizer.
- Parameters
 vocab_file (
str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
 
- __doc__ = 'Generic tokenizer that can build a vocabulary on the fly.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- class SmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
TokenizerSMILES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a SmilesTokenizer.
- Parameters
 vocab_file (
str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
 
- __doc__ = 'SMILES tokenizer that can build a vocabulary on the fly.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- class BigSmilesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
TokenizerBig-SMILES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a BigSmilesTokenizer.
- Parameters
 vocab_file (
str) – path to vocabulary file. If the file is not present, the provided Big-SMILES list is used to generate one.smiles (
List[str]) – list of big smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
 
- __doc__ = 'Big-SMILES tokenizer that can build a vocabulary on the fly.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶
 
- class SelfiesTokenizer(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Bases:
TokenizerSELFIES tokenizer that can build a vocabulary on the fly.
- __init__(vocab_file, smiles=[], pad_token='<pad>', sos_token='<sos>', eos_token='</s>', unk_token='<unk>')[source]¶
 Constructs a SelfiesTokenizer.
- Parameters
 vocab_file (
str) – path to vocabulary file. If the file is not present, the provided SMILES list is used to generate one.smiles (
List[str]) – list of smiles. Default to empty list, used only if the vocabulary file does not exist.pad_token (
str) – padding token. Defaults to ‘<pad>’.sos_token (
str) – start of sequence token. Defaults to ‘<sos>’.eos_token (
str) – end of sequence token. Defaults to ‘</s>’.unk_token (
str) – unknown token. Defaults to ‘<unk>’.
- __annotations__ = {}¶
 
- __doc__ = 'SELFIES tokenizer that can build a vocabulary on the fly.'¶
 
- __module__ = 'gt4sd.frameworks.granular.tokenizer.tokenizer'¶