gt4sd.frameworks.cgcnn.data module¶

Data module.

Summary¶

Classes:

`AtomCustomJSONInitializer`	Initialize atom feature vectors using a JSON file, which is a python dictionary mapping from element number to a list representing the feature vector of the element.
`AtomInitializer`	Base class for intializing the vector representation for atoms.
`CIFData`	The CIFData dataset is a wrapper for a dataset where the crystal structures are stored in the form of CIF files.
`GaussianDistance`	Expands the distance by Gaussian basis.

Functions:

`collate_pool`	Collate a list of data and return a batch for predicting crystal properties.
`get_train_val_test_loader`	Utility function for dividing a dataset to train, val, test datasets.

Reference¶

get_train_val_test_loader(dataset, collate_fn=<function default_collate>, batch_size=64, train_ratio=None, val_ratio=0.1, test_ratio=0.1, return_test=False, num_workers=1, pin_memory=False, **kwargs)[source]¶

Utility function for dividing a dataset to train, val, test datasets.

!!! The dataset needs to be shuffled before using the function !!!

Parameters

dataset (Dataset) – torch.utils.data.Dataset The full dataset to be divided.
collate_fn (Callable[[List[Any]], Any]) – torch.utils.data.DataLoader.
batch_size (int) – int.
train_ratio (Optional[float, None]) – float.
val_ratio (float) – float.
test_ratio (float) – float.
return_test (bool) – bool. Whether to return the test dataset loader. If False, the last test_size data will be hidden.
num_workers (int) – int.
pin_memory (bool) – bool.

Returns

torch.utils.data.DataLoader

DataLoader that random samples the training data.

val_loader: torch.utils.data.DataLoader

DataLoader that random samples the validation data.

(test_loader): torch.utils.data.DataLoader

DataLoader that random samples the test data, Returns if: return_test=True.

Return type

train_loader

collate_pool(dataset_list)[source]¶

Collate a list of data and return a batch for predicting crystal properties.

Parameters

dataset_list (List[Any]) –

list of tuples for each data point. (atom_fea, nbr_fea, nbr_fea_idx, target)

atom_fea: torch.Tensor shape (n_i, atom_fea_len). nbr_fea: torch.Tensor shape (n_i, M, nbr_fea_len). nbr_fea_idx: torch.LongTensor shape (n_i, M). target: torch.Tensor shape (1, ). cif_id: str or int.

Return type

Tuple[Tuple[Tensor, Tensor, Tensor, List[LongTensor]], Tensor, List[Any]]

Returns

N = sum(n_i); N0 = sum(i) batch_atom_fea: torch.Tensor shape (N, orig_atom_fea_len)

Atom features from atom type.

batch_nbr_fea: torch.Tensor shape (N, M, nbr_fea_len): Bond features of each atom’s M neighbors.
batch_nbr_fea_idx: torch.LongTensor shape (N, M): Indices of M neighbors of each atom.
crystal_atom_idx: list of torch.LongTensor of length N0: Mapping from the crystal idx to atom idx.
target: torch.Tensor shape (N, 1): Target value for prediction.

batch_cif_ids: list.

class GaussianDistance(dmin, dmax, step, var=None)[source]¶

Bases: object

Expands the distance by Gaussian basis.

Unit: angstrom

__init__(dmin, dmax, step, var=None)[source]¶

Parameters

dmin (float) – float Minimum interatomic distance.
dmax (float) – float Maximum interatomic distance.
step (float) – float Step size for the Gaussian filter.

expand(distances)[source]¶

Apply Gaussian disntance filter to a numpy distance array.

Parameters

distance – np.array shape n-d array A distance matrix of any shape.

Returns

shape (n+1)-d array: Expanded distance matrix with the last dimension of length len(self.filter).

Return type

expanded_distance

__dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.cgcnn.data', '__doc__': 'Expands the distance by Gaussian basis.\n\n Unit: angstrom\n ', '__init__': <function GaussianDistance.__init__>, 'expand': <function GaussianDistance.expand>, '__dict__': <attribute '__dict__' of 'GaussianDistance' objects>, '__weakref__': <attribute '__weakref__' of 'GaussianDistance' objects>, '__annotations__': {}})¶

__doc__ = 'Expands the distance by Gaussian basis.\n\n Unit: angstrom\n '¶

__module__ = 'gt4sd.frameworks.cgcnn.data'¶

__weakref__¶: list of weak references to the object (if defined)

class AtomInitializer(atom_types)[source]¶

Bases: object

Base class for intializing the vector representation for atoms.

!!! Use one AtomInitializer per dataset !!!

__init__(atom_types)[source]¶

get_atom_fea(atom_type)[source]¶

load_state_dict(state_dict)[source]¶

state_dict()[source]¶

decode(idx)[source]¶

__dict__ = mappingproxy({'__module__': 'gt4sd.frameworks.cgcnn.data', '__doc__': 'Base class for intializing the vector representation for atoms.\n\n !!! Use one AtomInitializer per dataset !!!\n ', '__init__': <function AtomInitializer.__init__>, 'get_atom_fea': <function AtomInitializer.get_atom_fea>, 'load_state_dict': <function AtomInitializer.load_state_dict>, 'state_dict': <function AtomInitializer.state_dict>, 'decode': <function AtomInitializer.decode>, '__dict__': <attribute '__dict__' of 'AtomInitializer' objects>, '__weakref__': <attribute '__weakref__' of 'AtomInitializer' objects>, '__annotations__': {}})¶

__doc__ = 'Base class for intializing the vector representation for atoms.\n\n !!! Use one AtomInitializer per dataset !!!\n '¶

__module__ = 'gt4sd.frameworks.cgcnn.data'¶

__weakref__¶: list of weak references to the object (if defined)

class AtomCustomJSONInitializer(elem_embedding_file)[source]¶

Bases: AtomInitializer

Initialize atom feature vectors using a JSON file, which is a python dictionary mapping from element number to a list representing the feature vector of the element.

__init__(elem_embedding_file)[source]¶

Parameters: elem_embedding_file (str) – str The path to the .json file.

__annotations__ = {}¶

__doc__ = '\n Initialize atom feature vectors using a JSON file, which is a python\n dictionary mapping from element number to a list representing the\n feature vector of the element.\n\n '¶

__module__ = 'gt4sd.frameworks.cgcnn.data'¶

class CIFData(root_dir, max_num_nbr=12, radius=8, dmin=0, step=0.2, random_seed=123, atom_initialization=None)[source]¶

Bases: Dataset

The CIFData dataset is a wrapper for a dataset where the crystal structures are stored in the form of CIF files. The dataset should have the following directory structure:

root_dir ├── id_prop.csv ├── atom_init.json ├── id0.cif ├── id1.cif ├── …

id_prop.csv: a CSV file with two columns. The first column recodes a unique ID for each crystal, and the second column recodes the value of target property.

atom_init.json: a JSON file that stores the initialization vector for each element.

ID.cif: a CIF file that recodes the crystal structure, where ID is the unique ID for the crystal.

__init__(root_dir, max_num_nbr=12, radius=8, dmin=0, step=0.2, random_seed=123, atom_initialization=None)[source]¶

Parameters

root_dir (str) – str The path to the root directory of the dataset.
max_num_nbr (int) – int The maximum number of neighbors while constructing the crystal graph.
radius (int) – float The cutoff radius for searching neighbors.
dmin (int) – float The minimum distance for constructing GaussianDistance.
step (float) – float The step size for constructing GaussianDistance.
random_seed (int) – int Random seed for shuffling the dataset.
atom_initialization (Optional[AtomCustomJSONInitializer, None]) – AtomInitializer The atom initializer for initializing the atom feature vectors. Defaults to None, in which case a atom_init.json should be in root_dir.

__annotations__ = {}¶

__doc__ = '\n The CIFData dataset is a wrapper for a dataset where the crystal structures\n are stored in the form of CIF files. The dataset should have the following\n directory structure:\n\n root_dir\n ├── id_prop.csv\n ├── atom_init.json\n ├── id0.cif\n ├── id1.cif\n ├── ...\n\n id_prop.csv: a CSV file with two columns. The first column recodes a\n unique ID for each crystal, and the second column recodes the value of\n target property.\n\n atom_init.json: a JSON file that stores the initialization vector for each\n element.\n\n ID.cif: a CIF file that recodes the crystal structure, where ID is the\n unique ID for the crystal.\n '¶

__len__()[source]¶

__module__ = 'gt4sd.frameworks.cgcnn.data'¶

__parameters__ = ()¶

__getitem__(idx)[source]¶

Parameters: idx (int) – index.
Returns: torch.Tensor shape (n_i, atom_fea_len). nbr_fea: torch.Tensor shape (n_i, M, nbr_fea_len). nbr_fea_idx: torch.LongTensor shape (n_i, M). target: torch.Tensor shape (1, ). cif_id: str or int.
Return type: atom_fea