Datasets
Graphormer supports training with both existing datasets in graph libraries and customized datasets.
Existing Datasets
Graphormer supports training with datasets in existing libraries.
Users can easily exploit datasets in these libraries by specifying the --dataset-source and --dataset-name parameters.
--dataset-source specifies the source for the dataset, can be:
dglfor DGLpygfor Pytorch Geometricogbfor OGB
--dataset-name specifies the dataset in the source.
For example, by specifying --dataset-source pyg and --dataset-name zinc, Graphormer will load the ZINC dataset from Pytorch Geometric.
When a dataset requires additional parameters to construct, the parameters are specified as <dataset_name>:<param_1>=<value_1>,<param_2>=<value_2>,...,<param_n>=<value_n>.
When the type of a parameter value is a list, the value is represented as a string with the list elements concatenated by +.
For example, if we want to specify multiple label_keys with mu, alpha, and homo for QM9 dataset,
--dataset-name should be qm9:label_keys=mu+alpha+homo.
When dataset split (train, valid and test subsets) is not configured in the original dataset source, we randomly partition
the full set into train, valid and test with ratios 0.7, 0.2 and 0.1, respectively.
If you want customized split of a dataset, you may implement a `customized dataset `.
Currently, only integer features of nodes and edges in the datasets are used.
A full list of supported datasets of each data source:
Dataset Source |
Dataset Name |
Link |
#Label/#Class |
|---|---|---|---|
|
|
QM7B dataset |
14 |
|
QM9 dataset |
Depending on |
|
|
QM9Edge dataset |
Depending on |
|
|
MiniGC dataset |
8 |
|
|
Graph Isomorphism Network dataset |
1 |
|
|
FakeNewsDataset dataset |
1 |
|
|
|
MoleculeNet dataset |
1 |
|
ZINC dataset |
1 |
|
|
|
ogbg-molhiv dataset |
1 |
|
ogbg-molpcba dataset |
128 |
|
|
PCQM4M dataset |
1 |
|
|
PCQM4Mv2 dataset |
1 |
Customized Datasets
Users may create their own datasets. To use customized dataset:
Create a folder (for example, with name customized_dataset), and a python script with arbitrary name in the folder.
2. In the created python script, define a function which returns the created dataset. And register the function with register_dataset. Here is a sample python script.
We define a QM9 dataset from dgl with customized split.
1 from graphormer.data import register_dataset
2 from dgl.data import QM9
3 import numpy as np
4 from sklearn.model_selection import train_test_split
5
6 @register_dataset("customized_qm9_dataset")
7 def create_customized_dataset():
8 dataset = QM9(label_keys=["mu"])
9 num_graphs = len(dataset)
10
11 # customized dataset split
12 train_valid_idx, test_idx = train_test_split(
13 np.arange(num_graphs), test_size=num_graphs // 10, random_state=0
14 )
15 train_idx, valid_idx = train_test_split(
16 train_valid_idx, test_size=num_graphs // 5, random_state=0
17 )
18 return {
19 "dataset": dataset,
20 "train_idx": train_idx,
21 "valid_idx": valid_idx,
22 "test_idx": test_idx,
23 "source": "dgl"
24 }
The function returns a dictionary. In the dictionary, dataset is the dataset object. train_idx is the graph indices used for training. Similarly we have
valid_idx and test_idx. Finally source records the underlying graph library used by the dataset.
3. Specify the --user-data-dir as customized_dataset when training. And set --dataset-name as customized_qm9_dataset.
Note that --user-data-dir should not be used together with --dataset-source. All datasets defined in all python scripts under the customized_dataset
will be registered automatically.