Datasets
Graphormer supports training with both existing datasets in graph libraries and customized datasets.
Existing Datasets
Graphormer supports training with datasets in existing libraries.
Users can easily exploit datasets in these libraries by specifying the --dataset-source
and --dataset-name
parameters.
--dataset-source
specifies the source for the dataset, can be:
dgl
for DGLpyg
for Pytorch Geometricogb
for OGB
--dataset-name
specifies the dataset in the source.
For example, by specifying --dataset-source pyg
and --dataset-name zinc
, Graphormer will load the ZINC dataset from Pytorch Geometric.
When a dataset requires additional parameters to construct, the parameters are specified as <dataset_name>:<param_1>=<value_1>,<param_2>=<value_2>,...,<param_n>=<value_n>
.
When the type of a parameter value is a list, the value is represented as a string with the list elements concatenated by +.
For example, if we want to specify multiple label_keys
with mu
, alpha
, and homo
for QM9 dataset,
--dataset-name
should be qm9:label_keys=mu+alpha+homo
.
When dataset split (train
, valid
and test
subsets) is not configured in the original dataset source, we randomly partition
the full set into train
, valid
and test
with ratios 0.7
, 0.2
and 0.1
, respectively.
If you want customized split of a dataset, you may implement a `customized dataset `.
Currently, only integer features of nodes and edges in the datasets are used.
A full list of supported datasets of each data source:
Dataset Source |
Dataset Name |
Link |
#Label/#Class |
---|---|---|---|
|
|
QM7B dataset |
14 |
|
QM9 dataset |
Depending on |
|
|
QM9Edge dataset |
Depending on |
|
|
MiniGC dataset |
8 |
|
|
Graph Isomorphism Network dataset |
1 |
|
|
FakeNewsDataset dataset |
1 |
|
|
|
MoleculeNet dataset |
1 |
|
ZINC dataset |
1 |
|
|
|
ogbg-molhiv dataset |
1 |
|
ogbg-molpcba dataset |
128 |
|
|
PCQM4M dataset |
1 |
|
|
PCQM4Mv2 dataset |
1 |
Customized Datasets
Users may create their own datasets. To use customized dataset:
Create a folder (for example, with name customized_dataset), and a python script with arbitrary name in the folder.
2. In the created python script, define a function which returns the created dataset. And register the function with register_dataset
. Here is a sample python script.
We define a QM9 dataset from dgl
with customized split.
1 from graphormer.data import register_dataset
2 from dgl.data import QM9
3 import numpy as np
4 from sklearn.model_selection import train_test_split
5
6 @register_dataset("customized_qm9_dataset")
7 def create_customized_dataset():
8 dataset = QM9(label_keys=["mu"])
9 num_graphs = len(dataset)
10
11 # customized dataset split
12 train_valid_idx, test_idx = train_test_split(
13 np.arange(num_graphs), test_size=num_graphs // 10, random_state=0
14 )
15 train_idx, valid_idx = train_test_split(
16 train_valid_idx, test_size=num_graphs // 5, random_state=0
17 )
18 return {
19 "dataset": dataset,
20 "train_idx": train_idx,
21 "valid_idx": valid_idx,
22 "test_idx": test_idx,
23 "source": "dgl"
24 }
The function returns a dictionary. In the dictionary, dataset
is the dataset object. train_idx
is the graph indices used for training. Similarly we have
valid_idx
and test_idx
. Finally source
records the underlying graph library used by the dataset.
3. Specify the --user-data-dir
as customized_dataset
when training. And set --dataset-name
as customized_qm9_dataset
.
Note that --user-data-dir
should not be used together with --dataset-source
. All datasets defined in all python scripts under the customized_dataset
will be registered automatically.