Datasets

Graphormer supports training with both existing datasets in graph libraries and customized datasets.

Existing Datasets

Graphormer supports training with datasets in existing libraries. Users can easily exploit datasets in these libraries by specifying the --dataset-source and --dataset-name parameters.

--dataset-source specifies the source for the dataset, can be:

  1. dgl for DGL

  2. pyg for Pytorch Geometric

  3. ogb for OGB

--dataset-name specifies the dataset in the source. For example, by specifying --dataset-source pyg and --dataset-name zinc, Graphormer will load the ZINC dataset from Pytorch Geometric. When a dataset requires additional parameters to construct, the parameters are specified as <dataset_name>:<param_1>=<value_1>,<param_2>=<value_2>,...,<param_n>=<value_n>. When the type of a parameter value is a list, the value is represented as a string with the list elements concatenated by +. For example, if we want to specify multiple label_keys with mu, alpha, and homo for QM9 dataset, --dataset-name should be qm9:label_keys=mu+alpha+homo.

When dataset split (train, valid and test subsets) is not configured in the original dataset source, we randomly partition the full set into train, valid and test with ratios 0.7, 0.2 and 0.1, respectively. If you want customized split of a dataset, you may implement a `customized dataset `. Currently, only integer features of nodes and edges in the datasets are used.

A full list of supported datasets of each data source:

Dataset Source

Dataset Name

Link

#Label/#Class

dgl

qm7b

QM7B dataset

14

qm9

QM9 dataset

Depending on label_keys

qm9edge

QM9Edge dataset

Depending on label_keys

minigc

MiniGC dataset

8

gin

Graph Isomorphism Network dataset

1

fakenews

FakeNewsDataset dataset

1

pgy

moleculenet

MoleculeNet dataset

1

zinc

ZINC dataset

1

ogb

ogbg-molhiv

ogbg-molhiv dataset

1

ogbg-molpcba

ogbg-molpcba dataset

128

pcqm4m

PCQM4M dataset

1

pcqm4mv2

PCQM4Mv2 dataset

1

Customized Datasets

Users may create their own datasets. To use customized dataset:

  1. Create a folder (for example, with name customized_dataset), and a python script with arbitrary name in the folder.

2. In the created python script, define a function which returns the created dataset. And register the function with register_dataset. Here is a sample python script. We define a QM9 dataset from dgl with customized split.

 1 from graphormer.data import register_dataset
 2 from dgl.data import QM9
 3 import numpy as np
 4 from sklearn.model_selection import train_test_split
 5
 6 @register_dataset("customized_qm9_dataset")
 7 def create_customized_dataset():
 8     dataset = QM9(label_keys=["mu"])
 9     num_graphs = len(dataset)
10
11     # customized dataset split
12     train_valid_idx, test_idx = train_test_split(
13         np.arange(num_graphs), test_size=num_graphs // 10, random_state=0
14     )
15     train_idx, valid_idx = train_test_split(
16         train_valid_idx, test_size=num_graphs // 5, random_state=0
17     )
18     return {
19         "dataset": dataset,
20         "train_idx": train_idx,
21         "valid_idx": valid_idx,
22         "test_idx": test_idx,
23         "source": "dgl"
24     }

The function returns a dictionary. In the dictionary, dataset is the dataset object. train_idx is the graph indices used for training. Similarly we have valid_idx and test_idx. Finally source records the underlying graph library used by the dataset.

3. Specify the --user-data-dir as customized_dataset when training. And set --dataset-name as customized_qm9_dataset. Note that --user-data-dir should not be used together with --dataset-source. All datasets defined in all python scripts under the customized_dataset will be registered automatically.