Hypergraph Convolution and Hypergraph Attention

Hypergraph Convolution and Hypergraph Attention
Song Bai 1 Feihu Zhang 1 Philip H.S. Torr 1
Abstract
Recently, graph neural networks have attracted
great attention and achieved prominent perfor-
arXiv:1901.08150v1 [cs.LG] 23 Jan 2019
mance in various research fields. Most of those

algorithms have assumed pairwise relationships
of objects of interest. However, in many real ap-
plications, the relationships between objects are (a) (b)
in higher-order, beyond a pairwise formulation.
To efficiently learn deep embeddings on the high- Figure 1. The difference between a simple graph (a) and a hyper-
order graph-structured data, we introduce two end- graph (b). In a simple graph, each edge, denoted by a line, only
to-end trainable operators to the family of graph connects two vertices. In a hypergraph, each edge, denoted by a
neural networks, i.e., hypergraph convolution and colored ellipse, connects more than two vertices.
hypergraph attention. Whilst hypergraph convolu-
tion defines the basic formulation of performing (GNNs) (Scarselli et al., 2009), a methodology for learning
convolution on a hypergraph, hypergraph atten- deep models with graph data. GNNs have a wide applica-
tion further enhances the capacity of representation in social science (Hamilton et al., 2017), knowledge
tion learning by leveraging an attention module. graph (Wang et al., 2018), recommendation system (Ying
With the two operators, a graph neural network is et al., 2018a), geometrical computation (Bronstein et al.,
readily extended to a more flexible model and ap- 2017), etc. And most existing methods assume that the
plied to diverse applications where non-pairwise relationships between objects of interest are in pairwise for-
relationships are observed. Extensive experimen- mulations. Specifically in a graph model, it means that each
tal results with semi-supervised node classifica- edge only connects two vertices (see Fig. 1(a)).
tion demonstrate the effectiveness of hypergraph
However, in many real applications, the object relationships
convolution and hypergraph attention.
are much more complex than pairwise. For instance in
recommendation systems, an item may be commented by
multiple users. By taking the items as vertices and the rating
1. Introduction
of users as edges of a graph, each edge may connect more
In the last decade, Convolution Neural Networks than two vertices. In this case, the affinity relations are
(CNNs) (Krizhevsky et al., 2012) have led to a wide spec- no longer dyadic (pairwise), but rather triadic, tetradic or
trum of breakthrough in various research domains, such of a higher-order. This brings back the concept of hyper-
as visual recognition (He et al., 2016), speech recogni- graph (Agarwal et al., 2006; Zhang et al., 2017), a special
tion (Hinton et al., 2012), machine translation (Bahdanau graph model which leverages hyperedges to connect multi-
et al., 2015) etc. Due to its innate nature, CNNs hold an ple vertices simultaneously (see Fig. 1(b)). Unfortunately,
extremely strict assumption, that is, input data shall have a most existing variants of graph neural networks (Zhang
regular and grid-like structure. Such a limitation hinders the et al., 2018; Zhou et al., 2018) are not applicable to the
promotion and application of CNNs to many tasks where high-order structure encoded by hyperedges.
data of irregular structures widely exists.
Although the machine learning community has witnessed
To handle the ubiquitous irregular data structures, the prominence of graph neural networks in learning pat-
there is a growing interest in Graph Neural Networks terns on simple graphs, the investigation of deep learning
1
on hypergraphs is still in a very nascent stage. Considering
Department of Engineering Science, University of Oxford, its importance, we propose hypergraph convolution and hy-
Oxford, United Kingdom. Correspondence to: Song Bai <song-
bai.site@gmail.com>. pergraph attention in this work, as two strong supplemental
operators to graph neural networks. The advantages and
Preliminary work. contributions of our work are as follows
1) Hypergraph convolution defines a basic convolutional the sequence. (Atwood & Towsley, 2016) demonstrate that
operator in a hypergraph. It enables an efficient informa- diffusion-based representations can serve as an effective
tion propagation between vertices by fully exploiting the basis for node classification. (Zhuang & Ma, 2018) further
high-order relationship and local clustering structure therein. explore a joint usage of diffusion and adjacency basis in
We mathematically prove that graph convolution is a spe- a dual graph convolutional network. (Gilmer et al., 2017)
cial case of hypergraph convolution when the non-pairwise defines a unified framework via a message passing function,
relationship degenerates to a pairwise one. where each vertex sends messages based on its states and
updates the states based on the message of its immediate
2) Apart from hypergraph convolution where the underlying
neighbors. (Hamilton et al., 2017) propose GraphSAGE,
structure used for propagation is pre-defined, hypergraph
which customizes three aggregating functions, i.e., element-
attention further exerts an attention mechanism to learn a
wise mean, long short-term memory and pooling, to learn
dynamic connection of hyperedges. Then, the information
embeddings in an inductive setting.
propagation and gathering is done in task-relevant parts
of the graph, thereby generating more discriminative node Moreover, some other works concentrate on gate mecha-
embeddings. nism (Li et al., 2016), skip connection (Pham et al., 2017),
jumping connection (Xu et al., 2018), attention mecha-
3) Both hypergraph convolution and hypergraph attention
nism (Velickovic et al., 2018), sampling strategy (Chen et al.,
are end-to-end trainable, and can be inserted into most vari-
2018a;b), hierarchical representation (Ying et al., 2018b),
ants of graph neural networks as long as non-pairwise rela-
generative models (You et al., 2018b; Bojchevski et al.,
tionships are observed. Extensive experimental results on
2018), adversarial attack (Dai et al., 2018), etc. As a thor-
benchmark datasets demonstrate the efficacy of the proposed
ough review is simply unfeasible due to the space limitation,
methods for semi-supervised node classification.
we refer interested readers to surveys for more representative
methods. For example, (Zhang et al., 2018) and (Zhou et al.,
2. Related Work 2018) present two systematical and comprehensive surveys
over a series of variants of graph neural networks. (Bron-
Graphs are a classic kind of data structure, where its vertices
stein et al., 2017) provide a review of geometric deep learn-
represent the objects and its edges linking two adjacent
ing. (Battaglia et al., 2018) generalize and extend various ap-
vertices describe the relationship between the corresponding
proaches and show how graph neural networks can support
objects.
relational reasoning and combinatorial generalization. (Lee
Graph Neural Network (GNN) is a methodology for learn- et al., 2018) particularly focus on the attention models for
ing deep models or embeddings on graph-structured data, graphs, and introduce three intuitive taxonomies. (Monti
which was first proposed by (Scarselli et al., 2009). One key et al., 2017) propose a unified framework called MoNet,
aspect in GNN is to define the convolutional operator in the which summarizes Geodesic CNN (Masci et al., 2015),
graph domain. (Bruna et al., 2013) firstly define convolution Anisotropic CNN (Boscaini et al., 2016), GCN (Kipf &
in the Fourier domain using the graph Laplacian matrix, Welling, 2016) and Diffusion CNN (Atwood & Towsley,
and generate non-spatially localized filters with potentially 2016) as its special cases.
intense computations. (Henaff et al., 2015) enable the spec-
As analyzed above, most existing variants of GNN assume
tral filters spatially localized using a parameterization with
pairwise relationships between objects, while our work
smooth coefficients. (Defferrard et al., 2016) focus on the
operates on a high-order hypergraph (Berge, 1973; Li &
efficiency issue and use a Chebyshev expansion of the graph
Milenkovic, 2017) where the between-object relationships
Laplacian to avoid an explicit use of the graph Fourier ba-
are beyond pairwise. Hypergraph learning methods differ
sis. (Kipf & Welling, 2016) further simplify the filtering
in the structure of the hypergraph, e.g., clique expansion
by only using the first-order neighbors and propose Graph
and star expansion (Zien et al., 1999), and the definition of
Convolutional Network (GCN), which has demonstrated
hypergraph Laplacians (Bolla, 1993; Rodriguez, 2003; Zhou
impressive performance in both efficiency and effectiveness
et al., 2007). Following (Defferrard et al., 2016), (Feng et al.,
with semi-supervised classification tasks.
2019) propose a hypergraph neural network using a Cheby-
Meanwhile, some spatial algorithms directly perform con- shev expansion of the graph Laplacian. By analyzing the
volution on the graph. For instance, (Duvenaud et al., 2015) incident structure of a hypergraph, our work directly defines
learn different parameters for nodes with different degrees, two differentiable operators, i.e., hypergraph convolution
then average the intermediate embeddings over the neighbor- and hypergraph attention, which is intuitive and flexible in
hood structures. (Niepert et al., 2016) propose the PATCHY- learning more discriminative deep embeddings.
SAN architecture, which selects a fixed-length sequence
of nodes as the receptive field and generate local normal-
ized neighborhood representations for each of the nodes in
3. Proposed Approach Then, one step of hypergraph convolution is defined as

 
In this section, we first give the definition of hypergraph N X M
(l+1) (l)
X
in Sec. 3.1, then elaborate the proposed hypergraph con- xi = σ Hi Hj W xj P , (3)
volution and hypergraph attention in Sec. 3.2 and Sec. 3.3, j=1 =1
respectively. At last, Sec. 3.4 provides a deeper analysis of
(l)
the properties of our methods. where xi is the embedding of the i-th vertex in the
(l)-th layer. σ(·) is a non-linear activation function like
3.1. Hypergraph Revisited LeakyReLU (Maas et al., 2013) and eLU (Clevert et al.,
(l) (l+1)
2016). P ∈ RF ×F is the weight matrix between the
Most existing works (Kipf & Welling, 2016; Velickovic
(l)-th and (l + 1)-th layer. Eq. (3) can be written in a matrix
et al., 2018) operate on a simple graph G = (V, E), where
form as
V = {v1 , v2 , ..., vN } denotes the vertex set and E ⊆ V ×V
X(l+1) = σ(HWHT X(l) P), (4)
denotes the edge set. A graph adjacency matrix A ∈ RN ×N
(l) (l+1)
is used to reflect the pairwise relationship between every where X(l) ∈ RN ×F and X(l+1) ∈ RN ×F are the
two vertices. The underlying assumption of such a simple input of the (l)-th and (l + 1)-th layer, respectively.
graph is that each edge only links two vertices. However, as
However, HWHT does not hold a constrained spectral ra-
analyzed above, the relationships between objects are more
dius, which means that the scale of X(l) will be possibly
complex than pairwise in many real applications.
changed. In optimizing a neural network, stacking multiple
To describe such a complex relationship, a useful graph hypergraph convolutional layers like Eq. (4) can then lead
model is hypergraph, where a hyperedge can connect more to numerical instabilities and increase the risk of explod-
than two vertices. Let G = (V, E) be a hypergraph with ing/vanishing gradients. Therefore, a proper normalization
N vertices and M hyperedges. Each hyperedge ∈ E is is necessary. Thus, we impose a symmetric normalization
assigned a positive weight W , with all the weights stored and arrive at our final formulation
in a diagonal matrix W ∈ RM ×M . Apart from a simple
graph where an adjacency matrix is defined, the hypergraph X(l+1) = σ(D−1/2 HWB−1 HT D−1/2 X(l) P). (5)
G can be represented by an incidence matrix H ∈ RN ×M
Here, we recall that D and B are the degree matrices
in general. When the hyperedge ∈ E is incident with a
of the vertex and hyperedge in a hypergraph, respec-
vertex vi ∈ V , in order words, vi is connected by , Hi = 1,
tively. It is easy to prove that the maximum eigen-
otherwise 0. Then, the vertex degree is defined as
value of D−1/2 HWB−1 HT D−1/2 is no larger than 1,
M which stems from a fact (Agarwal et al., 2006) that I −
D−1/2 HWB−1 HT D−1/2 is a positive semi-definite ma-
X
Dii = W Hi (1)
=1 trix. I is an identity matrix of an appropriate size.
and the hyperedge degree is defined as Alternatively, a row-normalization is also viable as
N
X X(l+1) = σ(D−1 HWB−1 HT X(l) P), (6)
B = Hi . (2)
i=1
which enjoys similar mathematical properties as Eq. (5),
except that the propagation is directional and asymmetric in
Note that D ∈ RN ×N and B ∈ RM ×M are both diagonal this case.
matrices.
As X(l+1) is differentiable with respect to X(l) and P, we
In the following, we define the operator of convolution on can use hypergraph convolution in end-to-end training and
the hypergraph G. optimize it via gradient descent.
3.2. Hypergraph Convolution 3.3. Hypergraph Attention

The primary obstacle to defining a convolution operator in a Hypergraph convolution has an innate attentional mecha-
hypergraph is to measure the transition probability between nism (Lee et al., 2018; Velickovic et al., 2018). As we can
two vertices, with which the embeddings (or features) of find from Eq. (5) and Eq. (6), the transition probability be-
each vertex can be propagated in a graph neural network. tween vertices is non-binary, which means that for a given
To achieve this, we hold two assumptions: 1) more propa- vertex, the afferent and efferent information flow is explic-
gations should be done between those vertices connected itly assigned a diverse magnitude of importance. However,
by a common hyperedge, and 2) the hyperedges with larger such an attentional mechanism is not learnable and trainable
weights deserve more confidence in such a propagation. after the incidence matrix H is defined.
Incidence Matrix
e1 e2
x1 1 0
x2 1 0
x1 Attention
x3 1 1
x4 0 1
x2 x5 0 1
Transition Probability
x3 x4 x5 x(l) Convolution σ x(l+1)

Input Nonlinearity Output
Figure 2. Schematic illustration of hypergraph convolution with 5 vertices and 2 hyperedges. With an optional attention mechanism,
hypergraph convolution upgrades to hypergraph attention.
One natural solution is to exert an attention learning module objects like newspaper or text. Then, it is problematic to
on H. In this circumstance, instead of treating each vertex directly learn an attention module over the incidence matrix
as being connected by a certain hyperedge or not, the atten- H. We leave this issue for future work.
tion module presents a probabilistic model, which assigns
non-binary and real values to measure the degree of connec- 3.4. Summary and Remarks
tivity. We expect that the probabilistic model can learn more
category-discriminative embeddings and the relationship The pipeline of the proposed hypergraph convolution and
between vertices can be more accurately described. hypergraph attention is illustrated in Fig. 2. Both two op-
erators can be inserted into most variants of graph neural
Nevertheless, hypergraph attention is only feasible when networks when non-pairwise relationships are observed, and
the vertex set and the hyperedge set are from (or can be used for end-to-end training.
projected to) the same homogeneous domain, since only in
this case, the similarities between vertices and hyperedges As the only difference between hypergraph convolution and
are directly comparable. In practice, it depends on how the hypergraph attention is an optional attention module on the
hypergraph G is constructed. For example, (Huang et al., incidence matrix H, below we take hypergraph convolution
2010) apply hypergraph learning to image retrieval where as a representative to further analyze the properties of our
each vertex collects its k-nearest neighbors to form a hyper- methods. Note that the analyses also hold for hypergraph
edge, as also the way of constructing hypergraphs in our attention.
experiments. When the vertex set and the edge set are com- Relationship with Graph Convolution. We prove that
parable, we define the procedure of hypergraph attention graph convolution (Kipf & Welling, 2016) is a special case
inspired by (Velickovic et al., 2018). For a given vertex xi of hypergraph convolution mathematically.
and its associated hyperedge xj , the attentional score is
Let A ∈ RN ×N be the adjacency matrix used in graph
exp (σ(sim(xi P, xj P))) convolution. When each edge only links two vertices in a
Hij = P , (7)
k∈Ni exp (σ(sim(xi P, xk P))) hypergraph, the vertex degree matrix B = 2I. Assuming
equal weights for all the hyperedges (i.e., W = I), we have
where σ(·) is a non-linear activation function and sim(·) is
an interesting observation of hypergraph convolution. Based
a similarity function that computes the pairwise similarity
on Eq. (5), the definition of hypergraph convolution then
between two vertices. Ni is the neighborhood set of xi ,
becomes
which can be pre-accessed on some benchmarks, such as
the Cora and Citeseer datasets (Sen et al., 2008). 1
X(l+1) =σ( D−1/2 HHT D−1/2 X(l) P),
With the incidence matrix H enriched by an attention mod- 2
1 −1/2
ule, one can also follow Eq. (5) and Eq. (6) to learn the =σ D (A + D)D−1/2 X(l) P
intermediate embedding of vertices layer-by-layer. Note 2 (8)

that hypergraph attention also propagates gradients to H in 1
=σ (I + D−1/2 AD−1/2 )X(l) P
addition to X(l) and P. 2
In some applications, the vertex set and the hyperedge set =σ(ÂX(l) P),
are from two heterogeneous domains. For instance, (Zhou
et al., 2007) assume that attributes are hyperedges to connect where Â = 1/2A e = I+D−1/2 AD−1/2 . As we can
e and A
see, Eq. (8) is exactly equivalent to the definition of graph

Table 1. Overview of data statistics.
convolution (see Equation 9 in (Kipf & Welling, 2016)). Dataset #Nodes #Edges #Features #Classes
Note that Ae has eigenvalues in the range [0, 2]. To avoid
Cora 2708 5429 1433 7
scale changes, (Kipf & Welling, 2016) have suggested a Citeseer 3327 4732 3703 6
re-normalization trick, that is Pubmed 19717 44338 500 3
e −1/2 A
Â = D eDe −1/2 . (9)
skip connections since the receptive field grows exponen-
In the specific case of hypergraph convolution, we are using tially with respect to the model depth. In the experiments,
a simplified solution, that is dividing A
e by 2.
we will verify the compatibility of the proposed operators
With GCN as a bridge and springboard to the family of with skip connections in model training.
graph neural networks, it then becomes feasible to build Multi-head. To stabilize the learning process and im-
connections with other frameworks, e.g., MoNet (Monti prove the representative power of networks, multi-head
et al., 2017), and develop the higher-order counterparts of (a.k.a. multi-branch) architecture is suggested in relevant
those variants to deal with non-pairwise relationships. works, e.g., (Xie et al., 2017; Vaswani et al., 2017; Velick-
Implementation in Practice. The implementation of hy- ovic et al., 2018). hypergraph convolution can be also ex-
pergraph convolution appears sophisticated as 6 matrices tended in that way, as
are multiplied for symmetric convolution (see Eq. (5)) and (l+1)

5 matrices are multiplied for asymmetric convolution (see Xk = HConv X(l) , Hk , Pk ,
Eq. (6)). However, it should be mentioned that D, W and K (12)
(l+1)
B are all diagonal matrices, which makes it possible to X(l+1) = Aggregate Xk ,
k=1
efficiently implement it in common used deep learning plat-
forms. where Aggregate(·) is a certain aggregation like concate-
nation or average pooling. Hk and Pk are the incidence
For asymmetric convolution, we have from Eq. (6) that matrix and weight matrix corresponding to the k-th head,
respectively. Note that only in hypergraph attention, Hk is
D−1 HWB−1 HT = (D−1 H)W(HB−1 )T , (10)
different over different heads.
where D−1 H and HB−1 perform L1 normalization of the
incidence matrix H over rows and columns, respectively. In 4. Experiments
space-saving applications where matrix-form variables are
allowed, normalization can be simply done using standard In this section, we evaluate the proposed hypergraph con-
built-in functions in public neural network packages. volution and hypergraph attention in the task of semi-
supervised node classification.
In case of space-consuming applications, one can readily
implement a sparse version of hypergraph convolution as 4.1. Experimental Setup
well. Since H is usually a sparse matrix, Eq. (10) does
not necessarily decrease the sparsity too much. Hence, we Datasets. We employ three citation network datasets, in-
can conduct normalization only on non-zero indices of H, cluding the Cora, Citeseer and Pubmed datasets (Sen et al.,
resulting in a sparse transition matrix. 2008), following previous representative works (Kipf &
Welling, 2016; Velickovic et al., 2018). Table 1 presents an
Symmetric hypergraph convolution defined in Eq. (5) can
overview of the statistics of the three datasets.
be implemented similarly, with a minor difference in nor-
malization using the vertex degree matrix D. The Cora dataset contains 2, 708 scientific publications di-
vided into 7 categories. There are 5, 429 edges in total, with
Skip Connection. Hypergraph convolution can be inte-
each edge being a citation link from one article to another.
grated with skip connection (He et al., 2016) as
Each publication is described by a binary bag-of-word repre-
(l+1)
sentation, where 0 (or 1) indicates the absence (or presence)
Xk = HConv X(l) , Hk , Pk + X(l) , (11)
of the corresponding word from the dictionary. The dictio-
nary consists of 1, 433 unique words.
where HConv(·) represents the hypergraph convolution op-
erator defined in Eq. (5) (or Eq. (6)). Some similar structures Like the Cora dataset, the Citeseer dataset contains 3, 327
(e.g., highway structure (Srivastava et al., 2015) adopted in scientific publications, divided into 6 categories and linked
Highway-GCN (Rahimi et al., 2018)) can be also applied. by 4, 732 edges. Each publication is described by a binary
bag-of-word representation of 3, 703 dimensions.
It has been demonstrated (Kipf & Welling, 2016) that deep
graph models cannot improve the performance even with The Pubmed dataset is comprised of 19, 717 scientific pub-
lications divided into 3 classes. The citation network has

Table 2. The comparison with baseline methods in terms of classi-
44, 338 links. Each publication is described by a vectorial
ficationa accuracy (%). “Hyper-Conv.” denotes hypergraph convo-
representation using Term Frequency-Inverse Document lution and “Hyper-Atten.” denotes hypergraph attention.
Frequency (TF-IDF), drawn from a dictionary with 500 Method Cora dataset Citeseer dataset
terms. GCN* 81.80 70.29
Hyper-Conv. (ours) 82.19 70.35
As for the training-testing data split, we adopt the setting
GCN*+Hyper-Conv. (ours) 82.63 70.00
used in (Yang et al., 2016). In each dataset, 20 articles per
GAT* 82.43 70.02
category are used for model training, which means the size Hyper-Atten. (ours) 82.61 70.88
of training set is 140 for Cora, 120 for Citeseer and 60 for GAT*+Hyper-Atten. (ours) 82.74 70.12
Pubmed, respectively. Another 500 articles are used for val-
idation purposes and 1000 articles are used for performance
evaluation. optimizer with a learning rate of 0.005 on the Cora and Cite-
Hypergraph Construction. Most existing methods inter- seer datasets and 0.01 on the Pubmed dataset, respectively.
pret the citation network as the adjacency matrix of a sim- An early stop strategy is adopted on the validation loss with
ple graph by a certain kind of normalization, e.g., (Kipf & a patience of 100 epochs. For all the experiments, we report
Welling, 2016). In this work, we construct a higher-order the mean classification accuracy of 100 trials on the test-
graph to enable hypergraph convolution and hypergraph at- ing dataset. The standard deviation, generally smaller than
tention. The whole procedure is divided into three steps: 1) 0.5%, is not reported due to the space limitation.
all the articles constitute the vertex set of the hypergraph;
2) each article is taken as a centroid and forms a hyperedge 4.2. Analysis
to connect those articles which have citation links to it (ei- We first analyze the properties of hypergraph convolution
ther citing it or being cited); 3) the hyperedges are equally and hypergraph attention with a series of ablation studies.
weighted for simplicity, but one can set non-equal weights The comparison is primarily done with Graph Convolution
to encode a prior knowledge if existing in other applications. Network (GCN) (Kipf & Welling, 2016) and Graph Atten-
Implementation Details. We implement the proposed hy- tion Network (GAT) (Velickovic et al., 2018), which are two
pergraph convolution and hypergraph attention using Py- latest representatives of graph neural networks that have
torch. As for the parameter setting and network structure, close relationships with our methods.
we closely follow (Velickovic et al., 2018) without a care- For a fair comparison, we reproduce the performance of
fully parameter tuning and model design. GCN and GAT with exactly the same experimental setting
In more detail, a two-layer graph model is constructed. The aforementioned. Thus, we denote them by GCN* and GAT*
first layer consists of 8 branches of the same topology, and in the following. Moreover, we employ the same normal-
each branch generates an 8-dimensional hidden representa- ization strategy as GCN, i.e., symmetric normalization in
tion. The second layer, used for classification, is a single- Eq. (5) for hypergraph convolution, and the same strategy
branch topology and generates C-dimensional feature (C is as GAT, i.e., asymmetric normalization in Eq. (6) for hy-
the number of classes). Each layer is followed by a nonlin- pergraph attention. They are denoted by Hyper-Conv. and
earity activation and here we use Exponential Linear Unit Hyper-Atten. for short, respectively.
(ELU) (Clevert et al., 2016). L2 regularization is applied We modify the model of GAT to implement GCN by re-
to the parameters of network with λ = 0.0003 on the Cora moving the attention module and directly feeding the graph
and Citeseer datasets and λ = 0.001 on the Pubmed dataset, adjacency matrix with the normalization trick proposed in
respectively. GCN. Two noteworthy comments are made here. First, al-
Specifically in hypergraph attention, dropout with a rate of though the architecture of GCN* differs from the original
0.6 is applied to both inputs of each layer and the attention one, the principle of performing graph convolution is the
transition matrix. As for the computation of the attention same. Second, directly feeding the graph adjacency matrix
incidence matrix H in Eq. (7), we employ a linear transform is not equivalent to the constant attention described in GAT
as the similarity function sim(·), followed by LeakyReLU as normalization is used in our case. In GAT, the constant
nonlinearity (Maas et al., 2013) with the negative input slope attention weight is set to 1 without normalization.
set to 0.2. On the Pubmed dataset, we do not use 8 output Comparisons with Baselines. The comparison with base-
attention heads for classification to ensure the consistency line methods is given in Table 2.
of network structures.
We first observe that hypergraph convolution and hyper-
We train the model by minimizing the cross-entropy loss on graph attention, as non-pairwise models, consistently out-
the training nodes using the Adam (Kingma & Ba, 2015) perform its corresponding pairwise models, i.e., graph
convolution network (GCN*) and graph attention network

Table 3. The compatibility of skip connection in terms of classifi-
(GAT*). For example on the Citeseer dataset, hypergraph
cation accuracy (%).
attention achieves a classification accuracy of 70.88, an λ=3e-4 λ=1e-3
improvement of 0.86 over GAT*. This demonstrates the Method
Cora Citeseer Cora Citeseer
benefit of considering higher-order models in graph neural GCN* 79.96 69.24 80.52 70.15
networks to exploit non-pairwise relationships and local Hyper-Conv. (ours) 82.22 69.46 82.66 70.83
clustering structure parameterized by hyperedges. GAT* 80.84 68.96 81.33 69.69
Hyper-Atten. (ours) 81.85 70.37 82.34 71.19
Compared with hypergraph convolution, hypergraph atten-
tion adopts a data-driven learning module to dynamically
estimate the strength of each link associated with vertices Table 4. The influence of the length of hidden representation on
and hyperedges. Thus, the attention mechanism helps hy- the Cora dataset.
pergraph convolution embed the non-pairwise relationships The length of hidden representation
Method
between objects more accurately. As presented in Table 2, 2 4 8 16 24 36
the performance improvements brought by hypergraph at- GCN* 65.9 79.6 82.0 81.9 82.0 81.9
Hyper-Conv. 69.7 80.4 82.0 82.1 82.1 82.1
tention are 0.42 and 0.53 over hypergraph convolution on
the Cora and Citeseer datasets, respectively.
Although non-pairwise models proposed in this work have seem to benefit from skip connection. For instance, the best-
achieved improvements over pairwise models, one cannot performing trial of hypergraph convolution yields 82.66 on
hastily deduce that non-pairwise models are more capable the Cora dataset, better than 82.19 achieved without skip
in learning robust deep embeddings under all circumstances. connection, and yields 70.83 on the Citeseer dataset, better
A rational claim is that they are suitable for different appli- than 70.35 achieved without skip connection.
cations as real data may convey different structures. Some
graph-structured data can be only modeled in a simple graph, Such experimental results encourage us to further train a
some can be only modeled in a higher-order graph and much deeper model implemented with hypergraph convolu-
others are suitable for both. Nevertheless, as analyzed in tion or hypergraph attention (say up to 10 layers). However,
Sec. 3.4, our method presents a more flexible operator in we also witness a performance deterioration either with or
graph neural networks, where graph convolution and graph without skip connection. This reveals that a better train-
attention are special cases of non-pairwise models with guar- ing paradigm and architecture are still urgently required for
anteed mathematical properties and performance. graph neural networks.
One may be also interested in another question, i.e., does it Analysis of Hidden Representation. Table 4 presents the
bring performance improvements if using hypergraph convo- performance comparison between GCN* and hypergraph
lution (or attention) in conjunction with graph convolution convolution with an increasing length of the hidden repre-
(or attention)? We further investigate this by averaging the sentation. The number of heads is set to 1.
transition probability learned by non-pairwise models and It is easy to find that the performance keeps increasing with
pairwise models with equal weights, and report the results an increase of the length of the hidden representation, then
in Table 2. As it shows, a positive synergy is only observed peaks when the length is 16. Moreover, hypergraph con-
on the Cora dataset, where the best results of convolution volution consistently beats GCN* with a variety of feature
operator and attention operator are improved to 82.63 and dimensions. As the only difference between GCN* and
82.74, respectively. By contrast, our methods encounter a hypergraph convolution is the used graph structure, the per-
slight performance decrease on the Citeseer dataset. From formance gain purely comes from a more robust way of
another perspective, it also supports our above claim that establishing the relationships between objects. It firmly
different data may fit different structural graph models. demonstrates the ability of our methods in graph knowledge
Analysis of Skip Connection. We study the influence of embedding.
skip connection (He et al., 2016) by adding an identity
mapping in the first hypergraph convolution layer. We report 4.3. Comparison with State-of-the-art
in Table 3 two settings of the weight decay, i.e., λ=3e-4 We compare our method with the state-of-the-art algorithms,
(default setting) and λ=1e-3. which have followed the experimental setting in (Yang et al.,
As it shows, both GCN* and GAT* report lower recognition 2016) and reported classification accuracies on the Cora,
rates when integrated with skip connection compared with Citeseer and Pubmed datasets. Note that the results are
those reported in Table 2. In comparison, the proposed directly quoted from the original papers, instead of be-
non-pairwise models, especially hypergraph convolution, ing re-implemented in this work. Besides GCN and GAT,
the selected algorithms also include Manifold Regulariza-
Table 5. Comparison with the state-of-the-art methods in terms of classification accuracy (%). The best and second best results are marked
in red and blue, respectively.
Method Cora dataset Citeseer dataset Pubmed dataset
Multilayer Perceptron 55.1 46.5 71.4
Manifold Regularization (Belkin et al., 2006) 59.5 60.1 70.7
Semi-supervised Embedding (Weston et al., 2012) 59.0 59.6 71.7
Label Propagation (Zhu et al., 2003) 68.0 45.3 63.0
DeepWalk (Perozzi et al., 2014) 67.2 43.2 65.3
Iterative Classification Algorithm (Lu & Getoor, 2003) 75.1 69.1 73.9
Planetoid (Yang et al., 2016) 75.7 64.7 77.2
Chebyshev (Defferrard et al., 2016) 81.2 69.8 74.4
Graph Convolutional Network (Kipf & Welling, 2016) 81.5 70.3 79.0
MoNet (Monti et al., 2017) 81.7 - 78.8
Variance Reduction (Chen et al., 2018b) 82.0 70.9 79.0
Graph Attention Network (Velickovic et al., 2018) 83.0 72.5 79.0
Ours 82.7 71.2 78.4
tion (Belkin et al., 2006) , Semi-supervised Embedding (We- & Welling, 2016) and graph attention network (Velickovic
ston et al., 2012), Label Propagation (Zhu et al., 2003), et al., 2018), are special cases of our methods. Hence, our
DeepWalk (Perozzi et al., 2014), Iterative Classification Al- proposed hypergraph convolution and hypergraph attention
gorithm (Lu & Getoor, 2003), Planetoid (Yang et al., 2016), are more flexible in dealing with arbitrary orders of relation-
Chebyshev (Defferrard et al., 2016), MoNet (Monti et al., ships and diverse applications, where both pairwise and non-
2017), and Variance Reduction (Chen et al., 2018b). pairwise formulations are likely to be involved. Thorough
experimental results with semi-supervised node classifica-
As presented in Table 5, our method achieves the second
tion demonstrate the efficacy of the proposed methods.
best performance on the Cora and Citeseer dataset, which
is slightly inferior to GAT (Velickovic et al., 2018) by 0.3 There are still some challenging directions that can be fur-
and 1.3, respectively. The performance gap is attributed ther investigated. Some of them are inherited from the
to multiple factors, such as the difference in deep learning limitation of graph neural networks, such as training sub-
platforms and better parameter tuning. As shown in Sec. 4.2, stantially deeper models with more than a hundred of lay-
thorough experimental comparisons under the same setting ers (He et al., 2016), handling dynamic structures (Zhang
have demonstrated the benefit of learning deep embeddings et al., 2018; Zhou et al., 2018), batch-wise model train-
using the proposed non-pairwise models. Nevertheless, we ing, etc. Meanwhile, some issues are directly related to the
emphasize again that pairwise and non-pairwise models proposed methods in high-order learning. For example, al-
have different application scenarios, and existing variants though hyperedges are equally weighted in our experiments,
of graph neural networks can be easily extended to their it is promising to exploit a proper weight mechanism when
non-pairwise counterparts with the proposed two operators. extra knowledge of data distributions is accessible, and even,
adopt a learnable module in a neural network then optimize
On the Pubmed dataset, hypergraph attention reports 78.4 in
the weight with gradient descent. The current implementa-
classification accuracy, better than 78.1 achieved by GAT*.
tion of hypergraph attention cannot be executed when the
As described in Sec. 4.1, the original implementation of
vertex set and the hyperedge set are from two heterogeneous
GAT adopts 8 output attention heads while only 1 is used
domains. One possible solution is to learn a joint embedding
in GAT* to ensure the consistency of model architectures.
to project the vertex features and edge features (Simonovsky
Even though, hypergraph attention also achieves a compara-
& Komodakis, 2017; Schlichtkrull et al., 2018) in a shared
ble performance with the state-of-the-art methods.
latent space, which requires further exploration.
5. Conclusion and Future Work Moreover, it is also interesting to plug hypergraph con-
volution and hypergraph attention into other variants of
In this work, we have contributed two end-to-end trainable graph neural network, e.g., MoNet (Monti et al., 2017),
operators to the family of graph neural networks, i.e., hy- GraphSAGE (Hamilton et al., 2017) and GCPN (You et al.,
pergraph convolution and hypergraph attention. While most 2018a), and apply them to other domain-specific applica-
variants of graph neural networks assume pairwise relation- tions, e.g., 3D shape analysis (Boscaini et al., 2016), vi-
ships of objects of interest, the proposed operators handle sual question answering (Narasimhan et al., 2018), chem-
non-pairwise relationships modeled in a high-order hyper- istry (Gilmer et al., 2017; Liu et al., 2018) and NP-hard
graph. We theoretically demonstrate that some recent rep- problems (Li et al., 2018).
resentative works, e.g., graph convolution network (Kipf
References Dai, H., Li, H., Tian, T., Huang, X., Wang, L., Zhu, J., and
Song, L. Adversarial attack on graph structured data. In
Agarwal, S., Branson, K., and Belongie, S. Higher order
ICML, 2018.
learning with graphs. In ICML, pp. 17–24, 2006.
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
Atwood, J. and Towsley, D. Diffusion-convolutional neural
volutional neural networks on graphs with fast localized
networks. In NeurlPS, pp. 1993–2001, 2016.
spectral filtering. In NeurlPS, pp. 3844–3852, 2016.
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bom-
translation by jointly learning to align and translate. In
barell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P.
ICLR, 2015.
Convolutional networks on graphs for learning molecular
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- fingerprints. In NeurlPS, pp. 2224–2232, 2015.
Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, Feng, Y., You, H., Zhang, Z., Ji, R., and Gao, Y. Hypergraph
A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela- neural networks. In AAAI, 2019.
tional inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261, 2018. Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. Protein
interface prediction using graph convolutional networks.
Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regular- In NeurlPS, pp. 6530–6539, 2017.
ization: A geometric framework for learning from labeled
and unlabeled examples. Journal of machine learning Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
research, 7:2399–2434, 2006. Dahl, G. E. Neural message passing for quantum chem-
istry. In ICML, 2017.
Berge, C. Graphs and hypergraphs. 1973.
Hamilton, W., Ying, Z., and Leskovec, J. Inductive rep-
Bojchevski, A., Shchur, O., Zügner, D., and Günnemann, S. resentation learning on large graphs. In NeurlPS, pp.
Netgan: Generating graphs via random walks. In ICML, 1024–1034, 2017.
2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
Bolla, M. Spectra, euclidean representations and clusterings learning for image recognition. In CVPR, pp. 770–778,
of hypergraphs. Discrete Mathematics, 117(1-3):19–39, 2016.
1993.
Henaff, M., Bruna, J., and LeCun, Y. Deep convolu-
Boscaini, D., Masci, J., Rodolà, E., and Bronstein, M. Learn- tional networks on graph-structured data. arXiv preprint
ing shape correspondence with anisotropic convolutional arXiv:1506.05163, 2015.
neural networks. In NeurlPS, pp. 3189–3197, 2016.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r.,
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Van- Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath,
dergheynst, P. Geometric deep learning: going beyond T. N., et al. Deep neural networks for acoustic modeling
euclidean data. IEEE Signal Processing Magazine, 34(4): in speech recognition: The shared views of four research
18–42, 2017. groups. IEEE Signal processing magazine, 29(6):82–97,
2012.
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spec-
tral networks and locally connected networks on graphs. Huang, Y., Liu, Q., Zhang, S., and Metaxas, D. N. Image
arXiv preprint arXiv:1312.6203, 2013. retrieval via probabilistic hypergraph ranking. In CVPR,
pp. 3376–3383, 2010.
Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with
graph convolutional networks via importance sampling. Kingma, D. P. and Ba, J. Adam: A method for stochastic
In ICLR, 2018a. optimization. In ICLR, 2015.
Chen, J., Zhu, J., and Song, L. Stochastic training of graph Kipf, T. N. and Welling, M. Semi-supervised classifica-
convolutional networks with variance reduction. In ICML, tion with graph convolutional networks. arXiv preprint
pp. 941–949, 2018b. arXiv:1609.02907, 2016.
Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
and accurate deep network learning by exponential linear classification with deep convolutional neural networks.
units (elus). In ICLR, 2016. In NeurlPS, pp. 1097–1105, 2012.
Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., and Koh, Rodriguez, J. A. On the laplacian spectrum and walk-regular
E. Attention models in graphs: A survey. arXiv preprint hypergraphs. Linear and Multilinear Algebra, 51(3):285–
arXiv:1807.07984, 2018. 297, 2003.
Li, P. and Milenkovic, O. Inhomogeneous hypergraph clus- Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
tering with applications. In NeurlPS, pp. 2308–2318, Monfardini, G. The graph neural network model. IEEE
2017. Transactions on Neural Networks, 20(1):61–80, 2009.
Li, Y. and Gupta, A. Beyond grids: Learning graph represen- Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R.,
tations for visual recognition. In NeurlPS, pp. 9245–9255, Titov, I., and Welling, M. Modeling relational data with
2018. graph convolutional networks. In European Semantic
Web Conference, pp. 593–607, 2018.
Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated
graph sequence neural networks. In ICLR, 2016. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,
Li, Z., Chen, Q., and Koltun, V. Combinatorial optimization and Eliassi-Rad, T. Collective classification in network
with graph convolutional networks and guided tree search. data. AI magazine, 29(3):93, 2008.
In NeurlPS, pp. 537–546, 2018.
Simonovsky, M. and Komodakis, N. Dynamic edgecondi-
Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. tioned filters in convolutional neural networks on graphs.
Constrained graph variational autoencoders for molecule In CVPR, 2017.
design. In NeurlPS, pp. 7806–7815, 2018.
Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway
Lu, Q. and Getoor, L. Link-based classification. In ICML, networks. arXiv preprint arXiv:1505.00387, 2015.
pp. 496–503, 2003.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier non- L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention
linearities improve neural network acoustic models. In is all you need. In NeurlPS, pp. 5998–6008, 2017.
ICML, 2013.
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio,
Masci, J., Boscaini, D., Bronstein, M., and Vandergheynst, P., and Bengio, Y. Graph attention networks. In ICLR,
P. Geodesic convolutional neural networks on riemannian 2018.
manifolds. In ICCVW, pp. 37–45, 2015.
Wang, X., Ye, Y., and Gupta, A. Zero-shot recognition via
Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., semantic embeddings and knowledge graphs. In CVPR,
and Bronstein, M. M. Geometric deep learning on graphs pp. 6857–6866, 2018.
and manifolds using mixture model cnns. In CVPR, 2017.
Weston, J., Ratle, F., Mobahi, H., and Collobert, R. Deep
Narasimhan, M., Lazebnik, S., and Schwing, A. Out of the
learning via semi-supervised embedding. In Neural Net-
box: Reasoning with graph convolution nets for factual
works: Tricks of the Trade, pp. 639–655. 2012.
visual question answering. In NeurlPS, pp. 2659–2670,
2018. Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
Niepert, M., Ahmed, M., and Kutzkov, K. Learning convo- gated residual transformations for deep neural networks.
lutional neural networks for graphs. In ICML, pp. 2014– In CVPR, pp. 5987–5995, 2017.
2023, 2016.
Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i.,
Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online and Jegelka, S. Representation learning on graphs with
learning of social representations. In KDD, pp. 701–710, jumping knowledge networks. In ICML, 2018.
2014.
Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisiting
Pham, T., Tran, T., Phung, D. Q., and Venkatesh, S. Column semi-supervised learning with graph embeddings. In
networks for collective classification. In AAAI, pp. 2485– ICML, 2016.
2491, 2017.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
Rahimi, A., Cohn, T., and Baldwin, T. Semi-supervised user W. L., and Leskovec, J. Graph convolutional neural net-
geolocation via graph convolutional networks. In ACL, works for web-scale recommender systems. In KDD,
2018. 2018a.
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and
Leskovec, J. Hierarchical graph representation learning
with differentiable pooling. In NeurlPS, pp. 4805–4815,
2018b.
You, J., Liu, B., Ying, R., Pande, V., and Leskovec, J. Graph
convolutional policy network for goal-directed molecular
graph generation. In NeurlPS, 2018a.
You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec,
J. Graphrnn: Generating realistic graphs with deep auto-
regressive models. In ICML, pp. 5694–5703, 2018b.
Zhang, C., Hu, S., Tang, Z. G., and Chan, T. Re-revisiting
learning on hypergraphs: confidence interval and subgra-
dient method. In ICML, pp. 4026–4034, 2017.
Zhang, M. and Chen, Y. Link prediction based on graph
neural networks. In NeurlPS, pp. 5171–5181, 2018.
Zhang, Z., Cui, P., and Zhu, W. Deep learning on graphs: A
survey. arXiv preprint arXiv:1812.04202, 2018.
Zhou, D., Huang, J., and Schölkopf, B. Learning with
hypergraphs: Clustering, classification, and embedding.
In NeurlPS, pp. 1601–1608, 2007.
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun,
M. Graph neural networks: A review of methods and
applications. arXiv preprint arXiv:1812.08434, 2018.
Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semi-
supervised learning using gaussian fields and harmonic
functions. In ICML, pp. 912–919, 2003.
Zhuang, C. and Ma, Q. Dual graph convolutional networks

for graph-based semi-supervised classification. In World
Wide Web, pp. 499–508, 2018.
Zien, J. Y., Schlag, M. D., and Chan, P. K. Multilevel spec-
tral hypergraph partitioning with arbitrary vertex sizes.
IEEE Transactions on Computer-aided Design of Inte-
grated Circuits and Systems, 18(9):1389–1399, 1999.

Hypergraph Convolution and Hypergraph Attention

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hypergraph Convolution and Hypergraph Attention

Caricato da

Copyright:

Formati disponibili

Hypergraph Convolution and Hypergraph Attention

Song Bai 1 Feihu Zhang 1 Philip H.S. Torr 1

mance in various research fields. Most of those

3. Proposed Approach Then, one step of hypergraph convolution is defined as

3.2. Hypergraph Convolution 3.3. Hypergraph Attention

x3 x4 x5 x(l) Convolution σ x(l+1)

see, Eq. (8) is exactly equivalent to the definition of graph

lications divided into 3 classes. The citation network has

convolution network (GCN*) and graph attention network

Zhuang, C. and Ma, Q. Dual graph convolutional networks

Potrebbero piacerti anche