Sei sulla pagina 1di 128

Advanced Information and Knowledge Processing

Cen Wan

Hierarchical
Feature Selection
for Knowledge
Discovery
Application of Data Mining to the
Biology of Ageing
Advanced Information and Knowledge
Processing

Series editors
Lakhmi C. Jain
Bournemouth University, Poole, UK, and
University of South Australia, Adelaide, Australia

Xindong Wu
University of Vermont
Information systems and intelligent knowledge processing are playing an increasing
role in business, science and technology. Recently, advanced information systems
have evolved to facilitate the co-evolution of human and information networks
within communities. These advanced information systems use various paradigms
including artificial intelligence, knowledge management, and neural science as well
as conventional information processing paradigms. The aim of this series is to
publish books on new designs and applications of advanced information and
knowledge processing paradigms in areas including but not limited to aviation,
business, security, education, engineering, health, management, and science. Books
in the series should have a strong focus on information processing—preferably
combined with, or extended by, new results from adjacent sciences. Proposals for
research monographs, reference books, coherently integrated multi-author edited
books, and handbooks will be considered for the series and each proposal will be
reviewed by the Series Editors, with additional reviews from the editorial board and
independent reviewers where appropriate. Titles published within the Advanced
Information and Knowledge Processing series are included in Thomson Reuters’
Book Citation Index.

More information about this series at http://www.springer.com/series/4738


Cen Wan

Hierarchical Feature
Selection for Knowledge
Discovery
Application of Data Mining to the Biology
of Ageing

123
Cen Wan
Department of Computer Science
University College London
London, UK

ISSN 1610-3947 ISSN 2197-8441 (electronic)


Advanced Information and Knowledge Processing
ISBN 978-3-319-97918-2 ISBN 978-3-319-97919-9 (eBook)
https://doi.org/10.1007/978-3-319-97919-9

Library of Congress Control Number: 2018951201

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to my family.
Preface

This book is the first work that systematically discusses the hierarchical feature
selection algorithms and applications of those algorithms in cutting-edge interdis-
ciplinary research areas—Bioinformatics and Biology of Ageing.
Hierarchical feature selection (HFS) is an under-explored subarea of data min-
ing. Unlike conventional (flat) feature selection algorithms, HFS algorithms work
by exploiting hierarchical (generalisation–specialisation) relationships between
features, in order to try to improve the predictive accuracy of classifiers. The basic
idea is to use an HFS algorithm to select a feature subset where the hierarchical
redundancy among features is eliminated or reduced, and then give only the
selected feature subset to a classification algorithm.
Apart from introducing HFS algorithms, this book also focuses on the
ageing-related gene function prediction problem with using Bioinformatics datasets
of ageing-related genes. This type of dataset is an interesting type of application for
data mining methods due to the technical difficulty and ethical issues associated
with doing ageing experiments with humans and the strategic importance of
research on the Biology of Ageing—since age is the greatest risk factor for a
number of diseases, but is still a not well-understood biological process.
My research on hierarchical feature selection and Bioinformatics has been done
with the help of many people. I would like to acknowledge Prof. Alex A. Freitas,
who always inspires me, encourages me and offers me the enormous support. I also
would like to sincerely acknowledge Prof. David T. Jones, who always encourages
me to explore the essence of life and offers enormous support. I would like to thank
Dr. João Pedro de Magalhães, Daniel Wuttke, Dr. Robi Tacutu and the
Bioinformatics group at UCL. Finally, I would like to thank my parents and the
whole family, who made it all worthwhile.

London, UK Cen Wan


June 2018

vii
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Mining and Knowledge Discovery . . . . . . . . . . . . . . . . . . . . 1
1.2 Hierarchical Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Biology of Ageing and Bioinformatics . . . . . . . . . . . . . . . . . . . . . 4
1.4 The Organisation of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Data Mining Tasks and Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 The Naïve Bayes and Semi-naïve Bayes Classifiers . . . . . . . . . . . 11
2.5.1 The Naïve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Semi-naïve Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . 11
2.6 K-Nearest Neighbour Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Feature Selection Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Conventional Feature Selection Paradigms . . . . . . . . . . . . . . . . . . 17
3.1.1 The Wrapper Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 The Filter Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 The Embedded Approach . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Hierarchical Feature Selection Paradigms . . . . . . . . . . . . . . . . . . . 20
3.2.1 The Lazy Learning and Eager Learning Approaches
for Classification Tasks . . . . . . . . . . . . . . . . . . . . . ..... 20
3.2.2 Other Approaches for Enrichment Analysis
and Regression Tasks . . . . . . . . . . . . . . . . . . . . . . ..... 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 22

ix
x Contents

4 Background on Biology of Ageing and Bioinformatics . . . . . . . . . . . 25


4.1 Overview of Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Overview of Biology of Ageing . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Introduction to Biology of Ageing . . . . . . . . . . . . . . . . . . 26
4.2.2 Some Possible Ageing-Related Factors . . . . . . . . . . . . . . . 27
4.2.3 Mysteries in Ageing Research . . . . . . . . . . . . . . . . . . . . . 28
4.3 Overview of Gene and Protein Function Prediction
in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 29
4.3.1 Introduction to Bioinformatics . . . . . . . . . . . . . . . . ..... 29
4.3.2 Gene and Protein Function Prediction . . . . . . . . . . . ..... 30
4.4 Related Work on The Machine Learning Approach Applied
to Biology of Ageing Research . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Biological Databases Relevant to This Book . . . . . . . . . . . . . . . . 35
4.5.1 The Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.2 Human Ageing Genomic Resources (HAGR) . . . . . . . . . . 36
4.5.3 Dataset Creation Using Gene Ontology Terms
and HAGR Genes . . . . . . . . . . . . . . . . . . . . . . . . . ..... 37
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 40
5 Lazy Hierarchical Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Hierarchical Redundancy in Lazy Learning Paradigm . . . . . . . . . . 45
5.2 Select Hierarchical Information-Preserving Features (HIP) . . . . . . . 47
5.3 Select Most Relevant Features (MR) . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Select Hierarchical Information-Preserving and Most Relevant
Features (HIP—MR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76
5.6.1 Statistical Analysis of GMean Value Differences Between
HIP, MR, HIP—MR and Other Feature Selection
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76
5.6.2 Robustness Against the Class Imbalanced Problem . . . . .. 78
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80
6 Eager Hierarchical Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Tree-Based Feature Selection (TSEL) . . . . . . . . . . . . . . . . . . . . . 81
6.2 Bottom-Up Hill Climbing Feature Selection (HC) . . . . . . . . . . . . . 85
6.3 Greedy Top-Down Feature Selection (GTD) . . . . . . . . . . . . . . . . 88
6.4 Hierarchy-Based Feature Selection (SHSEL) . . . . . . . . . . . . . . . . 91
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Contents xi

6.6.1 Statistical Analysis of GMean Value Difference between


Different Eager Learning-Based Feature Selection
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6.2 Robustness Against the Class Imbalance Problem . . . . . . . 102
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Comparison of Lazy and Eager Hierarchical Feature Selection
Methods and Biological Interpretation on Frequently Selected
Gene Ontology Terms Relevant to the Biology of Ageing . . . . . . . . . 105
7.1 Comparison of Different Feature Selection Methods Working
with Different Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 The Number of Selected Features by Different Methods . . . . . . . . 109
7.3 Interpretation on Gene Ontology Terms Selected by Hierarchical
Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8 Conclusions and Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1 General Remarks on Hierarchical Feature Selection Methods . . . . 115
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Notations

DAG Directed acyclic graph


TrainSet Training dataset
TestSet Testing dataset
Instt The tth instance
TrainSet SF The training dataset with selected features
Inst SFt The tth testing instance with selected features
X A set of feature(s)/node(s)
Y A set of class
RðxÞ The relevance value of feature x
AðxÞ A set of ancestor node(s) for node x
A þ ðxÞ A set of ancestor node(s) for node x plus node x
DðxÞ A set of descendant node(s) for node x
D þ ðxÞ A set of descendant node(s) for node x plus node x
MRF A set of feature/node with maximum relevance value
P A set of paths
PIðxÞ A set of parent node(s) for node x
SF A set of selected feature(s)/node(s)
CðxÞ A set of child node(s) for node x
LðPÞ A set of leaf node(s) in paths P
RðPÞ A set of root node(s) in paths P
SðxÞ Marking status of feature/node x
StatusðxÞ Selection status of feature/node x
Valueðxt Þ The value of feature/node x in the tth instance
N The dimensions of original feature set
M The dimensions of candidate feature subset
lP A scaling coefficient
jDi;c j The concentration degree of instances belonging to c different
i2D groups in the dataset D
Fcurr A set of currently selected feature(s)/node(s)
Costcurr The cost value of currently selected feature subset

xiii
xiv Notations

Fcand A set of candidate feature(s)/node(s)


Costcand The cost value of candidate feature subset
IGðxÞ Information Gain value of feature/node x
IGðpÞ The mean Information Gain value of all nodes in path p
HIP Select hierarchical information-preserving features
MR Select most relevant features
HIP  MR Select hierarchical information-preserving and most relevant
features
EntHIP n Entropy-based hybrid lazy/eager learning-based feature selection
method with the same n selected features by HIP method
EntMR n Entropy-based hybrid lazy/eager learning-based feature selection
method with the same n selected features by MR method
EntHIPMR n Entropy-based hybrid lazy/eager learning-based feature selection
method with the same n selected features by HIP–MR method
ReleHIP n Relevance-based hybrid lazy/eager learning-based feature selec-
tion method with the same n selected features by HIP method
ReleMR n Relevance-based hybrid lazy/eager learning-based feature selec-
tion method with the same n selected features by MR method
ReleHIPMR n Relevance-based hybrid lazy/eager learning-based feature selec-
tion method with the same n selected features by HIP–MR method
CFS Correlation-based feature selection
TSEL Tree-based feature selection
HC Bottom-up hill climbing feature selection
GTD Greedy top-down feature selection
SHSEL Hierarchy-based feature selection
Chapter 1
Introduction

1.1 Data Mining and Knowledge Discovery

Data mining (or machine learning) techniques have attracted considerable attention
from both academia and industry, due to their significant contributions to intelligent
data analysis. The importance of data mining and its applications is likely to increase
even further in the future, given that organisations keep collecting increasingly larger
amounts of data and more diverse types of data. Due to the rapid growth of data from
real world applications, it is timely to adopt Knowledge Discovery in Databases
(KDD) methods to extract knowledge or valuable information from data. Indeed,
KDD has already been successfully adopted in real world applications, both in science
and in business.
KDD is a field of inter-disciplinary research across machine learning, statistics,
databases, etc. [4, 8, 21]. Broadly speaking, the KDD process can be divided into
four phases. The first phase is selecting raw data from original databases according
to a specific knowledge discovery task, e.g. classification, regression, clustering or
association rule mining. Then the selected raw data will be input to the phase of data
pre-processing (the second phase), which aims at processing the data into a form that
could be efficiently used by the type of algorithm(s) to be applied in the data mining
phase - such algorithms are dependent on the chosen type of knowledge discovery
task. The data pre-processing phase includes data cleaning, data normalisation, fea-
ture selection and feature extraction, etc. The third phase is data mining, where a
model will be built by running learning algorithms on the pre-processed data. In this
book, we address the classification task, where the learning (classification) algorithm
builds a classification model or classifier as will be explained later. The final phase is
extracting the knowledge from the built classifier or model. Among those four phases
of KDD, the focus of this book is on the data pre-processing phase, in particular the
feature selection task, where the goal is to remove the redundant or irrelevant features
in order to improve the predictive performance of classifiers.

© Springer Nature Switzerland AG 2019 1


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_1
2 1 Introduction

1.2 Hierarchical Feature Selection

In the context of the classification task, this book focuses on the feature selection
task. When the number of features is large (like in the datasets used in this research),
it is common to apply feature selection methods to the data. These methods aim at
selecting, out of all available features in the dataset being mined, a subset of the
most relevant and non-redundant features [12, 15] for classifying instances in that
dataset. There are several motivations for feature selection [12, 15], one of the main
motivations is to try to improve the predictive performance of classifiers. Another
motivation is to accelerate the training time for building the classifiers, since training
a classifier with the selected features should be considerably faster than training the
classifier with all original features, in general. Yet another motivation is that the
selected features may represent a type of knowledge or pattern by themselves, i.e.
users may be interested in knowing the most relevant features in their datasets.
Note that feature selection is a hard computational problem, since the number
of candidate solutions (feature subsets) grows exponentially with the number of
features. More precisely, the number of candidate solution is 2m − 1, where m is
the number of available features in the dataset being mined, and “1” is subtracted in
order to take into account that the empty subset of features is not a valid solution for
the classification task.
Although there are many types of feature selection methods for classification [7,
12, 15], in general these methods have the limitation that they do not exploit infor-
mation associated with the hierarchy (generalisation-specialisation relationships)
among features, which present in some types of features. As the example shown
in Fig. 1.1, those features A–R, are hierarchically structured as a Directed Acyclic
Graph (DAG), where feature I is the child of features F and Q, while F is child of
features M and L, and Q is child of feature O.
This type of hierarchical relationships are relatively common (although usually
ignored) in applications. In text mining, for instance, features usually represent the
presence or absence of words in a document, and words are involved in generalisation-
specialisation relationships [3, 14]; in Bioinformatics, which is the type of application
this book focuses on, the functions of genes or proteins are often described by using
a hierarchy of terms, where terms representing more generic functions are ancestors
of terms representing more specific functions. As another example of hierarchical
features, many datasets in financial or marketing applications (where instances rep-
resent customers) have the address of the customer as a feature. This feature can be
specified at several hierarchical levels, varying from the most detailed level (e.g. the
full post code) to more generic levels (e.g. the first two or first three digits of the post
code).
From another perspective, hierarchies of features can also be produced by using
hierarchical clustering algorithms [21] to cluster features, rather than to cluster
instances, based on a measure of similarity between features. The basic idea is that
each object to be clustered would be a feature, and the similarity between any two
features would be given by a measure of how similar the values of those features are
1.2 Hierarchical Feature Selection 3

Fig. 1.1 Example of a small


DAG of features

across all instances. For instance, consider a dataset where each instance represents
an email, and each binary feature represents the presence or absence of a word. Two
features (words) can be considered similar to the extent that they occur (or don’t
occur) in the same sets of emails. Then, a hierarchical clustering algorithm can be
used to produce a hierarchy of features, where each leaf cluster will consist of a
single word, and higher-level clusters will consist of a list of words connected by an
“or” logical operator. For example, if words “money” and “buy” were merged into a
cluster by the hierarchical clustering algorithm, when mapping the original features
to the hierarchical features created by the clustering algorithm, an email with word
“money” but without the word “buy” would be considered to have value “yes” for
feature “money”, value “no” for feature “buy”, and value “yes” for feature “money
or buy”. Note that in this example the “or” operator was used (as opposed to the
“and” operator) in order to make sure the feature hierarchy is a “is-a” hierarchy; i.e.
if an email has value “yes” for a feature, it will necessarily have value “yes” for all
ancestors of that feature in the hierarchy.
Intuitively, in datasets where such hierarchical relations among features exist,
ignoring such relationships seems a sub-optimal approach; i.e. these hierarchical rela-
tionships represent additional information about the features that could be exploited
to improve the predictive performance associated with feature selection methods –
i.e. the ability of these methods to select features that maximise the predictive accu-
racy to be obtained by classification algorithms using the selected features. This is
the basic idea behind the hierarchical feature selection methods discussed in this
book.
The hierarchical feature selection methods are categorised into two types, i.e.
lazy learning-based and eager learning-based. In the sense of lazy learning-based,
the feature selection process is postponed to the moment when testing instances are
observed, rather than in the training phase of conventional learning methods (which
4 1 Introduction

perform “eager learning”). Both the lazy learning-based and eager learning-based
methods discussed in this book are evaluated together with the well-known Bayesian
network classifiers and K-Nearest Neighbour classifier.

1.3 Biology of Ageing and Bioinformatics

In terms of applications of the proposed hierarchical feature selection methods, this


book focuses on analysing biological data about ageing-related genes [1, 2, 6, 9,
11, 18–20]. The causes and mechanisms of the biological process of ageing are a
mystery that has puzzled humans for a long time. Biological research has, however,
revealed some factors that seem associated with the ageing process.
For instance, caloric restriction – which consists of taking a reduced amount of
calories without undergoing malnutrition – extends the longevity of many species
[13]. In addition, research has identified that several biological pathways seem to
regulate the process of ageing (at least in model organisms), such as the well-known
insulin/insulin-like growth factor (IGF-1) signalling pathway [10]. It is also known
that mutations in some DNA repair genes lead to accelerated ageing syndromes [5].
Despite such findings, ageing is a highly complex biological process which is still
poorly understood, and much more research is needed in this area.
Unfortunately, conducting ageing experiments in humans is very difficult, due
to the complexity of the human genome, the long lifespan of humans, and ethical
issues associated with experiments with human. Therefore, research on the biology of
ageing is usually done with model organisms like yeast, worms, flies or mice, which
can be observed in an acceptable time and have considerably simpler genomes.
In addition, with the growing amount of ageing-related data on model organisms
available on the web, in particular related to the genetics of ageing, it is timely to
apply data mining methods to that data [20], in order to try to discover patterns that
may assist ageing research.
More precisely, in this book, the instances being classified are genes from four
major model organisms, namely: C. elegans, D. melanogaster, M. musculus and S.
cerevisiae. Each gene has to be classified into one of two classes: pro-longevity or
anti-longevity, based on the values of features indicating whether or not the gene is
associated with each of a number of Gene Ontology (GO) terms, where each term
refers to a type of biological process, molecular function or cellular component. Pro-
longevity genes are those whose decreased expression (due to knockout, mutations
or RNA interference) reduces lifespan and/or whose overexpression extends lifes-
pan; accordingly, anti-longevity genes are those whose decreased expression extends
lifespan and/or whose overexpression decreases it [16].
The GO terms are adopted as features to predict a genes effect on longevity
because of the widespread use of the GO in gene and protein function prediction
and the fact that GO terms were explicitly designed to be valid across different
types of organisms [17]. GO terms are organised into a hierarchical structure where,
for each GO term t, its ancestors in the hierarchy denote more general terms (i.e.
1.4 The Organisation of This Book 5

more general biological processes, molecular function or cellular component) and


its descendants denote more specialised terms than t. It is important to consider
the hierarchical relationships among GO terms when performing feature selection,
because such relationships encode information about redundancy among GO terms.
In particular, if a given gene g is associated with a given GO term t, this logically
implies that is also associated with all ancestors of t in the GO hierarchy. This kind
of redundancy can have a substantially negative effect on the predictive accuracy
of Bayesian network classification algorithms, such as Naïve Bayes [21]. This issue
will be discussed in detail later.

1.4 The Organisation of This Book

This book first introduces the background of data mining tasks and feature selection
paradigms in Chaps. 2 and 3. The background of Biology of Ageing and Bioin-
formatics is covered in Chap. 4. Chapter 5 discusses three types of lazy learning-
based hierarchical feature selection methods and Chap. 6 discusses four types of
eager learning-based hierarchical feature selection methods. The overall compari-
son between both lazy learning-based and eager learning-based hierarchical feature
selection methods is described in Chap. 7, where the frequently selected GO terms
by one of the best hierarchical feature selection methods are interpreted from the
perspective of Biology of Ageing. The conclusion and future research directions are
described in Chap. 8.

References

1. de Magalhães JP, Budovsky A, Lehmann G, Costa J, Li Y, Fraifeld V, Church GM (2009) The


human ageing genomic resources: online databases and tools for biogerontologists. Aging Cell
8(1):65–72
2. Fang Y, Wang X, Michaelis EK, Fang J (2013) Classifying aging genes into DNA repair or
non-DNA repair-related categories. In: Huang DS, Jo KH, Zhou YQ, Han K (eds) Lecture
notes in intelligent computing theories and technology. Springer, Berlin, pp 20–29
3. Fellbaum C (1998) WordNet. Blackwell Publishing Ltd, Hoboken
4. Freitas AA (2002) Data mining and knowledge discovery with evolutionary algorithms.
Springer, Berlin
5. Freitas AA, de Magalhães JP (2011) A review and appraisal of the DNA damage theory of
ageing. Mutat Res 728(1–2):12–22
6. Freitas AA, Vasieva O, de Magalhães JP (2011) A data mining approach for classifying DNA
repair genes into ageing-related or non-ageing-related. BMC Genomics 12(27):1–11 Jan
7. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn
Res 3:1157–1182
8. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques.
Elsevier, San Francisco
9. Huang T, Zhang J, Xu ZP, Hu LL, Chen L, Shao JL, Zhang L, Kong XY, Cai YD, Chou KC
(2012) Deciphering the effects of gene deletion on yeast longevity using network and machine
learning approaches. Biochimie 94(4):1017–1025
6 1 Introduction

10. Kenyon CJ (2010) The genetics of ageing. Nature 464(7288):504–512


11. Li YH, Dong MQ, Guo Z (2010) Systematic analysis and prediction of longevity genes in
caenorhabditis elegans. Mech Ageing Dev 131(11–12):700–709
12. Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspec-
tive. Springer, US
13. Masoro EJ (2005) Overview of caloric restriction and ageing. Mech Ageing Dev 126(9):913–
922
14. Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ . Introduction to wordnet: an on-line
lexical database. Int J Lexicogr 3(4):235–244
15. Pereira RB, Plastino A, Zadrozny B, de C Merschmann LH, Freitas AA, (2011) Lazy attribute
selection: choosing attributes at classification time. Intell Data Anal 15(5):715–732
16. Tacutu R, Craig T, Budovsky A, Wuttke D, Lehmann G, Taranukha D, Costa J, Fraifeld VE,
de Magalhães JP (2013) Human ageing genomic resources: integrated databases and tools for
the biology and genetics of ageing. Nucleic Acids Res 41(D1):D1027–D1033
17. The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nat
Genet 25(1):25–29
18. Wan C, Freitas AA (2013) Prediction of the pro-longevity or anti-longevity effect of Caenorhab-
ditis Elegans genes based on Bayesian classification methods., pp 373–380
19. Wan C, Freitas AA (2015) Two methods for constructing a gene ontology-based feature selec-
tion network for a Bayesian network classifier and applications to datasets of aging-related
genes., pp 27–36
20. Wan C, Freitas AA, de Magalhães JP (2015) Predicting the pro-longevity or anti-longevity
effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM
Trans Comput Biol Bioinform 12(2):262–275
21. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and tech-
niques. Morgan Kaufmann, Burlington
Chapter 2
Data Mining Tasks and Paradigms

Data Mining tasks are types of problems to be solved by a data mining or machine
learning algorithm. The main types of data mining tasks can be categorised as clas-
sification, regression, clustering and association rule mining. The former two tasks
(classification and regression) are also grouped as the supervised learning paradigm,
whereas the latter one (clustering) is categorised as unsupervised learning.
Supervised learning consists of learning a function from labelled training data
[19]. The supervised learning process consists of two phases, i.e. the training phase
and the testing phase. Accordingly, in the supervised learning process, the original
dataset is divided into training and testing datasets. In the training phase, only the
training dataset will be used for inferring the specific function by learning a specific
model, which will be evaluated by using the testing dataset in the testing phase.
Unlike supervised learning, unsupervised learning is usually defined as a process
of learning particular patterns from unlabelled data. In unsupervised learning, there
is no distinction between training and testing datasets, and all available data are used
to build the model. The usual application of unsupervised learning is to find groups
or patterns of similar instances, constituting a clustering problem. To be different
with both supervised learning and unsupervised learning, the task of association
rule mining is to discover valuable relationship between items in a large database.

2.1 Classification

The classification task is possibly the mostly studied task in data mining. It consists
of building a classification model or classifier to predict the class label (a nominal
or categorical value) of an instance by using the values of the features (predictor
attributes) of that instance [5, 9]. Actually, the essence of the classification process is
exploiting correlations between features and the class labels of instances in order to
find the border between class labels in the data space - a space where the position of an
instance is determined by the values of the features in that instance. The classification
© Springer Nature Switzerland AG 2019 7
C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_2
8 2 Data Mining Tasks and Paradigms

Fig. 2.1 Example of data Class A


classification into two Class B
categories
Boundary
20

10

− 10

−20 −10 0 10 20 30

border is examplified in Fig. 2.1, in the context of a problem with just two class labels,
where the found classification border (a black dashed line) distinguishes the instances
labelled as red delta or blue circle.
Many types of classification algorithms have been proposed, such as Bayesian
network classifiers, Decision Tree, Support Vector Machine (SVM), Artificial Neural
Networks (ANN), etc. From the perspective of interpretability of the classifier, those
classifiers can be categorised into two groups, i.e. “white box” and “black box”
classifiers. The “white box” classifiers, e.g. Bayesian network classifiers and Decision
Tree, have better interpretability than the latter ones, e.g. Support Vector Machine
(SVM) and Artificial Neural Networks (ANN) [6]. This book focuses on Bayesian
network classifiers [7, 23, 24, 26, 27] (more precisely, Naïve Bayes and Semi-naïve
Bayes classifiers), due to their good potential for interpretability; in addition to their
ability to cope with uncertainty in data – a common problem in Bioinformatics [8].

2.2 Regression

Regression analysis is a traditional statistical task with the theme of discovering the
association between predictive variables (features) and the target (response) variable.
As it is usually used for prediction, regression analysis can also be considered a type
of supervised learning task from the perspective of machine learning and data mining.
Overall, a regression method is capable of predicting the numeric (real-valued)
value of the target variable of an instance - unlike classification methods, which
predict nominal (categorical) values, as mentioned earlier. A typical example of a
conventional linear regression model for a dataset with just one feature x is shown
as Eq. 2.1, where xi denotes the value of the feature x for the ith instance,
2.2 Regression 9

Fig. 2.2 Example of 1


regression for data yy==0.65 · x·+
0.65 + 9.36
x 9.36 · 10−3
· 10−3
0.8

0.6

0.4

0.2

−0.2
−0.2 0 0.2 0.4 0.6 0.8 1

yi = β0 + β1 xi + ξi (2.1)

βi denotes the corresponding weight, and ξi denotes the error. The most appropriate
values of the weights in Eq. 2.1 can be found using mathematical methods, such as
the well-known Linear Least Square [16, 17, 21]. Then the predicted output value yi
is computed based on the values of the input feature with its corresponding weight.
As shown in the simple example of Fig. 2.2, the small distances between the line and
the data points indicates that Eq. 2.1 fits well the data. Regression analysis has been
well studied in the statistics area and widely applied in different domains.

2.3 Clustering

The clustering task mainly aims at finding patterns in the data by grouping similar
instances into clusters (or groups). The instances within the same cluster are more
similar with each other, but simultaneously more dissimilar with the instances in
other clusters. An example of clustering is shown in Fig. 2.3, where the left graph
represents the situation before clustering, where all data are unlabelled (in blue),
and the right graph represents the situation where all data are clustered into four
different groups, i.e. group A of data in red, group B of data in blue, group C of data
in green and group D of data in orange. Clustering has been widely studied in the
area of statistical data analysis, and applied on different domains, like information
retrieval, Bioinformatics, etc. Examples of well-known, classical clustering methods
are k-means [10] and k-medoids [14].
10 2 Data Mining Tasks and Paradigms

20 20
Cluster A
Cluster B
Cluster C

10 10 Cluster D

0 0

−10 −10

−20 −20
−20 −10 0 10 20 −20 −10 0 10 20

Fig. 2.3 Example of data clustered into four groups

Fig. 2.4 Example of association rule mining based on transaction data

2.4 Association Rule Mining

The association rule mining aims at discovering valuable links between variables in
large databases. Basically, the association rule mining methods apply a threshold of
support and confidence values to select the highly reliable rules among a set of links
between variables, e.g. X → Y . Briefly, the support value measure the frequency of
rules in the database, while the confidence value measures the frequency of recording
including Y in the database also containing X . An example association rule is shown
in Fig. 2.4, where the left table includes 10 transaction records in a restaurant and
the right table shows 6 example rules discovered from those records. It is obvious
that the rule of fish and chip is ranked on the top due to the high Support (0.75) and
Confidence value (1.00). Some well-known association rule mining methods (e.g.
Apriori algorithm [1]) have been applied on business tasks, such as mining patterns
from the large transaction databases.
2.5 The Naïve Bayes and Semi-naïve Bayes Classifiers 11

Fig. 2.5 An example Naïve


Bayes network topology

2.5 The Naïve Bayes and Semi-naïve Bayes Classifiers

2.5.1 The Naïve Bayes Classifier

The Naïve Bayes classifier [5, 9, 18, 20, 25] is a type of Bayesian network classifier
that assumes that all features are independent from each other given the class attribute.
An example of this classifier’s network topology is shown in Fig. 2.5, where each
feature X i (i = 1, 2, . . . , 5) only depends on the class attribute. In the figure, this is
indicated by an edge pointing from the class node to each of the feature nodes. As
shown in Eq. 2.2,


n
P(y | x1 , x2 , . . . , xn ) ∝ P(y) P(xi | y) (2.2)
i=1

where ∝ is the mathematical symbol for proportionality and n is the number of


features; the estimation of the probability of a class attribute value y given all predictor
features’ values xi of one instance can be obtained by calculating the product of the
individual probability of each feature value given a class attribute value and the prior
probability of that class attribute value. Naïve Bayes (NB) has been shown to have
relatively powerful predictive performance, compared with other Bayesian network
classifiers [7], even thought it pays the price of losing the dependencies between
features.

2.5.2 Semi-naïve Bayes Classifiers

The Naïve Bayes classifier is very popular and has been applied on many domains
due to its advantages of simplicity and short learning time, compared with other
Bayesian classifiers. However, the assumption of conditional independence between
features is usually violated in practice. Therefore, many extensions of Naïve Bayes
12 2 Data Mining Tasks and Paradigms

Fig. 2.6 An example of


TAN’s network topology

focus on approaches to relax the assumption of conditional independence [7, 15,


26]. This sort of classifier is called Semi-naïve Bayes classifier.
Both the Naïve Bayes classifier and Semi-naïve Bayes classifiers use estimation of
the prior probability of the class and the conditional probability of the features given
the class to obtain the posterior probability of the class given the features, as shown
in the Eq. 2.3 (i.e. the Bayes’ formula), where y denotes a class and x denotes the set
of features, i.e. {x 1 , x 2 , . . . , x n }. However, different Semi-naïve Bayes classifiers
use different approaches to estimate the term P(x | y), as discussed in the next
subsections.
P(x | y)P(y)
P(y | x) = (2.3)
P(x)

2.5.2.1 Tree Augmented Naïve Bayes (TAN)

TAN constructs a network in the form of a tree, where each feature node is allowed to
have at most one parent feature node in addition to the class node (which is a parent
of all feature nodes), as shown in Fig. 2.6, where each feature except the root feature
X 4 has only one non-class parent feature. TAN computes the posterior probability
of a class y using Eq. 2.4,


n
P(y | x1 , x2 , . . . , xn ) ∝ P(y) P(xi | Par (xi ), y) (2.4)
i=1

where the number of non-class parent features for each feature xi (i.e. Par (xi )),
except the root feature, equals to “1”. Hence, it represents a limited degree of depen-
dencies among features.
In essence, the original TAN classifier firstly produces a rank of feature pairs
according to the conditional mutual information between the pair of features given
the class attribute. Then the Maximum Spanning Tree is built based on the rank.
Next, the algorithm randomly chooses a root feature and then sets all directions of
edges to other features from it. Finally, the constructed tree is used for classification.
2.5 The Naïve Bayes and Semi-naïve Bayes Classifiers 13

Fig. 2.7 An example of


BAN’s network topology

The concept of conditional mutual information proposed for building TAN clas-
sifiers is an extension of mutual information. The formula of conditional mutual
information is shown as Eq. 2.5,
 P(xi , x j | y)
I p (X i ; X j | Y ) = P(xi , x j , y)log (2.5)
xi ,x j ,y
P(xi | y)P(x j | y)

where X i and X j (i = j) are predictor features, Y is the class attribute, xi , x j , y are


the values of the corresponding features and the class attribute, P(xi , x j , y) denotes
the joint probability of xi , x j , y; P(xi , x j | y) denotes the joint probability of feature
values xi and x j given class value y; and P(xi | y) denotes the conditional probability
of feature value xi given class value y. Each pair of features “xi , x j ” is taken into
account as a group, then the mutual information for each pair of features given the
class attribute is computed [7].

2.5.2.2 Bayesian Network Augmented Naïve Bayes (BAN)

The BAN classifier is a more complicated type of Semi-naïve Bayes classifier, which
(unlike NB and TAN) can represent more complicated dependencies between features
[3, 7]. More precisely, in a BAN, in Eq. 2.4, the number of parent feature node(s)
for each node xi (i.e. Par (xi )) is allowed to be more than one. An example of this
classifier’s network topology is shown in Fig. 2.7, where each feature xi has the class
attribute as a parent, indicated by the dashed lines; and possibly other non-class
parent feature(s), as indicated by the solid lines. Node X 4 has two non-class parent
nodes X 1 and X 5 , while node X 3 also has two non-class parent nodes X 2 and X 4 .
There exist several approaches for constructing a BAN classifier from data that
have been shown to be relatively efficient to use, particularly when the number
of feature parents of a node is limited to a small integer number (a user-specified
parameter). However, in general, learning a BAN classifier tends to be much more
time consuming than learning a NB or TAN classifier, mainly due to the large time
taken to search for a good BAN network topology.
14 2 Data Mining Tasks and Paradigms

Fortunately, in the context of the Bioinformatics data used in this book, there are
strong dependency relationships between features (Gene Ontology terms), which
have been already defined by expert biologists in the form of a feature graph, con-
taining hierarchical relationships among features that are represented as directed
edges in the feature graph (i.e. Gene Ontology hierarchy, as will be explained in
detail later). Such hierarchical relationships provide a sophisticated representation
of biological knowledge that can be directly exploited by a BAN classifier. Hence, the
pre-defined hierarchical relationships retained in the data are adopted as the topology
of the BAN classifier network (i.e. the Gene Ontology-based BAN [22]), rather than
learning the BAN network topology from the data.

2.6 K-Nearest Neighbour Classifier

K-Nearest-Neighbour (KNN) is a type of instance-based classifier. It predicts the


class label of an individual testing instance by considering the class label of majority
training instances having the closest distances [2, 4, 11]. KNN is also a type of lazy
learning-based classifier, i.e. the classifier is trained for individual testing instances.
In this book, the Jaccard similarity coefficient [12, 13] is adopted as the distance
metric, due to the binary feature values in the datasets discussed in this book. As
shown in Eq. 2.6, the Jaccard similarity coefficient calculates the ratio of the size of
the intersection over the size of the union of two feature sets. M11 denotes the total
number of features that have value “1” in both ith (testing) and k th (nearest training)
instances; M10 denotes the total number of features that have value “1” in the ith
instance and value “0” in the k th instance; M01 denotes the total number of features
that have value “0” in the ith instance and value “1” in the k th instance.

M11
Jaccard(i, k) = (2.6)
M11 + M10 + M01

References

1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of
the 20th international conference on very large data bases (VLDB 1994), Santiago, Chile, pp
487–499
2. Aha DW (1997) Lazy learning. Kluwer Academic Publishers, Norwell
3. Cheng J, Greiner R (1999) Comparing Bayesian network classifiers. In: Proceedings of the
fifteenth conference on uncertainty in artificial intelligence, Stockholm, Sweden, pp 101–108
4. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory
13(1):21–27
5. Freitasc AA (2002) Data mining and knowledge discovery with evolutionary algorithms.
Springer, Berlin
References 15

6. Freitas AA (2013) Comprehensible classification models - a position paper. ACM SIGKDD


Explor 15(1):1–10
7. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–
3):131–163
8. Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature
521(7553):452–459
9. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques.
Elsevier, San Francisco
10. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc
Ser C (Appl Stat) 28(1):100–108
11. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining,
inference, and prediction. Springer, Berlin
12. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
13. Jain AK, Zongker D (1997) Representation and recognition of handwritten digits using
deformable templates. IEEE Trans Pattern Anal Mach Intell 19(12):1386–1391
14. Jin X, Han J (2010) Encyclopedia of machine learning. Springer, US
15. Kononenko I (1991) Semi-naive Bayesian classifier. In: Proceedings of machine learning-
European working session on learning, Porto, Portugal, pp 206–219
16. Lawson CL, Hanson RJ (1974) Solving least squares problems. Prentice-Hall, Englewood
Cliffs
17. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic, New York
18. Minsky M (1961) Steps toward artificial intelligence. In: Proceedings of the IRE, pp 8–30
19. Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. MIT Press,
Cambridge
20. Peot MA (1996) Geometric implications of the naive Bayes assumption. In: Proceedings of
the twelfth international conference on uncertainty in artificial intelligence, Portland, USA, pp
414–419
21. Strutz T (2010) Data fitting and uncertainty (A practical introduction to weighted least squares
and beyond). Vieweg+Teubner, Wiesbaden
22. Wan C, Freitas AA (2015) Two methods for constructing a gene ontology-based feature selec-
tion network for a Bayesian network classifier and applications to datasets of aging-related
genes. In: Proceedings of the sixth ACM conference on bioinformatics, computational biology
and health informatics (ACM-BCB 2015), Atlanta, USA, pp 27–36
23. Wang Z, Webb GI (2002) Comparison of lazy Bayesian rule, and tree-augmented Bayesian
learning. In: Proceedings of IEEE international conference on data mining (ICDM 2002),
Maebashi, Japan, pp 490–497
24. Webb GI, Boughton JR, Wang Z (2005) Not so naive Bayes: aggregating one-dependence
estimators. Mach Learn 58(1):5–24
25. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and tech-
niques. Morgan Kaufmann, Burlington
26. Zheng F, Webb GI (2005) A comparative study of semi-naive Bayes methods in classifica-
tion learning. In: Proceedings of the fourth australasian data mining conference (AusDM05),
Sydney, Australia, pp 141–155
27. Zheng F, Webb GI (2006) Efficient lazy elimination for averaged one-dependence estimators.
In: Proceedings of the twenty-third international conference on machine learning (ICML 2006),
Pittsburgh, USA, pp 1113–1120
Chapter 3
Feature Selection Paradigms

3.1 Conventional Feature Selection Paradigms

Feature selection is a type of data pre-processing task that consists of removing


irrelevant and redundant features in order to improve the predictive performance of
classifiers. The dataset with the full set of features is input to the feature selection
method, which will select a subset of features to be used for building the classi-
fier. Then the built classifier will be evaluated, by measuring its predictive accuracy.
Irrelevant features can be defined as features which are not correlated with the class
variable, and so removing such features will not be harmful for the predictive per-
formance. Redundant features can be defined as those features which are strongly
correlated with other features, so that removing those redundant features should also
not be harmful for the predictive performance.
Generally, feature selection methods can be categorised into three groups, i.e.
wrapper approaches, filter approaches and embedded approaches as discussed next.

3.1.1 The Wrapper Approach

The wrapper feature selection approach decides which features should be selected
from the original full set of features based on the predictive performance of the clas-
sifier with different candidate feature subsets. In the wrapper approach, the training
dataset is divided into a “building” (or “learning”) set and a validation-set. The best
subset of features to be selected is decided by iteratively getting a candidate feature
subset, building the classifier from the learning-set, using only the candidate feature
subset, and measuring accuracy in the validation-set. A boolean function will check
whether the selected subset of features satisfies the expected improvement on pre-
dictive performance. If not so, the re-selection of a candidate feature subset will be
conducted again, otherwise, the stage of feature selection will terminate, and the best

© Springer Nature Switzerland AG 2019 17


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_3
18 3 Feature Selection Paradigms

subset of features will be used for building the classifier, which is finally evaluated
on the testing dataset.
The wrapper approach selects features that tend to be tailored to the classification
algorithm, since the feature selection process was guided by the algorithm’s accuracy.
However, the wrapper approach has relatively higher time complexity than the filter
and embedded approaches, since in the wrapper approach the classification algorithm
has to be run many times.
One feature selection method following the wrapper approach is Backward
Sequential Elimination (BSE). It starts with the full set of features, then itera-
tively uses leave-one-out cross validation to detect whether removing a certain fea-
ture, whose elimination will most reduce the training error on the validation-set,
will improve predictive accuracy. It repeats this process until the improvement in
accuracy ends [18].
The opposite approach, named Forward Sequential Selection (FSS), starts with
the empty set of features and then iteratively adds the feature that mostly improves
accuracy on the validation dataset to the set of selected features. This iterative process
is repeated until the predictive accuracy starts to decrease [9]. Both wrapper feature
selection methods just discussed have a very high processing time because they
perform many iterations and each iteration involves measuring predictive accuracy
on the validation dataset by running a classification algorithm.

3.1.2 The Filter Approach

Unlike the wrapper approach, the filter approach conducts the feature selection pro-
cess by evaluating the quality of a feature or feature subset using a quality mea-
sure that is independent from the classification algorithm that will be applied to the
selected features. The subset of features is chosen from the original full set of fea-
tures according to a certain selection criterion (or feature relevance measure). The
selected feature subset is then input into the classification algorithm, the classifier is
built and then the predictive accuracy is measured on the testing dataset and reported
to the user. Note that the classifier is built and evaluated only once at the end of the
process, rather than being iteratively built and evaluated in a loop, like in the wrapper
approach. This means the filter approach is much faster than the wrapper approach
in general. All hierarchical feature selection methods discussed in this book are filter
approaches, which will be described in detail in Chaps. 5, 6 and 7.
Filter feature selection methods can be mainly categorised into two groups. The
first group focuses on measuring the quality (relevance) of each individual feature
without taking into account the interaction with other features. Basically, the rele-
vance of each feature will be evaluated by a certain criterion, such as the mutual
information with the class variable, the information gain [14], etc. Then all features
will be ranked in descending order according to the corresponding relevance mea-
sure. Only the top-n features will be selected for the classification stage, where n is a
3.1 Conventional Feature Selection Paradigms 19

user-defined parameter. This type of methods is simple, but it ignores the interaction
between features, and therefore it can select redundant features.
The second group of filter methods aims at selecting a subset of features to be
used for classification by considering the interaction between features within each
evaluated candidate subset of features. For example, one of the most well-known mul-
tivariate filter feature selection methods is called Correlation-based Feature Selection
(CFS) [3, 4, 16], which is based on the following hypothesis:

A good feature subset is one that contains features highly correlated with
(predictive of) the class, yet uncorrelated with (not predictive of) each other –
Hall [3].

The approach used by the CFS method for evaluating the relevance (Merit) of
a candidate subset of features based on the above hypothesis is based on Eq. 3.1,
which is based on Pearson’s linear correlation coefficient (r) used for standardised
numerical feature values. In Eq. 3.1, k denotes the number of features in the
krc f
Merits =  (3.1)
k + k(k − 1)r f f

current feature subset; rc f denotes the average correlation between class and features
in that feature subset; r f f denotes the average correlation between all pairs of features
in that subset. The numerator measures the predictive power of all features within
that subset, which is to be maximised; while the denominator measures the degree
of redundancy among those features in the subset, which is to be minimised.
Another part of CFS is the search strategy used to perform a search in the feature
space. A lot of heuristic search methods have been applied, e.g. Hill-climbing search,
Best First search and Beam search [12], and recently genetic algorithms [7, 8].
However, the CFS method based on genetic algorithms addresses the task of multi-
label classification, where an instance can be assigned two or more class labels
simultaneously, a more complex type of classification task which is out of the scope
of this book.
The search strategy implemented in the Weka version of CFS, used in experiments
reported in other chapters is Backward-Greedy-Stepwise, which conducts a backward
greedy search in the feature subset space. The termination criterion is when the
deletion of any remaining feature leads to a decrease on validation results.

3.1.3 The Embedded Approach

Embedded feature selection methods conduct the feature selection process within
the process of building the classifier, rather than conducting feature selection before
20 3 Feature Selection Paradigms

building the classifier. For example, within the process of building a Decision Tree
classifier, each feature is evaluated as a candidate for splitting the set of instances
in the current tree node based on the values of that feature. Another example of
embedded feature selection method is the well-known Least Absolute Shrinkage
and Selection Operator (LASSO) [5, 13], which is a linear regression method that
performs embedded feature selection. In general, LASSO aims to find the parameters
(regression coefficients) of a linear model that minimises both the value of a loss
function and the value of a regularisation term, which penalises models with large
values of feature weights. The need to minimise the value of the regularisation term
forces the construction of sparse models, where many features with a weight of “0”
are eliminated. Therefore, LASSO effectively selects a subset of relevant features.

3.2 Hierarchical Feature Selection Paradigms

Hierarchical feature selection methods are a specific type of feature selection methods
based on the principle of exploiting the hierarchical relationships among features in
order to improve the quality of the selected feature subset. This type of feature
selection method is the theme of this book.
There has been very little research so far on hierarchical feature selection, i.e.
on feature selection methods that exploit the generalisation-specialisation relation-
ships in the feature hierarchy to decide which features should be selected. This book
discusses seven types of hierarchical feature selection methods for the task of clas-
sification. Those seven methods are further categorised as lazy learning-based and
eager learning-based, according to different types of hierarchical redundancy those
methods cope with.

3.2.1 The Lazy Learning and Eager Learning Approaches


for Classification Tasks

Data mining or machine learning methods can be categorised into two general
paradigms, depending on when the learning process is performed, namely: lazy
learning and eager learning. A lazy learning-based classification algorithm builds a
specific classification model for each individual testing instance to be classified [1,
11]. This is in contrast to the eager learning approach, which performs the learning
process during the training phase, i.e. learning the classifier (or classification model)
using the whole training dataset before any testing instance is observed. Then the
classifier is used to classify all testing instances.
In the context of feature selection, lazy learning-based methods select a specific set
of features for each individual testing instance, whilst eager learning-based methods
select a single set of features for all testing instances. In general, both of lazy learning
3.2 Hierarchical Feature Selection Paradigms 21

and eager learning-based hierarchical feature selection methods aim at removing


the hierarchical redundancy included in the generalisation-specialise relationships
between features. The former cope with the hierarchical redundancy included in
single testing instance. For example, as shown in Fig. 1.1, features M, L, F, I, Q and
O are all redundant to feature C, since C is the descendant of all those 6 features
and all those features have the same value “1” in that instance. Eager learning-based
methods cope with the hierarchical redundancy without considering the value of
features in individual instances, whereas only considering the general relationship
between features in the hierarchy, e.g. removing the parent feature from the feature set
if its child feature has higher relevance to the class attribute. Those seven hierarchical
feature selection methods are discussed in details in Chaps. 5, 6 and 7.

3.2.2 Other Approaches for Enrichment Analysis and


Regression Tasks

Apart from those seven methods for classification task, hierarchical feature selection
methods have also been proposed for the task of selecting enriched Gene Ontology
terms (terms that occur significantly more often than expected by chance) [2] and
the task of learning linear models for regression, where the target variable to be
predicted is continuous [6, 10, 15, 17]. Note that these tasks are quite different from
the classification task addressed in this book, where the goal is to predict the value
of a categorical (or nominal) class variable for an instance based on the values of
features describing properties of that instance. In any case, a brief review of these
methods is presented next.
Alexa et al. (2006) [2] proposed two methods to identify enriched Gene Ontology
(GO) terms in a group of genes using the dependency information retained in the GO
hierarchy. The first proposed method exploits the hierarchical dependencies between
GO terms, i.e. the calculation of the p-value for each GO term starts from the bottom-
most level of the GO Graph. If a GO term is found as significant based on its p-value,
then all genes associated with that GO term’s ancestor terms will be removed from
that GO term’s set of associated genes. This significance test will be applied until all
GO terms have been processed. The second method calculates the significance score
of GO terms using the weights of their associated genes. The adjustment of weights
for individual GO terms takes into account the significance score of its children GO
terms. If the significance score for one child GO term is greater than the one for its
parent GO term, then the weights for that parent term and all ancestor GO terms will
be increased, and then the weight of that child GO term will also be re-computed. This
adjustment process will be iteratively executed until there does not exist any child
GO term whose weight is greater than any of its ancestor’s weights. Both methods
showed better performance than competing methods.
22 3 Feature Selection Paradigms

Variations of the LASSO method also perform hierarchical feature selection by


using regularisation terms that consider the feature hierarchy. Briefly, a feature can be
added into the set of selected features only if its parent feature is also included in that
set. LASSO could be seen as one type of embedded feature selection method, since
it removes features during the stage of model training. LASSO has been successful
in various applications such as biomarker selection, biological network construction,
and magnetic resonance imaging [15].

References

1. Aha DW (1997) Lazy learning. Kluwer Academic Publishers, Norwell


2. Alexa A, Rahnenführer J, Lengauer T (2006) Improved scoring of functional groups from gene
expression data by decorrelating GO graph structure. Bioinformatics 22(13):1600–1607
3. Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The
University of Waikato,
4. Hall MA, Smith LA (1997) Feature subset selection: a correlation based filter approach. In:
Proceedings of 1997 international conference on neural information processing and intelligent
information systems, pp 855–858
5. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining,
inference, and prediction. Springer, Berlin
6. Jenatton R, Audibert JY, Bach F (2011) Structured variable selection with sparity-inducing
norms. J Mach Learn Res 12:2777–2824
7. Jungjit S, Freitas (2015) A new genetic algorithm for multi-label correlation-based feature
selection. In: Proceedings of the twenty-third european symposium on artificial neural net-
works, computational intelligence and machine learning (ESANN-2015), Bruges, Belgium, pp
285–290
8. Jungjit S, Freitas AA (2015) A lexicographic multi-objective genetic algorithm for multi-label
correlation-based feature selection. In: Proceedings of the companion publication of work-
shop on evolutionary rule-based machine learning at the genetic and evolutionary computation
conference (GECCO 2015), Madrid, Spain, pp 989–996
9. Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of the tenth
international conference on uncertainty in artificial intelligence, Seattle, USA, pp 399–406
10. Martins AFT, Smith NA, Aguiar PMQ, Figueiredo MAT (2011) Structured sparsity in structured
prediction. In: Proceeding of the 2011 conference on empirical methods in natural language
processing (EMNLP 2011). Edinburgh, UK, pp 1500–1511
11. Pereira RB, Plastino A, Zadrozny B, de C Merschmann LH, Freitas AA, (2011) Lazy attribute
selection: choosing attributes at classification time. Intell Data Anal 15(5):715–732
12. Rich E, Knight K (1991) Artificial intelligence. McGraw-Hill Publishing Co., New York
13. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B
(Methodol) 58(1):267–288
14. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization.
In: Proceedings of the fourteenth international conference on machine learning (ICML 1997),
Nashville, USA, pp 412–420
15. Ye J, Liu J (2012) Sparse methods for biomedical data. ACM SIGKDD Explor Newsl 14(1):4–
15
16. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter
solution. In: Proceedings of the twentieth international conference on machine learning (ICML
2003), Washington, DC, USA
References 23

17. Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and
hierarchical variable selection. Annu Stat 37(6):3468–3497
18. Zheng F, Webb GI (2005) A comparative study of semi-naive Bayes methods in classifica-
tion learning. In: Proceedings of the fourth australasian data mining conference (AusDM05),
Sydney, Australia, pp 141–155
Chapter 4
Background on Biology of Ageing
and Bioinformatics

Ageing is an ancient research topic that has attracted scientists’ attention for a long
time, not only for its practical implications on extending the longevity of human
beings, but also due to its high complexity. With the help of modern biological
science, it is possible to start to reveal the mysteries of ageing. This book focuses
on research about the biology of ageing, which is an application topic associated
with the hierarchical feature selection methods, which will be described in the next
three chapters. This chapter will briefly review basic concept of Molecular Biology;
Biology of Ageing; and Bioinformatics.

4.1 Overview of Molecular Biology

Molecular Biology is defined by the Oxford Dictionary as “the branch of biology


that deals with the structure and function of the macromolecules essential to life”.
More precisely, molecular biology focuses on understanding the interactions between
DNA, RNA and proteins, including the regulation of the systems consisting of those
macromolecules.
Such regulation mechanisms include the process of gene expression, which can
be divided into three main stages, i.e. transcription, translation and protein folding.
At the stage of transcription, Deoxyribonucleic acid (DNA), which is a type of
nucleic acid that contains the genetic information, is transcribed into messenger
RNA (mRNA), then the mRNA will be translated into the amino acid sequence of a
protein, which is finally folded into a 3D structure in the cell.
The basic units of DNA consist of adenine (A), guanine (G), cytosine (C) and
thymine (T), and a DNA sequence can be represented by the combination of A, G,
C, and T, such as ATAAGCTC [57]. The 3D structure of DNA is a double helix,
where one strand governs the synthesis of a complementary RNA molecule during
the transcription process [52].

© Springer Nature Switzerland AG 2019 25


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_4
26 4 Background on Biology of Ageing and Bioinformatics

RNA, which is another type of nucleic acid, plays an important role on the process
of protein production. RNA has basic units that are the same units of DNA with the
exception that thymine (T) in DNA is replaced by uracil (U) in RNA. The structure of
RNA is represented as a chain of nucleotides, which is different from DNA having a
double helix structure. There exist different types of RNA, e.g. mRNA, tRNA, rRNA,
etc. Among those types of RNA, mRNA performs its function during the stage of
transcription, which is defined as the synthesis of RNA based on a DNA template
[57] or the process of copying one of the DNA strands into an RNA [52]. Then
the next step is translation, by which the linear sequence of information retained in
mRNA is decoded and used for producing linear chains of amino acids, which are
the basic component for proteins and determine the structure of proteins [52].
A gene is considered as a segment/unit of DNA containing heredity information
and defines particular characteristics/functions of proteins [52]. Briefly, one specific
gene controls different functions of proteins, and therefore affects particular functions
of organisms, such as the effect on the metabolism rate, which is possibly an ageing-
related factor that will be discussed later.
Proteins are large biological molecules that carry out almost all of living cells’
functions, most of which are determined by the ability of proteins to recognise other
molecules through binding [9]. The functions of proteins can be categorised into three
major broad groups: structural proteins, which are considered as the organism’s basic
building blocks; enzymes, which regulate biochemical reactions; and transmembrane
proteins that maintain the cellular environment [11]. Proteins consist of 20 different
types of amino acids that are joined together to compose a linear sequence named
poly-peptide chain [11]. Proteins have four types of structure.
The primary structure is a linear amino acid sequence which determines all other
three types of structures. The secondary structure consists of α helices and β sheets.
The tertiary structure is a 3D structure that is built according to the spontaneous
folding of poly-peptides in the cell environment. It is made by α helices, β sheets,
other minor secondary structures and connecting loops [57]. The quaternary structure
is composed by two or more poly-peptide chains with the same forces that stabilise
tertiary structure [57].
This book focuses on the ageing-related genes. Recall that one specific gene
controls certain functions for organisms by producing certain proteins. The next
section will review some factors associated with ageing, including some discovered
age-related genes and their related biological processes.

4.2 Overview of Biology of Ageing

4.2.1 Introduction to Biology of Ageing

Ageing is a complex and stochastic process of progressive function loss for an organ-
ism with time [38], and the accumulation of function losses leads to the mortality of
the organism. The speed of ageing and the longevity of organisms differs between
4.2 Overview of Biology of Ageing 27

species. For example, C. elegans’ lifespan is around 2–3 weeks [35], whereas the
ocean quahog has 400 years of longevity. In terms of human longevity, the longest
age record is 122.5 years and the average longevity measured in 2009 was 79.4 years
in the UK [62].
The mystery of ageing is a sophisticated issue that has puzzled humans for thou-
sands of years, as there has been many stories about a failure on finding the method
of being immortal. Nowadays, with the help of molecular biology, some possible
factors related to ageing have been found, as discussed next.

4.2.2 Some Possible Ageing-Related Factors

Some ageing-related factors have been revealed with the help of molecular biology,
such as genetic factors, environmental factors, etc. From the perspective of molecular
biology, those factors have an effect on ageing through their regulation of ageing-
related biological pathways.
A biological pathway is a series of actions among molecules in a cell that leads to a
certain product or a change in a cell [31]. Biological pathways analysis is considered
as an approach to research the molecular mechanisms of ageing. In particular, the
pathways related with the regulation of growth, energy metabolism, nutrition sensing
and reproduction seem associated with the process of ageing [59].
Genetic factors have been shown to be one of the most important types of factor
that impacts on biological pathways related with the ageing process. The mutation of
a gene(s) change(s) the effects of pathways on organisms. For instance, it has been
found that a gene called daf-2 is highly related to the extension of lifespan in C.elegans
(a worm). The mutation of daf-2 will affect the activation of FOXO proteins that can
activate cell maintenance and stress resistance mechanisms [36]. It was also found
that mutations that increase oxidative damage can shorten lifespan. For example,
the ctl-1 mutants shorten lifespan and prevent lifespan extension of daf-2 mutants
by age-associated lipofuscin granules accumulation [27]. This point of view is also
supported by another possible ageing-related pathway, i.e. the target of rapamycin
(TOR) pathway. TOR kinase stimulates growth and blocks salvage pathways [36] that
are related with autophagy (a basic repair mechanism for damaged cell degradation),
which can alleviate the accumulation of damages on cells.
Nutritional level is another type of environment factor. This was discovered in
1935 by McCay et al. [47] under well-executed studies, which discovered that the
longevity of rats can be extended by a dietary control approach. Then several findings
showed that the method of dietary control for extending longevity can be applied to
other species, such as yeast, fish, hamster, etc. [46]. Caloric restriction was found
to be helpful for extending lifespan with the possible reason of oxidative damage
attenuation. The joint impact of reduced rate of reactive oxygen molecules generation
and increased efficiency of protective processes might alleviate the accumulation of
oxidative damages; the evidence for this was found in isolated mitochondria and
microsomes from caloric restricted rodents [46].
28 4 Background on Biology of Ageing and Bioinformatics

In addition, some diseases (in particular, most types of cancers) are also factors
that are highly related with ageing. Cancer cells could be seem as immortal, and this
is opposite to normal cells that have intrinsic process of senescence. Some research
revealed that cell senescence might be a mechanism of tumour suppression [58].
The experiments about observing the function of p53 (a gene that prevents cancer)
supported that hypothesis. Finkel et al. (2007) [20] found that mice which over-
expressed p53 could be resistant to cancer, but was found as prematurely aged; and
reduction of p53 expression prevents telomere- or damage-induced senescence [12].
The possible reasons would be due to the fact that p53 helps to avoid or reduce
genomic instability, which is considered the hallmark of both cancer and ageing.
However, the relationship between ageing and cancer is very complex and has not
been precisely understood so far.
The evolutionary history theory of ageing is a popular explanation about the dif-
ference of longevity between species. Firstly, the natural selection principle plays
an essential role on the development of a species’ lifespan. The rate of ageing will
be concomitantly changed with changes on the force of natural selection [38]. Espe-
cially in hazardous environments, the surviving individuals would promote their
somatic maintenance ability and propagate their gene variants [59]. Also, a deleteri-
ous mutation will not be easily passed to offspring via reproduction, since the effect
of a mutation usually appears in early life [24], before the individual has a chance to
reproduce. On the other hand, if a mutation has a deleterious effect that occurs only
in late life, long after the organism has reproduced, there is little selection pressure
to eliminate that kind of mutation (since it does not affect the reproduction of the
organism). Secondly, the competition between species will suppress the growth of
longevity expectation for the weaker, as limited resources would not support the
energy consumption in harsh environmental conditions [37]. The weaker competitor
usually could not have enough time for evolution. For example, the observation on a
mainland population and an island population of Didelphis virginiana revealed that
the latter has longer longevity, since they have reduced exposure to predators com-
paring with the former [4]. The evolutionary history hypothesis provides a macro-
perspective about the development of lifespan expectation for different species.

4.2.3 Mysteries in Ageing Research

Although some findings about the possible reasons for the process of ageing have
been revealed, several mysteries about ageing still cannot be figured out. To start with,
the actual biological mechanisms leading to ageing are still not clear. For example,
the actual function of longevity-associated genes with respect to the stress resistance
is unknown [19] and the answer about how different ageing-related biological path-
ways interact and cooperate is still absent [59]. Moreover, it is not clear how gene
mutations affect ageing-related cellular degeneration [59]. Furthermore, the diversity
between species limits the universality of support from those hypotheses about the
reasons of ageing. In terms of the caloric restriction theory, which caloric restric-
4.2 Overview of Biology of Ageing 29

tion approach extends the lifespan and the actual molecular mechanism underlying
that extension are still debated, and whether caloric restriction extends longevity in
long-lived species is unknown [28]. Therefore, discovering answers to the mysteries
of ageing is challenging, as the vast variety of ageing-related factors interactively
work, and the answers are still a long way to go.

4.3 Overview of Gene and Protein Function Prediction in


Bioinformatics

4.3.1 Introduction to Bioinformatics

Bioinformatics is an inter-disciplinary field that integrates computer science, math-


ematics, statistics, etc., with the purpose of assisting biological research. Bioinfor-
matics can be defined as follows:

The science of collecting and analysing complex biological data such as


genetic codes. - Oxford Dictionary

The main subareas of Bioinformatics consist of biological data management, bio-


logical data analysis software development and research on biological data analysis
methods.
In terms of biological data management, there exists a lot of biological databases
with different types of biological data. For example, the well-known GenBank
database is a collection of publicly available nucleotide sequences [7]; the Biological
General Repository for Interaction Datasets (BioGRID) is a repository of data about
physical and genetic interactions from model organisms [54]; and REACTOME is
a curated database about human pathways and reactions [15]. Those Bioinformat-
ics databases foster the development of Bioinformatics and also promote biology
research, since the biological data in these databases are well stored, integrated or
managed.
Based on those biological databases, a lot of applications have been made for
supporting biology research, e.g. gene and protein function prediction [10, 23, 40, 51,
53], protein structure prediction [5, 30, 33, 34, 39], etc. In this book, the application of
hierarchical feature selection methods is for the task of ageing-related gene function
prediction and ageing-related biological patterns discovery.
30 4 Background on Biology of Ageing and Bioinformatics

4.3.2 Gene and Protein Function Prediction

As one of the main tasks in Bioinformatics, protein function prediction has been
highly valued due to its advantages of saving time and reducing cost, since it can
be used for guiding the direction of biological experiments designed to confirm
whether a protein has a certain function. A biologist can conduct only experiments
focusing on fewer specific proteins whose function have been predicted with high
confidence, rather than conducting a large amount of slow and expensive biological
experiments. The methods for gene and protein function prediction can be categorised
into three main broad groups, i.e. sequence alignment analysis, 3D structure similarity
analysis, and machine learning-based methods. Those three groups of methods will
be reviewed in the next three subsections.

4.3.2.1 Sequence Alignment Analysis Methods

Sequence Alignment Analysis is the most conventional approach to predict the func-
tions of proteins and genes. A well-known Sequence Alignment Analysis-based
method, named Basic Local Alignment Search Tool (BLAST), has been highly val-
ued and widely applied on protein and gene function prediction. The basic principle
of BLAST is measuring the degree of similarity between the amino acid sequence
of a protein with unknown function and the amino acid sequence of a set of proteins
with known functions. The motivation for this approach is that a protein’s amino acid
sequence dictates the protein’s 3D structure, which further determines the function
of the protein. In this approach, an unknown-function protein is predicted to have
the functions of its most similar known-function proteins.
In details, BLAST employs a measure of local similarity called maximal segment
pair (MSP) score between two sequences and also detects whether the score will
be improved by extending or shortening the segment pair by using a dynamic pro-
gramming algorithm [3]. Then a user-defined threshold is used for filtering the most
reliable MSPs. Based on this basic principle, BLAST has been extended for fitting
more applications, such as Primer-BLAST [63], IgBLAST [64], etc.
Although BLAST has dominated in the area of protein/gene function prediction,
it has several limitations, as follows [22]. Firstly, BLAST is only applicable for
predicting the function of proteins or genes which are similar to known-function
proteins and genes. Secondly, similar amino acid sequences do not guarantee similar
functions between proteins, because of the difference of their 3D structure. Therefore,
the high score obtained by BLAST might not be quite reliable. Thirdly, in the context
of coping with hierarchical protein function data, such as the data consisting of
generalisation-specialisation relationships discussed in this book, BLAST has the
limitation of ignoring such hierarchical relationships.
4.3 Overview of Gene and Protein Function Prediction in Bioinformatics 31

4.3.2.2 3D Structure Analysis-Based Protein Function Prediction

In a cell, the folds of proteins will spontaneously change depending on cellular


environment factors. Therefore, it is uncertain that a high degree of similarity between
amino acid sequences will lead to similar functions. In general, the information about
protein structure is more valuable in terms of protein function prediction. The second
group of methods for protein function prediction is based on protein 3D structure
analysis. There are some protein folds that are associated with multiple functions,
but most folds have been found to represent a unique function [23]. Some algorithms
based on the knowledge of folds don’t fit the expectation of high accuracy. For the
purpose of overcoming that shortage, a more reliable strategy consisting of analysing
the structure patterns of proteins that are spatial regions within protein structure,
denoting unique markers for specific functions, has been proposed [23].
The basic concept of a 3D structure analysis-based protein function prediction
algorithm consists of two parts: 3D motif library generation and a searching algo-
rithm for matching motifs between two proteins [23]. For example, a well-known
3D structure analysis-based protein function prediction server ProFunc [40] detects
the possible function of unknown proteins by using a graph-matching algorithm to
compare the secondary structure elements (SSEs) between target proteins and the
proteins whose SSEs are known and stored in the databases. In addition, ProFunc
further analyses the cleft size, residue type and other details of structural informa-
tion about the protein. 3D structure analysis has attracted attention due to its highly
reliable predictive results. There are several tools based on structure analysis that
are available to be used by the Bioinformatics community, such as SuMo, PINTS,
PDBFun, etc.

4.3.2.3 The Machine Learning-Based Approach

Machine learning methods have been widely applied in Bioinformatics research, such
as in the task of protein and gene function prediction. Unlike the popular sequence
similarity-based methods, such as BLAST, the machine learning approach can be
called a model induction or alignment-free approach. Briefly, this approach treats
protein function prediction as a classification task, where the protein functions are
classes and the predictor attributes (or features) are properties or characteristics of
protein. One of the advantages of machine learning-based protein function prediction
methods (more precisely, classification methods) is that they can predict the func-
tions of a given protein without being given existing similar proteins (i.e. protein with
amino acid sequence similar to the protein being classified). More precisely, classi-
fication methods take into account the variables (attributes) denoting different types
of biological properties that might be associated with protein function prediction.
A lot of different types of classifiers have been adopted for different tasks of protein
and gene function prediction and have shown powerful predictive performance. For
example, Support Vector Machine (SVM), which is a type of classifier that obtains
very good predictive performance in general, have been widely used. For instance, the
32 4 Background on Biology of Ageing and Bioinformatics

well-known protein sequence Feature-based Function Prediction (FFPred) method


[13, 45, 48] exploits different protein biophysical properties to train a library of SVM
classifiers for predicting proteins’ Gene Ontology annotation; Borgwardt et al. (2005)
[10] classified proteins into functional classes by applying SVM with graph kernels;
and Bhardwaj et al. (2005) [8] used SVM to predict DNA-binding proteins. Note,
however, that SVMs have the disadvantage of producing “black-box” classification
models, which in general cannot be interpreted by biologists.
Bayesian network and tree-based classifiers (e.g. Decision Tree and Random
Forests) are another group of classifiers that are widely applied in protein func-
tion prediction, due to their advantage of producing probabilistic graphic models
that can be interpreted by biologists. For example, Yousef et al. (2007) [65] used
Naïve Bayes to predict microRNA targets. As another example, Barutcuoglu et al.
(2006) [6] proposed to use a Bayesian network to cope with the prediction incon-
sistency problem that happens in a hierarchical classifier. Inconsistent hierarchical
predictions occur, e.g. when a classifier predicts for a given instance, a certain class
y, but not an ancestor of class y in the hierarchy. This is inconsistent, assuming the
class hierarchy is a “is-a” hierarchy, so that an instance assigned to a class must be
assigned to its ancestor classes. That Bayesian network calculates the most probable
prediction results by Bayes’ theorem. More specifically, they trained an individual
SVM classifier for each class, so that the different SVMs can make inconsistent pre-
dictions across the class hierarchy, and then combined the predictions of all those
SVMs by using a Bayesian network. Most recently, Wan et al. (2017) improved the
predictive performance of FFPred-fly [61] by adopting the Random Forests classi-
fication algorithm, which also reveals links between different biological processes
and certain developmental stages of Drosophila melanogaster.
Apart from classifiers, feature selection methods also play an important role on
protein function prediction, due to their capacity of improving the predictive per-
formance of classifiers by providing the classification algorithm with a subset of
very relevant features, removing features with little relevance or containing redun-
dant information for classification purposes. For example, Glaab et al. (2012) [26]
adopted three different types of eager learning-based feature selection algorithms,
i.e. partial least squares-based feature selection (PLSS), correlation-based feature
selection and random forest-based feature selection, working with rule-based evo-
lutionary machine learning systems to tackle the microarray data classification task.
The experimental results show that PLSS outperforms other non-univariate feature
selection methods and indicate that the feature independence assumption could be
beneficial for microarray gene selection tasks. Note that those three types of fea-
ture selection methods select a feature subset for classifying all testing instances,
following the eager learning paradigm. Al-Shahib et al. (2005) [1] adopted a type
of wrapper feature selection method with a genetic search algorithm combined with
SVM, Decision Tree and Naïve Bayes classifiers for predicting protein functions for
the Neisseria gonorrhoea proteome. In another work of Al-Shahib et al. (2005) [2],
they proposed a new feature selection approach. This feature selection approach first
ranks all features according to those features’ corresponding p-values calculated by
the Wilcoxon rank sum test between each feature and the class variable, and then
4.3 Overview of Gene and Protein Function Prediction in Bioinformatics 33

removes the redundant features with respect to the features from top to the bottom of
the ranking table. The method used to detect redundancy is based on the correlation
coefficient. Li et al. (2012) [43] adopt the mRMR (minimal-redundancy-maximal-
relevance) method [49] to select the optimal subset of features for predicting protein
domain. This method firstly ranks all features according to the quality measure com-
puted by the mRMR method, and then evaluates the predictive performance of differ-
ent subsets of features by stepwise adding one feature into the current feature subset.
The adding order is from high to low on the features’ ranking. In addition, Leijoto
et al. (2014) [42] adopted genetic algorithms to select a subset of physical-chemical
features to predict protein functions.

4.3.2.4 A Comparison Between Three Approaches for Gene and


Protein Function Prediction

Comparing machine learning-based methods and sequence alignment analysis meth-


ods, the latter seems to have more limited reliability in general. As mentioned in the
previous section, although the primary structure broadly determines the functions
of proteins, it is also possible that two proteins have different functions while their
primary structure are quite similar. That means the high score obtained by sequence
alignment will not guarantee a high degree of similarity between the functions of the
aligned proteins. For example, according to research on Gene Ontology term annota-
tion errors, the error rate of annotation inferred by sequence similarity reaches 49% in
same cases [22]. In addition, the sequence alignment methods have the drawback of
not discovering relationships between biochemical properties and protein functions,
which would be valuable for biologists.
Comparing machine learning-based methods and 3D structure analysis methods,
the latter show high accuracy in terms of protein function prediction. However, the
obvious limitation of 3D structure analysis methods is that there are many proteins
whose 3D structure is unknown. Therefore, in the case of predicting functions of an
unknown protein, the prediction method’s accuracy is limited by the availability of
proteins that not only have a known 3D structure, but also have a 3D structure similar
to the current unknown protein.
Although machine learning-based methods show advantages of flexibility and
potential for discovering comprehensible models, compared with the other two meth-
ods, the model induction approach also has the limitation of not producing compre-
hensible models sometimes, when the choice of machine learning algorithm(s) is
not appropriate. More precisely, as an advantage of black-box classifiers, their high
predictive accuracy attracts most researchers’ attention in the Bioinformatics com-
munity. Especially, artificial neural networks and support vector machines are widely
used as protein function prediction methods. However, as mentioned earlier, in gen-
eral, those classifiers cannot be interpreted by users and they cannot reveal valuable
insight on relationships between protein features (properties) and protein function.
34 4 Background on Biology of Ageing and Bioinformatics

Therefore, white-box (interpretable) classifiers, such as Bayesian network classi-


fiers, Decision Trees, etc., should receive more attention in area of protein function
prediction.

4.4 Related Work on The Machine Learning Approach


Applied to Biology of Ageing Research

There exist few works about the machine learning approach with application on
ageing-related proteins and genes function prediction. The use of classification meth-
ods for predicting the functions of ageing-related proteins and genes has been inves-
tigated by the Bioinformatics community only in the last few years, so there is a
broad space for research in this area. The relevant articles in this research topic are
briefly reviewed as follows.
Freitas et al. (2011) [21] addressed the classification of DNA repair genes into
ageing-related or non-ageing related by applying conventional data mining tech-
niques on datasets which consisted of ageing-related protein/gene data and several
types of features. The experiments revealed that protein-protein interaction informa-
tion, which was obtained from the HPRD (Human Protein Reference Database) [50],
is helpful for prediction. Other predictor features, such as biological process Gene
Ontology (GO) terms, evolutionary gene change rate, and types of DNA repair path-
way were used for the prediction task. After comparing the results of two different
classification algorithms, Naïve Bayes outperformed J48 (a Decision Tree algorithm)
in terms of predictive accuracy. But with the help of the J48 algorithm, some interest-
ing and interpretable IF-THEN rules which can be used for classifying a DNA repair
gene into an ageing-related gene or a non-ageing-related gene were found. Similarly,
Fang et al. (2013) [17] addressed the classification of ageing-related genes into DNA
repair or non-DNA repair genes. Both studies used GO terms as features, in addition
to other types of features. GO terms are particularly relevant for this book, since they
are the type of feature to which the hierarchical feature selection methods discussed
in this book were applied. Hence, GO terms will be discussed separately in the next
section.
Li et al. (2010) [44] classified C. elegans genes into longevity and non-longevity
genes by adopting a support vector machine (SVM). They firstly created a functional
network by adopting information about gene sequences, genetic interactions, pheno-
types, physical interactions and predicted interactions from wormnet [41]. Then they
derived graph features from the functional network, such as a node’s degree, longevity
neighbour ratio, etc. Huang et al. (2012) [29] proposed a K-Nearest Neighbour-based
method using the information about the effect of a gene’s deletion on lifespan to pre-
dict whether the deletion of a specific gene will affect the organism’s longevity. The
three effect classes were: no effect on lifespan, increased or decreased lifespan. They
adopted network features, biochemical and physicochemical features, and functional
features obtained from the deletion network, which was constructed by mapping the
4.4 Related Work on The Machine Learning Approach Applied … 35

information about gene deletion and protein-protein interaction data (obtained from
the STRING database [32]). In addition, Fernandes et al. (2016) [18] discovered
links between ageing and age-related diseases with the help of hierarchical feature
selection methods. Most recently, Fabio et al. (2017) reviewed the up-to-date pub-
lished researches about applying the supervised machine learning methods on the
ageing research.
These works regarding ageing-related gene classification/prediction shed a light
on ageing-related knowledge discovery based on machine learning or data mining
approaches. However, given the small number of works in this research topic, there
is still much space for further research, not only in terms of optimising the predictive
accuracy, but also finding new clues that help to solve or reduce the mystery of
ageing, by discovering knowledge that can be interpreted by biologists.

4.5 Biological Databases Relevant to This Book

4.5.1 The Gene Ontology

The Gene Ontology (GO) project aims to provide dynamic, structured, unified and
controlled vocabularies for the annotation of genes [56]. To minimise the incon-
sistent annotations of individual genes between different biological databases, it is
required that a centralised public resource provides universal access to the ontologies,
annotation datasets and software tools. In addition, an ontology can facilitate com-
munication during research cooperation and improve the interoperability between
different systems. The initial members/contributors of the Gene Ontology Consor-
tium were FlyBase, Saccharomyces Genome Database and the Mouse Genome Infor-
matics project, whereas now the number of databases members rose to around 36.
The information resources of GO consist of documentation-supported links between
database objects and GO terms with the experimental evidence from the published
literature for individual source information, in order to provide high-quality GO
annotations. In addition, the standard for GO term annotation defined that all GO
terms should not be species specific.
There are three categories of GO terms, each implemented as a separate ontology:
biological process, molecular function, and cellular component [56]. The biological
process represents a biological objective to which a gene product contributes, such
as regulation of DNA recombination, regulation of mitotic recombination, etc. The
process might be accomplished by one or more assemblies of functions. Note that
the meaning of a biological process is not necessarily consistent to the meaning of
a biological pathway. The molecular function ontology represents the biochemical
level of gene functions, regardless of the location or when that function occurs, such
as lactase activity. The cellular component refers to a location where the gene product
is active, such as ribosome, nuclear membrane, etc.
36 4 Background on Biology of Ageing and Bioinformatics

Fig. 4.1 A visualised Gene Ontology directed acyclic graph starting from the root term GO:0008150

In terms of structure of the GO information, there are hierarchical relationships


between GO terms. The hierarchical relationships are composed mainly by “is-a”
relationships, which is the type of hierarchical relationship considered in this book.
That is, the process, function or location represented by a GO term is a specific
instance of the process, function or location represented by its parent GO term(s).
Hence, these hierarchical relationships are effectively generalisation-specialisation
relationships. Examples of such hierarchical relationships are shown in the example
graph with GO:0008150 (biological process) as the root term shown in Fig. 4.1,
where GO:0040007 (growth), GO:0032502 (development process) and GO:0065007
(biological regulation) are all direct children of GO:0008150 (biological process),
and GO:0050789 (regulation of biological process) is the child of GO:0065007 and
the parent of GO:0048518. These hierarchical relationships can be used for building
a Directed Acyclic Graph (DAG) composed by GO terms.

4.5.2 Human Ageing Genomic Resources (HAGR)

The HAGR is a high-quality biological database that specifically focuses on the


biology or genetics of ageing. The HAGR database consists of four main groups of
data, namely GenAge, AnAge, GenDR and DAA (Digital Ageing Atlas).
4.5 Biological Databases Relevant to This Book 37

Firstly, GenAge is a database of ageing/longevity-associated genes for humans


and model organisms, such as mice, worms, fruit flies and yeast. GenAge includes
high-quality curated information of genes that have been shown to have noticeable
effect on changes in the ageing phenotype and/or longevity [16]. GenAge consists
of three sections, i.e. (1) a set of ageing-associated genes for human, (2) a set of
longevity-associated genes for model organisms, (3) a set of mammalian genes whose
expression is commonly altered during ageing in multiple issues.
Secondly, AnAge is a database that focuses on animal ageing and longevity. The
reason for building this database is providing sufficient data that can be used for
conducting comparative analysis on ageing mechanisms between different species.
AnAge contains longevity-related data about 4,205 species, which consists of mam-
mals, birds, reptiles, amphibians and fishes in version of Build 12 [55]. The data
included in AnAge is of high quality and confidence, based on data from authorita-
tive sources and checked by curators.
Thirdly, HAGR includes GenDR, which is a database designed for the analysis of
how caloric restriction extends lifespan, consisting of data about dietary restriction-
essential genes, which are defined as those genes that interfere with dietary restriction
lifespan extension after being genetically modified, but do not have impact on the
lifespan of animals under the condition of an ad libitum diet [16]. In addition, as com-
plementary information, GenDR includes a set of mammalian genes differentially
expressed under dietary restriction condition.
In addition, DAA is a centralised collection of human ageing-related changes that
integrates data from various biological levels, e.g. molecular, cellular, physiologi-
cal, etc [14]. DAA provides a system-level and comprehensive platform for ageing
research, focuses on ageing-associated changes.
Overall, GenAge offers a Bioinformatics platform where ageing-associated genes
can be found through a user-friendly interface, and is a way of integrating information
about ageing-related genes, for the purpose of functional genomics and systems
biology analysis. Also, as an overall picture of ageing-associated genes, GenAge
provides sufficient data for conducting data mining research, which will be discussed
in a later section.

4.5.3 Dataset Creation Using Gene Ontology Terms and


HAGR Genes

28 datasets were created for four model organisms by integrating data from the
Human Ageing Genomic Resources (HAGR) GenAge database (Build 17) [16] and
the Gene Ontology (GO) database (version: 2014-06-13) [56]. For each model organ-
ism, 7 datasets were generated with all possible subsets of the three GO term types
about the effect of genes on an organism’s longevity, i.e. one dataset for each type
of GO term (BP, MF, CC), one dataset for each pair of GO term types (BP and MF,
BP and CC, MF and CC), and one dataset with all three GO term types (BP, MF
38 4 Background on Biology of Ageing and Bioinformatics

Fig. 4.2 Structure of the created dataset

and CC). HAGR provides longevity-related gene data for four model organisms,
i.e. Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Saccha-
romyces cerevisiae. To begin with, the data from the HAGR database contains, as one
of the identifiers for each gene, the EntrezID, which is adopted as the unique key for
mapping from the HAGR data to the gene2go file [25], which contains information
about GO terms associated with each gene. Then the integrated dataset created by
retrieving data from the HAGR database and the gene2go file has been merged with
the data from the GO database for the purpose of obtaining the relationship between
each GO term and its ancestor GO terms. In addition, an iterative method had been
implemented in order to collect all ancestor GO terms for each gene in the dataset; i.e.
for each GO term associated with a gene, we get that GO term’s parent GO term(s),
then the parent(s) of that parent GO term(s), etc., until the root GO term (note that
the root GO terms, i.e. GO:0008150 (biological process), GO:0003674 (molecular
function) and GO:0005575 (cellular component), will not be included in the created
dataset, due to its uselessness for prediction). The structure of the newly created
dataset is represented as shown in Fig. 4.2, where the feature value “1” means the
occurrence of a GO term with respect to each gene. In the class variable, the values
of “Pro” and “Anti” mean “pro-longevity” and “anti-longevity”. Pro-longevity genes
are those whose decreased expression (due to knock-out, mutations or RNA interfer-
ence) reduces lifespan and/or whose overexpression extends lifespan; accordingly,
anti-longevity genes are those whose decreased expression extends lifespan and/or
whose over-expression decreases it [55].
The GO terms that have only one associated gene would be useless for building a
classification model because they are extremely specifically related to an individual
gene, and the model that includes these GO terms would be confronted with the over-
fitting problem. However, in terms of biological information contained in GO terms,
those GO terms associated with only a few genes might be valuable for discovering
knowledge, since they might represent specific biological information. Therefore, as
suggested by [60], the threshold is considered as 3, which retains more biological
information than higher thresholds while still leading to high predictive accuracy.
4.5 Biological Databases Relevant to This Book 39

Table 4.1 Main characteristics of the created datasets with GO term frequency threshold = 3
Caenorhabditis elegans
Property BP MF CC BP+MF BP+CC MF+CC BP+MF+CC
No. of features 830 218 143 1048 973 361 1191
No. of edges 1437 259 217 1696 1654 476 1913
No. of instances 528 279 254 553 557 432 572
No. (%) of pro-longevity instances 209 121 98 213 213 170 215
39.6% 43.4% 38.6% 38.5% 38.2% 39.4% 37.6%
No. (%) of anti-longevity instances 319 158 156 340 344 262 357
60.4% 56.6% 61.4% 61.5% 61.8% 60.6% 62.4%
Degree of class imbalance 0.345 0.234 0.372 0.374 0.381 0.351 0.398
Drosophila melanogaster
Property BP MF CC BP+MF BP+CC MF+CC BP+MF+CC
No. of features 698 130 75 828 773 205 903
No. of edges 1190 151 101 1341 1291 252 1442
No. of instances 127 102 90 130 128 123 130
No. (%) of pro-longevity instances 91 68 62 92 91 85 92
71.7% 66.7% 68.9% 70.8% 71.1% 69.1% 70.8%
No. (%) of anti-longevity instances 36 34 28 38 37 38 38
28.3% 33.3% 31.1% 29.2% 28.9% 30.9% 29.2%
Degree of class imbalance 0.604 0.500 0.548 0.587 0.593 0.553 0.587
Mus musculus
Property BP MF CC BP+MF BP+CC MF+CC BP+MF+CC
No. of features 1039 182 117 1221 1156 299 1338
No. of edges 1836 205 160 2041 1996 365 2201
No. of instances 102 98 100 102 102 102 102
No. (%) of pro-longevity instances 68 65 66 68 68 68 68
66.7% 66.3% 66.0% 66.7% 66.7% 66.7% 66.7%
No. (%) of anti-longevity instances 34 33 34 34 34 34 34
33.3% 33.7% 34.0% 33.3% 33.3% 33.3% 33.3%
Degree of class imbalance 0.500 0.492 0.485 0.500 0.500 0.500 0.500
Saccharomyces cerevisiae
Property BP MF CC BP+MF BP+CC MF+CC BP+MF+CC
No. of features 679 175 107 854 786 282 961
No. of edges 1223 209 168 1432 1391 377 1600
No. of instances 215 157 147 222 234 226 238
No. (%) of pro-longevity instances 30 26 24 30 30 29 30
14.0% 16.6% 16.3% 13.5% 12.8% 12.8% 12.6%
No. (%) of anti-longevity instances 185 131 123 192 204 197 208
86.0% 83.4% 83.7% 86.5% 87.2% 87.2% 87.4%
Degree of class imbalance 0.838 0.802 0.805 0.844 0.853 0.853 0.856
40 4 Background on Biology of Ageing and Bioinformatics

The detailed information about the created datasets is shown in Table 4.1, where
the numbers of features, edges, instances and the degree of class imbalance are
reported. The degree of class imbalance (D) is calculated by Eq. 4.1, where D equals
to the complement of the ratio of the number of instances belonging to the minority
class (Inst(Minor)) over the number of instances belonging to the majority class
(Inst(Major)).
Inst(Minor)
D=1− (4.1)
Inst(Major)

References

1. Al-Shahib A, Breitling R, Gilbert D (2005) Feature selection and the class imbalance problem
in predicting protein function from sequence. Appl Bioinform 4(3):195–203
2. Al-Shahib A, Breitling R, Gilbert D (2005) Franksum: new feature selection method for protein
function prediction. Int J Neural Syst 15(4):259–275
3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search
tool. J Mol Biol 215(3):403–410
4. Austad SN (1993) Retarded senescence in an insular population of virginia opossums (Didelphis
virginiana). J Zool 229(4):695–708
5. Bacardit J, Widera P, Márquez-Chamorro A, Divina F, Aguilar-Ruiz JS, Krasnogor N (2012)
Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple
predicted structural features. Bioinformatics 28(19):2441–2448
6. Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of
gene function. Bioinformatics 22(7):830–836
7. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW
(2013) Genbank. Nucleic Acids Res 41:D36–D42
8. Bhardwaj N, Langlois RE, Zhao G, Lu H (2005) Kernel-based machine learning protocol for
predicting DNA-binding proteins. Nucleic Acids Res 33(20):6486–6493
9. Bolsover SR, Hyams JS, Jones S, Shephard EA, White HA (1997) From genes to cells. Wiley-
Liss, New York
10. Borgwardt KM, Ong CS, Schönauer S, Vishwanathan SVN, Smola AJ, Kriegel HP (2005)
Protein function prediction via graph kernels. Bioinformatics 21(suppl 1):i47–i56 Mar
11. Brazma A, Parkinson H, Schlitt T, Shojatalab M (2012) A quick introduction to elements
of biology-cells, molecules, genes, functional genomics, microarrays. http://www.ebi.ac.uk/
microarray/biology-intro.html. Accessed 11 Nov 2012
12. Campisi J, di Fagagna FDA (2007) Cellular senescence: when bad things happen to good cells.
Nat Rev Mol Cell Biol 8(9):729–740
13. Cozzetto D, Minneci F, Currant H, Jones D (2015) FFPred 3: feature-based function prediction
for all gene ontology domains. Sci Rep 6:31865
14. Craig T, Smelick C, Tacutu R, Wuttke D, Wood SH, Stanley H, Janssens G, Savitskaya E,
Moskalev A, Arking R, de Magalhães JP (2015) The digital ageing atlas: integrating the diver-
sity of age-related changes into a unified resource. Nucleic Acids Res 43:D873–D878
15. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Stein L (2011) Reactome: a
database of reactions, pathways and biological processes. Nucleic Acids Res 39:D691–D697
References 41

16. de Magalhães JP, Budovsky A, Lehmann G, Costa J, Li Y, Fraifeld V, Church GM (2009) The
human ageing genomic resources: online databases and tools for biogerontologists. Aging Cell
8(1):65–72
17. Fang Y, Wang X, Michaelis EK, Fang J (2013) Classifying aging genes into DNA repair or
non-DNA repair-related categories. In: Huang DS, Jo KH, Zhou YQ, Han K (eds) Lecture
notes in intelligent computing theories and technology. Springer, Berlin, pp 20–29
18. Fernandes M, Wan C, Tacutu R, Barardo D, Rajput A, Wang J, Thoppil H, Yang C, Freitas AA,
de Magalhães JP (2016) Systematic analysis of the gerontome reveals links between aging and
age-related diseases. Hum Mol Genet 25(21):4804–4818
19. Finkel T, Holbrook NJ (2000) Oxidants, oxidative stress and the biology of ageing. Nature
408:239–247
20. Finkel T, Serrano M, Blasco MA (2007) The common biology of cancer and ageing. Nature
448(7155):767–774
21. Freitas AA, Vasieva O, de Magalhães JP (2011) A data mining approach for classifying DNA
repair genes into ageing-related or non-ageing-related. BMC Genomics 12(27):1–11
22. Freitas AA, Wieser DC, Apweiler R (2010) On the importance of comprehensible classification
models for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 7(1):172–
182
23. Friedberg I (2006) Automated protein function prediction-the genomic challenge. Brief Bioin-
form 7(3):225–242
24. Gavrilov LA, Gavrilova NS (2002) Evolutionary theories of aging and longevity. Sci World J
2:339–356
25. Gene2go file (2012). ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz. Accessed 13 Dec
2012
26. Glaab E, Bacardit J, Garibaldi JM, Krasnogor N (2012) Using rule-based machine learning for
candidate disease gene prioritization and sample classification of cancer gene expression data.
PLoS One 7:e39932
27. Guarente L, Kenyon C (2000) Genetic pathways that regulate ageing in model organisms.
Nature 408(6809):255–262
28. Heilbronn LK, Ravussin E (2003) Calorie restriction and aging: review of the literature and
implications for studies in humans. Am J Clin Nutr 78(3):361–369
29. Huang T, Zhang J, Xu ZP, Hu LL, Chen L, Shao JL, Zhang L, Kong XY, Cai YD, Chou KC
(2012) Deciphering the effects of gene deletion on yeast longevity using network and machine
learning approaches. Biochimie 94(4):1017–1025
30. Hurwitz N, Pellegrini-Calace M, Jones DT (2006) Towards genome-scale structure prediction
for transmembrane proteins. Philos Trans R Soc Lond B: Biol Sci 361(1467):465–475
31. N. H. G. R. Institute (2012) Biological pathways. http://www.genome.gov/27530687. Accessed
19 June 2013
32. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth
A, Simonovic M, Bork P, von Mering C (2009) String 8-a global view on proteins and their
functional interactions in 630 organisms. Nucleic Acids Res 37(suppl 1):D412–D416
33. Jones DT (2000) A practical guide to protein structure prediction. Humana Press, Totowa,
Protein structure prediction
34. Jones DT, Buchan DWA, Cozzetto D, Pontil M (2012) PSICOV: precise structural contact
prediction using sparse inverse covariance estimation on large multiple sequence alignments.
Bioinformatics 28(2):184–190
35. Kaletsky R, Murphy CT (2010) The role of insulin/igf-like signaling in C. elegans longevity
and aging. Dis Model Mech 3(7–8):415–419
36. Kenyon CJ (2010) The genetics of ageing. Nature 464(7288):504–512
37. Kirkwood TBL (2005) Understanding the odd science of aging. Cell 120(4):437–447
38. Kirkwood TBL, Austad SN (2000) Why do we age? Nature 408(6809):233–238
39. Kosciolek T, Jones DT (2014) De Novo structure prediction of globular proteins aided by
sequence variation-derived contacts. PLoS One 9(3):e92197
42 4 Background on Biology of Ageing and Bioinformatics

40. Laskowski RA, Watson JD, Thornton JM (2005) Protein function prediction using local 3D
templates. J Mol Biol 351(3):614–626
41. Lee I, Lehner B, Crombie C, Wong W, Fraser AG, Marcotte EM (2008) A single gene network
accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat
Genet 40:181–188
42. Larissa LF, de Oliveira Rodrigues TA, Zaratey LE, Nobre CN (2014) A genetic algorithm
for the selection of features used in the prediction of protein function. In: Proceedings of
2014 IEEE international conference on bioinformatics and bioengineering (BIBE-2014), Boca
Raton, USA, pp 168–174
43. Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC (2012) Prediction of protein domain with
mRMR feature selection and analysis. PLoS One 7(6):e39308
44. Li YH, Dong MQ, Guo Z (2010) Systematic analysis and prediction of longevity genes in
caenorhabditis elegans. Mech Ageing Dev 131(11–12):700–709
45. Lobley A, Nugent T, Orengo C, Jones D (2008) FFPred: an integrated feature based function
prediction server for vertebrate proteomes. Nucleic Acids Res 1(36):W297–W302
46. Masoro EJ (2005) Overview of caloric restriction and ageing. Mech Ageing Dev 126(9):913–
922
47. McCay CM, Crowell MF, Maynard LA (1935) The effect of retarded growth upon the length
of life span and upon the ultimate body size. J Nutr 10(1):63–79
48. Minneci F, Piovesan D, Cozzetto D, Jones DT (2013) FFPred 2.0: improved homology-
independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS One
8(5):e63754
49. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell
27(8):1226–1238
50. Prasad TSK (2009) Human protein reference database - 2009 update. Nucleic Acids Res
37(suppl 1):D767–D772
51. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Schaefer C (2013) A
large-scale evaluation of computational protein function prediction. Nat Methods 10(3):221–
227
52. Reece RJ (2004) Analysis of genes and genomes. Wiley, Chichester
53. Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst
Biol 3(1):88
54. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: a general
repository for interaction datasets. Nucleic Acids Res 34:D535–D539
55. Tacutu R, Craig T, Budovsky A, Wuttke D, Lehmann G, Taranukha D, Costa J, Fraifeld VE,
de Magalhães JP (2013) Human ageing genomic resources: integrated databases and tools for
the biology and genetics of ageing. Nucleic Acids Res 41(D1):D1027–D1033
56. The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat
Genet 25(1):25–29
57. Turner PC, McLennan AG, Bates AD, White MRH (2000) Molecular biology, 2nd edn. BIOS
Scientific Publishers Ltd, Oxford
58. Tyner SD, Venkatachalam S, Choi J, Jones S, Ghebranious N, Igelmann H, Lu X, Soron G,
Cooper B, Brayton C, Park SH, Thompson T, Karsenty G, Bradley A, Donehower LA (2002)
P53 mutant mice that display early ageing-associated phenotypes. Nature 415(6867):45–53
59. Vijg J, Campisi J (2008) Puzzles, promises and a cure for ageing. Nature 454(7208):1065–1071
60. Wan C, Freitas AA, de Magalhães JP (2015) Predicting the pro-longevity or anti-longevity
effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM
Trans Comput Biol Bioinform 12(2):262–275
61. Wan C, Lees JG, Minneci F, Orengo C, Jones D (2017) Analysis of temporal transcrip-
tion expression profiles reveal links between protein function and developmental stages of
Drosophila melanogaster. PLOS Comput Biol 13(10):e1005791
62. Wieser D, Papatheodorou I, Ziehm M, Thornton JM (2011) Computational biology for ageing.
Philos Trans R Soc B: Biol Sci 366(1561):51–63
References 43

63. Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, Madden TL (2012) Primer-BLASTt: a


tool to design target-specific primers for polymerase chain reaction. BMC Bioinform 13(1):134
64. Ye J, Ma N, Madden TL, Ostell JM (2013). IgBLAST: an immunoglobulin variable domain
sequence analysis tool. Nucleic Acids Res W34–40
65. Yousef M, Jung S, Kossenkov AV, Showe LC, Showe MK (2007) Naive Bayes for microRNA
target predictions-machine learning for microrna targets. Bioinformatics 23(22):2987–2992
Chapter 5
Lazy Hierarchical Feature Selection

This chapter describes three different lazy hierarchical feature selection methods,
namely Select Hierarchical Information-Preserving Features (HIP) [5, 6], Select
Most Relevant Features (MR) [5, 6] and the hybrid Select Hierarchical Information-
Preserving and Most Relevant Features (HIP—MR) [3, 6]. Those three hierarchical
feature selection methods are categorised as filter methods (discussed in Chap. 2, i.e.
feature selection is conducted before the learning process of classifier).

5.1 Hierarchical Redundancy in Lazy Learning Paradigm

Those three types of lazy hierarchical feature selection methods aim to eliminate
or alleviate one particular type of hierarchical redundancy, which is a key concept
for the lazy hierarchical feature selection methods. The definition of hierarchical
redundancy in lazy learning paradigm is as the scenario where there exists more
than one features that are related via a specialisation-generalisation relationship and
have the same value (i.e. either “0” or “1”). In the example shown in Fig. 5.1, the
features can be grouped into two sets, i.e. a set of features having value “1” (the left
four features: E, F, G, C), and another set of features having value “0” (the right
four features: H, A, B, D). In terms of features E, F, G, C, feature E is the parent
of F, which is the parent of G. Feature G has the child C. It means that the value
“1” of C logically implies the value “1” of G, whose value implies the value of F,
and the value of F implies the value of E. Therefore, it can be noted that feature E
is hierarchically redundant with respect to F, G and C; feature F is hierarchically
redundant with respect to G and C; and feature G is hierarchically redundant with
respect to C.
Analogously to the set of features having values “1”, the other set of features
having values “0” contains a similar type of hierarchical redundancy. In details, the
value “0” of feature H logically implies the value “0” of A, whose value implies

© Springer Nature Switzerland AG 2019 45


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_5
46 5 Lazy Hierarchical Feature Selection

E F G C H A B D
1 1 1 1 0 0 0 0

Fig. 5.1 Example of a set of hierarchically redundant features

Fig. 5.2 Example of a set of


M 1 L 1
hierarchically redundant
features structured as a DAG O 1

F 1

Q 1

E 1

K 0 I 1

G 1

B 0

C 1
A 0

J 0

D 0 N 0

H 0 P 0 R 0

the value of B, and the value of B implies the value of D. Therefore, it can be
noted that feature D is hierarchically redundant with respect to B, A and H; feature
B is hierarchically redundant with respect to feature A and H; and feature A is
hierarchically redundant with respect to H.
This type of hierarchical redundancy could be retained by a more complicated
scenario, i.e. a given directed acyclic graph (DAG) structure of features. As shown
in Fig. 5.2, the DAG actually is composed by a set of different paths, where each
individual path contains a set of hierarchically structured features. Note that some
features are shared by more than one path, e.g. feature F is shared by 6 paths, feature
I is shared by 5 paths, feature A is shared by 3 paths, etc. This scenario of hierarchi-
cally structured features, with hierarchical redundancy as defined earlier, is the core
problem addressed in this chapter, and the feature selection methods discussed later
remove or at least reduce the hierarchical redundancy among features.
Note that this type of hierarchical redundancy scenario fits well with the lazy
learning paradigm, i.e. the hierarchical redundancy occurs in the context of the values
of features in an individual instance. For instance, Fig. 5.3 is an example testing
dataset matrix, where each individual row represents one testing instance consisting
of the value of the class attribute (in the last column) and the values of a set of
features (in all other columns). The set of features in this example testing dataset
matrix retains the hierarchical dependencies associated with the feature DAG shown
in Fig. 5.2. For example, in the first row, the value of feature C equals to 1, then the
values of features I, F, M, L, Q and O are all equal to 1; and vice versa, the value
5.1 Hierarchical Redundancy in Lazy Learning Paradigm 47

Fig. 5.3 Example matrix of testing dataset matrix containing hierarchical dependency information

of feature A equals 0, then the values of features D, H, N, P and R are all equal
to 0. Therefore, all lazy hierarchical feature selection methods and the classifiers
discussed in this chapter are based on the lazy learning scenario.

5.2 Select Hierarchical Information-Preserving


Features (HIP)

The Select Hierarchical Information-Preserving Features (HIP) method focuses only


on eliminating the hierarchical redundancy in the set of selected features, ignoring
the relevance values of individual features. Recall that two features are hierarchically
redundant, in a given instance, if they have the same value in that instance and are
located in the same path from a root to a leaf node in the feature graph (for more
details on hierarchical redundancy, see Sect. 5.1). The motivation for eliminating
the hierarchical redundancy among selected features is that some types of classifi-
cation algorithms, like Naïve Bayes, are particularly sensitive to redundancy among
features, as discussed earlier.
The pseudocode of the HIP method is shown as Algorithm 1, where TrainSet
and TestSet denote the training dataset and testing dataset, and they consist of
all input features; A(xi ) and D(xi ) denote the set of ancestors and descendants
(respectively) of the feature xi ; Status(xi ) means the selection status (“Selected”
or “Removed”) of the feature xi ; Instt means the current instance being classified
in TestSet; V alue(xi,t ) denotes the value of feature xi (“1” or “0”) in that instance;
Ai j denotes the jth ancestor of the feature xi ; Di j denotes the jth descendant of the
feature xi ; TrainSet_SF denotes the shorter version of the training dataset where all
features’ status are “Selected”; and Inst_SFt denotes the shorter version of instance
t that consists only of features whose status is “Selected”.
In the first part of Algorithm 1 (lines: 1–8), it firstly constructs the DAG of features,
finds all ancestors and descendants of each feature in the DAG, and initialises the
status of each feature as “Selected”. During the execution of the algorithm, some
features will have their status set to “Removed”, whilst other features will remain with
their status set “Selected” throughout the algorithm’s execution. When the algorithm
terminates, the set of features with status “Selected” is returned as the set of selected
features.
48 5 Lazy Hierarchical Feature Selection

Algorithm 1 Select Hierarchical Information Preserving Features (HIP)


1: Initialize DAG with all features X in Dataset;
2: Initialize TrainSet;
3: Initialize TestSet;
4: for each feature xi ∈ X do
5: Initialize A(xi ) in DAG;
6: Initialize D(xi ) in DAG;
7: Initialize Status(xi ) ← “Selected”;
8: end for
9: for each Inst t ∈ TestSet do
10: for each feature xi ∈ X do
11: if V alue(xi,t ) = 1 then
12: for each ancestor Ai j ∈ A(xi ) do
13: Status(Ai j ) ← “Removed”;
14: end for
15: else
16: for each descendant Di j ∈ D(xi ) do
17: Status(Di j ) ← “Removed”;
18: end for
19: end if
20: end for
21: Re-create TrainSet_SF with all features X  where Status(X  ) = “Selected”;
22: Re-create Inst_SFt with all features X  where Status(X  ) = “Selected”;
23: Classifier(TrainSet_SF, Inst_SFt );
24: for each feature xi ∈ X do
25: Re-assign Status(xi ) ← “Selected”;
26: end for
27: end for

In the second part of Algorithm 1 (lines: 9–27), it performs feature selection for
each testing instance in turn, using a lazy learning approach. For each instance, for
each feature xi , the algorithm checks its value in that instance. If xi has value “1”, all
its ancestors in the DAG have their status set to “Removed” – since the value “1” of
each ancestor is redundant, being logically implied by the value “1” of xi . If xi has
value “0”, all its descendants have their status set to “Removed” – since the value
“0” of each descendant is redundant, being logically implied by the value “0” of xi .
To show how the second part of Algorithm 1 works, as shown in Fig. 5.4a, we
use the same hypothetical DAG and testing instances example discussed in Sect. 5.1,
which consist of just 18 features, denoted by the letters A–R. The value (“1” or “0”)
for each feature is shown on the right of the node representing that feature. Note that
the HIP feature selection method uses only hierarchical dependency information
5.2 Select Hierarchical Information-Preserving Features (HIP) 49

(a) (b)

(c)

(d)

(e)

Fig. 5.4 Example of select hierarchical information-preserving features

about the feature and their corresponding values contained in the testing dataset
matrix.
With respect to the example DAG in Fig. 5.4a, lines 10–20 of Algorithm 1 work
as follows. When feature C is processed, the selection status of its ancestor features
I, F, M, L, Q and O will be assigned as “Removed” (lines: 12–14), since the value
“1” of C logically implies the value “1” of all of C’s ancestors. Analogously, when
feature A is processed, the selection status of its descendant features D, H, N, P and
R will be assigned as “Removed” (lines: 16–18), since the value “0” of A logically
50 5 Lazy Hierarchical Feature Selection

implies the value “0” of all of A’s descendants. When feature G (with value “1”) is
processed, its ancestor E has its status set to “Removed”. And so on, processing one
feature at a time.
Note that the status of a feature may be set to “Removed” more than once, as it
happened for feature F when processing features C and I. However, once the status
of a feature is set to “Removed”, it cannot be re-set to “Selected” again. Hence,
the result of Algorithm 1 does not depend on the order in which the features are
processed.
After processing all features in the example DAG, the features selected by the
loop in lines 10–20 are A, B, C, G and K. Note that these five core features (in
blue colour) contain the complete hierarchical information associated with all the
features in the DAG of Fig. 5.4b, in the sense that the observed values of these five
core features logically imply the values of all other features in that DAG.
Next, the training dataset and current testing instance are reduced to contain only
features whose status are “Selected” (lines: 21–22). As shown in Fig. 5.4d, e, the
blue columns denote the selected features’ values in training dataset and testing
instance, which are used by the classifier (line: 23). Finally, the status of all features
is reassigned as “Selected” (lines: 24–26), as a preparation for feature selection for
the next testing instance.

5.3 Select Most Relevant Features (MR)

The Select Most Relevant Features (MR) method performs feature selection consider-
ing both the relevance values of individual features and the hierarchical redundancy
among features. Like the HIP method, for each feature xi in the current instance
being classified, MR first identifies the set of features whose values are implied by
the value of xi in that instance – i.e. either the ancestors of xi , if xi has value “1”;
or the descendants of xi , if xi has value “0”, for each path from the current node to
a root or a leaf node of the feature DAG, depending on whether the current feature
has value “1” or “0”, respectively. Next, MR compares the relevance of xi and all
features in each identified path. Among all those features (including xi ), MR marks
for removal all features, except the most relevant feature. If there are more than one
features with the same maximum relevance value in a given path, as a tie-breaking
criterion, MR retains the most specific (deepest) feature among the set of features
with value “1” or the most generic (shallowest) feature among the set of features
with value “0” – since those features’ values logically imply the largest number of
other features’ values, among the set of features being compared.
As a part of our feature selection method, we use Eq. 5.1 to measure the relevance
(R), or predictive power of a binary feature xi taking value xi1 or xi2 ,


n
R(xi ) = [P(yc |xi1 ) − P(yc |xi2 )]2 (5.1)
c=1
5.3 Select Most Relevant Features (MR) 51

where yc is the c-th class and n is the number of classes. A general form of Eq. 5.1
was originally used in [2] in the context of Nearest Neighbour algorithms, and here
it has been adjusted to be used as a feature relevance measure for feature selection
algorithms. In this work, n = 2, xi is a feature, and Eq. 5.1 is expanded to Eq. 5.2,
where the two terms being added in the right part of the equation are equal, as shown
in Theorem 5.1, followed by the corresponding proof.

R(xi ) = [P(y = 1 | xi = 1) − P(y = 1 | xi = 0)]2


+ [P(y = 0 | xi = 1) − P(y = 0 | xi = 0)]2 (5.2)

Equation 5.2 calculates the relevance of each feature as a function of the difference
in the conditional probabilities of each class given different values (“1” or “0”) of a
feature, indicating whether or not a instance is annotated with that feature.
Theorem 5.1 In Eq. 5.1,
if n = 2, so that R(xi ) =[P(y1 |xi1 ) − P(y1 |xi2 )]2 + [P(y2 |xi1 ) − P(y2 |xi2 )]2 ,
we have: [P(y1 |xi1 ) − P(y1 |xi2 )]2 = [P(y2 |xi1 ) − P(y2 |xi2 )]2 .
Proof

∵ [P(y1 |xi1 ) + P(y2 |xi1 ) = 1] ∧ [P(y1 |xi2 ) + P(y2 |xi2 ) = 1]


∴[P(y1 |xi1 ) − P(y1 |xi2 )]2 = [(1 − P(y2 |xi1 )) − (1 − P(y2 |xi2 ))]2
= [1 − P(y2 |xi1 ) − 1 + P(y2 |xi2 )]2
= [−P(y2 |xi1 ) + P(y2 |xi2 )]2
= [−(P(y2 |xi1 ) − P(y2 |xi2 ))]2
= [P(y2 |xi1 ) − P(y2 |xi2 )]2


The pseudocode of the MR method is shown as Algorithm 2, where R(xi ) denotes
the value of relevance for the ith feature; A+ (xi,k ) and D+ (xi,k ) denote the set of
features containing both the ith feature and its ancestors or descendants (respectively)
in the kth path; MRF denotes the most relevant feature among the set of features
in A+ (xi,k ) or D+ (xi,k ); Ai, j,k+ and Di, j,k+ denotes the jth feature in A+ (xi,k ) and
D+ (xi,k ), respectively.
In the first part of Algorithm 2 (i.e. lines 1–9), firstly the DAG will be constructed,
then A+ (xi,k ) and D+ (xi,k ) for each feature xi at each path k will be initialized, and
the relevance (R) value for each feature will be calculated. In the second part of the
algorithm (i.e. lines 10–34), the feature selection process will be conducted for each
testing instance using a lazy learning approach.
To show how the second part of Algorithm 2 works, we use again as example the
DAG shown in Fig. 5.5a, where the relevance value for each feature is shown on the
left of each node. When feature C (with value “1”) is processed (lines: 13–18), the
features in three paths, i.e. path (a) containing features C, I, F and M; and path (b)
52 5 Lazy Hierarchical Feature Selection

(a) (b)

(c)

(d)

(e)

Fig. 5.5 Example of select most relevant features

containing features C, I, F and L; and path (c) containing features C, I, Q and O are
processed. In path (a), the features having maximum relevance value are C and M;
but only feature C is selected as the MRF (line: 14), since it is deeper than feature M
in that path. In path (b), only feature C is selected as MRF, since it has the maximum
relevance value. In path (c), feature O is the MRF. Hence, after processing feature C,
all features contained in the three paths have their status set to “Removed”, except
feature O (lines: 15–17).
5.3 Select Most Relevant Features (MR) 53

Analogously, when feature A (with value “0”) is processed, the features in three
paths, i.e. path (a) containing features A, D and H; path (b) containing features A, N
and P; and path (c) containing features A, N and R will be processed. In path (a), both
features D and H have maximum relevance value, but D will be selected as the MRF
(line: 21) since it is shallower than H. In path (b), feature P is selected as the MRF
since it has the maximum relevance value among all features in that path. In path (c),
feature R is selected as the MRF, since it also has the maximum relevance value.
Therefore, after processing feature A, the selection status for all features contained in
those three paths will be assigned as “Removed”, except features D, P and R (lines:
22–24).
After processing all features in that example DAG, the selected features are K,
J, D, P, R, G and O. Next, as shown in Fig. 5.5d, e, the training dataset and the
current testing instance are reduced to contain only those seven selected features in
line 28–29 of Algorithm 2, and that reduced instance is classified in line 30. Finally,
the status of all features is reassigned to “Selected” in line 31–33, as a preparation
for feature selection for the next instance.
Note that, for each set of features being compared when MR decides which fea-
tures will have their status set to “Removed”, this decision is based both on the
relevance values of the features being compared and the hierarchical redundancy
among features, as explained earlier. Thus, in general the MR method does not select
all core features with complete hierarchical information on feature values, as selected
by HIP (see Sect. 5.2). Consider, e.g. the core feature C = “1” implies that features
I, F, M, L, Q and O have value “1”. Also the core feature A = “0”, which implic-
itly contains the hierarchical information that features D, H, N, P and R have value
“0”. The features C and A were selected by the HIP method, but neither C nor A
is selected by the MR method, because the relevance value of A is smaller than the
relevance values of features D, P and R; and the relevance value of C is smaller than
the relevance value of feature O. Hence, we lose the information about the values of
node C and A, whose values are not implied by the values of features K, J, D, P, R,
G and O (nor implied by any other feature in the DAG).
On the other hand, the MR method has the advantage that in general it selects
features with higher relevance values than the features selected by the HIP method
(which ignores feature relevance values). For instance, in the case of our example
DAG in Figs. 5.4b and 5.5b, the five features selected by HIP (A, B, C, G and K)
have on average a relevance value of 0.282, whilst the seven features selected by MR
(K, J, D, P, R, G and O) have on average a relevance value of 0.344.
54 5 Lazy Hierarchical Feature Selection

Algorithm 2 Select Most Relevant Features (MR)


1: Initialize DAG with all features X in Dataset;
2: Initialize TrainSet;
3: Initialize TestSet;
4: for each feature xi on path k in DAG do
5: Initialize A+ (xi,k ) in DAG;
6: Initialize D+ (xi,k ) in DAG;
7: Initialize Status(xi ) ← “Selected”;
8: Calculate R(xi ) in TrainSet;
9: end for
10: for each Inst t ∈ TestSet do
11: for each feature xi ∈ DAG do
12: if V alue(xi,t ) = 1 then
13: for each path k from xi to root in DAG do
14: Find MRF in A+ (xi,k );
15: for each ancestor Ai, j,k+ except MRF do
16: Status(Ai, j,k+ ) ← “Removed”;
17: end for
18: end for
19: else
20: for each path k from xi to leaf in DAG do
21: Find MRF in D+ (xi,k );
22: for each descendant Di, j,k+ except MRF do
23: Status(Di, j,k+ ) ← “Removed”;
24: end for
25: end for
26: end if
27: end for
28: Re-create TrainSet_SF with all features X  where Status(X  ) = “Selected”;
29: Re-create Inst_SFt with all features X  where Status(X  ) = “Selected”;
30: Classifier(TrainSet_SF, Inst_SFt );
31: for each feature xi ∈ X do
32: Re-assign Status(xi ) ← “Selected”;
33: end for
34: end for
5.4 Select Hierarchical Information-Preserving and Most Relevant Features (HIP—MR) 55

5.4 Select Hierarchical Information-Preserving and Most


Relevant Features (HIP—MR)

Although both HIP and MR select a set of features without hierarchical redun-
dancy, HIP has the limitation of ignoring the relevance of features, and MR has
the limitation that it does not necessarily select all core features with the complete
hierarchical information (features whose observed values logically imply the val-
ues of all other features for the current instance). The hybrid Select Hierarchical
Information-Preserving and Most Relevant Features (HIP—MR) method addresses
these limitations, by both considering feature relevance (like MR) and selecting all
core features with the complete hierarchical information (like HIP). The price paid
for considering both these criteria is that, unlike HIP and MR, HIP—MR typically
selects a large subset of features having some hierarchical redundancy (although less
redundancy than the original full set of features), as will be discussed later.
For each feature xi in the instance being classified, HIP—MR first identifies the
features whose values are implied by the value of xi in the instance – i.e. the set of
features which are ancestors or descendants of xi , depending on whether xi has value
“1” or “0”, respectively. Then, HIP—MR removes features by combining ideas from
the HIP and MR methods, as follows. If feature xi has value “1”, HIP—MR removes
the ancestors of xi whose relevance values are not greater than the relevance value
of xi . If feature xi has value “0”, HIP—MR removes the descendants of xi whose
relevance values are not greater than the relevance value of xi .
Therefore, HIP—MR selects a set of features where each feature has the prop-
erty(ies) of being needed to preserve the complete hierarchical information associ-
ated with the instance being classified (the kind of feature selected by HIP) or has
a relatively high relevance in the context of its ancestors or descendants (the kind
of feature selected by MR). Hence, the set of features selected by the HIP—MR
method tends to include the union of the sets of features selected by the HIP and
MR methods separately, making HIP—MR a considerably more “inclusive” feature
selection method.
The pseudocode is shown as Algorithm 3. In the first part of the algorithm (lines:
1–9), firstly the DAG is constructed, the ancestors and descendants of each feature
are found, and the relevance value of each feature is calculated by Eq. 5.1. In the
second part of the algorithm (lines: 10–32), the feature selection process is carried
out by combining ideas of the HIP and MR methods, as explained earlier, for each
testing instance, following a lazy learning approach.
In the case of our example feature DAG in Fig. 5.6, when feature C (with value
“1”) is processed, its relevance value is compared with the relevance values of all its
ancestor features I, F, M, L, Q and O. Then, features I, F, M and L are marked for
removal, since their relevance values are not greater than the relevance of C. Next,
when feature A (with value “0”) is processed, only one of its descendant features are
marked for removal (i.e. feature N), since the relevance values of other descendant
features are greater than the relevance value of A. This process is repeated for all
56 5 Lazy Hierarchical Feature Selection

Algorithm 3 Select Hierarchical Information-Preserving and Most Relevant Features


(HIP–MR)
1: Initialize DAG with all features X in Dataset;
2: Initialize TrainSet;
3: Initialize TestSet;
4: for each feature xi in X do
5: Initialize A(xi ) in DAG;
6: Initialize D(xi ) in DAG;
7: Initialize Status(xi ) ← “Selected”;
8: Calculate R(xi ) in TrainSet;
9: end for
10: for each Inst t ∈ TestSet do
11: for each feature xi ∈ DAG do
12: if V alue(xi,t ) = 1 then
13: for each ancestor Ai j ∈ A(xi ) do
14: if R(Ai j ) ≤ R(xi ) then
15: Status(Ai j ) ← “Removed”;
16: end if
17: end for
18: else
19: for each descendant Di j ∈ D(xi ) do
20: if R(Di j ) ≤ R(xi ) then
21: Status(Di j ) ← “Removed”;
22: end if
23: end for
24: end if
25: end for
26: Re-create TrainSet_SF with all features X  where Status(X  ) = “Selected”;
27: Re-create Inst_SFt with all features X  where Status(X  ) = “Selected”;
28: Classifier(TrainSet_SF, Inst_SFt );
29: for each feature xi ∈ X do
30: Re-assign Status(xi ) ← “Selected”;
31: end for
32: end for

other features in the instance being classified. At the end of this process, the selected
features are: C, R, K, O, G, Q, B, J, A, D and P.
Note that in this example HIP—MR selects all features selected by HIP or MR.
Actually, HIP—MR tends to select substantially more features than the number of
features selected by HIP and MR together. Note also that, although HIP—MR selects
a feature subset with less hierarchical redundancy than the original full feature set,
the features selected by HIP—MR still have some redundancy, unlike the features
selected by HIP and MR. This is because HIP—MR can select a redundant feature
xi if xi has higher relevance than another selected feature logically implying xi . For
instance, in the above example, HIP—MR selects feature Q, which is redundant with
respect to selected feature C, but Q has higher relevance than C.
5.5 Experimental Results 57

(a) (b)

(c)

(d)

(e)

Fig. 5.6 Example of select hierarchical information-preserving and most relevant features

5.5 Experimental Results

This section reports the comparison of HIP, MR and HIP—MR methods with two
“flat” feature selection methods i.e. Hybrid-lazy/eager-entropy-based feature selec-
tion [1] and Hybrid-lazy/eager-relevance-based feature selection. The main char-
acteristics of the feature selection methods involved in the experiments are sum-
marised in Table 5.1. The Hybrid-lazy/eager-entropy-based feature selection and
Hybrid-lazy/eager-relevance-based feature selection methods follow the lazy learn-
ing scenario, i.e. conducting feature selection for each individual testing instance,
58 5 Lazy Hierarchical Feature Selection

although these two methods also have an “eager” learning component, as discussed
next. In essence, these two methods measure the quality of each feature, and then
produce a ranking of all the features based on that measure and select the top n
features in that ranking.
The difference between those two methods is the feature quality measure: one uses
entropy, as shown in Eq. 5.3 [1]. This method calculates two versions of a feature’s
entropy: in the lazy version, the entropy is calculated using only the training instances
with the value v j (“1” or “0”) of the feature A j observed in the current testing instance
being classified; whilst in the eager version, the entropy is calculated using all training
instances, regardless of the value v j observed in the current testing instance. Then
the method chooses the smaller of these two entropy values as the feature’s quality
measure.
Ent (A j , v j ) = min(Ent (A j , v j ), Ent (A j )) (5.3)

The other method uses the relevance measure given by Eq. 5.1, which follows the
eager scenario, i.e. calculating the relevance value of each feature using all training
instances. This is a hybrid lazy/eager method because the measure of relevance is
calculated using the whole training dataset in an “eager” approach, but it selects the
top-n ranked features for each testing instance, in a “lazy” approach.
For both methods, the parameter n, representing the number of features selected
for each instance, equals to the number of features selected by the HIP, MR or HIP—
MR method respectively. That is, for each testing instance, the Hybrid-lazy/eager-
entropy-based feature selection method and the Hybrid-lazy/eager-relevance-based
feature selection method will select the same number of features selected by HIP, MR
or HIP—MR. This adds a lazy criterion to both these methods, since HIP, MR and
HIP—MR are lazy methods. In addition, the HFS+GO—BAN [4] method is adopted
for constructing the BAN classifier by using the features selected by different feature
selection methods.
Tables 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 5.10, 5.11, 5.12 and 5.13 report the
results for the hierarchical and “flat” feature selection methods working with the
Naïve Bayes, Tree Augmented Naïve Bayes, Bayesian Network Augmented Naïve
Bayes and K-Nearest Neighbour classifiers. In these tables, the numbers after the
symbol “±” denote standard errors. We also show, the box-plots in Figs. 5.7, 5.8
and 5.9, the distribution of ranks based on the GMean values for different feature
selection methods working with different classifiers.
Tables 5.2, 5.3, 5.4 and 5.5 compare the predictive accuracies obtained by NB,
TAN, BAN and KNN when using HIP or different “flat” feature selection methods,
i.e. Ent H I P_n and Rele H I P_n . Generally, HIP+NB obtains the most times of the
highest GMean value for predicting all four model organisms’ genes, i.e. 6 out
of 7 times for predicting Caenorhabditis elegans genes and 5 out of 7 times for
predicting Drosophila melanogaster, Mus musculus and Saccharomyces cerevisiae
genes respectively.
As shown in Fig. 5.7, HIP+NB method obtains the best average rank of 1.357,
while the second best rank (2.054) was obtained by Naïve Bayes without feature
5.5 Experimental Results 59

Table 5.1 Summary of characteristics of feature selection methods working with different lazy
classification algorithms
Feature selection Learning approach Annotations Classification
method algorithms
No feature selection Eager NB, TAN, BAN, KNN
HIP Lazy NB, TAN, BAN, KNN
MR Lazy NB, TAN, BAN, KNN
HIP—MR Lazy NB, TAN, BAN, KNN
Entropy-based Hybrid Select the same n of NB, TAN, BAN, KNN
(HIP_n) features selected by
HIP
Entropy-based Hybrid Select the same n of NB, TAN, BAN, KNN
(MR_n) features selected by
MR
Entropy-based Hybrid Select the same n of NB, TAN, BAN, KNN
(HIP—MR_n) features selected by
HIP—MR
Relevance-based Hybrid Select the same n of NB, TAN, BAN, KNN
(HIP_n) features selected by
HIP
Relevance-based Hybrid Select the same n of NB, TAN, BAN, KNN
(MR_n) features selected by
MR
Relevance-based Hybrid Select the same n of NB, TAN, BAN, KNN
(HIP—MR_n) features selected by
HIP—MR

selection method. The average rank for Rele H I P_n is 2.732, whereas Ent H I P_n +NB
obtained the worst average rank (3.857) in terms of GMean value.
Table 5.3 reports the results for the hierarchical and “flat” feature selection meth-
ods working with Tree Augmented Naïve Bayes classifier. Obviously, analogously to
the cases when working with the Naïve Bayes classifier, HIP+TAN performs best in
predicting genes of all four model organisms, since it obtains 5 times of the highest
GMean value for predicting Caenorhabditis elegans, 6 times of the highest GMean
value for predicting Drosophila melanogaster genes, and all 7 times for predicting
Saccharomyces cerevisiae genes. For predicting Mus musculus genes, HIP+TAN and
TAN without feature selection methods show competitive performance, since both
of them obtain 3 out of 7 times highest GMean values.
HIP+TAN obtains the best results with the average rank of 1.411, while the second
best rank (2.500) was obtained by Rele H I P_n +TAN. TAN without feature selection
method obtains the third best average rank (2.679), while Ent H I P_n +TAN still obtains
the worst average rank (3.411).
Table 5.4 reports the results for different feature selection methods working with
Bayesian Network Augmented Naïve Bayes classifier. Obviously, HIP+BAN obtains
60 5 Lazy Hierarchical Feature Selection

Table 5.2 Predictive accuracy for Naïve Bayes with the hierarchical HIP method and baseline
“flat” feature selection methods
Feature NB without Lazy/Eager Lazy/Eager
Lazy HIP + NB
Type Feature Selection EntHIP_n + NB ReleHIP_n + NB

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 50.2 ± 3.6 69.0 ± 2.6 58.9 54.1 ± 3.4 75.5 ± 2.8 63.9 34.4 ± 3.0 84.0 ± 2.0 53.8 35.9 ± 2.8 81.2 ± 2.6 54.0

MF 57.9 ± 4.1 46.2 ± 5.5 51.7 45.5 ± 4.7 51.9 ± 5.1 48.6 36.4 ± 2.8 65.2 ± 4.4 48.7 66.9 ± 7.7 43.7 ± 5.8 54.1

CC 43.9 ± 5.7 70.5 ± 3.4 55.6 58.2 ± 4.9 60.9 ± 4.0 59.5 20.4 ± 3.0 83.3 ± 2.6 41.2 25.5 ± 4.2 79.5 ± 3.4 45.0

BP+MF 54.0 ± 1.8 70.3 ± 3.0 61.6 53.5 ± 3.6 76.2 ± 1.9 63.8 30.5 ± 1.5 85.6 ± 1.3 51.1 38.5 ± 3.8 79.4 ± 2.3 55.3

BP+CC 52.6 ± 3.9 68.3 ± 2.6 59.9 57.7 ± 3.7 73.0 ± 2.6 64.9 27.7 ± 2.7 85.5 ± 2.4 48.7 37.6 ± 2.7 81.1 ± 2.1 55.2

MF+CC 51.2 ± 2.8 64.1 ± 4.3 57.3 54.7 ± 3.3 66.0 ± 4.1 60.1 39.4 ± 4.2 80.5 ± 3.5 56.3 37.6 ± 3.3 76.3 ± 3.5 53.6

BP+MF+CC 52.1 ± 4.4 70.0 ± 2.3 60.4 55.3 ± 3.6 71.7 ± 2.7 63.0 29.3 ± 3.4 84.9 ± 1.8 49.9 45.6 ± 3.9 80.1 ± 2.0 60.4

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 74.7 ± 3.5 36.1 ± 9.5 51.9 73.6 ± 4.1 44.4 ± 9.0 57.2 93.4 ± 2.5 2.8 ± 2.5 16.2 76.9 ± 3.2 47.2 ± 8.2 60.2

MF 82.4 ± 4.6 35.3 ± 8.6 53.9 69.1 ± 6.1 52.9 ± 7.3 60.5 97.1 ± 2.3 32.4 ± 6.3 56.1 92.6 ± 3.4 32.4 ± 9.5 54.8

CC 87.1 ± 4.1 50.0 ± 10.2 66.0 80.6 ± 6.5 46.4 ± 11.4 61.2 91.9 ± 2.7 25.0 ± 7.1 47.9 85.5 ± 5.2 39.3 ± 8.7 58.0

BP+MF 77.2 ± 3.9 50.0 ± 10.2 62.1 72.8 ± 5.6 57.9 ± 9.3 64.9 95.7 ± 2.5 15.8 ± 7.6 38.9 84.8 ± 3.0 44.7 ± 10.8 61.6

BP+CC 76.9 ± 5.1 48.6 ± 9.8 61.1 73.6 ± 4.9 64.9 ± 8.3 69.1 91.2 ± 3.5 2.7 ± 2.5 15.7 78.0 ± 4.0 40.5 ± 10.2 56.2

MF+CC 89.4 ± 3.2 57.9 ± 5.3 71.9 82.4 ± 6.1 63.2 ± 6.7 72.2 95.3 ± 2.5 34.2 ± 5.5 57.1 91.8 ± 3.1 47.4 ± 4.5 66.0

BP+MF+CC 81.5 ± 5.3 55.3 ± 8.2 67.1 76.1 ± 4.9 68.4 ± 5.3 72.1 96.7 ± 1.7 21.1 ± 8.7 45.2 87.0 ± 3.2 50.0 ± 8.3 66.0

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 82.4 ± 4.7 44.1 ± 5.9 60.3 72.1 ± 4.8 70.6 ± 5.1 71.3 95.6 ± 2.2 29.4 ± 4.1 53.0 91.2 ± 3.2 44.1 ± 7.0 63.4

MF 69.2 ± 7.4 48.5 ± 11.2 57.9 78.5 ± 4.4 45.5 ± 12.2 59.8 87.7 ± 3.0 30.3 ± 10.8 51.5 84.6 ± 3.7 36.4 ± 11.9 55.5

CC 75.8 ± 2.3 52.9 ± 10.0 63.3 80.3 ± 3.0 47.1 ± 11.2 61.5 81.8 ± 3.3 32.4 ± 11.7 51.5 75.8 ± 3.2 41.2 ± 11.9 55.9

BP+MF 83.8 ± 3.4 44.1 ± 7.0 60.8 70.6 ± 4.8 70.6 ± 8.1 70.6 94.1 ± 2.3 32.4 ± 6.4 55.2 86.8 ± 4.5 44.1 ± 7.2 61.9

BP+CC 79.4 ± 6.1 50.0 ± 8.4 63.0 66.2 ± 5.0 73.5 ± 9.3 69.8 97.1 ± 1.9 32.4 ± 8.9 56.1 88.2 ± 4.7 38.2 ± 10.3 58.0

MF+CC 75.0 ± 5.0 64.7 ± 12.5 69.7 79.4 ± 4.2 58.8 ± 11.8 68.3 91.2 ± 3.3 32.4 ± 8.9 54.4 83.8 ± 5.0 47.1 ± 10.5 62.8

BP+MF+CC 82.4 ± 4.2 47.1 ± 9.3 62.3 73.5 ± 5.1 73.5 ± 9.8 73.5 92.6 ± 4.4 35.3 ± 9.4 57.2 85.3 ± 4.3 41.2 ± 9.1 59.3

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 40.0 ± 8.3 84.9 ± 3.5 58.3 63.3 ± 6.0 78.4 ± 3.1 70.4 3.3 ± 3.3 100.0 ± 0.0 18.2 40.0 ± 6.7 84.3 ± 3.7 58.1

MF 11.5 ± 6.1 81.7 ± 4.8 30.7 5.0 ± 5.0 83.2 ± 3.4 20.4 0.0 ± 0.0 98.5 ± 1.0 0.0 7.7 ± 4.4 90.8 ± 3.3 26.4

CC 25.0 ± 7.1 86.2 ± 3.0 46.4 29.2 ± 10.2 82.9 ± 4.2 49.2 16.7 ± 7.0 95.1 ± 1.7 39.9 20.8 ± 6.9 87.8 ± 3.1 42.7

BP+MF 33.3 ± 11.1 85.4 ± 1.7 53.3 76.7 ± 7.1 74.0 ± 3.3 75.3 0.0 ± 0.0 100.0 ± 0.0 0.0 50.0 ± 7.5 85.9 ± 1.9 65.5

BP+CC 53.3 ± 8.9 85.8 ± 3.0 67.6 70.0 ± 7.8 79.4 ± 3.2 74.6 0.0 ± 0.0 100.0 ± 0.0 0.0 36.7 ± 10.5 85.3 ± 2.5 56.0

MF+CC 34.5 ± 10.5 87.3 ± 2.1 54.9 31.0 ± 8.0 82.2 ± 3.5 50.5 6.9 ± 5.7 95.9 ± 1.3 25.7 24.1 ± 9.7 89.8 ± 1.7 46.5

BP+MF+CC 36.7 ± 9.2 85.6 ± 2.7 56.0 70.0 ± 10.5 75.0 ± 2.6 72.5 0.0 ± 0.0 100.0 ± 0.0 0.0 30.0 ± 9.2 87.0 ± 1.7 51.1
5.5 Experimental Results 61

Table 5.3 Predictive accuracy for Tree Augmented Naïve Bayes with the hierarchical HIP method
and baseline “flat” feature selection methods
Feature TAN without Lazy/Eager Lazy/Eager
Lazy HIP + TAN
Type Feature Selection EntHIP_n + TAN ReleHIP_n + TAN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 34.0 ± 3.2 79.6 ± 2.3 52.0 52.2 ± 2.3 67.7 ± 3.5 59.4 34.9 ± 3.4 84.3 ± 1.7 54.2 32.1 ± 2.3 83.1 ± 2.3 51.6

MF 37.2 ± 5.8 61.4 ± 5.0 47.8 43.0 ± 5.6 50.6 ± 4.5 46.6 38.0 ± 5.1 66.5 ± 5.2 50.3 15.7 ± 3.6 82.9 ± 3.3 36.1

CC 39.8 ± 3.0 78.2 ± 2.2 55.8 44.9 ± 2.7 62.2 ± 4.7 52.8 28.6 ± 5.0 80.8 ± 3.0 48.1 29.6 ± 4.0 76.9 ± 3.6 47.7

BP+MF 35.2 ± 1.9 80.3 ± 2.2 53.2 54.5 ± 3.2 72.1 ± 2.4 62.7 38.0 ± 4.3 82.1 ± 1.5 55.9 35.2 ± 3.4 82.6 ± 2.1 53.9

BP+CC 42.7 ± 3.1 81.7 ± 2.7 59.1 59.2 ± 3.9 69.2 ± 2.9 64.0 42.3 ± 3.3 82.3 ± 2.3 59.0 35.2 ± 2.4 83.7 ± 1.9 54.3

MF+CC 40.6 ± 3.4 74.4 ± 3.6 55.0 45.3 ± 2.2 67.2 ± 3.5 55.2 37.6 ± 3.2 74.4 ± 3.5 52.9 39.4 ± 3.7 75.2 ± 3.4 54.4

BP+MF+CC 39.5 ± 2.8 80.1 ± 2.6 56.2 60.0 ± 5.5 71.4 ± 2.2 65.5 37.7 ± 2.7 79.0 ± 1.7 54.6 39.5 ± 4.1 81.0 ± 1.7 56.6

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 92.3 ± 2.9 19.4 ± 8.4 42.3 58.2 ± 6.5 72.2 ± 5.4 64.8 94.5 ± 1.9 2.8 ± 2.5 16.3 85.7 ± 3.3 25.0 ± 5.9 46.3

MF 91.2 ± 3.3 20.6 ± 5.0 43.3 73.5 ± 5.5 32.4 ± 7.1 48.8 91.2 ± 3.3 26.5 ± 6.0 49.2 91.2 ± 2.5 35.3 ± 7.2 56.7

CC 90.3 ± 3.6 32.1 ± 11.6 53.8 79.0 ± 3.6 50.0 ± 11.3 62.8 95.2 ± 2.4 25.0 ± 7.1 48.8 90.3 ± 2.6 35.7 ± 9.9 56.8

BP+MF 92.4 ± 3.3 23.7 ± 6.9 46.8 52.2 ± 4.0 73.7 ± 5.8 62.0 96.7 ± 2.4 13.2 ± 4.2 35.7 85.9 ± 4.1 28.9 ± 7.9 49.8

BP+CC 86.8 ± 4.0 18.9 ± 7.6 40.5 59.3 ± 5.7 67.6 ± 7.2 63.3 94.5 ± 1.8 8.1 ± 4.7 27.7 82.4 ± 3.7 29.7 ± 8.5 49.5

MF+CC 90.6 ± 3.3 31.6 ± 5.0 53.5 76.5 ± 4.9 60.5 ± 9.3 68.0 96.5 ± 2.3 28.9 ± 6.9 52.8 92.9 ± 2.5 39.5 ± 5.5 60.6

BP+MF+CC 92.4 ± 2.4 18.4 ± 5.3 41.2 60.9 ± 7.6 78.9 ± 6.9 69.3 97.8 ± 1.5 13.2 ± 6.7 35.9 91.3 ± 2.2 42.1 ± 8.4 62.0

Mus musculus Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 89.7 ± 3.7 41.2 ± 4.9 60.8 42.6 ± 5.3 73.5 ± 7.2 56.0 98.5 ± 1.4 32.4 ± 5.5 56.5 85.3 ± 4.8 38.2 ± 8.5 57.1

MF 89.2 ± 4.0 33.3 ± 9.4 54.5 69.2 ± 7.7 66.7 ± 7.6 67.9 86.2 ± 2.6 36.4 ± 12.9 56.0 89.2 ± 3.8 30.3 ± 9.5 52.0

CC 75.8 ± 4.4 41.2 ± 8.3 55.9 72.7 ± 5.1 50.0 ± 10.1 60.3 83.3 ± 3.3 26.5 ± 7.3 47.0 81.8 ± 4.7 32.4 ± 9.3 51.5

BP+MF 86.8 ± 3.4 35.3 ± 5.4 55.4 42.6 ± 4.9 79.4 ± 9.3 58.2 97.1 ± 1.9 32.4 ± 6.4 56.1 91.2 ± 3.2 38.2 ± 6.2 59.0

BP+CC 88.2 ± 3.6 47.1 ± 9.7 64.5 48.5 ± 4.4 82.4 ± 6.8 63.2 97.1 ± 1.9 29.4 ± 5.2 53.4 80.9 ± 7.1 47.1 ± 9.7 61.7

MF+CC 88.2 ± 4.2 41.2 ± 10.0 60.3 63.2 ± 3.1 64.7 ± 12.7 63.9 89.7 ± 3.7 38.2 ± 9.4 58.5 89.7 ± 4.3 50.0 ± 10.2 67.0

BP+MF+CC 91.2 ± 3.2 41.2 ± 8.6 61.3 45.6 ± 8.0 82.4 ± 5.2 61.3 94.1 ± 2.3 35.3 ± 8.4 57.6 89.7 ± 3.0 41.2 ± 7.9 60.8

Saccharomyces cerevisiae Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 3.3 ± 3.3 98.9 ± 1.1 18.1 56.7 ± 10.0 68.6 ± 2.0 62.4 0.0 ± 0.0 99.5 ± 0.5 0.0 16.7 ± 7.5 94.1 ± 1.7 39.6

MF 0.0 ± 0.0 97.7 ± 1.2 0.0 26.9 ± 6.2 78.6 ± 2.7 46.0 0.0 ± 0.0 98.5 ± 1.0 0.0 0.0 ± 0.0 96.2 ± 1.3 0.0

CC 16.7 ± 7.0 95.9 ± 2.1 40.0 25.0 ± 10.6 85.4 ± 4.0 46.2 5.0 ± 5.0 97.6 ± 1.2 22.1 12.5 ± 6.9 95.1 ± 2.1 34.5

BP+MF 3.3 ± 3.3 99.0 ± 0.7 18.1 63.3 ± 9.2 67.7 ± 3.1 65.5 0.0 ± 0.0 99.5 ± 0.5 0.0 13.3 ± 5.4 94.3 ± 0.9 35.4

BP+CC 10.0 ± 5.1 99.0 ± 0.7 31.5 63.3 ± 6.0 73.5 ± 3.8 68.2 0.0 ± 0.0 100.0 ± 0.0 0.0 20.0 ± 7.4 95.6 ± 1.6 43.7

MF+CC 5.0 ± 5.0 98.5 ± 0.8 22.2 31.0 ± 9.9 81.7 ± 2.5 50.3 0.0 ± 0.0 100.0 ± 0.0 0.0 10.3 ± 6.1 95.4 ± 1.6 31.3

BP+MF+CC 0.0 ± 0.0 99.0 ± 0.6 0.0 70.0 ± 9.2 69.7 ± 3.0 69.8 0.0 ± 0.0 99.5 ± 0.5 0.0 16.7 ± 7.5 93.8 ± 1.3 39.6
62 5 Lazy Hierarchical Feature Selection

Table 5.4 Predictive accuracy for Bayesian Network Augmented Naïve Bayes with the hierarchical
HIP method and baseline “flat” feature selection methods
Feature BAN without Lazy/Eager Lazy/Eager
Lazy HIP + BAN
Type Feature Selection EntHIP_n + BAN ReleHIP_n + BAN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 28.7 ± 2.2 86.5 ± 1.8 49.8 54.5 ± 3.2 73.4 ± 2.7 63.2 31.6 ± 3.5 85.9 ± 2.3 52.1 35.9 ± 2.4 82.1 ± 2.4 54.3

MF 34.7 ± 4.5 66.5 ± 4.5 48.0 43.8 ± 4.5 52.5 ± 5.2 48.0 37.2 ± 4.7 65.8 ± 4.1 49.5 16.5 ± 3.5 81.6 ± 4.1 36.7

CC 33.7 ± 4.5 81.4 ± 2.2 52.4 55.1 ± 5.0 63.5 ± 4.0 59.2 20.4 ± 4.6 82.1 ± 3.0 40.9 31.6 ± 4.7 75.6 ± 3.4 48.9

BP+MF 30.0 ± 2.7 84.7 ± 1.7 50.4 55.9 ± 3.2 74.1 ± 2.5 64.4 36.6 ± 2.3 83.2 ± 2.0 55.2 39.0 ± 3.6 80.0 ± 1.6 55.9

BP+CC 29.1 ± 2.1 86.6 ± 1.7 50.2 58.7 ± 3.6 72.7 ± 2.5 65.3 38.5 ± 3.3 85.8 ± 2.4 57.5 37.6 ± 2.3 81.4 ± 2.1 55.3

MF+CC 35.3 ± 2.9 80.2 ± 3.2 53.2 55.9 ± 3.1 64.5 ± 3.6 60.0 35.3 ± 4.0 79.4 ± 3.9 52.9 43.5 ± 3.4 73.7 ± 3.0 56.6

BP+MF+CC 31.2 ± 2.9 85.2 ± 1.5 51.6 58.1 ± 3.8 73.4 ± 2.6 65.3 34.0 ± 3.4 83.2 ± 1.4 53.2 46.0 ± 4.6 78.2 ± 2.0 60.0

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 100.0 ± 0.0 0.0 ± 0.0 0.0 75.8 ± 4.4 52.8 ± 8.6 63.3 96.7 ± 1.7 0.0 ± 0.0 0.0 78.0 ± 3.2 38.9 ± 11.0 55.1

MF 91.2 ± 3.3 26.5 ± 3.4 49.2 64.7 ± 7.2 50.0 ± 10.0 56.9 92.6 ± 3.4 29.4 ± 7.2 52.2 85.3 ± 3.1 38.2 ± 9.3 57.1

CC 93.5 ± 2.6 28.6 ± 11.1 51.7 79.0 ± 6.6 46.4 ± 11.4 60.5 95.2 ± 2.5 25.0 ± 7.1 48.8 91.9 ± 3.7 28.6 ± 9.9 51.3

BP+MF 97.8 ± 1.5 0.0 ± 0.0 0.0 72.8 ± 3.9 63.2 ± 9.3 67.8 95.7 ± 3.4 10.5 ± 5.5 31.7 88.0 ± 3.1 42.1 ± 10.4 60.9

BP+CC 98.9 ± 1.1 0.0 ± 0.0 0.0 73.6 ± 4.7 62.2 ± 8.4 67.7 92.3 ± 2.4 2.7 ± 2.7 15.8 82.4 ± 3.8 40.5 ± 7.6 57.8

MF+CC 95.3 ± 1.9 31.6 ± 5.3 54.9 80.0 ± 6.2 60.5 ± 7.6 69.6 95.3 ± 2.5 28.9 ± 8.2 52.5 91.8 ± 2.4 44.7 ± 3.3 64.1

BP+MF+CC 98.9 ± 1.1 2.6 ± 2.5 16.0 73.9 ± 4.7 68.4 ± 5.3 71.1 98.9 ± 1.1 7.9 ± 5.5 28.0 90.2 ± 2.0 50.0 ± 9.9 67.2

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 98.5 ± 1.4 26.5 ± 5.0 51.1 75.0 ± 5.1 70.6 ± 5.1 72.8 98.5 ± 1.4 29.4 ± 5.2 53.8 89.7 ± 4.3 44.1 ± 7.0 62.9

MF 90.8 ± 3.3 27.3 ± 10.0 49.8 84.6 ± 3.0 45.5 ± 12.2 62.0 87.7 ± 3.0 24.2 ± 9.0 46.1 89.2 ± 3.1 24.2 ± 9.0 46.5

CC 86.4 ± 3.3 35.3 ± 11.2 55.2 80.3 ± 3.0 50.0 ± 10.1 63.4 84.8 ± 2.9 23.5 ± 10.4 44.6 83.3 ± 3.3 41.2 ± 11.9 58.6

BP+MF 98.5 ± 1.4 29.4 ± 6.4 53.8 69.1 ± 5.8 70.6 ± 8.1 69.8 98.5 ± 14.3 26.5 ± 7.0 51.1 92.6 ± 3.2 35.3 ± 7.3 57.2

BP+CC 98.5 ± 1.4 29.4 ± 6.4 53.8 66.2 ± 6.0 76.5 ± 8.0 71.2 97.1 ± 1.9 29.4 ± 5.2 53.4 88.2 ± 5.1 44.1 ± 9.6 62.4

MF+CC 91.2 ± 3.2 26.5 ± 8.8 49.2 79.4 ± 4.2 61.8 ± 12.5 70.0 91.2 ± 3.3 23.5 ± 7.4 46.3 89.7 ± 3.0 32.4 ± 7.7 53.9

BP+MF+CC 98.5 ± 1.4 26.5 ± 10.5 51.1 70.6 ± 6.0 76.5 ± 8.8 73.5 97.1 ± 1.9 29.4 ± 10.0 53.4 92.6 ± 2.4 41.2 ± 8.6 61.8

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 0.0 ± 0.0 100.0 ± 0.0 0.0 63.3 ± 6.0 76.8 ± 3.1 69.7 3.3 ± 3.3 100.0 ± 0.0 18.2 16.7 ± 7.5 91.9 ± 2.7 39.2

MF 0.0 ± 0.0 99.2 ± 0.8 0.0 23.1 ± 6.7 80.2 ± 3.9 43.0 0.0 ± 0.0 99.2 ± 0.8 0.0 0.0 ± 0.0 94.7 ± 1.6 0.0

CC 12.5 ± 6.1 99.2 ± 0.8 35.2 29.2 ± 10.2 83.7 ± 4.1 49.4 8.3 ± 5.7 99.2 ± 0.8 28.7 16.7 ± 7.0 91.9 ± 2.4 39.2

BP+MF 0.0 ± 0.0 100.0 ± 0.0 0.0 73.3 ± 6.7 71.9 ± 3.0 72.6 0.0 ± 0.0 100.0 ± 0.0 0.0 26.7 ± 6.7 95.8 ± 1.3 50.6

BP+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 63.3 ± 10.5 78.4 ± 2.9 70.4 0.0 ± 0.0 100.0 ± 0.0 0.0 23.3 ± 7.1 95.6 ± 1.4 47.2

MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 41.4 ± 8.3 80.7 ± 3.0 57.8 3.4 ± 3.4 100.0 ± 0.0 18.4 10.3 ± 6.1 93.4 ± 2.0 31.0

BP+MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 76.7 ± 7.1 73.6 ± 2.8 75.1 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 5.4 97.1 ± 1.1 35.9
5.5 Experimental Results 63

Table 5.5 Predictive accuracy for K-Nearest Neighbour with the hierarchical HIP method and
baseline “flat” feature selection methods
Feature KNN without Lazy/Eager Lazy/Eager
Lazy HIP + KNN
Type Feature Selection EntHIP_n + KNN ReleHIP_n + KNN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 48.3 ± 4.8 74.0 ± 3.0 59.8 51.7 ± 2.8 77.4 ± 3.5 63.3 36.4 ± 4.1 75.9 ± 3.1 52.6 76.6 ± 9.1 35.4 ± 8.3 52.1

MF 41.3 ± 3.3 54.4 ± 4.4 47.4 36.4 ± 4.4 53.2 ± 4.5 44.0 41.3 ± 4.8 59.5 ± 4.9 49.6 13.2 ± 2.8 84.2 ± 2.8 33.3

CC 39.8 ± 6.5 67.9 ± 3.3 52.0 40.8 ± 4.0 68.6 ± 2.9 52.9 25.5 ± 5.8 76.3 ± 2.9 44.1 31.6 ± 5.4 74.4 ± 5.3 48.5

BP+MF 49.3 ± 3.5 72.9 ± 1.2 59.9 52.6 ± 3.4 74.1 ± 1.7 62.4 38.0 ± 2.5 74.7 ± 2.4 53.3 75.6 ± 8.0 32.1 ± 7.6 49.3

BP+CC 42.7 ± 3.4 72.7 ± 2.7 55.7 45.1 ± 3.2 77.0 ± 1.9 58.9 41.8 ± 3.7 71.8 ± 2.6 54.8 77.9 ± 5.6 34.6 ± 8.0 51.9

MF+CC 44.7 ± 2.7 68.3 ± 2.6 55.3 47.1 ± 2.5 71.4 ± 2.9 58.0 36.5 ± 3.0 73.7 ± 2.9 51.9 37.1 ± 4.0 75.2 ± 2.8 52.8

BP+MF+CC 47.9 ± 3.6 72.0 ± 2.4 58.7 47.4 ± 3.9 75.1 ± 1.7 59.7 39.5 ± 2.7 75.4 ± 1.8 54.6 70.7 ± 7.9 37.3 ± 6.8 51.4

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 80.2 ± 4.9 38.9 ± 7.5 55.9 84.6 ± 3.8 50.0 ± 10.0 65.0 93.4 ± 2.4 13.9 ± 4.8 36.0 27.5 ± 4.4 63.9 ± 9.6 41.9

MF 77.9 ± 5.6 32.4 ± 5.2 50.2 69.1 ± 5.7 44.1 ± 7.0 55.2 73.5 ± 6.3 41.2 ± 8.8 55.0 4.4 ± 3.0 97.1 ± 2.5 20.7

CC 83.9 ± 5.6 46.4 ± 10.0 62.4 82.3 ± 4.7 46.4 ± 12.2 61.8 83.9 ± 5.0 35.7 ± 8.6 54.7 45.2 ± 7.0 60.7 ± 11.2 52.4

BP+MF 79.3 ± 5.1 42.1 ± 9.9 57.8 78.3 ± 4.7 52.6 ± 9.7 64.2 95.7 ± 2.5 18.4 ± 8.4 42.0 33.7 ± 5.2 71.1 ± 6.2 48.9

BP+CC 78.0 ± 5.4 37.8 ± 8.9 54.3 83.5 ± 3.0 51.4 ± 6.1 65.5 90.1 ± 3.5 16.2 ± 4.4 38.2 27.5 ± 4.4 62.2 ± 8.1 41.4

MF+CC 91.8 ± 3.1 42.1 ± 6.7 62.2 82.4 ± 5.2 57.9 ± 5.3 69.1 89.4 ± 3.8 39.5 ± 5.5 59.4 25.9 ± 4.1 78.9 ± 5.8 45.2

BP+MF+CC 81.5 ± 3.8 52.6 ± 6.9 65.5 84.8 ± 3.0 63.2 ± 7.7 73.2 94.6 ± 2.5 23.7 ± 7.5 47.3 38.0 ± 4.1 65.8 ± 9.2 50.0

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 86.8 ± 3.4 41.2 ± 4.7 59.8 82.4 ± 5.9 64.7 ± 8.8 73.0 97.1 ± 1.9 29.4 ± 7.4 53.4 82.4 ± 3.6 50.0 ± 8.6 64.2

MF 78.5 ± 4.5 39.4 ± 10.4 55.6 89.2 ± 5.1 39.4 ± 8.1 59.3 89.2 ± 3.2 30.3 ± 10.7 52.0 83.1 ± 4.1 27.3 ± 9.9 47.6

CC 74.2 ± 7.7 41.2 ± 9.4 55.3 75.8 ± 4.4 38.2 ± 10.2 53.8 80.3 ± 5.4 26.5 ± 8.9 46.1 80.3 ± 5.7 32.4 ± 8.9 51.0

BP+MF 83.8 ± 4.0 47.1 ± 7.3 62.8 83.8 ± 4.0 52.9 ± 11.7 66.6 94.1 ± 2.3 23.5 ± 6.2 47.0 85.3 ± 5.2 44.1 ± 5.0 61.3

BP+CC 86.8 ± 5.8 47.1 ± 10.1 63.9 77.9 ± 5.3 50.0 ± 9.1 62.4 95.6 ± 2.2 20.6 ± 7.4 44.4 80.9 ± 4.8 47.1 ± 7.5 61.7

MF+CC 77.9 ± 4.3 61.8 ± 6.9 69.4 80.9 ± 4.8 50.0 ± 8.9 63.6 89.7 ± 3.7 32.4 ± 8.1 53.9 83.8 ± 5.4 47.1 ± 12.1 62.8

BP+MF+CC 83.8 ± 4.5 50.0 ± 10.8 64.7 85.3 ± 6.4 55.9 ± 8.5 69.1 94.1 ± 2.3 29.4 ± 8.9 52.6 85.3 ± 3.7 52.9 ± 11.6 67.2

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 10.0 ± 5.1 95.7 ± 1.9 30.9 10.0 ± 5.1 91.4 ± 1.8 30.2 0.0 ± 0.0 98.4 ± 1.1 0.0 26.7 ± 4.4 93.0 ± 2.0 49.8

MF 11.5 ± 6.9 90.1 ± 3.0 32.2 3.8 ± 3.8 96.2 ± 1.3 19.1 0.0 ± 0.0 97.7 ± 1.2 0.0 7.7 ± 6.7 95.4 ± 1.2 27.1

CC 12.5 ± 6.9 93.5 ± 2.1 34.2 12.5 ± 6.9 93.5 ± 2.1 34.2 8.3 ± 6.7 95.9 ± 2.6 28.2 8.3 ± 5.7 94.3 ± 1.7 28.0

BP+MF 13.3 ± 5.4 94.8 ± 1.8 35.5 16.7 ± 7.5 93.8 ± 1.5 39.6 0.0 ± 0.0 99.5 ± 0.5 0.0 30.0 ± 6.0 93.8 ± 1.5 53.0

BP+CC 20.0 ± 5.4 96.6 ± 1.1 44.0 26.7 ± 6.7 97.1 ± 0.8 50.9 0.0 ± 0.0 99.5 ± 0.5 0.0 26.7 ± 6.7 95.1 ± 1.5 50.4

MF+CC 17.2 ± 8.0 94.9 ± 1.3 40.4 13.8 ± 11.4 95.9 ± 1.7 36.4 6.9 ± 5.7 99.0 ± 0.7 26.1 6.9 ± 4.4 95.4 ± 2.1 25.7

BP+MF+CC 20.0 ± 7.4 95.7 ± 1.1 43.7 30.0 ± 9.2 97.1 ± 1.5 54.0 0.0 ± 0.0 99.5 ± 0.5 0.0 30.0 ± 7.8 93.3 ± 1.9 52.9
64 5 Lazy Hierarchical Feature Selection

Table 5.6 Predictive accuracy for Naïve Bayes with the hierarchical MR method and baseline
“flat” feature selection methods
Feature NB without Lazy/Eager Lazy/Eager
Lazy MR + NB
Type Feature Selection EntMR_n + NB ReleMR_n + NB

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 50.2 ± 3.6 69.0 ± 2.6 58.9 51.2 ± 3.5 75.5 ± 2.6 62.2 32.1 ± 1.8 83.7 ± 2.6 51.8 46.4 ± 3.0 73.4 ± 2.4 58.4

MF 57.9 ± 4.1 46.2 ± 5.5 51.7 38.8 ± 2.9 63.3 ± 3.8 49.6 47.9 ± 3.4 58.2 ± 5.2 52.8 76.0 ± 6.9 35.4 ± 6.3 51.9

CC 43.9 ± 5.7 70.5 ± 3.4 55.6 42.9 ± 4.0 71.2 ± 3.0 55.3 22.4 ± 3.2 80.8 ± 3.2 42.5 37.8 ± 5.3 73.1 ± 3.2 52.6

BP+MF 54.0 ± 1.8 70.3 ± 3.0 61.6 62.9 ± 3.5 73.2 ± 1.8 67.9 31.5 ± 1.8 80.9 ± 2.0 50.5 57.3 ± 4.4 71.5 ± 2.1 64.0

BP+CC 52.6 ± 3.9 68.3 ± 2.6 59.9 55.4 ± 2.8 73.8 ± 2.2 63.9 32.9 ± 2.7 81.7 ± 2.1 51.8 50.2 ± 3.1 75.6 ± 2.1 61.6

MF+CC 51.2 ± 2.8 64.1 ± 4.3 57.3 47.6 ± 3.6 68.3 ± 4.2 57.0 39.4 ± 4.4 77.5 ± 4.1 55.3 48.2 ± 2.4 70.2 ± 2.9 58.2

BP+MF+CC 52.1 ± 4.4 70.0 ± 2.3 60.4 55.8 ± 3.6 70.6 ± 2.4 62.8 31.6 ± 3.9 81.5 ± 2.2 50.7 54.9 ± 3.3 74.8 ± 2.5 64.1

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 74.7 ± 3.5 36.1 ± 9.5 51.9 79.1 ± 4.1 38.9 ± 11.0 55.5 94.5 ± 2.5 8.3 ± 4.3 28.0 79.1 ± 2.4 38.9 ± 10.9 55.5

MF 82.4 ± 4.6 35.3 ± 8.6 53.9 80.9 ± 4.2 44.1 ± 7.6 59.7 95.6 ± 2.5 29.4 ± 7.2 53.0 89.7 ± 3.8 35.3 ± 10.1 56.3

CC 87.1 ± 4.1 50.0 ± 10.2 66.0 83.9 ± 5.6 53.6 ± 8.7 67.1 95.2 ± 2.4 21.4 ± 7.4 45.1 87.1 ± 4.1 39.3 ± 8.7 58.5

BP+MF 77.2 ± 3.9 50.0 ± 10.2 62.1 79.3 ± 4.3 44.7 ± 8.2 59.5 94.6 ± 3.4 10.5 ± 4.1 31.5 81.5 ± 3.4 44.7 ± 11.5 60.4

BP+CC 76.9 ± 5.1 48.6 ± 9.8 61.1 80.2 ± 4.3 56.8 ± 11.2 67.5 93.4 ± 2.8 2.7 ± 2.5 15.9 81.3 ± 4.0 40.5 ± 9.0 57.4

MF+CC 89.4 ± 3.2 57.9 ± 5.3 71.9 83.5 ± 4.4 57.9 ± 7.5 69.5 96.5 ± 1.8 34.2 ± 6.7 57.4 92.9 ± 1.9 44.7 ± 5.0 64.4

BP+MF+CC 81.5 ± 5.3 55.3 ± 8.2 67.1 77.2 ± 4.5 63.2 ± 7.7 69.9 95.7 ± 1.8 13.2 ± 5.5 35.5 85.9 ± 3.7 50.0 ± 7.5 65.5

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 82.4 ± 4.7 44.1 ± 5.9 60.3 80.9 ± 5.2 50.0 ± 7.9 63.6 97.1 ± 1.9 32.4 ± 4.5 56.1 89.7 ± 3.7 44.1 ± 7.0 62.9

MF 69.2 ± 7.4 48.5 ± 11.2 57.9 83.1 ± 4.1 39.4 ± 10.7 57.2 87.7 ± 3.0 24.2 ± 9.0 46.1 84.6 ± 3.7 39.4 ± 13.0 57.7

CC 75.8 ± 2.3 52.9 ± 10.0 63.3 81.8 ± 3.6 41.2 ± 11.9 58.1 81.8 ± 3.3 29.4 ± 11.0 49.0 75.8 ± 2.3 44.1 ± 11.1 57.8

BP+MF 83.8 ± 3.4 44.1 ± 7.0 60.8 82.4 ± 4.2 50.0 ± 10.2 64.2 97.1 ± 1.9 35.3 ± 7.3 58.5 86.8 ± 4.0 38.2 ± 6.2 57.6

BP+CC 79.4 ± 6.1 50.0 ± 8.4 63.0 73.5 ± 5.1 52.9 ± 9.6 62.4 95.6 ± 3.0 29.4 ± 8.7 53.0 88.2 ± 5.1 47.1 ± 9.7 64.5

MF+CC 75.0 ± 5.0 64.7 ± 12.5 69.7 83.8 ± 5.0 55.9 ± 13.3 68.4 91.2 ± 3.3 29.4 ± 8.1 51.8 80.9 ± 5.2 52.9 ± 11.3 65.4

BP+MF+CC 82.4 ± 4.2 47.1 ± 9.3 62.3 85.3 ± 4.3 50.0 ± 6.9 65.3 94.1 ± 3.2 35.3 ± 9.4 57.6 85.3 ± 4.3 44.1 ± 8.9 61.3

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 40.0 ± 8.3 84.9 ± 3.5 58.3 33.3 ± 8.6 85.9 ± 2.9 53.5 0.0 ± 0.0 98.4 ± 1.2 0.0 36.7 ± 10.5 86.5 ± 2.8 56.3

MF 11.5 ± 6.1 81.7 ± 4.8 30.7 0.0 ± 0.0 93.9 ± 2.4 0.0 0.0 ± 0.0 98.5 ± 1.0 0.0 3.8 ± 3.3 90.1 ± 2.9 18.5

CC 25.0 ± 7.1 86.2 ± 3.0 46.4 20.8 ± 6.9 91.9 ± 2.7 43.7 16.7 ± 7.0 95.1 ± 1.7 39.9 20.8 ± 7.5 87.8 ± 2.2 42.7

BP+MF 33.3 ± 11.1 85.4 ± 1.7 53.3 23.3 ± 5.1 89.1 ± 2.5 45.6 0.0 ± 0.0 97.9 ± 0.8 0.0 36.7 ± 6.0 84.9 ± 1.5 55.8

BP+CC 53.3 ± 8.9 85.8 ± 3.0 67.6 40.0 ± 8.3 84.8 ± 2.7 58.2 10.0 ± 5.1 99.0 ± 0.7 31.5 43.3 ± 8.7 88.7 ± 1.6 62.0

MF+CC 34.5 ± 10.5 87.3 ± 2.1 54.9 17.2 ± 6.3 89.8 ± 2.3 39.3 13.8 ± 6.3 94.9 ± 1.3 36.2 20.7 ± 10.0 88.8 ± 1.6 42.9

BP+MF+CC 36.7 ± 9.2 85.6 ± 2.7 56.0 30.0 ± 9.2 86.5 ± 2.6 50.9 0.0 ± 0.0 99.5 ± 0.5 0.0 43.3 ± 11.2 90.9 ± 1.4 62.7
5.5 Experimental Results 65

Table 5.7 Predictive accuracy for Tree Augmented Naïve Bayes with the hierarchical MR method
and baseline “flat” feature selection methods
Feature TAN without Lazy/Eager Lazy/Eager
Lazy MR + TAN
Type Feature Selection EntMR_n + TAN ReleMR_n + TAN

Caenorhabditis elegans Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 34.0 ± 3.2 79.6 ± 2.3 52.0 55.0 ± 2.4 73.0 ± 1.8 63.4 30.1 ± 3.8 83.7 ± 2.8 50.2 36.8 ± 3.1 80.6 ± 2.4 54.5

MF 37.2 ± 5.8 61.4 ± 5.0 47.8 33.1 ± 3.5 65.2 ± 4.0 46.5 40.5 ± 5.2 64.6 ± 5.3 51.1 24.8 ± 3.5 75.3 ± 5.0 43.2

CC 39.8 ± 3.0 78.2 ± 2.2 55.8 37.8 ± 3.4 74.4 ± 2.7 53.0 30.6 ± 3.5 76.9 ± 3.6 48.5 33.7 ± 5.4 75.0 ± 2.8 50.3

BP+MF 35.2 ± 1.9 80.3 ± 2.2 53.2 61.0 ± 4.3 71.8 ± 2.3 66.2 37.1 ± 4.1 83.8 ± 1.7 55.8 43.7 ± 3.6 81.2 ± 2.7 59.6

BP+CC 42.7 ± 3.1 81.7 ± 2.7 59.1 56.3 ± 3.0 77.3 ± 2.2 66.0 34.7 ± 4.5 82.6 ± 2.5 53.5 44.1 ± 2.1 82.6 ± 1.3 60.4

MF+CC 40.6 ± 3.4 74.4 ± 3.6 55.0 45.9 ± 3.8 70.6 ± 3.0 56.9 35.9 ± 3.2 73.7 ± 2.9 51.4 40.0 ± 3.3 74.0 ± 3.4 54.4

BP+MF+CC 39.5 ± 2.8 80.1 ± 2.6 56.2 54.4 ± 4.2 76.5 ± 2.3 64.5 34.0 ± 3.0 82.6 ± 1.5 53.0 44.2 ± 3.8 80.7 ± 1.6 59.7

Drosophila melanogaster Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 92.3 ± 2.9 19.4 ± 8.4 42.3 76.9 ± 3.6 50.0 ± 9.6 62.0 95.6 ± 2.5 2.8 ± 2.5 16.4 87.9 ± 3.8 27.8 ± 7.5 49.4

MF 91.2 ± 3.3 20.6 ± 5.0 43.3 83.8 ± 4.5 41.2 ± 7.4 58.8 92.6 ± 3.4 32.4 ± 6.3 54.8 89.7 ± 2.4 35.3 ± 6.1 56.3

CC 90.3 ± 3.6 32.1 ± 11.6 53.8 75.8 ± 6.6 42.9 ± 8.3 57.0 95.2 ± 2.4 25.0 ± 7.1 48.8 88.7 ± 4.3 35.7 ± 9.9 56.3

BP+MF 92.4 ± 3.3 23.7 ± 6.9 46.8 80.4 ± 2.8 47.4 ± 9.5 61.7 96.7 ± 2.4 13.2 ± 4.2 35.7 87.0 ± 3.6 23.7 ± 6.9 45.4

BP+CC 86.8 ± 4.0 18.9 ± 7.6 40.5 82.4 ± 3.8 40.5 ± 8.0 57.8 94.5 ± 2.3 10.8 ± 5.2 31.9 83.5 ± 4.3 27.0 ± 9.0 47.5

MF+CC 90.6 ± 3.3 31.6 ± 5.0 53.5 72.9 ± 6.4 52.6 ± 6.9 61.9 96.5 ± 2.4 23.7 ± 6.9 47.8 92.9 ± 2.5 42.1 ± 3.8 62.5

BP+MF+CC 92.4 ± 2.4 18.4 ± 5.3 41.2 77.2 ± 4.5 60.5 ± 8.5 68.3 98.9 ± 1.1 13.2 ± 6.7 36.1 92.4 ± 2.4 42.1 ± 8.4 62.4

Mus musculus Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 89.7 ± 3.7 41.2 ± 4.9 60.8 73.5 ± 7.1 50.0 ± 10.0 60.6 97.1 ± 1.9 26.5 ± 3.4 50.7 83.8 ± 4.5 41.2 ± 7.4 58.8

MF 89.2 ± 4.0 33.3 ± 9.4 54.5 83.1 ± 6.6 54.5 ± 9.1 67.3 89.2 ± 3.2 33.3 ± 12.5 54.5 87.7 ± 3.6 39.4 ± 11.2 58.8

CC 75.8 ± 4.4 41.2 ± 8.3 55.9 74.2 ± 4.3 44.1 ± 9.8 57.2 86.4 ± 4.0 23.5 ± 10.4 45.1 78.8 ± 4.0 26.5 ± 10.2 45.7

BP+MF 86.8 ± 3.4 35.3 ± 5.4 55.4 79.4 ± 4.3 55.9 ± 8.6 66.6 95.6 ± 2.2 26.5 ± 4.5 50.3 89.7 ± 3.7 35.3 ± 5.4 56.3

BP+CC 88.2 ± 3.6 47.1 ± 9.7 64.5 70.6 ± 5.9 58.8 ± 8.9 64.4 98.5 ± 1.4 32.4 ± 6.4 56.5 80.9 ± 7.1 41.2 ± 10.5 57.7

MF+CC 88.2 ± 4.2 41.2 ± 10.0 60.3 82.4 ± 3.6 55.9 ± 11.5 67.9 92.6 ± 3.2 38.2 ± 9.4 59.5 88.2 ± 4.7 50.0 ± 10.2 66.4

BP+MF+CC 91.2 ± 3.2 41.2 ± 8.6 61.3 75.0 ± 5.7 58.8 ± 7.9 66.4 94.1 ± 2.3 29.4 ± 6.4 52.6 89.7 ± 3.0 41.2 ± 9.9 60.8

Saccharomyces cerevisiae Datasets


Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM Sens. Spec. GM

BP 3.3 ± 3.3 98.9 ± 1.1 18.1 30.0 ± 7.8 87.0 ± 2.7 51.1 0.0 ± 0.0 100.0 ± 0.0 0.0 10.0 ± 5.1 93.0 ± 3.1 30.5

MF 0.0 ± 0.0 97.7 ± 1.2 0.0 0.0 ± 0.0 87.8 ± 2.9 0.0 0.0 ± 0.0 98.5 ± 1.0 0.0 0.0 ± 0.0 96.2 ± 1.3 0.0

CC 16.7 ± 7.0 95.9 ± 2.1 40.0 20.8 ± 6.9 95.1 ± 2.1 44.5 5.0 ± 5.0 95.9 ± 1.8 21.9 12.5 ± 6.9 94.3 ± 2.4 34.3

BP+MF 3.3 ± 3.3 99.0 ± 0.7 18.1 20.0 ± 7.4 93.2 ± 1.4 43.2 0.0 ± 0.0 100.0 ± 0.0 0.0 16.7 ± 5.6 95.3 ± 1.5 39.9

BP+CC 10.0 ± 5.1 99.0 ± 0.7 31.5 30.0 ± 9.2 89.2 ± 2.1 51.7 6.7 ± 4.4 100.0 ± 0.0 25.9 13.3 ± 5.4 94.1 ± 1.6 35.4

MF+CC 5.0 ± 5.0 98.5 ± 0.8 22.2 10.3 ± 6.1 93.4 ± 2.5 31.0 6.9 ± 5.7 99.5 ± 0.5 26.2 10.3 ± 6.1 95.4 ± 1.8 31.3

BP+MF+CC 0.0 ± 0.0 99.0 ± 0.6 0.0 36.7 ± 9.2 89.4 ± 2.1 57.3 3.3 ± 3.3 100.0 ± 0.0 18.2 10.0 ± 5.1 96.6 ± 1.3 31.1
66 5 Lazy Hierarchical Feature Selection

Table 5.8 Predictive accuracy for Bayesian Network Augmented Naïve Bayes with the hierarchical
MR method and baseline “flat” feature selection methods
Feature BAN without Lazy/Eager Lazy/Eager
Lazy MR + BAN
Type Feature Selection EntMR_n + BAN ReleMR_n + BAN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 28.7 ± 2.2 86.5 ± 1.8 49.8 52.2 ± 3.1 74.0 ± 2.2 62.2 28.7 ± 2.2 86.5 ± 1.6 49.8 40.2 ± 2.9 78.7 ± 2.8 56.2

MF 34.7 ± 4.5 66.5 ± 4.5 48.0 35.5 ± 3.0 63.3 ± 3.4 47.4 36.4 ± 4.8 65.2 ± 4.4 48.7 24.8 ± 4.2 74.7 ± 5.3 43.0

CC 33.7 ± 4.5 81.4 ± 2.2 52.4 40.8 ± 4.3 73.1 ± 2.6 54.6 24.5 ± 4.0 82.1 ± 2.5 44.8 39.8 ± 4.9 71.2 ± 4.2 53.2

BP+MF 30.0 ± 2.7 84.7 ± 1.7 50.4 63.8 ± 2.2 73.2 ± 2.1 68.3 34.3 ± 2.7 83.8 ± 1.3 53.6 47.4 ± 4.3 77.6 ± 1.9 60.6

BP+CC 29.1 ± 2.1 86.6 ± 1.7 50.2 54.0 ± 2.8 74.7 ± 2.3 63.5 32.4 ± 4.2 86.0 ± 2.0 52.8 45.5 ± 2.9 79.4 ± 1.6 60.1

MF+CC 35.3 ± 2.9 80.2 ± 3.2 53.2 47.1 ± 3.4 70.2 ± 3.9 57.5 33.5 ± 3.9 79.8 ± 3.4 51.7 44.7 ± 2.9 72.9 ± 3.5 57.1

BP+MF+CC 31.2 ± 2.9 85.2 ± 1.5 51.6 55.3 ± 4.0 72.0 ± 2.6 63.1 31.6 ± 3.6 84.0 ± 1.5 51.5 50.2 ± 3.9 76.5 ± 2.5 62.0

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 100.0 ± 0.0 0.0 ± 0.0 0.0 80.2 ± 3.5 44.4 ± 10.2 59.7 97.8 ± 1.5 0.0 ± 0.0 0.0 82.4 ± 3.2 33.3 ± 10.2 52.4

MF 91.2 ± 3.3 26.5 ± 3.4 49.2 80.9 ± 5.2 47.1 ± 9.1 61.7 92.6 ± 3.4 32.4 ± 6.3 54.8 86.8 ± 3.4 41.2 ± 8.3 59.8

CC 93.5 ± 2.6 28.6 ± 11.1 51.7 85.5 ± 4.6 42.9 ± 10.2 60.6 96.8 ± 2.0 21.4 ± 7.4 45.5 90.3 ± 3.6 32.1 ± 10.5 53.8

BP+MF 97.8 ± 1.5 0.0 ± 0.0 0.0 80.4 ± 3.7 44.7 ± 8.2 59.9 98.9 ± 1.1 7.9 ± 3.8 28.0 88.0 ± 3.5 36.8 ± 8.5 56.9

BP+CC 98.9 ± 1.1 0.0 ± 0.0 0.0 80.2 ± 4.1 51.4 ± 10.9 64.2 97.8 ± 1.5 2.7 ± 2.5 16.2 82.4 ± 4.5 32.4 ± 4.9 51.7

MF+CC 95.3 ± 1.9 31.6 ± 5.3 54.9 83.5 ± 4.9 55.3 ± 8.2 68.0 95.3 ± 2.5 26.3 ± 6.5 50.1 91.8 ± 2.9 39.5 ± 4.1 60.2

BP+MF+CC 98.9 ± 1.1 2.6 ± 2.5 16.0 81.5 ± 3.7 63.2 ± 7.7 71.8 97.8 ± 1.5 7.9 ± 5.5 27.8 89.1 ± 2.3 39.5 ± 5.5 59.3

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 98.5 ± 1.4 26.5 ± 5.0 51.1 88.2 ± 4.7 44.1 ± 7.7 62.4 98.5 ± 1.4 29.4 ± 4.1 53.8 88.2 ± 4.7 41.2 ± 7.4 60.3

MF 90.8 ± 3.3 27.3 ± 10.0 49.8 87.7 ± 3.0 39.4 ± 10.6 58.8 89.2 ± 3.2 24.2 ± 9.0 46.5 87.7 ± 4.2 24.2 ± 9.0 46.1

CC 86.4 ± 3.3 35.3 ± 11.2 55.2 78.8 ± 3.8 44.1 ± 11.1 58.9 86.4 ± 3.3 23.5 ± 10.4 45.1 80.3 ± 3.0 41.2 ± 11.9 57.5

BP+MF 98.5 ± 1.4 29.4 ± 6.4 53.8 86.8 ± 4.0 41.2 ± 9.6 59.8 98.5 ± 1.4 26.5 ± 5.0 51.1 94.1 ± 3.2 35.3 ± 7.3 57.6

BP+CC 98.5 ± 1.4 29.4 ± 6.4 53.8 77.9 ± 5.3 52.9 ± 9.6 64.2 97.1 ± 1.9 29.4 ± 5.2 53.4 89.7 ± 4.8 38.2 ± 7.5 58.5

MF+CC 91.2 ± 3.2 26.5 ± 8.8 49.2 83.8 ± 5.0 58.8 ± 13.1 70.2 94.1 ± 3.2 23.5 ± 7.4 47.0 88.2 ± 3.6 38.2 ± 8.3 58.0

BP+MF+CC 98.5 ± 1.4 26.5 ± 10.5 51.1 86.8 ± 4.0 50.0 ± 6.9 65.9 97.1 ± 1.9 29.4 ± 10.0 53.4 92.6 ± 2.4 44.1 ± 8.9 63.9

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 0.0 ± 0.0 100.0 ± 0.0 0.0 33.3 ± 8.6 89.7 ± 2.5 54.7 3.3 ± 3.3 99.5 ± 0.5 18.1 13.3 ± 7.4 92.4 ± 2.5 35.1

MF 0.0 ± 0.0 99.2 ± 0.8 0.0 0.0 ± 0.0 90.8 ± 3.0 0.0 0.0 ± 0.0 99.2 ± 0.8 0.0 0.0 ± 0.0 96.2 ± 1.3 0.0

CC 12.5 ± 6.1 99.2 ± 0.8 35.2 20.8 ± 6.9 93.5 ± 2.7 44.1 8.3 ± 5.7 99.2 ± 0.8 28.7 16.7 ± 7.0 93.5 ± 2.4 39.5

BP+MF 0.0 ± 0.0 100.0 ± 0.0 0.0 23.3 ± 7.1 89.6 ± 2.6 45.7 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 7.4 97.4 ± 0.9 36.0

BP+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 40.0 ± 8.3 87.3 ± 2.5 59.1 3.3 ± 3.3 100.0 ± 0.0 18.2 13.3 ± 5.4 97.5 ± 0.8 36.0

MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 13.8 ± 6.3 88.8 ± 2.3 35.0 0.0 ± 0.0 100.0 ± 0.0 0.0 10.3 ± 6.1 95.4 ± 1.4 31.3

BP+MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 33.3 ± 5.0 87.0 ± 2.5 53.8 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 5.4 99.0 ± 0.6 36.3
5.5 Experimental Results 67

Table 5.9 Predictive accuracy for K-Nearest Neighbour (k = 3) with the hierarchical MR method
and baseline “flat” feature selection methods
Feature KNN without Lazy/Eager Lazy/Eager
Lazy MR + KNN
Type Feature Selection EntMR_n + KNN ReleMR_n + KNN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 48.3 ± 4.8 74.0 ± 3.0 59.8 47.4 ± 2.9 73.4 ± 2.2 59.0 30.6 ± 3.3 79.6 ± 3.2 49.4 70.3 ± 8.6 44.8 ± 6.6 56.1

MF 41.3 ± 3.3 54.4 ± 4.4 47.4 40.5 ± 4.0 62.0 ± 5.9 50.1 33.1 ± 3.7 60.8 ± 5.5 44.9 20.7 ± 4.0 75.9 ± 4.2 39.6

CC 39.8 ± 6.5 67.9 ± 3.3 52.0 34.7 ± 7.5 64.1 ± 1.9 47.2 31.6 ± 3.9 75.0 ± 2.3 48.7 38.8 ± 4.5 65.4 ± 5.2 50.4

BP+MF 49.3 ± 3.5 72.9 ± 1.2 59.9 49.3 ± 3.1 74.7 ± 1.9 60.7 37.6 ± 5.0 76.5 ± 1.8 53.6 62.9 ± 5.1 48.8 ± 4.6 55.4

BP+CC 42.7 ± 3.4 72.7 ± 2.7 55.7 43.7 ± 4.3 74.1 ± 2.2 56.9 41.8 ± 2.6 78.8 ± 2.0 57.4 71.4 ± 3.9 50.3 ± 5.6 59.9

MF+CC 44.7 ± 2.7 68.3 ± 2.6 55.3 44.7 ± 2.0 67.9 ± 3.1 55.1 33.5 ± 3.5 74.8 ± 2.0 50.1 35.3 ± 3.2 72.5 ± 3.5 50.6

BP+MF+CC 47.9 ± 3.6 72.0 ± 2.4 58.7 48.8 ± 4.3 74.5 ± 1.5 60.3 37.2 ± 4.0 77.6 ± 2.2 53.7 60.0 ± 6.1 50.4 ± 4.3 55.0

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 80.2 ± 4.9 38.9 ± 7.5 55.9 68.1 ± 5.4 63.9 ± 8.3 66.0 92.3 ± 3.7 19.4 ± 7.7 42.3 47.3 ± 6.0 61.1 ± 7.2 53.8

MF 77.9 ± 5.6 32.4 ± 5.2 50.2 61.8 ± 5.2 41.2 ± 5.5 50.5 82.4 ± 6.2 35.3 ± 7.9 53.9 10.3 ± 3.7 79.4 ± 6.0 28.6

CC 83.9 ± 5.6 46.4 ± 10.0 62.4 79.0 ± 6.2 53.6 ± 12.4 65.1 88.7 ± 4.3 42.9 ± 9.7 61.7 58.1 ± 7.3 60.7 ± 10.0 59.4

BP+MF 79.3 ± 5.1 42.1 ± 9.9 57.8 71.7 ± 4.4 57.9 ± 7.5 64.4 94.6 ± 3.4 15.8 ± 7.6 38.7 58.7 ± 5.6 65.8 ± 6.7 62.1

BP+CC 78.0 ± 5.4 37.8 ± 8.9 54.3 78.0 ± 3.2 56.8 ± 7.3 66.6 90.1 ± 3.5 16.2 ± 6.8 38.2 51.6 ± 5.3 62.2 ± 9.2 56.7

MF+CC 91.8 ± 3.1 42.1 ± 6.7 62.2 76.5 ± 6.8 44.7 ± 8.4 58.5 92.9 ± 3.0 36.8 ± 4.2 58.5 38.8 ± 6.7 71.1 ± 6.2 52.5

BP+MF+CC 81.5 ± 3.8 52.6 ± 6.9 65.5 80.4 ± 4.6 63.2 ± 9.3 71.3 94.6 ± 1.9 26.3 ± 7.5 49.9 57.6 ± 4.3 73.7 ± 6.5 65.2

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 86.8 ± 3.4 41.2 ± 4.7 59.8 86.8 ± 4.0 47.1 ± 8.9 63.9 95.6 ± 2.2 26.5 ± 7.9 50.3 86.8 ± 4.5 47.1 ± 10.4 63.9

MF 78.5 ± 4.5 39.4 ± 10.4 55.6 84.6 ± 3.3 45.5 ± 10.0 62.0 89.2 ± 4.1 27.3 ± 8.6 49.3 84.6 ± 3.3 30.3 ± 10.7 50.6

CC 74.2 ± 7.7 41.2 ± 9.4 55.3 65.2 ± 6.4 50.0 ± 9.0 57.1 78.8 ± 5.0 26.5 ± 8.9 45.7 80.3 ± 5.7 38.2 ± 7.4 55.4

BP+MF 83.8 ± 4.0 47.1 ± 7.3 62.8 86.8 ± 4.0 55.9 ± 8.2 69.7 95.6 ± 2.2 26.5 ± 7.6 50.3 88.2 ± 3.6 47.1 ± 5.9 64.5

BP+CC 86.8 ± 5.8 47.1 ± 10.1 63.9 86.8 ± 4.0 58.8 ± 6.8 71.4 94.1 ± 2.3 20.6 ± 7.4 44.0 83.8 ± 4.5 50.0 ± 7.6 64.7

MF+CC 77.9 ± 4.3 61.8 ± 6.9 69.4 73.5 ± 4.7 50.0 ± 11.6 60.6 89.7 ± 3.7 29.4 ± 6.1 51.4 85.3 ± 4.8 55.9 ± 12.0 69.1

BP+MF+CC 83.8 ± 4.5 50.0 ± 10.8 64.7 80.9 ± 3.7 58.8 ± 10.8 69.0 97.1 ± 1.9 29.4 ± 10.2 53.4 89.7 ± 3.7 52.9 ± 11.3 68.9

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 10.0 ± 5.1 95.7 ± 1.9 30.9 26.7 ± 8.3 92.4 ± 2.3 49.7 10.0 ± 5.1 96.8 ± 1.2 31.1 16.7 ± 7.5 94.1 ± 2.0 39.6

MF 11.5 ± 6.9 90.1 ± 3.0 32.2 7.7 ± 4.4 91.6 ± 1.8 26.6 7.7 ± 5.7 96.9 ± 1.3 27.3 15.4 ± 7.0 92.4 ± 1.6 37.7

CC 12.5 ± 6.9 93.5 ± 2.1 34.2 12.5 ± 6.9 93.5 ± 2.0 34.2 12.5 ± 6.9 97.6 ± 1.2 34.9 12.5 ± 10.2 95.1 ± 1.7 34.5

BP+MF 13.3 ± 5.4 94.8 ± 1.8 35.5 26.7 ± 6.7 95.8 ± 1.3 50.6 3.3 ± 3.3 100.0 ± 0.0 18.2 23.3 ± 8.7 96.4 ± 1.6 47.4

BP+CC 20.0 ± 5.4 96.6 ± 1.1 44.0 16.7 ± 5.6 92.6 ± 1.8 39.3 3.3 ± 3.3 98.5 ± 0.8 18.0 33.3 ± 7.0 95.1 ± 1.5 56.3

MF+CC 17.2 ± 8.0 94.9 ± 1.3 40.4 13.8 ± 6.3 94.9 ± 1.7 36.2 10.3 ± 6.1 98.5 ± 0.8 31.9 10.3 ± 6.1 95.9 ± 1.7 31.4

BP+MF+CC 20.0 ± 7.4 95.7 ± 1.1 43.7 13.3 ± 7.4 94.7 ± 1.5 35.5 10.0 ± 5.1 99.5 ± 0.5 31.5 26.7 ± 4.4 96.6 ± 1.9 50.8
68 5 Lazy Hierarchical Feature Selection

Table 5.10 Predictive accuracy for Naïve Bayes with the hierarchical HIP—MR method and
baseline “flat” feature selection methods
Feature NB without Lazy/Eager Lazy/Eager
Lazy HIP–MR + NB
Type Feature Selection EntHIP−MR_n + NB ReleHIP−MR_n + NB

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 50.2 ± 3.6 69.0 ± 2.6 58.9 50.7 ± 3.6 77.7 ± 3.4 62.8 34.9 ± 2.3 80.9 ± 3.1 53.1 55.0 ± 2.5 70.8 ± 2.9 62.4

MF 57.9 ± 4.1 46.2 ± 5.5 51.7 47.9 ± 3.4 58.2 ± 4.4 52.8 51.2 ± 2.6 57.0 ± 5.1 54.0 75.2 ± 4.7 41.8 ± 5.9 56.1

CC 43.9 ± 5.7 70.5 ± 3.4 55.6 39.8 ± 5.8 76.9 ± 2.3 55.3 29.6 ± 4.9 75.6 ± 3.2 47.3 41.8 ± 4.2 73.1 ± 3.4 55.3

BP+MF 54.0 ± 1.8 70.3 ± 3.0 61.6 52.0 ± 3.9 75.3 ± 2.2 62.6 35.2 ± 2.6 81.2 ± 2.5 53.5 58.2 ± 3.4 70.0 ± 2.1 63.8

BP+CC 52.6 ± 3.9 68.3 ± 2.6 59.9 42.7 ± 2.6 77.0 ± 2.6 57.3 40.8 ± 2.8 77.9 ± 2.6 56.4 53.0 ± 2.1 70.9 ± 2.1 61.3

MF+CC 51.2 ± 2.8 64.1 ± 4.3 57.3 42.9 ± 3.8 73.7 ± 4.6 56.2 42.9 ± 3.6 72.5 ± 4.3 55.8 55.3 ± 3.3 65.6 ± 3.8 60.2

BP+MF+CC 52.1 ± 4.4 70.0 ± 2.3 60.4 43.7 ± 5.0 77.6 ± 2.4 58.2 36.7 ± 3.6 79.3 ± 2.0 53.9 54.9 ± 4.0 70.9 ± 3.0 62.4

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 74.7 ± 3.5 36.1 ± 9.5 51.9 83.5 ± 3.2 27.8 ± 9.9 48.2 91.2 ± 2.8 22.2 ± 9.8 45.0 75.8 ± 3.1 41.7 ± 10.2 56.2

MF 82.4 ± 4.6 35.3 ± 8.6 53.9 77.9 ± 5.2 52.9 ± 8.5 64.2 94.1 ± 2.6 26.5 ± 9.3 49.9 83.8 ± 5.0 32.4 ± 10.0 52.1

CC 87.1 ± 4.1 50.0 ± 10.2 66.0 88.7 ± 5.6 50.0 ± 11.3 66.6 90.3 ± 3.6 32.1 ± 9.2 53.8 87.1 ± 4.1 46.4 ± 10.2 63.6

BP+MF 77.2 ± 3.9 50.0 ± 10.2 62.1 84.8 ± 3.8 39.5 ± 7.6 57.9 93.5 ± 2.4 10.5 ± 4.1 31.4 77.2 ± 4.3 50.0 ± 10.2 62.1

BP+CC 76.9 ± 5.1 48.6 ± 9.8 61.1 81.3 ± 4.1 32.4 ± 8.1 51.4 91.2 ± 3.2 16.2 ± 5.8 38.5 78.0 ± 5.3 45.9 ± 9.4 59.9

MF+CC 89.4 ± 3.2 57.9 ± 5.3 71.9 88.2 ± 3.1 57.9 ± 5.3 71.5 95.3 ± 2.7 36.8 ± 4.2 59.2 90.6 ± 3.0 47.4 ± 5.8 65.5

BP+MF+CC 81.5 ± 5.3 55.3 ± 8.2 67.1 83.7 ± 3.4 42.1 ± 9.2 59.4 93.5 ± 1.8 23.7 ± 6.5 47.1 80.4 ± 4.3 50.0 ± 8.3 63.4

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 82.4 ± 4.7 44.1 ± 5.9 60.3 92.6 ± 3.2 35.3 ± 6.0 57.2 94.1 ± 2.3 35.3 ± 5.7 57.6 88.2 ± 3.6 47.1 ± 6.2 64.5

MF 69.2 ± 7.4 48.5 ± 11.2 57.9 81.5 ± 3.0 42.4 ± 12.3 58.8 87.7 ± 3.0 33.3 ± 11.4 54.0 80.0 ± 6.0 48.5 ± 11.2 62.3

CC 75.8 ± 2.3 52.9 ± 10.0 63.3 74.2 ± 2.9 50.0 ± 10.1 60.9 80.3 ± 3.3 38.2 ± 10.5 55.4 74.2 ± 2.0 50.0 ± 10.1 60.9

BP+MF 83.8 ± 3.4 44.1 ± 7.0 60.8 91.2 ± 3.8 35.3 ± 7.3 56.7 95.6 ± 2.2 35.3 ± 7.3 58.1 86.8 ± 4.0 41.2 ± 6.6 59.8

BP+CC 79.4 ± 6.1 50.0 ± 8.4 63.0 85.3 ± 4.8 44.1 ± 9.6 61.3 94.1 ± 3.2 35.3 ± 8.4 57.6 82.4 ± 5.6 50.0 ± 8.4 64.2

MF+CC 75.0 ± 5.0 64.7 ± 12.5 69.7 79.4 ± 3.7 55.9 ± 13.3 66.6 88.2 ± 3.7 38.2 ± 10.3 58.0 79.4 ± 5.6 55.9 ± 12.1 66.6

BP+MF+CC 82.4 ± 4.2 47.1 ± 9.3 62.3 91.2 ± 2.3 35.3 ± 9.4 56.7 91.2 ± 3.2 35.3 ± 9.4 56.7 83.8 ± 4.0 47.1 ± 9.0 62.8

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 40.0 ± 8.3 84.9 ± 3.5 58.3 6.7 ± 4.4 96.8 ± 1.4 25.4 0.0 ± 0.0 97.8 ± 1.4 0.0 43.3 ± 10.0 87.0 ± 2.7 61.4

MF 11.5 ± 6.1 81.7 ± 4.8 30.7 0.0 ± 0.0 93.9 ± 1.9 0.0 0.0 ± 0.0 94.7 ± 2.2 0.0 3.8 ± 3.3 86.3 ± 3.5 18.1

CC 25.0 ± 7.1 86.2 ± 3.0 46.4 20.8 ± 7.5 91.1 ± 2.6 43.5 20.8 ± 7.5 92.7 ± 1.9 43.9 25.0 ± 7.1 85.4 ± 3.1 46.2

BP+MF 33.3 ± 11.1 85.4 ± 1.7 53.3 3.3 ± 3.3 96.9 ± 0.8 18.0 3.3 ± 3.3 97.4 ± 1.2 18.0 43.3 ± 7.1 84.9 ± 2.1 60.7

BP+CC 53.3 ± 8.9 85.8 ± 3.0 67.6 13.3 ± 5.4 98.5 ± 0.8 36.2 16.7 ± 5.6 98.5 ± 0.8 40.6 50.0 ± 9.0 86.8 ± 2.5 65.9

MF+CC 34.5 ± 10.5 87.3 ± 2.1 54.9 10.3 ± 6.1 94.4 ± 1.2 31.2 20.7 ± 10.0 93.9 ± 1.3 44.1 34.5 ± 10.5 87.3 ± 2.2 54.9

BP+MF+CC 36.7 ± 9.2 85.6 ± 2.7 56.0 10.0 ± 5.1 97.6 ± 0.8 31.2 13.3 ± 7.4 98.1 ± 0.8 36.1 46.7 ± 11.3 88.5 ± 2.3 64.3
5.5 Experimental Results 69

Table 5.11 Predictive accuracy for Tree Network Augmented Naïve Bayes with the hierarchical
HIP—MR method and baseline “flat” feature selection methods
Feature TAN without Lazy/Eager Lazy/Eager
Lazy HIP–MR + TAN
Type Feature Selection EntHIP−MR_n + TAN ReleHIP−MR_n + TAN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 34.0 ± 3.2 79.6 ± 2.3 52.0 43.5 ± 2.9 76.8 ± 2.8 57.8 29.2 ± 2.4 83.7 ± 2.6 49.4 40.7 ± 3.5 80.3 ± 1.9 57.2

MF 37.2 ± 5.8 61.4 ± 5.0 47.8 41.3 ± 5.9 55.1 ± 4.8 47.7 37.2 ± 5.9 63.3 ± 4.0 48.5 35.5 ± 4.2 69.0 ± 4.3 49.5

CC 39.8 ± 3.0 78.2 ± 2.2 55.8 41.8 ± 3.3 69.2 ± 3.9 53.8 29.6 ± 4.2 78.2 ± 3.0 48.1 35.7 ± 3.1 78.8 ± 3.1 53.1

BP+MF 35.2 ± 1.9 80.3 ± 2.2 53.2 44.6 ± 2.1 82.7 ± 2.5 60.7 31.8 ± 3.0 83.8 ± 1.5 51.6 40.4 ± 2.8 82.1 ± 2.6 57.6

BP+CC 42.7 ± 3.1 81.7 ± 2.7 59.1 52.1 ± 4.2 80.2 ± 2.4 64.6 32.9 ± 4.1 83.2 ± 3.0 52.3 41.8 ± 2.4 81.4 ± 1.9 58.3

MF+CC 51.2 ± 2.8 64.1 ± 4.3 57.3 40.0 ± 3.5 70.6 ± 3.2 53.1 40.6 ± 4.0 74.4 ± 3.5 55.0 41.8 ± 3.1 71.8 ± 3.0 54.8

BP+MF+CC 39.5 ± 2.8 80.1 ± 2.6 56.2 46.9 ± 3.7 79.8 ± 1.8 61.2 32.7 ± 3.5 83.8 ± 2.0 52.3 40.0 ± 3.6 81.2 ± 2.1 57.0

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 92.3 ± 2.9 19.4 ± 8.4 42.3 82.4 ± 4.1 30.6 ± 8.7 50.2 94.5 ± 3.0 8.3 ± 3.8 28.1 90.1 ± 2.6 25.0 ± 8.8 47.5

MF 91.2 ± 3.3 20.6 ± 5.0 43.3 77.9 ± 4.2 32.4 ± 6.4 50.2 91.2 ± 3.3 29.4 ± 5.2 51.8 91.2 ± 3.9 35.3 ± 7.9 56.7

CC 90.3 ± 3.6 32.1 ± 11.6 53.8 85.5 ± 3.9 35.7 ± 10.9 55.2 90.3 ± 3.6 28.6 ± 9.7 50.8 90.3 ± 3.7 39.3 ± 10.5 59.6

BP+MF 92.4 ± 3.3 23.7 ± 6.9 46.8 79.3 ± 3.5 39.5 ± 6.7 56.0 95.7 ± 2.5 13.2 ± 5.6 35.5 88.0 ± 2.6 31.6 ± 6.5 52.7

BP+CC 86.8 ± 4.0 18.9 ± 7.6 40.5 81.3 ± 3.7 40.5 ± 8.6 57.4 95.6 ± 1.8 8.1 ± 4.3 27.8 86.8 ± 3.6 32.4 ± 9.9 53.1

MF+CC 90.6 ± 3.3 31.6 ± 5.0 53.5 83.5 ± 4.8 47.4 ± 8.7 62.9 95.3 ± 2.5 28.9 ± 6.2 52.5 92.9 ± 2.7 44.7 ± 5.0 64.4

BP+MF+CC 92.4 ± 2.4 18.4 ± 5.3 41.2 79.3 ± 4.4 42.1 ± 7.6 57.8 95.7 ± 1.8 18.4 ± 6.2 42.0 88.0 ± 2.6 36.8 ± 6.7 56.9

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 89.7 ± 3.7 41.2 ± 4.9 60.8 77.9 ± 6.4 47.1 ± 5.3 60.6 95.6 ± 2.2 29.4 ± 4.1 53.0 86.8 ± 4.5 44.1 ± 5.9 61.9

MF 89.2 ± 4.0 33.3 ± 9.4 54.5 81.5 ± 4.5 54.5 ± 9.2 66.6 89.2 ± 3.2 36.4 ± 12.9 57.0 87.7 ± 3.0 45.5 ± 11.8 63.2

CC 75.8 ± 4.4 41.2 ± 8.3 55.9 72.7 ± 5.1 41.2 ± 8.3 54.7 84.8 ± 2.1 32.4 ± 7.0 52.4 80.3 ± 3.0 23.5 ± 10.4 43.4

BP+MF 86.8 ± 3.4 35.3 ± 5.4 55.4 77.9 ± 4.3 47.1 ± 8.8 60.6 94.1 ± 2.3 32.4 ± 6.4 55.2 91.2 ± 3.2 41.2 ± 6.8 61.3

BP+CC 88.2 ± 3.6 47.1 ± 9.7 64.5 70.6 ± 7.7 52.9 ± 9.6 61.1 91.2 ± 4.4 29.4 ± 5.2 51.8 89.7 ± 3.7 35.3 ± 6.5 56.3

MF+CC 88.2 ± 4.2 41.2 ± 10.0 60.3 80.9 ± 5.6 44.1 ± 9.6 59.7 91.2 ± 4.4 38.2 ± 10.6 59.0 88.2 ± 4.7 44.1 ± 10.0 62.4

BP+MF+CC 91.2 ± 3.2 41.2 ± 8.6 61.3 74.3 ± 5.1 54.0 ± 6.8 63.3 95.7 ± 2.2 33.3 ± 9.6 56.5 92.9 ± 2.4 47.0 ± 11.5 66.1

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 3.3 ± 3.3 98.9 ± 1.1 18.1 16.7 ± 9.0 89.7 ± 2.2 38.7 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 8.9 95.1 ± 1.9 35.6

MF 0.0 ± 0.0 97.7 ± 1.2 0.0 0.0 ± 0.0 93.1 ± 2.4 0.0 0.0 ± 0.0 93.9 ± 1.5 0.0 0.0 ± 0.0 96.9 ± 1.3 0.0

CC 16.7 ± 7.0 95.9 ± 2.1 40.0 20.8 ± 10.6 91.1 ± 2.2 43.5 12.5 ± 6.9 94.3 ± 1.8 34.3 8.3 ± 5.7 93.5 ± 2.4 27.9

BP+MF 3.3 ± 3.3 99.0 ± 0.7 18.1 16.7 ± 5.6 94.8 ± 1.7 39.7 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 7.4 97.9 ± 0.9 36.1

BP+CC 10.0 ± 5.1 99.0 ± 0.7 31.5 30.0 ± 7.8 90.2 ± 2.1 52.0 6.7 ± 4.4 99.5 ± 0.5 25.8 13.3 ± 5.4 96.6 ± 1.3 35.8

MF+CC 5.0 ± 5.0 98.5 ± 0.8 22.2 10.3 ± 6.1 97.5 ± 0.8 31.7 6.9 ± 5.7 99.5 ± 0.5 26.2 6.9 ± 5.7 98.0 ± 1.1 26.0

BP+MF+CC 0.0 ± 0.0 99.0 ± 0.6 0.0 23.3 ± 7.1 93.8 ± 1.8 46.7 3.3 ± 3.3 99.0 ± 0.6 18.1 16.7 ± 7.5 97.6 ± 1.1 40.4
70 5 Lazy Hierarchical Feature Selection

Table 5.12 Predictive accuracy for Bayesian Network Augmented Naïve Bayes with the hierar-
chical HIP—MR method and baseline “flat” feature selection methods
Feature BAN without Lazy/Eager Lazy/Eager
Lazy HIP–MR + BAN
Type Feature Selection EntHIP−MR_n + BAN ReleHIP−MR_n + BAN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 28.7 ± 2.2 86.5 ± 1.8 49.8 37.0 ± 3.3 81.2 ± 2.7 54.8 25.8 ± 2.7 86.8 ± 1.8 47.3 41.1 ± 3.8 79.3 ± 2.2 57.1

MF 34.7 ± 4.5 66.5 ± 4.5 48.0 40.5 ± 4.5 57.0 ± 5.4 48.0 33.9 ± 4.3 63.9 ± 4.6 46.5 33.9 ± 4.4 70.9 ± 4.9 49.0

CC 33.7 ± 4.5 81.4 ± 2.2 52.4 40.8 ± 4.8 73.7 ± 2.3 54.9 25.5 ± 4.0 80.8 ± 1.9 45.4 37.8 ± 3.3 75.6 ± 3.2 53.4

BP+MF 30.0 ± 2.7 84.7 ± 1.7 50.4 39.9 ± 2.7 81.8 ± 2.3 57.1 28.6 ± 2.7 85.3 ± 1.4 49.4 47.9 ± 3.6 78.5 ± 2.3 61.3

BP+CC 29.1 ± 2.1 86.6 ± 1.7 50.2 42.2 ± 3.8 81.4 ± 2.0 58.6 31.9 ± 3.4 86.0 ± 1.5 52.4 46.5 ± 3.1 79.0 ± 1.7 60.6

MF+CC 35.3 ± 2.9 80.2 ± 3.2 53.2 44.1 ± 2.8 75.2 ± 3.0 57.6 37.6 ± 5.2 79.8 ± 3.7 54.8 46.5 ± 3.4 74.8 ± 3.8 59.0

BP+MF+CC 31.2 ± 2.9 85.2 ± 1.5 51.6 41.0 ± 2.8 79.8 ± 2.2 57.2 33.1 ± 3.4 83.1 ± 1.3 52.4 46.0 ± 3.7 75.9 ± 2.0 59.1

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 100.0 ± 0.0 0.0 ± 0.0 0.0 91.0 ± 3.2 25.8 ± 9.9 48.5 96.7 ± 2.4 2.5 ± 2.5 15.5 85.9 ± 3.5 29.2 ± 10.7 50.1

MF 91.2 ± 3.3 26.5 ± 3.4 49.2 79.4 ± 4.8 41.2 ± 7.2 57.2 94.1 ± 3.4 32.4 ± 6.3 55.2 83.8 ± 4.0 35.3 ± 7.1 54.4

CC 93.5 ± 2.6 28.6 ± 11.1 51.7 90.3 ± 3.6 42.9 ± 12.4 62.2 91.9 ± 3.6 28.6 ± 11.1 51.3 93.5 ± 2.6 32.1 ± 10.5 54.8

BP+MF 97.8 ± 1.5 0.0 ± 0.0 0.0 93.5 ± 2.7 25.0 ± 4.8 48.3 98.9 ± 0.9 10.0 ± 3.5 31.4 89.1 ± 3.4 27.5 ± 7.1 49.5

BP+CC 98.9 ± 1.1 0.0 ± 0.0 0.0 86.7 ± 3.3 20.8 ± 6.7 42.5 96.7 ± 1.5 3.3 ± 3.0 17.9 84.4 ± 3.1 27.5 ± 3.6 48.2

MF+CC 95.3 ± 1.9 31.6 ± 5.3 54.9 82.4 ± 4.8 50.0 ± 9.1 64.2 96.5 ± 2.4 31.6 ± 6.5 55.2 95.3 ± 2.5 44.7 ± 6.2 65.3

BP+MF+CC 98.9 ± 1.1 2.6 ± 2.5 16.0 91.1 ± 2.5 25.0 ± 5.8 47.7 97.8 ± 1.3 10.0 ± 5.0 31.3 90.0 ± 1.8 37.5 ± 6.9 58.1

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 98.5 ± 1.4 26.5 ± 5.0 51.1 92.9 ± 2.4 34.5 ± 6.0 56.6 98.6 ± 1.4 22.0 ± 6.6 46.6 94.3 ± 2.3 41.2 ± 5.5 62.3

MF 90.8 ± 3.3 27.3 ± 10.0 49.8 84.6 ± 3.0 36.4 ± 11.9 55.5 90.8 ± 2.5 27.3 ± 11.1 49.8 87.7 ± 3.0 27.3 ± 11.1 48.9

CC 86.4 ± 3.3 35.3 ± 11.2 55.2 83.3 ± 4.0 44.1 ± 11.1 60.6 83.3 ± 3.3 32.4 ± 11.7 52.0 83.3 ± 3.8 41.2 ± 11.9 58.6

BP+MF 98.5 ± 1.4 29.4 ± 6.4 53.8 89.0 ± 4.2 36.8 ± 7.7 57.2 98.6 ± 1.4 26.2 ± 7.0 50.8 94.3 ± 3.2 34.8 ± 7.3 57.3

BP+CC 98.5 ± 1.4 29.4 ± 6.4 53.8 90.0 ± 3.7 42.0 ± 7.9 61.5 97.1 ± 1.9 24.5 ± 6.6 48.8 91.4 ± 3.8 42.0 ± 7.9 62.0

MF+CC 91.2 ± 3.2 26.5 ± 8.8 49.2 85.3 ± 3.1 41.2 ± 11.5 59.3 89.7 ± 3.2 23.5 ± 7.4 45.9 86.8 ± 3.4 32.4 ± 7.7 53.0

BP+MF+CC 98.5 ± 1.4 26.5 ± 10.5 51.1 92.9 ± 2.9 40.0 ± 8.5 61.0 98.6 ± 1.4 29.2 ± 10.5 53.7 92.9 ± 2.4 44.0 ± 8.3 63.9

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 0.0 ± 0.0 100.0 ± 0.0 0.0 3.3 ± 3.3 98.9 ± 0.7 18.1 10.0 ± 10.0 100.0 ± 0.0 31.6 10.0 ± 7.1 98.9 ± 0.7 31.4

MF 0.0 ± 0.0 99.2 ± 0.8 0.0 0.0 ± 0.0 97.7 ± 1.1 0.0 0.0 ± 0.0 98.5 ± 1.0 0.0 0.0 ± 0.0 96.2 ± 1.7 0.0

CC 12.5 ± 6.1 99.2 ± 0.8 35.2 12.5 ± 6.1 95.1 ± 1.3 34.5 12.5 ± 6.9 99.2 ± 0.8 35.2 16.7 ± 7.0 94.3 ± 2.1 39.7

BP+MF 0.0 ± 0.0 100.0 ± 0.0 0.0 17.8 ± 12.0 81.5 ± 11.7 38.1 20.0 ± 13.3 82.5 ± 11.8 40.6 24.4 ± 11.7 84.5 ± 10.6 45.4

BP+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 6.7 ± 4.4 98.5 ± 0.8 25.7 0.0 ± 0.0 100.0 ± 0.0 0.0 13.3 ± 5.4 99.0 ± 0.7 36.3

MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 3.4 ± 3.4 98.5 ± 0.8 18.3 3.4 ± 3.4 100.0 ± 0.0 18.4 6.9 ± 5.7 98.5 ± 0.8 26.1

BP+MF+CC 0.0 ± 0.0 100.0 ± 0.0 0.0 0.0 ± 0.0 99.0 ± 0.6 0.0 0.0 ± 0.0 100.0 ± 0.0 0.0 10.0 ± 5.1 100.0 ± 0.0 31.6
5.5 Experimental Results 71

Table 5.13 Predictive accuracy for K-Nearest Neighbour (k = 3) with the hierarchical HIP—MR
method and baseline “flat” feature selection methods
Feature KNN without Lazy/Eager Lazy/Eager
Lazy HIP–MR + KNN
Type Feature Selection EntHIP−MR_n + KNN ReleHIP−MR_n + KNN

Caenorhabditis elegans Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 48.3 ± 4.8 74.0 ± 3.0 59.8 45.9 ± 2.6 73.4 ± 2.9 58.0 36.8 ± 4.0 76.8 ± 3.5 53.2 65.6 ± 5.6 53.6 ± 4.0 59.3

MF 41.3 ± 3.3 54.4 ± 4.4 47.4 35.5 ± 4.0 60.1 ± 3.7 46.2 27.3 ± 3.5 58.9 ± 5.1 40.1 33.9 ± 3.7 69.0 ± 3.9 48.4

CC 39.8 ± 6.5 67.9 ± 3.3 52.0 35.7 ± 4.0 72.4 ± 2.6 50.8 32.7 ± 4.0 75.6 ± 2.8 49.7 45.9 ± 4.8 66.7 ± 2.7 55.3

BP+MF 49.3 ± 3.5 72.9 ± 1.2 59.9 51.2 ± 2.7 75.6 ± 1.9 62.2 37.6 ± 4.0 79.7 ± 2.1 54.7 61.5 ± 4.9 62.1 ± 3.7 61.8

BP+CC 42.7 ± 3.4 72.7 ± 2.7 55.7 46.0 ± 3.2 79.1 ± 2.3 60.3 36.6 ± 3.6 77.6 ± 1.6 53.3 65.7 ± 3.3 60.5 ± 3.6 63.0

MF+CC 44.7 ± 2.7 68.3 ± 2.6 55.3 39.4 ± 3.3 71.4 ± 2.8 53.0 37.1 ± 2.9 73.7 ± 3.0 52.3 49.4 ± 3.4 65.3 ± 3.3 56.8

BP+MF+CC 47.9 ± 3.6 72.0 ± 2.4 58.7 39.1 ± 3.0 77.3 ± 2.0 55.0 33.0 ± 3.2 77.9 ± 2.2 50.7 62.3 ± 4.0 59.7 ± 2.8 61.0

Drosophila melanogaster Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 80.2 ± 4.9 38.9 ± 7.5 55.9 82.4 ± 4.8 38.9 ± 8.4 56.6 90.1 ± 2.6 27.8 ± 8.1 50.0 71.4 ± 4.5 52.8 ± 9.1 61.4

MF 77.9 ± 5.6 32.4 ± 5.2 50.2 69.1 ± 5.9 35.3 ± 6.2 49.4 85.3 ± 3.1 32.4 ± 8.0 52.6 36.8 ± 7.0 55.9 ± 7.9 45.4

CC 83.9 ± 5.6 46.4 ± 10.0 62.4 83.9 ± 5.6 53.6 ± 9.0 67.1 80.6 ± 6.0 46.4 ± 8.7 61.2 72.6 ± 5.7 57.1 ± 11.3 64.4

BP+MF 79.3 ± 5.1 42.1 ± 9.9 57.8 79.3 ± 4.6 50.0 ± 9.1 63.0 90.2 ± 5.1 18.4 ± 6.5 40.7 68.5 ± 4.0 60.5 ± 5.6 64.4

BP+CC 78.0 ± 5.4 37.8 ± 8.9 54.3 83.5 ± 4.5 48.6 ± 9.3 63.7 89.0 ± 4.4 29.7 ± 8.9 51.4 68.1 ± 4.6 51.4 ± 8.2 59.2

MF+CC 91.8 ± 3.1 42.1 ± 6.7 62.2 89.4 ± 3.3 44.7 ± 6.5 63.2 90.6 ± 4.1 47.4 ± 5.8 65.5 69.4 ± 5.4 60.5 ± 4.1 64.8

BP+MF+CC 81.5 ± 3.8 52.6 ± 6.9 65.5 89.1 ± 2.9 47.4 ± 5.8 65.0 91.3 ± 2.8 28.9 ± 6.2 51.4 73.9 ± 4.1 65.8 ± 7.6 69.7

Mus musculus Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 86.8 ± 3.4 41.2 ± 4.7 59.8 92.6 ± 3.4 41.2 ± 6.6 61.8 94.1 ± 2.6 26.5 ± 6.1 49.9 91.2 ± 3.2 50.0 ± 8.6 67.5

MF 78.5 ± 4.5 39.4 ± 10.4 55.6 90.8 ± 3.6 27.3 ± 10.0 49.8 87.7 ± 3.9 30.3 ± 8.1 51.5 87.7 ± 3.9 48.5 ± 9.2 65.2

CC 74.2 ± 7.7 41.2 ± 9.4 55.3 75.8 ± 7.2 44.1 ± 11.0 57.8 75.8 ± 4.4 26.5 ± 7.3 44.8 77.3 ± 5.1 44.1 ± 7.0 58.4

BP+MF 83.8 ± 4.0 47.1 ± 7.3 62.8 85.3 ± 3.7 50.0 ± 10.9 65.3 97.1 ± 1.9 29.4 ± 8.4 53.4 80.9 ± 4.3 41.2 ± 7.0 57.7

BP+CC 86.8 ± 5.8 47.1 ± 10.1 63.9 83.8 ± 5.8 38.2 ± 7.1 56.6 95.6 ± 2.2 32.4 ± 7.2 55.7 82.4 ± 5.6 47.1 ± 6.0 62.3

MF+CC 77.9 ± 4.3 61.8 ± 6.9 69.4 80.9 ± 5.7 47.1 ± 9.0 61.7 85.3 ± 3.7 44.1 ± 11.4 61.3 82.4 ± 5.1 52.9 ± 8.8 66.0

BP+MF+CC 83.8 ± 4.5 50.0 ± 10.8 64.7 85.3 ± 3.7 44.1 ± 10.6 61.3 95.6 ± 2.2 35.3 ± 11.7 58.1 79.4 ± 5.3 58.8 ± 7.4 68.3

Saccharomyces cerevisiae Datasets


Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM Sen. Spe. GM

BP 10.0 ± 5.1 95.7 ± 1.9 30.9 10.0 ± 5.1 95.1 ± 1.5 30.8 10.0 ± 5.1 97.8 ± 0.9 31.3 23.3 ± 7.1 96.2 ± 1.8 47.3

MF 11.5 ± 6.9 90.1 ± 3.0 32.2 3.8 ± 3.8 94.7 ± 2.0 19.0 7.7 ± 5.7 95.4 ± 1.7 27.1 7.7 ± 5.7 93.1 ± 1.7 26.8

CC 12.5 ± 6.9 93.5 ± 2.1 34.2 12.5 ± 6.9 91.9 ± 2.1 33.9 12.5 ± 6.9 97.6 ± 1.2 34.9 12.5 ± 10.2 95.1 ± 1.7 34.5

BP+MF 13.3 ± 5.4 94.8 ± 1.8 35.5 13.3 ± 5.4 97.4 ± 1.2 36.0 0.0 ± 0.0 98.4 ± 0.8 0.0 16.7 ± 7.5 95.8 ± 1.1 40.0

BP+CC 20.0 ± 5.4 96.6 ± 1.1 44.0 16.7 ± 7.5 98.5 ± 0.8 40.6 10.0 ± 5.1 97.5 ± 0.8 31.2 20.0 ± 7.4 96.1 ± 1.0 43.8

MF+CC 17.2 ± 8.0 94.9 ± 1.3 40.4 6.9 ± 5.7 97.5 ± 0.8 25.9 10.3 ± 6.1 98.5 ± 0.8 31.9 6.9 ± 5.7 96.4 ± 0.8 25.8

BP+MF+CC 20.0 ± 7.4 95.7 ± 1.1 43.7 20.0 ± 7.4 97.6 ± 0.8 44.2 10.0 ± 5.1 99.0 ± 0.7 31.5 10.0 ± 5.1 96.6 ± 1.6 31.1
72 5 Lazy Hierarchical Feature Selection

(a) (b)
EntHIP_n + NB EntHIP_n + TAN

ReleHIP_n + NB TAN

NB ReleHIP_n + TAN

HIP+NB HIP+TAN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

(c) (d)
BAN EntHIP_n + KNN

EntHIP_n + BAN ReleHIP_n + KNN

ReleHIP_n + BAN KNN

HIP+BAN HIP+KNN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

Fig. 5.7 Boxplots showing the distributions of ranks obtained by HIP and other feature selection
methods working with different Lazy classifiers

the most times of highest GMean values on predicting almost all 28 datasets, except
the Caenorhabditis elegans and Drosophila melanogaster datasets using molecular
function terms as features.
HIP+BAN obtains the best average rank (1.089), which is sequentially better than
the ones obtained by Rele H I P_n +BAN (2.214), Ent H I P_n +BAN (3.321) and BAN
without feature selection (3.375).
As shown in Table 5.5, HIP+KNN obtains the most times of highest GMean value
in predicting three model organisms’ genes, i.e. 6 times for predicting Caenorhabditis
elegans genes, 6 times for predicting Drosophila melanogaster genes and 4 times
for predicting Mus musculus genes, while KNN without feature selection obtains
the same 3 out of 7 times of highest GMean value with HIP+KNN for predicting
Saccharomyces cerevisiae genes.
HIP+KNN obtains the best results with the average rank of 1.446, while KNN
without feature selection obtains the second best average rank (1.982), being sequen-
tially better than the average ranks obtained by Rele H I P_n +KNN (3.071) and
Ent H I P_n +KNN (3.500).
Tables 5.6, 5.7, 5.8 and 5.9 compare the predictive accuracies obtained by NB,
TAN, BAN and KNN when using MR or different “flat” feature selection methods
in a pre-processing phase, i.e. Ent M R_n and Rele M R_n . As shown in Table 5.6, the
MR+NB method obtains the highest GMean value 3 out of 7 times for predicting
5.5 Experimental Results 73

(a) (b)
EntMR_n + NB EntMR_n + TAN

ReleMR_n + NB TAN

MR+NB ReleMR_n + TAN

NB MR+TAN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0


(c) (d)
EntMR_n + BAN EntMR_n + KNN

BAN ReleMR_n + KNN

ReleMR_n + BAN KNN

MR+BAN MR+KNN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

Fig. 5.8 Boxplots showing the distributions of ranks obtained by MR and other feature selection
methods working with different lazy classifiers

(a) (b)
EntHIP MR_n + NB EntHIP MR_n + TAN

HIP MR+NB TAN

NB ReleHIP MR_n + TAN

ReleHIP MR_n + NB HIP MR+TAN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

(c) (d)
BAN EntHIP MR_n + KNN

EntHIP MR_n + BAN HIP MR+KNN

HIP MR+BAN KNN

ReleHIP MR_n + BAN ReleHIP MR_n + KNN

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

Fig. 5.9 Boxplots showing the distributions of ranks obtained by HIP—MR and other feature
selection methods working with different lazy classifiers
74 5 Lazy Hierarchical Feature Selection

Caenorhabditis elegans genes, while other methods respectively obtain only 1 or 2


times the highest GMean value. MR+NB also obtains the most times of the highest
GMean value (5 out of 7) when predicting the Drosophila melanogaster genes.
However, it only obtains 3 and 0 times the highest GMean value respectively for
predicting Mus musculus and Saccharomyces cerevisiae genes, whereas the Naïve
Bayes without feature selection method respectively obtains 3 and 5 times the highest
GMean value.
As shown in Fig. 5.8, the NB without feature selection method obtains the best
overall results with the average rank of 1.857, while the second best rank (2.071) was
obtained by MR+NB method. The average rank for Rele M R_n +NB is 2.232, whereas
Ent M R_n +NB obtained the worst average rank (3.839) in terms of GMean value. It
is obvious that the NB method performs best even without feature selection, since
it ranks in the first position in 11 out of 28 datasets, as indicated by the boldfaced
GMean values in Table 5.6. However, MR+NB shows competitive performance,
since it also obtains the highest GMean values in 10 out of 28 datasets.
Table 5.7 reports the results for the MR and other “flat” feature selection methods
working with Tree Augmented Naïve Bayes classifier. MR+TAN performs best in
predicting genes of all four model organisms, since it obtains 5 out of 7 times highest
GMean value for predicting both Caenorhabditis elegans, Mus musculus and Sac-
charomyces cerevisiae genes, while also obtains the highest GMean value 6 out of 7
times when predicting the Drosophila melanogaster genes.
MR+TAN obtains the best average rank of 1.304, while the second best rank
(2.304) was obtained by Rele M R_n +TAN. TAN without feature selection method
obtains the third best average rank (2.714), while Ent M R_n +TAN obtains the worse
average rank (3.679).
Table 5.8 reports the results for different feature selection methods working with
Bayesian Network Augmented Naïve Bayes classifier. MR+BAN obtains the highest
GMean value when predicting almost all 28 datasets, except the cases of Caenorhab-
ditis elegans and Saccharomyces cerevisiae datasets using molecular function terms
as features.
MR+BAN obtains the best average rank of 1.125, while Rele M R_n +BAN obtains
the second best rank (2.161). The Bayesian Network Augmented Naïve Bayes with-
out feature selection and Ent M R_n +BAN methods both obtain the worse best average
ranks (3.357).
Table 5.9 reports the results for different feature selection methods working with
K-Nearest Neighbour classifier. MR+KNN obtains the highest GMean value in three
model organisms, i.e. Caenorhabditis elegans – 3 times, Drosophila melanogaster –
5 times and Mus musculus – 6 times, while the KNN without feature selection method
also obtains the highest GMean value 3 out of 7 times for predicting Caenorhabditis
elegans genes, and Rele M R_n +KNN also obtains 3 times the highest GMean value
for prediction Saccharomyces cerevisiae genes.
MR+KNN obtains the best results with the average tank of 1.804, while KNN
without feature selection obtains the second best average rank (2.304), being sequen-
tially better than the average rank obtained by Rele M R_n +KNN and Ent M R_n +KNN.
5.5 Experimental Results 75

Tables 5.10, 5.11, 5.12 and 5.13 compare the predictive accuracies obtained by
NB, TAN, BAN and KNN when using HIP—MR or different “flat” feature selection
methods in a pre-processing phase: two hybrid lazy/eager “flat” (non-hierarchical)
feature selection methods, namely Hybrid-lazy/eager-entropy-based (selecting the
same number of n features as HIP—MR), denoting as Ent H I P—M R_n , and Hybrid-
lazy/eager-relevance-based (selecting the same number of n features as HIP—MR),
denoting as Rele H I P—M R_n . The tables also report results for NB, TAN, BAN and
KNN without using any feature selection method, as a natural baseline.
Rele H I P—M R_n +NB obtains the highest GMean value 5 out of 7 times on
Caenorhabditis elegans datasets, while HIP—MR+NB and Naïve Bayes without
feature selection respectively obtain the highest GMean value 1 time. For the results
about Drosophila melanogaster, HIP—MR+NB and Rele H I P—M R_n +NB respec-
tively obtain the highest GMean value 2 out of 7 times. However, Naïve Bayes
without feature selection obtains 4 times of the highest GMean value. For the results
about Mus musculus and Saccharomyces cerevisiae, Rele H I P—M R_n +NB respec-
tively obtains the highest GMean value 4 out of 7 times, while Naïve Bayes without
feature selection also respectively obtains the highest GMean value 3 and 4 times.
As shown in Fig. 5.9, Rele H I P—M R_n +NB method obtains the best overall results
with the average rank of 1.661, while the second best rank (1.786) was obtained by
Naïve Bayes without feature selection method. The average rank for HIP—MR+NB
is 2.893, whereas Ent H I P—M R_n +NB obtained the worst average rank (3.661) in
terms of GMean value. It is obvious that the Rele H I P—M R_n +NB method performs
best, since it ranks in the first position in 15 out of 28 datasets, as indicated by the
boldfaced GMean values in Table 5.10.
Table 5.11 reports the results for the HIP—MR and “flat” feature selection meth-
ods working with Tree Augmented Naïve Bayes classifier. Obviously, HIP—MR per-
forms best in predicting genes of three model organisms, since it obtains 4 times of
the highest GMean value for predicting both Caenorhabditis elegans and Drosophila
melanogaster genes, while also obtains the highest GMean value 6 out of 7 times
when predicting the Saccharomyces cerevisiae genes. For predicting Mus musculus
genes, Rele H I P—M R_n +TAN obtains the highest GMean value 4 out of 7 times.
HIP—MR+TAN obtains the best results with the average rank of 1.732, while
the second best rank (2.054) was obtained by Rele H I P—M R_n +TAN. TAN without
feature selection method obtains the third best average rank (2.750), indicating the
worst average rank obtained by Ent H I P—M R_n +TAN (3.464).
Table 5.12 reports the results for different feature selection methods working with
Bayesian Network Augmented Naïve Bayes classifier. Rele H I P—M R_n +BAN respec-
tively obtains the highest GMean value 6 out of 7 times for predicting Caenorhabditis
elegans genes, and 5 out of 7 times for predicting both Drosophila melanogaster and
Saccharomyces cerevisiae genes. It also obtains 4 times of the highest GMean value
for predicting Mus musculus genes, while HIP—MR+BAN obtains 3 times the high-
est GMean value. Note that, Ent H I P—M R_n +BAN and Bayesian Network Augmented
Naïve Bayes without feature selection method only obtains 1 and 0 time the highest
GMean value over all 28 datasets.
76 5 Lazy Hierarchical Feature Selection

Rele H I P—M R_n +BAN obtains the best results with the average rank of 1.411, while
HIP—MR+BAN obtains the second best rank (2.036). Both of Ent H I P−−M R_n +BAN
and Bayesian Network Augmented Naïve Bayes without feature selection methods
obtain the worst results, due to the lower average rank of GMean value, i.e. 3.143
and 3.411 respectively.
Table 5.13, which is analogous to Table 5.12, Rele H I P—M R_n +KNN obtains the
highest GMean value in three model organisms, i.e. Caenorhabditis elegans – 5
times, Drosophila melanogaster – 3 times and Mus musculus – 4 times, while KNN
without feature selection method obtains the highest GMean value 3 out of 7 times
for predicting Saccharomyces cerevisiae genes.
Rele H I P—M R_n +KNN obtains the best results with the average tank of 1.786, while
KNN without feature selection obtains the second best average rank (2.250), being
sequentially better than the average ranks obtained by HIP—MR+KNN (2.571) and
Ent H I P—M R_n +KNN (3.393).

5.6 Discussion

5.6.1 Statistical Analysis of GMean Value Differences


Between HIP, MR, HIP—MR and Other Feature
Selection Methods

The Friedman test and Holm post-hoc correction methods were used to conduct the
statistical significance test on the differences between the GMean values of feature
selection methods working with NB, TAN, BAN and KNN classifiers. The results
are shown in Table 5.14, where columns 3, 7, 11 and 15 present the average ranks of
different feature selection methods; columns 4, 8, 12 and 16 present the corresponding
p-values, and columns 5, 9, 13 and 17 present the adjusted significance level according
to Holm post-hoc method. The boldfaced p-values indicate that the corresponding
results are significant at the corresponding adjusted significance levels (i.e. occurs
when the p-value is smaller than the “Adjusted α”).
As shown on the top 3th to 6th rows of Table 5.14, HIP is the control method
when working with all four different classifiers. It also significantly outperforms
all other feature selection methods when working all different classifiers, while also
shows its capacity on significantly improving the predictive performance of NB, TAN
and BAN.
As shown on the middle 8th to 11th rows of Table 5.14, MR is considered as the
control method when working with TAN, BAN and KNN classifiers. In details, it
significantly outperforms all other methods when working with TAN, significantly
improves the predictive performance of BAN and outperforms the Ent M R_n methods
when working with BAN classifier. MR also significantly outperforms Ent M R_n when
working with KNN. Note that, in the cases when working with NB classifier, there is
no significant difference on the predictive performance between all different methods.
Table 5.14 Statistical test results of the methods’ GMean values according to the non-parametric Friedman test with the Holm post-hoc correction

Group NB TAN BAN KNN

HIP FS R̄ P-value α FS R̄ P-value α FS R̄ P-value α FS R̄ P-value α


5.6 Discussion

1 HIP 1.357 - - HIP 1.411 - - HIP 1.089 - - HIP 1.446 - -

2 No FS 2.054 2.17e-02 5.00e-02 ReleHIP_n 2.500 7.97E-04 5.00e-02 ReleHIP_n 2.214 5.55e-04 5.00e-02 No FS 1.982 6.01e-02 5.00e-02

3 ReleHIP_n 2.732 3.36e-05 2.50e-02 No FS 2.679 1.19e-04 2.50e-02 EntHIP_n 3.321 4.90e-11 2.50e-02 ReleHIP_n 3.071 1.24e-06 2.50e-02

4 EntHIP_n 3.857 2.15e-13 1.67e-02 EntHIP_n 3.411 3.38e-09 1.67e-02 No FS 3.375 1.72e-11 1.67e-02 EntHIP_n 3.500 1.31e-09 1.67e-02

MR FS R̄ P-value α FS R̄ P-value α FS R¯ P-value α FS R¯ P-value α

1 No FS 1.857 - - MR 1.304 - - MR 1.125 - - MR 1.804 - -

2 MR 2.071 2.68e-01 5.00e-02 ReleMR_n 2.304 1.87e-03 5.00e-02 ReleMR_n 2.161 1.34e-03 5.00e-02 No FS 2.304 7.37e-02 5.00e-02

3 ReleMR_n 2.232 1.39e-01 2.50e-02 No FS 2.714 2.18e-05 2.50e-02 No FS 3.357 4.90e-11 2.50e-02 ReleMR_n 2.446 3.14e-02 2.50e-02

4 EntMR_n 3.839 4.62e-09 1.67e-02 EntMR_n 3.679 2.91e-12 1.67e-02 EntMR_n 3.357 4.90e-11 1.67e-02 EntMR_n 3.446 9.73e-07 1.67e-02

HIP-MR FS R¯ P-value α FS R¯ P-value α FS R¯ P-value α FS R¯ P-value α

1 ReleHIP−MR_n 1.661 - - HIP-MR 1.732 - - ReleHIP−MR_n 1.411 - - ReleHIP−MR_n 1.786 - -

2 No FS 1.786 3.59e-01 5.00e-02 ReleHIP−MR_n 2.054 1.75E-01 5.00e-02 HIP-MR 2.036 3.50e-02 5.00e-02 No FS 2.250 8.93e-02 5.00e-02

3 HIP-MR 2.893 1.78E-04 2.50e-02 No FS 2.750 1.58E-03 2.50e-02 EntHIP−MR_n 3.143 2.58e-07 2.50e-02 HIP-MR 2.571 1.15e-02 2.50e-02

4 EntHIP−MR_n 3.661 3.38E-09 1.67e-02 EntHIP−MR_n 3.464 2.58E-07 1.67e-02 No FS 3.411 3.38e-09 1.67e-02 EntHIP−MR_n 3.393 1.60e-06 1.67e-02
77
78 5 Lazy Hierarchical Feature Selection

As shown on the bottom 13th to 16th rows of Table 5.14, Rele H I P—M R_n is the
control method when working with three different classifiers (i.e. NB, BAN and
KNN), comparing with other feature selection methods. The outcome shows that
Rele H I P—M R_n significantly outperforms two other types of feature selection, HIP—
MR and Ent H I P—M R_n , whereas does not show significant difference on GMean value
to the NB and KNN classifier without feature selection. When working with TAN
classifier, HIP—MR is considered as the control method and significantly outper-
forms the TAN without feature selection and Ent H I P—M R_n +TAN methods.
In conclusion, although HIP—MR outperforms the Ent H I P—M R_n method, it
can not select a more powerful feature subset comparing with only considering
the relevance value of features without using the hierarchical information, since
Rele H I P—M R_n performs better. However, MR and HIP methods both consider the
hierarchical information successfully select the feature subsets containing more pow-
erful predictive power and improve the predictive performance of NB, TAN, BAN
and KNN classifiers.

5.6.2 Robustness Against the Class Imbalanced Problem

Recall that, Table 4.1 reports the information about degree of class imbalance for all
28 datasets. In general, the values of the degree of class imbalance in the datasets range
from 0.35 to 0.84, where the Sacchar omyces cer evisiae datasets have the highest
degree of class imbalance and the Caenorhabditis elegans datasets have the lowest
degree of class imbalance. Therefore, this chapter further evaluates the robustness
of different feature selection methods against the imbalanced class distribution.
The values of linear correlation coefficient r between the degree of class imbal-
ance and GMean values for different combinations of feature selection method and
classifier were calculated. As shown in Fig. 5.10, NB, TAN, BAN and KNN without
feature selection methods all show negative r values indicating linear that the higher
imbalanced degree leading to lower predictive performance. However, the values
of correlation coefficient for HIP with different Bayesian classifiers are close to 0,
indicating its capacity on improving the robustness against the imbalanced class dis-
tribution problem. MR also shows its capacity on improving classifiers’ robustness,
due to its better correlation coefficient values obtained by working with all different
classifiers (except the NB classifier), compared with those classifiers without feature
selection. Analogously, HIP—MR also improves the robustness of TAN and KNN
classifiers, whereas the changes of correlation coefficient value for KNN classifier
is not large (i.e. only increased 0.025 of r value).
The correlation coefficient between the difference (Diff) and class imbalance
degree was also calculated in order to further discuss the negative correlation coeffi-
cient between GMean value and class imbalance degree obtained by most of meth-
ods. Here, the difference value between Sensitivity and Specificity is calculated by
Eq. 5.4. As shown in Fig. 5.11, the r values for HIP method working with NB, TAN
and BAN range from 0.208 to 0.332, which are lower than the r values obtained by all
5.6 Discussion 79

100 100 100 100


r = -0.258 r = -0.801 r = -0.789 r = -0.671
80 80 80 80

GMean-KNN
GMean-BAN
GMean-TAN
GMean-NB

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

NB TAN BAN KNN


100 100 100 100
r = -0.035 r = 0.088 r = 0.103 r = -0.541

GMean-HIP+KNN
80 80 80 80

GMean-HIP+BAN
GMean-HIP+TAN
GMean-HIP+NB

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HIP+NB HIP+TAN HIP+BAN HIP+KNN


100 100 100 100
r = -0.483 r = -0.515 r = -0.463 r = -0.554

GMean-MR+KNN
80 80 GMean-MR+BAN 80 80
GMean-MR+TAN
MR+NB

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

MR+NB MR+TAN MR+BAN MR+KNN

100 100 100 100


r = -0.747 r = -0.592 r = -0.790 r = -0.646
GMean-HIP–MR+KNN
GMean-HIP–MR+BAN
GMean-HIP–MR+TAN
GMean-HIP–MR+NB

80 80 80 80

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HIP–MR+NB HIP–MR+TAN HIP–MR+BAN HIP–MR+KNN

Fig. 5.10 Linear relationship between the degree of class imbalance and Gmean values obtained
by different eager feature selection methods and classifiers

other methods. This fact explains that HIP with NB, TAN and BAN classifiers tend
to obtain similar values of sensitivity and specificity, leading to stronger robustness
against the class imbalance issue.

Diff = Max(Sen, Spe) − Min(Sen, Spe) (5.4)


80 5 Lazy Hierarchical Feature Selection

100 100 100 100


r = 0.793 r = 0.946 r = 0.884 r = 0.936
80 80 80 80

60 60 60 60

Diff

Diff

Diff
Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

NB TAN BAN KNN


100 100 100 100
r = 0.332 r = 0.208 r = 0.292 r = 0.870
80 80 80 80

60 60 60 60
Diff

Diff

Diff
Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HIP+NB HIP+TAN HIP+BAN HIP+KNN

100 100 100 100


r = 0.790 r = 0.798 r = 0.786 r = 0.797
80 80 80 80

60 60 60 60
Diff

Diff

Diff
Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

MR+NB MR+TAN MR+BAN MR+KNN


100 100 100 100
r = 0.910 r = 0.882 r = 0.916 r = 0.886
80 80 80 80

60 60 60 60
Diff

Diff

Diff
Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HIP–MR+NB HIP–MR+TAN HIP–MR+BAN HIP–MR+KNN

Fig. 5.11 Linear relationship between the degree of class imbalance and difference between sen-
sitivity and specificity values obtained by different eager feature selection methods and classifiers

References

1. Pereira RB, Plastino A, Zadrozny B, de C Merschmann LH, Freitas AA (2011) Lazy attribute
selection: choosing attributes at classification time. Intell Data Anal 15(5):715–732
2. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
3. Wan C, Freitas AA (2013) Prediction of the pro-longevity or anti-longevity effect of Caenorhab-
ditis Elegans genes based on Bayesian classification methods. In: Proceedings of IEEE inter-
national conference on bioinformatics and biomedicine (BIBM 2013), Shanghai, China, pp
373–380
4. Wan C, Freitas AA (2015) Two methods for constructing a gene ontology-based feature selection
network for a Bayesian network classifier and applications to datasets of aging-related genes. In:
Proceedings of the sixth ACM conference on bioinformatics, computational biology and health
informatics (ACM-BCB 2015) Atlanta, USA, pp 27–36
5. Wan C, Freitas AA (2017) An empirical evaluation of hierarchical feature selection methods
for classification in bioinformatics datasets with gene ontology-based features. In: Artificial
intelligence review
6. Wan C, Freitas AA, de Magalhães JP (2015) Predicting the pro-longevity or anti-longevity effect
of model organism genes with new hierarchical feature selection methods. IEEE/ACM Trans
Comput Biol Bioinform 12(2):262–275
Chapter 6
Eager Hierarchical Feature Selection

This chapter discusses four different eager hierarchical feature selection methods, i.e.
Tree-based Feature Selection (TSEL) [1], Bottom-up Hill Climbing Feature Selec-
tion (HC) [5], Greedy Top-down Feature Selection (GTD) [2] and Hierarchy-based
Feature Selection (SHSEL) [3]. All of those four hierarchical feature selection meth-
ods are also categorised as filter methods. Those methods aim to alleviate the feature
redundancy by considering the hierarchical structure between features and the pre-
dictive power of features (e.g. information gain). Unlike the lazy hierarchical feature
selection methods discussed in last chapter, those eager hierarchical feature selection
methods only consider the relevance value of those features calculated by the train-
ing dataset and the hierarchical information, without considering the actual value of
features for individual testing instance.

6.1 Tree-Based Feature Selection (TSEL)

Tree-based feature selection (TSEL) [1] method considers to select the feature for
individual paths in DAG containing high predictive power and more specificity def-
inition (i.e. the node is as close as possible to the leaf node). TSEL first selects one
representative node with the highest lift value for each single path in the DAG. As
shown in Eq. 6.1, the lift value for feature x is defined as the quotient of conditional
probability of class y = 1 given feature x = 1 divided by the probability of feature
x = 1. Then it checks each selected node whether any of its descendent nodes are
also selected after processing other paths. If so, this node will be removed and a
new representative node will be selected from the subtree of that node. The process
terminates until there is no node has its descendent nodes also being selected.

P(y = 1|x = 1)
lift(x) = (6.1)
x=1

© Springer Nature Switzerland AG 2019 81


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_6
82 6 Eager Hierarchical Feature Selection

The pseudocode of the TSEL method is shown in Algorithm 4. In the first part
of the algorithm (lines: 1–9), the GO-DAG (DAG), set of paths (P), descendant sets
for nodes (D(X )) and direct children sets for nodes (C(X )) are initialised. Then the
second part of algorithm (lines: 10–14) firstly generates the leaf L(pi ) and root R(pi )
nodes for individual paths, than select the representative nodes using the function
RepresentativeFeatures(R(pi ), L(pi ), P), which outputs one node having the max-
imum lift value for single path. Note that, if more than one nodes have the same
maximum lift values, the one that is located in shallower position in the path will
be selected as the representative node. After generating the set of representative
nodes SF, between lines 15–33, Algorithm 4 checks each node xi whether any of
its descendent di in D(xi ) is also selected. If there exists at least one descendant
node also being selected, the node xi will be removed from the representative feature
set SF in line 21. In order to obtain the representative node(s) for the subtree of xi ,
every child node ci of C(xi ) is used as the root node that is input into the function
RepresentativeFeature(ci , L(pci ), P), where L(pci ) denotes a set of leaf nodes in the
paths that also contain the node ci . Finally, in lines 34–36, the classifier is trained and
tested after regenerating the new training and testing sets by using the set of selected
features SF.
Figure 6.1 shows an example of applying TSEL method on the DAG, while the
numbers in the left of nodes denote the lift values of corresponding nodes. The initial
DAG is shown in Fig. 6.1a, where TSEL selects the representative node in each
individual paths, i.e. node M for path starting from node M to K; node F for path
starting from node L to K; node B for paths starting from node M to J and node L to
J; node Q for path starting from node O to J; node C for paths starting from node M
to H, node L to H and node O to H; node A for path starting from node E to H; and
node N for paths starting from node E to P or R. Then TSEL removes nodes M, Q and
A (red nodes in Fig. 6.1c), since node M is the ancestor node of F, while F was also
selected in previous steps, node Q is the ancestor node of B that was also selected as
the representative node in other path, node N is the descendant of node A and both
nodes were selected in previous steps. Note that, node D is shown as the selected
node in this step, since it has the highest lift value in the path starting from node D
to H, and node F and C are still the representative nodes for different corresponding
paths. Finally, after checking the nodes F and C, only node D is selected as the
representative node in paths stating from node M to H, node L to H and node O to
H. The selected feature subset by TSEL consists of features B, D and N, which are
used for recreating the datasets for training and testing the classifier (as shown in
Fig. 6.1e).
6.1 Tree-Based Feature Selection (TSEL) 83

Algorithm 4 Tree-based Feature Selection (TSEL)


1: Initialize TrainSet;
2: Initialize TestSet;
3: Initialize DAG with all features X in Dataset;
4: Initialize P with DAG;
5: Initialize SF;
6: for each feature xi ∈ X do
7: Initialize D(xi ) in DAG;
8: Initialize C(xi ) in DAG;
9: end for
10: for each path pi ∈ P do
11: L(pi ) in DAG;
12: R(pi ) in DAG;
13: SF = SF ∪ RepresentativeFeatures(R(pi ), L(pi ), P);
14: end for
15: Initialize checkUpdated ← True;
16: while checkUpdated==True do
17: checkUpdated ← False;
18: for each feature xi ∈ SF do
19: for di in D(xi ) do
20: if di ∈ SF then
21: SF = SF - xi ;
22: for each feature ci ∈ C(xi ) do
23: SF = SF ∪ RepresentativeFeature(ci , L(pci ), P);
24: end for
25: checkUpdated ← True;
26: Break;
27: end if
28: end for
29: if checkUpdated==True then
30: Break;
31: end if
32: end for
33: end while
34: Re-create TrainSet_SF with all features X  in SF;
35: Re-create TestSet_SF with all features X  in SF;
36: Classifier(TrainSet_SF, Inst_SF);
84 6 Eager Hierarchical Feature Selection

(a) (b)

(c) (d)

(e)

Fig. 6.1 Example of tree-based feature selection method


6.2 Bottom-Up Hill Climbing Feature Selection (HC) 85

6.2 Bottom-Up Hill Climbing Feature Selection (HC)

Bottom-up hill climbing feature selection (HC) [5] method is a type of eager learning
hierarchical feature selection method that searches the optimal feature subset by
exploiting the hierarchical information. The initial feature subset of hill climbing
consists of all leaf nodes of individual paths in the Directed Acyclic Graph (DAG). HC
evaluates the predictive information of the initial feature subset by using the Eq. 6.2,
where the left part of equation calculates a coefficient considering the proportion of
candidate feature set dimensions (M ) and the original full feature set dimensions
(N ), while the right part of equation evaluates the concentration degree of instances
belonging
 to the same class using the candidate feature subset, i.e. the lower value of
|Di,c | indicates the better description to different groups of instances belonging
i∈D
to different class. As suggested by [5], the value of k is defined as 5 and the value of
μ is chosen as 0.05. The construction of candidate feature subset considers replacing
the single leaf node by its corresponding parent node. Due to the natural hierarchical
redundancy constraint, the corresponding child nodes for that added parent node are
also removed from the feature subset. Note that, in the case of handling with the
Gene Ontology hierarchy, it is possible that one single leaf node has more than one
parent nodes. Therefore, HC evaluates the cost value obtained by using the parent
node leading to the highest cost value. This search process terminates until the cost
value of candidate feature subset being no greater than the feature subset evaluated
on the previous step.
 
N −M
f (S) = 1 + μ |Di,c |, where Di,c ⊆ Di,k , μ>0 (6.2)
N
i∈D

As shown in Algorithm 5, HC method firstly initialises the GO-DAG (DAG), set


of paths (P), set of parents for each feature in the DAG (PI(xi )), set of descendants for
each feature in the DAG (D(xi )) and the checking status for each feature (S(xi )). HC
generates an initial candidate feature subset (Fcurr ) by selecting all leaf nodes in the
DAG and calculates the cost value (Costcurr ) for that candidate feature set (in lines
12–15). The hill climbing search is conducted in lines 16–42, where each feature
in the current feature candidate set is checked whether being replaced by its direct
parent node (lines 23–26) will lead to higher cost value (in line 27–34). Note that,
in the GO-DAG, each node may have more than one direct parent node. Therefore,
each parent will be processed in order to only select the optimal direct parent leading
to highest cost value of candidate feature subset (i.e. Costcurr  of Fcurr  ). In addition,
after replacing the feature xi by its direct parent node pij , all descendant nodes for
that parent node pij will be removed from the candidate feature subset. After updated
the current feature subset, the hill climbing searching process restarts from checking
other features whether replacing their corresponding optimal parent nodes leads to
higher cost value. The searching process terminates if the cost value does not increase
by replacing any features with their parents. Finally, in lines 43–45, after finishing the
86 6 Eager Hierarchical Feature Selection

Algorithm 5 Bottom-up Hill Climbing Feature Selection (HC)


1: Initialize TrainSet;
2: Initialize TestSet;
3: Initialize DAG with all features X in Dataset;
4: Initialize P with DAG;
5: Initialize Fcurr ;
6: Initialize Fcand ;
7: for each feature xi ∈ X do
8: Initialize PI(xi ) in DAG;
9: Initialize D(xi ) in DAG;
10: Initialize S(xi ) ← "Unmarked";
11: end for
12: for each path pi ∈ P do
13: Fcurr = Fcurr ∪ L(pi );
14: end for
15: Costcurr = costFunction(Fcurr );
16: Initialize Update ← True;
17: while Update == True do
18: Update ← False;
19: Fcurr  ← Fcurr ;
20: Costcurr  ← Costcurr ;
21: for each feature xi ∈ Fcurr do
22: for each Parent pij ∈ PI(xi ) do
23: if S(pij ) != "Marked" then
24: Fcand ← pij ∪ Fcurr ;
25: Fcand ← Fcand - D(pij );
26: end if
27: Costcand = costFunction(Fcand );
28: if Costcand > Costcurr then
29: Update ← True;
30: if Costcand > Costcurr  then
31: Costcurr  ← Costcand ;
32: Fcurr  ← Fcand ;
33: end if
34: end if
35: end for
36: Fcurr ← Fcurr  ;
37: Costcurr ← Costcurr  ;
38: if Update == True then
39: Break;
40: end if
41: end for
42: end while
43: Re-create TrainSet_SF with all features X  in Fcurr ;
44: Re-create Inst_SF with all features X  in Fcurr ;
45: Classifier(TrainSet_SF, Inst_SF);
6.2 Bottom-Up Hill Climbing Feature Selection (HC) 87

(a) (b)

(c) (d)

(e)

Fig. 6.2 Example of bottom-up hill climbing feature selection method


88 6 Eager Hierarchical Feature Selection

hill climbing search, the classifier will be trained and tested by using the regenerated
datasets based on the final selected feature set.
Figure 6.2 shows an example of applying HC method on the DAG. The example
DAG in Fig. 6.2a is used to generate the initial candidate feature subset by selecting
all leaf nodes for all paths, i.e. nodes K, J, H, P and R in Fig. 6.2b. Then HC decides
to replace nodes P and R by node N (Fig. 6.2c), since the cost value is greater than the
one of initial candidate feature subset. After replacing the node J by node B and node
H by node D, as shown in Fig. 6.2d, the hill climbing search process terminates, since
the cost value of candidate feature subset consisting of nodes K, B, D and N will
not be further improved by replacing those individual nodes by their corresponding
direct parent nodes. Figure 6.2e shows that the training and testing datasets will be
recreated by only using those four selected features.

6.3 Greedy Top-Down Feature Selection (GTD)

Greedy top-down feature selection (GTD) [2] method searches the optimal feature
subset by traversing all individual paths in the DAG. In details, when processing
each path, GTD iteratively selects the top-ranked features (according to the Gain
Ratio metric as shown in Eq. 6.3) into the selection feature subset and removes its
ancestor and descendant nodes from the candidate feature subset. The Gain Ratio
value (Eq. 6.3) is calculated by dividing the Information Gain value with the Infor-
mation Value. The Information Gain is calculated by Eq. 6.4, which measures the
informativeness of feature X for predicting instances into target category (e.g. y = 0
or y = 1) when X is absent (x = 0) or present (x = 1) in the instances. The Infor-
mation Value is calculated by different feature values (e.g. x = 0 and x = 1) and the
dimension of dataset (i.e. M ), as shown in Eq. 6.5. After processing all individual
paths, the final selection feature subset is used for recreating the training and testing
datasets.
Information Gain(X )
Gain Ratio(X ) = (6.3)
Information V alue(X )


c v
 
c
Information Gain(X ) = − P(yj )logP(yj ) + P(xi ) P(yj |xi )logP(yj |xi ) (6.4)
j=0 i=0 j=0

v
xi x 
i
Information V alue(X ) = − log (6.5)
i=0
M M

As shown in Algorithm 6, in lines 1–10, the DAG (DAG), set of paths (P), selection
feature subset (SF), ancestor and descendant set for each feature (A(xi ), D(xi )) are
initialised. Then GTD processes each individual paths in lines 11–27, where lines 13–
17 create a candidate feature subset by using all “available” features in that individual
path and selects the top-ranked feature with the maximum Gain Ratio (GR) value
6.3 Greedy Top-Down Feature Selection (GTD) 89

Algorithm 6 Greedy Top-down Feature Selection (GTD)


1: Initialize TrainSet;
2: Initialize TestSet;
3: Initialize DAG with all features X in Dataset;
4: Initialize P with DAG;
5: Initialize SF;
6: for each feature xi ∈ X do
7: Initialize A(xi ) in DAG;
8: Initialize D(xi ) in DAG;
9: Initialize Status(xi ) ← “Available”;
10: end for
11: for each pi ∈ P do
12: Initialize Fcand ;
13: for each xj in path pi do
14: if Status(xj ) == “Available” then
15: Fcand = Fcand ∪ xj ;
16: end if
17: end for
18: x ← max(Fcand , GR);
19: SF = SF ∪ x ;
20: Status(x ) ← “Removed”;
21: for each ai in A(x ) do
22: Status(ai ) ← “Removed”;
23: end for
24: for each di in D(x ) do
25: Status(di ) ← “Removed”;
26: end for
27: end for
28: Re-create TrainSet_SF with all features X  in SF;
29: Re-create Inst_SF with all features X  in SF;
30: Classifier(TrainSet_SF, Inst_SF);

(line 18). In lines 19–26, the selected feature will be added into the selection feature
subset (SF) and its all ancestor and descendant nodes will also be removed from the
current candidate feature subset and the selection status information will be assigned
as “Removed”. Therefore, those removed features will also not be considered when
processing other paths. After traversing all paths, the regenerated training and testing
datasets will be used for training and testing the classifier (as shown in lines 28–30).
Figure 6.3 shows an example of applying GTD feature selection method on a
DAG, where the numbers in the left of nodes denote the Gain Ratio value of each
node. As shown in Fig. 6.3b, GTD firstly processes the path starting from node M to
90 6 Eager Hierarchical Feature Selection

(a) (b)

(c) (d)

(e)

Fig. 6.3 Example of tree-based feature selection (GTD) method


6.3 Greedy Top-Down Feature Selection (GTD) 91

K, then node K is selected into the selected feature subset, due to its highest Gain
Ratio value. Hence, all its ancestors M, F, and L are removed from the candidate
feature subset. Then node I is selected as the top-ranked feature in the path starting
from node O to J and all its ancestors (O and Q) and descendants (B, J, C, D, and H)
are removed from the candidate feature subset. Finally, as shown in Fig. 6.3d node
A is moved into the selection feature subset, since it is the top-ranked feature in the
path starting from node E to P and nodes E, G, N, P and R are removed from the
candidate feature subset.

6.4 Hierarchy-Based Feature Selection (SHSEL)

Hierarchy-based feature selection (SHSEL) [3] selects the optimal feature subset by
processing the pair of child-parent features in individual paths of DAG. In details, on
the first stage, SHSEL traverses all individual paths, where it starts comparing the
similarity between the leaf node and its direct parent node. If the similarity is greater
or equal to the pre-defined threshold (τ = 0.99 as suggested by [3]), the leaf node
will be removed from the hierarchy and candidate feature subset. The processing for
single path terminates until SHSEL finished comparing all pairs of child-parent nodes
by iteratively using the processed parent node as the child node. After processing all
paths, SHSEL conducts a further process on removing features whose Information
Gain (IG) values are below the average Information Gain value of the corresponding
path. Note that, the calculation of average Information Gain value of individual paths
will also only consider the features that have not been removed after the first stage
processing. Finally, all remaining features are used for regenerating the datasets for
training and testing the classifier.
As shown in Algorithm 7, in lines 1–13, SHSEL firstly initialises the training and
testing datasets, the Directed Acyclic Graph (DAG), the set of paths P in the DAG,
the set of leaf nodes L in the DAG, the set of parent nodes for all individual nodes
PI(X ) and the selection status for all individual nodes Status(X ). In lines 14–23,
SHSEL conducts the first stage of processing. For each leaf node li in the DAG,
SHSEL compares the similarity with its parent node to the threshold τ , then removes
the leaf node if the similarity is greater or equal to τ (lines 17–19). This process
compares all pairs of child-parent nodes from the leaf to the root of single path by
iteratively replacing the processed parent node as the new child node (line 20). After
processing all paths, SHSEL conducts the second stage of selection (lines 24–32).
It compares the Information Gain value of individual remaining features after the
first stage processing with the average Information Gain values in the corresponding
paths. If the former is lower than the latter, those features will be removed from the
candidate feature subset. Finally, in lines 33–40, the SHSEL regenerates the datasets
for training and testing the classifier.
Figure 6.4 shows an example of applying SHSEL on the DAG, where the numbers
in the left of nodes denoting the Information Gain (Eq. 6.4). As shown in Fig. 6.4b,
node F is removed by after SHSEL processed the paths M to K and L to K, since
92 6 Eager Hierarchical Feature Selection

Algorithm 7 Hierarchy-based Feature Selection (SHSEL)


1: Initialize TrainSet;
2: Initialize TestSet;
3: Initialize DAG with all features X in Dataset;
4: Initialize P with DAG;
5: Initialize L;
6: Initialize SF;
7: for each path pi ∈ P do
8: L = L ∪ Lpi ;
9: end for
10: for each feature xi ∈ X do
11: Initialize PI(xi ) ∈ DAG;
12: Initialize Status(xi ) ← “Available”;
13: end for
14: for each leaf feature li ∈ L do
15: for each path pk ⊆ li do
16: for each node xj ∈ pk starting from pili & Status(xj ) == “Available” do
17: if 1 - |IG(pili ) - IG(li )| ≥ τ then
18: Status(li ) ← “Removed”;
19: end if
20: li ← pili ;
21: end for
22: end for
23: end for
24: Re-create P with all features X where Status(X ) == “Available”;
25: for each path pi ∈ P do
26: I G(pi ) ← average_Information_Gain(pi );
27: for each feature xij in path pi do
28: if I G(xij ) < I G(pi ) then
29: Status(xij ) ← “Removed”;
30: end if
31: end for
32: end for
33: for each feature xi ∈ X do
34: if Status(xi ) == “Available” then
35: SF = SF ∪ xi ;
36: end if
37: end for
38: Re-create TrainSet_SF with all features X  in SF;
39: Re-create Inst_SF with all features X  in SF;
40: Classifier(TrainSet_SF, Inst_SF);
6.4 Hierarchy-Based Feature Selection (SHSEL) 93

(a) (b)

(c) (d)

(e)

Fig. 6.4 Example of hierarchy-based feature selection (SHSEL) method


94 6 Eager Hierarchical Feature Selection

node F has similar Information Gain value with node M and the Information Gain
difference between nodes L and K is below the threshold τ . Analogously, nodes J was
removed after comparing with node B, while node I was removed after comparing
with node Q. After traversing all paths, the remaining nodes are M, L, K, O, Q,
B, C, E, A, P and R. The second stage of SHSEL checks whether the Information
Gain of individual remaining nodes being greater than the average Information Gain
value of the corresponding path. As shown in Fig. 6.4c, node K was removed since its
Information Gain is lower than the average one of nodes K and L. Analogously, nodes
O, Q, C, A and R were removed after comparing with the average Information Gain
for corresponding paths. Finally, only nodes M, L, B, E and P were selected (Fig. 6.4d)
for regenerating the datasets (Fig. 6.4e) for training and testing the classifier.

6.5 Experimental Results

The predictive performance of those different eager learning-based hierarchical fea-


ture selection methods are evaluated by using NB, TAN, BAN and KNN classi-
fiers respectively. Those four eager hierarchical feature selection methods are also
compared with one “flat” feature selection method (i.e. CFS) and those different
classifiers without feature selection as the natural baseline. Note that, the CFS+GO-
BAN method [4] is adopted for constructing the BAN classifier by using the features
selected by different eager learning-based feature selection methods. The character-
istics of those different feature selection methods are summarised in Table 6.1. The
experimental results are reported in Tables 6.2, 6.3, 6.4 and 6.5. Each table displays
the results of different feature selection methods working with NB, TAN, BAN and
KNN classifier respectively. The box-plots in Fig. 6.5a–d show the distribution of
rankings based on the GMean values obtained by different feature selection methods
working with different classifiers.
Table 6.2 compares the predictive accuracies obtained by NB when using TSEL,
HC, SHSEL, GTD or CFS feature selection methods in a pre-processing phase and
without feature selection as the natural baseline. In brief, different eager learning-
based hierarchical feature selection methods perform best on predicting the function
of different model organisms’ genes. In details, GTD+NB obtains 6 out of 7 and
7 out of 7 times the highest GMean value on both Caenorhabditis elegans and
Saccharomyces cerevisiae datasets. HC+NB obtains the 3 out of 7 times the highest
GMean value on Drosophila melanogaster datasets, while SHSEL+NB obtains 4 out
of 7 times the highest GMean value on Mus musculus datasets. Overall, GTD+NB
is ranked in the first position on 16 out of 28 datasets in total, as indicated by the
boldfaced GMean values in Table 6.2.
As shown in Fig. 6.5a, GTD+NB method also obtains the best overall result
with the average rank of 1.786, while the second best average ranking (2.857) was
obtained by Naïve Bayes without feature selection method. The average rankings
for SHSEL+NB and CFS+NB are respectively 3.018 and 3.696, whereas HC+NB
6.5 Experimental Results 95

Table 6.1 Summary of characteristics of feature selection methods working with different eager
learning-based classification algorithms
Feature selection Learning approach Annotations Classification
method algorithms
No feature selection Eager NB, TAN, BAN, KNN
TSEL Eager Tree-based feature NB, TAN, BAN, KNN
selection
HC Eager Bottom-up hill NB, TAN, BAN, KNN
climbing feature
selection
GTD Eager Greedy top-down NB, TAN, BAN, KNN
feature selection
SHSEL Eager Hierarchy-based NB, TAN, BAN, KNN
feature selection
CFS Eager Correlation-based NB, TAN, BAN, KNN
feature selection

and TSEL+NB obtained the worst average ranking (4.589 and 5.054 respectively) in
terms of GMean value.
Table 6.3 compares the predictive accuracies obtained TSEL, HC, SHSEL, GTD or
CFS feature selection methods working with TAN classifier. Analogously to the cases
when using NB as the classifier, different feature selection methods perform best
when predicting the function of different model organisms’ genes. GTD+TAN and
CFS+TAN both obtain 3 out of 7 times the highest GMean value on Caenorhabditis
elegans datasets, while HC+TAN and CFS+TAN also both obtain 3 out of 7 times
the highest GMean value on Drosophila melanogaster datasets. The former also
performs best on Mus musculus datasets, since it obtains the most times (3 out of 7)
the highest GMean value, while the latter obtains the most times the highest GMean
value (3 out of 7) on Saccharomyces cerevisiae datasets. Overall, CFS obtains the
highest GMean value 9 out of 28 times in total.
As shown in Fig. 6.5b, CFS+TAN method also obtains the best average rank
(2.321), while the second best rank (2.839) is obtained by GTD+TAN. It obtains
substantially better rank than all other hierarchical feature selection methods, i.e.
3.268 for SHSEL+TAN, 3.821 for HC+TAN, 4.161 for TSEL+TAN and 4.589 for
TAN without feature selection.
Table 6.4 compares the predictive performance obtained by different feature selec-
tion methods working with BAN classifier. Generally, CFS+BAN obtains the most 4
out of 7 times the highest GMean value on both Caenorhabditis elegans and Saccha-
romyces cerevisiae datasets, while SHSEL+BAN obtains the most 5 out of 7 times the
highest GMean value on both Drosophila melanogaster and Mus musculus datasets.
Overall, SHSEL+BAN the most 13 out of 28 times the highest GMean values in
total.
96 6 Eager Hierarchical Feature Selection

Table 6.2 Predictive accuracy (%) for Naïve Bayes with eager hierarchical feature selection meth-
ods TSEL, HC, SHSEL, GTD and “flat” feature selection method CFS
6.5 Experimental Results 97

Table 6.3 Predictive accuracy (%) for TAN with eager hierarchical feature selection methods
TSEL, HC, SHSEL, GTD and “flat” feature selection method CFS
98 6 Eager Hierarchical Feature Selection

Table 6.4 Predictive accuracy (%) for BAN with eager hierarchical feature selection methods
TSEL, HC, SHSEL, GTD and “flat” feature selection method CFS
6.5 Experimental Results 99

Table 6.5 Predictive accuracy (%) for KNN (k = 3) with eager hierarchical feature selection
methods TSEL, HC, SHSEL, GTD and “flat” feature selection method CFS

As shown in Fig. 6.5c, SHSEL+BAN method also obtains the best average rank
(2.071), which is better than the second best rank (2.232) obtained by CFS+BAN.
TSEL+BAN obtains the third best rank (3.536), which is substantial better than the
ranks obtained by other methods, i.e. 3.661 for GTD+BAN, 4.482 for HC+BAN and
5.018 for BAN without feature selection.
Table 6.5 compares the predictive performance obtained by different feature selec-
tion methods working with KNN classifier. Generally, GTD+KNN obtains the most
100 6 Eager Hierarchical Feature Selection

(a) (b)
TSEL+NB TAN
HC+NB TSEL+TAN
CFS+NB HC+TAN
SHSEL+NB SHSEL+TAN
NB GTD+TAN
GTD+NB CFS+TAN

1 2 3 4 5 6 1 2 3 4 5 6

(c) (d)
BAN HC+KNN
HC+BAN TSEL+KNN
GTD+BAN CFS+KNN
TSEL+BAN KNN
CFS+BAN SHSEL+KNN
SHSEL+BAN GTD+KNN

1 2 3 4 5 6 1 2 3 4 5 6

Fig. 6.5 Boxplots showing the distributions of ranks obtained by different eager feature selection
methods working with different eager classifiers

5 out of 7, 3 out of 7 and 5 out of 7 times the highest GMean value respectively
on Caenorhabditis elegans, Mus musculus and Saccharomyces cerevisiae datasets,
while TSEL+KNN obtains the most 3 out of 7 times the highest GMean value on
Drosophila melanogaster datasets. Overall, GTD-KNN obtains the most 16 out of
28 times the highest GMean values in total.
As shown in Fig. 6.5d, GTD+KNN method obtains the best overall average rank
(2.054), which is better than the second best rank (2.946) obtained by SHSEL+KNN.
KNN without feature selection obtains the third best average rank (3.607), which is
substantial better than the average ranks obtained by other methods, i.e. 3.839 for
CFS+KNN, 4.214 for TSEL+KNN and 4.339 for HC+KNN.
6.6 Discussion 101

Table 6.6 Statistical test results of GMean values obtained by different eager hierarchical feature
selection methods according to the non-parametric Friedman test with the Holm post-hoc correction

6.6 Discussion

6.6.1 Statistical Analysis of GMean Value Difference between


Different Eager Learning-Based Feature Selection
Methods

The statistical significance test results (as shown in Table 6.6) further confirm that
GTD obtains significantly better predictive performance than all other feature selec-
tion methods when working with NB classifier. Analogously, when working with
BAN classifier, GTD significantly outperforms other feature selection methods only
except the SHSEL method. Both GTD and SHSEL perform well in when working
with TAN classifier, although CFS method obtains the best average rank, it does not
show significantly better predictive performance than former ones. SHSEL shows
significantly better predictive performance than all other feature selection methods
except CFS method working with BAN classifier. In addition, note that all top-ranked
feature selection methods significantly improve the predictive accuracy of four dif-
ferent classifiers.
102 6 Eager Hierarchical Feature Selection

100 100 100 100


r = -0.258 r = -0.801 r = -0.789 r = -0.671
80 80 80 80

GMean-KNN
GMean-BAN
GMean-NB

GMean-TAN
60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance
NB TAN BAN KNN
100 100 100 100
r = -0.587 r = -0.500 r = -0.587 r = -0.572
80 80 80 80

GMean-KNN
GMean-BAN
GMean-NB

GMean-TAN

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

TSELS+NB TSEL+TAN TSEL+BAN TSEL+KNN

100 100 100 100


r = -0.535 r = -0.555 r = -0.825 r = -0.689
80 80 80 80

GMean-KNN
GMean-BAN
GMean-NB

GMean-TAN

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HC+NB HC+TAN HC+BAN HC+KNN

100 100 100 100


r = -0.453 r = -0.521 r = -0.488 GMean-SHSEL+KNN r = -0.558
80 80 80 80
GMean-BAN
GMean-NB

GMean-TAN

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

SHSEL+NB SHSEL+TAN SHSEL+BAN SHSEL+KNN


100 100 100 100
r = -0.198 r = -0.668 r = -0.830 r = -0.405
80 80 80 80
GMean-KNN
GMean-BAN
GMean-NB

GMean-TAN

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

GTD+NB GTD+TAN GTD+BAN GTD+KNN

Fig. 6.6 Linear relationship between the degree of class imbalance and Gmean values obtained by
different eager feature selection methods and classifiers

6.6.2 Robustness Against the Class Imbalance Problem

Analogously to Sect. 5.6.2, this section also discusses the robustness of different
feature selection methods against the class imbalance problem.
Overall, as shown in Fig. 6.6, all eager learning-based hierarchical feature selec-
tion methods show the negative correlation coefficient values between the GMean
6.6 Discussion 103

100 100 100 100


r = 0.793 r = 0.946 r = 0.884 r = 0.936
80 80 80 80

60 60 60 60
Diff

Diff

Diff

Diff
40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance
NB TAN BAN KNN

100 100 100 100


r = 0.712 r = 0.623 r = 0.712 r = 0.749
80 80 80 80

60 60 60 60
Diff

Diff

Diff

Diff
40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

TSEL+NB TSEL+TAN TSEL+BAN TSEL+KNN

100 100 100 100


r = 0.807 r = 0.724 r = 0.935 r = 0.944
80 80 80 80

60 60 60 60
Diff

Diff

Diff

Diff
40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

HC+NB HC+TAN HC+BAN HC+KNN


100 100 100 100
r = 0.670 r = 0.756 r = 0.712 r = 0.737
80 80 80 80

60 60 60 60
Diff

Diff

Diff

Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

SHSEL+NB SHSEL+TAN SHSEL+BAN SHSEL+KNN

100 100 100 100


r = 0.678 r = 0.838 r = 0.892 r = 0.881
80 80 80 80

60 60 60 60
Diff

Diff

Diff

Diff

40 40 40 40

20 20 20 20

0 0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance Degree of Class Imbalance

GTD+NB GTD+TAN GTD+BAN GTD+KNN

Fig. 6.7 Linear relationship between the degree of class imbalance and difference between sensi-
tivity and specificity values obtained by different eager feature selection methods and classifiers

value and class imbalance degree, indicting a fact that the higher degree of class
imbalance leads to the lower GMean value. However, among all those 5 eager learn-
ing hierarchical feature selection methods, GTD shows the strongest robustness when
working with both NB and KNN classifiers, due to its highest r values. TSEL and
SHSEL respectively show the strongest robustness when working with TAN and NB
classifiers. In addition, all eager learning-based hierarchical feature selection meth-
ods improve the robustness against the class imbalance problem of TAN classifiers.
104 6 Eager Hierarchical Feature Selection

The correlation coefficient between the class imbalance degree and the difference
of sensitivity and specificity also further confirms that the imbalanced distribution
of instance with different class labels directly leads to high difference between the
sensitivity and specificity values. As shown in Fig. 6.7, all pairs of Diff and Degree
show positive correlation coefficient and the r values are all greater than 0.600.

References

1. Jeong Y, Myaeng S (2013) Feature selection using a semantic hierarchy for event recognition
and type classification. In: Proceedings of the international joint conference on natural language
processing, Nagoya, Japan, pp 136–144
2. Lu S, Ye Y, Tsui R, Su H, Rexit R, Wesaratchakit S, Liu X, Hwa R (2013) Domain ontology-
based feature reduction for high dimensional drug data and its application to 30-day heart
failure readmission prediction. In: Proceedings of the international conference conference on
collaborative computing, Austin, USA, pp 478–484
3. Ristoski P, Paulheim H (2014) Feature selection in hierarchical feature spaces. In: Proceedings
of the international conference on discovery science (DS 2014), pp 288–300
4. Wan C, Freitas AA (2015) Two methods for constructing a gene ontology-based feature selection
network for a Bayesian network classifier and applications to datasets of aging-related genes. In:
Proceedings of the sixth ACM conference on bioinformatics, computational biology and health
informatics (ACM-BCB 2015), Atlanta, USA, pp 27–36
5. Wang BB, Mckay RIB, Abbass HA, Barlow M (2003) A comparative study for domain ontology
guided feature extraction. In: Proceedings of the 26th Australasian computer science conference,
Darlinghurst, Australia, pp 69–78
Chapter 7
Comparison of Lazy and Eager
Hierarchical Feature Selection Methods
and Biological Interpretation on
Frequently Selected Gene Ontology
Terms Relevant to the Biology of Ageing

This chapter compares the predictive performance of all different hierarchical feature
selection methods working with different classifiers on 28 datasets. The number of
features selected by different feature selection methods are also reported. Finally, the
features (GO terms) selected by the optimal hierarchical feature selection methods
are interpreted for revealing potential patterns relevant to the biology of ageing.

7.1 Comparison of Different Feature Selection Methods


Working with Different Classifiers

All different feature selection methods are compared when working with differ-
ent classifiers. As shown in Fig. 7.1, the box-plots report the distribution of ranks
according to the GMean values obtained by feature selection methods with corre-
sponding classifiers. In general, HIP obtains the best average ranking when working
with NB and BAN classifiers, while also obtains the second best average ranking
when working with TAN and KNN classifiers. MR obtains the best average ranking
when working with TAN classifier and the second best results when working with
BAN classifier. Analogously, GTD obtains the best ranking when working with KNN
classifier, while also obtains the second best result working with NB classifier.
The Friedman test (with Holm post-hoc correction) results shown in Table 7.1
further confirm that HIP significantly outperforms all other feature selection meth-
ods except GTD and Rele H I P−M R_n when working with NB classifier. It also obtains
significantly better predictive accuracy when working with BAN than all other fea-
ture selection methods except the MR method. When working with TAN classifier,
although MR obtains the overall best predictive performance, it does not obtain sig-
nificantly higher accuracy than HIP, CFS, HIP–MR and GTD methods. Analogously,
GTD obtains the overall best predictive accuracy when working KNN classifier, but
© Springer Nature Switzerland AG 2019 105
C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_7
106 7 Comparison of Lazy and Eager Hierarchical Feature Selection Methods …

(a) (b)
EntHIP_n + NB EntMR_n + TAN
EntMR_n + NB EntHIP_n + TAN
EntHIP MR_n + NB EntHIP MR_n + TAN
TSEL+NB TAN
HC+NB ReleHIP + TAN
HIP MR+NB TSEL+TAN
CFS+NB HC+TAN
ReleHIP + NB ReleMR_n + TAN
SHSEL+NB ReleHIP MR_n + TAN
ReleMR_n + NB SHSEL+TAN
MR+NB GTD+TAN
NB HIP MR+TAN
ReleHIP MR_n + NB CFS+TAN
GTD+NB HIP+TAN
HIP+NB MR+TAN

2 4 6 8 10 12 14 2 4 6 8 10 14

(c)
(d)
EntMR_n + BAN EntHIP_n + KNN
BAN EntMR_n + KNN
EntHIP_n + BAN EntHIP MR_n + KNN
EntHIP MR_n + BAN ReleHIP_n + KNN
HC+BAN HC+KNN
GTD+BAN TSEL+KNN
TSEL+BAN CFS+KNN
HIP MR+BAN HIP MR+KNN
ReleMR_n + BAN ReleMR_n + KNN
ReleHIP_n + BAN KNN
ReleHIP MR_n + BAN SHSEL+KNN
CFS+BAN ReleHIP MR_n + KNN
SHSEL+BAN MR+KNN
MR+BAN HIP+KNN
HIP+BAN GTD+KNN

2 4 6 8 10 14 2 4 6 8 10 14

Fig. 7.1 Boxplots showing the distributions of ranks obtained by all different feature selection
methods working with individual classifiers

its performance does not show significant difference to HIP, MR, Rele H I P−M R_n ,
SHSEL and KNN without feature selection methods.
Moreover, all different combinations of feature selection methods and classifiers
are compared with each other according to the GMean values obtained on all 28
datasets. As shown in Table 7.2, HIP obtains the highest GMean values in 15 out of
28 datasets, while MR and GTD respectively obtain 5 and 3 times the highest GMean
values. Note that, one type of “flat” feature selection method, Rele H I P−M R_n , also
obtains 2 times the highest GMean value.
In detailed, HIP, MR and Rele H I P−M R_n methods respectively obtain the highest
GMean value in 2 out of 7 Caenorhabditis elegans datasets, while GTD also obtains
the highest GMean in Caenorhabditis elegans datasets with only using BP terms
as features. Among those 7 Caenorhabditis elegans datasets, MR+BAN obtains the
highest GMean value (68.3) on the one using BP and MF terms as features. HIP and
GTD methods respectively obtain the optimal GMean values on 2 out of 7 Drosophila
melanogaster datasets. MR, HIP-MR and HC also respectively obtains the highest
GMean value on 1 type of Drosophila melanogaster datasets. Among all those 7
7.1 Comparison of Different Feature Selection Methods … 107

Table 7.1 Statistical test results about comparison of GMean values obtained by different hierar-
chical feature selection methods working with different classifiers according to the non-parametric
Friedman test with the Holm post-hoc correction

Table 7.2 Summary of best prediction methods for each datasets


Feature type Optimal method GMean Optimal method GMean
Caenorhabditis elegans Datasets Drosophila melanogaster Datasets
BP GTD + NB 64.5 MR + KNN 66.0
MF Rele H I P−M R_n + NB 56.1 HIP-MR + NB 64.2
CC HIP + NB 59.5 GTD + NB 69.0
BP + MF MR + BAN 68.3 HC + TAN 72.5
BP + CC MR + TAN 66.0 HIP + NB 69.1
MF + CC Rele H I P−M R_n + NB 60.2 GTD + NB 72.9
BP + MF + CC HIP + TAN 65.5 HIP + KNN 73.2
Mus musculus Datasets Saccharomyces cerevisiae Datasets
BP HIP + KNN 73.0 HIP + NB 70.4
MF TSEL + NB/BAN 69.6 HIP + TAN 46.0
CC HIP + BAN 63.4 HIP + BAN 49.4
BP + MF HIP + NB 70.6 HIP + NB 75.3
BP + CC MR + KNN 71.4 HIP + NB 74.6
MF + CC MR + BAN 70.2 HIP + BAN 57.8
BP + MF + CC HIP + NB/BAN 73.5 HIP + BAN 75.1
108 7 Comparison of Lazy and Eager Hierarchical Feature Selection Methods …

Fig. 7.2 Boxplots showing the distributions of ranks obtained by all different prediction methods
(the combinations of different feature selection methods and different classifiers)
7.1 Comparison of Different Feature Selection Methods … 109

types of Drosophila melanogaster datasets, the one using BP, MF and CC terms as
features leads to the overall highest GMean value obtained by HIP+KNN method.
HIP obtains the highest GMean values on 4 out of 7 Mus musculus datasets, while
MR and TSEL respectively obtains 2 and 1 times the highest GMean values. The
Mus musculus dataset using BP, MF and CC terms as features leads to the highest
overall GMean value obtained by HIP+NB or HIP+BAN methods.
Analogously to the cases when HIP method obtains the highest GMean value in
majority of Mus musculus datasets, it also obtains the highest GMean value in all 7
Saccharomyces cerevisiae datasets. The overall highest GMean value was obtained
by HIP+NB method on the dataset using BP and MF terms as features.
In terms of the best prediction method (i.e. the combination of feature selection
method and classifier), HIP+NB obtains the most times highest GMean values (on
7 out of 28 datasets). HIP+BAN also obtains 5 out of 28 times the highest GMean
values, while GTD+NB obtains 3 out of 28 times the best predictive results. Figure 7.2
shows the boxplot of distribution of ranking obtained by all 60 prediction methods
involved in this book. HIP+BAN, HIP+NB and GTD+NB are still the best three
methods obtaining highest mean ranking results.

7.2 The Number of Selected Features by Different Methods

The information about the numbers of features selected by HIP, MR, HIP–MR,
HC, TSEL, GTD, SHSEL and CFS methods on 28 different datasets are reported in
Fig. 7.3. In general, GTD selects the most features, while HIP–MR selects the second
most features. TSEL, MR and HC all select similar numbers of features over 28
datasets, whereas HIP, SHSEL and CFS all select much fewer numbers of features.
SHSEL selects the fewest number of features on the majority of Caenorhabditis
elegans and Saccharomyces cerevisiae datasets, while CFS selects the fewest number
of features on all Drosophila melanogaster and Mus musculus datasets.

7.3 Interpretation on Gene Ontology Terms Selected by


Hierarchical Feature Selection Methods

As discussed in Sect. 7.1, the Caenorhabditis elegans and Saccharomyces cerevisiae


datasets both lead to the highest predictive accuracy using BP and MF terms as fea-
tures, while both of Drosophila melanogaster and Mus musculus datasets lead to the
best predictive accuracy using BP, MF and CC terms as features. This fact indicates
that the feature subsets selected by corresponding hierarchical feature selection meth-
ods from the full feature sets of either BP+MF or BP+MF+CC for corresponding
model organisms are the most informative for describing the function of ageing-
related genes. Hence, this section discusses the biological meaning of those GO
110 7 Comparison of Lazy and Eager Hierarchical Feature Selection Methods …

No.of Selected Features 800 800

No.of Selected Features


600 600

400 400

200 200

0 0
BP

BP

F
CC

CC

CC

+M C
CC
M

+M

+M
+C

+C

C
CC
F+

F+

F+
BP

BP
BP

BP
F+
M

M
+M

BP
BP

Caenorhabditis elegans datasets Drosophila melanogaster datasets


1,200

1,000 600
No.of Selected Features

No.of Selected Features


800
400
600

400
200
200

0 0
BP

F
CC

CC

CC

BP

F
CC

CC

CC
M

+M

+C

+M

+C
F+

F+

F+

F+
BP
BP

BP
BP
M

+M

+M
BP

Mus musculus datasets Saccharomyces cerevisiae datasets BP

HIP
MR
HIP-MR
TSEL
HC
SHSEL
GTD
CFS

Fig. 7.3 Average number of features selected by HIP, MR, HIP–MR, TSEL, HC, SHSEL, GTD
and CFS for each of the feature (GO term) types

terms that are only selected in high frequency but also located in deep position in the
GO-DAG.
However, note that, due to the natural limit of knowledge about ageing mechanism
for different model organisms, the degree of specificity on gene function annotation
for different model organisms varies. This fact leads to a difficulty on finding GO
terms that are simultaneously bearing more specific biological definition and being
selected for the majority of genes. The correlation coefficient value between the depth
of GO terms in the GO-DAG and total selection times reveal that Caenorhabditis
7.3 Interpretation on Gene Ontology Terms Selected … 111

Table 7.3 Selected GO terms located in deep position of GO-DAG


GO_ID Description Domain Depth Selection Degree
frequency (Prob) trees
GO:1902600 Proton transmembrane BP 9 553 (100%) 1322
transport
GO:0006352 DNA-templated BP 9 553 (100%) 615
transcription, initiation
GO:0003924 GTPase activity BP 8 553 (100%) 1324
GO:0006310 DNA recombination BP 8 553 (100%) 1083
GO:0007265 Ras protein signal BP 8 553 (100%) 1071
transduction
GO:0071103 DNA conformation change BP 8 553 (100%) 996
GO:0043065 Positive regulation of BP 8 553 (100%) 871
apoptotic process
GO:0006412 Translation MF 7 553 (100%) 2175
GO:0004713 Protein tyrosine kinase BP 7 553 (100%) 1575
activity
GO:0005506 Iron ion binding MF 7 553 (100%) 1374

elegans dataset bears more specific annotation of gene function, due to comparison
of r values obtained by different model organisms. In details, CE dataset obtains
the best r value (−0.09) that is better than the r values obtained by all other model
organisms’ datasets (i.e. r (DM) = −0.56, r (MM) = −0.53 and r (SC) = −0.48).
Therefore, this book only focuses on discussing the ageing-related biological mean-
ing of Caenorhabditis elegans genes.
As shown in Table 7.3, those 10 GO terms are selected by MR method with high
frequency and located in 7–9 layers of GO-DAG. Recall that, the lazy learning-based
TAN classifier builds a tree using the selected features for individual testing instances.
Therefore, the highly frequently connected GO terms with those 10 highly selected
GO terms are also relevant to the discussion on biology of ageing. Figure 7.4 shows
a network that is created by those 10 core GO terms and all other terms connecting
with them. Those connecting terms are also selected by MR hierarchical feature
selection method. The red edges denote that those pairs of GO terms are frequently
included in the trees built for all individual testing instances (i.e. the weight denote
the number of instances whose tree containing the edge). Table 7.4 summarises the
information of those highly relevant terms connecting with 10 core terms.
In general, the patterns revealed by those 10 GO terms can be interpreted from
three popular hypothesis of ageing mechanisms. To begin with, GPTase activity
(GO:0003924), Ras proteins signal transduction (GO:0007265) and protein tyrosine
kinase activity (GO:0004713) are highly relevant with each other, due to their well-
known roles on controlling signal transduction during cell growth and division. The
pathways related to cell growth and division are also found to be related with certain
ageing processes. For example, it was found that changes on gene daf-2 are related
112 7 Comparison of Lazy and Eager Hierarchical Feature Selection Methods …

Fig. 7.4 The reconstructed network consisting of 10 core GO terms and their connecting GO terms
according to the Trees learned by TAN classifier

with insulin/insulin-like growth factor-1 (IGF-1) signalling. The former is a hormone


that regulates the metabolism of glucose and the latter primarily controls growth
[5]. It was found that inhibiting insulin/IGF-1 signalling extends the lifespan [3].
Therefore, it is possible to speculate that gene mutations, especially changes on the
sensitivity of the insulin/IGF-1 receptor, can enhance the resistance to environmental
stress [3]. In support of this inference, the relationship between stress responsiveness
and lifespan was also found for age-1 mutants in C. elegans [1].
DNA damage is another ageing-related factor, which links the “DNA-templated
transcription, initiation” (GO:0006352), DNA recombination (GO:0006310) and
DNA conformation change (GO:0071103). One possible factor of the DNA dam-
age would be due to the oxidative stress. In essence, the role of oxidative stress
on longevity regulation is related with reactive oxygen species (ROS), which are
a type of byproduct of normal metabolism [4]. It was discovered that the balance
between ROS and an antioxidant defence system controls the degree of oxidative
stress, which is associated with modifications of cellular proteins, lipids and DNA
[1]. Also, other research revealed that a cycle of growing DNA damage is caused
7.3 Interpretation on Gene Ontology Terms Selected … 113

Table 7.4 GO terms frequently connected with those 10 core GO terms in Table 7.3
Target Connected Description Domain Weight
GO:1902600 GO:0009055 Electron transfer activity MF 9
GO:0015078 Proton transmembrane transporter activity MF 9
GO:0009124 Nucleoside monophosphate biosynthetic BP 9
process
GO:0006352 GO:0010628 Positive regulation of gene expression BP 9
GO:0003924 GO:0031683 G-protein beta/gamma-subunit complex MF 8
binding
GO:1901068 Guanosine-containing compound BP
metabolic process
GO:0006310 GO:0016358 Dendrite development BP 8
GO:0071103 DNA conformation change BP
GO:0007265 GO:0010172 Embryonic body morphogenesis BP 8
GO:0071103 GO:0008094 DNA-dependent ATPase activity MF 8
GO:0006310 DNA recombination BP
GO:0043065 GO:0044389 Ubiquitin-like protein ligase binding MF 8
GO:0006412 GO:0003743 Translation initiation factor activity MF 7
GO:0000003 Reproduction BP 7
GO:0003735 Structural constituent of ribosome MF 7
GO:0004713 GO:0018212 Peptidyl-tyrosine modification BP 7
GO:0005506 GO:0009055 Electron transfer activity MF 7
GO:0046906 Tetrapyrrole binding MF 7
GO:0016705 Oxidoreductase activity, acting on paired MF 7
donors, with incorporation or reduction of
molecular oxygen

by damaged mitochondria, which leads to increased ROS production [1]. ROS can
damage and crosslink DNA, proteins and lipids [5] and affect the formation of base
adducts of mutation and canceration-related DNA [2]. Therefore, the damage caused
by oxidation reactions, cell or DNA self-repair mechanisms and resistance to envi-
ronmental stress are probably interacting factors that affect the process of ageing, and
all of them are supported by the theory that the reduction of energy intake associated
with calorie restriction will be helpful for extending longevity.
Apart from those terms discussed above, other frequently-selected terms like
proton transmembrane transport (GO:1902600) and iron ion binding (GO:0005506)
might need for more further studies in order to reveal potential novel ageing-related
mechanisms. In addition, the list of full GO term rankings according to the selection
frequency can be found via online resources of this book.
114 7 Comparison of Lazy and Eager Hierarchical Feature Selection Methods …

References

1. Finkel T, Holbrook NJ (2000) Oxidants, oxidative stress and the biology of ageing. Nature
408:239–247
2. Heilbronn LK, Ravussin E (2003) Calorie restriction and aging: review of the literature and
implications for studies in humans. Am J Clin Nutr 78(3):361–369
3. Kenyon CJ (2010) The genetics of ageing. Nature 464(7288):504–512
4. Raha S, Robinson BH (2000) Mitochondria, oxygen free radicals, disease and ageing. Trends
Biochem Sci 25(10):502–508
5. Vijg J, Campisi J (2008) Puzzles, promises and a cure for ageing. Nature 454(7208):1065–1071
Chapter 8
Conclusions and Research Directions

8.1 General Remarks on Hierarchical Feature Selection


Methods

Overall, the hierarchical feature selection methods (especially the lazy learning-
based ones) show the capacity on improving the predictive performance of differ-
ent classifiers. Their better performance also proves that exploiting the hierarchical
dependancy information as a type of searching constraint usually leads to a feature
subset containing higher predictive power. However, note that, those hierarchical
feature selection methods still have some drawbacks. For example, as one of the top-
performing methods, HIP eliminates hierarchical redundancy and selects a feature
subset that retains all hierarchical information, whereas it ignores the relevance of
individual features - since it does not consider any measure of association between
a feature and the class attribute. Analogously, MR method eliminates hierarchical
redundancy and selects features by considering both the hierarchical information
and the features relevance, but the selected features might not retain the complete
hierarchical information.
From the perspective of ageing-related gene function prediction method (i.e. a
hierarchical feature selection method combined with a classification algorithm), there
is still a space on obtaining higher predictive accuracy through developing better
hierarchical feature selection methods. For example, recall that, in Table 7.2, the
highest GMean values obtained by the optimal prediction methods for individual
model organisms are 68.3 (CE), 73.2 (DM), 73.5 (MM) and 75.3 (SC). All those
highest predictive accuracy values are below 80.0.
In terms of the interpretability of features selected by hierarchical feature selection
methods, although the data quality (e.g. the specificity degree of gene function anno-
tation) is strongly correlated with the value of biological patterns, the MR method
still shows its capacity on discovering biological patterns that have been success-
fully linked to some existing experimental findings. It is expected to discover more
valuable ageing-related biological patterns by adopting the hierarchical feature selec-
© Springer Nature Switzerland AG 2019 115
C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9_8
116 8 Conclusions and Research Directions

tion methods on the data that include more specific gene function annotation over
time.
This book also contributes to the biology of ageing research. Firstly, a set of
ageing-related datasets referring to four model organisms are freely available for
other researchers. In those datasets, genes are classified into pro-longevity or anti-
longevity ones, using Gene Ontology (GO) terms as predictive features. Secondly,
as another contribution of this book, a ranking list of highly frequently selected
GO terms reveals potentially insightful information about ageing research. All those
materials and Java implementations of algorithms discussed in this book can be found
via https://github.com/HierarchicalFeatureSelection.

8.2 Future Research Directions

The future research directions suggested in this book can be categorised into six
types. The first type includes research directions that are direct extensions of the work
described in this book. To begin with, going beyond GO terms, the proposed hierar-
chical feature selection methods are generic enough to be applicable to any dataset
with hierarchically organised features, as long as the hierarchical relationships rep-
resent generalisation-specialisation relationships. Hence, the proposed hierarchical
methods should be further evaluated in other types of datasets too. For instance, these
methods can be evaluated in text mining datasets, where instances represent docu-
ments, features typically represent the presence or absence of words in a document,
and classes represent, for instance, the topic or subject of the document. Words also
obey hierarchical, generalisation-specialisation relationships (as captured e.g. in the
WordNet system [2]), making text mining another natural application domain for the
proposed hierarchical feature selection methods.
The second type of future research direction consists of proposing new embedded
hierarchical feature selection methods based on lazy learning versions of other types
of Bayesian network classifiers. For example, Wan, C. and Freitas A.A. (2016) [5]
proposed a novel embedded hierarchical feature selection method HRE-TAN, which
successfully improves the predictive performance of TAN classifier. Due to the lack of
evidence showing the better performance of HRE-TAN than other feature selection
methods, the discussion of HRE-TAN is not included in this book. Therefore, it
is intuitive to further investigate the predictive performance of HRE-TAN and to
propose other embedded hierarchical feature selection methods based on other semi-
naïve Bayesian classifiers, e.g. the AODE [6] classifier. More precisely, hierarchically
redundant features can be removed for each individual One-Dependent Estimator
(ODE) during the training phase of AODE. Then the classification phase of the
conventional AODE classifier remains the same, i.e. the class predictions computed
by the set of ODEs will be used for classifying a new testing instance.
The third type of future research direction consists of proposing a new lazy version
of the CFS method [3], and then further extend lazy CFS to eliminate the hierarchical
redundancy according to the pre-defined DAG in a way analogous to HIP and MR.
8.2 Future Research Directions 117

In order to design lazy CFS, the calculation of the correlation coefficient between
a pair of features, or between a feature and the class variable can be adapted for
only considering the actual values of features on the current testing instance. Then,
in order to incorporate hierarchical redundancy elimination into Lazy-CFS, during
the stage of heuristic search for the most appropriate subset of features, the search
space can be substantially reduced by removing hierarchically redundant features
with respect to features in the current candidate feature subset.
The fourth type of future research directions is an extension of the scenario when
the classes or feature values are non-binary. The proposed hierarchical feature selec-
tion methods can be directly adopted for the multi-class classification task, where
there are more than two class values. However, the performance of the proposed
methods on this scenario still needs to be evaluated.
The fifth type of future research directions is evaluating the usefulness of a feature
hierarchy as a form of pre-defined expert knowledge, in the context of the classifica-
tion task. As an example, in order to evaluate the usefulness of the Gene Ontology
as a feature hierarchy, the proposed hierarchical feature selection methods could
be applied to randomly generated variations of the feature hierarchy, e.g. randomly
permuting the dependencies between GO terms.
Finally, in terms of future research direction on the application of hierarchical
feature selection methods to the biology of ageing, it is suggested to create other
datasets that contain other types of hierarchical features of genes or proteins, such
as ageing-related pathway information by integrating data from the KEGG (Kyoto
Encyclopedia of Genes and Genomes) database [4], Reactome [1], etc.

References

1. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, ..., Stein L (2011) Reactome: a


database of reactions, pathways and biological processes. Nucl Acids Res 39:D691–D697
2. Fellbaum C (1998) WordNet. Blackwell Publishing Ltd
3. Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The
University of Waikato
4. Kanehisa M, Goto S (2000) Kegg: Kyoto encyclopedia of genes and genomes. Nucl Acids Res
28(1):27–30
5. Wan C, Freitas AA (2016) A new hierarchical redundancy eliminated tree augmented naive bayes
classifier for coping with gene ontology-based features. In: Proceedings of the 33rd international
conference on machine learning (ICML 2016) workshop on computational biology, New York,
USA
6. Webb G, Boughton J, Wang Z (2005) Not so naive bayes: aggregating one-dependence estima-
tors. Mach Learn 58(1)
Index

A Embedded approach, 17
Ageing mechanisms, 111
Ageing-related genes, 4
Anti-longevity, 4 F
Artificial Neural Networks, 8 Feature selection, 1, 2, 5, 17, 20
Association rule mining, 1, 7 Filter approach, 17, 18
Forward Sequential Selection, 18

B
G
Backward Sequential Elimination, 18
Gene Ontology, 14, 35
Bayesian network, 8, 11
Gene Ontology hierarchy, 14
Bayesian Network Augmented Naïve Bayes,
Gene Ontology terms, 21
13, 58
Generalisation-specialisation relationships,
Bioinformatics, 2, 5, 8, 14
2, 20, 21, 36
Biology of Ageing, 5
GMean values, 76
Bottom-up Hill Climbing Feature Selection
Greedy Top-down Feature Selection (GTD),
(HC), 81, 85
81, 88

C H
Caloric restriction, 4 Hierarchical feature selection, 5, 18, 20, 21,
Classification, 1, 2, 7, 20, 21 45
Class imbalance, 78, 102 Hierarchical redundancy, 20, 21, 45, 50
Clustering, 1, 7, 9 Hierarchical relationships, 2, 5, 14, 20, 36
Correlation coefficient, 78, 102 Hierarchical structure, 4
Hierarchy-based Feature Selection
(SHSEL), 81, 91
D
Human Ageing Genomic Resources, 37
Data mining, 1, 7, 8
Decision Tree, 8, 20
Directed Acyclic Graph, 2 J
Jaccard similarity coefficient, 14

E
Eager hierarchical feature selection, 81, 94 K
Eager learning, 5, 20, 21 K-means, 9

© Springer Nature Switzerland AG 2019 119


C. Wan, Hierarchical Feature Selection for Knowledge Discovery,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-319-97919-9
120 Index

K-medoids, 9 S
K-Nearest Neighbour, 14, 58 Select Hierarchical Information-Preserving
Knowledge Discovery in Databases, 1 and Most Relevant Features (HIP–
MR), 45, 55
Select Hierarchical Information-Preserving
L Features (HIP), 45, 47
Lazy hierarchical feature selection, 45 Select Most Relevant Features (MR), 45, 50
Lazy learning, 5, 14, 20 Semi-naïve Bayes, 8, 11–13
Least Absolute Shrinkage and Selection Op-
Specialisation-generalisation relationship,
erator, 20
45
Longevity, 4
Statistical significance test, 76
Supervised learning, 7, 8
M Support Vector Machine, 8
Machine learning, 1, 8

N T
Naïve Bayes, 8, 11, 12, 58 Tree Augmented Naïve Bayes, 12, 58
Tree-based Feature Selection (TSEL), 81

P
Predictive accuracy, 17
Predictive performance, 1, 2, 17, 94, 105 U
Pro-longevity, 4 Unsupervised learning, 7
Pseudocode, 47

R W
Regression, 1, 7, 8 Wrapper approach, 17, 18