Sei sulla pagina 1di 4

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue

ISSN 2278-6856

National Conference on Architecture, Software systems and Green computing-2013(NCASG2013)

Early Diagnosis of Lung Cancer using a Mining


Tool
Juliet R Rajan1, Jefrin J Prakash2
1

Assistant Professor, Department of Computer Science and Engineering, Jerusalem College of Engineering,
Pallikaranai, Chennai, Tamil Nadu, India
julietrajan@gmail.com
2

Project Manager, Infosys Limited,


Chennai, Tamil Nadu, India
jef_kash@yahoo.com

Abstract- As the amount of data is growing day by day, there


is a high requirement to extract knowledge from the data.
Data mining has contributed much to this requirement thus
finding its applications in diverse fields such as stock market,
banking, information technology and medicine. Data mining
is a process of sifting through the data and extracting the
underlying pattern beneath it. With the growth in population
and disease, there is need to include data mining in the field
of health care industry. Studies have shown that cancer is one
of the widespread diseases leading to fatal death today.
Among them, lung cancer and breast cancer accounts the
most. It has been found that if the disease is being diagnosed
at an early stage, the survival rate of the patient could be
improved but most of the time the disease is being diagnosed
at a later stage. This paper proposes a methodology using a
data mining which could predict the lung cancer at an early
stage thereby increasing the survival rate of the patient by
five years. The tool works intelligently in pre-diagnosing lung
cancer based at Stage 1. This tool is constructed by making
use of Artificial Neural Networks.
Index TermsArtificial Intelligence, Biomarkers,
Clinical diagnosis, Data mining, Expert System, Pattern
analysis

1. INTRODUCTION
Lung cancer is the leading cause of death in both men
and women. The disease is characterized by the
uncontrolled growth of cells. If it is not diagnosed and
treated early, the tissues can be metastasized to other
parts of the body such as the brain, bone, liver and
adrenal gland. As per the CancerCare, widely accepted
tool for early lung cancer detection is not yet available.
Current techniques like the chest X-Ray, Computed
Tomography (CT) scan, sputum cytology, biopsy,
bronchoscopy, needle aspirations, electronic nose[1] and
others, not only require high infrastructure and high cost
but they are proved to be efficient only in stage 4, when
the tumor has metastasized to other parts of the body.
Also, it has been found that 0.4% of current cancers in
US are due to the CT scans performed in the past and this
may increase to as high as 1.5-2% as per the 2007 report
[2]. The ionizing radiation emitted by the CT scan has the
capability to damage the DNA which cannot be corrected
by the cellular repair mechanism. Biomarker detection

can also help in the lung cancer detection but lung cancer
does not have any specific biomarkers and researchers are
still working on
that [3]. In spite of the available existing techniques, most
of the time lung cancer is detected only after crossing
stage 1.
As the volume of data is growing proportionally with the
increase in population, there is a greater need to extract
the knowledge from the data. Data mining contributes
much towards this and finds its application in various
diverse fields including the healthcare industry. Lung
cancer being a disease which is highly dependent on
previous data can make use of data mining for its early
detection. Data mining tool has been proved to be
successful in disease diagnosis [4]. Data mining has
already started to find its application in the diagnosis of
cancer such as cancer lesion detection [5], pulmonary
nodule detection [6], and classification of cancer stage
from tree-text histology report [7], breathe biomarker
detection [8] and so on. Most of the diagnoses employed
so far are based on imaging mining. In this paper, we
propose a data mining method which operates on the
patients causes and symptoms rather than the images or
the biomarkers. The expert system that we define here
takes the various causes and symptoms of the patients as
the input parameters and performs the classification for
the positive or negative category of lung cancer based on
Artificial Neural Networks (ANN).

2. DATA MINING STEPS


Data mining is a process of finding and extracting hidden
pattern of correlation among the data which cannot be
found by the normal statistical method. It is an iterative
and interactive process. For the successful extraction of
pattern, step by step procedure has to be followed.
2.1 Data Integration
The reports of the patients suffering from lung cancer are
collected from various sources and integrated.
Heterogeneous reports from different health care centers
tend to give better result after the mining process.

ISBN NO: 978-93-80609-14-0

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue

ISSN 2278-6856

National Conference on Architecture, Software systems and Green computing-2013(NCASG2013)

2.2 Data Cleaning


The various causes and symptoms relevant to the mining
process are retrieved from the heterogeneous reports thus
generating the dataset required for the learning and
testing .The dataset thus generated has a greater
probability of containing missing information, erroneous
data, noise or inconsistent data. Based on the domain
knowledge, missing value for an attribute is filled. We
ignore those records which has more than 40% of its
value missing. In future, we can make use of SOM based
fuzzy map model for data mining with incomplete dataset
[9]. Table 1 presents some of the causes identified.
Table 1 Some Lung Cancer Causes Attributes
Attribute
Type
Age
Numeric
Gender
Nominal
Height
Numeric
Weight
Numeric
Smoking habit
Nominal
Secondhand smoke
Nominal
Radon gas
Nominal
Asbestos
Nominal
Air pollution
Nominal
Radiation therapy to lungs
Nominal
HIV or AIDS
Nominal
Organ Transplant
Nominal
Women with HRT
Nominal
Symptoms of the patients are classified as primary and
secondary symptoms. Table 2 and Table 3 present some
of the primary and secondary symptoms identified.

2.3 Data Transformation


The attributes identified has to be transformed into form
that is understandable by both human and the machine.
Some of the parameters like age, height, weight are
normalized for computational efficiency by using the
following formula:

(1)

The attributes with nominal values are then converted


into numeric or discrete variables. After the
normalization and the discretization process, the records
of the patients are represented in the form of a matrix.

(2)

Where p is the total number of training data and n is the


number of attributes identified.
The dataset is then divided into 2 parts such that 80% of
the data are used for the learning purpose and the
remaining 20% of the data are used for the testing
purpose. Fig 1 shows the general structure of
unsupervised learning.

Table 2 Some of Lung Cancer Primary Causes


Attributes
Attribute
Type
Chest pain
Nominal
Cough
Nominal
Coughing of blood
Nominal
Fatigue
Nominal
Losing weight without trying
Nominal
Loss of appetite
Nominal
Shortness of breathe
Nominal
Wheezing
Nominal
Table 3 Some of Lung Cancer Secondary Causes
Attributes
Attributes
Type
Bone pain or tenderness
Nominal
Eyelid drooping
Nominal
Facial Paralysis
Nominal
Hoarseness or changing voice
Nominal
Joint pain
Nominal
Nail problems
Nominal
Shoulder pain
Nominal
Swallowing difficulty
Nominal
Swelling of face or arms
Nominal
Weakness
Nominal
Fever
Nominal

Training Set

Learning
Set

Predicted y

Fig.1 Unsupervised Learning


2.4 Data Mining
From the literature survey that has been done for the
proposed system, it has been found that the Kohonen map
could give high performance for the development of the
tool since it can provide statistical insights and models for
larger data sets [10]. Kohonen map is one of the powerful
techniques for data mining through cluster analysis [11] .
Kohonen map has the capability of imitating the human

ISBN NO: 978-93-80609-14-0

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue

ISSN 2278-6856

National Conference on Architecture, Software systems and Green computing-2013(NCASG2013)


brain there by learning from the past experience (or data)
and then making the classification. Since this ANN
follows an unsupervised learning method, there is a high
possibility to learrn more complex and larger models.
Also, the learning can proceed hierarchically from the
observations into ever more abstract levels of
representation.
At first, the Euclidean distance for the i1 is calculated
where i1 is a patient record, from the weight vectors wj
associated with each output node.
(3)
Select output node j* that has weight vector with
minimum value as a result of the formula defined in (3).
Then, update the weight values with all nodes within a
topological distance given by D(t) from j* using the
weight update rule.
(4)
This process is repeated for all the input vectors.
Learning generally decreases with time.
An output vector of length 2, one for the cancer positive
category and the other one for the cancer negative
category, will been identified

Each of the p vectors in the training data is classified as


falling into one of the clusters. Random weight values are
assigned from the inputs to the outputs. The Euclidean
distance is then calculated and the corresponding weight
values are updated. This process is repeated for the
number of iterations till the system learns. The accuracy
of the tool increases with the increase in the number of
training data.

trained network. The classification is validated against


the actual value and the accuracy is being calculated.
2.4 Data Mining
The hidden pattern representing the knowledge from the
Kohonen map is extracted by means of the resulting
weight vector. The weight values along with its
corresponding causes and symptoms are then analyzed by
the doctors and the root cause of the disease can be found.
The patient can then be treated accordingly based on his
diagnosis result.

3. CONCLUSION
In this paper, we have proposed a learning method based
on unsupervised learning which can be used in building a
predictive model for early detection of lung cancer. We
also showed that this ANN can be used to predict the
disease even with the occurrence of new symptoms. Also,
the disease can be further analyzed by extracting the
resultant weight vector after the training process.

References
[1] P. Wang, X. Chen, F. Xu, D. Lu, W. Cai, K. Ying,
Y. Wang and Y. Hu, Development of Electronic
Nose for Diagnosis of Lung Cancer at Early Stage,
Proceedings of the 5th International Conference on
Information Technology and Application in
Biomedicine, Shenzhen, China, May 30-31, 2008.
[2] R. Smith-Bindman, J. Lipson, R. Marcus, et al.
(December 2009). Radiation dose associated with
common computed tomography examinations and
the associated lifetime attributable risk of cancer.
Arch.Intern.Med.169 (22):2078-86.
[3] C. Je-Yoel and S. Hye-Jin, Proteomic
approaches in lung cancer biomarker development,
PubMed ,
2009
Feb;6(1):27-42.
doi:
10.1586/14789450.6.1.27.
[4] S. Mai, T. Tim , S. Rob , Using Data Mining
Techniques in Heart Disease Diagnosis and
Treatment,
Conference
on
Electronics,
Communications and Computers, 2012 JapanEgypt.
[5] T. Jia , Y. Wei, D. Wu, A Lung
Cancer Lesions
Detection Scheme Based on CT Image, 2nd
International Conference on Signal Processing
Systems (ICSPS), 2012.

Fig.2. Classification of cancer patient

[6] L. Yang, Y. Jinzhu , Z. Dazhe,A Method of


Pulmonary Nodule Detection utilizing multiple
support V Vector Machine,
International
Conference on Computer Application and System
Modelling, 2010.

After the training has been completed in the above step,


the identified testing samples are given as input to the

ISBN NO: 978-93-80609-14-0

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue

ISSN 2278-6856

National Conference on Architecture, Software systems and Green computing-2013(NCASG2013)


[7] M. Iain , M. Darren, F. Mary-Jane ,Classification
of Cancer Stage from Free-text Histology Reports,
Proceedings of the 28th IEEE EMBS Annual
International Conference New York City, USA, Aug
30-Sept 3, 2006.
[8] D. Siqi, H. Tianlin , S. Yang, L. Chun, H.
Yuanqing*, Detection of Lung Cancer with Breath
Biomarkers Based on SVM Regression, Fifth
International Conference on Natural Computation
2009.
[9] D. 1. HAND , "Data mining: Statistics and more?,"
The American Statistician, Vol. 52, No. 2, May
1998, pp.112-118.
[10] O. Jason , A. Syed, Data Mining Using Self
Organizing Kohonen maps: A Technique for
Effective Data Clustering & Visualization, In
International Conference on Artificial Intelligence
(IC-AI'99), June 28-July 1 1999, Las Vegas
[11] T. Kohonen, Self-Organization and Associative
Memory, 3rd ed.,Berlin: Springer-Verlag, 1989.

ISBN NO: 978-93-80609-14-0

Potrebbero piacerti anche