Sei sulla pagina 1di 6

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt WEB MINING BASED ON GENETIC ALGORITHM M.
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt WEB MINING BASED ON GENETIC ALGORITHM M.
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt WEB MINING BASED ON GENETIC ALGORITHM M.

WEB

MINING BASED ON GENETIC ALGORITHM

M. H. Marghny and A. F. Ali Dept. of Computer Science, Faculty of Computers and Information, Assuit University, Egypt, Email: marghny@acc.aun.edu.eg.

Abstract

As the web continues to increase in size, the relative coverage of web search engine is decreasing, and search tools that combine the results of multiple search engines are becoming more valuable. We propose a framework for web mining, the applications of data mining and knowledge discovery techniques to data collected in World Wide Web (WWW), and a genetic search for search engines by showing that important relation existed between web statistical studies and search engines standard techniques in optimization. It is straightforward to define an evaluation function that is a mathematical formulation of the user request and to define a steady state genetic algorithm (GA) that evolves a population of pages with binary tournament selection. Querying standard search engine performs the creation of individuals. The crossover operator that with probability of crossover P c is performed by selecting two parent individuals (web pages) from the population. It chooses one crossover position within the page randomly and exchanges the links after that position between both individuals (web pages). We present a comparative evaluation that is performed with the same protocol as used in optimization. Our tool leads to pages of qualities that are significantly better than those of the standard search engines.

Keywords: search engines, Meta search, crossover, genetic algorithm, web mining.

1. Introduction

The WWW is an information environment made of a very large distributed database of heterogeneous documents, using a wide area network (WAN) and a client server protocol. The structure of this environment is that of a graph, where nodes (web pages) are connected by edge (hyperlinks). The typical strategy for accessing information on the WWW is to navigate cross documents through hyperlinks, retrieving the information of interest along the way. A metasearch engine searches the web by making requests to multiple search engines such as Alta vista, Yahoo, etc.

to multiple search engines such as Alta vista, Yahoo, etc. Results of the individual search engines

Results of the individual search engines are combined into a single result set. The Advantage of metasearch engines includes a consistent interface to multiple engines and improved coverage. Genetic search is characterized by the fact that a number N of potential solutions of an optimization problem simultaneously samples the search space. We assume that it is possible to perform additional computation to the result from standard search engines, a point that is lacking to standard search engines. This may consist of instance in formulationa“richer“requesttodownloadthepagesin order to well analyze their content [1], to propose a textual clustering of the results [2], and to perform additional search with a given strategy [3]. We deal in this paper with the last point and we make use of the optimality of genetic algorithms [4], with respect to finding the most interesting pages for the user. From this intuitive view, we show that GAs and more generally evolutionary algorithms can positively contribute to the problem of defining an efficient search strategy on the web. Section (2) formalizes the problem. We deal with it as an optimization problem by making a relationship between concepts use in optimization and concepts used in studied dealing with web statistical properties and with web search. Section (3) contains the principles of our GA that evolves a population of web pages. Section (4) reports the experimental tests, comments and comparisons with metasearch. Section (5) contains conclusions.

2. Web Search As An Optimization Problem

We argue that the web search can be seen as standard optimization problem, and may thus benefit from knowledge learned in previous studied in optimization. We establish a parallel between web search and general problem of function optimization. Recent statistical studies have modeled the web as a graph in which the nodes are web pages and the edges are the links that exist between these pages [5-6]. The search space S of our optimization problem is the set of web pages and is structured with neighborhood (Links going out of a page)

S of our optimization problem is the set of web pages and is structured with neighborhood

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

relationship V:S→S k . We associate to S an evaluation or fitness function, which can numerically evaluate web pages. A search engine tries to output pages, which

1-

Generate a random number (float) r in the interval [0,1] that equals to the number of pages.

maximize this function, and thus tries to solve that

2-

If r < P c select the page for crossover.

optimization problem. To scan S, optimization algorithms and search engines make both uses of the following similar search operators:

3-

If the numbers of selected pages are odd we remove one selected page (this choice is made randomly).

Creation operators that initialize points from S. In optimization, random generation is a common creation operator, but in the web context, randomly generating IP addresses for instance has already been studied [1]. For other purposes but only gives a valid web server with a low chance (one over several hundreds). So this kind of randomcreationoperatordoesn’tseemsuitableforweb search. In optimization, another example of creation operator is the use of a heuristic that builds a solution from the description of the problem. Many search engines, either based on a metasearch or agents, use such an operator for the web by querying one or more index- based search engines and outputting the obtained links. From the evolutionary computation point of view, this operator would be used for the initial generation of the population. Operators that will modify existing points in the population. Web robots and more generally web agents [7-8], use such strategy by exploring the links found in pages. From this point of view, a standard heuristic in optimization such as hill climbing can be directly adapted to the web, starting from a given page to explore its links and select the best one according to the presented fitness function F in order to define a new starting point.

3. The Proposed GA

1-

Get the user request and define the

2-

evaluation function F. t 0 (iteration No =0 ,pop size =0)

3-

Initialize P (t).

4-

Evaluate P(t) (page from standard search engine).

5-

Generate an offspring page O.

6-

t ←t+1(newpopulation).

7-

Select P (t) from P (t-1).

8-

Crossover P (t).

9-

Evaluate P (t).

10- Go

To

5

(while not termination

11-

12-

condition (no of iterations)). Sort P (t) (sort the pages given to the user in descending order according to their quality values). Stop P (t) and give the outputs to the

the pages given to the user in descending order according to their quality values). Stop P

user. It combines the concepts described in the previous section with those of a steady state GA [9]. An individual in the population is a web page that can be numerically evaluated with a fitness function. Initially, the first individuals are mostly generated with a heuristic creation operator which queries standard search engines to obtain pages. Then, the individuals can be selected/deleted according to their fitness, and can give birth to offspring with selection/crossover operators. Crossover steps of parent pages are: -

Links selection steps of the parent page are: -

1-

Generate a random number (integer) d ranging from 1 to the total number of the links per page to determine the crossover point.

2-

Exchange the links after the crossover point of both selected parent page.

From an intuitive point of view, the behavior of this search algorithm can range from a meta-search engine (with P c = 0) which only analyzes/evaluates the results of standard search engines, to a search engine which explores in parallel as many local links as possible (P c = 1) with the help of selective pressure to guide the search through the links. When P c = 0, the selection of the GA decides about the survival of a page in the population and about the number of offspring. It controls the intensity with which pages are explored. (i.e. When P c = 0.25 means we select 25% of pages to make the crossover operator). As far as we know, other applications of GAs to the problem centered on the web are for instance [7- 8,10-14]. For instance, [7] has presented an adaptive search with a population of agents. These agents are selected according to the relevance of the documents return to the user. Our approach models the problem at a level that is closer to the fitness landscape. The GA search does not optimize the parameters of searching agents but rather directly deals with points in the search space.

3.1. Evaluation Function According To The User Request

The fitness function F that evaluates web pages is a mathematical formulation of the user query and numerous evaluation functions. We have used function F, closing to the evaluation functions used in standard search engines. First, let us define the followings in the simplest forms for practical considerations.

1) Link quality

F (L)

F (L) =

n

i

1 #

K

i

(1)

Where n is the total number of input keywords, #(k i ) mean number of occurrence in link and k 1 , K 2 , k 3 …are the keywords given by the user.

2) Page quality F (P)

F (P)=

m

j

1 F

j

(

L

)

(2)

, K 2 , k 3 …are the keywords given by the user. 2) Page quality

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

Where m is the total number of links per page.

3) Mean quality function Mq

M q

=

F

max

(

)

P F

min

(

P

)

2

(3)

Where F max (P) and F min (P) are the maximum and minimum values of the pages qualities, respectively after applying the GA. It should be noted that the upper value of F max is m*n, and the least value of F min (P) is zero. Hence, the upper limit of Mq is (m*n)/2. Application of the GA to web pages will increase some qualities of pages and decrease others.

3.2. Crossover Operators And Other Search Engines

We use a heuristic creation operator that outputs a web page from the results given by four standard search engines (AltaVista, Google, Msn, Yahoo). It consists of querying each search engine with the keywords (K1, K 2) and in extracting the results. The links found are stored in a list sorted in the same order given by each search engine (1 st link of the 1 st engine, 1 st link of the 2 nd engine…then2 nd link of the 1 st engine,), and each time the creation operator is called to output the next link on this list. When none of these engines can provide further links, the creation operator is not used anymore and is replaced by the crossover operator. This creation operator allows the genetic search to start with points of good quality. As it will be seen in the further results, those heuristically generated individuals (pages) can be greatly improved with the crossover operator. Each time the creation operator is called, the next link on the list is given as an output. When the list is empty, the crossover operator is used. From a selected parent pages, the crossover operator generates an offspring by combining the parent pages (exchanging the links between the two pages after crossover point). In order to speed up the pages evaluation, only mismatched links are considered and links having maximum qualities are transferred directly to the output list before applying the GA.

4. Results And Comments

4.1. Settings of The Experiment

The proposed GA has been implemented using C++ language. The obtained program was tested using PC PIV, 2.8GHz, 256 MB RAM, and HDD of 80GB having 7200 rpm. For each problem, we used 300 downloaded pages from the standard four search engines (Yahoo, Google, AltaVista, Msn). They were stored in the HDD for further operations. Tabulated results are averaged over 5 runs.

4.2. Results

Table 1 shows results of ten queries of different

keywords at different values of P c (P c = 0) means results from standard search engines without applying the GA)

other values of

(P c = 0.25, 0.5, 0.75, 1) are after applying our algorithm.

These results show the population averaged mean quality

fordifferentvaluesofpopulations’sizeafter3000

iterations applied on 100 pages. Referring to Fig.1 of query No 1, we note that, if the population size is too small for example 20 pages, the GA decreases the quality of the results. If the population size is large for example 100 pages, the binary selection does not concentrate the search on important pages. We can note that when P c is large (i.e. p c = 0.75 or 1) the search algorithm spends more time with unsuccessful exchanging links. To improve obtained results at these P c s, number of iterations should be increased. As results, the execution time will rapidly increase as shown in Fig. 5. Figures 2,3,4,5 illustrate these notes.

P c

5. Conclusions

We have proven experimentally the relevance of our approach on the presented queries by comparing the qualities of output pages with those of the original downloaded pages. As the number of iterations increases better results are obtained still with reasonable execution time. The small size of pages P c limits chances of improving the page qualities and reducing execution time at a specified number of iterations. It should be noted that the results depends on the preparation methods of constituting pages under test. Here, a page consists of links from yahoo, AltaVista, Google and Msn downloaded links sequentially. (One link per search engine each time).

of links from yahoo, AltaVista, Google and Msn downloaded links sequentially. (One link per search engine

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

Table 1. Comparative results for M q

   

The value of averaged mean quality

 

No

Keywords (K 1 , K2…)

P

c =0

P

c =0.25

P

c =0.5

P

c =0.75

P

c =1

1

Web mining with genetic algorithm

12.5

17.1

18

17.5

15.9

2

Low pass filters operational amplifier

14

18.1

19.3

18.7

16.2

3

Network security Notes

11

15.3

16.1

15.6

13.1

4

Improving search engine results with genetic algorithm

17.1

23.1

25.3

24.1

19.2

5

Egyptian football Players

11.1

15.1

16.5

15.7

13.5

6

Implementing and supporting Microsoft windows Xp Professional

18.3

24.1

26.3

25.3

21.1

7

Microsoft internet security and acceleration server 2000 ISA

21

27.6

29.1

28.1

24.1

8

Implementing, managing and marinating a Microsoft windows server 2003 network infrastructure

28

36.3

39.1

34.2

30.1

9

Planning, implementing and maintaining a Microsoft windows server 2003 active directory

29

38.1

42.3

35.1

32

10

Oracle internet application developer track

14

19.1

21.1

20

16.1

26 24 P c =0.5 22 P c =0.75 20 18 16 P c =0.25
26
24
P c =0.5
22
P c =0.75
20
18
16
P
c =0.25
14
12
P c =1
P c =0
10
0
50
100
150
200
Mean Quality

Download Pages

Figure 1. Population averaged mean quality for different values of pop size at 3000 iterations.

25 Pc=0.75 Pc=1 23 21 19 17 15 13 Pc=0.25 11 Pc=0.5 0 2000 4000
25
Pc=0.75
Pc=1
23
21
19
17
15
13
Pc=0.25
11
Pc=0.5
0
2000
4000
6000
8000
10000
12000
Mean Quality

Iterations No.

Figure 2. Population averaged mean quality for different values of iterations number at 20 pages.

Quality Iterations No. Figure 2. Population averaged mean quality for different values of iterations number at

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

Pc=0.5 25 23 Pc=0.25 21 19 17 Pc=0.75 15 13 Pc=1 11 0 2000 4000
Pc=0.5
25
23
Pc=0.25
21
19
17
Pc=0.75
15
13
Pc=1
11
0
2000
4000
6000
8000
10000
12000
Mean Quality

Itrations No.

Figure 3. Population averaged mean quality for different values of iterations number at 120 pages.

25 P c =0.25 23 21 19 17 15 P c =0.5 13 P c
25
P
c =0.25
23
21
19
17
15
P
c =0.5
13
P c =1
P c =0.75
11
0
2000
4000
6000
8000
10000
12000
Mean quality

Iteretions No.

Figure 4. Population averaged mean quality for different values of iterations at 250 pages.

300 Pc=1 250 Pc=0.75 200 150 100 50 0 Pc=0. 5 500 1000 1500 2000
300
Pc=1
250
Pc=0.75
200
150
100
50
0
Pc=0. 5
500
1000
1500
2000
2500
3000
No. of iterations
Pc=0.25
Time

Figure 5. Variation of time (sec.) with number of iterations using 250 pages.

2500 3000 No. of iterations Pc=0.25 Time Figure 5. Variation of time (sec.) with number of

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

6. References

[1] Lawrence S. and Giles C.L. 1999b Text and image meta-search on the web. International Conference on Parallel and Distributed Processing Techniques and Application, 1999. [2] Zamir O. and Etzioni O. 2000, Grouper:

a dynamic clustering interface to web search

results. Proceedings of the Ninth International Worldwide Web Conference, Elsevier, 2000.

[3] F.Picarougne, N.Monmarche, A.Oliver,

G.Venturini. Search of information on the Internet

by evolutionary algorithm, 2002.

[4] HollandJ.H. Adaptation in natural and artificial

systems. Ann Arbor: University of Michigan Press

1997.

[5] Albert R, Jeong H. and BarabasiA.-L. 1999, Diameter of the Worldwide Web. Nature, 401:130-131, 1999. [6] Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S. State R., TomkinsA. And Wiener J. 2000. Graph structure in the Web, Proceedings of the Ninth International Worldwide Web Conference, Elsevier, 2000.

[7] Menczer F, Belew R.K., Willuhn W. Artificial life applied to adaptive information agents. Spring Symposium on Information Gathering from distributed, HeterogeneousDatabases, AAIPress,

1995.

[8] MoukasA, Amalthea. Iinformation discovery and

filtering using a multiagent-evolving ecosystem. Applied Artificial Intelligence, 11(5):437-457,

1997.

[9] Whitley D. The Genitor algorithm and selective pressure: why rank-based allocation of reproductive trials is best. Proceedings of the third International Conference on Genetic Algorithms, 1989, J.D. Schaffer (Ed), Morgan Kaufmann,

pp116-124.

[10] Fan W., Gordon M.D., Pathak P. Automatic generation of a matching function by genetic programming for effective information retrieval, Proceeding of the 1999AmericasConference on Information Systems,pp49-51. [11] Monmarché N., Nocent G., Slimane M. and Venturini G. Imagine: a tool for generating HTML style sheets with an interactive genetic algorithm based on genes frequencies. 1999IEEE International Conference on Systems, Man, and Cybernetics (SMC'99), Interactive Evolutionary Computation session, October 12-15, 1999, Tokyo, Japan. [12] Morgan J.J. and Kilgour A.C. Personalizing information retrieval using evolutionary modeling, Proceedings of Poly Model Applications of Artificial Intelligence, ed by A.O. Moscardini and P. Smit h, 142-149, 1996. [14] Sheth B.D. A learning approach to personalized informationfiltering,.Master’sthesis, Department of Electrical Engineering and Computer Science, MIT, 1994. [15] Vakali A. and ManolopoulosY. Caching objects from heterogeneous information sources, Technical report TR99-03, Data Engineering Lab, Department of Informatics, Aristotle University, Greece 1999.

sources, Technical report TR99-03, Data Engineering Lab, Department of Informatics, Aristotle University, Greece 1999.