Sei sulla pagina 1di 12

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59.

October, 2006

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching
Feng-Hsu Wang, *Shih-Yao Jian
Department of Computer Science and Information Engineering, Mining Chuan University E-mail: fhwang@mcu.edu.tw Department of Computer Science and Information Engineering, Mining Chuan University r1450063@ss23.mcu.edu.tw

Abstract This paper presents a novel content-based recommendation method which recommends web pages resembling to a users recent interests in a web site. Traditionally, a web page is recommended based on a comparison between a users profile and web contents that are represented as a set of feature keywords. This paper proposes a new approach to representing and extracting page features called keyword contexts. A keyword context is a set of discriminating words that occur together often in the web pages, which could capture more semantic information than a single keyword. The keyword contexts are extracted by analyzing Web pages with a new feature extraction method that combines the IR (Information Retrieval) and association mining techniques. Three methods for establishing a users interest profile based on keyword contexts of recently-visited web pages are proposed and compared. Web pages most similar to the users interest profile in terms of keyword contexts are then recommended. Finally, an application of this recommendation mechanism to an e-learning web site is presented. The experimental results showed that the method is better than the pure-keyword approach.

Keywords: Web browsing, recommender system, content-based recommendation, association mining, keyword context

1. INTRODUCTION
Internet has stirred the fast development of web sites equipped with rich resources in a variety of application sectors. However, on-line readers are often apt to get lost in such an environment due to its complicated structure and huge amount of information. Therefore, a new design method that can adapt a Web site to user needs is of great importance to improve the usability and user retention of the Web site. The success of such an adaptation feature, also called Web personalization [32], heavily relies on the systems capability to anticipate users future needs. Web personalization already finds important applications in e-business (such as Amazon.com and google.com), e-learning and so on. The problem of personalized recommending according to personal needs has been studied extensively [2][3][6][9][13][16][26][42][47][49]. Two fundamental major paradigms have emerged. In the content-based recommendation paradigm [1][6][13][21][31], one tries to

recommend pages that are similar to those ones the user was interested (e.g., browsed) in the past; while in the collaborative recommendation paradigm [4][12][20][23], one identifies the users whose interests are similar to the user and then recommend the pages those similar users are interested. Trust-based recommendation algorithms are another approach to improve the robustness of collaborative recommendation tasks [34][35]. Several efforts of combining the two mechanisms also have been made [2][16][26]. While each approach has its own advantages and disadvantages, this paper focus on the exploration of a content-based recommendation method which recommends web pages according to a users interest profile. Specifically, in the content-based recommendation paradigm, a web page is recommended based on a matching between the users profile and web contents that are both represented as a set of feature keywords. To improve the effectiveness of the matching task, it was suggested that the original document keywords be expanded by adding related or associated terms not originally present in the available

- 49 -

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching text samples [38]. These associative text processing methods may include thesaurus operations [43], automatic term associations [11] and term phrase generation [10]. In very small collections with narrow, well-defined domains of discourse and small vocabularies, NLP methods using linguistically parsed phrases could also produce significant retrieval performance improvements [15]. However, evaluation results for the keyword expansion approaches in a more general area by either thesaurus operations or automatic term associations were disappointing [37]. Throughout the last three decades in Information Retrieval arena, there have been repeated attempts to incorporate the concepts of phrases, i.e. combinations of words, in the automatic indexing and retrieval processes for full text applications [39][40][17][8]. However, the improvements gained through the use of term phrases were counter-intuitively shown to be quite limited [8]. Croft (1991) showed that the most successful automatic phrase generation system they tested uses Ken Church's stochastic tagger [7], which was employed by Croft only to generate noun phrases. The stochastic tagger is based on so-called ``shallow'' lexical analysis as well as probability tables of contextual occurrence to assign parts of speech to words. Unfortunately, for larger unrestricted collections, most algorithmic phrase production methods to produce indexing features composed of more than one word have not had much success [38][10][45]. In summary, the basic theories needed to construct useful term grouping schedules and thesauruses valid for particular subject areas are not sufficiently developed. As a result, the effectiveness of associative retrieval techniques based on term grouping and vocabulary expansion leaves something to be desired [40]. This paper proposes a novel keyword context approach that is similar to the term phrase generation approach [10], but different in the ways adopted to find associative terms. This approach adopts keyword contexts to represent page features. A keyword context is a set of discriminating words that occur together often in the web pages, and capture more semantic information than a single keyword. The keyword contexts are extracted by analyzing Web pages with a new feature extraction method that combines the IR (Information Retrieval) and association mining techniques. Another important issue in content-based recommendation is establishing user interest profile. Appropriate approaches that respond properly to user interest changes are required. This paper proposes and compares three mechanisms of establishing user profile according to the navigation path of a user. Finally, the personalized recommendation method is applied as part of an intelligent navigation guider in an e-learning web site, and different recommendation policies and effectiveness measures are investigated. The effectiveness of the recommendation with the keyword context approach are compared to that with the pure keyword approach. The results showed that the recommendation model built with keyword contexts is more effective. The subsequent sections are organized as follows. Section 2 depicts the basics about recommendation systems, keyword extraction and association mining. Section 3 presents the detail of the personalized recommendation based on the keyword context approach. Section 4 describes the various recommendation policies and gives a description of the performance criteria. This section also depicts the application of the personalized recommendation method to an e-learning web site. In Section 5, experiments are conducted and analyzed on the real-world data sets collected from the web site. Finally, we make some remarks on the limitations of the method and portray some future work.

2. BACKGROUNDS
2.1 Recommendation Systems
The popularization of computers and the Internet have resulted in an explosion in the amount of digital information. As a result, it becomes more important and difficult to retrieve proper information adapted to user preferences [32][36]. In general, there are two types of recommendation systems, collaborative filtering systems [33][30] and content-based filtering systems [1][31][29][21].

2.1.1Collaborative Recommendation In collaborative filtering, items (e.g., web pages) are recommended to a particular user when other similar users also prefer them. The definition of similarity among users depends on applications. For example, the similarity may be defined as users having similar ratings of items or users having similar navigation behavior. This kind of recommendation systems is the first one that uses the artificial intelligence technique to do the personalized job [36]. A collaborative filtering system collects all information about users activities on the website and calculates the similarity among the users. If some users have similar behavior, they will be categorized to the same user group. When a user logins into the web site again, the

- 50 -

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59. October, 2006 system will first compute the group most similar to the user, using methods like the k-nearest neighborhood, and then recommend to the user the items that the group members prefer. Examples of collaborative recommendation systems include the Amazon Net Book Store, Tapestry, Firefly, Referral Web, PHOAKS, Siteseer, GroupLens, Ringo and so on. However, a pure collaborative filtering system has several drawbacks and issues, including that the coverage of item ratings could be very sparse, hence yielding poor recommendation efficiency; it is difficult to provide services for users who have unusual tastes; and the user clustering and classification problems for users with changing and/or evolving preferences.
Table 2.1 Comparison between content-based recommendation and collaborative recommendation [44]. Content-based recommendation 1. A user can receive proper recommendations without helps from other users. 2. It is more feasible to Advantage tackle the problems of multiple user interests and interest transference by Collaborative recommendation 1. A user may have chances to receive items that s/he never contacts before, but may be of his/her potential interests. 2. Facilitate the sharing of knowledge and/or experiences among users having similar interests.

2.1.2 Content-based Recommendation Systems Content-based recommendation techniques are based on content analysis of the target items. For examples, the technique of term frequency analysis for text documents is a well-known content analysis method. In content-based filtering systems, recommendations are provided to a user based solely on a user profile built up by analyzing the content of items that the user rated in the past and/or users personal information and preferences. The users profile can be constructed by analyzing the user responses to a questionnaire, item ratings, or the users navigation information to infer the users preferences and/or interests. Examples of content-based systems include InfoFinder [19], WebWatcher, NewsWeeder [22] and so on. However, a pure content-based filtering system has several shortcomings and issues remained to be solved, including that only a very shallow analysis of specific kinds of content (text documents, etc.) are available; users can receive only recommendations similar to their earlier experiences; and the sparseness problem of item rating information [20][23]. As a summary, Table 2.1 adapted from [44] shows a brief comparison between the two recommendation methods.

monitoring the change and evolving of user profiles. 1. Some types of item content (e.g., multimedia) is not easy to analyze. Limitation 2. A user can just receive items that are similar to his/her past experiences.

1. It is hard to provide recommendations for users that have unusual preferences. 2. It is hard to cluster and classify users with changing and/or evolving preferences.

2.2 Keyword Extraction from Text Documents


One important research issue related to content-based recommendation is the keyword analysis for text documents so that their characterization can be extracted and represented. Often some weighting scheme is used to select discriminating words [29]. Some researchers adopt the multinomial text model [28] in which a document is modeled as an ordered sequence of word events drawn from the same vocabulary set. A naive Bayesian text classifier is

trained to represent user interests and to produce rankings of books that conforms to the users preference [31]. The naive Bayes assumption states that the probability of each word event is dependent on the document class but independent of the word's context and position. While this assumption might be valid for their book recommendation case, it is not applicable in the web page recommendation situation considered in this paper, since no pre-defined document classes are specified for each content page. Three keyword extraction approaches have been proposed in the literature. The first is the Dictionary approach that exploits a pre-established term dictionary to do the keyword matching against the documents [5][14]. The advantages of this approach are fast extraction speed and easy implementation. However, the size and domain of the dictionary determine the success of this approach. Building and maintaining a large dictionary will be a cost and time consuming task. The second is the Linguistic approach that exploits the natural language processing technique to parse sentences in

- 51 -

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching a document, and extract the phrases including nouns, verbs, pronouns and so on. Then some linguistic heuristics are adopted to extract meaningful phrases, filtering out unsuitable terms. This approach relies on the existence of a term dictionary, so it has the same drawbacks as the dictionary approach. The last is the Statistical approach that extracts keywords using the numerical information about text such as the frequency of the terms, co-occurrence degree between words and so forth. The most noticeable advantage of this approach is the independence of a dictionary and linguistic parsers, and new words can be extracted without referencing any dictionary. In principle, keywords are viewed as representative and meaningful vocabularies (and/or phrases) of documents. However, the identification of meaningful vocabularies is subjective and dependent on the subject area. In the literature of natural language processing of text documents, the processing of meaningful vocabularies can be done by first applying the term determination methods to extract important terms and/or term phrases, which can be obtained by semantic matching and analysis of terms. Term semantic matching is a basement of all semantic matching methods. Traditionally, terms are viewed as the smallest unit of semantics, but there is a trend now to view terms as combination of many semantic features. The semantic analysis of terms includes the determination and linguistic tagging of terms, syntactic analysis and semantic analysis [24][25]. Establishment of user profiles is another important factor that has significant affects on the recommendation effectiveness. In [41], the user navigation history was used as the major information source for establishing user interest profiles. Their system, called Letizia, is able to predict the changes of user interests by observing the recent preferences of a user. The user profile is consisted of a set of weighted keywords that were obtained through the tf.idf analysis of Web pages. The established user profile can be exploited to facilitate the recommendation of web pages by matching the user profile with content pages. More complicated systems might combine both content-based and collaborative-filtering recommendation mechanisms, such as the Fab system [1], which matches the user profile from users in the database to find similar users, and then recommend content pages from the list of the pages of the similar users.

3. RECOMMENDATION MECHANISM BASED ON KEYWORD CONTEXTS


The main contribution of this paper is the proposing of a process to extract important information of web pages, which might be in Chinese, and a process to establish a user interest profile from the user's navigation history. The user profile is updated to reflect the most recent state and changes of the user interest so that the content pages that mostly meet the user's need could be recommended. Specifically, the novel approach exploits keyword contexts, instead of keywords, as the basic information unit to represent content characteristics. This approach is similar to the term phrase approach in the literature [10], but is distinct in the way that phrases are formulated. The N-gram approach [10] formulates terms phrases from consecutive words in the document, and then tested their significance with some statistical computations. Our method generates the keyword contexts by the association mining technique which reveals highly frequent co-occurrences of a set of words in a sentence. Figure 3.1 shows the web-page recommendation process based on the keyword contexts:
A users navigation sequence Web content pages

Keyword Contexts

List of keyword contexts

Modeling user interest profile based on keyword contexts

Page recommendation by profile matching

Recommended page list

User browsing Figure 3.1 The recommendation process based on keyword context matching.

k11, k12, , k1x; k21, k22, , k2y;; .., km1, km2, , kmw Figure 3.2 The source keywords in a page, where a semicolon represents the end of a sentence.

- 52 -

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59. October, 2006

3.1 Extraction of Keyword Contexts


To extract keyword contexts from web content pages, which might be written in Chinese, we first perform sentence interpunction and word segmentation using the CKIP Lexicon and Chinese Grammar system [27] that was developed by Chinese Knowledge and Information Processing Group (CKIP) of Academia Sinica in Taiwan. After the word segments (keyword candidates) are generated, the tf.idf technique is applied to filter out unimportant keywords. After the filtering process, all pages are represented only by the remaining set of keywords, retaining the original sentence structure, as shown in Figure 3.2. Each sentence forms a context unit of the source keywords. Association data mining is then applied on these sentences to discover frequent co-occurring keywords within sentences. These frequent co-occurring keywords in one sentence will be called a keyword context. Specifically, the extraction of keyword contexts follows the following steps: Step 1: The web pages in a web site may include documents of a variety of forms such as text, images and son on. We just focus on analyzing text information in a HTML web page such as the text phrases in a paragraph, caption titles for an embedded figure and so on. Step 2: Divide each page into a set of sentences separated by the various punctuation marks such as the dot ("."), full stop (" "), semicolon ("") and question mark ("?"). Step 3: Perform word segmentation on each sentence using the CKIP Lexicon and Chinese Grammar system [27]. The result is a set of keyword candidates for each sentence. Step 4: Apply the TFIDF and stop-word list techniques on the keyword candidates to filter out un-important ones. The remaining keywords form the set of elementary keywords, denoted as E, for building keyword contexts. Step 5: Apply association data mining on the elementary keywords with the IBM

keywords A and B are found to occur frequently in one sentence, they could be viewed as a keyword context. Step 6: Collect the frequent keyword sets and build the set of keyword contexts Cx according to the following rules. Let Cm be the set of frequent keywords discovered after the association mining phase in Step 5, and Ek be the set of elementary keywords in page Pk. Then, Case 1: Each frequent keyword set and its subsets in Cm belong to the set CX. For example, if Cm={(rule, system)}, then (rule), (system) and (rule, system) are all elements in the keyword context set CX. Case 2: For each page Pk, let c Ek be an elementary keyword such that c x for any xCm, then add c into CX. For example, let the elementary keyword for some page Pk be Ek={expert, Cm={(rule, system, }, database)}, then and the

single-keyword contexts {expert} and {system} will be added into CX so that the keywords "expert" and "system" can be used to index the page Pk which would not be indexed by the keyword contexts derived from Cm. Step 7: Finally, compute the weight Wc,k of each keyword context c in CX with respect to page Pk by the following formula: Wc,k =
count ( c , Pk ) c ount ( Pk )

* log(

NP , ) NP (c )

(3.1)

where count(c, Pk) denotes the number of keyword context c in page Pk, count(Pk) denotes the total number of keyword contexts appearing in Pk, NP is the total number of pages, and NP(c) is the number of pages containing the keyword context c. Step 8: Normalize the keyword context weights into

Intelligent Miner, viewing each sentence as a transaction of keywords. For example, if


- 53 -

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching the range between 0 and 1 by the following formula:

4. EVALUATION CRITERIA

There are several evaluation strategies for recommendation systems [12]. In this paper, two W c,k Wc,k = (3.2) measurements are often used to evaluate the effectiveness of Max (W i , j ) prediction knowledge. One is the precision, and the other is the recall rate. The precision measures the system ability to for each keyword context i and page Pj, provide correct predictions; while the recall rate measures where P is the total number of content pages. the system ability to provide as many correct predictions as Step 9: Each page Pk is now represented as a users need. Specifically, traditional definitions of the two document vector [Wc,k], cCX. evaluation criteria are described as below. 1. Precision Rate: the ratio of the recommended items that users actually need to the total items in 3.2 User Interest Transfer Model the recommendation list. 2. Recall Rate: the ratio of the recommended items We suppose there is an interest transfer during a users that are needed by users to the total items that web navigation within a web site. The interest transfer in users actually need. this paper means the change of keyword contexts in the For example, suppose a user may actually need the keyword profile during a browsing session in the web site. items of {I1, I2, I3, I4, I5}. Given a recommendation list of We adopt the following formula to model the interest {I4, I5, I6, I7}, two items of the list {I4, I5} are actually what transfer of a user during his/her navigation of a web site: the user needs, therefore, the precision rate of the I i = I i-1 (1 r ) + Pi r , (3.3) recommendation list is 2/4, in which 4 is the size of the recommendation list. On the other hand, the recall rate is where Ii is the user's interest vector consisting of 2/5, in which 5 is the size of the item set that the user numerical values between 0 and 1 over the keyword actually needs. contexts at the ith browsing step, and r is the interest decay The concepts of the traditional evaluation criteria given rate, Pi is the keyword-context vector representation of the above do not take practical navigational behavior into page browsed at the ith browsing step. For example, suppose consideration. Actually, items recommended on top there is a browsing session P1P2...., and there are totally five positions are more likely to be checked by users. Therefore, keyword contexts. Initially, the user's interest vector I1 is it is always better to place items that users actually need on derived by the document vector of the first page P1, say [0.2, top positions of the recommendation list. Therefore, this 0.01, 0.4, 0.02, 0.1] which is focused more on keyword research give different weights to the item positions in the context 3 (with value of 0.4). Suppose the interest decay recommendation list. Accordingly, in this paper we adopt rate r = 0.1 and P2= [0.05, 06, 0.1, 0.04, 0.6] which is the weighted precision rate [4] and weighted recall rate to focused more on keyword contexts 2 and 5. Then the user's evaluate the effectiveness of the recommendation algorithm. interest vector at step 2 will be I2= I10.9 + P20.1, which For the sake of self-consistency, we cite the definitions of will be [0.19, 0.07, 0.37, 0.02, 0.15], in which the user's weighted precision rate and recall rate from [44] as below. interest is moved slightly to keyword contexts 2 and 5. Content pages whose document vectors are similar to 4.1 Weighted Precision Rates the user's interest vector are selected for recommendation. The similarity metric adopted in this paper is the Let Ai = (n1n|Ai|) denote the set of items a user cosine-based measure as shown in Eq. (3.4): actually need during a navigation session i, and let Ri =
sim ( I , C ) = cos( I , C ) = I C I C

(3.4)

[r1r|Ri|] denote the ordered set of items that the system recommends to the user at some stage of the navigation session i. Let Wj be the weight of the top jth position in the recommendation list, which is defined as Wj

where I is the user's interest vector and C is the document vector of a web page.

1 2
( j 1) /( 1)

(4.1)

- 54 -

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59. October, 2006 where |Ri| is the length of the recommendation list, and is a parameter that specifies the item position where users will have a 50-50 chance of viewing the located item. In this research we assume =10, which indicates that the top 10 item positions will have a probability of 0.5 and above for being explored by users [4]. As a result, The weighted precision rate WPi for the ith user session is then defined as:
1 , if hit, 0 , otherwise, otherwise. size of the item set that the user actually needs is smaller than the size of the recommendation list. Therefore, it can be seen the precision performance of the this recommendation is nearly 0.52/0.79 = 66% of what it can do best for the users needs.

4.2 Weighted Recall Rates


The same formula Eq. (3.3) of position weight given above is applied. As conventional recall rate is the ratio of the recommended items that are needed by users to the total items that users actually need. A weighted version of the recall rate is then given as below: WRi
j =1 ..| Ri |

WPi

j =1 ..| Ri |

H j W j
j =1 ..| Ri |

W j

Hj

(4.2)

where Hj = 1 if the jth item of the recommendation list Ri is in the set of user needs Ai; otherwise it is 0. Besides, as the length of a recommendation list varies from session to session, we would like to know how best the system can do in precision for the ith session under the list length of N. The best situation happens when the items located continuously in front part of the list are what the user actually needs. So define
min(| Ai |, | R i |)

H j W j
j =1 ..| Ai |

W j

Hj

1 , if hit, 0 , otherwise. otherwise.

(4.5)

WPimax

W j
1

j =1 ..| R i |

W j

(4.3)

where Hj = 1 if the jth item of Ri is in Ai; otherwise it is 0. Besides, as the number of items that the user needs varies from session to session, we would like to know how best the system can do in recall rate for the ith session under the amount of the users needs (|Ai|). The best situation happens when what the user actually needs are located continuously in front part of the list. So define

Therefore, a normalized average of weighted prediction rates for the total sessions is obtained by

min(| Ai |, | R i |)

WPi
AWP
i =1

WRimax

W j
1

j = 1 ..| Ai |

W j

(4.6)

WPi
i =1

,
max

(4.4)
Therefore, a normalized average of weighted recall rates over the total sessions is obtained by

where S is the total number of user sessions. For example, suppose a user may actually need the items of {I3, I4, I5}. Given a recommendation list of {I4, I6, I5, I7}, the corresponding weights for the four positions are 1, 0.93 , 0.86, 0.79, respectively, according to Eq. (4.1). Two items of the list {I4, I5} are actually what the user needs, but they are located in the first and third places of the recommendation list; therefore, according to Eq. (4.2), the precision rate of the recommendation list is WP = (1+0+0.86+0)/(1+0.93+0.86+0.79) = 1.86/3.58=0.52, in which the denominator is the total weights of the recommendation list. On the other hand, the best the recommendation can do for this case is to put all the four recommendation items on the top four positions of the list, so according to Eq. (4.3), we have WPmax = (1+0.93+0.86)/(1+0.93+0.86+0.79) = 0.79, in which the numerator is the sum of the weights of three positions as the

WRi
AWR
i =1

WRi
i =1

,
max

(4.7)

where S is the total number of user sessions. For example, suppose a user may actually need the items of {I3, I4, I5}. Given a recommendation list of {I4, I6, I5, I7}, the corresponding weights for the four positions are 1, 0.93 , 0.86, 0.79, respectively, according to Eq. (4.1). Two items of the list {I4, I5} are actually what the user needs, but they are located in the first and third places of the recommendation list; therefore, according to Eq. (4.5), the recall rate of the recommendation list is WR = (1+0+0.86+0)/(1+0.93+0.86) = 1.86/2.78=0.67, in which the denominator is the total weights of the item set the user

- 55 -

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching actually needs. On the other hand, the best the recommendation can do for this case is to put all the four recommendation items on the top four positions of the list, so according to Eq. (4.6), we have WRmax = (1+0.93+0.86)/(1+0.93+0.86) = 1, in which the numerator is the sum of the weights of three positions as the size of the item set that the user actually needs is smaller than the size of the recommendation list. Therefore, it can be seen the recall performance of the this recommendation is nearly 67% of what it can do best for the users needs. there is a browsing session for some user: P1P2P3P2P4P2P5P6. According to the MIV approach, the session is divided into three sub-sessions: S1= P1P2P3, S2= P2P4 and S3= P2P5P6. Furthermore, suppose three recommendation lists R1=P8P13P11, R2= P10P7P12, and R3= P7P8P9 are derived respectively by feeding the three sub-sessions S1, S2and S3into the recommendation process. The combining priority list is R3> R2> R1, therefore the resulting recommendation list by the MIV strategy will be Rcombine= P7P8P9 P10P12P13P11.

5. RECOMMENDATION STRATEGIES
5.1 Preprocessing of Navigation Session
As a matter of fact, it is a difficult task to capture the real interests of a user when he/she is browsing a web site. Users may browse intentionally or un-intentionally by clicking around in a web site. To get a better understanding of the effect of the interest model in predicting a user's interest, we propose three approaches to interpreting a user's browsing session: (1) Single-intentional view (SIV): The entire session is interpreted as a consecutive intentional browsing sequence. In this sense, a repetitive page in the session is viewed as a re-focus on the page, which transfers the user's interest back to previous ones. In this way, a browsing session will not be preprocessed any further, and is directly fed into the recommendation process. (2) Most-recently intentional view (MRIV): A repetitive page in a browsing session is interpreted as a "backward" action which marks an end of a previous old intentional browsing, and starts a new one. The last intentional browsing is believed to reflect the most recent interests of the user's intention. Therefore, the last intentional session is fed into the recommendation process. (3) Multiple-intentional view (MIV): Similar to MRIV, a repetitive page in a browsing session is interpreted as a "backward" action which marks an end of a previous old intentional browsing, and starts a new one. In contrast to the MRIV approach, the MIV approach considers all the intentional sub-sessions to give a final recommendation. All intentional sub-sessions are fed into the recommendation process separately, and the resulting recommendation lists are combined in a most recently first manner. For example, suppose

5.2 Absolute Nearest Recommendation Policy

Neighbor

(ANN)

Page contents that are most similar to the user's interest vector are selected and recommended to the user. A fixed and proper similarity threshold is pre-selected so that only pages with higher similarity to the user's interest than the threshold are recommended, and hence the accuracy could be improved. However, it is often the case that page contents might be of little similarity with the user's interest, so improper selection of the threshold (too high) often result in no recommendations, and hence lowers the recall rate.

6. EXPERIMENTS AND RESULTS


6.1 The Data Source
In a Web-based Virtual Classroom in Ming Chuan University, students can choose and browse material according to the topic indices and perform further study following the hyperlinks embedded in the documents, or they can browse specific material through the systems search engine utility. All documents are displayed in a browsing window. A client agent is designed to track the users activities, including the URLs of the pages showing in the browsing window, and sends them back to the behavior-tracking database on the server side. Since the study focuses on the browsing related activities, all other unrelated log data are filtered out, including those activities of teachers as well as those browsing records with short staying time. In particular, browsing records with short staying time are often caused by pages that contain intermediate hyperlinks between web documents. For example, a student may intend to browse page-B, but have to browse page-A first because only through the hyperlink in page-A can he/she reaches page-B. In such a situation, page-A is often called a pass-by page.

- 56 -

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59. October, 2006


Table 5.1 The record format of logged learning activities. Data Field Student id Page URL Activity Type Start Time Stay Time Description Identifier of students. URL of the referenced page. Activity type such as login, browsing, group discussion and so on. Start time of the activity. Staying time of the activity (in seconds).

On the other hand, students short references of pages may also be caused due to mistaking some pages as useful for their learning purposes. This kind of references can also be filtered out by checking a minimal page residence time (say 10 seconds). Furthermore, the raw log data has to be reconfigured for further analysis. Table 5.1 shows the record format of the logged learning activities. All browsing records are sorted in an ascendant manner with the user id as a major key and a starting time tag as a minor key. Sessions are identified by packing continuous records that follow a login type record until the next "login" record. Specifically, browsing records picked up between two successive login-type records are grouped into a browsing-session record.

stage of a user session, in the following experiments, we just take the performance results of the recommendations provided at three typical stages of a user session: the first, middle and last ones. Recommendations provided at the first stage of a user session are given when there is only one item in the current session, and those provided at the middle stage are given when half the items in the user session are available. At last, recommendations provided at the last stage of a user session are given when all but one item in the current session are available. The total service performance provided for a user session is then computed as an average of the performance results at the three stages of the session. In the following experiments, the interest decay rate r is 0.1. The support threshold applied in association mining is 0.01, and the similarity threshold is 0.1.

6.3 Results and Discussion


Table 6.1 shows the experimental results of the recommendation performance for different session preprocessing approaches and keyword matching methods. From Table 3, the recommendation performance based on keyword context matching is better than the pure keyword matching, in both average weighted precision (AWP) and average weighted recall (AWR) rates. Furthermore, it can be seen that the MRIV approach to session preprocessing results in the best performance among the three approaches. This seems to imply that the most recent browsing pages are good enough to model the users browsing interest, and produce better recommendation performance.

6.2 Design of the Experiments


The pages referenced in the browsing records may include local and external web pages. Those pages contain nothing but figures are excluded from the data source due to the limitation of the content-based analysis method adopted in this paper. Browsing sessions are pre-processed according to the SIV, MRIV and MIV methods, respectively. As a comparative basis, we also establish a pure keyword document model that applies Steps 1-4, and 7-9 in Section 3.1 to represent each page Pk as another document vector [Wc,k] , cE. The same user interest transfer model is used except that the interest vector is composed of elementary keywords. A set of experiments based on traditional keyword matching and keyword context matching are conducted, respectively. There are a total of 1678 browsing sessions in the database. Half of the sessions are randomly selected to build the recommendation model, and the other half is used for testing the recommendation performance. Each experiment is conducted ten times, and the average performance is reported and compared. Furthermore, though practically the recommendation system could provide recommendation services at every

7. CONCLUSIVE REMARKS
In this paper, we propose a novel content-based recommendation mechanism based on a keyword context matching scheme. A new keyword context extraction method based on TFIDF and association mining is also presented. Based on this new featuring scheme, a model for constructing user interest profiles is derived. Three approaches to building user interest profiles based on browsing sessions are presented and compared. In the case study, the experimental results showed that it produced the best performance with the MRIV model building approach. Furthermore, the new recommendation mechanism produces better recommendation performance than the pure keyword matching scheme. However, the experimental results are preliminary and conservative because of the size of data sets used in the experiments are of moderate scale. Larger scaled

- 57 -

An Effective Content-based Recommendation Method for Web Browsing Based on Keyword Context Matching experiments will be conducted to further confirm the effectiveness of the method. Another future work is to probe the effect of the recommendation model built by combining the user interest profiles with the collaborative filtering mechanisms [13].

Table 6.1 Experimental results of the recommendation performance. Different intentional views and matching methods are given. View Matching Method Keyword Context Matching Pure Keyword Matching AWP 0.38 0.19 SIV AWR 0.43 0.21 AWP 0.56 0.26 MRIV AWR 0.58 0.35 AWP 0.44 0.24 MIV AWR 0.55 0.31

REFERENCES
[1] M. Balabanovic and Y. Shohan (1997). Fab: Content-based, Collaborative Recommendation. Communications of the ACM, 40(3), 66-72. [2] J. Basilico and T. Hofmann (2004). Unifying Collaborative and Ccontent-based Filtering. Proceedings of ICML04, Twenty-first International Conference on Machine Learning. ACM Press, New York. [3] J. Basilico and T. Hofmann (2004). A Joint Framework for Collaborative and Content Filtering. Proceedings of SIGIR04, Sheffield, South Yorkshire, UK., 550-551. [4] J. S. Breese, D. Heckerman and C. Kadie (1988). Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 43-52. [5] Chen, et al. (1993). Some Distributional Properties of Mandarin Chinese a Study Based on the Academia Sinica Corpus, Proceedings of the First Pacifc Asia Conference on Formal & Computational Linguistics, 81-95. [6] J. Chen, J. Yin and J. Huang (2005). Automatic Content-Based Recommendation in e-Commerce. Proceedings of the 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service, 748-753. [7] K. Church (1988). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Proceedings of the 2nd Conference on Applied Natural Language Processing. Austin, Texas, 136-143. [8] B. Croft (1991). The Use of Phrases and Structured Queries in Information Retrieval, Proceedings of the 14th ACM SIGIR Conf. on R&D in Information Retrieval. ACM, Chicago, IL, 32-45. [9] M. Eirinaki, C. Lampos, S. Paulakis and M. Vazirgiannis (2004). Web Personalization Integrating Content Semantics and Navigational Patterns. Proceeding of WIDM04, Washington, DC, 72-79. [10] J. L. Fagan (1985). Automatic Phrase Indexing for Text Passage Retrieval and Printed Subject Indexes. Technical Report, Department of Computer Science, Cornell University, Ithaca, NY. [11] V. E. Giulianoj (1962). Automatic Message Retrieval by Associative Techniques. Joint Man- Computer Languages. Mitre Corporation Report SS-10, Bedford, MA. [12] J. L. Herlocker, J. A. Konstan, L. G., Terveen and J. T. Riedl (2004). Evaluating Collaborative Filtering Recommender [13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

Systems. ACM Transactions on Information Systems, 22(1), 553. Y. Hijikata, K. Iwahama, K. Takegawa, and S. Nishida (2006). Content-based Music Filtering System with Editable User Profile. Proceedings of SAC06, Dijon, France, 1050-1057. Ho, et al. (1993). Using Syntactic Markers and Semantic Frame Knowledge Representation in Automated Chinese Text Abstraction. Proceedings of the First Pacific Asia Conference on Formal & Computational Linguistics. 122-131. P. Jacobs (1992). Joining Statistics With Natural Language Parsing for Text Categorization, Proceedings of the third conference on Applied natural language processing, Trento, Italy, 78 185. X. Jin, Y. Zhou, and B. Mobasher (2005). A Maximum Entropy Web Recommendation System: Combining Collaborative and Content Features, Proceedings of KDD05, Chicago, Illinois, USA., 612-617. P. Jacobs, and L. Rau, eds. (1992). Innovations in text interpretation in artificial intelligence. North Holland, Amsterdam. J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon and J. Reial (1997). GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3), 77-87. B. Krulwich, and C. Burkey (1996). Learning User Information Interests Through Extraction of Semantically Significant Phrases. Proceedings of AAAI Spring Symposium on Machine Learning in Information Access. Standford, CA. M. Kwak and D. S. Cho (2001). Collaborative Filtering With Automatic Rating for Recommendation. Proceedings of ISIE 2001 IEEE International Symposium on Industrial Electronics, 625-628. J. W. Kwak, and N-I. Cho (2003). Relevance Feedback in the Content-based Image Retrieval System by Selective Region Growing in the Feature Space, Signal Processing-Image Communication, 18(9), 787-799. K. Lang (1995). Newsweeder: Learning to Filter Netnews. Proceedings of the 12th International Conference on Machine Learning. Tahoe City, CA. C-H, Lee, Y-H Kim, and P-K Rhee (2001). Web Personalization Expert with Combining Collaborative Filtering and Association Rule Mining Technique. Expert Systems with Applications, 21, 131-137.

- 58 -

Journal of Informatics & Electronics, Vol.1, No.2, pp.49-59. October, 2006


[24] C. N. Lee (1999). Understanding the Text Book of Primary School based on How-net. MD. Thesis, Institute of Computer Science and Information Engineering, National Cheng Kung University. [25] K-L Lee (1999). Intention Extraction and Semantic Matching for Internet FAQ Retrieval. MD. Thesis, Institute of Computer Science and Information Engineering, National Cheng Kung University. [26] S. Luo, and J. Rong (2004). Unified Filtering by Combining Collaborative Filtering and Content-Based Filtering via Mixture Model and Exponential Model. Proceeding of CIKM04, Washington, DC, USA., 156-157. [27] W-Y. Ma and K-J. Chen (2003, 3). Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. http://dats.ndap.org.tw/docu/paper-92/92-10.pdf (visited March 2003). [28] A. McCallum, and K. Nigam (1998). A Comparison of Event Models for Naive Bayes Text Classification. Proceedings AAAI 1998 Workshop on Text Categorization. Adison, WI, 41-48. [29] N. K. Mimouni, F. Marir, and F. Meziane (2000). An Intelligent Agent for Content-based and Retrieval of Documents. Proceedings of CBIR2000 Conference, Brighton, UK. [30] B. Mobasher, R. Cooley, and J. Srivastava (2000). Automatic Personalization Based on Web Usage Mining. Communications of the ACM, 43(8), 142-151. [31] R. J. Mooney and L. Roy (2000). Content-based Book Recommending Using Learning for Text Categorization. Proceedings of ACM Conference on Digital Libraries, 195-204. [32] M. D. Mulvenna, S. S. Anand, and A. G.. Buchner (2000). Personalization on the Net Using Web Mining. Communications of the ACM, 43(8), 123-125. [33] D. M. Nichols (1997). Implicit Rating and Filtering. Proceedings of the Fifth Workshop on Filtering and Collaborative Filtering, 31-36. [34] J. ODonovan and B. Smyth (2006). Is Trust Robust? An Analysis of Trust-Based Recommendation. Proceeding of IUI06, 101-108. [35] J. ODonovan, and B. Smyth (2005). Trust in Recommender Systems. Proceeding of IUI05, San Diego, California, USA., 167-174. [36] D. Riecken (2000). Personalized Views of Personalization. Communications of the ACM, 43(8), 27-28. [37] S. Robertson, C. J. van Rijsbergen, and M. F. Porter (1981). Probabilistic Models of Indexing and Searching in Information Retrieval Research. Oddy R N, Roberts S E, van Rijsbergen N C J and Williams P W, eds, Butterworths, London, 35-56. G.. Salton (1971). The Smart Retrieval System Experiments in Automaic Document Processing. Salton G, ed. Prentice Hall Inc., Englewood, Cliffs, NJ, 207-208. G.. Salton, and M. J. McGill, eds. (1983). Introduction to Modern Information Retrieval McGraw-Hill, New York, 1983. G.. Salton (1986). On the Use of Term Associations in Automatic Information Retrieval. Proceedings of the 11th Conference on Computational Linguistics, Bonn, Germany, 380 386. I. Schwab, W. Pohl, and I. Koychev (2000). Learning to Recommend From Positive Evidence. Proceedings of Intelligent User Interfaces. ACM Press, 241-247. G., Shani, D. Heckerman, and R. I. Brafman (2005). An MDP-Based Recommender System. Journal of Machine Learning Research, 6, 12651295. M. E. Stevens (1965). Automatic Indexing: a State of the Art Report. NBS Monograph 9, National Bureau of Standards, Washington, DC. F. H. Wang, and H. M. Shao (2004). Effective Personalized Recommendation based on Time-framed Navigation Clustering and Association Mining. Expert Systems with Applications, 27(3), 365-377. M. F. Wyle and H. P.Frei (1991). Retrieval Algorithm Effectiveness in a Wide Area Network Information Filter. Proceedings of the 14th ACM SIGIR Conf. on R&D in Information Retrieval, ACM, Chicago IL, 114-122. Z. Zhang, and O. Nasraoui (2006). Hybrid Query Session and Content-Based Recommendations for Enhanced Search. Proceedings of IEEE World Congress on Computational Intelligence, Vancouver, BC, Canada. C. Yu, J. Xu, and X. DU (2006). Recommendation Algorithm combining the User-Based Classified Regression and the Item-Based Filtering. Proceedings of ICEC06, Fredericton, Canada, 574-578. Y. Zheng, W. L. Moreau, and N. R. Jennings (2005). A Market-Based Approach to Recommender Systems. ACM Transactions on Information Systems, 23(3), 227266. C. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen (2005). Improving Recommendation Lists Through Topic Diversification. Processing of WWW 2005, 22-32.

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

Received 2005/11/11 Revised 2006/08/12 Accepted 2006/09/30

- 59 -

Potrebbero piacerti anche