Sei sulla pagina 1di 80

Master's degree thesis

Topic Personalized citation


recommendation based on user
preference

__________________
_______1101213892_______
________
__________
Research direction: search engine and Internet
Information mining
______ _____

May 2014
Personalized citation recommendation based on user
preference

Liu Yaning

Computer Science and Technology

Instructor: Hongfei Associate Professor

Summary
Citation is an essential part of any paper and book, and it is a kind of respect
for the original when the author expounds the knowledge, and it is convenient
for readers to trace their origins and know the ins and outs of knowledge.
However, with the deepening of scientific research and the increase in the
number of scientific research workers, the number of papers is also expanding
sharply. This results in the author's writing, the need to spend a lot of time to
identify and supplement the citation, is a relatively cumbersome and lack of
creative process. This paper constructs an automated citation recommendation
system to solve this problem.
System according to the citation context, automatically recommend a list of
quoted papers for researchers, save a lot of time for researchers, and have great
practical value in the process of scientific research writing; In addition, the
citation recommendation system can be understood as the combination of
retrieval system and Recommender system, which brings a great research
significance to this subject.
Citation recommendation is a relatively new problem, in the past research, it
is used as a variant of the retrieval system, according to the content of the
citation context, to be recommended. In this paper, citation recommendation is
actually a personalized recommendation process, not only according to the
content of general recommendations, but also according to the preferences of

I
different researchers personalized recommendation.
This paper constructs a personalized citation recommendation model--PCR
model, using the user's publication and reference history, combining the existing
content-based recommendation methods. ThePCR model, which combines
user-reference orientation with content dependencies, has a performance boost
on recall@10 , with the latest content-based translation model, with 67% on
MAP . Performance improvements for 65% .

keywords : Citation recommendation, personalization, retrieval system, referral


system,SVM

II
Personalized Citation Recommendation based on Users
Preference

Yaning Liu (Computer Science)

Directed by Professor Hongfei Yan

Abstract
Citations are necessary parts for papers or books; they show authors' respect for
original works. Citations also help paper readers to know related information better.
But with the development of science, more and more researchers work on research
jobs which lead to paper's number grow rapidly. These years, when authors compose
their papers, they need a lot of time to add and confirm citations. It's a time
consuming procedure with little creativeness. This paper builds an automatic citation
recommendation system to solve the problem.
Our system analysis citation context, and recommend paper list for researchers to
cite automatically. System can save a lot of time for researchers, which means a huge
practical value; citation recommendation can be understood as a combination of
retrieval system and recommend system, which brings a big research significance.
Citation recommendation is a relatively new problem, in past, all researchers
consider this problem as a transformation of retrieval system, and they do
recommendation according to citation context's content. Here, we consider citation
recommendation as a personalized recommendation procedure. We also take author's
preference into consideration, do a personalized citation recommendation.
In this paper, we use author's publication and citation history, combine exist
content related recommend method, build a personalized citation recommendation
model--PCR model. Model obtains a 31.67% performance improvement in terms of
recall@10 and 27.65% improvement in MAP compared with the state-of-art method.

Keywords: Citation Recommendation, Personalization, Retrieval System,

Recommend System, SVM

III
Directory
1.1 Research background ............................................................................... 4
1.2 Research content........................................................................................ 6
1.2.1 get user data and build user information ...................................... 7
1.2.2 to User Information modeling ........................................................... 8
1.2.3 user Information and content-based methods combine .......10
1.3 thesis organization structure ..............................................................11
2.1 Papers recommend ..............................................................................................13
2.2 Citation Recommended ................................................................................15
2.3 (Support Vector Machine, SVM) ............................................18
3.1 related Concepts ....................................................................................................22
3.2 Reference behavior Analysis.............................................................................23
3.3 UTD User Tendency Degree............................................24
3.3.1 Build user information .......................................................................24
3.3.2 using user information to build UTD .........................................25
3.4 CRD ( Content Relevant Degree) ............................................31
3.4.1 language Model ......................................................................................31
3.4.2 Translations Model...............................................................................33
3.5 UTD CRD ..................................................................................................35
3.5.1 Fill the value divide......................................................................35
3.5.2 Combine multiple fractions ..............................................................36
4.1 Data Set ...............................................................................................................40
4.1.1 Data Requirements...............................................................................40
4.1.2 data Gets the procedure .....................................................................42
4.1.3 MAS API ................................................................................46
4.1.4 data get tips .............................................................................................51
4.1.5 Data preprocessing ..............................................................................52
4.2 Evaluation Method................................................................................................55
4.3 Experiment Frame ................................................................................................56
IV
4.4 Experiment Results .................................................................................59
4.5 Parameters Tune Excellent................................................................................60
4.6 features Analysis ...................................................................................................61
5.1 Summary ..................................................................................................................63
5.2 Future work.............................................................................................................64

V
Figure Directory
Figure 1. 1 Citation recommendation system work motioned ............................... 3
Figure 1.2 The total amount of paper varies with time statistics ........................... 5
Figure 1.3 A citation context sample .................................................................................. 5
Figure 1.4 User Reference Preferences location ..........................................................10
Figure 5. 8 results recall rate with position change ..................................................59

VI
Tables Directory
Tables 3.1 abbreviation description .................................................................................22
Tables 3.2 Different levels of recommendations and different ways of
extension ...........................................................................................................................................26
3.2 MAS API ...........................................................................................................47

VII
The first 1 Chapter Introduction

1 Introduction
There is no doubt that researchers need to know more about the
background and the latest progress in their field, and find out the shortcomings
of the current research, so as to carry out their own research. Research workers
currently engaged in the study, and predecessors of the research results are often
inseparable. They are either engaged in research related to the results of their
predecessors, or are based on previous research results. Therefore, when
researchers need to publish their research results in the form of papers, they
often need to cite a large number of citations, on the one hand, respect for the
work of predecessors, on the other hand, can provide readers with sufficient
background information.
However, it is not easy to do a good job of citation. With the development of
science and technology, information overload is also taking place in literature. In
a large number of papers, it is not an easy task to find the exact literature that the
scientific research mentioned in his paper should cite. There are two ways for
researchers to understand a scientific research. One is through books, courses or
lectures and other non-paper channels. In this case, when they want to cite a
scientific research in their own paper, find the original source is a laborious
matter. They often need to read a large number of papers and expand their
reading according to the citation relationship between the papers, and spend a
lot of time to find the source of the research results mentioned in their papers.
Another way is to read the original paper that introduces the results of the
research. In this case, it is relatively easy for researchers to cite the results, but
because many researchers have read a large number of papers, it is difficult to
recall a specific paper when it comes to the outcome. Even elementary
researchers who have read only a small number of papers cannot clearly
remember the title, author, and conference information of each paper. Therefore,
when the researchers need to make reference, almost all need to manually read
1
The first 1 Chapter Introduction

all kinds of information, in the search for paper spend a lot of time, and even
need to do a lot of unnecessary extended reading, this is undoubtedly a waste of
valuable time for researchers.
There are several large-scale systems that contain a huge amount of paper
data, such as reference counting, meeting article list features such as Google
scholar,Microsoft academic Search , and so on. Google scholar is the most familiar
paper retrieval system for researchers, offering free paper search services and
indexing most of the world's published academic journals. Google and ACM,
Nature,IEEE,OCLC , and many other publishers to cooperate, In the paper
retrieval, whether from the data volume or query speed is very good. For users
who simply want to make a paper query, the functionality they provide is
sufficient to meet the user's needs. However, for citations,Google scholar
provides only two simple features, format generation and reference counting.
Microsoft Academic Search is also an excellent paper search engine, providing
more functionality, such as the classification of paper fields, and the history of
reference counts in each article. In the application of citations,Microsoft
academic Search provides stronger support for extracting a list of citations from
each paper and citation contexts in which other papers cite the paper, which is
described in more detail in the article. In addition to the two larger paper search
engines, there are many smaller citation search engines, which provide features
such as CiteSeer. However, there is no system available for citation
recommendation.

2
The first 1 Chapter Introduction

Figure 1. 1 Citation recommendation system work motioned

Imagine having such a system, as shown in Figure 1.1 , when you're writing a
paper, you just need to write the existing research results based on the
knowledge you already have. In the manuscript, the location where citations
need to be annotated. After the manuscript is entered into the citation referral
system, the system automatically recommends a list of citations made up of
several candidate citations based on what you write. Even sometimes, when you
can't remember exactly how you want to quote a paper, you just need to write a
short description of what you're doing, and the system can recommend a list of
possible citations. When making a reference, you only need to select the paper in
the candidate list to complete the citation work. This process will be very
labor-saving, convenient, greatly reduce the workload, improve accuracy. With
accurate citation recommendations, on the one hand, when writing a paper,
quickly add citations, on the other hand, the extension of citation
recommendation technology, can also be based on the scientific research staff to

3
The first 1 Chapter Introduction

the simple description of their ideas, to recommend a similar idea of scientific


research, so as to help researchers to understand the direction of a specific
research at the forefront, It's a lot more than that.

1.1 Research background

With the development of science, information is exploding and information


overload is reflected in all fields. The same exists in the field of paper. Taking the
computer field as an example, not only the university is carrying on the research
for the year and age, the big companies have the strength of the scientific
research ability, so that the number of published papers growth is rapid. For
example, in the Computer field, figure 1.2 shows an increase in the total number
of papers in the field, and it can be seen that only recently, in the computer field,
there are about $number new papers published every year, and in the
foreseeable future, The number of papers will continue to grow at a very rapid
rate. The increase of the paper is inevitable, and the increasing speed of the
paper is also increasing. For example, $day Year, the total amount of new paper
generated in the computer field is almost three times times that of a decade ago
[1]. The increasing amount of paper, and the growth rate of explosive papers, will
make it more and more time-consuming to add citations to the work, and the
citation recommendation will become more and more important. The huge
amount of data caused by the information overload in the paper field makes
citation recommendation not only very necessary but also challenging.

4
The first 1 Chapter Introduction

Figure 1.2 The total amount of paper varies with time statistics

A sentence that describes a quoted article is called a citation context, near a


specific reference character. Figure 1.3 is an example of a citation context.
Researchers have spent a great deal of time looking for an experience like "[ten,
6]" specifically pointing to which paper is in the picture. It often takes a lot of
time to find articles that need to be referenced. The paradox is that the more
papers a writer knows, the more difficult it is to find a source of ideas. The total
volume of the paper is huge, and it is still increasing by a very fast speed. Find a
user may cite a paper, will encounter a lot of problems, with a certain degree of
difficulty.

Figure 1.3 A citation context sample

Some researchers have carried out a certain amount of research, and


proposed some methods to solve the problem of citation recommendation. For
example, the citation recommendation problem is considered as an information

5
The first 1 Chapter Introduction

retrieval problem, the citation context is considered as a query string, and then
the system searches the corresponding paper and returns according to the
citation context. This process is similar to the standard search engine for
retrieving the steps. In this framework, there are many ways to measure the
relevance of a citation context and a thesis, such as the language model [2] and
the translation model [3]. The citation context can also be used as anchor text,
and the citation referral system determines the point of the citation context by
comparing the current citation context and other known citation contexts.
In a word, all previous methods only analyze the content of the thesis and
citation context. In spite of all sorts of optimizations, all of them are generally
recommended. The common disadvantage of these methods is that none of the
user's reference preferences are considered. In reality, different users have
different reading scopes and different reference habits, which leads to different
citations for the same paper and citation contexts. General recommendation does
not consider the user's personalization, the effect is difficult to further improve.

1.2 Research content

In view of the fact that the general citation recommendation is difficult to


adjust for different users, this article hopes to get a recommendation framework,
while considering the content of relevance, can take into account the user's
personalized information, so that the recommendation can be taken care of the
user's preferences. In this paper, the citation recommendation method, which
considers both content and user preference, is called "personalized citation
Recommendation". This paper first proposes to use personalized methods to
solve citation recommendation problems, and achieved good results.
Personalized Citation recommendation is a new job, so there will be many
challenges, need to solve a lot of problems, this is the main content of this article.
The following is a brief list of issues and their solutions that you will encounter

6
The first 1 Chapter Introduction

when making personalized citation referrals.

1.2.1 get user data and build user information

In any personalized system, the user's information needs to be built first.


Then there needs to be a certain amount of user information as a support. For
existing large-scale systems, the user's preferences can be judged by various
activities. Two more mature personalized applications are personalized search
and personalized recommendation. First look at personalized search, in the M
Speretta and other people's research [4], they use the user's search history, the
user's habits and interests to analyze, thus providing users personalized search
services. In addition, because the user's behavior in the application system can
also be recorded, then the user's click behavior can be adjusted to each user's
model. In this way, the system can continue to improve at runtime to have a
better user experience. To look at personalized recommendations, for example,
in the Jae kyeong Kim and other people's research in [5], they use the user before
the behavior of the website and other users of the Internet behavior, to
determine the user's long-term and recent interest, To infer the product that the
user may be interested in, and recommend the product. In short, these
personalized applications need to have a system that is already running and can
collect user data all the time. Some methods even need to collect all of the user's
Web browsing information.
However, for citation recommendation issues, there is no open interface for
the site to extract the user's various information. This is also the biggest hurdle
for personalized citation referrals. Without user information, naturally can not
get the user's preference, can not be personalized according to the preferences of
the citation recommendation.
Here, the paper divides the researchers into two categories, one of which is
that there is no history of the primary scientific research personnel, and the

7
The first 1 Chapter Introduction

other is the publication of the history of senior researchers. For the primary
scientific research staff, the research work time is not very often, often does not
form a certain reference preference, therefore, its demand for personalized
citation recommendation service is not strong. However, for the researchers who
have published a certain history, they generally have a certain reference
preference and need to consider their preference when quoting referrals. As a
result, the target users who actually need citation referrals are senior
researchers with a history. And the history of senior researchers is the most
representative of the personalized information. This paper first attempts to use
the scientific research staff in the past and other publications to build user
information, and use this user information to recommend.
The information is relatively easy to obtain for a user's published
publication. There is a large number of open resources on the Internet for
researchers to download. At the same time Microsoft academic Search and
Google scholar provide a number of functional support. The user's reference
preferences can be analyzed through existing publications. In this way, when the
citation referral system is on line, it can provide personalized citation
recommendation service without user intervention. And through the existing
methods, the system in the process of continuous adjustment of user information
and recommendation process, the recommendation algorithm to further
optimize.

1.2.2 to User Information modeling

This article wants to build the user's information through a user's


publication. However, the publication's reflection on users ' interests and
preferences is indirect. How to use these indirect user information directly to
guide the citation behavior is another research content of this paper.
The paper that the user publishes appears Onefold, actually contains a lot of

8
The first 1 Chapter Introduction

content. From the meta information point of view, there are users of
collaborators, users of all papers published in the conference or periodicals,
user-cited papers and so on. The user-referenced paper also has the above meta
information. In addition to meta information, users have published the content of
the paper, is also a manifestation of user personalization.
All kinds of information from this extension are very many, because the
content information and content-based method of the paper overlap with the
content, so this article mainly focuses on the user published the paper's meta
information. The following three kinds of information can be obtained from the
related meta information of the thesis:
1. the number of collaborators, the number of collaborations, the publisher
of collaborators, the degree of authority of collaborators and collaborators, etc.
from the perspective of the user's collaborators.
2. from the point of view of the meeting or periodical published by the user,
there are the topics, years, the number of publications, the authority of the
relevant field, and the various information of the review.
3. from the user's point of view, there are relevant information about the
author of the paper, the relevant information of the paper Conference and the
relevant information of the paper.
The information is large and complex, and the direct indirect values that can
be computed are very high when calculating various user personalization metrics.
This paper hopes to select the information which is logical, easy to realize and
effective, and to complete the work of personalized citation recommendation.
Here, this article from a relatively concise, easy to achieve the point of view
of this problem: a new entry into the field of research users, the basic is no
reference preferences, with the deepening of scientific research, its reference
preferences and habits gradually formed. This formation process is made up of
countless recommended procedures. For example, a user's collaborators
would recommend papers to users, and a better-read paper for users would also
recommend a similar. This article wants to quantify the recommended process
9
The first 1 Chapter Introduction

in the process of developing these users ' interests, resulting in a quantified


standard that can be modeled on user information.

1.2.3 user Information and content-based methods


combine

In the personalized citation recommendation model, we want to take


advantage of the existing content-based work. Because the most direct
information that can be used by the citation recommendation process is the
content of the citation context. The previous method has been explored in the use
of citation context. This article hopes that the personalized recommendation
framework can directly use the results of the previous work, and at the same
time using the user's preferences information, so that the two methods of simple
and reasonable combination.

Figure 1.4 User Reference Preferences location

There are many ways to use a variety of information, two kinds of


information can be understood as transcendental posteriori relation, combined

10
The first 1 Chapter Introduction

with Bayesian formula, and can be understood as the function of a certain period,
the different information is considered as different dimensions, and the
combination method of each dimension is determined by using SVM . As shown
in Figure 4 , the user's citation preference is actually one of the user's writing
habits, which is caused by the user's past experience. These preferences are
long-term, and a relatively long-term amount of information that has a lasting
impact on users. The content of the citation context is a point in time when a user
writes a paper describing a problem and is instantaneous. Therefore, this article
thinks that the user has already had the preference to each paper before writing
the paper. Therefore, this paper uses the Bayesian formula to combine the user's
preference with the content of a citation context.
Once the user information is combined with content-based methods, it is also
necessary to combine the various reference bias indicators. The indicators reflect
user preferences, user preferences are long-term formation, can be considered as
a result of a variety of information, here, this article will make use of SVM to
combine. In this way, the main constituent elements of the framework are
Bayesian formulas and SVM, which will be expanded to describe them.

1.3 thesis organization structure

This paper is divided into four parts, the organizational structure is as


follows:
2 , this paper introduces some work related to citation
recommendation, including the work of paper recommendation and citation
recommendation, because the algorithm used SVM, the SVM is introduced.
In chapter 3 , we introduce the experimental design of this article, introduce
the requirements, methods and preprocessing of data sets, and list the evaluation
methods, and finally comb the whole process frame of the experiment.
In chapter 4 , we introduce the model of this paper, firstly introduce some

11
The first 1 Chapter Introduction

concepts used in this article, analyze the user's citation behavior, and then
introduce the user inclination degree, Content correlation degree and the
combination of the two in the thesis model. Then the experimental results are
shown and the parameters of tuning are analyzed. Finally, the paper analyzes the
features used in this article.
Chapter 5 summarizes the work of this article and explains the work that
can be carried out in the future.

12
The first 2 chapter related work

2 related work
This chapter introduces the research related to citation recommendation
work. First of all, the paper recommend this large class of several work to
introduce. Then introduce citations to recommend all the work that is already in
the direction of this research. Finally, the SVM used in this article is described.

2.1 Papers recommend

Citation recommendation can be considered as a special application, and the


thesis recommendation is a relatively mature research topic. In the general
recommendation of the paper, there has been a lot of work, here This article
selected several representative work to introduce.
For Recommender systems, the most commonly used and most common
method is called collaborative filtering [6]. Collaborative filtering believes that
people who have similar interests in the past will show similar interests in future
choices. When making recommendations, it is not directly to determine whether
a user is interested in an item. Instead, it finds users who are similar to the user
and picks out the users who have acted on the item, and determines whether the
user has a preference for the item through the behavior of those users.
Collaborative filtering method is very effective, and the method is simple and
suitable for large-scale application. The most obvious drawback is the "cold
start" problem. Because the system has just been established when the number
of users, less behavior, it is difficult to find "similar users", natural also can not be
a similar user for users to recommend. In addition to the system has just been
established, the "cold start" problem will always accompany the system. Whether
new or new to the user, the corresponding information is vacant, unable to find
similar entities for it. In addition to the problem of "cold start", the problem of

13
The first 2 chapter related work

sparsity and disequilibrium of data can also result in the unsatisfactory


recommendation of collaborative filtering methods. Because of these reasons, the
direct use of collaborative filtering method for paper recommendation, often can
not achieve the desired results, need to find more suitable for the paper
recommended method.
Because of the particularity of the thesis, we can use the content to
recommend [7], some researchers construct the word group of interest and
analyze the frequency of the article to make recommendations, which is actually
a variant of the TF*IDF technology in the search engine. In addition, the language
model [2] is often used in content-based methods to determine the ranking of the
recommended papers by calculating the probability of generating user interest
words in each paper. In addition, the researchers used vector space model [8]to
calculate the similarity between the user's interest vocabulary and the article
vocabulary, so as to determine the degree of recommendation of different
articles.
In addition to these basic methods of recommendation, in recent years the
paper recommended some innovative methods. K. Chandrasekaran and others
[9] further refine the Content-based recommendation method, based on the
information that the user has in citeseer Read the recommended paper for users.
They do not use a word-bag model to represent users or articles directly, without
using cos to calculate the similarity between users and articles. Instead, they
represented the user and the article in the structure of the hierarchical concept
tree and described the similarity between the user and the document by editing
the distance. Compared with the traditional method of content-based article
recommendation, this method has a great improvement.
B Shaparenko and other people [ten] in the recommendation, not satisfied
with the paper itself recommended to the user, but hope that the original part of
the paper extracted out, to provide users with reading. They use unsupervised
method to solve the problem, use the language model to analyze, use convex
programming to approximate, finally through the cosine similarity to calculate
14
The first 2 chapter related work

the full text, so as to get the thesis body, and recommend. S. McNee and others
[one] hope to solve the cold startup problem effectively when establishing a
paper recommendation system. They use the reference relationship between
existing researchers, the reference relationship among the papers and other
interconnected information as the starting data for collaborative filtering, so that
the system can have a better recommendation when it is first run.
K. Sugiyamad and other people [a] according to the user has published a
paper reference and the referenced information, build the user's neighbor paper
and the neighbor author, then unifies the user has published the paper the
Collaborators, and the content of published papers to build the user's personal
information. Based on the analysis of the user's recent interest, and through the
user's personal information and other documents of the similarity between
information, as the main basis for recommendations.
D. Zhou and other people [to], using the relationship between the author, the
relationship between the paper, and the author and other papers to build the
relationship between the three, and the three diagrams combined with the object
based collaborative filtering. When measuring the similarity between objects,
they use the low dimension data, and turn this problem into a optimization
problem, and use the half supervised learning method to construct the model. T.
Tang and other people [to] in an online learning system, using model-driven and
mixed collaborative filtering to recommend users, this article takes into account
the different interests and different levels of knowledge of users, And after the
corresponding recommendations and users to read, updated the user's
knowledge level information, so as to continue to recommend suitable articles
for users.

2.2 Citation Recommended

The citation recommendation is a hotspot of research that has arisen

15
The first 2 chapter related work

in recent years. In the direction of citation recommendation this


refinement, the current work is not many, but in recent years, we have
gradually noticed the significance of citation recommendations. Trevor
Strohman and other people [copy] for the first time to try citation
recommendations, they put the entire manuscript as a system input, then
the user's manuscript as a long query string, in the paper Library search,
the search papers as a recommended paper returned. They divide the search
process into two steps, first, in a collection of millions of papers, to
search for a previous $number paper that looks only at the closest content.
In the second step, add the paper that is referenced by this $number
article to the list of alternative papers, so that the entire list expands
to 1000~3000 . The articles are then reordered. By using the simple
features of publication time, similarity of thesis content, common
reference, common author,Katz distance, reference quantity, the weights
of each feature are found by using the method of gradient rising, then
the recommended values of each paper are obtained by using the weighted
linear model. The model is intuitive and effective, but there is no
recommendation for positioning the citation context.
J Tang et people [in] when making citation recommendations, there is
no heuristic method applied, but a citation recommendation is made through
subject similarity. They propose a two-layer RBM(restricted Boltzmann
rogue) model, given a collection of papers with referential relationships,
which is used to study topic distribution through paper content and
reference relationships. Given a citation context, the subject model
learns to match the corresponding citation context and to sort the
proposed paper according to the degree of match. Compared with the
previous method, this method has a certain improvement.
Y. Lu and others [3] think that the citation context is very different
from the quoted papers in terms of idioms, and that it is not possible

16
The first 2 chapter related work

to conduct similar searches directly. They use the translation model to


consider the citation context and the thesis itself as two different
languages. By statistical analysis of the existing citation context and
the translation probabilities between the papers, a citation context is
used to calculate the probability of an article. As a basis for the ranking
of the recommended papers, this method fills the gully between the
citation context and the paper, and gains the effect.
Q. He has been studied for a long time on citation recommendations.
They first mentioned in the paper [notes] , the content of an article
itself because of the more noise, and can not be a good summary of the
content of the article, and when another article cited this article, its
citation context is often a general description of the content of the
article, and extracts the main part of the article view. Therefore, in
the citation recommendation, their research did not directly in the
article library to retrieve the paper, but in the paper reference library
to carry on the content of the similarity judgment, according to the
similarity of citation recommendations. This method is similar to the
anchor text technology in search engine, and has achieved good results.
Then,Q. He, further expansion of the work in [review] , in their work
[post] , they do not give the position information of the reference
character, but in the entire manuscript to find the place where the
citation is needed and automatically provide citations based on their
context. In this way, users can write without having to consider citations
at all. When looking for a citation context location, they first to the
document set all the citation context of the vocabulary statistics, get
what vocabulary comparison needs citation, and then get the user
manuscript after the user manuscript according to the number of fixed
words into the sentence, and then calculate the citation context with the
paper library statistics need to quote vocabulary similarity, Get these

17
The first 2 chapter related work

sentence hotspots throughout the article to determine which sentences


need citations. This kind of judgement to the position of citation,
achieved a good accuracy rate, can be very convenient for users to write.
These methods have a unified problem. For a citation context that is
similar in content, it must be a recommended list of similar citations.
However, each person's reading range and citation preference are very
different, so, for similar content, different people, actually want to
quote the article is different, needs to be personalized processing, this
article to model the user's information, and at the same time based on user
preferences and user's written content to recommend, The focus of this
article is to further improve the accuracy and user satisfaction of the
recommendations.

2.3 (Support Vector Machine, SVM)

Here, the citation recommendation problem can be understood as a


classification problem, in fact, for a citation context, the papers can be divided
into two categories according to their reference. According to the confidence
degree given by the classifier, the ranking of the recommended papers can be
determined.
The classifiers often used in scientific research mainly have the following
three categories: Classification algorithm based on information theory,
computing method based on TFIDF weights, and classification algorithm based
on knowledge learning. Each classification method contains many specific
methods. The classification method based on information theory is more
common with Naive Bayes algorithm (Naive Bayes, NB) and maximum entropy
(Maximum Entropy) algorithm, and the most commonly used method based on
TFIDF weight is TFIDF algorithm and k nearest neighbor (K nearest
Neighbors,KNN) algorithm; the most common decision tree algorithm based on

18
The first 2 chapter related work

the classification algorithm of Knowledge Learning (Decision trees ), neural


network (Artificial neural Networks,ANN), and the support vector machines used
in this article (Support vector Machine, SVM) Classification Method [to].
There are many features that need to be used when building user
preferences, but when you finally consolidate other models for sorting, you can
only use one value to sort. As a result, multiple features need to be consolidated.
The support vector machine regards an instance as a point in space, and how
many dimensions there are in the method. For ease of introduction, take the 2
dimension as an example. At this point, each entity is a point in a two-dimensional
plane, as shown in Figure 2.1 , where circular and square dots represent two
different categories. SVM First requires a training set to train the model, which
can be expressed as:

S = (( , ), L( , )) (X Y) (2-1)

Wherel represents the number of samples, refers to samples, refers to


the sample's markup,X is the input space, andY represents the output space. The
two types of points in Figure 2.1 can be used to form a dataset that trains support
vector machines.
The H in Figure 2.1 is a classification surface for support vector machines,
andH1 and H2 are two planes parallel to H , each with a different category of
sample points, and in H1 A point that does not contain any samples from H2 . The
distance between H and H1 and H2 is equal and all training samples can be
correctly divided, where the H is called the optimal hyperplane and can be
expressed in the following formula:

<wx+b> =0 (2-2)

Where W is the normal of H , the offset ofb is h , andx is a sample point on H1


and H2 , which is called the support vector. The line Hin the middle of H1 and H2
has the function of categorization.

19
The first 2 chapter related work

Figure 2.1 support vector machine (SVM) principle schematic

The principle of SVM is to compute an optimal hyperplane, which can have


the best classification effect for training samples, and the classifier using this
hyperplane is the tool for classifying this article, assuming there is a training
sample S, The process is to solve the following optimization problems:

min , < > subject to

(< > +) 1, = 1, , (2-3)

Here, you need to get a hyperplane ( ), which causes the geometry


interval = to ||w||2take to the maximum value.
This is a constrained two-time programming problem, you can use the
Lagrange multiplier method to transform it into a dual problem, then the dual
form can be expressed as:

0 = =1 , ( 0), = 1, , (2-4)

O = =1 , ( 0), = 1, , (2-5)

Bringing it into the original Lagrangian function can be translated into the
20
The first 2 chapter related work

following:
1
L(w, b, a) = =1 < > (2-6)
2

The following two optimization questions are:



1
max () = < >
2
=1

= =1 , ( 0), = 1, , (2-7)

assume andb is the solution to the above optimization problem, then


the planning of the decision is given by the following equation:

sgn(=1 < > +b ) (2-8)

The above formula can be used to calculate whether a sample belongs to a


category, given a new sample x, you can enter x into the top, and if 1 belongs to
that category, otherwise it does not belong to that category.
Here in two dimensions, for example, multidimensional situation is similar,
in multi-dimensional case,H is a hyperplane, play the role of classification. In the
process of classification, support vector machines tend to map vectors to a higher
dimensional space in which the maximum spacing of the hyperplane is
established.
In this work, the user's tendency of a thesis is divided into three aspects,
which extend to nine dimensions. So different samples are different
nine-dimensional points in space, and then using support vector machine, will
reflect the user preferences of the various dimensions of integration, the final
implementation of the unique sorting criteria, the next chapter will detail the
model of this article-thePCR model.

21
3 PCR

3 PCR

PCRPersonalized Citation Recommendation

3.1 related Concepts

Before introducing the method of this article, we first explain some concepts
and abbreviations used in this article. The following concepts and abbreviations
are used in this article:
Tables 3.1 abbreviation description

Chinese name English name English


abbreviation
Content Correlation Content Relevance Degree CRD
Used to measure the degree of relevance between a citation context and an
article that only considers content. This value can be computed from any kind
of content-based model, such as a common language model.
User preference User Tendency Degree UTD

The focus of this article is to measure the tendency of a user to cite an article.
Reference possible Cite Possibility Degree CPD
degree
The likelihood that a citation context refers to a paper is the ultimate measure
of recommendation.

So in this concept, the system input and output to be constructed in this


paper is:
Enter: 1) The metadata collection Pof the thesis, the metadata needs to
contain the content information of the paper, author information, meeting

22
3 PCR

information and reference information; 2) Some do not refer to the citation


context, and the author of these citation contexts.
output: According to input, a list of papers sorted by CPD is obtained for
each citation context.

3.2 Reference behavior Analysis

in order to better understand the user's reasons for personalization needs,


first of all, in-depth imagine that the next user U refers to a paper T the entire
process, as shown in the following figure.

Figure 3.1 User reference behavior occurs in detail

First u must know the existence of T , interest in t , then read it, and then
when u writing a paper p , recall that the T is relevant; At this point, the user U
may write a description of t D and eventually reference T. Before you write p , in
fact T has a higher probability of being referenced than a paper that has not been
read by u .
The value between D and T is CRD, and the value between u and t is UTD. In
previous work, often CRD was used as a CPD, while UTD was ignored, which
largely limited the further promotion of recommendations. In fact, users first
have different UTDfor different papers, and then produce a personalized citation
behavior based on these UTD . In other words, in citation recommendation,UTD
23
3 PCR

can actually be considered a priori of CPD . The following formulas can be used to
describe their relationship:
CPD = UTD CRD (3-1)

Therefore, this work will focus on UTD , using the predecessor has
completed the CRD, hope that the existing CRD on the basis of the appropriate to
add UTD to further improve the effect of citation recommendations. The
following article describes in detail how this article constructs UTD and CRD .

3.3 UTD User Tendency Degree

3.3.1 Build user information

As mentioned earlier, for each "paper - user" pair, you need to compute the
value of UTD . Firstly, this paper constructs the user information. Here, you need
to divide the user into two types: a junior researcher who has no history, and a
senior researcher who publishes history.
It is difficult for novice researchers to build their user information, without
the effective support of user data and personalization. But from a different point
of view, personalization is not everyone's must, for junior researchers, involved
in scientific research is not deep enough, often does not form a relatively stable
interest and reference preferences, therefore, in the recommendation of
personalized reference value is small. In addition, the scope of contact of primary
researchers is too narrow, if the citation recommendation is limited to their
familiarity with the known scope, instead of limiting their scope of knowledge,
therefore, when making personalized citation recommendation, the junior
researcher only need to follow CRD recommendation, that is, With the previous
research results can be recommended.
For many years engaged in scientific research work of senior researchers,

24
3 PCR

they engaged in scientific research work for a long time, in a certain field has
their own contribution. Also gradually formed their own interest points and
reference preferences, they will be cited in the behavior of the impact of
reference preferences, so they need to be personalized treatment of these
researchers. Senior researchers have published papers that are the key to
building their user information, which is the focus of the PCR model. For the
majority of researchers who have worked for a number of years, they can get all
the information they want from their publication history, including:
1). A collection of papers made by all users; This is actually based on the
user's citation precedent as the basis for inferring its citation bias.
2). A collection of all the authors who have worked with the user, which is
actually a consideration of the user's social circle's impact on user citation
preferences.
3). A collection of authors that have been referenced by all users, which is
actually a user's circle of attention and a reflection of user preferences.

3.3.2 using user information to build UTD

How do I measure UTDbetween u and T for a given user U and a goal paper
Tafter you have these user information? This article argues that the reason why
users have different UTD for different papers is that some recommended and
expand relationships. Recommend people from three levels: the user, the user's
partner, and the user-referenced author. These three levels of people will be
written and cited in two ways for users to recommend the paper, both of these
methods are considered as recommended behavior. For example, the author
writes a paper p or a reference paper P once, which is considered to be the
recommended paper p once. After accepting recommendations from three of
people, users will extend the recommendations, in addition to taking note of the
paper itself, the user will also notice the author of the paper and the conference

25
3 PCR

published in the paper. Based on three-level referrals and three extensions, you
can get a prior feature of 3x3=9 that can be used as a UTD (table 3.2 ).
Tables 3.2 Different levels of recommendations and different ways of extension

Referrer \ Thesis Paper


Paper itself
extension method author Conference

Users themselves 1_1 1_2 1_3

Users ' 2_3


2_1 2_2
collaborators

User-Referenced 3_3
3_1 3_2
person

Next, a detailed description of each UTD is given, partial data support for the
feature is provided, and a calculation formula for each UTD is listed. In the
following calculation formula,count (x, y) means the number of times the x
recommends y . where x is a user,y may be a paper, author, or meeting. count (x)
means the total number of times a user X recommends a paper. count (x) means
the total number of times the user X recommends the author. For example, if x
recommends only one paper, which has three authors, then count (x) is 1,count
(x) is 3. The variable u represents the current user, andt represents the target
thesis. A represents the author collection of the target thesis, and thev represents
the meeting or journal published in t . Delegate u A collection of people who
have worked together, and represents a collection of people who are
referenced by u .
First is the user's own recommendation behavior, the user obviously to own
thesis or once cited the paper more familiar, has the stronger quoted tendency.
High tendency not only affects the user's reference behavior to the paper itself,
but also expands to the author and the meeting of the paper. This can be
obtained1_1 to1_3 .
1. : Users themselves recommend paper 1_1 T is the percentage of

26
3 PCR

the total number of users recommended. This feature takes into account two
types of user behavior: Quoting your own published paper and quoting a paper
you've cited again. In the data set of this article, the number of paper metadata is
55823, and the average number of papers per author is " 2"for each article, which
is written by the author of 3. $number , which refers to the average number of
times a paper that you have referenced or that you have referred to is 5. In other
words, for an author, the probability of any paper being cited is 0.079%, and the
probability that a paper that has been quoted or written by itself is referred to as
a 29%, which is the $number of the former. A paper is written or quoted by the
user, and there is a great probability that the user will be quoted later.
This phenomenon is not surprising, first of all, for each researcher, its
research area is relatively concentrated, the current work of researchers and past
work often has a large correlation. As a result, previously published papers or
citations are undoubtedly more likely to be related to the current work. On the
other hand, users are often more familiar with their work or the work they have
cited, and these two factors lead to a higher citation bias. Therefore, given a paper,
the user's own recommendation behavior is the first feature to be considered.
The calculation formula is as follows:
(,)
1_1 = (3-2)
()

2. : The user's own recommendation paper 1_2 T 's author collection A


is the percentage of the total number of users recommended by the author. This
feature takes into account the user behavior of quoting the author's paper. In the
data set in this article, the average number of $number times per user appears in
this case, and the total author of each user reference is $number. $number, in
other words, a user who has $number will refer to the author of a previously
cited article. Users have referred to an author many times, then the user of the
author's work is more familiar and recognized. Then users will have a great deal
of probability to familiarize themselves with most or all of the author's papers.

27
3 PCR

This feature expands the focus to the author of the target thesis, with the
following formula:
(,)
1_2 = (3-3)
()

3. : Users themselves recommend paper 1_3 T the number of meetings


v as a percentage of the total number of users recommended. In the dataset in
this article, the average number of times a user references a referenced meeting
is $number, and the number of times a meeting is referenced is $number, which
means that the user will have a $number probability of referencing a paper that
has been used for a conference. This feature takes into account a user's habit of
reading and quoting papers that are familiar with the meeting. The more times a
user references a conference paper or the more times a user publishes a meeting,
the more familiar the user is with the meeting. The more likely the paper is to be
referenced by the user. The feature expands in the direction of the meeting and
calculates the following formula:
(,)
1_3 = (3-4)
()

The above three formulas all consider the user's own behavior, in
addition to the user's own, the user's collaborators also have the
recommendation force, and will have an impact on the user's future reference
behavior. So the papers published or cited by user collaborators also play a
role in this model. The following three transcendental features consider the
recommended behavior of the user's collaborators.
4. 2_1 : User's collaborators c ollection recommended papers
T the number of total recommended times. This feature takes account of such
user behavior: Familiarity with the work of collaborators, and reading the papers
of collaborators. In the data set of this article, the average number of users who
have collaborated with the? ?? ? is the average of 9. $number names, and
references to collaborators are 5.4 times, and the number of papers referenced

28
3 PCR

by collaborators is 33.69. times, the average recommendation of each partner


resulted in the 2. the secondary reference, and the average of any other user only
has 0. The Times, is the partner's Ten. 86%, So you can see that the partner's
recommendation ability is much higher than that of other authors. Users often
have close links with collaborators, which leads to a great deal of probability for
users to familiarize themselves with the work of collaborators and to read their
papers. As a result, a partner-recommended paper can have a certain impact on
the user, and 2_1 uses to measure this effect, as shown in the following
formula:
( ,)

2_1 = (3-5)
( )

5. : User's collaborators collection recommended 2_2 thesis T


the number of times a is the percentage of the total recommended author
number. After has an impact on the user's recommended behavior for the paper,
users may extend the impact further to the authors of these recommendations.
The formula is as follows:
( ,)

2_2 = (3-6)
( )

6). , user's collaborators collection recommended th 2_3 esis


T the number of times v the percentage of total recommended times.
similar 1_3 , which can also extend the meeting dimension from the
perspective of the user's collaborators. This feature can be described in the
following formula:
( ,)

2_3 = (3-7)
( )

In addition to the user and the user's collaborators, the user's referrer
also has the ability to recommend to the user. If an article, an author, or a meeting
29
3 PCR

is recommended multiple times by a user's referrer, then the user will have a
higher probability to reference it, the next three of the transcendental features
from three kinds of expansion (the paper itself, the author of the paper, the
conference published in the paper), consider the user cited the author's behavior.
These features are calculated in the same way as 2_1 to2_3 .
7. , the user-referenced author 3_1 collection recommended paper
T is the percentage of its total recommended number. This feature only considers
the target thesis itself, which is calculated as:
( ,)

3_1 = ( )
(3-8)

8. , user 3_2 -referenced author c ollection recommended


thesis T the number of times a is the percentage of the total number of
recommended authors. This feature extends to the author dimension with the
following formula:
( ,)

3_2 = ( )
(3-9)

9. , user 3_3 -referenced author c ollection recommended


paper T the number of sessions v that is the total recommended number of times.
This feature is extended in the meeting dimension with the following formula:
( ,)

3_3 = ( )
(3-10)

These nine features are actually a priori on the 9 dimension of CPD , which
can then be multiplied by the CRD , which is calculated by the 9 prior to the
previous model, and the 9 a different CPD . By combining this 9 CPD , you can get
the criteria that are ultimately used to sort the candidate essays in a given
citation context.

30
3 PCR

3.4 CRD ( Content Relevant Degree)

This article utilizes a meta language model [2] and a translation model that
has been improved on the digest [3] as the CRD part of the algorithm, and is also
used as a contrast algorithm. Here's how these two methods work in the data set
of this article.

3.4.1 language Model

The language model is put forward in the $number year, which is mainly
applied in the field of information retrieval and has achieved good results, and
then some researchers have applied it in the fields of speech recognition,
handwriting recognition, machine translation and morphological annotation. The
language model argues that the appearance of a word is related to the words that
preceded it, and not to the words that appear afterwards. The language model
classifies the number of words related to the word before it is considered by the
model, and if the number of related words is n-1, then it is called the n Meta
language model. In the actual system, for efficiency consideration, it is often used
as a one-dollar or two-element language model. Therefore, this article will use a
unary language model here as a CRD.
The citation recommendation process is actually to get a user's citation
context, then to each paper in the paper collection according to the CPD rating,
and finally follow the CPD from high to low to recommend. This article assumes
that the paper is D, the citation context is C, and only when the content is
considered, the value of theCPD is expressed as a probability in the form of the P
(d| C), using the Bayes theorem to:

p(C|D)p(D)
(| ) = (3-11)
p(C)

Because this problem does not concern specific values, for the same C , in

31
3 PCR

the sort sense,p (d| C) =p (c| d) p (d), where p (d) is a priori probability of paper
D . Here, you can think that the paper D conforms to the uniform distribution,
that is, the probability of all papers appearing the same, the value ofP (d) is the
same, then the p (d) item will not affect the final sort result. Therefore, CPD
directly leverages the p (c| D) the value of. P (c| D The value of this article uses a
unary language model to estimate, according to the definition of a meta language
model, the model considers that each word appears to be independent of each
other, regardless of the previous word, is actually a " Word bag Model ". Its
calculation formula can be expressed as:

p(C|D) = =1 ( |) (3-12)

where is the word for the citation context, I , andn is the total number of
words in the citation context. ( |) represents the probability of
distribution of each word in the context of a citation. Suppose a paper is D, and
the total number of words in the article D is expressed as | d|, a word in the paper
is w, the number of occurrences is then the distribution probability of the
word can be expressed as follows:

(|) = (3-13)
|D|

However, since a paper cannot appear in all words, it is unreasonable that


when a word in the citation context does not appear in the paper, the score of this
document will immediately become 0. This is also a very common data sparse
problem in the language model, generally using smoothing technology to solve
this problem. The main principle of smoothing is to reduce the probability of the
words appearing in the citation context slightly, and then add these probabilities
to the words that do not appear. This article uses Dirichlet smoothing. Suppose
the entire proceedings are S, and the total number of words in the proceedings is
| s|, a word w appears as |w|, then the probability that the word will appear in
the proceedings can be expressed as:

32
3 PCR

||
(|) = (3-14)
|S|

When you get a citation context, the probability of each word calculates not
only the current document, p (w| D) , you also need to compute the document Set
p (w| C) , and then adjust by using the parameter Alpha , as follows:

p (w|D) = (1 )p(w|D) + p(w|S) (3-15)

Where Alpha is a variable of the same length, the formula is:



= |D|+ (3-16)

Where is the constant set based on experience. Then, using the pin the p '
substitution formula (3-12) , this article can get the following formula for
calculating CPD :

p(C|D) = =1 (1 |D|+) p(w|D) + |D|+ p(w|S)
(3-17)

Then the paper uses the calculation result of the above formula as the basis
to sort out the candidate papers.

3.4.2 Translations Model

As early as the $number year,Warren Weaver proposed the basic idea of a


translation model. $number Year Berger and others apply translation models to
information retrieval to fill the information gap between query words and Web
pages. In fact, the citation context and the paper itself will also have a gap in the
content, so Yang lu[3] and other people using the translation model for citation
recommendations, has achieved good results.
The translation model is based on the terms of the word and calculates the
33
3 PCR

probability that the word W1 translates to W2 . This article makes use of all the
citation context C and its referenced thesis D in the dataset to train a collection of
components t={(c, D)} . The model estimation process is actually the maximum
likelihood of C to D :

t = arg = (,) (|, ) (3-18)

When estimating the model T , use the Heuristic method [to]:


(w ,w )
p(w |w ) = (3-19)
(w )

Where w is a word in the D of the thesis, is a word that w is contained by


citation context C , ( w , w )tow andw The number of occurrences of
the ( w )tow ) that occurred in the reference relationship. In the data,
the translation model of the processing, the same word between the reference
relationship is too little, this phenomenon is called low self translation
probability. This article uses the proposed methods, such as xue[21] , to be
corrected, so you can get:

p ( | ) = 1( = ) + (1 ) p( | ) (3-20)

Where 1( = ) means when and the same value is 1. This


computes the probability of a word in the context of a citation and the translation
probabilities of the paper, given a citation context C , the probability that a paper
D is referenced is:

p(C|D) = ( |) (3-21)

One

( |) = ( |) + (1 ) ( | ) ( |)

(3-22)
( |)and ( |) , respectively, the maximum likelihood estimate for

34
3 PCR

the word in all the essays together with S and thesis D , ( | )to The
probability of translating a word in a text from a translation model to a citation
context is computed as (4-19) . This makes use of the translation model to fill the
citation context and the word gap in the paper. In the final effect, in the
calculation of (3-20) , the model has a better effect [3]since the translation was
improved on the summary, so the translation model on the summary is used as
the CRD And one of the comparison methods for the final effect of this paper.

3.5 UTD CRD

After you get the value of UTD and CRD , the next difficulty is how to
combine CRD with UTD . For a given citation context and a paper, a score is
required to measure its CPD, which in turn sorts the list of candidates for the
paper. However, the combination process will encounter four of problems: the
value gap, multiple points, multiple authors, and positive and negative cases are
not balanced, in order to solve these four problems, this article uses the following
methods:

3.5.1 Fill the value divide

In the formula (4-1) , you need to multiply UTD and CRD . However, in the
process of calculating CRD , you are actually multiplying the relative quantities of
the words in the citation context. The citation context is generally longer, and its
average length is 4, plus the huge differences between the different words, so the
value of the CRD between the different citation contexts and the paper pairs is
huge. In the thesis []
"sprint[28]andrainforestproposetwoscalabletechniquesfordecisiontreebuilding...
" This citation context is an 3.361 10137 example of the maximum
CRD6.726 1058 correlation is , the minimum CRD correlation is, and the
35
3 PCR

largest is the smallest when using the language model. 2 1078 times. When
us5.936 1077ing the translation model, its maximum CRD correlation with the
candidate set is 51048 . 258x, minimum CRD correlation is, where the largest is
the smallest Times. 8.86 1028 In the case of with the maximum change in the
citation context, the maximum value is 0.0438, the smallest non- 9_9 0 value
is3.977 106 , and the maximum and minimum difference is only 11017 times.
Therefore, when UTD is used for CRD in this case, the role ofUTD can be almost
negligible. In the final sort, the results are almost identical to those obtained with
only CRD .
The magnitude of the value of UTD depends primarily on the size of the
dataset, and the magnitude of the value ofCRD depends on the length of the
citation context. You can take the relative fixed length to fetch the citation context.
But for the size of the dataset, the different crawl size will lead to different data
set size, so a dynamic tuning method is needed to balance the impact of the
dataset size.
Here, a common method of data balancing is used to solve this problem,
introducing a shrink variable Alpha, and using exponential to adjust the impact
of different data sizes, the formula (3-1) is updated to the following formula:

CPD = 1 (3-23)

You can get a better result by tuning the parameter Alpha , and you can
adjust the Alpha when the dataset size changes.

3.5.2 Combine multiple fractions

The model in this paper analyzes the user's referential bias from three
angles in three aspects, so there are 9 different priors. After this 9 UTD is
consolidated with CRD , there will be a 9 different score. To get a final score for
sorting, you need to combine this 9 score, which is the second question that is
36
3 PCR

mentioned at the beginning of this chapter. When you combine, you will also
encounter the remaining two issues that are mentioned at the beginning of this
chapter.
What this article solves is actually a multidimensional combination of
problems, here, this article leverages the SVM described above to combine.
Reference recommendation is actually a reference prediction problem, this
article regards it as a classification problem. There are actually only two
relationships for each citation context and paper to C: References or no
references. This article represents Cwith a different score of 9 . Therefore, each
citation context and thesis pair is actually a point of a 9 dimension. Some of these
points are positive examples (citation contexts refer to this paper), and some are
counter examples. This is a standard problem that is resolved by a SVM classifier.
Given a citation context, combining it with all candidate papers, getting multiple
classification points, and classifying these points, you can sort the papers
according to the confidence of SVM , which is divided into positive examples.
This idea can combine multiple UTD with a single CRD .
The third problem is multiple authors, considering a citation context and a
thesis pair, and a score for each author of the citation context. Many papers are
often collaborated by multiple authors, so many citations have multiple authors.
While each author has different UTDfor the same thesis, in the experiment, the
paper applies the maximum value, the minimum value, the mean, and the four
methods according to the rank attenuation, which has the best stability and good
effect. Therefore, this article takes the average number of authors as the citation
context, the final score of the thesis pair.
The last problem is that in the training process of SVM , there is an
imbalance in the number of positive and negative cases. A citation context refers
to only one paper, and all other papers, combined with the citation context, will
form a negative case point. So during the training process, the negative cases
accounted for the overwhelming majority, which greatly affected the
performance of the SVM classifier. This article resolves the problem with the
37
3 PCR

following steps:
1). Add all of the positive cases to the training set, and randomly select the
equivalent negative points to join the training set.
2). SVM
3). Gets a result based on the model in 2 .
4. Repeat steps n in 1 to 3 , get n , and then calculate the average result as the
final score. N is stable when the result is 5 , and the n has no effect on the result,
so the n is set to 5. Good balance of positive and negative cases, and achieved a
more stable effect.
After the solution of the above problem is clear, the model can be trained
directly using the Open Source tool of SVM . This work uses libsvm[23]. LIBSVM
is a fast and effective SVM software package with good interface and efficiency.
Using the following steps,LIBSVM completes the model training and testing:
1). to consolidate the UTD and CRD of all dimensions using formula (3-23)
after selecting the positive and negative examples, The nine values that will be
given as one row. And at the beginning of the line, its classification results (either
referenced or not referenced) are annotated and written to the file.
2). scales the data so that its value is between $number .
3). Select the c-support vector classifier, useRadial Base kernel function
training to get the model files. The model file can be applied to assist in the actual
running system.
4). constructs a test file using a method similar to 1 , classifies the models
obtained by 3 , and prints the classification results and confidence.
5). based on the classification results of the 4 , the results of the paper are
sorted and evaluated according to the actual reference conditions.
Here, this paper introduces the theory process and basis of constructing
system and experiment. In the following chapters, experiments are used to verify
the validity of the model, and the following chapter focuses on how to build a
reasonable dataset in the experiment. After the introduction of the experimental
framework, the introduction of the parameter is described above, and the
38
3 PCR

features used in the model are analyzed.

39
The first 4 chapter experimental design

4 Experiment Design

4.1 Data Set

Since no researcher has tried to personalize the citation recommendation


problem, the existing data sets can not meet the requirements of this article and
need to build the dataset. In the case of datasets, the requirements of the data are
described first, and then the method of acquiring the data is described in terms
of data requirements. After data acquisition, in order to facilitate the further
experiment, the need for data preprocessing, the following steps will be carried
out in turn.

4.1.1 Data Requirements

Unlike many research issues, the problem does not require data to be
labeled. Many recommendations, in order to determine the accuracy of
recommendations, often require users to label the degree of preferences, so as to
obtain training and testing data. As for citation recommendation, the data can be
divided into two parts by using time as natural partition. These two parts have a
referential relationship, directly reflects the user's citation behavior. The
reference relation of the time before can be used as the training data, and the
time can be used as the test data. There are no tagging requirements, but there
are several unique requirements for citation recommendation:
1. Content format that is easy to work with. Because the paper resources are
often in PDF format, it is inconvenient to direct text processing. You need to
convert the contents of the PDF to text.
2. Clear and unambiguous citation points. Because the templates for each
meeting are different, the format of the paper is different, the list of

40
The first 4 chapter experimental design

citations cannot be obtained by a uniform method, and the paper is pointed


to by the list of citations. However, in the course of citation
recommendation, accurate citation orientation has an important effect on
the training and testing process.
3. More accurate citation context. In a reference, most articles refer to a
reference in a form similar to [1, 3] , which is referred to as a citation
placeholder. However, the form of placeholders is inconsistent, such as the
citation format in the article [text] , as shown in Figure 4.1 . You need to
handle a variety of different citation placeholders. In addition, the relative
position of the citation context and the citation placeholder varies widely.
Some authors place the citation placeholder after the quoted content, as
well as in the middle or before. In addition, the length of citations is also
different, need to be handled to extract a more accurate citation context.

Figure 4.1 different citation placeholders beckoned

4. A relatively dense set of reciprocal primers. Since the total amount of the
paper is very large, the author is also likely to cite a variety of topics in
writing, there will be a relatively large reference range. Unable to obtain the
complete works of the thesis, in obtaining a subset, if the method is not
appropriate, will result in the set of papers in the inter-cited rate is too low,
the reference graph is too sparse to achieve the ideal experimental results.
The data acquisition strategy of this paper needs to obtain a subset of the
paper with a higher mutual citation rate.
5. Accurate extraction of meta information. In the personalized
recommendation, not only need the content of the paper. More important is
the accurate extraction of meta information. Including the author's list of

41
The first 4 chapter experimental design

papers, the papers published in the meeting and so on.


6. More accurate individual identification. For personalized citation
recommendation, users ' preferences and interests need to be understood
through the publication history of the user. Therefore, accurate individual
identification is very important. Because of the phenomenon of duplicate
name and the existence of one person's multiple names, it is not possible to
judge the user by its name directly. Different individuals need to be
identified.

4.1.2 data Gets the procedure

If you randomly crawl existing resources on your network to build a dataset,


the 6 data requirements above are almost impossible to complete. This 6 data
requires that each is a child problem that can be studied in depth and requires a
great deal of input. However, there are many previous work, can assist this article
to meet the above requirements, coupled with some of the work, you can
construct a data set to meet this 6 requirements.
MAS
Microsoft Academic Search APIMAS 1pdfminer
The two most influential tools in the field now are Google scholar and MAS .
Both sites contain a large number of papers, with good usability. In addition to
the paper, which provides a retrieval function,MAS provides more functionality
and provides some of the functionality of the API. After applying for an API to
Microsoft, you can use the MAS API to develop your own applications and request
data from the Mas server. The services provided by the MAS API include the
following three areas:
Send a text query to the MAS API to get related objects. For example, get
a list of papers based on a query.

1 http://academic.research.microsoft.com/

42
The first 4 chapter experimental design

Gets the details about the related object. such as the author of the paper,
meetings and other information.
Explore relationships between different objects. such as the inclusion of
a meeting and a paper.
MAS provides a data acquisition interface for JSON and SOAP , where you use
JSON to request and retrieve data, and to parse and exploit it to complete the
construction of a dataset. Returns the JSON statement for a list of papers for a
meeting, given a meeting ID. The MAS API returns more complex results, and the
interpretation of the returned results in the document is not specific and does
not explain and interpret the problems that exist in the API , and the next chapter
parses and explains the query results for the returned JSON format.

4.2 MAS API

MAS contains a lot of metadata when providing a paper: such as paper itself,
thesis author, meeting, field, and so on, the metadata is identified by a unique ID
and is identified by the entity. After a collection of essays has been crawled, the
author's reference preference can be analyzed by aggregating the same author's
IDto get a list of papers from the same author. This solves the requirements for
the first 5 and the 6 mentioned above.
In addition to the Mas API that provides the ability to meet 5,6 , theMAS Web
site also provides a very important feature called "citation context, which is the
citation contexts, when you search for a paper, you can see the text that
references the paper's reference character at the same time, and in the search
paper [{] when the effect is shown in Figure 4.3 . The box area indicates the text
described in other papers when the paper is referenced, and the site is well
identified with the citation locator and the citation context.

43
The first 4 chapter experimental design

Figure 4.3 MAS site citation context feature beckoned

When you view the HTML code in the diagram above, you can pinpoint the
reference relationship by locating the paper IDthat references the paper. The area
in the box below shows the Figure 4.3 The first citation context corresponds to
the corresponding IDfor the thesis. With this feature, you can have the data
satisfy the requirements 2 and 3.

44
The first 4 chapter experimental design

Figure 4.4 MAS Web page source schematic

You can get the PDFfor your paper, based on the downloadable URLs
provided by the MAS API . Then, using the third tool PDF Miner2pdf Miner is a
Python tool with powerful PDF text recognition capabilities that can be
accurately used to PDF the text of the paper is extracted and put into another text
file, so the paper will meet the requirements 1.
To meet the data requirements 4, you need to get a collection of citations
with a higher rate of interaction, and there is no doubt that similar areas will
have higher rates of interaction, and that people are often willing to cite more
influential papers. Therefore, this paper selects the most important 10 meetings
related to the data mining field as a seed conference to collect the papers, the
process is as follows:
1). ACLCIKMEMNLPICDEICDMKDDSIGIRVLDBWSDM
WWW
2). from the mas API to obtain this ten conference from $number year to
$time year of all the paper metadata, metadata includes the following
information: Thesis in MAS ID, title, summary, Publication time, published
meeting, author, ID list for the referenced paper, URLfor the article. After you get
the metadata for your paper, this article takes advantage of the functionality of
the MAS Web site to get a citation context that references these papers.

2 http://www.unixuser.org/~euske/python/pdfminer

45
The first 4 chapter experimental design

3). gets all referenced thesis metadata for papers obtained in 2 .


4). Download the paper in PDF format, based on all available papers ' URLs.
5). obtain all references and citation contexts in the dataset from the MAS
Web site. Since an article will refer to another article in multiple places, the
reference relationship between an article and another article can occur several
times.
After collecting these datasets, this article picks out the author's relatively
complete personal information, takes out the author's last paper for each author,
and randomly picks out a citation context from the paper and puts it in the test
set. The thesis that is pointed to by the citation context in all test sets constitutes
the candidate set for the recommendation. After completing these steps, all
remaining data is used as the data set for feature extraction and model training.
This constitutes a collection of data that meets the requirements of 6 above. The
following table shows the amount of data obtained by each step:
Tables 4.1 Data Volume statistics

Number DataSet Generation Steps Amount of data


a Paper Meta data obtained from a seed meeting
9492
b Add all papers referenced in a 55823

c The section in b can be downloaded to the 20171


original text of the PDF
d Part of C from a 4537
e Reference relationship from D 73236
f Test set author and citation context 1000

4.1.3 MAS API

When you submit a query request to the MAS API , a result is returned in the
JSON format. JSON is a lightweight data interchange format that is a subset of the
JavaScript language. JSON Returns the result as a mapping of names and values.

46
The first 4 chapter experimental design

Data is separated by commas, and curly braces can be used to hold objects, and
square brackets save the array. After submitting a request to the MAS API , a
deeper, more complex result is returned, as follows:
{"d":{"__type":"Response:http:\/\/research.microsoft.com","Author":null,"Con
ference":null,"Domain":null,"Journal":null,"Keyword":null,"Organization":nul
l,"Publication":{"__type":"PublicationResponse:http:\/\/research.microsoft.c
om","EndIdx":0,"StartIdx":1,"TotalItem":0,"Result":[]},"ResultCode":0,"Trend
":null,"Version":"1.1"}}

The results returned by the MAS API can be understood as a tree structure,
and each node in the number is a pair of key and value . Where value might be a
value, or it might be a list. For any query statement, theMAS returns the same
result structure at the top level, where "D" is the root of the result, and the
following table is the topmost result and the corresponding interpretation.

3.2 MAS API

d
Domain Explain Domain Explain
Version Return version Journal Journal Results
number

ResultCode Result code,0 Organization Organization


represents success results
Publication Publication results Domain field results
Author Author results Keyword Keyword results
Conference Meeting Results __type return type

The result structure of each return is the same, but because the query
actually requires only one type of result, for example, to query only the
publication list, the return result is nullin another domain, such as the Author
field.
Publication Author
ConferenceJournalOrganizationDomainKeyword 7 7
47
The first 4 chapter experimental design

Tables 4.3 field results format

7 -type fields
Domain Explain Domain Explain
TotalItem Number of results Result Results list

StartIdx Start index number __type return type


EndIdx End index number

The result is rendered as a list in the result field, which may contain each
other, such as when a list of papers is returned, an author field, and the Author
field is a list, where the return fields of the different types of results are collated.
As shown in the following table:
Tables 4.4 fields result list items format

Category Domain Explain Domain Explain


Publication ID Publication CitationCount Number of
Publicatio Unique ID citations
n ReferenceCount Number of Type Type of
references publication

Year Age of Title Title


publication
Abstract Summary Author Author List
Conference Meeting ID and Journal Journal ID and

name name

Keyword Keyword list DOI Digital Object

identification

FullVersionURL Downloadable __type return type


address List

Author ID Author Unique PublicationCount Number of


Author ID publications

48
The first 4 chapter experimental design

CitationCount Number of FirstName Name


citations

MiddleName Middle Name LastName Name


NativeName Non-English HomepageURL Home
name Address

Affiliation Work unit DisplayPhotoURL Photo


Address
HIndex HIndex GIndex GIndex

ResearchInterest List of areas of __type return type


Domain interest to the

author

Conference ID Meeting PublicationCount Number of


Meeting Unique ID publications
CitationCount Number of FullName Name
citations

ShortName Referred ResearchInterest List of areas


Domain of study
__type return type

Journal ID Journal Unique PublicationCount Number of


Journals ID publications
CitationCount Number of FullName Name
citations

ShortName Referred ResearchInterest List of areas


Domain of study
ISSN International __type return type
Standard

Journal number

Organization ID Organization PublicationCount Number of


Organizati Unique ID publications

49
The first 4 chapter experimental design

ons AuthorCount Number of CitationCount Number of


Authors citations
Name Name HomepageURL Home
Address
ResearchInterest List of areas of __type return type
Domain study

Domain DomainID Research field SubDomainID Child domain


Research unique ID ID
Area PublicationCount Number of CitationCount Number of
publications citations
Name Name __type return type
Keyword ID Keyword PublicationCount Number of
Keywords unique ID publications
CitationCount Number of Name Name
citations

__type return type

As you can see, the fields contain rich information. However, because of the
difficulty of filling each information domain, the completion of the MAS API is
different, and there are many fields that return the result to Nulland need to be
processed further. In addition, each returned result entry will have a _ one _type
field that represents the result type, and the return value of each result type is
like "Response:http: \/\/ A. microsoft. com, where bold sections change as the
result type changes,Response is represented as the root
part;publicationresponse, Authorresponse , such as the return part of each
domain,publication,Author , and so on, represents each return item. In this way,
you can use the results returned by the MAS API to make an entity distinction
and get rich metadata.

50
The first 4 chapter experimental design

4.1.4 data get tips

In data acquisition, different data sources encounter different problems. The


performance of the MAS APIis unstable. For a request, an empty result may be
returned due to server instability, or an error is thrown. Because this program
requires a large number of requests for the MAS API, the exception of the server
needs to be handled with fault tolerance. When a return result is empty or an
error occurs, the 0.1 to ten seconds is randomly retried, and the result is not
achieved three times in a row. The request is placed in the failed request queue,
and the request for the queue is retried after a random wait of ten minutes to 5
hours, and the request in the failed request queue is discarded if the retry is
unsuccessful three times. When this is done, all of the basic API requests get the
results you need.
The data acquisition of this article requires crawling the MAS Web site and
encountering the Web site's anti-crawler mechanism during the crawl. All sites
will have their own anti-crawler mechanism, to prevent the machine too much
access to the site, take up bandwidth, resulting in the user is not normal access.
When you crawl the references in the MAS Web site and the citation context
content that is referenced, you frequently visit the server that is responsible for
citations in the Web site, and when a IP accesses the server frequently, the server
decides that it is suspected of malicious access. Therefore, a temporary or even
permanent blockade will be carried out. Because the amount of data required in
this article is large, this problem needs to be noted when crawling data. Here, this
article employs two strategies:
1. a random time between ten seconds to ten minutes between each request,
which can greatly reduce the server's single IP access frequency, easing requests,
no signs of excessive access, The server also has no need to block IP for crawling
the machine.
2. after using the 1 method, although it can be crawled for a long time, the
speed is too slow to complete the crawl task within an acceptable timeframe. To
51
The first 4 chapter experimental design

increase speed, you cannot crawl with only one IP , requiring multiple IP
crawling. Of course, hundreds of machines can be used, but the resources needed
are too much to meet, and the process would be tedious to gather data after
crawling. Here, this article uses a lightweight method. When you visit the MAS
Web site, you have the same effect of accessing a new machine, except that you
occasionally access it with your own IP , mostly through an agent. There will be
some free agents on the network, this article first collects free agents, tests filter
out available agents, and then takes turns taking advantage of available agents
for access. Assuming there is an x agent, the time to wait is only for the original /.
This can greatly improve the crawl speed. During the crawl, this article has used a
$number agent, which increases the crawl speed of the method in 1 to $number
times. This method is used to satisfy the need of data acquisition in this paper.
In addition, during the data acquisition process, there are unexpected events
that cause the program to terminate, such as power outages, program
manslaughter, or bugsin the program itself. Because crawling is a lengthy process,
it can waste a lot of time and server resources after an unexpected termination, if
the program restarts and needs to start the crawl process from scratch. Therefore,
data acquisition in this article requires breakpoint continuation. After each
crawler starts, you need to know the previous crawl progress, and then continue
crawling according to the progress. Here, the method used in this article is to
print the log file at the same time as the crawl, each time the crawler starts, the
log file is read first, the crawl progress is determined, and then the crawl
continues according to the progress.
After taking advantage of these techniques, it is possible to obtain the
required data smoothly, so that further experiment can be done.

4.1.5 Data preprocessing

After obtaining the data, the thesis data often contains a lot of noises

52
The first 4 chapter experimental design

because of the different writing habits and the different format requirements of
the papers. The text data needs to be preprocessed first to reduce noise and
improve the effectiveness of the method. The preprocessing consists of three
parts.
1. converts all letters to lowercase. Before parsing, you need to convert all
the letters to lowercase, which is also a common method in information retrieval.
The case will be in the process of machine processing there is a certain amount of
interference, more common is the first letter capital, but in fact, the same words
in the sentence meaning is the same, if due to the initial capital letter and can not
be associated, it will lose some information. In addition, because of the different
habits, different authors in the same word capitalization method is also different.
For example, some authors are accustomed to writing as Google and some users
write it as Google, and there is no difference in meaning. Of course, there are
special cases where the case is changed to mean different things, such as she
generally means her, and she may indicate a singer combination. But this
situation rarely occurs in data sets. So converting all the letters to lowercase is
more beneficial than harm, and is the first step in preprocessing.
2. preserves only letters or numbers and removes other symbols. Because
the content of meaning in the article is basically composed of letters and
numbers, and punctuation or other characters do not contain literal meaning, it
is removed to reduce the noise. Some researchers have chosen to take the
numbers out of the study, but the numbers often have more important meanings
in the paper, such as "four-color problems," which would lose all information if
the numbers were removed.
3. to take root for all words. There are many tenses and voices in English, the
verb has the third person singular, present tense, past tense, now complete, the
noun has singular and plural. These variants, though changed in morphology, are
almost identical in meaning. Therefore, the root of all words needs to be put

53
The first 4 chapter experimental design

forward to better analyze. Another tool-NLTK 3 -is used here. NLTK is the
abbreviation for Natural Language Toolkit , which is a Python open Source
Library for natural language processing and can complete the task of taking root
participle.
4. will reference the citation context for each article at the end of the article
and add its IDto the MAS site after each citation context.
After the above four steps, and then simple to organize the paper, you will get
a more easy to deal with the text of the paper, the format is as follows. Among
them, each part is delimited by the box, the first part is the thesis title, the second
part is the thesis abstract, the third part is the paper body, because the text is too
long, this article has omitted. Part fourth is a citation context that references this
paper and its corresponding thesis is IDin the MAS Web site. In this way, with a
paper, you can learn the content and the ID and citation contexts of other papers
that reference the paper, and you can build a reference graph of the dataset by
using the ID to match the papers already in the library.

Figure 4.5 Pre- preprocessing thesis schematic

3 http://www.nltk.org/

54
The first 4 chapter experimental design

4.2 Evaluation Method

The common indicators for evaluating search results are: accuracy,


recall,NDCG, average accuracy (Mean Average Precision,MAP). In the question of
this article, for each citation context, its actual reference is only one article, so it
is equivalent to only one related option retrieval problem, and the traditional
search problem slightly different, in order to more conveniently display the
results of the experiment, this article using a slightly improved the former K bit
recall rate ( recall@k) and MAP for evaluation.
For each citation context, each model can get a sorted list of papers. This
article is the only correct answer to the paper that this citation context actually
refers to. The previous k bit recall rate (recall@k) Here means that the
percentage of the citation context that is returned by the only correct answer in
the previous K results. The formula is as follows:

1{ <}
@ = (4-1)
||

Where indicates that the correct answer is in the result

location,Q means all test queries,| q| represents the total number of


queries. in practice, because the result of sorting is ignored by the user,
the reference value is small, so just focus on K value to.
For another evaluation metric, the average accuracy rate is the formula :
( )
< ( )

MAP(1 , 2 , , ) = ( )
(4-2)

Here,R ( )is A Boolean function that indicates whether the citation


context refers to thesisd . Since only one thesis is associated with each citation
context, the MAP actually degenerates to MRR(Mean reciprocal rank), and for a
citation context, If the question and answer for the actual reference is the x bit in

55
The first 4 chapter experimental design

the recommended list, then the score for the citation context is /, and the average
rating for all the citation contexts in the test set is the evaluation indicator here.
The formula is as follows:
1


= (4-3)
||

A simple example is shown in the following figure

Figure 4.6 Results Evaluation Example

Solid background paper for actual citation,In this case


recall@1=0.33
1 1 1
+ +
3 1 4
MAP = 0.53
3
In the experimental results of this article, both therecall@k and MAP
indicators have improved.

4.3 Experiment Frame

After the data acquisition process and the evaluation method are explained,
here is a simple description of the experimental framework of this study.
1. Use the mas API to collect the metadata for the required papers and
obtain the PDF download link, and to obtain the citation context information for
the relevant papers on the Mas Web site; based on PDF Links Download the PDF

56
The first 4 chapter experimental design

format of the corresponding paper and convert it to text. In this step, you need to
take into account the MAS and the various PDF source Web site's anti-crawler
mechanism, and need to make the server's instability transparent.
2. to translate the text of the collected essays into lowercase, remove
punctuation, take root, and add the citation context ID crawled to the end of the
paper.
3. to extract the characteristics of user preferences, here need to make full
use of the user's paper only metadata, organized to extract, which is the focus of
the text.
4. based on the chronological order, select the $number author as the test
user, choose the next citation context in the last paper to put into the test set, the
remaining data as a training set. Because it is only a part of the data of a large
number of papers, there is a need to select personal information in the data set
relatively complete users.
5. According to the training data, using some kind of content-related model
framework to get the content-related model, the SVM model is used to train the
user preference characteristics. Combining the results of the two models, the
unified model can be easily used. The content-related model of this article selects
the language model and the translation model, and acts as baseline.
6. for all the citation contexts in the test set in 4 , remove their citation
information, and use the model obtained in 5 to rate and sort all papers in the
$number text candidate set.
7. MAP Recall@10
The overall experimental framework is shown in the following illustration:

57
The first 4 chapter experimental design

MAS API
MAS




ID


PCR

MAP
Recall@10

Figure 4.7 Experiment process

By following these procedures, you can complete the evaluation of the


effects of the PCR model, the next section describes the results of the experiment,
and analyzes the results.

58
4 PCR

4.4 Experiment Results

In this paper, we use the stochastic algorithm, the one-language model and
the translation model as the contrast method. The model of this paper is based
on a meta language model and a translation model based on translation
enhancement as CRD. Use the methods previously described to evaluate. Table 7
is the final result of the experiment, and Figure 4.8 is the details of the recall rate
change.
Tables 4.5 Effects of different models contrast

Model RDM LM LM_PCR TM TM_PCR


MAP 0.007 0.299 0.509 0.504 0.644
Recall@10 0.000 0.376 0.634 0.594 0.782

Figure 5. 8 results recall rate with position change

In Figure 4.8 and Table 7 ,RDM indicates a random recommendation,LM


represents a meta language model, andLM_PCR represents the use of a unary
language model as a CRD PCR model,TM represents a translation model that is
improved from translation on a summary,TM_PCR represents a translation model

59
4 PCR

that has been improved on a digest as a CRD PCR model. You can see from the
results above:
1. The effect of TM_PCR is best, and the effect ofLM_PCR is also better than that
of a meta language model that does not include personalization information.
2. Whether it is a language model or a translation model, the effect is improved
in the PCR model, which introduces personalization information. The
language model increases 234%on MAP , and recall@10 617%on. The
translation model improved 778%on MAP and raised recall@10 on the.
650%.
It can be seen that the effect of the PCR model is significantly improved both
for the language model and the translation model. This is because the PCR model
can take advantage of the relevance of the content as well as the user's
preferences. On the one hand, thePCR model can maintain a citation context that
can achieve better results with only content information; On the other hand,
thePCR model can improve the citation contexts that use only content
information that does not get better results. Therefore, thePCR model is effective
and can significantly improve the effectiveness of the traditional citation
recommendation model.

4.5 Parameters Tune Excellent

In the model of this article, there is an unfixed parameter, which is the in


4.2.1 . The calculation cannot be performed because Alpha is related to the size of
the dataset. Therefore, it is necessary to traverse to tune the results of the
evaluation. For the size of Alpha, as previously stated, two factors determine the
size of the dataset and the length of the citation context. The size of the dataset
This article has been described in 5.1 . The citation context for this work is
obtained from the MAS Web site, and after you delete the Deactivate word, the
average citation context length is 4 words.

60
4 PCR

This paper shows the effect of the model by substituting different . Figure
4.9 shows the change in thePCR model effect during the change of the value of
the parameter alpha from 0.1 to 0.9 . You can see that the best effect is to set
Alpha to 0.8 , either in combination with the translation model or the language
model.
When you use the PCR model, you must tune Alpha again when you replace
the dataset because of the size of the data and the length of the citation context.

Figure 4.9 parameter Alpha Tuning procedure

4.6 features Analysis

In the model for this article, there is a total of 3x3=9 features. Whether these
features are effective or not, which features are better. Here, the effect of this 9
feature is experimentally studied.
Figure 4. Ten shows the effect of the model after removing one of the
features in turn. The x_y item in the graph has the effect of remove_ . The

last "none" means that all features are not removed, that is, the final result of the
model.
61
4 PCR

With the figure 4. Ten , you can see that when you remove any feature, the
effect of the experiment is attenuated, indicating that each feature has a gain on
the recommended effect. As you can see, remove1_1 ~ A feature that has the
most effect on the experiment, especially 1_3 is1_1 . Shows that the
user's own historical behavior, the greatest impact on their future citation
behavior, including: the user wrote the paper, the user's collaborators,
user-issued meetings, user-referenced papers, user-referenced authors,
user-referenced meetings. Among them, the user refers to their past papers or
cited papers is the most obvious feature.

Figure 4. to analyze for each feature

62
The first 5 Chapter Summary and future work

5 summary and future work

5.1 Summary

In this paper, the personalization of each user is taken into account when
doing citation recommendation work. We hope to improve the effect of citation
recommendation by adding the decision of user preference. For this purpose, this
paper presents a personalized citation recommendation model--PCR model.
Different users have different citations to the same paper, the model of this
paper quantifies the tendency, and builds the personalized information of the
users through the published papers. This information is used to measure the
probability of a user referencing a paper in a 9 dimension. Among them, the
factors include: the user itself, the user's collaborators, the user's published
meetings, the user cited papers and so on. These factors are consolidated using
SVM , combined with a single meta language model and a translation model that
has been improved from translation on the digest. In the model, this article also
uses the shrinkage parameter to solve the problem that the traditional CRD
model has a large difference from the UTD part of the introduction. In this paper,
the problem of the imbalance between the positive and negative cases in SVM is
solved by the method of calculating the average number of times by randomly
selecting the equivalent negative example.
Finally, after the combination of UTD , both the traditional one-meta
language model and the most recent summary on the translation improved
translation model, the effect has significantly improved. The method makes
reasonable use of the known information, further captures the user's reference
behavior, and makes the citation recommendation more accurate.

63
The first 5 Chapter Summary and future work

5.2 Future work

In future work, we hope to make further use of the information already


available and try to improve the accuracy of citation recommendation. For
example, the thesis can be separated from the field, the characteristics of the
different areas of paper to study, and summed up the general law, but also to use
the authority of the thesis and epidemic as an important factor of judgment, can
use other channels to obtain the latest research hotspots, in order to infer the
user's tendency to quote ; The user's interest can be judged by obtaining other
data, such as browsing data, reading data, etc., in addition to the publication of
history. After the system is completed, the user's preference information can be
further modeled by collecting the user's behavior data, thus making the
recommendation more accurate.
In addition to extended information, new models can be used to try to
improve the accuracy of citation recommendations, such as using collaborative
filtering with CRD for citation referrals, and using the forward-edge prediction
techniques in the graph model to predict possible future reference relationships.
In addition, in addition to recommending the paper to be recommended. It is
also possible to make a personalized decision about the location of the paper, so
that after submitting the manuscript, the user can not only recommend the paper
it needs, but also place the citation placeholder in the appropriate position. It can
analyze different citation habits of different users and improve the accuracy of
citation position prediction.
With the depth of work, I believe that the effect of citation recommendations
can be further improved, users in the use of citation referral system, will be more
convenient, accurate and fast.

64
Reference documents

Reference documentation
[1] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li. Citation count prediction: Learning
to estimate future citations for literature. In Proceedings of the 20th ACM
international conference on Information and knowledge management, pages
12471252. ACM, 2011.

[2] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to information


retrieval, volume 1. Cambridge University Press Cambridge, 2008.

[3] Y. Lu, J. He, D. Shan, and H. Yan. Recommending citations with translation
model. In Proceedings of the 20th ACM international conference on Information
and knowledge management, pages 20172020. ACM, 2011.

[4] Speretta M, Gauch S. Personalized search based on user search


histories[C]//Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM
International Conference on. IEEE, 2005: 622-628.

[5] Kim J K, Cho Y H, Kim W J, et al. A personalized recommendation procedure


for Internet shopping support[J]. Electronic Commerce Research and
Applications, 2003, 1(3): 301-313.

[6] Schafer J B, Frankowski D, Herlocker J, et al. Collaborative filtering


recommender systems[M]//The adaptive web. Springer Berlin Heidelberg, 2007:
291-324.

[7] Pazzani M J, Billsus D. Content-based recommendation systems[M]//The


adaptive web. Springer Berlin Heidelberg, 2007: 325-341.

[8] Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J].
Communications of the ACM, 1975, 18(11): 613-620.

[9] K. Chandrasekaran, S. Gauch, P. Lakkaraju, and H. P. Luong. Concept-based


document recommendations for citeseer authors. In Adaptive Hypermedia and

65
Reference documents

Adaptive Web-Based Systems, pages 8392. Springer, 2008.

[10] B. Shaparenko and T. Joachims. Identifying the original contribution of a


document via language modeling. In Machine Learning and Knowledge Discovery
in Databases, pages 350365. Springer, 2009.

[11] S. M. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. K. Lam, A. M. Rashid, J.


A. Konstan, and J. Riedl. On the recommending of citations for research papers. In
Proceedings of the 2002 ACM conference on Computer supported cooperative
work, pages 116125. ACM, 2002.

[12] K. Sugiyama and M.-Y. Kan. Scholarly paper recommendation via users
recent research interests. In Proceedings of the 10th annual joint conference on
Digital libraries, pages 2938. ACM, 2010.

[13] D. Zhou, S. Zhu, K. Yu, X. Song, B. L. Tseng, H. Zha, and C. L. Giles. Learning
multiple graphs for document recommendations. In Proceedings of the 19th
international conference on World Wide Web, pages 141150. ACM, 2008.

[14] T. Tang and G. McCalla. Beyond learnersar interest: personalized paper


recommendation based on their pedagogical features for an e-learning system. In
PRICAI 2004: Trends in Artificial Intelligence, pages 301310. Springer, 2004.

[15] T. Strohman, W. B. Croft, and D. Jensen. Recommending citations for


academic papers. In Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 705
706. ACM, 2007.

[16] J. Tang and J. Zhang. A discriminative approach to topic-based citation


recommendation. In Advances in Knowledge Discovery and Data Mining, pages
572579. Springer, 2009.

[17] Q. He, J. Pei, D. Kifer, P. Mitra, and L. Giles. Context-aware citation


recommendation. In Proceedings of the 19th international conference on World
Wide Web, pages 421430. ACM, 2010.

66
Reference documents

[18] Q. He, D. Kifer, J. Pei, P. Mitra, and C. L. Giles. Citation recommendation


without author supervision. In Proceedings of the fourth ACM international
conference on Web search and data mining, pages 755764. ACM, 2011.

[19] Cortes C, Vapnik V. Support-vector networks[J]. Machine learning, 1995,


20(3): 273-297.

[20] Shaparenko B, Joachims T. Identifying the original contribution of a


document via language modeling[M]//Machine Learning and Knowledge
Discovery in Databases. Springer Berlin Heidelberg, 2009: 350-365.

[21] Xue X, Jeon J, Croft W B. Retrieval models for question and answer
archives[C]//Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval. ACM, 2008: 475-482.

[22] Liu B, Xia Y, Yu P S. Clustering through decision tree


construction[C]//Proceedings of the ninth international conference on
Information and knowledge management. ACM, 2000: 20-29.

[23] Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM
Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.

[24] Shen A, Ball A D. Preference stability belief as a determinant of response to


personalized recommendations[J]. Journal of Consumer Behaviour, 2011, 10(2):
71-79.

[25] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing
order to the web[J]. 1999.

67
Research achievements during Master's degree

Study for Master's degree during


research results

Published papers

[1] Yaning Liu, Rui Yan, Hongfei Yan. Guess What You Will Cite: Personalized
Citation Recommendation Based on Users Preference [M]//Information
Retrieval Technology. Springer Berlin Heidelberg, 2013: 428-439.

[2] , , , . 64 [J].
, 2014, 40(2): 71-76.

[3] Liu Yaning, Ching, Hongfei. Personalized citation recommendation based on


user preference and language model . CCIR2013,$number,(recommended to the
Chinese Journal of Information)

Optimization system

[1] Web infomallhttp://www.infomall.cn/

[2] Paradise: http://e.pku.edu.cn/

68
To Xie


Thank Hongfei Teacher, is you lead me in the unknown research field
of continuous forward, your guidance like a light in the night, let me
constantly find the direction of the forward. Your profound professional
knowledge, clear ideas of scholarship and meticulous and accurate
guidance, let me acquire knowledge at the same time have a deep knowledge
of the ability to explore. Your elegant demeanor, humble attitude and
sincere concern for the students, let me feel the warmth of the laboratory
family at the same time, understand a lot of scholarship and human truth.
Thank you to the Li teacher, you have a strong and extensive interest
in science, your deep knowledge and vision of the scientific vision
leading the Skynet group in the scientific research on the road farther
and further away, the higher. Every time I discuss with you, I always find
the key to the problem. Your disciplined spirit of scholarship and
approachable attitude, let me really feel the master demeanor.
Thank Bo Teacher, your course let me have a good knowledge base, for
the further improvement of scientific research ability to prepare. Thanks
to the Xie Zhengmao teacher, you have provided a stable cluster for the
lab, and you have deeply impressed me with your deep understanding of the
project and the ingenious solution to the problem in the process of
cooperation.
Thank you, Brother Shing, for your great help in the course of the
work. And your discussion not only gave me a clear idea of the scientific
research, but also let me to the detailed steps of the experiment. After
the written, you make the finishing touches of the changes, so that the
paper more accurate and smooth. Thanks to the single building, and you
to do the project, I have a more comprehensive understanding of the search
engine principles and framework, your strong coding ability and focus on
the work attitude, admirable. Thank Xin Brother, your solid scientific
research Foundation and serious research attitude deeply affect me,
scientific research encountered difficulties, your analysis always help
me find a solution. Thank Wang Jinpeng brother, you have a deep
understanding of the computer, every time encountered a tedious problem,

69
To Xie

you can always find the answer.


Thank Maucian brother, the tree Berhan brother, Liu Xiaobing brother,
Chen Zhi flash brother, Lu Ying brother, Xiangwenqing brother, although
you have graduated, but you have created a heritage of Skynet group of
excellent tradition. Thank Zhang, Yin Yu, you are my classmates in the
same grade, but also my role model. Thanks to Yan, Wu Yuexin, Chiangkhan,
Chen Wei, Huangda, Liu Yan. Thanks to Skynet, thanks for the network. Thank
Wang Yu, Kesong Yu, Zhang Yi, you accompany me through three years of happy
dormitory life.
Finally, there are four people to thank, that is Liu Yurei doctor,
parents and himself. Before writing this paper, I did the anterior
cruciate ligament reconstruction and meniscus repair surgery, after the
operation requires a brutal rehabilitation training. Since my recovery
is not going well, rehabilitation training has lasted for nearly five
months now. In the process of rehabilitation, the physical experience of
the pain has never been suffered, psychologically suffering from the
anxiety and fear has never been. During this period, my surgeon Dr. Liu
Yurei patiently guide me to recover, my parents take care of my life, I
also tenacious struggle, in the insistence on rehabilitation training at
the same time, writing this paper. As the paper gradually formed, my legs
were slowly recovering. Perhaps only the doctor, the parents and oneself,
only then knew clearly, this paper has how many tears and the sweat. Thank
the doctor, thank the parents, thank themselves.

70