Sei sulla pagina 1di 12

Understanding Reward-Based Learning

Krista L. Roze

Boston College
UNDERSTANDING REWARD-BASED LEARNING 1

Introduction

Learning is a cognitive process crucial to humans, both in the past when humans needed to learn

what was safe to eat, where food was, etc., as well as learning in schools in the present day. Almost

always this learning is the result of some incentivisation, through which we continue doing an action due

to the promise of some reward. We alter our expectations of events with experience, tailoring our actions

to seek reward and avoid punishment (Schultz, Dayan & Montague, 1997). Reward-based learning

perhaps began with the evolutionary drive to seek out food. Researchers over the past 20 or so years

have used this adaptive behavior in rats to determine the neural substrates underlying reward-based

learning. Evidence points to the orbitofrontal cortex (OFC) as a key region utilized in learning through

incentive. When the OFC was lesioned in a group of rats, for example, they took longer than the controls

who weren’t lesioned to learn a reversal reward, in which they had to press something that was previously

unenforced (Izquierdo, 2017). The role of the rat OFC in reward-based learning has been shown to also

be connected through its projections to the caudate nucleus, known to be a reward center in the brain

(Gourley et al., 2013).

To expand the literature for human subjects, researchers have started testing reward-based

learning using rewards other than primary incentives like food. Researchers have found that using an

abstract reward like money as an incentive activates regions of the brain that significantly overlap with

those associated with primary incentives. Elliott et al. found that in addition to activity in the OFC and

dopaminergic midbrain, the amygdala and ventral striatum also showed increased activity in the presence

of rewards (Elliott, Newman, Longe, & Deakin, 2003). The amygdala and OFC in particular have been

shown to be adaptive to responding to stimuli that predict rewards. When a stimulus previously learned to

lead to a reward was changed to no longer lead to a reward, the responses elicited by the amygdala and

the OFC decreased. This shows the flexibility of the brain in learning and re-learning what stimuli lead to

rewards (Gottfried, O’Doherty, Dolan, 2003).These results have been echoed in the caudate nucleus as

well. When task difficulty was altered by controlling the probability of reward, activity in the caudate

nucleus still “paralleled the magnitude of a subject’s behavioral change during learning” (Haruno et al.,

2004). This reiterates the fact that the human brain is adaptive in learning based on reward incentives.
UNDERSTANDING REWARD-BASED LEARNING 2

Our experiment stemmed from one observing risk-taking behavior, which is also involved with

predicting future reward or punishment. The target article aimed to define the neural basis of risk-taking

by having participants complete the Balloon Analog Risk Task (BART) while being scanned in an fMRI

machine (Schonberg et al., 2012). The escalating risk and cumulative rewards earned in the BART have

been shown to produce a realistic model for naturalistic risk-taking behavior. The researchers considered

two models of risk-taking behavior specific to the BART: either participants would view each consecutive

balloon pump as an accumulated reward relative to the beginning of the trial, or they would view each

balloon pump as an increased chance at a loss, updating the risk of each balloon pump as they went

through the task. They hypothesized that if the accumulating reward model was used by participants,

there would be increasing ventro-medial prefrontal cortex (vmPFC) activity with each pump. If, however,

the participants used the possible loss model to gauge rewards, the researchers hypothesized decreasing

vmPFC activity with each pump. Upon conclusion of the experiment, the researchers found that vmPFC

activity decreased with each successive pump, supporting the model that participants viewed each pump

as a higher risk of loss rather than as a possible accumulation of reward.

Due to difficulties with PsychoPy’s programming capabilities, we tailored the experiment of the

target article in order to explore the effect of reward-based learning. We performed the following

experiment to see if participants would respond with more correct answers over trials in a task, as they

learned which answers yielded a reward. We wanted to explore how well a response could be learned,

based on the probability tied to each stimulus of one of two answers being correct. We hypothesized that

over trials, participants would have more correct responses due to learning the probabilities (especially

probabilities further from chance) for the correct response for each stimulus.

Method

Participants

A total of 14 participants were chosen as a convenience sample from the Cognitive Neuroscience

Research Practicum class; this included 13 Boston College students and one Boston College professor.
UNDERSTANDING REWARD-BASED LEARNING 3

All participants were exposed to all stimuli. Participants were randomly assigned to one of three different

versions of the experiment as a form of counterbalancing.

Experimental Task

The experimental task was a modification of the original Balloon Analog Risk Task (BART) used

in the target article. Due to the coding limitations of PsychoPy, the task was changed to measure reward

learning instead of risk-taking behaviors. Participants were first presented with instructions for the task. A

screen with the word “READY?” was presented to focus the participant's attention before the stimuli

appeared. The participant was then shown either a red, white, or blue balloon. They were prompted to

choose either the left or right arrow key. One key would “inflate” the balloon, and the participant was given

a point, which showed up on the screen as a green “1” after they pressed the key. The other key would

“pop” the balloon, and the participant was not given any points, which showed up on the screen as a red

“0” after they pressed the key. The experiment immediately moved onto the next balloon. Each balloon

color was tied to a unique probability of amount of times the left versus the right key was the choice

leading to inflation of the balloon. For example, in one version of the experiment, the red balloon had a

probability of the right key inflating the balloon 80% of trials, whereas in the other 20% of trials, the left

key would inflate the balloon. In this same version of the experiment, the blue balloon had a probability of

the right key inflating the balloon 40% of trials, whereas in the other 60% of trials, the left key would inflate

the balloon. Lastly, as a control task, the white balloon in this version of the experiment had a probability

of the left key inflating the balloon 50% of trials, and the right key would inflate the balloon in the other

50% of trials. The order in which the different colored balloons showed up was randomized. In the entire

task, each color balloon was presented 60 times, for a total of 180 trials. The dependent variable being

measured was the number of correct key responses for each balloon.

Design

In order to more specifically focus on the effect of the probabilities of the balloons on reward

learning rather than the effect of color, three versions of the same experiment were created. Throughout

the experiments, no color of balloon had the same probability as its counterpart in the other versions (the

actual probabilities remained the same throughout versions). This counterbalancing allowed for color to
UNDERSTANDING REWARD-BASED LEARNING 4

be controlled to ensure that only an effect of probability would be tested. Participants were randomly

assigned to complete one of the three versions of the experiment. As all participants were exposed to all

stimuli, this was a within-subjects design.

The target article’s original BART was used to observe neural activity during risk-taking

behaviors. Participants would be prompted to inflate a balloon and would earn a certain amount of money

with each successful inflation. At any point during the trial, there was an opportunity to stop inflating the

balloon and cash out the rewards. There was also a chance that the balloon would pop with too many

inflations, and the rewards accumulated during that trial would be lost. This task was modified due to the

constraints of PsychoPy, as accumulating rewards and having multiple inflations per balloon could not be

properly programmed. Instead, each balloon was shown only once, and there were two outcomes: either

the balloon was inflated, or it popped. A successful inflation was rewarded one point, while a popped

balloon did not receive any points and did not take any points away. This simplified version of the BART

was used to test learning over time, according to each balloon's probability of the correct answer being

the left or right key. Therefore, instead of looking at when participants cashed out, the amount of correct

responses over time was observed in the form of participant hit rates.

To quantify the reward learning over time, the average participant hit rate (number of correct key

responses out of the total number of trials) was split into “early” and “late” trials. The average hit rate for

the first 90 trials was compared to the average hit rate for the last 90 trials using a paired t-test. A paired

t-test was used because the early and late trials were compared within participants.

Results

No data was excluded from analysis. To operationalize reward learning over time, the hit rates

were calculated for each participant during “early” and “late” trials. Out of 180 trials, the first 90 trials were

labelled as “early” and the last 90 trials were labelled as “late”. Hit rates for early and late trials were

calculated separately for each participant, by taking the number of correct key responses over the total

number of key responses. This resulted in two hit rates per participant - one for early trials and one for

late trials. These early and late hit rates were then compared in the analysis.
UNDERSTANDING REWARD-BASED LEARNING 5

The research question was aiming to see if participants would learn the probabilities of the correct

key for each balloon, as evidenced by the number of correct responses they gave over trials. We

hypothesized that, over trials, participants would have more correct key responses as a result of learning

the different probabilities for the correct key for each balloon.

The data we obtained did not support our hypothesis, and so we cannot conclude from this

experiment that participants were able to learn the probabilities for correct key responses over trials. The

data was separated by trials into the three different probabilities so that early and late trials could be

compared within the trials that had the same probability. Once separated into these three groups, the

averages and standard deviations of the early trial hit rates and the late trial hit rates were calculated.

Further, a paired t-test was used to compare the the hit rates of the early trials and the late trials.

Balloons with an 80/20 probability (the right key being correct 80% of the time) had an average hit

rate of 0.586 for early trials, and an average hit rate of 0.621 for late trials. Balloons with a 40/60

probability (the right key being correct 40% of the time) had an average hit rate of 0.483 for early trials,

and an average hit rate of 0.517 for late trials. Balloons with a 50/50 probability (each key being correct

50% of the time) had an average hit rate of 0.476 for early trials, and an average hit rate of 0.490 for late

trials.

Table 1
Mean and Standard Deviations of Hit Rates, split by early and late trials
Conditions Mean Standard Deviation

80/20; Early 0.586 0.083

80/20; Late 0.621 0.567

40/60; Early 0.483 0.533

40/60; Late 0.517 0.433

50/50; Early 0.476 0.067

50/50; Late 0.490 0.050


UNDERSTANDING REWARD-BASED LEARNING 6

Figure 1​. Average hit rates of early versus late trials, split by balloon probability

There was no significant change in number of correct responses over time for the 80/20

probability balloons, ​t(​ 13) = -1.37, ​p​ = 0.193.


UNDERSTANDING REWARD-BASED LEARNING 7

​ ariability of hit rates in early versus late trials for 80/20 probability balloons. Bars denote
Figure 2. V

median and quartiles for each plot.

There was no significant change in number of correct responses over time for the 40/60

probability balloons, ​t(​ 13) = -0.95, ​p​ = 0.358.


UNDERSTANDING REWARD-BASED LEARNING 8

​ ariability of hit rates in early versus late trials for 40/60 probability balloons. Bars denote
Figure 3. V

median and quartiles for each plot.

There was no significant change in number of correct responses over time for the 50/50

probability balloons, ​t(​ 13) = -0.57, ​p​ = 0.576.


UNDERSTANDING REWARD-BASED LEARNING 9

​ ariability of hit rates in early versus late trials for 50/50 probability balloons. Bars denote
Figure 4. V

median and quartiles for each plot.

For all probabilities, the p-value was greater than 0.05 and thus cannot be deemed significant.

Therefore, we must reject our hypothesis that participants would have more correct key responses over

trials. Although not deemed significant, it should be noted that with a greater discrepancy between the

probability of each key being correct (i.e. 80/20), the p-value was lower than that of the control probability

(50/50), thus coming closer to significance.

Discussion

The results observed in this experiment did not support our hypothesis. While not quite reaching

significance, the results do present an interesting distinction between the hit rates for each probability.

When analyzed using a paired t-test, the p-value for the higher probability 80/20 balloons (i.e. those more

likely to be learned) turned out to be lower than the p-value of the control 50/50 balloons.
UNDERSTANDING REWARD-BASED LEARNING
10
Focusing on reward-based learning, overall, our results did not echo the findings of previous

literature. However, the relatively better learning for the higher probability balloons is a similar finding to

literature stating that the brain’s reward-based learning networks are flexible when presented with

different predictive stimuli and difficulty of the task (Gottfried et al., 2003; Haruno et al., 2004).

Participants simultaneously responded to balloons of different probabilities, learning the more easily

predicted outcomes (the 80/20 balloons) more distinctly than the other balloons.

If we were to run this experiment using an fMRI machine, I would expect there to be activity

particularly centered around the OFC and the caudate nucleus. The OFC has continued to show activity

throughout studies concerning reward-based learning (Gottfried et al., 2003; Gourley et al., 2013;

Izquiero, 2017). Although our participants may not have overtly known they were learning something, I

predict that activity in the OFC would spike nonetheless. In addition, with increasing probability ratio, it

would be expected that activity in the caudate nucleus would similarly increase, due to the change in

behavior participants exhibited to gain more reward. The magnitude of the change in behavior, as shown

in previous literature, is correlated with the amount of activity in this area (Haruno et al., 2004).

A strength of our study was the use of randomization and counterbalancing. Randomizing the

order in which the balloons were presented and creating three versions of the experiment with the

probabilities tied to different colored balloons diminished any effect the color of the balloon had on the

results, leaving only their probabilities as the cause.

A limitation of our study was the absence of realistic rewards in order to incentivise learning.

Participants were given a point for each correct response given, but they did not receive any tangible

reward for each correct response as in the target article. We had planned to distribute candy as a reward

for each participant, but in reality we had no way of efficiently looking at the number of correct responses

for each participant and giving the correct number of candies in the time allotted.

In conclusion, our simplified BART to measure reward-based learning yielded no significant

results of participants learning correct key responses over trials.


UNDERSTANDING REWARD-BASED LEARNING
11
References

Elliott, R., Newman, J.L., Longe, O.A., & Deakin, J.F.W. (2003). Differential Response Patterns in the
Striatum and Orbitofrontal Cortex to Financial Reward in Humans: A Parametric Functional Magnetic
Resonance Imaging Study. ​Journal of Neuroscience, 23(​ 1), 303-307.
https://doi.org/10.1523/JNEUROSCI.23-01-00303.2003

Gottfried, J.A., O’Doherty, J., & Dolan, R.J (2003). Encoding Predictive Reward Value in Human
Amygdala and Orbitofrontal Cortex. ​Science, 301(​ 5636), 1104-1107.
https://doi.org/10.1126/science.1087919

Gourley, S.L., Olevska, A., Zimmermann, K.S., Ressler, K.J., DiLeone, R.J., & Taylor, J.R. (2013). The
orbitofrontal cortex regulates outcome-based decision-making via the lateral striatum. ​European
Journal of Neuroscience, 38(​ 3), 2382-2388. https://doi.org/10.1111/ejn.12239

Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., … & Kawato, M. (2004). A
Neural Correlate of Reward-Based Behavioral Learning in Caudate Nucleus: A Functional Magnetic
Resonance Imaging Study of a Stochastic Decision Task. ​Journal of Neuroscience, 24(​ 7), 1660-1665.
https://doi.org/10.1523/JNEUROSCI.3417-03.2004

Izquierdo, A. (2017). Functional Heterogeneity within Rat Orbitofrontal Cortex in Reward Learning and
Decision Making. ​Journal of Neuroscience, 37(​ 44), 10529-10540.
https://doi.org/10.1523/JNEUROSCI.1678-17.2017

Schonberg, T., Fox, C.R., Mumford, J.A., Congdon, E., Trepel, C., & Poldrack, R.A. (2012). Decreasing
ventromedial prefrontal cortex activity during sequential risk-taking: an FMRI investigation of the
balloon analog risk task. ​Frontiers in Neuroscience, 6​(80). https://doi.org/10.3389/fnins.2012.00080

Schultz, W., Dayan, P., & Montague, P.R. (1997). A Neural Substrate of Prediction and Reward. ​Science,
275​(5306), 1593-1599. https://doi.org/10.1126/science.275.5306.1593

Potrebbero piacerti anche