Understanding Reward-Based Learning

Understanding Reward-Based Learning
Krista L. Roze
Boston College
UNDERSTANDING REWARD-BASED LEARNING 1
Introduction
Learning is a cognitive process crucial to humans, both in the past when humans needed to learn
what was safe to eat, where food was, etc., as well as learning in schools in the present day. Almost
always this learning is the result of some incentivisation, through which we continue doing an action due
to the promise of some reward. We alter our expectations of events with experience, tailoring our actions
to seek reward and avoid punishment (Schultz, Dayan & Montague, 1997). Reward-based learning
perhaps began with the evolutionary drive to seek out food. Researchers over the past 20 or so years
have used this adaptive behavior in rats to determine the neural substrates underlying reward-based
learning. Evidence points to the orbitofrontal cortex (OFC) as a key region utilized in learning through
incentive. When the OFC was lesioned in a group of rats, for example, they took longer than the controls
who weren’t lesioned to learn a reversal reward, in which they had to press something that was previously
unenforced (Izquierdo, 2017). The role of the rat OFC in reward-based learning has been shown to also
be connected through its projections to the caudate nucleus, known to be a reward center in the brain
(Gourley et al., 2013).
To expand the literature for human subjects, researchers have started testing reward-based
learning using rewards other than primary incentives like food. Researchers have found that using an
abstract reward like money as an incentive activates regions of the brain that significantly overlap with
those associated with primary incentives. Elliott et al. found that in addition to activity in the OFC and
dopaminergic midbrain, the amygdala and ventral striatum also showed increased activity in the presence
of rewards (Elliott, Newman, Longe, & Deakin, 2003). The amygdala and OFC in particular have been
shown to be adaptive to responding to stimuli that predict rewards. When a stimulus previously learned to
lead to a reward was changed to no longer lead to a reward, the responses elicited by the amygdala and
the OFC decreased. This shows the flexibility of the brain in learning and re-learning what stimuli lead to
rewards (Gottfried, O’Doherty, Dolan, 2003).These results have been echoed in the caudate nucleus as
well. When task difficulty was altered by controlling the probability of reward, activity in the caudate
nucleus still “paralleled the magnitude of a subject’s behavioral change during learning” (Haruno et al.,
2004). This reiterates the fact that the human brain is adaptive in learning based on reward incentives.
Our experiment stemmed from one observing risk-taking behavior, which is also involved with
predicting future reward or punishment. The target article aimed to define the neural basis of risk-taking
by having participants complete the Balloon Analog Risk Task (BART) while being scanned in an fMRI
machine (Schonberg et al., 2012). The escalating risk and cumulative rewards earned in the BART have
been shown to produce a realistic model for naturalistic risk-taking behavior. The researchers considered
two models of risk-taking behavior specific to the BART: either participants would view each consecutive
balloon pump as an accumulated reward relative to the beginning of the trial, or they would view each
balloon pump as an increased chance at a loss, updating the risk of each balloon pump as they went
through the task. They hypothesized that if the accumulating reward model was used by participants,
there would be increasing ventro-medial prefrontal cortex (vmPFC) activity with each pump. If, however,
the participants used the possible loss model to gauge rewards, the researchers hypothesized decreasing
vmPFC activity with each pump. Upon conclusion of the experiment, the researchers found that vmPFC
activity decreased with each successive pump, supporting the model that participants viewed each pump
as a higher risk of loss rather than as a possible accumulation of reward.
Due to difficulties with PsychoPy’s programming capabilities, we tailored the experiment of the
target article in order to explore the effect of reward-based learning. We performed the following
experiment to see if participants would respond with more correct answers over trials in a task, as they
learned which answers yielded a reward. We wanted to explore how well a response could be learned,
based on the probability tied to each stimulus of one of two answers being correct. We hypothesized that
over trials, participants would have more correct responses due to learning the probabilities (especially
probabilities further from chance) for the correct response for each stimulus.
Method
Participants
A total of 14 participants were chosen as a convenience sample from the Cognitive Neuroscience
Research Practicum class; this included 13 Boston College students and one Boston College professor.
All participants were exposed to all stimuli. Participants were randomly assigned to one of three different
versions of the experiment as a form of counterbalancing.
Experimental Task
The experimental task was a modification of the original Balloon Analog Risk Task (BART) used
in the target article. Due to the coding limitations of PsychoPy, the task was changed to measure reward
learning instead of risk-taking behaviors. Participants were first presented with instructions for the task. A
screen with the word “READY?” was presented to focus the participant's attention before the stimuli
appeared. The participant was then shown either a red, white, or blue balloon. They were prompted to
choose either the left or right arrow key. One key would “inflate” the balloon, and the participant was given
a point, which showed up on the screen as a green “1” after they pressed the key. The other key would
“pop” the balloon, and the participant was not given any points, which showed up on the screen as a red
“0” after they pressed the key. The experiment immediately moved onto the next balloon. Each balloon
color was tied to a unique probability of amount of times the left versus the right key was the choice
leading to inflation of the balloon. For example, in one version of the experiment, the red balloon had a
probability of the right key inflating the balloon 80% of trials, whereas in the other 20% of trials, the left
key would inflate the balloon. In this same version of the experiment, the blue balloon had a probability of
the right key inflating the balloon 40% of trials, whereas in the other 60% of trials, the left key would inflate
the balloon. Lastly, as a control task, the white balloon in this version of the experiment had a probability
of the left key inflating the balloon 50% of trials, and the right key would inflate the balloon in the other
50% of trials. The order in which the different colored balloons showed up was randomized. In the entire
task, each color balloon was presented 60 times, for a total of 180 trials. The dependent variable being
measured was the number of correct key responses for each balloon.
Design
In order to more specifically focus on the effect of the probabilities of the balloons on reward
learning rather than the effect of color, three versions of the same experiment were created. Throughout
the experiments, no color of balloon had the same probability as its counterpart in the other versions (the
actual probabilities remained the same throughout versions). This counterbalancing allowed for color to
be controlled to ensure that only an effect of probability would be tested. Participants were randomly
assigned to complete one of the three versions of the experiment. As all participants were exposed to all
stimuli, this was a within-subjects design.
The target article’s original BART was used to observe neural activity during risk-taking
behaviors. Participants would be prompted to inflate a balloon and would earn a certain amount of money
with each successful inflation. At any point during the trial, there was an opportunity to stop inflating the
balloon and cash out the rewards. There was also a chance that the balloon would pop with too many
inflations, and the rewards accumulated during that trial would be lost. This task was modified due to the
constraints of PsychoPy, as accumulating rewards and having multiple inflations per balloon could not be
properly programmed. Instead, each balloon was shown only once, and there were two outcomes: either
the balloon was inflated, or it popped. A successful inflation was rewarded one point, while a popped
balloon did not receive any points and did not take any points away. This simplified version of the BART
was used to test learning over time, according to each balloon's probability of the correct answer being
the left or right key. Therefore, instead of looking at when participants cashed out, the amount of correct
responses over time was observed in the form of participant hit rates.
To quantify the reward learning over time, the average participant hit rate (number of correct key
responses out of the total number of trials) was split into “early” and “late” trials. The average hit rate for
the first 90 trials was compared to the average hit rate for the last 90 trials using a paired t-test. A paired
t-test was used because the early and late trials were compared within participants.
Results
No data was excluded from analysis. To operationalize reward learning over time, the hit rates
were calculated for each participant during “early” and “late” trials. Out of 180 trials, the first 90 trials were
labelled as “early” and the last 90 trials were labelled as “late”. Hit rates for early and late trials were
calculated separately for each participant, by taking the number of correct key responses over the total
number of key responses. This resulted in two hit rates per participant - one for early trials and one for
late trials. These early and late hit rates were then compared in the analysis.
The research question was aiming to see if participants would learn the probabilities of the correct
key for each balloon, as evidenced by the number of correct responses they gave over trials. We
hypothesized that, over trials, participants would have more correct key responses as a result of learning
the different probabilities for the correct key for each balloon.
The data we obtained did not support our hypothesis, and so we cannot conclude from this
experiment that participants were able to learn the probabilities for correct key responses over trials. The
data was separated by trials into the three different probabilities so that early and late trials could be
compared within the trials that had the same probability. Once separated into these three groups, the
averages and standard deviations of the early trial hit rates and the late trial hit rates were calculated.
Further, a paired t-test was used to compare the the hit rates of the early trials and the late trials.
Balloons with an 80/20 probability (the right key being correct 80% of the time) had an average hit
rate of 0.586 for early trials, and an average hit rate of 0.621 for late trials. Balloons with a 40/60
probability (the right key being correct 40% of the time) had an average hit rate of 0.483 for early trials,
and an average hit rate of 0.517 for late trials. Balloons with a 50/50 probability (each key being correct
50% of the time) had an average hit rate of 0.476 for early trials, and an average hit rate of 0.490 for late
trials.
Table 1
Mean and Standard Deviations of Hit Rates, split by early and late trials
Conditions Mean Standard Deviation
80/20; Early 0.586 0.083
80/20; Late 0.621 0.567
40/60; Early 0.483 0.533
40/60; Late 0.517 0.433
50/50; Early 0.476 0.067
50/50; Late 0.490 0.050

Figure 1. Average hit rates of early versus late trials, split by balloon probability
There was no significant change in number of correct responses over time for the 80/20
probability balloons, t( 13) = -1.37, p = 0.193.

ariability of hit rates in early versus late trials for 80/20 probability balloons. Bars denote
Figure 2. V
median and quartiles for each plot.

Figure 3. V

Figure 4. V
For all probabilities, the p-value was greater than 0.05 and thus cannot be deemed significant.
Therefore, we must reject our hypothesis that participants would have more correct key responses over
trials. Although not deemed significant, it should be noted that with a greater discrepancy between the
probability of each key being correct (i.e. 80/20), the p-value was lower than that of the control probability
(50/50), thus coming closer to significance.
Discussion
The results observed in this experiment did not support our hypothesis. While not quite reaching
significance, the results do present an interesting distinction between the hit rates for each probability.
When analyzed using a paired t-test, the p-value for the higher probability 80/20 balloons (i.e. those more
likely to be learned) turned out to be lower than the p-value of the control 50/50 balloons.
UNDERSTANDING REWARD-BASED LEARNING
10
Focusing on reward-based learning, overall, our results did not echo the findings of previous
literature. However, the relatively better learning for the higher probability balloons is a similar finding to
literature stating that the brain’s reward-based learning networks are flexible when presented with
different predictive stimuli and difficulty of the task (Gottfried et al., 2003; Haruno et al., 2004).
Participants simultaneously responded to balloons of different probabilities, learning the more easily
predicted outcomes (the 80/20 balloons) more distinctly than the other balloons.
If we were to run this experiment using an fMRI machine, I would expect there to be activity
particularly centered around the OFC and the caudate nucleus. The OFC has continued to show activity
throughout studies concerning reward-based learning (Gottfried et al., 2003; Gourley et al., 2013;
Izquiero, 2017). Although our participants may not have overtly known they were learning something, I
predict that activity in the OFC would spike nonetheless. In addition, with increasing probability ratio, it
would be expected that activity in the caudate nucleus would similarly increase, due to the change in
behavior participants exhibited to gain more reward. The magnitude of the change in behavior, as shown
in previous literature, is correlated with the amount of activity in this area (Haruno et al., 2004).
A strength of our study was the use of randomization and counterbalancing. Randomizing the
order in which the balloons were presented and creating three versions of the experiment with the
probabilities tied to different colored balloons diminished any effect the color of the balloon had on the
results, leaving only their probabilities as the cause.
A limitation of our study was the absence of realistic rewards in order to incentivise learning.
Participants were given a point for each correct response given, but they did not receive any tangible
reward for each correct response as in the target article. We had planned to distribute candy as a reward
for each participant, but in reality we had no way of efficiently looking at the number of correct responses
for each participant and giving the correct number of candies in the time allotted.
In conclusion, our simplified BART to measure reward-based learning yielded no significant
results of participants learning correct key responses over trials.

UNDERSTANDING REWARD-BASED LEARNING
11
References
Elliott, R., Newman, J.L., Longe, O.A., & Deakin, J.F.W. (2003). Differential Response Patterns in the
Striatum and Orbitofrontal Cortex to Financial Reward in Humans: A Parametric Functional Magnetic
Resonance Imaging Study. Journal of Neuroscience, 23( 1), 303-307.
https://doi.org/10.1523/JNEUROSCI.23-01-00303.2003
Gottfried, J.A., O’Doherty, J., & Dolan, R.J (2003). Encoding Predictive Reward Value in Human
Amygdala and Orbitofrontal Cortex. Science, 301( 5636), 1104-1107.
https://doi.org/10.1126/science.1087919
Gourley, S.L., Olevska, A., Zimmermann, K.S., Ressler, K.J., DiLeone, R.J., & Taylor, J.R. (2013). The
orbitofrontal cortex regulates outcome-based decision-making via the lateral striatum. European
Journal of Neuroscience, 38( 3), 2382-2388. https://doi.org/10.1111/ejn.12239
Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., … & Kawato, M. (2004). A
Neural Correlate of Reward-Based Behavioral Learning in Caudate Nucleus: A Functional Magnetic
Resonance Imaging Study of a Stochastic Decision Task. Journal of Neuroscience, 24( 7), 1660-1665.
https://doi.org/10.1523/JNEUROSCI.3417-03.2004
Izquierdo, A. (2017). Functional Heterogeneity within Rat Orbitofrontal Cortex in Reward Learning and
Decision Making. Journal of Neuroscience, 37( 44), 10529-10540.
https://doi.org/10.1523/JNEUROSCI.1678-17.2017
Schonberg, T., Fox, C.R., Mumford, J.A., Congdon, E., Trepel, C., & Poldrack, R.A. (2012). Decreasing
ventromedial prefrontal cortex activity during sequential risk-taking: an FMRI investigation of the
balloon analog risk task. Frontiers in Neuroscience, 6(80). https://doi.org/10.3389/fnins.2012.00080
Schultz, W., Dayan, P., & Montague, P.R. (1997). A Neural Substrate of Prediction and Reward. Science,
275(5306), 1593-1599. https://doi.org/10.1126/science.275.5306.1593

Understanding Reward-Based Learning

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Understanding Reward-Based Learning

Caricato da

Copyright:

Formati disponibili

Understanding Reward-Based Learning

(Gourley et al., 2013).

as a higher risk of loss rather than as a possible accumulation of reward.

versions of the experiment as a form of counterbalancing.

stimuli, this was a within-subjects design.

80/20; Early 0.586 0.083

80/20; Late 0.621 0.567

40/60; Early 0.483 0.533

40/60; Late 0.517 0.433

50/50; Early 0.476 0.067

50/50; Late 0.490 0.050

probability balloons, t( 13) = -1.37, p = 0.193.

median and quartiles for each plot.

probability balloons, t( 13) = -0.95, p = 0.358.

median and quartiles for each plot.

probability balloons, t( 13) = -0.57, p = 0.576.

median and quartiles for each plot.

(50/50), thus coming closer to significance.

results, leaving only their probabilities as the cause.

In conclusion, our simplified BART to measure reward-based learning yielded no significant

results of participants learning correct key responses over trials.

Potrebbero piacerti anche

Understanding Reward-Based Learning

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Understanding Reward-Based Learning

Caricato da

Copyright:

Formati disponibili

Understanding Reward-Based Learning

(Gourley et al., 2013).

as a higher risk of loss rather than as a possible accumulation of reward.

versions of the experiment as a form of counterbalancing.

stimuli, this was a within-subjects design.

80/20; Early 0.586 0.083

80/20; Late 0.621 0.567

40/60; Early 0.483 0.533

40/60; Late 0.517 0.433

50/50; Early 0.476 0.067

50/50; Late 0.490 0.050

probability balloons, ​t(​ 13) = -1.37, ​p​ = 0.193.

median and quartiles for each plot.

probability balloons, ​t(​ 13) = -0.95, ​p​ = 0.358.

median and quartiles for each plot.

probability balloons, ​t(​ 13) = -0.57, ​p​ = 0.576.

median and quartiles for each plot.

(50/50), thus coming closer to significance.

results, leaving only their probabilities as the cause.

In conclusion, our simplified BART to measure reward-based learning yielded no significant

results of participants learning correct key responses over trials.

Potrebbero piacerti anche

probability balloons, t( 13) = -1.37, p = 0.193.

probability balloons, t( 13) = -0.95, p = 0.358.

probability balloons, t( 13) = -0.57, p = 0.576.