Introduction and a single signal cue (Dayan & Niv,

IntroductionCritical clues could be given from studies with lowerorganisms by the virtue of simplifying the mechanism and minimizingdistractions. For instance, Eric Kandel became a Nobel laureate owing to hisstudy elucidating the inner-working of memory making with snails (Kandel &Schwartz, 1982). In this manner, rodents’ behaviors in the operant chamber canbe one candidate to interpret human nature. There are two different learningstrategies in instrumental conditioning. The action of rodents in theexperimental chamber seems to be goal-directed and governed by the relationshipbetween their actions and their specific consequences or the result at thebeginning of training. Besides, after a period of training, it seems that thecontrolling over of their behaviors shifted to a stimulus-response process,which can be considered as performance without thinking (Balleine , 1998).These phenomena can be extended to human behavior.

Asanimal behavior can be ruled by two distinct ways: a well-calculated sequenceof a goal-directed process and a stimulus-response habit mechanism, humanbehavior can be driven by a series of actions aiming specific goal andautomated habitual activities that involved in two types of learning. Thissomewhat looking complicated system of behavior allows us to focus on moreimportant thing during our daily life. We can avoid sudden obstacles in our waybecause we do not have to think about how to walk. However, hypothesis demandsempirical applications to study it.

As it was said above, to study this it is required tosimplified this concept. Among many attempts, a formation of model-free andmodel-based learning approaches from a computational modeling is one of theadequate tools to capture this present of competing for goal-directed andhabitual behavior (Daw et al., 2005). In a model-free approach, the reasoningbehind each action is based on the simple connection between reward predictionerrors and a single signal cue (Dayan & Niv, 2008). The cost for the actionof mathematical calculation can be minimized, hence this strategy has lesspossibility in adjusting to current goals (Dayan & Niv, 2008;  Keramati et al.

, 2011). On the other hand,continuous adjustments in actions from model-based approach let us have diverseand optimized-by-goals collections of behaviors due to vigorous and consecutivecomputational works in our brain prior to the decision of specific actions (Dawet al., 2005; Dayan & Niv, 2008; Keramati et al., 2011).

Another background of this study is that thesemodel-based and model-free strategies can be summed as reinforcement learning(Dayan & Niv, 2008). Reinforcement learning is the way of optimizingbehaviors based on the prediction of consequences (Sutton & Barto, 1998).There are many approaches to this mathematical psychology. One representativeway is event-related brain potential (ERP). A myriad of ERP studies(Falkenstein et al., 1991; Gehring et al., 1993; Gehring & Willoughby,2002; Holroyd & Coles, 2002) have supported the idea that reinforcementlearning can be visualized in ERP wavelet by a negative deflection in the ERPfrom positive feedback (reward) to negative feedback (non-reward).

Thisnegative deflection of ERP component happens approximately 250 ms after thefeedback, and the peak is made on the recording site of frontal-centralrecording electrodes (Miltner et al., 1997; Holroyd et al., 2009). Thepresumable source of this negative deflection is the anterior cingulate cortex(ACC), which receives signals from the midbrain dopamine system to evaluateprevious actions (Rushworth et al., 2004). This hypothesis is especially welldeveloped in animal study (Schweimer & Hauber 2006).

One thing should be recalled in this context is thatthere is another ERP component sharing its timing, polarity, and location, theN200. The cognitive representation of the N200 has been interpreted as thedetection of a mismatch with analyzation of auditory stimuli (Folstein Petten, 2008). However, based on Holroyd et al., 2008, the possibility thatthis negative deflection and the N200 are actually the same was suggested.Accordingly, the N200 represents the mismatch of prediction and the feedback(Baker & Holroyd, 2011; Baker et al.

, 2016). In other words, the predictionerrors from unexpected reward or punishment possibly display as the ERPcomponent, the N200, and the amplitude of the N200 can contribute the negativedeflection (Baker & Holroyd, 2011; Baker et al., 2016). This brought theidea of feedback correct-related positivity or reward positivity (RP), which isconsidered to be obtained by the difference between the amplitudes of thenegative deflection or the N200 (Cohen et al., 2007; Holroyd et al., 2008;Baker & Holroyd, 2011; Baker et al.

, 2016). In sum, reinforcement learningcan be divided by two; model-free and model-based learning, and it also can berecorded by the ERP wavelet. However, the connection between model-free andmodel-based learning and the ERP recording is still needed to be elucidated.Hence, in this study, the N200 as the indicator ofthe feedback error-related negativity was measured from six healthy individualsduring model-free and model-based learning task paradigm to study the linkamong these. The modified version of two-step probabilistic learning paradigmfrom Smittenaar et al, 2013 was used, which can induce model-free andmodel-based learning by providing stochastic chances of rewards (Fig. 1).

Inthe more frequently presented case (HF), a chance to get a reward is highlybiased to the left choice, while, it is more random and unbiased to get areward in less frequently displayed case (LF). In addition, by switching thebackground color and providing previous choices, we tried to give theinformation to participants for discriminating the HF and the LF (Fig. 1).Subsequently, we could get significantly different trends in the N200 from theLF to the HF. Moreover, learning progress in the LF is less predictable in thepoint of view that the accuracy of the performance in the task is lesscorrelated to the RP. Therefore, we can conclude that the model-free learningis presumably differently mediated in our brain, and this process can bemeasured by ERP waveform.

– ParticipantsSeven healthy individuals participated in theexperiment (1 male and 6 female; age range 18 to 32; mean = 22.43, SD = 4.76years). All participants had normal or corrected-to-normal vision. Anyparticipant who has a record of psychiatric or neurological disorder wasexcluded. In advance of the experiment, written informed consent was providedto all participants, which was approved by the local research ethics committee.This experiment was conducted in accordance with the ethical standardprescribed in the 1964 Declaration of Helsinki.  – Reinforcement learning taskThis is modified from Smittenaar et al, 2013.

On eachtrial, two fractals were given to participants for a choice, each of which morefrequently (70%; fig. 1) led to another fractal particular at a second step. Atthe second step, a coin (25 cents) was displayed on the screen based on theprobability (20% to 80%; fig. 1) of participants’ choices in that step.Opposingly, the red cross was presented in the case of non-reward also based onthe probability. Choices in the first stage less frequently (30%; fig.

1) ledto the alternative second state. In this alternative stage, the reward coin wasgiven less frequently(40% to 60%; fig. 1) than the other state. There is noclear evidence of assigned fractal for a high chance of a reward on the screen.Hence, the participants were asked to use model-based learning strategy that issensitive not only prior reward but also the transition structure of the task,which is highly unlikely to model-free learning strategy focusing only on thelast action of being rewarded.Prior to the experimental task, participants weretrained to the task, which consisted of written instructions on the screen, 10demo trials showing the probabilistic association between the second stagefractals and coin rewards, and the verbal explanation during this demo trailsfrom the assistant seating next to the participant.Participants were asked to respond within 2.

5 s bypushing keys (the left;1 and right;0) following the presentation of thefirst-state choice. If the response was made over this time period, the redcolored words “no response” appeared at the center of the monitor screen, andit moved to the brand new next trial. If the response was made on time, theresized selected fractal placed to the top center of the screen to remind theparticipant’s choice of the first stage. And the participant would sawdifferent background color based on their choices made in the first stage. Atthe second step, the responding time reduced by 1 s, and based on theprobability (20% or 80%) a reward coin or the red cross appeared on the screen.

 – Electrophysiological recordingsThe electroencephalogram (EEG) recordings conductedwith a montage of 36 electrodes located as stated in the extended international10-20 system (Jasper, 1958). Readings were obtained through Ag/AgCl ringelectrodes placed in a nylon electrode cap. A conductive gel (Falk MinowServices, Herrsching, Germany) applied to the head skin, and Inter-electrodeimpedances were controlled by 10 ? by this application of a conductive gel. Signalswere amplified by low-noise electrode differential amplifiers with a frequencyresponse of DC 0.

017 to 67.5 Hz (90 dB octave roll off). The sampling rate fordigitizing was 250 per second. These digitized signals were recorded to diskusing Brain Vision Recorder software (Brain Products GmbH, Munich, Germany).For artifacts detection, the vertical electrooculogram (EOG) was calculated byrecordings of beneath the right eye of the participant and electrode channelFp2. The horizontal EOG was recorded from the external canthus of both eyes.

The average reference was used, and the reference electrodes were on the leftand right mastoids. The ground electrode was placed on channel AFz. – Data processing and calculating Reward PositivityBrain Vision Analyzer (Brain Products GmbH, Munich,Germany) was used for post-processing and data visualization. A 4th orderdigital Butterworth filter with a bandpass of .1 to 20 Hz was used forfiltering the digitized readings. The readings were segmented by a 1000 msecwidth epoch, which extended from 200 msec prior to the onset of the stimuli to800 msec after that. The segmented evoked potentials were re-referenced tomastoids electrodes.

The baseline correction was made by subtracting from eachmean amplitude associated with the electrode within 200 ms interval prior tothe onset of stimuli. Blinks and saccades were corrected with eye movementcorrection algorithm from Gratton et al, 1983. Trials with muscular and otherartifacts were rejected with the range of ±150 µV level threshold and a ±35 µVstep threshold. Then, averaging the single-trial EEG of each participantrendered event-related potentials (ERP). The ERPs were sort of feedback typeand the frequency of getting a reward. Reward positivity (RP) was calculated bythe assessing the difference between positive and negative feedback of ERPcomponents.

To do this, a difference wave by subtracting the reward feedbackERPs from the No-reward feedback ERPs (Sambrook & Goslin, 2015; Holroyd& Krigolson, 2007) was performed. The size of RPs is determined by peakamplitude detection of N200, which is within a 200 to 400 msec window afterfeedback onset. This peak amplitude detection was conducted at channel FCzwhere the RP can reach its maximum value, and by doing this, the statistics ofanalyzing could be verified. After this, the ERPs and scalp maps were revisedwith Illustrator CS5 (Adobe software). The significance of the ERPs is calculatedby the peak amplitude values with SPSS (IBM) and Excel (Microsoft). – Negative deflection is more salient duringmodel-based learning process than it in model-free strategy.

The feedback-evoking ERPs at channel FCz sorted byconditions inducing model-free and model-based learning and scalp distributionsfor each is presented in figure 2. Significant different waveform happened inless frequently presented situation causing model-based learning (Fig. 2C,one-tailed t.test, t = 320ms p = 0.033, mean amplitude of LF-PF = 6.80 µV, SDof LF-PF = 4.24, mean amplitude of LF-NF = 2.59 µV, SD of LF-NF = 1.

21). Basedon scalp distributions, reward-related negative deflection seems acenter-oriented process (Fig. 2B and 2D).

These observations are congruent withthe previous studies (Baker & Holroyd, 2011; Baker et al., 2016). In the HFcondition, although the reason is not clear, the ERPs do not have significantlydifferent amplitudes of the N200 (Fig.

2A). In addition, it is hard to figureout the source, however, there is a slight incongruent timing between PF and NF(Fig. 2A and 2C). The timings for PF and NF are aligned cross model-free andmodel-based learning conditions (Fig.

5). The moments of the RP for HF are faster than it of LF (Fig. 2A and 2C).Moreover, the trends in scalp distributions are clearly divided by positivefeedback and negative feedback, yet, this modality is making a consensusbetween model-free and model-based strategies (Fig. 2B and 2D). – Difference wave and the relationship betweenperformance accuracy and reward positivity show that the model-free setting ismore in agreement with the previous ACC related reward task paradigm.

Difference wave and scalp distributions for eachcondition are displayed in figure 3. Subtractions between PF and NF scalpdistribution consent to the previous literature (Fig. 3B and 3D; Baker , 2011; Baker et al., 2016). However, according to Figure 3A and 3C, theERPs by the task paradigm provoking model-based learning are less congruent tothe previous hypothesis about reinforcement learning studies (Baker , 2011; Baker et al., 2016). Unlikely to the waveforms in the model-freeconditions, the wavelet of LF is changing amplitude bigger than it of HF (Fig.

3C). Furthermore, around P300 the polarity and the negativity of the ERP iscrossing the zero line (Fig. 3C). The causality behind it is still remained tobe elucidated, but this can be evidence that there is an additional neuralprocess rather than the ACC reward-related mechanism.In addition, the correlation between performanceaccuracy and RP indicates that model-based learning process is less predictablewith the current hypothesis that RP reflects the reinforcement learningprogress by the amount of negative deflection (Fig. 4A and 4B; Baker , 2011).

Performance accuracy was calculated by assessing the percentageof reward trials over total trials. This result can be revised with a biggersample size. – The difference between model-free and model-basedlearning in this task paradigm seems coming from LF-NF cues. To figure outwhat is the factor to make these changes among model-free and model-basedlearning signal cues the comparison between HF and LF was done in positivefeedback and negative feedback, respectively (Fig. 5). The center-orientedactivity pattern of scalp distributions among conditions are consistent (Fig.5B and 5D).

The timing of the N200 is slightly faster in positive feedback(Fig. 5A and 5C). Moreover, overall time points of ERP components, such as P100,N200, and P300, are more aligned than these in comparing between PF and NF(Fig. 5A and 5C).

In the PF, the ERP components do not show significantdifference from each other, while from N200 including the following componentP300 the waveforms are significantly different (Fig. 5A and 5C, one-tailedt.test, t = 320ms p = 0.

029, mean amplitude of HF-NF = 4.69 µV, SD of HF-NF =1.70, mean amplitude of LF-NF = 2.59 µV, SD of LF-NF = 1.21). Especially,around P300 there is a constant big difference between more-frequent andless-frequent conditions (Fig. 5C).

This is well fit to the classical study ofthe P300 that the decision making or learning something that it rarely happensevoke this ERP (Donchin, 1981).  The ERPs were recorded during two-step probabilisticlearning paradigm, which causes model-free and model-based learning process.The N200 and the RP were the major subjects to be scrutinized due to the factthat it can indicate reinforcement learning (Baker and Holroyd, 2011). The LFsignal cues triggering model-based strategy shows a more clear effect of thenegative deflection with positive feedback.

Moreover, it could be assumed thatthe task with more commonly presented event having less stochastic reward ismore in accordance with the previous explanation about the relationship betweenthe N200 and the ACC for reward evaluation. Moreover, the scalp distribution ofnegative feedback presumably implied the additional mechanism to affect theprevious inner-working of the ACC and the midbrain in the reward trails. Lastbut not least, based on the statistics, the difference between model-free andmodel-based learning in this task paradigm seems biased to LF-NF cues.  – There is a possibility of additional brainactivities rather than the relationship between the ACC and the midbraindopamine.The N200 indicates many different pieces ofinformation. There are several N2 sub-components according to theircharacteristics, such as the automated or the ones requiring consciousattention (Naatanen & Picton., 1986). The N2b, which is unlikely to theother N2 family in terms of the fact that it is responsive not only auditorycues but also visual and template changes, can be seen at central corticaldistribution related to to the ACC activity only during conscious stimulusattention (Pritchard et al.

, 1991; Baker & Holroyd, 2011). In addition tothis, according to the animal study indicating the relationship betweenmidbrain dopamine activity and the ACC activity (Rushworth et al., 2004;Schweimer & Hauber 2006), we previously set a hypothesis that the N200negative deflection can visualize reinforcement learning with reward predictionerrors (Baker & Holroyd, 2011). However, although it does not have a clearcausation, there are three factors implies extra neural activity except rewardprediction errors in this setting; the constant latency between reward andnon-reward feedbacks, the difference in the P300 and the different polaritybetween model-free and model-based approaches after the P300.

Of course, it istrue that these elements appeared just because we have a small size of thesample. Nonetheless, we still can have another plausible theory based on theprevious literature. Human age-related studies showed that N2b latency could becaused by the general decay of attentional processes with age (Czigler et al.,1997; Amenedo & Diaz, 1998).

Among these participants, there is a lesschance that they will have the age-related depression of attentional processes(1 male and 6 female; age range 18 to 32; mean = 22.43, SD = 4.76 years). Yet,there is a possibility of having distraction by a stochastic chance of rewardat negative feedbacks.

This may indicate that there was a factor to disturbparticipants’ attention during the non-reward cues of the task paradigm in thisstudy. Moreover, this conclusion is well fit for the generic idea of thedelayed computational process in goal-directed behaviors compared to thehabitual learning (Dayan & Niv, 2008). Furthermore, as figure 2, 3 and 5continuously showed, the P300 components are significantly different betweenreward and non-reward feedbacks. Based on the classical studies, the P300 arecorresponding to broad recognition and memory-updating throughout rarelyhappening events (Sutton et al., 1965; Donchin, 1981; Naatanen, 1990). This ledus to a similar conclusion to the previous above that because of the possibleadditional cognition process, we need to extend our scope to another component,such as the P300 in this context.There are subcortical areas to explain the aspectof subconsciousness process during reinforcement learning.There are two major subclasses in reinforcementlearning; model-free and model-based, which include subconscious computation(Dayan & Niv, 2008).

It is sure that there will be the complex mechanism toshape our behaviors, but one factor is reward evaluation by the midbraindopamine pathway. (Doya, 2008). Therefore, the N200 was measured in this studyon the connection of the ACC and the midbrain dopamine cells’ reward predictionerrors (Brown & Braver, 2005). One more helpful a factor to add is thestriatum. There is a study showed that the striatum also topologicallyresponding to habitual learning and goal-directed learning respectively (Yin etal., 2005). Many attempts have been done to enlighten the cortico-striatalconnection in human.

One of them is combining computational modeling and ERPs(Santesso et al., 2009). This leads to the final conclusion that although theN200 and RP are a great tool to assessing model-free and model-based learning,still, we need to improve the task and analyzing data with the aid ofcomputational modeling.