1 s2.0 S0165027017303655 Main

Journal of Neuroscience Methods 293 (2018) 359–374
Contents lists available at ScienceDirect
Journal of Neuroscience Methods

journal homepage: www.elsevier.com/locate/jneumeth
EEG-Annotate: Automated identification and labeling of events in

continuous signals with applications to EEG
Kyung-min Su a , W. David Hairston b , Kay Robbins a,∗
a
Department of Computer Science, University of Texas-San Antonio, San Antonio, TX 78249, USA
b
Human Research and Engineering Directorate, US Army Research Laboratory, Aberdeen Proving Ground, MD 21005, USA
h i g h l i g h t s g r a p h i c a l a b s t r a c t
• Proposes a framework for a new

“reverse inference” problem for EEG –
that tries to identify every place that
a particular response occurs.
• Shows that it is possible to perform
non-time-locked EEG classification
on unlabeled data using data from
other subjects.
• Provides a complete open-source
implementation of a working system
that includes visualization and anal-
ysis tools.
• Includes an open-source EEG data
collection and an associated Data
Note.
a r t i c l e i n f o a b s t r a c t
Article history: Background: In controlled laboratory EEG experiments, researchers carefully mark events and analyze
Received 8 August 2017 subject responses time-locked to these events. Unfortunately, such markers may not be available or may
Received in revised form 11 October 2017 come with poor timing resolution for experiments conducted in less-controlled naturalistic environ-
Accepted 13 October 2017
ments.
Available online 20 October 2017
New method: We present an integrated event-identification method for identifying particular responses
that occur in unlabeled continuously recorded EEG signals based on information from recordings of
Keywords:
other subjects potentially performing related tasks. We introduce the idea of timing slack and timing-
EEG
Single-trial ERP
tolerant performance measures to deal with jitter inherent in such non-time-locked systems. We have
Time-locked developed an implementation available as an open-source MATLAB toolbox (http://github.com/VisLab/
Event identification EEG-Annotate) and have made test data available in a separate data note.
Stimulus-response Results: We applied the method to identify visual presentation events (both target and non-target) in data
from an unlabeled subject using labeled data from other subjects with good sensitivity and specificity.
∗ Corresponding author.
E-mail addresses: kyungmin.su@utsa.edu (K.-m. Su), william.d.hairston4.civ@mail.mil (W.D. Hairston), kay.robbins@utsa.edu (K. Robbins).
https://doi.org/10.1016/j.jneumeth.2017.10.011
0165-0270/© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
360 K.-m. Su et al. / Journal of Neuroscience Methods 293 (2018) 359–374
The method also identified actual visual presentation events in the data that were not previously marked
in the experiment.
Comparison with existing methods: Although the method uses traditional classifiers for initial stages, the
problem of identifying events based on the presence of stereotypical EEG responses is the converse of the
traditional stimulus-response paradigm and has not been addressed in its current form.
Conclusions: In addition to identifying potential events in unlabeled or incompletely labeled EEG, these
methods also allow researchers to investigate whether particular stereotypical neural responses are
present in other circumstances. Timing-tolerance has the added benefit of accommodating inter- and
intra- subject timing variations.
© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
1. Introduction experimenter cannot be certain that a subject actually “saw”

the stimulus and “paid attention”. Clearly, the analysis of neural
A large body of traditional EEG neuroscience research uses responses in EEG acquired under complex naturalistic experimen-
“stimulus-response” paradigms in which a stimulus is presented tal conditions requires alternative approaches for understanding
to a subject and the subject’s neurological response, as measured stimulus-response relationships.
by EEG or other imaging, is recorded. Analysis begins by extract- This paper introduces a new class of methods for analysis of
ing windows or epochs consisting of channels × time data-points EEG data, which we refer to as “automated event identification”.
that are time-locked to particular known stimulus events. ERP Automated event identification matches responses observed in
(event related potential) analysis averages the epochs associated continuous unlabeled EEG data to responses known to be asso-
with each type of stimulus event and then examines differences ciated with particular stimuli in other datasets. While traditional
in the average patterns. In order for averaging to work, precise ERP analysis tries to identify the stereotypical responses result-
signal time-locking to the events is critical and depends on having from a particular stimulus, event identification tries to find
ing an accurate representation of event onset times (Larson and stereotypical responses and hence locate when potential stimuli
Carbine, 2017). More sophisticated time-locked methods based on might have occurred. We employ machine-learning approaches
regression (Rivet et al., 2009; Souloumiac and Rivet, 2013; Congedo to train multiple classifiers using data from experiments that fol-
et al., 2016) account not only for the temporal jitter due to vari- low well-controlled time-locked stimulus-response paradigms and
ability of individual responses, but also for temporal overlap of then apply these classifiers to time windows extracted from con-
responses due to closely spaced stimuli. However, all of these tinuous data in other settings to determine whether the same
methods rely on precise knowledge of the time of stimulus with types of neural responses occurred in these new settings. Ideally
a goal of modeling “stereotypical” responses to specified stim- such systems could allow researchers to tackle the inverse of the
uli. “stimulus-response” problem and ask, “Which stimuli trigger a par-
Single trial time-locked analyses (Jung et al., 2001) and BCI ticular response?” rather than “What response is triggered by a
(brain-computer interface) systems (Müller et al., 2008) also map particular stimulus?”
epochs time-locked to stimuli into feature vectors labeled by the In this paper, we apply our automated detection method to show
stimulus event type. These methods build classifiers (Blankertz that not only does the expected neural response to visual stimu-
et al., 2011) or use other machine learning techniques (Lemm lus occur in predictable ways across subjects in a visually evoked
et al., 2011) to map feature vectors into class labels for use in later potential experiment, but also that similar responses can be evoked
stimulus identification of time-locked unlabeled data. Single-trial by incidental visual stimuli during the course of an experiment.
analysis is also used in regression and other types of statisti- Included in the supplemental material is a complete implementa-
cal modeling (Kiebel and Friston, 2004a,b; Maris and Oostenveld, tion of the method as a MATLAB toolbox (available at http://github.
2007; Rousselet et al., 2009). Frishkoff et al. (2007) have used com/VisLab/EEG-Annotate) along with the labeled data used in this
features extracted from ERPs with decision-tree classifiers to gen- paper in order to allow researchers to label their own data using
erate rules for categorizing responses to stimuli across studies. these tools.
They have developed NEMO (Neural ElectroMagnetic Ontologies)
a formal ontology for the ERP domain (Frishkoff et al., 2011)
and methods of matching ERPs across heterogeneous datasets 2. Materials and methods
(Liu et al., 2012a). Time-locked classification, time-frequency, and
ERP analyses have many applications in laboratory and clinical 2.1. Method overview
settings where scientists are trying to understand neurological
processes associated with particular stimuli (Roach and Mathalon, In traditional time-locked EEG analysis, we know the stimulus
2008; Dietrich and Kanso, 2010; Mumtaz et al., 2015; Lotte et al., times exactly and analyze responses by extracting fixed-length por-
2007). tions of the signal, called windows or epochs, starting at known times
As EEG experiments move to more naturalistic settings (Kellihan relative to the stimulus events as illustrated schematically by the
et al., 2013; McDowell et al., 2013), neither time-locked analysis yellow window (x) of Fig. 1. We can transform response epochs
nor expert identification of stereotypical patterns may be pos- to feature vectors labeled by stimulus type and build classifiers to
sible. In a typical laboratory event-related experiment involving identify stimulus based on response.
visual stimuli, the subject focuses on the screen, and researchers In the context of automated event identification, however, we
assume that subjects “see” what is presented at the time of have no knowledge of where stimulus events are located or even
presentation. For subjects moving in natural environments, a myr- if a particular brain response was actually associated with a “stim-
iad of stimuli compete for attention and the mere existence ulus”. Our goal is to identify stereotypical neural responses and
of a visible line-of-sight does not guarantee perception. Even if then to determine whether these responses were associated with
eye-tracking data indicates a fixation on the visual target, the identifiable stimuli.
K.-m. Su et al. / Journal of Neuroscience Methods 293 (2018) 359–374 361
Fig. 1. Time-locked vs. non-time-locked decomposition of EEG signals into fixed sized windows or epochs. A time-locked epoch (x) used in traditional ERP or classification
analysis appears in yellow. Non-time-locked epochs (a through l) generated using fixed offsets (dotted lines) appear in blue and apricot. Apricot windows (d through k)
partially overlap the epoch time-locked to this stimulus event. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
Table 1 offset or sample size (125 ms) is smaller than the time scale over
Summary of the event annotation algorithm.
which the relevant event process occurs. Thus, the classifiers out-
Step Section Event annotation algorithm put hits for clusters of adjacent, overlapping epochs that actually
a) 2.2 Extract epochs from the training and test correspond to a single stimulus event. Another way to view this
datasets. redundancy is that the individual predictions align to within a spec-
b) 2.3 Convert epochs to feature vectors. ified timing error.
c) 2.4 Train multiple classifiers using training and A second difference is that in the automated identification of
possibly test features.
stimulus events, such as those that occur in naturalistic settings,
d) 2.5 Use each classifier to score each unlabeled
unlocked test feature. we are mainly interested in the most significant events and assume
e) 2.6 Use threshold adapted for imbalance to shift that events (positive labels) occur relatively infrequently. After
scores. making and combining the initial predictions, we compare scores
f) 2.7 Shift and rescale to bring window scores
across the dataset as a whole to determine which labels are the most
between 0 and 1.
g) 2.8 Apply weights of Table 2 to smooth window likely to correspond to events (true positives). Event identification
scores. in its second phase is similar to retrieval systems, which seek to
h) 2.9 Apply greedy zero out to obtain sub-window identify the most relevant matches first. Precision and recall, which
or sample scores. do not depend on true negatives (non-events), are the relevant
i) 2.10 Combine multiple classifiers scores by
measures.
averaging and smoothing.
j) 2.10 Apply greedy zero-out to obtain candidate The remainder of this section provides details about a par-
events. ticular implementation we provide with the annotation toolbox.
k) 2.11 Histogram resulting scores and apply adaptive Researchers can easily modify these choices to better suit their
threshold.
problems.
l) 2.11 Rerank if appropriate.
2.2. Extracting epochs or windows (Step a)

To approach this problem, we decompose the continuous chan-
As described in Fig. 1, we extract windows (epochs) starting at
nels × time EEG signal into windows (epochs), each offset by a fixed
the beginning of the dataset from the continuous channels × time
time as illustrated by the blue and apricot windows in Fig. 1. We
EEG signals of both training and test datasets. The windows are
refer to the portion of an epoch from its start to the start of the
one second in length, and each window is offset from the previous
next epoch as a sample. We choose the sample width (the window
window by an offset of 125 ms. The “positive” instances of a given
offset) to evenly divide the window length. The choice of sample
event class correspond to windows that contain an event of that
offsets that integrally divide the length of windows (epochs) simpli-
class in their first sub-window. The “negative” instances are those
fies overlap calculations. We transform these overlapping windows
windows that do not overlap at all with the positive instances.
into feature vectors and apply a federation of classifiers trained on
time-locked data from other labeled datasets. We then combine the
results, accounting for variations in timing, to identify the epochs 2.3. Transforming epochs to feature vectors (Step b)
most likely to contain potential stimuli. Table 1 summarizes the
steps of the method. In order to perform classification, we must convert epochs in
On the surface, this approach seems similar to ordinary clas- the training and test datasets into feature vectors. We have selected
sification with multiple datasets for training and a new dataset spectral power features, which capture spectral, temporal, and spa-
for testing, but there are crucial differences. Ordinary classification tial information without heavy dependence on exact timing. We
treats the epochs as independent of each other. In contrast, adja- form a feature vector of length channels × 8 × 8 for an epoch by
cent non-time-locked epochs extracted from continuous data are computing for each channel the spectral power in each of eight
highly correlated, particularly for features that are relatively insen- (8) frequency sub-bands on eight one-second time intervals offset
sitive to time locking. If we choose an epoch size of one second by multiples of 125 ms from the start of the epoch.
with epoch offsets of 125 ms, 15 non-time-locked epochs overlap Specifically, after high-pass filtering the EEG signals at 1.0 Hz, we
to some degree with each time-locked epoch. Further, the epoch band-pass filter using EEGLAB’s pop eegfiltnew function (Delorme
and Makeig, 2004a) to form eight frequency sub-bands of approxi- conditional distributions, since the test set labels are completely
mately 4 Hz in width in the range 1–32 Hz. These bands, which start unknown and not likely to be the same as that of the training data.
at 0.75, 4, 8, 12, 16, 20, 24, and 28 Hz, respectively, align to well- We have developed a modification of ARRLS, called ARRLSimb to
known neurologically relevant frequency bands (Klimesch, 1999). handle highly imbalanced datasets (Su et al., 2016). To estimate
For example, brain-computer interface systems (BCIs) often use Mu the structural risk and the joint distributions, ARRLSimb makes
(8–12 Hz) signals detected over motor cortex areas (Arroyo et al., an initial prediction of the class labels for the test set to gener-
1993), while EEG drowsiness monitoring systems sometimes use ate pseudo-labels. ARRLSimb uses the pseudo-labels of the test set
Theta (4–7 Hz) signals originating over the cerebral cortex (Lin et al., along with the actual labels of the training set to weight the class
2008). Theta rhythms originating from the hippocampus are gen- samples when generating the conditional distributions.
erally too deep for EEG surface recordings to detect (Buzsáki, 2002).
Alpha waves (also 8–12 Hz) found in the occipital lobe are associ- 2.5. Classifying test epochs or windows (Step d)
ated with attention lapses (Palva and Palva, 2007), while Beta band
activity (16–32 Hz) is associated with motor and cognitive control We apply the one-against-all classifiers developed for each
(Engel and Fries, 2010). event type-training dataset combination (LDA) or each event type-
training dataset-test dataset combination (ARRLSimb) to produce
2.4. Training the classifiers (Step c) initial prediction scores for the individual overlapping epochs in
each test dataset. LDA sets the class label using a threshold based
Each of type of event generates a separate one-against-all clas- on posterior probabilities. The original ARRLS algorithm produces
sification problem. The question is whether an event of a particular two scores for each test data item, one corresponding to a positive
type occurs in an epoch, not which of a selected set of events class label and the other corresponding to a negative class label.
occurs in that epoch. We illustrate the approach using two classi- ARRLS selects the positive class if the difference in scores (positive
fier algorithms – simple linear discriminant analysis (LDA) (Zhang – negative) is positive. Unfortunately, ARRLS performance is very
et al., 2013a) and a more sophisticated domain adaptation classi- sensitive to class imbalance. ARRLSimb uses an adaptive thresh-
fier called ARRLS, which was originally introduced by Long et al. olding method described in the next subsection to compensate for
(2014) and later modified to handle imbalanced data by Su et al. imbalance when determining the class label.
(2016). LDA classifiers do not use information about the test set,
so an annotation system can train a single classifier for each event
2.6. Adaptive thresholding to adjust labels for imbalance (Step e)
type-training set combination and store the classifier along with
the training set data for a simple and fast implementation. Domain
ARRLSimb fits the distribution of difference scores based on the
adaptation classifiers use the distribution of the unlabeled test data
assumption that the test set is highly imbalanced, with far more
to improve performance, so we must train a different classifier for
negative class instances. ARRLSimb assumes that the distribution
each event type-training set-test set combination.
of difference scores is the sum of two Gaussian distributions, with
LDA classifiers find linear combinations of features that maxi-
the Gaussian representing the negative class having far more points
mize the variance between classes while minimizing the variance
than the Gaussian representing the positive class. Instead of trying
within classes for the training set. LDA assumes that the samples
to fit directly two Gaussians, ARRLSimb fits a single Gaussian cor-
within a class come from the same normal distribution and that
responding to the largest peak in the score histogram, removes an
the distributions of different classes have different means but the
approximation to this distribution from the histogram, and then
same covariance matrices. LDA also assumes that the test set has
fits a second Gaussian to the remaining positive differences. We
the same class distributions as the training set. We use the MAT-
use MATLAB scripts provided by Long et al. (2014) modified for
LAB fitcdiscr (fit discriminant analysis classifier) function with a
imbalance as described in (Su et al., 2016) for the ARRLSimb classi-
linear discriminant type and empirical prior probabilities to train the
fications of this paper.
LDA classifiers. After training an LDA classifier, we use the MATLAB
predict function in the Statistics Toolbox to estimate the posterior
probabilities to score the test epochs. 2.7. Scaling epoch or window scores (Step f)
In contrast to LDA, domain adaptation adapts the training distri-
butions to the test distributions in order to achieve better accuracy The ranges of score differences for ARRLS differ widely across
during learning transfer (Liu et al., 2012b; Wu et al., 2014). The classifiers, preventing straightforward combination of scores from
basic ARRLS algorithm proposed by Long adds several regulariza- multiple classifiers. We have experimented with various types of
tion terms to standard regression-based classification. One term, scaling such as z-scoring prior to combination. To avoid undue
based on maximum mean discrepancy, tries to minimize the dif- influence by outliers, we subtract the adaptive threshold from the
ference between the joint distributions of the training and test sets. score difference computed as described in the previous section and
The joint distribution is a combination of the marginal distribu- set all negative scores to 0. We then divide by the 98th percentile
tion of the feature vectors and the conditional distributions of the value of the non-zero scores and set all scores above 1 to 1. This
feature vectors for the different classes. ARRLS also uses manifold shift and scaling results in scores from the individual classifiers to
regularization to assure that the local geometry of the marginal dis- lie between 0 and 1.
tributions is similar in both the training and test sets. ARRLS finds
an optimal solution by minimizing the least-squared classification 2.8. Smoothing the window scores (Step g)
error with a standard ridge regression term for regularization. In
order to approximate the conditional distributions of the classes in The algorithm combines scores from overlapping epochs (win-
the test set, ARRLS uses logistic regression classifier trained with dows) to produce a single score for each sample (sub-window) by
the training data to calculate pseudo-labels for each element of each classifier, accounting for the contribution of each epoch (win-
the test data. We always balance the training data for this initial dow) to many samples (sub-windows) as illustrated schematically
classification. in Fig. 2. Fig. 2A shows a plot of the raw window scores produced by
ARRLS assumes that training data and test data have the similar Step f plotted against the starting position of the window. A win-
class balances. The highly imbalanced nature of event annota- dow consisting of N sub-windows (eight in our case) overlaps and
tion (∼100:1) creates an intrinsic difficulty in estimation of the contributes to the score 2 × N − 1 sub-windows. If Xk is the window
In summary, the predictions of a target tend to cluster near

a target, but usually spread over adjacent sub-windows due to
timing variations and the contribution of each window to multi-
ple sub-windows. Applying a weighted window average followed
by thresholding gives better predictions than simply thresholding
each score. Furthermore, since sub-windows adjacent to an event
have high scores, we use a zero-out procedure to avoid double
counting.
2.10. Combining multiple classifiers to obtain candidate events

(Step i and Step j)
At this point, we have an array of sub-window scores for each

classifier. Due to the zero-out procedure, the array contains iso-
lated non-zero values. We average all of the sub-window score
arrays for each test set-event class combination. We then apply the
smoothing function of Table 2 and then the zero-out procedure. The
non-zero values are the scores for the candidate events.
2.11. Using adaptive thresholding to select the actual events (Step

k)
Finally, we use adaptive thresholding to fit the score distribu-

tion of the candidate events and separate high-scoring candidates
(likely events) from low-scoring candidates (likely noise). When
the histogram of this distribution separates well into low and high
scores, the annotation is likely to have been successful. However,
if there is no clear separation between high and low scores, one
should be conservative in accepting the annotation results. In some
Fig. 2. A schematic of the weighted averaging and zero-out procedure for producing cases, a re-ranking procedure, such as reclassifying the samples that
sub-window scores from overlapping window scores. A) The initial window scores have non-zero scores in the annotation process, may improve the
plotted against position of the first sub-window in a window. B) Sub-window scores separation and the effectiveness of the annotation.
produced by calculating the weighted averages of overlapping window scores. C)
Events are identified in isolated sub-windows by applying greedy zero-out.
2.12. Metrics and timing tolerance
starting at sub-window k, we compute the score Yi of sub-window We use balanced accuracy as an initial measure of classification
i as: performance:
TP TN

N
Accuracy = 0.5 × + (2)
Yi = Wj × Xi+j (1) TP + FN TN + FP
j=−N where TP, TN, FP, and FN are the true positives, true negatives, false
positives, and false negatives, respectively. However, in the con-
Wj is the weight applied to the window offset by j sub-windows
text of annotation, as in retrieval, we are primarily interested in
from the center of sub-window i. Table 2 shows the particular
how well the algorithm annotates (retrieves) relevant items. In this
weight vector used in this work. (We omit division by the sum of
context, precision, recall, and related measures are more useful.
the weights for readability.)
The annotation process determines how many items to annotate
Weighted averages account for the contributions of different
(retrieve) based on quality scores. Precision is the ratio of the num-
windows and smooth local variations in the scores as illustrated
ber of correct or relevant items annotated (retrieved) over the total
schematically in Fig. 2B, which depicts the sub-window scores after
number annotated (retrieved). Recall is the ratio of the number of
window weighting.
correct or relevant items annotated (retrieved) over the total num-
ber relevant items. We also use the ranked average precision (RAP)
2.9. Scoring of samples or sub-windows (Step h) to measure annotation quality.
The ranked average precision (RAP) is the sum of P(ri ) for all
A sub-window with a high score is likely to have neighbors with correctly retrieved items divided by the total number of correct
a high score. Under the assumption that distinct events have some or relevant items. Here P(ri ) is the precision calculated by anno-
temporal separation, high scoring sub-windows in a neighborhood tating all of the items whose scores are at least that of the ith top
are likely designate the same event. To avoid false positives due scoring correct item. RAP effectively assigns a zero precision to cor-
to double counting of immediate neighbors with a high score, we rect items that are not annotated (retrieved) and takes on a value
apply a greedy masking procedure to remove the conflicts. The pro- of 1 when all correct items and only correct items are annotated
cedure is as follows: first find the largest sub-window score and (retrieved). RAP is similar to the average precision (AP) used in infor-
annotate the associated sub-window as a hit. Apply the zero-out mation retrieval (Deselaers et al., 2007). However, since the term
mask (Table 2) centered on the labeled sub-window and update “average precision” in classification usually refers to averaging val-
by multiplying the sub-window scores by the mask. This masking ues over different folds in cross-validation, we use the term RAP to
procedure eliminates nearby high scores as shown schematically in avoid confusion.
Fig. 2C. Repeat this procedure until there are no unmarked non-zero Timing presents a difficulty for evaluation because if a prediction
scores. is off from the true location by even one sub-window, the metrics
count the item as a fail no matter how closely the prediction is Each dataset contains approximately 34 Foe events and 235
to a true event. To account for these timing variations, we modify Friend events. Typical time-locked classification analyses extract
the algorithms for determining precision and recall to inspect the one-second epochs time-locked to these events and label epochs
neighboring samples of a prediction. The non-time-locked versions as “friend” or “foe”. In these cases, a visual oddball-response to the
of precision and recall count a sample as a positive if it is within a enemy combatants (foe) is expected.
specified number of sub-windows (the tolerance) of a true positive In contrast to classification, annotation processes the two dif-
event. Table 3 illustrates the process. ferent types of events in two separate analyses, each based on
True events occur during sub-windows 3, 6 and 10. Depending one-against-all classification. The “positive” samples are either the
on the specified timing tolerance, the algorithm may or may not Friend events or the Foe events. To form the “negative samples” we
count predictions for neighboring sub-windows as hits. We also select the samples remaining after eliminating the windows with
modify the RAP calculation to use timing precisions with a spec- any overlap with the positive events. Approximately 1000 nega-
ified tolerance expressed as a number of sub-windows. Because tive samples remain for the “friend” problem and 4000 negative
of the zero-out procedure, each prediction will not have another samples remain for the “foe” problem.
prediction in its immediate neighborhood. To predict labels for a test subject, we combine the scores from
the 17 other subjects and predict labels for the test subject based on
2.13. Calculations of statistical significance the combined score. Since the training and test datasets are distinct,
we use all of the samples of the training subjects for training and
We used an empirical bootstrapping technique to answer the all of samples of the test subject for testing.
question of whether the performance of annotation in labeling m Note that the classification problems described above (Friend
samples from a subject’s n labeled samples was significantly better versus others and Foe versus others) are extremely imbalanced. In
than random. To generate a random base annotation, we assigned each classification problem, we balance the training datasets by
a random score between 0 and 1 to each element of an n-element oversampling the positive samples before training the classifiers.
vector. We selected the m annotated samples using greedy proce-
dure (remaining sample with the highest non-zero score selected
next). However, after each selection, we applied the zero-out pro- 3. Results
cedure described in Section 2.9 to remove the neighbors of the
selected consideration from further consideration. We generated To test the EEG annotation system, we perform two separate
10,000 random annotations and evaluated each performance met- one-against-all classifications within the annotation process: the
ric on the annotations with a specified timing tolerance to form a first annotates the appearance of U.S. soldiers (Friend events) and
base distribution. We applied the MATLAB ztest using the mean and the second annotates enemy combatants (Foe events). Unlike the
standard deviation of this distribution to determine the statistical typical classification analyses, we use non-time-locked samples for
significance of the annotation for each subject. We also performed a both training and testing. A positive sample is a window whose first
non-parametric test of significance using the empirical distribution sub-window contains the targeted event. The negative samples are
of the metrics based on random annotations. non-time-locked windows that do not overlap with the positive
samples. Therefore, the negative samples depend on the type of
2.14. Test data positive samples. If positive samples are friends, then the negative
samples are non-time-locked samples that do not overlap with the
To evaluate the proposed annotation system, we used prior friend samples. Note that, in this case, the negative samples overlap
collected, anonymized data containing no personally identifiable with the foe samples. However, in annotation, the question is not
information from a standard Visual Evoked Potential (VEP) oddball whether the response between these two types of events is distinct,
task acquired during a cross-headset comparison study compiled but rather at what times such events were likely to have occurred.
by the U.S. Army Research Laboratory (Ries et al., 2014; Hairston
et al., 2014). The voluntary, fully informed consent of the persons
used in this research was obtained in written form. The document 3.1. Pairwise classification accuracy
used to obtain informed consent was approved by the U.S. Army
Research Laboratory’s Institutional Review Board (IRB) in accor- As an initial point of reference, we compare the performance
dance with 32 CFR 219 and AR 70-25 and also in compliance with of LDA (no domain adaptation), ARRLS (with domain adaptation)
the Declaration of Helsinki. The study was reviewed and approved and ARRLSimb (ARRLS modified for imbalanced data) for non-time-
(approval #ARL 14-042) by the U.S. Army Research Laboratory’s IRB locked classification using one subject as the training subject and
before the study began. Table 4 summarizes the properties of this another as the test subject. No labeled data from the test subject
dataset. is included, but ARRLS and ARRLSimb accounts for the distribution
Eighteen subjects were presented with a sequence of two types of the test data. Because there are 18 subjects, we tested 306 pairs,
of images: a U.S soldier (Friend events) and an enemy combatant consisting of 17 training subjects for each of 18 test subjects. Fig. 3
(Foe events). The images were presented in random order at a fre- compares the balanced accuracies for the two inter-subject classi-
quency of approximately 0.5+/−0.1 Hz. Subjects were instructed to fication problems: Friend vs. Others and Foe vs. Others. In the plot,
identify each image with a button press. The data used in this study each dot represents the accuracy of one training-test pair.
was acquired using a 64-channel Biosemi Active2 EEG headset. As indicated by Fig. 3, both ARRLS and ARRLSimb consistently
The experiments recorded events for approximately 560 s. Longer outperform LDA in the pairwise classification task. As expected, the
records were trimmed to include only the portion containing the foe classification task is more difficult, in part because of the limited
actual experiment with recorded events. training set available for foe classification. Based on 306 training-
We applied the PREP preprocessing pipeline (Bigdely-Shamlo test subject pairs, the average balanced classification accuracy for
et al., 2015) to remove line noise and interpolate bad channels. We LDA for Friend vs. Others was 0.60 (std 0.08) and for Foe vs. Others
then removed the remaining noise and subject-generated artifacts was 0.52 (std 0.06). ARRLS had an average classification accuracy
using the Multiple Artifact Rejection Algorithm (MARA) (Winkler of 0.76 (std 0.11) for Friend vs. Others and 0.66 (std 0.14) for Foe
et al., 2011), which is an ICA-based EEG artifact removal tool. Both vs. Others. ARRLSimb had an average classification accuracy of 0.75
PREP and MARA are completely automated pipelines. (std 0.10) for Friend vs. Others and 0.78 (std 0.1) for Foe vs. Others.
Fig. 3. Comparison of LDA, ARRLS and ARRLSimb classification accuracies for pairwise classification with zero timing tolerance. The left plot shows Friend vs. Others and the
right plot shows Foe vs. Others.
A one-sided paired t-test shows ARRLS and ARRLSimb are sig- events with a timing error of one sub-window or less (in this case,
nificantly better than LDA for Friend vs. Others (T(305) = 34.5, 125 ms). In either case, 85% of the events have a timing error of
p < 0.001) and (T(305) = 33.6, p < 0.001), respectively. For Foe vs. Oth- four sub-windows or less. ARRLS and ARRLSimb perform similarly
ers, ARRLS and ARRLSimb are also significantly better (T(305) = 17.7, for both friend and foe events. LDA, on the other hand, performs
p < 0.001) and (T(305) = 43.7, p < 0.001). ARRLS and ARRLSimb have much more poorly, predicting 35% of friend events and 10% of the
similar performance for Friend vs. Others, but ARRLSimb is signifi- foe events with zero timing error.
cantly better than ARRLS for Foe vs. Others (T(305) = 18.0, p < 0.001). The average timing error for Friend events is 1.52 (std 1.64)
The 18 subjects fall into three distinct groups based on the aver- sub-windows (equivalent to 190 ms) for LDA and 0.91 (std 1.38)
age of the Friend and Foe balanced accuracies of ARRLSimb. Subjects sub-windows (114 ms) for ARRLS and 0.95 (std 1.36) sub-windows
6, 12, 14, and 17 have average accuracy less than or equal to 0.66. (119 ms) for ARRLSimb. Foe events have an average timing error of
Subjects 1, 3, 7, 10, 13, 15, and 16 have average accuracies between 2.54 (std 1.70) sub-windows (318 ms) for LDA and 0.68 (std 1.02)
0.7 and 0.8, while subjects 2, 4, 5, 8, 9, 11, and 18 have average sub-windows (85 ms) for ARRLS and 0.66 (std 0.93) sub-windows
accuracies of 0.80 or above. (83 ms) for ARRLSimb.
The ARRLS classification results are similar to those reported in Fig. 6 shows the annotation performance metrics by subject for
Wu et al. (2015) on the same data collection using features consist- the individual subjects in the VEP datasets. In each case, the other
ing of raw EEG on selected channels. 17 subjects formed the training data for the annotation. The left
column shows Friend vs. Others classification, and the right column
shows Foe vs. Others classification. The Foe classification problem
3.2. Raw window scores
is very difficult and extremely imbalanced. Fig. 6 shows the sta-
tistical significance using the bootstrap ztest described in Section
The initial classification step calculates a score for each window.
2.13. Highly significant (p < 0.001) use black squares and results
Fig. 4 over plots the difference between the positive and nega-
that are not statistically significant (0.05 < p) use red crosses. The
tive raw class scores output from all ARRLSimb training-test set
remaining results have 0.001 < p < 0.05. Subjects, 6, 12, 14, and 17,
pairs. Friend and foe classification problems are considered sep-
which are subjects with poor pairwise classification accuracies,
arately. The window scores are aligned so that the sub-windows
have relatively poor performance results, although many of them
containing the actual event appear at position zero. The thick black
are statistically significant. All results using the non-parametric test
line corresponds to the average of the scores for the Friend and
were significant.
Foe events, respectively. The overlay of the initial window scores
In general, performance is an increasing function of timing toler-
shows the expected elevation in score for the positive events in
ance, while statistical significance is a decreasing function of timing
each classification task, confirming the need for zero-out.
tolerance. Intuitively, the latter is true because it is easier to hit a
true positive randomly within say two sub-windows than it is to hit
3.3. Predicted annotations and timing errors within one sub-window. Subjects 1, 3, 7, 10, 15, and 16, which have
moderate pairwise classification accuracy, have much better per-
Fig. 5 summarizes the success of prediction of the true events formance at timing tolerances of one than at zero. Performance for
(Friend and Foe separately) with respect to timing errors. In this timing tolerance of two is virtually the same as for one for all sub-
case, the tolerance refers to the distance of the sub-window con- jects. We also calculated performance after eliminating Subject 17
taining each actual event from the sub-window of the nearest from the training set and noticed very little change in performance.
prediction of that event. ARRLS and ARRLSimb predict approxi- Results here include Subject 17.
mately 50% of all events with zero timing error, and 75% of the
Fig. 4. Difference between positive and negative ARRLSimb classifier window scores for friend and foe events aligned at sub-window 0.
Fig. 5. Cumulative fraction of positive events that have predictions within the specified timing error. The timing tolerances are in sub-windows, and the events from all 18
subjects are combined.
Table 5 summarizes the average performance in leave-one- computed performance by scoring a hit if an event of either class
subject-out annotation for VEP for different timing tolerances. was present within the specified timing tolerance. Table 5 displays
Subjects are grouped by their average pair-wise classification accu- these results in parentheses. As expected, precision improved. For
racy. example, when the training class was Friend, the precision went
The average performance for Friend training is quite high for the from 0.83 to 0.93 for the Good group and from 0.65 to 0.75 for
Good group, especially for a timing tolerance of one sub-window, the Moderate group when the timing tolerance was one. When the
while performance for the Poor group is low. As expected, Foe training class was Foe, the precision went from 0.25 to 0.83 for the
events are more difficult to annotate, most likely because of the Good group and from 0.20 to 0.66 for the Moderate group when the
very limited amount of training data available. timing tolerance was one. The recall when the training class was
A closer examination of the results revealed that many false pos- Friend did not change very much. As expected, recall was much
itives were due to hits on the other class. For this reason, we also lower for combined classes when the training class was Foe. This
Fig. 6. Performance of annotation by subject for VEP datasets for different timing tolerances. Red × ’s mark values that are not statistically significant. Square boxes mark
values that are highly significant (p > 0.0001). All other values have 0.001 < p < 0.05.
is because the number of actual positive events increases by a fac- Each short vertical black line in the upper 17 rows of each panel
tor of seven. The statistical significance of the results for combined in Fig. 7 marks a label predicted by one of the 17 training subjects.
class hits is slightly lower due to an increased probability of hitting The top panel corresponds to annotation of Friend events, and the
a positive event at random. bottom panel corresponds to annotation of Foe events. The short
vertical red lines in the bottom row show the timing of actual Foe
samples. The blue vertical lines in the bottom row mark Friend sam-
3.4. Observation of annotation patterns ples. As expected, the predicted timings are not exactly aligned,
and we must consider these timing variations when combining the
Misalignments in event response timing exist both within a sub- predictions. Because of zero-out, predicted events are isolated. The
ject and across subjects. The sliding window annotation process data drawn in Fig. 7 are actually the normalized scores estimated
introduces another level of temporal “jitter”. Fig. 7 illustrates some by each classifier as shown in Fig. 3.
sample misalignments of predicted labels using Subject 1 of the Several interesting features are apparent here. In the experi-
VEP dataset as the test subject. As mentioned previously, we train ment, subjects were asked to identify friend or foe images, with
ARRLSimb classifiers for each of the 17 other subjects using either the expectation that relatively rare Foe events would elicit an
Friend or Foe samples for the positive class and non-overlapping oddball response. Images were presented roughly at 2-s intervals
other samples for the negative class. We then extract the predicted and approximately one-seventh of the images were foes. Odd-
labels for a given test-training subject pair by using the window ball responses at the beginning of the experiments were present
weighting and zero-out procedure.
Table 2
Weights W and mask used to compute sub-window scores.
Timing (j) −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7
Wj × 21 0.5 0.5 0.5 0.5 0.5 1 3 8 3 1 0.5 0.5 0.5 0.5 0.5
Mask 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Table 3
Evaluation of predictions for different timing error tolerances in units of sub-windows. Sub-windows with true events are marked with a T and those with predicted events
are marked with a P. Sub-windows for which no prediction is made are marked with 0.
Sub-window 1 2 3 4 5 6 7 8 9 10
True event 0 0 T 0 0 T 0 0 0 T
Predicted event P 0 0 0 0 P 0 0 P 0
Timing tolerance 0 Fail Hit Fail
Timing tolerance 1 Fail Hit Hit
Timing tolerance 2 Hit Hit Hit
Fig. 7. Misaligned predicted labels for a representative sample when Subject 1 is the test subject and the other 17 subjects are the training subjects for Friend versus others
and Foe versus others, respectively. The true Foe events appear as red vertical lines and Friend events appear as green vertical lines in the bottom row. The blue arrows at the
top indicate interesting or unexpected predictions. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
in several datasets. Anecdotal feedback from other experimenters The “missing” codes that were highly predicted corresponded to
suggests that experimentalists sometimes exclude the beginning missing log entries for an oddball event. This hardware error did not
events in an experimental record for this reason. Many of the sub- affect standard ERP analysis, because epoching only occurs around
jects show an initial response before the first event (first blue events explicitly marked in the data.
vertical line in the bottom row in Fig. 7 at around 16 s). The experi- We examined the entire experimental record and also found
menter indicated that an image cueing the start of the experiment that many subjects showed an oddball response after each ses-
was displayed prior to the first visual stimulus presentation. sion block ended. A closer examination of the experimental
Fig. 7 shows clear oddball responses annotated at around 38 s protocol revealed that subjects were presented with a congrat-
and 48 s (marked with the first two blue arrows at the top). Notice ulatory “You’ve made it through the block” image at the end of
that expected event presentations were missing from the exper- these sessions, which corresponded to these unanticipated odd-
imental record. Events corresponding to subject button presses ball responses. The fact that the method found very few oddball
occurred after each of these “missing” events. Foe events, or at responses that could not be explained supports the usefulness of
least some events, were roughly “due” at that time. The annota- this approach for analyzing unlabeled data.
tion system also detected “missing” events in other experimental The annotation system identifies samples in the data that are
records. likely to be related to the positive class in a set of labeled samples.
Based on this observation, we re-examined the original raw data After annotating, we can order the sub-windows by their overall
and the log of experimental event codes. In doing so, we uncov- score. Fig. 8 shows the high scoring sub-windows of subjects 9 (top
ered a variable hardware error had caused a small number of trial row) and 14 (bottom row), respectively. Subject 9 had the best aver-
event codes to be omitted from the recording of the data stream, age pair-wise classification accuracy (Fig. 3), while subject 14 had
although they corresponded to actual presentations to the subject. the worst. Each row of each panel corresponds to a candidate event
Fig. 8. Predicted Friend (left panel) or Foe (right panel) events for subjects 9 (top row) and 14 (bottom row). The scores appear in decreasing order using a timing tolerance of
two sub-windows. The predicted events are aligned with respect to their sub-window 0, color-coded by their closest true event/subject response. The marked sub-windows
on either side of zero correspond to other true events that appear within +/− 2 s (16 sub-windows) on either side of the predicted events. All non-zero scores are displayed,
with dotted lines marking the retrieval threshold.
(step j of Table 1) for either Friend (left panel) or the Foe (right panel) Incorrect Response) as a Friend event, the predicted event appears
as the positive class. The sub-windows for the candidate events are in orange. If the closest true event was a Foe event in which the
aligned at zero, with candidate events arranged vertically in order subject subsequently missed the button press (NR, No Response),
of decreasing score. the sub-window appears in pink. Similarly, if the closest true event
Fig. 8 represents each candidate event by a short horizontal bar, is a friend, the response options appear in blue (CR), aqua (IR),
marking the sub-window during which it occurs by color-coding or purple (NR), respectively. If no true event is within two sub-
based on the relationship of this event to the closest true event as windows of this predicted event, the sub-window appears in black.
described in Table 6. The “No event” condition marks a situation in which no Friend or Foe
Predicted events displayed in red are within two sub-windows event was marked in the data. The panels only display the scores
of an actual foe event and have been recognized by the subject as corresponding to the candidate events of step j of the annotation
foe by a correct button press (CR, Correct Response). If the closest algorithm of Table 1. The dotted line marks the score corresponding
true event is a Foe event that the subject incorrectly identified (IR, to the adaptive threshold cutoff of step k of the algorithm.
Table 4 Table 6
Characteristics of the VEP data collection and preprocessing. True event categories broken down by subject response.
Variable Values Code Category Description
Channels 64 CRa Foe w/CR Foe image with correct response

Sampling rate 512 Hz IRa Foe w/IR Foe image with incorrect response
Number of subjects 18 NRa Foe w/NR Foe image with no response
Dataset length 563 s (average length/subject) CRb Friend w/CR Friend image with correct response
Friend vs others samples 235 Friend vs 1070 others (average/subject) IRb Friend w/IR Friend image with incorrect response
Foe vs others samples 34 Foe versus 3991 others (average/subject) NRb Friend w/NR Friend image with no response
Sample window size 8 sub-windows (1 s) No No event Neither friend or foe
Sub-window length 0.125 s
Window offset One sub-window (0.125 s)
Sub-bands 8 bands
Feature size 64 channels × 8 sub-windows × 8 sub-bands typical of moderate performance. The two Gaussians representing
Bad channels and referencing PREP pipeline Negative and Positive scores are moderately well-separated, and
Artifact removal MARA
the leftover distribution for the Positive fit is reasonably Gaussian
in shape. The remaining cases have much lower performance. These
types of distributions indicate that one should be more conservative
In addition to the predicted events, which are aligned at sub-
in accepting annotations with scores near the cutoff.
window 0, Fig. 8 also displays all of the other true events within
+/− 2 s (16 sub-windows) of the predicted event. Since subjects
viewed images at roughly 2-s intervals, the display area of each 3.5. The EEG-Annotate toolbox
graph in Fig. 8 roughly corresponds to three true events. The true
events directly before and after the high-scoring predicted events We have released an open-source MATLAB toolbox called EEG-
align well with the actual timing of the true events. In the case of Annotate on Github (http://github.com/VisLab/EEG-Annotate) and
the Foe events, a number of the lower scoring predictions do not have released as a data note the VEP dataset used for training and
correspond to true events, but the predictions seem to be directly testing. The toolbox supports four base classification methods: LDA,
out of phase with actual events of any type. Very few predicted ARRLS, ARRLSMod, and ARRLSimb. ARRLSMod and ARRLSimb are
events appear in areas where no event of either type occurs. Subject our modifications of the original ARRLS algorithm to better gener-
9 only has 33 Foe events, and most Foe predictions are clustered at ate initial pseudo labels and to handle highly imbalanced datasets.
the top of the scoring order. The challenge for annotation when the The classifiers are designed for batch processing on a directory of
events are truly unknown is to decide where to set the annotation datasets and produce score data structures that can then be pro-
threshold with no prior knowledge of how many events actually cessed by the annotator. The toolbox includes report-generating
appear in the data. functions to create the graphs and tables presented in this paper
Table 7 shows the results for Subject 9 in more detail. Many and to evaluate the statistical precision using bootstrapping. The
of the incorrect predictions correspond to actual events from the toolbox also contains functions for preprocessing EEG and gen-
other class. The problem with the Foe events in Table 9 is that the erating features. The annotation pipeline does not require that
default classifier thresholds include more events than actually exist training and testing datasets correspond to EEG. The only depen-
in the data. Thresholds based on predicting the top 30 events give dence of the toolbox on EEG occurs in the preprocessing and feature
much better Foe event accuracies. generation steps of the process. The toolbox depends on EEGLAB
Fig. 9 shows how the distribution of annotation scores can give (Delorme and Makeig, 2004b), PREP (Bigdely-Shamlo et al., 2015),
an indication of the success from the unlabeled data. The top left and MARA (Winkler et al., 2014). We have found that for EEG anno-
graph, corresponding to Friend events for Subject 9, is typical of tation, artifact removal is essential to prevent the annotator from
annotations with high performance. The Gaussians representing being confounded by artifacts such as blinks. The pipeline is fully-
Negative and Positive annotations are well-separated. The second automated and requires no user intervention. We are making the
row in the first column, representing Friend events for Subject 1 is data used in this study available on NITRC (https://www.nitrc.org/)
Table 5
Average performance for leave-one-subject-out annotation of VEP with subjects grouped by pairwise classification accuracy: Good (2, 4, 5, 8, 9, 11, 18), Moderate (1, 3, 7, 10,
13, 15, 16), and Poor (6, 12, 14, 17). Numbers in parentheses are performance when hits are counted for both classes.
Training class Metric Tolerance Subjects grouped by pairwise accuracy
Good Moderate Poor All
Friend Precision 0 0.65 (0.72) 0.41 (0.45) 0.18 (0.20) 0.45 (0.50)
1 0.83 (0.94) 0.67 (0.77) 0.29 (0.34) 0.65 (0.74)
2 0.83 (0.95) 0.69 (0.80) 0.37 (0.43) 0.68 (0.78)
Recall 0 0.71 (0.69) 0.35 (0.34) 0.14 (0.13) 0.44 (0.43)
1 0.90 (0.90) 0.59 (0.59) 0.23 (0.24) 0.63 (0.63)
2 0.91 (0.91) 0.61 (0.62) 0.28 (0.29) 0.65 (0.65)
RAP 0 0.57 (0.58) 0.19 (0.20) 0.04 (0.04) 0.30 (0.31)
1 0.82 (0.89) 0.44 (0.50) 0.09 (0.11) 0.51 (0.57)
2 0.83 (0.90) 0.47 (0.54) 0.14 (0.17) 0.54 (0.59)
Foe Precision 0 0.19 (0.45) 0.12 (0.32) 0.05 (0.15) 0.13 (0.34)
1 0.25 (0.80) 0.20 (0.65) 0.09 (0.33) 0.20 (0.63)
2 0.25 (0.83) 0.22 (0.74) 0.12 (0.45) 0.21 (0.71)
Recall 0 0.72 (0.30) 0.45 (0.16) 0.21 (0.08) 0.50 (0.19)
1 0.92 (0.49) 0.79 (0.32) 0.42 (0.19) 0.76 (0.36)
2 0.93 (0.51) 0.84 (0.37) 0.51 (0.26) 0.80 (0.40)
RAP 0 0.38 (0.18) 0.16 (0.08) 0.02 (0.01) 0.22 (0.11)
1 0.59 (0.45) 0.42 (0.25) 0.07 (0.06) 0.41 (0.28)
2 0.60 (0.47) 0.45 (0.30) 0.09 (0.11) 0.43 (0.32)
Table 7
Breakdown of predicted positives by category of true event with +/− two sub-windows for Subject 9. The event category codes are from Table 6. The fraction correctly
annotated appears in the last column. Subject 9 has 237 Friend events and 33 Foe events.
Class CRa IRa NRa CRb IRb NRb No Correct
Friend 29 4 0 226 5 0 4 0.86

Foe 29 4 0 202 5 0 6 0.13
Fig. 9. Distribution of non-zero annotation scores for Friend (left column) or Foe (right column) events for subjects 9 (top row), 1 (middle row) and 14 (bottom row). A
Gaussian is fit to the largest peak (medium gray line (denoted Negative) and removed. A second Gaussian is fit to remaining scores (denoted Positive). Amplitudes of fitting
Gaussians are scaled to maximum value of their respective distributions.
and providing a detailed description of the data in an accompanying This work demonstrates the feasibility of automated annota-
data note linked to the paper. tion by developing a specific practical implementation. Here we
demonstrate this by applying the approach to an 18-subject dataset
4. Discussion to annotate unlabeled data from one subject based on labels from
the other 17 subjects. We use the time-locked simulus-response
Event identification and annotation presents several challenges: labels as “ground truth” for the verification. We demonstrate that
one rarely has sufficient time-locked data for a single subject to even a system based on simple features can provide useful and
develop an accurate classifier to label continuous data not used detailed annotations for unlabeled EEG and present exemplar cases
to develop the classifier. In addition, inter-subject classification is where the system even found erroneously missing event codes in
notoriously difficult for EEG based on the variability of individual a dataset. We emphasize that many choices could be made for the
subject responses. Further, it is difficult to establish the accuracy of required components and multiple approaches can be federated to
the annotations when only unlabeled data is available. improve accuracy. Many implementations of the annotation sys-
tem outlined above are possible, and success depends on factors Classifier fusion is a classical problem in machine learning, and
such as the features selected, the classifiers used, the method of there are many potential ways to combine results from multiple
balancing the training sets, and the similarity of the tasks and sub- classifiers (Ho et al., 1994; Kuncheva et al., 2001; Hou et al., 2014;
jects. However, annotation can label not only events in the data Woźniak et al., 2014; Britto et al., 2014; Cruz et al., 2015). We
for subsequent time-locked analysis, but can also identify times have only considered the most straightforward approaches, but
when a subject’s brain response is “similar” to the pattern exhib- more sophisticated methods may be useful. For example, one could
ited by other subjects during other tasks. Thus, we might apply use many different annotation methods and then apply weight-of-
this type of automated brain response labeling in much the same evidence approaches based on Dempster-Shafer theory (Liu et al.,
way as researchers use automated genome annotation systems to 2014) or Bayesian networks (Zhang et al., 2013b). Recent work
conditionally identify genes and their potential functions (Yandell in spectral ranking (Parisi et al., 2014) and variational methods
and Ence, 2012). In genomics, these provisional annotations have (Nguyen et al., 2016) also show potential for improving the output
led to many experiments designed to verify the annotations and from federations of classifiers beyond simple addition of scores or
to characterize individual genomic variability (Boyle et al., 2012). votes. The goal of this work is not to provide an exhaustive exami-
Similarly, brain response annotation may lead to new insights into nation of the possibilities, but rather to demonstrate the feasibility
brain function, connectivity, and association of response with tasks, of annotation of EEG data. Much work has also been done in the
once sufficient annotated data becomes available. At the very least, image retrieval and web search communities on re-ranking algo-
these methods can be used to improve the overall robustness of rithms (Mei et al., 2014) and this work is potentially applicable to
existing labels, and eventually whole databases, by leveraging their EEG annotation.
inherent redundancy to provide means to identify cases where data While annotation generally places more emphasis on precision
and metadata may be missing or aberrant. (reducing the number of false positives), good recall results can
In this work, we start the sub-windows at the beginning of be useful input for downstream re-ranking or for use with addi-
the dataset rather than time-locking the positive samples, treating tional voting schemes by combining results from using other types
the labeled training and unlabeled test data in the same fashion. of features or methods.
Because of the lack of time-locking in the test data, we focus on Automated annotation of EEG is a new direction in EEG analysis
features that are not particularly sensitive to exact timing. Exam- that is oriented towards enabling discovery in non-ideal scenarios.
ples of time-insensitive features include total power, bag-of-words As EEG moves into more realistic cases or non-laboratory locations
(BOW), autoregressive (AR), and spectral features. We have also (McDowell et al., 2013), researchers can no longer clearly mark and
experimented with both time-locked and non-time-locked positive time lock to events. Analyses will become increasingly challeng-
training windows and found very little difference in the annotation ing for multiple reasons. First, consistent, precise time locking is
results for these particular features. Other choices of features may extremely difficult, if not impossible to achieve in many situations
be more sensitive to positive sample time-locking. For simplicity, due to the complexity of real-world events, where event markers
in this work we use features on total power by channel, since the may have to be added post-hoc during manual review (such as
datasets we consider have roughly the same channel configura- identification of events from video feed taken in the field). Addi-
tions. tionally, with added experimental or environmental complexity,
Simple power features contain temporal and frequency infor- the probability that events are missed or overlooked due to faulty
mation but the spatial information is headset dependent. For equipment or human error during post-hoc encoding increases.
cross-headset leave-one-subject-out, bag-of-words (BOW) fea- The approaches developed here can help with these challenges and
tures give higher classification accuracies on the VEP cross-headset facilitate analyses through proper annotation.
collection (Su and Robbins, 2015). We are exploring the possibil- Note also that a data scenario such as the VEP can provide “pro-
ity of using common dictionaries across collections to annotate a totypical” oddball training to search for oddball responses in other
variety of tasks and headsets using the same core classifiers. types of data. More advanced annotation systems can use multi-
Many choices of parameters can be manipulated beyond the ple feature sets in conjunction with other types of classifiers and
choice of features. Selection of classifier is one such choice. We combination methods as input to weight-of-evidence approaches.
have used several classifiers: Linear Discriminant Analysis (LDA), One can envision the creation of EEG databases with provisional
and Adaptation Regularization using Regularized Least Squares annotations, in much the same as genomic databases have used
(ARRLS), ARRLSMod (ARRLS modified for pseudo-labels), and ARRL- machine learning to identify potential genes. With such large-
Simb (ARRLS modified to handle imbalance). More recently we have scale databases at hand, one might use this information to identify
experimented with deep learning classification using convolution potentially similar subjects and prototypical brain responses. This
networks and power features, but the results of pairwise classi- could also enable examining enrichment approaches − determin-
fication particularly for the Foe vs others case are poor. Work on ing when features and responses tend to co-occur and beginning
artificially expanding the training set and using more sophisticated to isolate the features that characterize these responses. Such
network approaches is ongoing. databases could also provide the core infrastructure for organizing
We use straightforward methods for computing and combining knowledge about brain responses and behavior.
sub-window scores after the initial classification. The predictions of This paper focused on a relatively clean dataset with well-
a target tend to cluster near a target, but usually spread over adja- defined events known to have prototypical responses. Even with
cent sub-windows due to timing variations and the contribution of extremely simple spectral power features, the system was able to
each sub-window to multiple windows. Applying a weighted win- successfully identify most events without using any labeled data
dow average and then thresholding gives better predictions than and no knowledge of when the events occurred in the test dataset.
simply thresholding each score. Furthermore, since sub-windows Our next step is to apply these algorithms on a larger scale to deter-
adjacent to an event have high scores, we use a zero-out proce- mine how often such a system is likely to identify such events in EEG
dure to avoid double counting. Other weight matrices are possible, that are not associated with a stereotypical oddball experiment.
and the results appear to be insensitive to the exact choice of We believe that this system is the first step in applying large-scale
weights. We have considered only symmetric weights, but asym- annotation to build EEG databases of brain responses. The EEG-
metric weights are also possible and may be more appropriate in Annotate MATLAB toolbox is freely available at http://github.com/
some cases.
VisLab/EEG-Annotate. The VEP dataset is available in a companion Kiebel, S.J., Friston, K.J., 2004b. Statistical parametric mapping for event-related
data note. potentials (II): a hierarchical temporal model. Neuroimage 22 (June (2)),
503–520.
Klimesch, W., 1999. EEG alpha and theta oscillations reflect cognitive and memory
Conflict of interest statement performance: a review and analysis. Brain Res. Brain Res. Rev. 29 (April (2–3)),
169–195.
Kuncheva, L.I., Bezdek, J.C., Duin, R.P.W., 2001. Decision templates for multiple
All three authors ascertain that they have no conflicts of interest classifier fusion: an experimental comparison. Pattern Recognit. 34 (February
in this work. (2)), 299–314.
Larson, M.J., Carbine, K.A., 2017. Sample size calculations in human
electrophysiology (EEG and ERP) studies: a systematic review and
Acknowledgments recommendations for increased rigor. Int. J. Psychophysiol. 111 (January),
33–41.
The authors would like to thank Scott Kerick and Amar Marathe Lemm, S., Blankertz, B., Dickhaus, T., Müller, K.-R., 2011. Introduction to machine
learning for brain imaging. Neuroimage 56 (May (2)), 387–399.
of the Army Research Laboratory for their comments and sug- Lin, C.-T., Pal, N.R., Chuang, C.-Y., Jung, T.-P., Ko, L.-W., Liang, S.-F., 2008. An
gestions. This research was sponsored by the Army Research EEG-based subject- and session-independent drowsiness detection. IEEE
Laboratory and was accomplished under Cooperative Agreement International Joint Conference on Neural Networks, 2008. IJCNN 2008. (IEEE
World Congress on Computational Intelligence), 3448–3454.
Number W911NF-10-2-0022. The views and conclusions contained
Liu, H., Frishkoff, G., Frank, R., Dou, D., 2012a. Sharing and integration of cognitive
in this document are those of the authors and should not be inter- neuroscience data: metric and pattern matching across heterogeneous ERP
preted as representing the official policies, either expressed or datasets. Neurocomputing 92 (September), 156–169.
Liu, Y., Chen, W., Yang, S., Huang, K., 2012b. Domain adaptation to automatic
implied, of the Army Research Laboratory or the U.S. Government.
classification of neonatal amplitude-integrated EEG. 2012 12th International
The U.S. Government is authorized to reproduce and distribute Conference on Intelligent Systems Design and Applications (ISDA), 131–136.
reprints for Government purposes notwithstanding any copyright Liu, Y., Wang, X., Liu, K., 2014. Network anomaly detection system with optimized
notation herein. DS evidence theory. Sci. World J. 2014 (August), e753659.
Long, M., Wang, J., Ding, G., Pan, S.J., Yu, P.S., 2014. Adaptation regularization: a
general framework for transfer learning. IEEE Trans. Knowl. Data Eng. 26 (May
References (5)), 1076–1089.
Lotte, F., Congedo, M., Lécuyer, A., Lamarche, F., Arnaldi, B., 2007. A review of
Arroyo, S., Lesser, R.P., Gordon, B., Uematsu, S., Jackson, D., Webber, R., 1993. classification algorithms for EEG-based brain–computer interfaces. J. Neural
Functional significance of the mu rhythm of human cortex: an Eng. 4 (2), R1.
electrophysiologic study with subdural electrodes. Electroencephalogr. Clin. Müller, K.-R., Tangermann, M., Dornhege, G., Krauledat, M., Curio, G., Blankertz, B.,
Neurophysiol. 87 (September (3)), 76–87. 2008. Machine learning for real-time single-trial EEG-analysis: from
Bigdely-Shamlo, N., Mullen, T., Kothe, C., Su, K.-M., Robbins, K.A., 2015. The PREP brain–computer interfacing to mental state monitoring. J. Neurosci. Methods
pipeline: standardized preprocessing for large-scale EEG analysis. Front. 167 (January (1)), 82–90.
Neuroinf. 9. Maris, E., Oostenveld, R., 2007. Nonparametric statistical testing of EEG- and
Blankertz, B., Lemm, S., Treder, M., Haufe, S., Müller, K.-R., 2011. Single-trial MEG-data. J. Neurosci. Methods 164 (August (1)), 177–190.
analysis and classification of ERP components — a tutorial. Neuroimage 56 McDowell, K., et al., 2013. Real-world neuroimaging technologies. IEEE Access 1,
(May (2)), 814–825. 131–149.
Boyle, A.P., et al., 2012. Annotation of functional variation in personal genomes Mei, T., Rui, Y., Li, S., Tian, Q., 2014. Multimedia search reranking: a literature
using RegulomeDB. Genome Res. 22 (September (9)), 1790–1797. survey. ACM Comput. Surv. 46 (January (3)), 38:1–38:38.
Britto Jr., A.S., Sabourin, R., Oliveira, L.E.S., 2014. Dynamic selection of classifiers—a Mumtaz, W., Malik, A.S., Yasin, M.A.M., Xia, L.K., 2015. Review on EEG and ERP
comprehensive review. Pattern Recognit. 47 (November (11)), 3665–3680. predictive biomarkers for major depressive disorder. Biomed. Signal Process.
Buzsáki, G., 2002. Theta oscillations in the hippocampus. Neuron 33 (January (3)), Control 22 (September), 85–98.
325–340. Nguyen, T.T., Nguyen, T.T.T., Pham, X.C., Liew, A.W.-C., 2016. A novel combining
Congedo, M., Korczowski, L., Delorme, A., Lopes da silva, F., 2016. Spatio-temporal classifier method based on Variational Inference. Pattern Recognit. 49
common pattern: a companion method for ERP analysis in the time domain. J. (January), 198–212.
Neurosci. Methods 267 (July), 74–88. Palva, S., Palva, J.M., 2007. New vistas for ␣-frequency band oscillations. Trends
Cruz, R.M.O., Sabourin, R., Cavalcanti, G.D.C., Ing Ren, T., 2015. META-DES: a Neurosci. 30 (April (4)), 150–158.
dynamic ensemble selection framework using meta-learning. Pattern Parisi, F., Strino, F., Nadler, B., Kluger, Y., 2014. Ranking and combining multiple
Recognit. 48 (May (5)), 1925–1935. predictors without labeled data. Proc. Natl. Acad. Sci. 111 (January (4)),
Delorme, A., Makeig, S., 2004a. EEGLAB: an open source toolbox for analysis of 1253–1258.
single-trial EEG dynamics including independent component analysis. J. Ries, A.J., Touryan, J., Vettel, J., McDowell, K., Hairston, W.D., 2014. A comparison of
Neurosci. Methods 134 (March (1)), 9–21. electroencephalography signals acquired from conventional and mobile
Delorme, A., Makeig, S., 2004b. EEGLAB: an open source toolbox for analysis of systems. J. Neurosci. Neuroeng. 3 (February (1)), 10–20.
single-trial EEG dynamics including independent component analysis. J. Rivet, B., Souloumiac, A., Attina, V., Gibert, G., 2009. xDAWN algorithm to enhance
Neurosci. Methods 134 (March (1)), 9–21. evoked potentials: application to brain-computer interface. IEEE Trans.
Deselaers, T., Keysers, D., Ney, H., 2007. Features for image retrieval: an Biomed. Eng. 56 (August (8)), 2035–2043.
experimental comparison. Inf. Retr. 11 (December (2)), 77–107. Roach, B.J., Mathalon, D.H., 2008. Event-related EEG time-frequency analysis: an
Dietrich, A., Kanso, R., 2010. A review of EEG, ERP, and neuroimaging studies of overview of measures and an analysis of early gamma band phase locking in
creativity and insight. Psychol. Bull. 136 (September (5)), 822–848. schizophrenia. Schizophr. Bull. 34 (September (5)), 907–926.
Engel, A.K., Fries, P., 2010. Beta-band oscillations — signalling the status quo? Curr. Rousselet, G.A., Husk, J.S., Pernet, C.R., Gaspar, C.M., Bennett, P.J., Sekuler, A.B.,
Opin. Neurobiol. 20 (April (2)), 156–165. 2009. Age-related delay in information accrual for faces: evidence from a
Frishkoff, G.A., Frank, R.M., Rong, J., Dou, D., Dien, J., Halderman, L.K., 2007. A parametric, single-trial EEG approach. BMC Neurosci. 10 (1), 114.
framework to support automated classification and labeling of brain Souloumiac, A., Rivet, B., 2013. Improved estimation of EEG evoked potentials by
electromagnetic patterns. Comput. Intell. Neurosci. 2007 (December), e14567. jitter compensation and enhancing spatial filters. 2013 IEEE International
Frishkoff, G., et al., 2011. Minimal Information for Neural Electromagnetic Conference on Acoustics, Speech and Signal Processing, 1222–1226.
Ontologies (MINEMO): a standards-compliant method for analysis and Su, K.M. and Robbins, K.A., 2015. Space-time frequency bag of words models for
integration of event-related potentials (ERP) data. Stand. Genomic Sci. 5 capturing EEG variability. A comprehensive study, UTSA-CS-TR-2015-001,
(November (2)), 211–223. 2015.
Hairston, W.D., et al., 2014. Usability of four commercially-oriented EEG systems. J. Su, K.M., Hairston, W.D., Robbins, K.A., 2016. Adaptive thresholding and
Neural Eng. 11 (August (4)), 046018. reweighting to improve domain transfer learning for unbalanced data with
Ho, T.K., Hull, J.J., Srihari, S.N., 1994. Decision combination in multiple classifier applications to EEG imbalance. 2016 15th IEEE International Conference on
systems. IEEE Trans. Pattern Anal. Mach. Intell. 16 (January (1)), 66–75. Machine Learning and Applications (ICMLA), 320–325.
Hou, J., E, X., Xia, Q., Qi, N.-M., 2014. Evaluating classifier combination in object Winkler, I., Haufe, S., Tangermann, M., 2011. Automatic classification of artifactual
classification. Pattern Anal. Appl. 18 (February (4)), 799–816. ICA-components for artifact removal in EEG signals. Behav. Brain Funct. 7 (1),
Jung, T.-P., Makeig, S., Westerfield, M., Townsend, J., Courchesne, E., Sejnowski, T.J., 1–12.
2001. Analysis and visualization of single-trial event-related potentials. Hum. Winkler, I., Brandl, S., Horn, F., Waldburger, E., Allefeld, C., Tangermann, M., 2014.
Brain Mapp. 14 (November (3)), 166–185. Robust artifactual independent component classification for BCI practitioners.
Kellihan, B., et al., 2013. A real-world neuroimaging system to evaluate stress. In: J. Neural Eng. 11 (3), 035013.
Schmorrow, D.D., Fidopiastis, C.M. (Eds.), Foundations of Augmented Woźniak, M., Graña, M., Corchado, E., 2014. A survey of multiple classifier systems
Cognition. Springer, Berlin Heidelberg, pp. 316–325. as hybrid systems. Inf. Fusion 16 (March), 3–17.
Kiebel, S.J., Friston, K.J., 2004a. Statistical parametric mapping for event-related Wu, D., Lance, B., Lawhern, V., 2014. Transfer learning and active transfer learning
potentials: I. Generic considerations. Neuroimage 22 (June (2)), 492–502. for reducing calibration data in single-trial classification of visually-evoked
potentials. 2014 IEEE International Conference on Systems, Man and Zhang, R., Xu, P., Guo, L., Zhang, Y., Li, P., Yao, D., 2013a. Z-Score linear discriminant
Cybernetics (SMC), 2801–2807. analysis for EEG based brain-computer interfaces. PLoS One 8 (September (9)),
Wu, D., Lawhern, V.J., Lance, B.J., 2015. Reducing offline BCI calibration effort using e74433.
weighted adaptation regularization with source domain selection. 2015 IEEE Zhang, H., Yang, H., Guan, C., 2013b. Bayesian learning for spatial filtering in an
International Conference on Systems, Man, and Cybernetics (SMC), 3209–3216. EEG-based brain–computer interface. IEEE Trans. Neural Netw. Learn. Syst. 24
Yandell, M., Ence, D., 2012. A beginner’s guide to eukaryotic genome annotation. (July (7)), 1049–1060.
Nat. Rev. Genet. 13 (May (5)), 329–342.

1 s2.0 S0165027017303655 Main

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

1 s2.0 S0165027017303655 Main

Caricato da

Copyright:

Formati disponibili

Journal of Neuroscience Methods 293 (2018) 359–374

Contents lists available at ScienceDirect

Journal of Neuroscience Methods

EEG-Annotate: Automated identiﬁcation and labeling of events in

• Proposes a framework for a new

1. Introduction experimenter cannot be certain that a subject actually “saw”

2.2. Extracting epochs or windows (Step a)

In summary, the predictions of a target tend to cluster near

2.10. Combining multiple classiﬁers to obtain candidate events

At this point, we have an array of sub-window scores for each

2.11. Using adaptive thresholding to select the actual events (Step

Finally, we use adaptive thresholding to ﬁt the score distribu-

Variable Values Code Category Description

Channels 64 CRa Foe w/CR Foe image with correct response

Training class Metric Tolerance Subjects grouped by pairwise accuracy

Good Moderate Poor All

Class CRa IRa NRa CRb IRb NRb No Correct

Friend 29 4 0 226 5 0 4 0.86

Potrebbero piacerti anche