Sei sulla pagina 1di 35

Master en Nuevos Periodismos,

Comunicación Política y Sociedad del


Conocimiento
Materia: Metodología de la investigación
en comunicación y periodismo

Introducción al Análisis
de Datos:
Ejemplos de Investigaciones
Tomás Baviera
Universitat Politècnica de València
Octubre 2019
a n ti fi ca r a l g o
s p o s ib le c u
¿E m o es l a
o c o
tan complej ?
pe rs o n a l i d a d
1978: NEO (3 factors)
1985: NEO-PI (5 factors)

Robert R. McCrae Paul Costa


Methodology used to obtain computer-based judgments and estimate the self-other agreement.

Participants and their Likes are represented as a matrix, where entries are set to
1 if there exists an association between a participant and a Like and 0 otherwise
(second panel). The matrix is used to fit five LASSO linear regression models
(16), one for each self-rated Big Five personality trait (third panel). A 10-fold
cross-validation is applied to avoid overfitting. The models are built on participants
having at least 20 Likes.
Computer-based personality judgment accuracy (y axis), plotted against the number of Likes
available for prediction (x axis).

The red line represents the average accuracy of computers’ judgment across the
five personality traits. The five-trait average accuracy of human judgments is
positioned onto the computer accuracy curve. For example, the accuracy of an
average human individual (r = 0.49) is matched by that of the computer models
based on around 90–100 Likes. The gray ribbon represents the 95% CI.
The external validity of personality judgments and self-ratings across the range of life outcomes,
expressed as correlation (continuous variables; Upper) or AUC (dichotomous variables; Lower).

The red, yellow, and blue bars indicate the external validity of self-ratings, human
judgments, and computer judgments, respectively. For example, self-rated scores
allow predicting network size with accuracy of r = 0.23, human judgments achieve
r = 0.17 accuracy (or 0.06 less than self-ratings). Compound variables (variables
represented across a few subvariables) are marked with an asterisk.
Brussels, 14 J une 2019
Today the Commission and the High Representative report on the progress achieved in the
fight against disinformation and the main lessons draw n from the European elections, as a
European
contribution to the discussions by EU leaders Commission
next
European w eek. -- PPress
Commission ress release
release

Protecting our democratic processes and institutions from disinformation is a major challenge for
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
framework for coordinated action, with full respect for European values and fundamental rights.
AA Europe
Europe that
Today's joint Communication protects:
that sets EU
EU reports
out how
protects: the on
Action
reports progress
onPlan in
in fighting
against
progress fighting disinformation
Disinformation and the Elections
disinformation
ahead
aheadto
Package have helped of European
offight
European Council
Council and preserve the integrity of the European Parliament
disinformation
elections. Brussels,
Brussels, 14
14 JJune
une 2019
2019
High Representative/
Today Vice
Today the Presidentand
the Commission
Commission Federica
and the
the HighMogherini,
High Representative
Representative Vice-President
report
report onon the for the achieved
the progress
progress Digital
achievedSingle
in the Market
in the
fight against
Andrus Ansip, Commissioner disinformation and
for J ustice,and
fight against disinformation the main
Consumers lessons
the main lessons draw
and Gender n from the
Equality
draw n from European elections,
ourová, as aa
Věra J elections,
the European as
contribution to the discussions by EU leaders next
contribution to the discussions by EU leaders next w eek. w eek.
Commissioner for the Security Union J ulian K ing, and Commissioner for the Digital Economy and
Protecting
Protecting our
our democratic
democratic processes
processes and
and institutions
institutions from
from disinformation
disinformation is
is aa major
major challenge
challenge for
for
Society Mariya Gabriel said in a joint statement:
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
framework
“The record high turnoutfor
framework
Today's
incoordinated
for coordinated
the European action,
action, with with full
full respect
Parliament respect for
for European
elections European has values
values and
and fundamental
underlined fundamental
thethe rights.
rights.
increased interest of
Today's joint
joint Communication
Communication sets sets out
out howhow thethe Action
Action PlanPlan against
against Disinformation
Disinformation and and the Elections
Elections
citizens in European
Package
Package democracy.
have
have helped
helped to toOur
fightactions,
fight disinformation
disinformation including
and
and preserve the setting-up
preserve the
the integrity
integrity of ofthe
of election
the European
European networks
Parliament at national
Parliament
and European level, helped in protecting our democracy from attempts at manipulation.
elections.
elections.
High
High Representative/
Representative/ Vice
Vice President
President Federica
Federica Mogherini,
Mogherini, Vice-President
Vice-President for the
the Digital
Digital Single
fordisinformation Single Market
Market
We are confident that
Andrus our efforts
Ansip, Commissionerhave contributed
for J ustice, Consumers to limit and the
Genderimpact
Equality of Věra J ourová, operations,
Andrus Ansip, Commissioner for J ustice, Consumers and Gender Equality Věra J ourová,
including from foreign
Commissioner
Commissioneractors, for through
for the
the Securitycloser
Security Union coordination
Union JJulian
ulian KKing,
ing, and between the
and Commissioner
Commissioner for EU
for the and Member
the Digital
Digital Economy
Economy and States.
and
However, much Societyremains
Society Mariya
Mariya to Gabriel
be done.
Gabriel saidThe
said in a European
joint
in a joint statement:
statement: elections were not after all free from
disinformation; “Thewe should
record not
high accept
turnout in thethis as
European
“The record high turnout in the European Parliament the new
Parliament normal.
elections
elections Malign
has actorsthe
has underlined
underlined constantly
the increased change
increased interest
interest of their
of
citizens
citizens in
in European
European democracy.
democracy. Our
Our actions,
actions, including
including the
the setting-up
setting-up of
of election
election networks
networks at
at national
national
strategies. We mustand
strive to be aheadprotectingof them. Fighting disinformation is a common, long-term
and European
European level,level, helped
helped in in protecting our our democracy
democracy from from attempts
attempts at at manipulation.
manipulation.
challenge for EU institutions and Member States.
We
We are
are confident
confident that that our
our efforts
efforts have have contributed
contributed to to limit
limit thethe impact
impact of of disinformation
disinformation operations,
operations,
including
including from
Ahead of the elections, we foreign
from actors,
actors, through
saw evidence
foreign closer
closer coordination
of coordinated
through coordination between
between the
inauthentic EU
EU and
behaviour
the and Member aimed
Member States.
at spreading
States.
However,
However, much
much remains
remains to
to be
be done.
done. The
The European
European elections
elections were
were not
not after
after all
all free
free from
from
divisive material on online platforms,
disinformation; including through the useMalign of bots and fake accounts. So online
disinformation; we we should
should notnot accept
accept thisthis as
as thethe newnew normal.
normal. Malign actors actors constantly
constantly change change their
their
platforms have strategies.
a particular
strategies. We
We mustresponsibility
must strive
strive toto be to tackle
be ahead
ahead of
of them.
them. disinformation.
Fighting
Fighting disinformation Withisisour
disinformation activelong-
aa common,
common, support,
long- term Facebook,
term
Google and Twitter have
challenge
challenge formade
for EU some progress
institutions
EU institutions and Memberunder
and Member States. the Code of Practice on disinformation. The latest
States.
monthly reports,Ahead which
Ahead of the
of the weelections,
are publishing
elections, we saw today,
evidence
we saw evidence of confirm this
of coordinated
coordinated trend.behaviour
inauthentic
inauthentic We nowaimed
behaviour expect
aimed at online platforms to
at spreading
spreading
divisive material
divisive and
material on online
on online platforms,
platforms, including
including through
through the use
the use of of bots
bots and fake accounts.
and fake accounts. So
So online
online
maintain momentum platforms
to step up their efforts and implement all commitments under the Code.”
platforms havehave aa particular
particular responsibility
responsibility to to tackle
tackle disinformation.
disinformation. With With ourour active
active support,
support, Facebook,
Facebook,
Google
While it is still too
Google and
early
andto Twitter
draw
Twitter have made
final
have made some
some progress
conclusions progress aboutunder
under the the
the Codelevel
Code of Practice
ofand on
on disinformation.
impact
Practice of disinformation
disinformation. The
The latest
latest in the
monthly
monthly reports,
reports, which
which wewe are
are publishing
publishing today,
today, confirm
confirm this this trend.
trend. WeWe now now expect
expect online
online platforms
platforms to to
recent European Parliament
maintain
maintain momentum
elections,
momentum and and toto step
it isupclear
step up their that the
their efforts
efforts and
actions taken
and implement
implement all
by the EU
all commitments
commitments under
– the
under the
together
Code.”
Code.”
with
¿Cómo podemos diferenciar
entre una cuenta real y una
cuenta automatizada?
Estructura 1.150
datos numéricos
asociados a la
cuenta, que
representan su
comportamiento
ducethefurther criteriathatthey musthaveproducedatleast
200tweets intotal and90tweets duringthethree-month ob-
servationwindow(oneper day onaverage). Our final sample
includes approximately 14 million user accounts that meet
both criteria. For each of these accounts, we collected their
tweetsthroughtheTwitter SearchAPI. Werestrictedthecol- Figure 1: ROC cur
lection to the most recent 200 tweets and 100 mentions of ferent datasets. Acc
each user, as described earlier. Owing to Twitter API limits,
this greatly improved our datacollection speed. This choice
also reduces theresponsetimeof our serviceandAPI. How- Evaluating Mode
ever the limitation adds noise to the features, due to the Toevaluateour clas
scarcity of dataavailableto computethem. dataset, we examin
for eachbot-scored
Manual Annotations
We achieved classi
Wecomputed classification scores for each of theactiveac- the accounts in the
counts using our initial classifier trained on the honeypot human accounts. W
dataset. Wethengroupedaccountsbytheir botscores, allow- scores inthe(0.8, 1
ing us to evaluate our systemacross the spectrumof human counts in the grey-
and bot accounts without being biased by thedistribution of 60% and 80%. Intu
bot scores. We randomly sampled 300 accounts from each lenging accounts to
bot-scoredecile. Theresultingbalancedsetof 3000accounts annotators overlapi
were manually annotated by inspecting their public Twitter binisweightedby t
profiles. Some accounts haveobvious flags, such as using a fromwhich thema
stock profile image or retweeting every message of another obtain 86% overall
accountwithinseconds. Ingeneral, however, thereisnosim- We also compare
tified a large, representative sample of users by monitor- pleset of rulesto assesswhether anaccount ishumanor bot. counts in each bot-
ing a Twitter stream, accounting for approximately 10% Withthehelpof four volunteers, weanalyzedprofileappear- scores are higher f
of public tweets, for 3 months starting in October 2015. ance, content produced and retweeted, and interactions with
This approach avoids known biases of other methods such
lower for accounts
other users in terms of retweets and mentions. Annotators is moredifficult for
as snowball and breadth-first sampling, which rely on the were not given a precise set of instructions to perform the
selection of an initial group of users (Gjoka et al. 2010; opposed to human-
Morstatter et al. 2013). We focus on English speaking users
classification task, but rather shown a consistent number of We observe a sim
as they represent thelargest group on Twitter (Mocanu et al. both positiveand negativeexamples. Thefinal decisions re- quired on averaget
2013). flect each annotator’s opinion and are restricted to: human, notators employed
To restrict our sample to recently active users, we intro- bot, or undecided. Accountslabeledasundecidedwereelim- accounts and 37 sec
ducethefurther criteriathat they must haveproduced at least inated fromfurther analysis. Fig. 1 shows the
200 tweets in total and 90 tweets during thethree-month ob- We annotated all 3000 accounts. We will refer to this set vestigate our ability
servation window (oneper day onaverage). Our final sample of accounts as the manually annotated data set. Each anno- baseline ROC curv
includes approximately 14 million user accounts that meet tator was assigned a random sample of accounts from each model on the man
both criteria. For each of these accounts, we collected their decile. We enforced a minimum 10% overlap between an-
tweets through theTwitter Search API. Werestrictedthecol- thebaselineaccurac
Figu re 1
notations : ROC
to curve
assess the s of m
reliability ofod elsannotator.
each trained a nd teste
This d on dif-
cross-validating on
lection to the most recent 200 tweets and 100 mentions of fe rent an
yielded daaverage
tasets.pairwise
Accura cy isent
agreem me ofasure
75% d by
and AUC.
moder-
each user, as described earlier. Owing to Twitter API limits, themodel is not tra
this greatly improved our data collection speed. This choice ate inter-annotator agreement (Cohen’s = 0.41). We also
computed the agreement between annotators and classifier
also reduces theresponsetimeof our serviceand API. How- Evaluating M odels Using Annotated Data Dataset Effect on
ever the limitation adds noise to the features, due to the outcomes, assuming that a classification score above 0.5 is
To evaluaas teour We can update ou
scarcity of data available to compute them. interpreted a bot.cla ssifica
This tion
resulted insys temtrapairwise
an average ined ontheh one ypot
da tase t, we exa m in e d the cla ss ifica tion accu racy annotated
se pa rate and hon
ly
agreement of 79% and a moderately high Cohen’s = 0.5.
for each bot-scoredecileof themanually annonateanced d data datasets
set. and
M anual Annotations Theseresults suggest high confidencein theannotation pro-
We achieved classification accuracy greater thanevaluatetheaccura 90% for
We computed classification scores for each of the active ac- cess,a
the as well as in
ccounts inthe
theagreement
(0.0, 0.4 between
) ranannotations
ge, which and
includes mostly
counts using our initial classifier trained on the honeypot model
hu mapredictions.
n accounts. We also observe accuracy above• 70% Annotation:
for We
dataset. Wethengroupedaccounts by their bot scores, allow- scores in the(0.8, 1.0) range(mostly bots). Accuracy for ac-
ing us to evaluate our system across the spectrum of human counts in the grey-area range (0.4, 0.8) fluctuates between
and bot accounts without being biased by the distribution of 60% and 80%. Intuitively, this rangecontains themost chal-
Figure 2: Distribution of classifier score for human and bot
accounts in the two datasets.

annotated accounts and labels assigned by the majority of


annotators. This yields 0.89 AUC, a reasonable accuracy
considering that the dataset contains recent and possibly
sophisticated bots.
• M erged: We merged the honeypot and annotation
datasets for training and testing. The resulting classifier
achieves 0.94 AUC, only slightly worse than the honey-
pot (training and test) model although the merged dataset
contains a variety of more recent bots.
• M ixture: Using mixtures with different ratios of accounts
from the manually annotated and honeypot datasets, we
obtain an accuracy ranging between 0.90 and 0.94 AUC.
In Fig 2, we plot the distributions of classification scores
for human and bot accounts according to each dataset. The
mixture model trained on 2K annotated and 10K honeypot
accounts is used to compute the scores. Human accounts
in both datasets have similar distributions, peaked around
0.1. The difference between bots in the two datasets is more
prominent. The distribution of simple, honeypot bots peaks
around 0.9. The newer bots from the m anually annotated
dataset have typically smaller scores, with a distribution
peaked around 0.6. They are more sophisticated, and ex-
hibit characteristics more similar to human behavior. This
raises the issue of how to properly set a threshold on the
score when a strictly binary classification between human
a ndu
Fig b ots
re 3:isCone medpe
ad . To
rison in fe
of srca su
ore sitfor
able thre
diffe sh
re o
ntld
m, od
we co
els.m - ch
Ea
pute classification accuracies for varying thresholds consid-
account is represented as a point in the scatter plot with a
ering all accounts scoring below each threshold as human,
colo
a nd tr
he dnete
serm
lec in
tte
hdebthy
reitso
sh ca
ldte g
thaotry
m .aTe
ximstize
po sints
accua re
ra ra
cy. ndomly
Figure 2: Distribution of classifier score for human and bot samWp elecd
om from
paredou scr las
ore rg e-sa
for ca
ccle
ou co
ntsllectio
in the n.m Pe
anaursona
ally co
nnrre
o- la-
accounts in the two datasets. tio
tatensd
d baetta
we
set enbyscore
pairssaore
f mals
od o rep
els orte
(i.e d
. t , in
ra aloedng wit
wit hh estim
differena
tted
m
thixt
resuhre s) s
old foarndlacorre
beledsp hu m
on dan,gb
in ao t, ara
ccu nd as
cie ra
. ndom subset of
accounts (Fig. 3). As expected, both models assign lower
annotated accounts and labels assigned by the majority of scores for humans and higher for bots. High correlation co-
annotators. This yields 0.89 AUC, a reasonable accuracy efficients indicate agreement between the models.
best performance with user meta-data features; content fea-
considering that the dataset contains recent and possibly
ture
Fe s areIa
ature lso
m poerta
ffective
nce.Ana
Both yie
lysislded AUC above 0.9. Other
sophisticated bots.
feature classes yielded AUC above 0.8.
To compare the usefulness of different features, we trained
• M erged: We merged the honeypot and annotation moWdee
lsans
u alyze
ing ed the
ach im
classpofrta
o fen
ace o
turesfa
slo
ing
nele fe
. Weatu
acresv
hie u
es
dingethe
th
datasets for training and testing. The resulting classifier Gini impurity scoreproduced by our RandomForests model.
achieves 0.94 AUC, only slightly worse than the honey- To rank the top features for a given dataset, werandomly se-
pot (training and test) model although the merged dataset lect asubset of 10,000 accounts and computethetop features
contains a variety of more recent bots. across 100 randomized experiments. The top 10 features are
• M ixture: Using mixtures with different ratios of accounts sufficient to reach performance of 0.9 AUC. Sentiment and
from the manually annotated and honeypot datasets, we content of mentioned tweets are important features along
obtain an accuracy ranging between 0.90 and 0.94 AUC. with the statistical properties of retweet networks. Features
of the friends with whom a user interacts are strong predic-
In Fig 2, we plot the distributions of classification scores tors as well. We observed the redundancy among many cor-
for human and bot accounts according to each dataset. The related features, such as distribution-type features (cf. Ta-
mixture model trained on 2K annotated and 10K honeypot ble 1), especially in the content and sentiment categories.
accounts is used to compute the scores. Human accounts Further analysis of feature importance is the subject of on-
in both datasets have similar distributions, peaked around going investigation.
Hypothetical cultural network in which nodes represent actors engaged in conversation
about an advocacy issue and edges between them describe similarities in the content of their
messages.

I argue that advocacy organizations are most likely to stimulate comments from
new social media audiences if they create “cultural bridges,” or produce messages
that connect discursive themes that are seldom discussed together. Such
messages may not only provoke comments from multiple audiences but also put
these audiences into conversations with one another, creating new, hybrid
conversational themes, or “cultural trellises,” within a social media advocacy field.
Three-stage sampling process used to recruit advocacy organizations to install Facebook
application.

Christopher Andrew Bail PNAS 2016;113:42:11823-11828

©2016 by National Academy of Sciences


Combining natural language processing and network analysis to create cultural networks
between organizations.

Christopher Andrew Bail PNAS 2016;113:42:11823-11828

©2016 by National Academy of Sciences


Regression models predicting the number of unique people who made substantial comments
about autism advocacy organizations’ posts each day who had not previously commented
upon the organization’s posts, August 2, 2011–December 18, 2012 (n = 18,483 organization
per day observations).
Red circles, standardized coefficients; thick blue lines, 95% confidence intervals; thin blue
lines, 90% confidence intervals.