Sei sulla pagina 1di 8

Association Rule Mining in Peer-to-Peer Systems

Ran Wolff and Assaf Schuster



Technion – Israel Institute of Technology

Email: ranw,assaf @cs.technion.ac.il

Abstract 
plies that will appear in that context as well. The output
of an ARM algorithm is a list of all the association rules
We extend the problem of association rule mining – that appear frequently in the database and for which the
a key data mining problem – to systems in which the association is confident.
database is partitioned among a very large number of ARM has been the focus of great interest among data
computers that are dispersed over a wide area. Such com- mining researchers and practitioners. It is today widely
puting systems include GRID computing platforms, feder- accepted to be one of the key problems in the data mining
ated database systems, and peer-to-peer computing envi- field. Over the years many variations were described for
ronments. The scale of these systems poses several difficul- ARM, and a wide range of applications were developed.
ties, such as the impracticality of global communications The overwhelming majority of these deal with sequential
and global synchronization, dynamic topology changes of ARM algorithms. Distributed association rule mining (D-
the network, on-the-fly data updates, the need to share re- ARM) was defined in [2], not long after the definition of
sources with other applications, and the frequent failure ARM, and was also the subject of much research (see, for
and recovery of resources. example, [2, 5, 15, 7, 8]).
We present an algorithm by which every node in the In recent years, database systems have undergone major
system can reach the exact solution, as if it were given changes. Databases are now detached from the computing
the combined database. The algorithm is entirely asyn- servers and have become distributed in most cases. The
chronous, imposes very little communication overhead, natural extension of these two changes is the development
transparently tolerates network topology changes and of federated databases – systems which connect many dif-
node failures, and quickly adjusts to changes in the data ferent databases and present a single database image. The
as they occur. Simulation of up to 10,000 nodes show that trend toward ever more distributed databases goes hand
the algorithm is local: all rules, except for those whose in hand with an ongoing trend in large organizations to-
confidence is about equal to the confidence threshold, are ward ever greater integration of data. For example, health
discovered using information gathered from a very small maintenance organizations (HMOs) envision their medical
vicinity, whose size is independent of the size of the system. records, which are stored in thousands of clinics, as one
database. This integrated view of the data is imperative for
essential data analysis applications ranging from epidemic
control, ailment and treatment pattern discovery, and the
1 Introduction detection of medical fraud or misconduct. Similar exam-
ples of this imperative are common in other fields, includ-
ing credit card companies, large retail networks, and more.
The problem of association rule mining (ARM) in large
transactional databases was first introduced in 1993 [1]. An especially interesting example for large scale dis-
The input to the ARM problem is a database in which ob- tributed databases are peer-to-peer systems. These sys-
jects are grouped by context. An example would be a list tems include GRID computing environments such as Con-
of items grouped by the transaction in which they were dor [10] (20,000 computers), specific area computing sys-
bought. The objective of ARM is to find sets of objects tems such as SETI@home [12] (1,800,000 computers) or
which tend to associate with one another. Given two dis- UnitedDevices [14] (2,200,000 computers), general pur-
  
tinct sets of objects, and , we say is associated with pose peer-to-peer platforms such as Entropia [6] (60,000
 
if the appearance of in a certain context usually im- peers), and file sharing networks such as Kazza (1.8 mil-
 This work was supported in part by Microsoft Academic Foundation.
lion peers). Like any other system, large scale distributed
systems maintain and produce operational data. However,

1
in contrast to other systems, that data is distributed so 100,000 nodes could easily fail five times per hour. More-
widely that it will usually not be feasible to collect it for over, many such systems are purposely designed to sup-
central processing. It must be processed in place by dis- port the dynamic departure of nodes. This is because a
tributed algorithms suitable to this kind of computing en- system that is based on utilizing free resources on non-
vironment. dedicated machines should be able to withstand sched-
Consider, for example, mining user preferences over uled shutdowns for maintenance, accidental turnoffs, or an
the Kazza file sharing network. The files shared through abrupt decrease in availability when the user comes back
Kazza are usually rich media files such as songs and from lunch. The problem is that whenever a node departs,
videos. participants in the network reveal the files they the database on that node may disappear with it, chang-
store on their computers to the system and gain access ing the global database and the result of the computation.
to files shared by their peers in return. Obviously, this A similar problem occurs when nodes join the system in
database may contain interesting knowledge which is hard mid-computation.
to come by using other means. It may be discovered, for Obviously none of the distributed ARM algorithms de-
instance, that people who download The Matrix also look veloped for small-scale distributed systems can manage
for songs by Madonna. Such knowledge can then be ex- a system with the aforementioned features. These algo-
ploited in a variety of ways, much like the well known data rithms focus on achieving parallelization induced speed-
mining example stating that “customers who purchase di- ups. They use basic operators, such as broadcast, global
apers also buy beer”. synchronization, and a centralized coordinator, none of
The large-scale distributed association rule mining which can be managed in large-scale distributed systems.
(LSD-ARM) problem is very different from the D-ARM To the best of our knowledge, no D-ARM algorithm
problem, because a database that is composed of thousands presented so far acknowledges the possibility of failure.
of partitions is very different from a small scale distributed Some relevant work was done in the context of incre-
database. The scale of these systems introduces a plethora mental ARM, e.g., [13], and similar algorithms. In these
of new problems which have not yet been addressed by any works the set of rules is adjusted following changes in the
ARM algorithm. The first such problem is that in a system database. However, we know of no parallelizations for
that large there can be no global synchronization. This has those algorithms even for small-scale distributed systems.
two important consequences for any algorithm proposed In this paper we describe an algorithm which solves
for the problem: The first is that the nodes must act inde- LSD-ARM. Our first contribution is the inference that the
pendently of one another; hence their progress is specu- distributed association rule mining problem is reducible
lative, and intermediate results may be overturned as new to the well-studied problem of distributed majority votes.
data arrives. The second is that there is no point in time in Building on this inference, we develop an algorithm which
which the algorithm is known to have finished; thus, nodes combines sequential association rule mining, executed lo-
have no way of knowing that the information they possess cally at each node, with a majority voting protocol to dis-
is final and accurate. At each point in time, new informa- cover, at each node, all of the association rules that exist
tion can arrive from a far-away branch of the system and in the combined database. During the execution of the al-
overturn the node’s picture of the correct result. The best gorithm, which in a dynamic system may never actually
that can be done in these circumstances is for each node terminate, each node maintains an ad hoc solution. If the
to maintain an assumption of the correct result and update system remains static, then the ad hoc solution of most
it whenever new data arrives. Algorithms that behave this nodes will quickly converge toward an exact solution. This
way are called anytime algorithms. is the same solution that would be reached by a sequen-
Another problem is that global communication is costly tial ARM algorithm had all the databases been collected
in large scale distributed systems. This means that for and processed. If the static period is long enough, then
all practical purposes the nodes should compute the result all nodes will reach this solution. However, in a dynamic
through local negotiation. Each node can only be famil- system, where nodes dynamically join or depart and the
iar with a small set of other nodes – its immediate neigh- data changes over time, the changes are quickly and lo-
bors. It is by exchanging information with their immediate cally adjusted to, and the solution continues to converge. It
neighbors concerning their local databases that nodes in- is worth mentioning that no previous ARM algorithm was
vestigate the combined, global database. proposed which mines rules (not itemsets) on the fly. This
A further complication comes from the dynamic nature contribution may affect other kinds of ARM algorithms,
of large scale systems. If the mean time between failures especially those intended for data streams [9].
of a single node is 20,000 hours1, a system consisting of The majority voting protocol, which is at the crux of
1 This figure is accepted for hardware; for software the estimate is our algorithm, is in itself a significant contribution. It re-
usually a lot lower. quires no synchronization between the computing nodes.

2
' vxw mzy|{}_vx~ w mzy|{ ' , and its precision: ' vxw mzy|{}<vx~ w my€{ ' .
Each node communicates only with its immediate neigh- recall:
' vxw mzy|{ ' ' v+~ w mzy|{ '
We require that if both f *Bh j and #C% j s of each npoGf *Bh j
bors. Moreover, the protocol is local: in the overwhelming
majority of cases, each node computes the majority – i.e.,
remain static long enough, then the approximated solution
identifies the correct rules – based upon information arriv-
tue f *Bh j will converge to egf *0h j . In other words, both the
precision and the recall of * converge to a hundred percent.
ing from a very small surrounding environment. Local-
ity implies that the algorithm is scalable to very large net-
works. Another outcome of the algorithm’s locality is that Throughout this work we make some simplifying as-
sumptions. We assume that node connectivity is a forest
the communication load it produces is small and roughly
* n
– for each and there could be either one route from *
uniform, thus making it suitable for non-dedicated envi-
ronments. n
to or none. Trees in the forest may split, join, grow, and
shrink, as a result of node crash and recovery, or depar-
ture and join. We assume the failure model of computers
2 Problem Definition is fail-stop, and that a node is informed of changes in the
status of adjacent nodes.
The association rule mining (ARM) problem is tradi-
tionally defined as follows: Let 
   
be 3 An ARM Algorithm for Large-Scale Dis-
the items in a certain domain. An itemset is some sub-
set  . A transaction is also a subset of associ-  tributed Systems
ated with a unique transaction identifier !"$#
. A database
#&% is a list that contains ' #&%('
transactions. Given an As previously described, our algorithm is comprised
itemset and a database , #&% )+*-,.,0/13245#&%76
is the of two rather independent components: Each node exe-
number of transactions in #&%
which contain all the items cutes a sequential ARM algorithm which traverses the lo-

of , and 891:;<2=>#&%76?@)+*-,.,0/13245#&%76BA<' #C%D'
. cal database and maintains the current result. Addition-
For some frequency threshold EGFIHJ KL891M:;NFPO
, we ally, each node participates in a distributed majority voting
say that an itemset 
is frequent in a database if #&% protocol which makes certain that all nodes that are reach-
891M:;Q2R>S#&%T6VUWHJ KL891M:;and infrequent otherwise. able from one another converge toward the correct result
For two distinct frequent itemsets  
and , and a con- according to their combined databases. We will begin by
fidence threshold ENFXHJ KZY9/KZ[\F]O
, we say the rule describing the protocol and then proceed to show how the
 ^  is confident in #&% 891M:;_24a`(bS#&%T6cU
if full algorithm is derived from it.
HJ KZY9/KZ[(d891M: ;_2=>#&%76
. We will call confident rules
3.1 LSD-Majority Protocol
between frequent itemsets correct and the remaining rules
false. The solution for the ARM problem is egf #&%ih
, the
list of all the correct rules in the given database. It has been shown in [11] that a distributed ARM algo-
When the database is dynamically updated, that is, rithm can be viewed as a decision problem in which the
transactions are added to it or deleted from it over time,
#&%kj
participating nodes must decide whether or not each item-
we denote the database at time . Now consider that set is frequent. However, the algorithm described in that
the database is also partitioned among an unknown number work extensively uses broadcast and global synchroniza-
of share nothing machines (nodes); we denote the partition
* l#&%Tjm
tion; hence it is only suitable for small-scale distributed
of node at time . Given an infrastructure through systems. We present here an entirely different majority
which those machines may communicate data, we denote
f *Bh j *
voting protocol – LSD-Majority – which works well for
the group of machines reachable from at time . We
npoGf *0h j<q *gocf n.h j
large-scale distributed systems. In the interest of clarity
will assume symmetry, i.e., . Never-
f *0h j
we describe the protocol assuming the data at each node is
theless, may or may not include all of the machines. a single bit. We will later show how the protocol can easily
The solution to the large-scale distributed association rule
*
be generalized for frequency counts.
mining (LSD-ARM) problem for node at time is the set As in LSD-ARM, the purpose of LSD-Majority is to
of rules which are correct in the combined databases of the
machines in f *0h j
; we denote this solution . erf *0h j ensure that each node converges toward the correct major-
ity. Since the majority problem is binary, we measure the
Since both and f *0h j
of each #C%js
are free to nVof *Bh j *
recall as the proportion of nodes whose ad hoc solution
vary with time – so does etf *Bh j
. It is thus imperative that is one when the majority in f *0h j
is of set bits, or zero when
* calculate not only the eventual solution but also approx- the majority in f *0h j
is of unset bits. The protocol dictates
imated, ad hoc, solutions. We term the ad hoc solu- rue f *0h j how nodes react when the data changes, a message is re-
*
tion of node at time . It is common practice to measure ceived, or a neighboring node is reported to have detached
the performance of an anytime algorithm according to its or joined.

3
The nodes communicate by sending messages that con-
‚/*ƒK3 ˆ m s
its majority decision with node ˆ msn  ˆ s‡m
by maintaining the
tain two integers: , which stands for the number of
bits this message reports, and „z*0…
which is the number
same
certain that
ˆ m s
value (note that ) and making
n
will not mislead into believing that the
of those bits which are equal to one. Each node will * ˆ m U ˆ ms U©E ˆ s U ˆ s‡m UE
global majority is larger than it actually is. As long as
n
record, for every neighbor , the last message it sent to – n and , there is no need
„z*ƒ…(ms-/*0K3†ms$
– and the last message it received from * n
for and to exchange data. They both calculate the
n „z*ƒ…(s‡m0‚/*ƒK3†s‡mB
– . Node calculates the following* majority of the bits to be set; thus, the majority in their
two functions of these messages and its own local bit: ˆ m sg¢ ˆ m n
combined data must be of set bits. If, on the other hand,
, then might mistakenly calculate
ˆ sgUªE
ˆ m ‰ „ m9Š ‹ „z*ƒ… s‡m‘p’  m"Š ‹ ‚/*ƒK3 s‡m.“ because it has not received the updated data from . Thus, *
« * n
ˆ ms ‰„z*0…s‡m$mŒ.s lŠŽ „z*0… s‡m ” 2‚/*0K3 s‡mm$s Œ.ŠZŽ ‚/*ƒK3 s‡m 6 „ m Š — ¤ m¨§ ¥ s‡m$Œ.ZŽ „z*0… ¤ m¨S‚m Š — ˆ ¤ m¨§ ¥ s‡m$ˆ Œ.ZŽ ‚/*ƒK3 ¤ mƒ¬ .
in this case the protocol dictates that send a message,

Note that after this message is sent, ms­ m.


Here •im is the set of edges colliding with ˆ * , „m is the value ˆ m stU ˆ m and Eg¢ ˆ s‡mGU ˆ s , then no messages
The opposite case is almost the same. Again, if EV¢
of the local bit, and m is, for now, one. m measuresˆ the ˆ ˆ ms , the protocol dic- are
number of access set bits * has been informed of. ms exchanged. However, when m>¢
measures the number of access set bits * and n have last tates that * send n a message calculated the same way. The
reported to one another. Each time „m changes, a message only difference is that when no messages ˆ mC£rE andwere* knows sent or re-
ˆ mreceived,
is or a nodeˆ connects to n or disconnects from n , n
ˆ s £ªE . Thus, unless ˆ m UE , * does not send mes-
ceived, knows, by default, that that
is recalculated; ms is recalculated each time a mes-
sage is sent to or received from n . sages to n because the majority bits in their combined data
cannot be set.
Algorithm 1 LSD-Majority The pseudocode of the LSD-Majority protocol is given
*
Input for node : The set of edges that collide with it , •Tm in Algorithm 1. It is easy to see that when the protocol
„ m
A bit and the majority ratio .  ˆ s®U]E
dictates that no node needs to send any message, either
n¯o°f *0h j ˆ s®£@E
ˆ m–UrE
Output: The algorithm never terminates. Nevertheless, at
each point in time if then the output is , otherwise O
for all nodes
them. If there is disagreement in
, or for all of
f *0h j
, then there must be
it is . E ˆ m  „ m Š˜— disagreement between two immediate neighbors, in which
n
„*ƒ…™s‡m 
&š  m Š — s‡m$Œ. Ž /*0K3 s‡mœ› ˆ ms „z*0…s‡m$mŒ.s  Ž Š „z*0… s‡m 
case at least one node must send data, which will cause
Definitions:
‚/*ƒK3†m s Š ‚/*0K3†s‡m
to increase. This number is bounded
 2‚/*ƒK3†ms Š ‚/*0K3†s‡mœ6
,
by the size of f *0h j
; hence, the protocol always reaches con-
Initialization: For each n$*žoŸ•m set „z*ƒ…™s‡m , ‚/*0K3†s‡m , sensus in a static state. It is less trivial to show that the
„z*0…™ms , ‚/*ƒK3†ms to E . Set mTO . conclusion they arrive at is the correct one. This proof is
On edge n-* recovery : Add n$* to •m . Set „z*ƒ…(s‡m , too long to include in this context.
‚/*ƒK3†s‡m , „z*0…™ms , ‚/*0K3†ms to E . In order to generalize LSD-Majority for frequency
On failure of edge n$*5o(•m : Remove n-* from •m . counts, mneed only to be set to the size of the local
On message „z*ƒ… S‚/*ƒK3¡ received over edge n$* : Set database and „m
to the local support of an itemset. Then,
„z*0… s‡m ‰„z*0… , ‚/*ƒK3 s‡m N/*0K3 if we substitute HJ KL891M: ; 
for , the resulting protocol
On change in „ m , edge failure or recovery, or the receiv- will decide whether an itemset is frequent or infrequent in
f *0h j
. Deciding whether a rule  ^±
]
is confident is also
zm
ing of a message:
For each n-*?o?• m ˆ straightforward using this protocol: should now count
If ‚/*0K3†ms Š /*0K3†s‡mTcE and m&UrE
†s‡mpˆ ¢\E andˆ either ˆ ms?£\E and
in the local database the number of transactions that in-
ˆ m&¢ or ˆ ‚/m*ƒs K3or†mˆ s mŠ sT‚Ug/*ƒE K3and  „m
clude , should count the number of these transactions
mC£ ms ¤   
that include both and , and should be replaced with
Set „z*0…™mskG„ m Š ‹
¤¤ m¦§ ¥ s‡m.Œ.lŽ „*ƒ… m and ‚/*ƒK3†ms9 HJ KZY9/KZ[.
Deciding whether a rule is correct or false requires that
‚m Š ¤ § ‹ ‚/*ƒK3 m each node run two instances of the protocol: one to decide
m¨¥ s‡m$Œ.ZŽ
Send „z*0…™ms-‚/*ƒK3†ms. over n$* to n
whether the rule is frequent and the other to decide whether
² ^
%³ ²
it is confident. Note, however, that for all rules of the form
, only one instance of the protocol should be
Each node performs the protocol independently with performed to decide whether is frequent. %
each of its immediate neighbors. Node coordinates * The strength of the protocol lies in its behavior when

4
the average of the bits over f *Bh j
is somewhat different than
´ 
¹ —cº — ´¡´¡µµ¶ Ž¡¶ · {œ¸ ´7 O , we will show in section 4.1 that
the majority threshold . Defining the significance of the
input as .
Ž¡· {œ» on the scale of ¼kE¨O , is suffi- Algorithm 2 Majority-Rule
even a minor significance,
*
Input for node : The set of edges that collide with it •Tm .
cient for making a correct decision using data from just
The local database . , #C%Tm HJ KL891M: ; HJ KZY9/KZ[ H
,
a small number of nodes. In other words, even a minor
Initialization: Set YJ¿À œÁR­^à z Ä<[Å/1+ƜÇRÇz Èo(B
significance is sufficient for the algorithm to become lo-
For each set 1ro\Y 1M|„*ƒ…É°1M ‚/*0K3&IE 1M  
, and
cal. Another strength of the protocol is that during static
HJ KL891:;
For each 1¾o°Y and every n-*°oX•Tm set 1M „z*ƒ…(ms
periods, most of the nodes will make the correct major-
ity decision very quickly. These two features make LSD-
1M /*0K3†mskG1M|„*ƒ…™s‡mTN1M ‚/*ƒK3†s‡mTcE
Upon receiving 1M Ê0S„z*ƒ… S‚/*ƒK3¡ from a neighbor n If
Majority especially well-suited for LSD-ARM, in which
the overwhelming majority of the candidates are far from
1–ËÁ4X^]7Ä7o‘Ì Y add it to Y . If ÍlaÁÂi^Îa`C7Ä7oÌ
significant.
Y add it too.
Set 1M|„z*0…™s‡mT‰„*ƒ… , 1M ‚/*ƒK3†s‡m7c‚/*ƒK3
On edge n$* recovery: Add n-* to • m . For all 1>ogY set
3.2 Majority-Rule Algorithm
1M „z*ƒ…(ms9N1M ‚/*ƒK3†m s9N1M|„z*0…™s‡mTG1M /*0K3†s‡mTcE
On failure of edge n$*>o™•m : Remove n-* from •m .
LSD-Majority efficiently decides whether a candidate
rule is correct or false. It remains to show how candi-
Main: Repeat the following for ever
Read the next transaction – ! . If it is the last one in #C%7m
dates are generated and how they are counted in the lo-
cal database. The full algorithm must satisfy two require-
iterate back to the first one.
For every 1D©ÁR^@TÄ­orY which was generated after
ments: First, each node must take into account not only the
local data, but also data brought to it by LSD-Majority, as
this transaction was last read
If Îr! increase 1M ‚/*0K3
this data may indicate that additional rules are correct and
If a`(¯g! increase 1M „z*ƒ…
thus further candidates should be generated. Second, un-
Once every H transactions
like other algorithms, which produce rules after they have
Setˆ g ue f *0hÏÐ the set of rules 1>ÁR^@Tˆ Ä7oNY such
finished discovering all itemsets, an algorithm which never
that mÑ2R1M6<UgE and for 1M̓\ÁRÂi^]Ò`(TÄ mÑ2R1MÍÓ6QUVE
really finishes discovering all itemsets must generate rules
For every 1ÔÕÁ=X^]TÄ?oWg ue f *BhÓÏ , such that ÖIÂ
on the fly. Therefore the candidates it uses must be rules,
and Do if 1MÍ­ÃÁ X ³Ñ  SÑ^à z S đo¯ Ì Y insert 1MÍ into
not itemsets. We now present an algorithm – Majority-
Rule – which satisfies both requirements.
Y with 1MÍ|„z*0…À¾1Í ‚/*ƒK39aE , 1MÍ  aHJ KZY9/KZ[ and
The first requirement is rather easy to satisfy. We sim-
1 Í  ÊT unique rule id
For each   Á=X^]G`5  MÄŁ  
ply increment the counters of each rule according to the
data received. Additionally, we employ a candidate gen-
Á4X^]c`> z  Ä×o rue f *Bh j such that £  , if
eration approach that is not levelwise: as in the DIC al-
‚ØNÉÁ4°^]G`> z    ÄGo Ì Y and ك ØcoËÛÚ91zØN
gorithm [4], we periodically consider all the correct rules,
Á4X^]c`> z   Q³Ñ z ØMÄ o gue f *0h j , add 1zØ to Y
regardless of when they were discovered, and attempt to
with 1zØ$|„*ƒ…  1zØ$ ‚/*ƒK3 E , 1 Ø   1   , and
use them for generating new candidates.
1 Ø  ÊT unique rule id
For each 1iŸÁ=°^]TÄQo?Y and for ˆevery n-*>o?•m
The second requirement, mining rules directly rather
If 1M /*0K3 ms Š 1M ‚/*ƒK3 s‡m NE and m 2=16_ˆ UrE
than mining itemsets first and producing rules when the al-

ˆ or 1M ‚/*0K3mˆ s ms_Š 241M1 6/*0K3†s‡m–¢rE and either msQ2=16_£rE


gorithm terminates, has not, to the best of our knowledge,
and m"2=ˆ 1M6<¢
been addressed in the literature. To satisfy this requirement
we generalize the candidate generation procedure of Apri-
or ms UVE and m 2=16_£
ˆ ˆ ms 2416
ori [3]. Apriori generates candidate itemsets in two ways:
O z S Set 1M „z*ƒ…(ms?¾1M|„z*0… Š ‹ M
1  z
„ ƒ
* … ¤ m and
Initially, it generates candidate itemsets of size :
<op ½Š O
for
¤ m¦§ ¥ s‡m.Œ.l¤ Ž
every . Later, candidates of size are generated
½
by finding pairs of frequent itemsets of size that differ by 1M /*0K3†mskG1M ‚/*0K3 Š ¤ § ‹ 1M ‚/*ƒK3 m
only the last item – V`­ z  N`9  
and – and validating m¦¥ s‡m.Œ.lŽ
Send 1M Ê01M|„*ƒ…™ms$1M /*0K3†ms$ over n-* to n
that all of the subsets of ¾`5    
are also frequent be-
fore making that itemset a candidate. In this way, Apriori
generates the minimal candidate set which must be gener-
ated by any deterministic algorithm.
In the case of Majority-Rule, the dynamic nature of the

5
system means that it is never certain whether an itemset is tool from the IBM-quest data mining group [3]. We gener-
frequent or a rule is correct. Thus, it is impossible to guar- ated three synthetic databases – T5.I2, T10.I4 and T20.I6 –
antee that no superfluous candidates are generated. Nev- where the number after T is the average transaction length
ertheless, at any point during execution , it is worthwhile and the number after I is the average pattern length. The
to use the ad hoc set of rules, tue f *Bh j
, to try and limit the combined size of each of the three databases is 10,000,000
number of candidate rules. Our candidate generation cri- transactions. Other than the number of transactions the
terion is thus a generalization of Apriori’s criterion. Each change we made from the defaults was reducing the num-
node generates initial candidate rules of the form Âi^±  S ber of patterns. This was reduced so as to increase the pro-
for each 7oG
. Then, for each rule Â5^ÛÖoÃrue f *0h j , it portion of correct rules from one in ten-thousands to one
generates X³ z >^Ü  S
candidate rules for all 7oc. in a hundred candidates. Because our algorithm performs
In addition to these initial candidate rules, the node will better for false rules than for correct ones this change does
look for pairs of rules in egf *0h j
which have the same left- not impair out the validity of the results.
hand side, and right-hand sides that differ only in the last
item – ^Ü\`g z   ^Ü\`g z 
and . The node
will verify that the rules ×^ݯ`V   M M³7 Ø 
, for
4.1 Locality of LSD-Majority and Majority-Rule
every †ØÞop
, are also correct, and then generate the can-
didate X^]”`Þ    
. It can be inductively proved that The LSD-Majority protocol, and consequently the
iftue f *0h j
contains only correct rules, then no superfluous Majority-Rule algorithm, are local algorithms in the sense
candidate rules are ever generated using this method. that a correct decision will usually only require that a small
The rest of Majority-Rule is straightforward. When- subset of the data is gathered. We measure the locality of
ever a candidate is generated, the node will begin to count an algorithm by the average and maximum size of the en-
its support and confidence in the local database. At the vironment of nodes. The environment is defined in LSD-
same time, the node will also begin two instances of LSD- Majority as the number of input bits received by the node
Majority, one for the candidate’s frequency and one for its and in Majority-Rule as the percent of the global database
confidence, and these will determine whether this rule is reported to the node, until system stabilization. Our exper-
globally correct. Since each node runs multiple instances
— G ¡
´ 
µ ¶
iments with LSD-Majority show that its locality strongly ´
depends on the significance of the input: ¹.º
—N´¡µŽ¡¶ Ž¡· {œ· { ¸ » ´9 O .
of LSD-Majority concurrently, messages must carry, in ad-
dition to „*ƒ… and /*0K3 ˆ m 2R16
, the identification of the rule it ˆ m s 42 16
refers to, 1M Ê
. We will denote and the re- Figure 1(a) describes the results of a simulation of
sult of the previously defined functions when they refer to
 1 1M  10,000 nodes in a random tree over a grid, with various

1M  HJ KL891M:;
the counters and of candidate . Finally, is the ma-
jority threshold that applies to . We set 1 to
percentages of set input bits at the nodes. It shows that
¼kE¨O ߜá$â á.á$â
for rules with an empty left-hand side and to HJ KZY9/KZ[ when the significance is (i.e., or of the
nodes have set input bits), the protocol already has good
for all other rules. locality: the maximal environment is about 1200 nodes
The pseudocode of Majority-Rule is detailed in Algo- and the average size a little over 300. If the percentage
rithm 2. of set input bits is closer to the threshold, a large portion
of the data would have to be collected in order to find the
4 Experimental Results majority. In the worst possible case, when the number of
set input bits is equal to the number of unset input bits plus
To evaluate Majority-Rule’s performance, we imple- one, at least one node would have to collect all of the in-
mented a simulator capable of running thousands of simu- put bits before the solution could be reached. On the other
lated computers. We simulated 1600 such computers, con- hand, if the percentage of set input bits is further from the
nected in a random tree overlaid on a grid. We ß$E?àpß$E threshold, then the average environment size becomes neg-
ligible. In many cases different regions of the grid may not
also implemented a simulator for a stand-alone instance of
the LSD-Majority protocol and ran simulations of up to exchange any messages at all. In Figure 1(c) these results
10,000 nodes on a OzE.E&àÔOzE$E
grid. The simulations were repeat themselves for Majority-Rule.
run in lock-step, not because the algorithm requires that Further analysis – Figure 1(c) – show that the size of a
the computers work in locked-step – the algorithm poses node’s environment depends on the significance in a small
no such limitations – but rather because properties such as region around the nodes. I.e., if the inputs of nodes are
convergence and locality are best demonstrated when all independent of one another then the environment size will
processors have the same speed and all messages are de- be random. This makes our algorithms fair: nodes’ perfor-
livered in unit time. mance is not determined by its connectivity or location but
We used synthetic databases generated by the standard rather by the data.

6
(b). 40% set bits (c). Locality of T20.I6 on a 40 by 40 grid, vs. rule significance
(a). Worst, average and best environment sizes
100 100
6000 Worst-case
90 90 Average

5000 80 80

Percent of DB gathered
70

Environment size
70
4000
60 60

3000 50 50

40
2000 40
30
30
1000 20
20
10
0 10
0
10 20 30 40 45 50 55 60 70 80 90 -0.1 -0.05 0 0.05 0.1
Percentage of set bits 20 40 60 80 100 Rule significance

Figure 1. The locality of LSD-Majority (a) and of Majority-Rule (c) depends of the significance. The
distribution of environment sizes (b) depends on the local significance and is hence random.

4.2 Convergence and Cost of Majority-Rule This occurs because our protocol operates in expanding
wavefronts, convincing more and more nodes that there is
a certain majority, and then retreating with many nodes be-
In addition to locality, the other two important char-
ing convinced that the majority is the opposite. Since we
acteristics of Majority-Rule are its convergence rate and
assume by default a majority of zeros, the first wavefront
communication cost. We measure convergence by cal-
that expands would always be about a majority of ones.
culating the recall – the percentage of rules uncovered –
Interestingly enough, the same pattern can be seen in the
and precision – the percentage of correct rules among all
convergence of Majority-Rule (more clearly for the preci-
rules assumed correct – vis-a-vis the number of transac-
sion than for the recall).
tions scanned. Figure 2 describes the convergence of the
Figure 3 presents the the communication cost of LSD-
recall (a) and of the precision (b). In (c) the convergence
Majority vis-a-vis the percentage of set input bits and of
of stand-alone LSD-Majority is given, for various percent-
Majority-Rule vis-a-vis rule significance. For rules that
ages of set input bits.
are very near the threshold, a lot of communication is re-
To understand the convergence of Majority-Rule, one quired, on the scale of the grid diameter. For significant
must look at how the candidate generation and the major- rules the communication load is about ten messages per
ity voting interact. Rules which are very significant are rule per node. However, for false candidates the commu-
expected to be generated early and agreed upon fast. The nication load drops very fast to nearly no messages at all.
same holds for false candidates with extremely low sig- It is important to keep in mind that we denote every pair of
nificance. They too are generated early, because they are integers we send a message. In a realistic scenario, a mes-
usually generated due to noise, which subsides rapidly as
a greater portion of the local database is scanned; the con-
sage will contain up to bytes, or about integer O áE$E Ozã.E
pairs.
vergence of LSD-Majority will be as quick for them as
for rules with high significance. This leaves us with the
group of marginal candidates, those that are very near to 5 Conclusions
the threshold; these marginal candidates are hard to agree
upon, and in some cases, if one of their subsets is also We have described a new distributed majority vote pro-
marginal, they may only be generated after the algorithm tocol – LSD-Majority– which we incorporated as part of an
has been working for a long time. We remark that marginal algorithm – Majority-Rule – that mines association rules
candidates are as difficult for other algorithms as they are on distributed systems of unlimited size. We have shown
for Majority-Rule. For instance, DIC may suffer from the that the key quality of our algorithm is its locality – the fact
same problem: if all rules were marginal, then the number that information need not travel far on the network for the
of database scans would be as large as that of Apriori. correct solution to be reached. We have also shown that the
An interesting feature of LSD-Majority convergence is locality of Majority-Rule translates into fast convergence
that the number of nodes that assume a majority of set bits of the result and low communication demands. Commu-
always increases in the first few rounds. This would result nication is also very efficient, at least for candidate rules
in a sharp reduction in accuracy in the case of a majority which turn out not to be correct. Since the overwhelm-
of unset bits, and an overshot, above the otherwise expo- ing majority of the candidates usually turn out this way,
nential convergence, in the case of a majority of set bits. the communication load of Majority-Rule depends mainly

7
(a). Average recall for T5.I2, T10.I4 and T20.I6 (b). Precision for T5.I2, T10.I4 and T20.I6
1 1
(c). Convergence on 10,000 nodes, vs. percentages of set bits
100

0.8 0.8
80

Percent of Nodes
0.6 0.6

Precision
Recall 60
10%
20%
0.4 0.4 30%
40 40%
45%
48%
52%
0.2 0.2 55%
20 60%
T5 T5 70%
T10 T10 80%
T20 T20 90%
0 0 0
0.01 0.1 1 10 100 0.01 0.1 1 10 100 1 10 100 1000
Number of database scans Number of database scans Steps

Figure 2. Convergence of the recall and precision of Majority-Rule, and of LSD-Majority.

(a). Per node messages regarding a rule vs. rule significance (b). Max, avg, and min per node messages vs. percentage of set bits
80 80
Average number of per node messages

70 70

60 60

Number of messages
50 50

40 40

30 30

20 20

10 10

0 0
-1 -0.5 0 0.5 1 10 20 30 40 45 50 55 60 70 80 90
Rule significance Percentage of set bits

Figure 3. Communication characteristics of Majority-Rule (a), and of LSD-Majority (b). Each mes-
sage here is a pair of integers.

on the size of the output – the number of correct rules. [7] E.-H. S. Han, G. Karypis, and V. Kumar. Scalable parallel
That number is controllable via user supplied parameters, data mining for association rules. IEEE Transactions on
namely and . HJ KL891M:; HJ KZY9/KZ[ Knowledge and Data Engineering, 12(3):352 – 377, 2000.
[8] J.-L. Lin and M. H. Dunham. Mining association rules:
Anti-skew algorithms. In Proceedings of the 14th Int’l.
References Conference on Data Engineering (ICDE’98), pages 486–
493, 1998.
[1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining as- [9] G. S. Manku and R. Motwani. Approximate frequency
sociation rules between sets of items in large databases. counts over data streams. In Proceedings of the 28th Inter-
In Proc. of the 1993 ACM SIGMOD Int’l. Conference on national Conference on Very Large Data Bases (VLDB’02),
Management of Data, pages 207–216, Washington, D.C., Hong Kong, China, August 2002.
June 1993. [10] T. C. Project. http://www.cs.wisc.edu/
[2] R. Agrawal and J. Shafer. Parallel mining of association condor/.
[11] A. Schuster and R. Wolff. Communication-efficient dis-
rules. IEEE Transactions on Knowledge and Data Engi-
tributed mining of association rules. In Proc. of the 2001
neering, 8(6):962 – 969, 1996.
ACM SIGMOD Int’l. Conference on Management of Data,
[3] R. Agrawal and R. Srikant. Fast algorithms for mining as-
pages 473 – 484, Santa Barbara, California, May 2001.
sociation rules. In Proc. of the 20th Int’l. Conference on
[12] Seti@home. http://setiathome.ssl.
Very Large Databases (VLDB’94), pages 487 – 499, Santi-
berkeley.edu/.
ago, Chile, September 1994.
[13] S. Thomas and S. Chakravarthy. Incremental mining of
[4] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic
constrained associations. In HiPC, pages 547–558, 2000.
itemset counting and implication rules for market basket [14] United devices inc. http://www.ud.com/home.htm.
data. SIGMOD Record, 6(2):255–264, June 1997. [15] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Par-
[5] D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast dis- allel algorithms for discovery of association rules. Data
tributed algorithm for mining association rules. In Proc. of Mining and Knowledge Discovery, 1(4):343–373, 1997.
1996 Int’l. Conf. on Parallel and Distributed Information
Systems, pages 31 – 44, Miami Beach, Florida, December
1996.
[6] Entropia. http://www.entropia.com.

Potrebbero piacerti anche