Sei sulla pagina 1di 3

Subgroup discovery

  We are not seeking a global model but rather


“interesting subgroups”
  Identification of these groups can
–  Provide better understanding of domain
–  Allow for targeted action

Data mining Subgroup Discovery 1 Data mining Subgroup Discovery 2

Requirements for subgroups Algorithms

  Shows deviating behavior of response variable   CN2-SD


compared to the whole population   Data Surveyor
–  E.g. high mean value of response variable   PRIM
  But as large as possible   …
  Simple description of group
–  Usually conjunctions of conditions

Data mining Subgroup Discovery 3 Data mining Subgroup Discovery 4

Patient Rule Induction Method (PRIM) Illustration of PRIM with binary response

Procedure FindSubgroup Adverse outcome


  Begin with a box B that covers all the data

  Iterate
o  Remove the sub-box b of B that results in the
highest mean in B − b of the response variable
o  B  B − b
  Stopping rule: The support within the current box falls
below the threshold value
  If more subgroups are required:
–  Remove instances in last subgroup found
–  Repeat Procedure FindSubgroup
Data mining Subgroup Discovery 5 Data mining Subgroup Discovery 6
Remove observations and repeat Mean versus support trajectory

Data mining Subgroup Discovery 7 Data mining Subgroup Discovery 8

Preventing overfitting

  When deciding on narrowing a box we look at


observations in the current box from a kept aside
“inspection” dataset. If performance does not
increase we stop.
  For validation we still need a yet separate test
Case study
dataset to assess the performance. Subgroups at high risk of hyperglycemia

Developmental set Inspection Test set

Nannings B, Abu-Hanna A, Bosman. Int J Medical Informatics. 2007.

Training set

Data mining Subgroup Discovery 9 Data mining Subgroup Discovery 10

Glucose Management Approach

  For each glucose measurement use its “history”


  IC patients receive insulin to regulate their blood as attribute
glucose levels   Use PRIM to discover subgroups
  Clinicians follow new guidelines   Perform sensitivity analysis over time
  Yet some patients run the risk of hyper and
hypoglycemia
  Who are these patients?

Data mining Subgroup Discovery 11 Data mining Subgroup Discovery 12


Results: Rule1 Rule2

  Mean Body Temperature < 35.5 ºC last 6h   Bicarbonate < 20.5 mmol/l last 6h
  Bicarbonate < 14.9 mmol/l in last 6h   Admission type = medical
 Mean Glucose = 12.5 mmol/l   Urine < 2 l OR Urine > 4 l
  21.5 < Albumin < 38.5 g/l last 24h
  Mean Body Temperature < 36.85 ºC last 6h
 Mean Glucose = 9.1 mmol/l

Data mining Subgroup Discovery 13 Data mining Subgroup Discovery 14

Sensitivity Analysis Mean within subgroup over time


If we allow for (historical) glucose as prognostic factor

Previous glucose > 13.2 mmol


 
Bicarbonate < 26 mmol/l last 6h
  Mean in whole sample

 Glucose = 15.1 mmol/l

Data mining Subgroup Discovery 15 Data mining Subgroup Discovery 16

Summary

  Subgroup discovery seeks interesting subgroups,


not global models.
  PRIM is a popular subgroup discovery algorithm
–  Provides easily understandable group
descriptions using conjunctions
–  Finds groups with high value of the response
variable
–  Supports interaction with the user

Data mining Subgroup Discovery 17

Potrebbero piacerti anche