Final Report

Final Report-EGEE 597
Amey Vidvans
Analysis of OSHA Health Samples database to
improve detection of violations using Machine
Learning Methods
Amey Vidvans
EGEE 597
1
Amey Vidvans
Problem Description:
Machine Learning is being used in a wide range of fields to access hidden knowledge in
data. Thus, the idea in this project is to use a dataset about Chemical Exposure Health
Data, containing information about inspections and chemicals tested to improve the
efficiency of inspections and enable OSHA inspectors to take advantage of accumulated
knowledge during inspections. There is also an urgent need to improve the testing and
inspection process given certain information like type of company as the issue of
occupational chemical exposure is serious. The critics of OSHA contend that it is not
doing enough to enforce the regulations at workplaces. According to [1], [2], occupational
chemical exposure causes more than premature 50000-100000 deaths per year in the US
alone. Also, it would take 600 years for OSHA to test less than half of the industrial
facilities. There is also a subjective element to how inspections are carried out [3]. When
a complaint regarding an industrial facility is received, a visit may be arranged
depending on circumstances. When a visit is arranged, it is up to the OSHA inspectors to
expand or curtail the inspection to include or exclude substances, sites. Thus, there is a
need to apply machine learning methods to aid OSHA inspectors in determining sites
and substances to be tested based on the complaint received. The techniques discussed
here can be applied and an exhaustive list of association rules can be created to create
knowledge based system for judging likelihood of violations.
Dataset Description:
The dataset as mentioned earlier is a Chemical Exposure Dataset obtained from OSHA
recording all relevant information regarding an inspection. Thus, each data record in the
dataset corresponds to an actual inspection. There are 25 data fields or attributes
2
Amey Vidvans
describing each data record. A few are mentioned below along with comments regarding
their relevance. The rest may be accessed at [4].
Establishment Name: Sampled Establishment
State: Identifies the site state where the inspection was performed.
Zip Code: Identifies the site zip code where the inspection was performed.
SIC Code: Indicates the 4-digit Standard Industrial Classification Code from the 1987
version SIC manual which most closely applies.
Date Sampled: Date sample was taken.
Eight Hour TWA Calc: Eight hour TWA calculation used.
IMIS Substance Code: IMIS substance code number is the substance code assigned by
OSHA to each substance.
Substance: Chemical name primarily lists Substances as it appears in the OSHA PELs, 29
CFR 1910.1000, TABLES Z-1-A, Z-2, Z-3; the ACGIH TLV's; or by common name.
Sample Result: Sample result from laboratory analysis for each sample submitted with a
unique field number.
Unit of Measurement: Unit of measurement (UOM) from IMIS manual. Values: [M -
mg/m3, X - micrograms, P - Parts per million, Y - milligrams, F - fibers/cc, % - percentage]
The dataset is a consolidation of inspections from 6/15/1984 to 1/3/2011.The dataset in its
raw form needed to be processed to glean relevant required information. The datafields
3
Amey Vidvans
were extracted with relevant queries in MS Access and were discretized into categorical
variables. For example, date when sample was taken was converted into a Season
categorical variable. Using SIC code classification, the variable was converted into the
Categorical Variable Type of Industry. Finally based on the sample results, the Target
Categorical Variable Violation or Not was created using cutoff values taken from
ACGIH(American Conference of Governmental Industrial Hygienists). Note that these
are the most stringent exposure limits, more than even the ones from OSHA
PELs(Permissible Exposure Limits). Over all, from the more than 700 substances tested
only 434 had acceptable limits. Where ACGIH limits were not available, and if the
substance had a OSHA PEL or NIOSH Limit, the relevant cutoff value was taken into
consideration. A full listing of substances and their limits with explanation of cutoff value
is attached in the Appendix. The main drawback of this dataset is that it is not a
comprehensive dataset. It means that not every industrial facility is included in the
dataset. Also, not every substance is tested. Also, there was a lot of missing data in the
form of units of measurement and use of non-standard measuring units. Here the last
available unit if measurement in the dataset was used till a suitable description of the unit
appeared in the dataset. A significant portion of the dataset is thus lost which can impat
the inferences that can be made. For example, if the last measured unit was M
(micrograms per m3), it was used to determine violation till the next unit appeared and if
the intermediate rows had missing values. Thus, last value imputation was used. Thus,
from a total of 2114613 entries initially, the data records went down to 800145. ZIP Code
was discarded from the analysis as it was thought that the random variable State would
adequately explain geographic effects. Thus, the final variables are State, Season,
Substance, Type of Industry, and the target variable Violation or Not.
4
Amey Vidvans
Literature Survey:
There was no relevant study found that had a similar objective as this study. In fact, the
author could find only one instance of this dataset being used in research [5] and that too
for a single substance, Asbestos only. Thus, building a single model to describe all
substances and geographic attributes was a challenge. However, examining the dataset,
it was determined that the dataset could be thought of as a binary sparse data. The
probability of a violation was roughly 10% in the dataset. Presence or absence of a
substance could be represented as binary variables and so on. Plasse et al [6] studied
binary sparse matrices with numerical attributes for determining association rules in the
automobile industry using generated clusters. Their work included use of clustering and
association rule mining to determine relevant links. Thus, the work presented in the
study is indeed novel. The model constructed for this study could be easily utilized in the
accident prediction analysis using OSHA Injury Reports with a few changes to the inputs
and variables. Association rule mining has long been used to determine patterns in
datasets [7].
Methods:
Techniques used in this work were Random Forest, Bayesian Networks, and WEKA
Association Engine. The Author, determined that starting off with a Random Forest
considering the problem as a classification problem was the first step. Initial estimates
about variable importance are also useful for building Bayesian Network. It is also a
robust technique and very useful as benchmark models for classification problem. It also
gives Posterior Probabilities based on counts observed which is basis for classification. A
Random forest with 100 decision trees, In Bag fraction of 0.5 and sampling all predictors
5
Amey Vidvans
at once (because the author feels that leaving any variable is not advisable) is built. The
relevant arguments that are to be activated are found in the section for code.
For building the Bayesian Network, R statistical language is used. A BIC score based hill
climbing algorithm with relevant inputs of blacklisted and whitelisted nodes is used
that were determined from domain knowledge. Queries about the events given evidence
can be made after conditional probability tables are created after compilation. Bayesian
Networks are a relevant technique in this situation as the question about an observation
being a violation or not is not really a classification problem but a probabilistic one. The
probability of an observation being a violation depends upon the values taken on by the
4-predictor random categorical variables. These conditional probability tables were
modelled as multinomial distributions which are like binomial distributions except that
they can model multiple discrete outcomes. The hill climbing algorithm determines
which links or edges are appropriate based on the cumulative BIC score after adding that
edge. The network with the lowest BIC score is finally selected. However, from domain
knowledge we know that certain edges are to be included using a technique called
blacklisting and whitelisting. Thus, the Bayesian network is then fitted over the data and
inferences can be made. Validation of this Bayesian Network was done on a subjective
basis, based on expert judgement.
For generating the association rules, the association engine from WEKA Associator is
used. The criterion for generating association rules are Delta, Min Support, Type of Metric
and number of rules required. Minimum support is the number of instances in last
association divided by size of dataset that gives a percentage value. The Support Value
starts off at a large value and is decreased by Delta (0.05) at every run of the engine. We
need to have as big a support size and as large a lift as possible for Association rule to be
6
Amey Vidvans
significant. Looking at dataset, it is estimated at 0.01 with 8000 instances to generate
association rules including Violations (which are roughly 80000 in number). This number
will be successively lowered to get more interesting associations. The type of metric used
is Lift, which is defined as follows.
()
Lift= ().()
Thus, if lift is greater than 1, the events A, B are not independent and obviously depend
on each other. The question to be answered here is whether there is a significant
relationship or general trends in the dataset. Inspiration for such an analysis was taken
from a study of basket case analysis [8] solved by Agrawal et al with the Apriori
Algorithm. The A Priori Algorithm works on generating large item sets determined by
inputs and correlating them to other larger and larger item sets to determine association
rules while maintaining requirements of the metric used which may be Confidence or
Lift. The author determined that an association rule with a confidence greater than 0.1 is
useful as it will give a better result than randomly trying to guess the outcome. This is
justifiable as we are trying to do better than randomly guessing the outcome. The values
used as inputs are presented with the results.
Results and discussion:
The Random forest model gave the following confusion matrix shown below
Predicted No Predicted Violation

Violation
Actual No 714146(TN) 5725(FP)

Violation
Actual 69833(FN) 10440(TP)

Violation
7
Amey Vidvans
Table 1: Confusion Matrix for the random forest.
Thus, we can see from the confusion matrix that the random forest which is considered a
benchmark classifier predicts only 13% of all violations. Probability of Violation based on
Random forest is roughly 2%. However, the probability of randomly picking a violation
from dataset is roughly 10%. Thus, decision trees perform worse than randomly picking
a data record as a violation. One common solution in such a situation is resampling to
reduce class imbalance. This however in this case is not an acceptable solution. The Cross-
Validation Error was 0.0893. The Posterior Probabilities obtained from the random forests
are based on democratic counts and are heavily influenced by the splitting criterion
determined by the algorithm. Thus, a Bayesian Outlook at Posterior Probabilities based
on Bayes Rule is required. Bayes Rule is used to infer posterior probabilities of events
based on a priori probabilities. A Nave Bayes Classifier was also tried out on the dataset.
It however performed even worse than the random forest both in terms of sensitivity,
specificity, and Cross Validation Error. However, we can get predictor importance of the
variables from the random forest. It is represented below.
8
Amey Vidvans
Fig 1: Predictor Importance from Random Forest
The random variable substance plays the most important role in determining the
possibility of violation which was expected. This information was used to determine the
direction of conditional probabilities
Thus, this indicates that the assumption that the random variables are independent is not
valid and there are conditional dependencies present in the data structure. We know this
to be true based on experience and expert knowledge. A Bayesian Network is constructed
as shown below in R. The Child nodes are thought to be dependent on their parent nodes.
However, the last variable, for example Violation or Not depends only on substance if its
value is known and does not depend on Type of Industry or its parent nodes.
Fig 2: Bayesian Network constructed.
9
Amey Vidvans
The Bayesian Network is now ready for inference and queries can be made to it to
determine the probabilities. A sensitivity analysis inputting different values of random
variables is presented. A few interesting results are discovered.
Iowa
Fig 3: Sensitivity Analysis by State.
Iowa, it turns out has significantly greater chances of producing violations than the other
states. The rest of them are centred around the 0.1 probability which is expected.
10
Amey Vidvans
Fig 4: Sensitivity Analysis w.r.t Season
Again, this is also centered around the 0.1 probability with the Fall Season having a
slightly higher probability.
1,3,5-Triglycidyl Isocyanurate Hexachloroethane Nicotine
11
Amey Vidvans
Fig 5: Sensitivity Analysis according to substance.
The figure shows posterior probabilities of various substances based on knowledge that
a violation has occurred. 3 substances it turns out have a 100% chance of producing a
violation. They are Nicotine,135-Triglycidil Isocyanurate,Hexachloroethane. However
the observation is not very reliable as the number of observations in each substance is
probably too low to consider this a significant result. Going back to the original dataset,
this observation is proved correct. Thus these probabilities are not very strong indicators
of likelihood and should be thought of as indicators.
Fig 6: Sensitivity Analysis per type of industry.
As can be seen in Fig 6, the industry Retail Trade has the highest likelihood of producing
violations. This is counterintuitive but can be attributed to the lax attitude adopted by
companies not traditionally associated with occupational exposure.
12
Amey Vidvans
We now move to generating Association rules using a technique based on K-mediod
clustering. To increase the relevance of rules and possibility of finding strong rules, a K-
mediod clustering algorithm was run on the dataset. It was clustered into a group of 2
clusters. The first cluster being State and the second cluster being Season, Type of
Industry, Substance, and Violation or Not. The Association Engine was run on both
clusters to discover previously unknown relations. The results were very interesting and
are very useful during inspections to improve violation detection. Some of the rules
discovered are presented along with the parameters used to generate those rules. From
the first cluster, we get
Minimum support: 0.01 (8001 instances)
Minimum metric <lift>: 1.1
Number of cycles performed: 110
Type of Industry=Construction Violation or Not=Violation 67399 ==>
substance=Manganese Fume (as Mn) 12211 conf:(0.18) < lift:(3.39)> lev:(0.01)
substance=Lead, Inorganic (as Pb) Type of Industry=Construction 59221 ==> Violation or
Not=Violation 13941 conf:(0.24) <lift:(2.35)> lev:(0.01)
Minimum support: 0 (800 instances)
13
Amey Vidvans
substance=Asbestos (all forms) Season=Summer Type of Industry=Construction 6464 ==>
Violation or Not=Violation 2267 conf:(0.35) < lift:(3.5)> lev:(0)
substance=Mercury (Vapor) (as Hg) Type of Industry=Construction 2098 ==> Violation or
Not=Violation 1355 conf:(0.65) < lift:(6.44)> lev:(0)
Minimum support: 0 (80 instances)
Type of Industry=Finance,Insurance,Real Estate Violation or Not=Violation 4113 ==>
substance=Formaldehyde Season=Summer 251 conf:(0.06) < lift:(13.63)> lev:(0)
substance=Carbon Black 440 ==> Season=Summer Type of Industry=Construction
substance=p-Dichlorobenzene 328 ==> Season=Summer Type of Industry=Construction
Season=Winter Type of Industry=Construction Violation or Not=Violation 22373 ==>
substance=Asbestos (all forms) 2205 conf:(0.1) < lift:(3.2)> lev:(0)
substance=Cadmium Dust (as Cd) 1671 ==> Violation or Not=Violation 566 conf:(0.34) <
lift:(3.38)> lev:(0)
Now moving to the 2nd cluster,
state=MT 10710 ==> Violation or Not=Violation 1824 conf:(0.17) < lift:(1.7)> lev:(0)
14
Amey Vidvans
state=RI 11900 ==> Violation or Not=Violation 1803 conf:(0.15) < lift:(1.51)> lev:(0)
state=NJ 51510 ==> Violation or Not=Violation 7175 conf:(0.14) < lift:(1.39)> lev:(0)
state=CT 11910 ==> Violation or Not=Violation 1591 conf:(0.13) < lift:(1.33)> lev:(0)
state=ME 12573 ==> Violation or Not=Violation 1644 conf:(0.13) < lift:(1.3)> lev:(0)
The arrows denote the direction of possibilities. The confidence and lift values are shown
next to each of the association rules which are measures of uncertainty, also describing
strength of the association rule One interesting outcome of the exercise was that as the
minimum support went down, the number of rules generated increased exponentially. It
is recommended to have as high a value of Min Support and Lift as possible. The meaning
of these association rules is self-explanatory. Summer seems to be the season when a lot
of associations are being formed. There seems to be a link between summer and
possibility of violation.A significant number of substances appear frequently in the
association rules which should be taken note of.All these association rules are highly
significant.
One of the most interesting applications of the principles of association rules and
Bayesian Networks was in exploring synergistic effects of various effects. Synergistic
effects of exposure occur when the human body is exposed to various substances that
attack the same organ.
Targeting only chemicals that cause neurologic and neurobehavioral damage (Class HE
7).As an example, Manganese fume(A) and Lead, Inorganic(as Pb )(B) are selected for
testing synergy. The procedure for calculating the lift is demonstrated in the example
below for the construction industry.
15
Amey Vidvans
Table 2: Calculating Lift from Synergistic relationship
() 0.3257
Lift= = = 1277.25.
().() 0.0150.017
Thus it can be seen that the likelihood of having a violation in Lead,Inorganic(as Pb) with
the knowledge that Manganese Fume have already caused a violation is very high. Thus,
similar synergistic effects can be explored for the other substances too.
Conclusion:
Thus, through this study we have a framework to address the questions about likelihood
of violations given certain information. We have also generated a set of association rules
to act as guidelines when looking for new potential sites or when a complaint is received
depending on the circumstances. We have also looked at a procedure to test for synergy
in the dataset using Bayesian Networks and Association rules.
Future Work:
We can look for trends to explore effect of time on violation probability. i.e are industries
improving their health and safety records? Is a substance causing more violations as time
progresses? These questions will reveal if attitudes towards health and safety in industry
are improving or deteriorating.
16
Amey Vidvans
Classifying substances by their type may give interesting results, since similar substances
are handled in similar manners. This clustering can be done based on their chemical
properties information which is readily available.
Acknowledgements:
The basic research idea for this study was proposed by Prof Gernand. I am grateful to
him for giving me the chance to work on this dataset.
References:
[1] http://oshadata.peer.org IEEE format not applicable
[2] Mark Karlin(2015) Title: The US Government Must Address Toxic Chemical
Exposure in the Workplace[online]Available: http://www.truth-
out.org/buzzflash/commentary/us-government-needs-to-do-more-abouttoxic-chemical-
exposure-in-the-workplace/19344-us-government-needs-to-do-more-about-
toxicchemical-exposure-in-the-workplace
[3]National Council for Occupational Safety and Health(NA),Title: The OSHA

Inspection: A Step-by-Step Guide[online] Available:
https://www.osha.gov/dte/grant_materials/fy10/sh-20853-10/osha_inspections.pdf
[4] US Dept of Labor(NA), Title: Chemical Exposure Health Data[online] Available:

https://www.osha.gov/opengov/healthsamples.html
[5] DM.Cowan,TJ.Cheng,M.Ground,J.Sahmel,A.Varughese,AK Madl Analysis of

workplace compliance measurements of asbestos by the U.S. Occupational Safety and
Health Administration (19842011),Regul Toxicol Pharmacol,Vol 2015 Aug ;72(3) pp
615 629, Aug, 2015
[6] M Plasse,N Niang,G Saporta,A Villeminot,L Leblond Combined use of association

rules mining and clustering methods to find relevant links between binary rare
attributes in a large data set,Comp. Stat & Data Anal.Vol 52 Issue1; pp 596-613, Sept
2007
17
Amey Vidvans
[7] T Djatna, I Alitu An Application of Association Rule Mining in Total Productive

Maintenance Strategy: An Analysis and Modelling in Wooden Door Manufacturing
Industry,Ind Engg & Serv Sci Vol 4 ,pp 336-343, 2015
[8] R Agrawal,R Srikant Fast algorithms for mining association rules in large
databases Proceedings of the 20th International Conference on Very Large Data
Bases(1994), VLDB, pp 487-499, Santiago, Chile, Sept 1994.
18
Amey Vidvans
Appendix:
Code for R Bayesian Network
> QueryAmey1 <- read.csv("~/QueryAmey1.csv")
> View(QueryAmey1)
> QueryAmey1$zip_code=NULL
> QueryAmey1 <- read.csv("~/QueryAmey1.csv")
> View(QueryAmey1)
> QueryAmey1$zip_code=NULL
> res<-hc(QueryAmey1)
Error: could not find function "hc"
> library(bnlearn)
Attaching package: bnlearn
The following object is masked from package:stats:
sigma
Warning message:
package bnlearn was built under R version 3.3.2
> res<-hc(QueryAmey1)
> plot(res)
> whitelist = data.frame(from = c("Season"), to = c("Type.of.Industry"))
> res2<-hc(QueryAmey1,whitelist = whitelist)
19
Amey Vidvans
> plot(res2)
> whitelist = data.frame(from = c("Season"), to = c("Violation.or.Not"))
> res3<-hc(QueryAmey1,whitelist = whitelist)
> plot(res3)
> blacklist = data.frame(from = c("state"), to = c("Violation.or.Not"))
> res4<-hc(QueryAmey1,blacklist = blacklist)
> plot(res4)
>fittedbn<-bn.fit(res4,data = QueryAmey1)
>cpquery(fittedbn,(Violation.or.Not=="Violation"),(substance=="Formaldehyde"))
[1] 0.2036199
Code for Random Forest
%%Data set imported using Import Data Tab as a Table(Very important)
ZIP = QueryAmey1(1:size(QueryAmey1,1),2);
STATE=QueryAmey1(1:size(QueryAmey1,1),1);
SUB=QueryAmey1(1:size(QueryAmey1,1),3);
SEA=QueryAmey1(1:size(QueryAmey1,1),4);
TYPE=QueryAmey1(1:size(QueryAmey1,1),5);
BIN = QueryAmey1(1:size(QueryAmey1,1),6);
%f=fitctree([STATE ZIP SUB SEA TYPE],BIN,'CategoricalPredictors','all');
DD=TreeBagger(100,[STATE SUB SEA

TYPE],BIN,'CategoricalPredictors','all','InBagFraction',0.5,'Method','classification','Num
PredictorsToSample',4,'OOBPredictorImportance','on','OOBPrediction','on');
d=fitcnb([STATE SUB SEA TYPE],BIN);
X=[STATE SUB SEA TYPE];
[label,Posterior,Cost]=predict(d,X);
[Yfit,scores]=predict(DD,X);
20
Amey Vidvans
%cc=fitcsvm([STATE ZIP SIC SUB SEA],BIN);
CV=d;
CV1=DD;
defaultCVMdl = crossval(CV);
defaultLoss = kfoldLoss(defaultCVMdl);
defaultCVMdl2 = crossval(CV1,'KFold',10);
defaultLoss1 = kfoldLoss(defaultCVMdl2)
%view(f)
bar(d.predictorImportance)
bar(DD.OOBPermutedPredictorDeltaError);
Dataset:
Submitted after Presentation.
List of States
levels(QueryAmey1$state) according to index number used
[1] "AK" "AL" "AR" "AS" "AZ" "CA" "CO" "CT" "CZ" "DC" "DE" "FL" "FN" "GA" "GU" "HI"
"IA" "ID" "IL" "IN" "JQ" "KS" "KY" "LA"
[25] "MA" "MD" "ME" "MI" "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ"
"NM" "NV" "NY" "OH" "OK" "OR" "PA" "PI" "PR" "RI"
[49] "SC" "SD" "TN" "TX" "UT" "VA" "VI" "VT" "WA" "WI" "WV" "WY"
21

Final Report

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Report

Caricato da

Copyright:

Formati disponibili

Final Report-EGEE 597

Analysis of OSHA Health Samples database to

improve detection of violations using Machine

efficiency of inspections and enable OSHA inspectors to take advantage of accumulated

a complaint regarding an industrial facility is received, a visit may be arranged

depending on circumstances. When a visit is arranged, it is up to the OSHA inspectors to

knowledge based system for judging likelihood of violations.

dataset corresponds to an actual inspection. There are 25 data fields or attributes

their relevance. The rest may be accessed at [4].

Establishment Name: Sampled Establishment

version SIC manual which most closely applies.

Date Sampled: Date sample was taken.

Eight Hour TWA Calc: Eight hour TWA calculation used.

OSHA to each substance.

unique field number.

Unit of Measurement: Unit of measurement (UOM) from IMIS manual. Values: [M -

mg/m3, X - micrograms, P - Parts per million, Y - milligrams, F - fibers/cc, % - percentage]

The dataset is a consolidation of inspections from 6/15/1984 to 1/3/2011.The dataset in its

ACGIH(American Conference of Governmental Industrial Hygienists). Note that these

Substance, Type of Industry, and the target variable Violation or Not.

probability of a violation was roughly 10% in the dataset. Presence or absence of a

4-predictor random categorical variables. These conditional probability tables were

basis, based on expert judgement.

significant. Looking at dataset, it is estimated at 0.01 with 8000 instances to generate

is Lift, which is defined as follows.

on each other. The question to be answered here is whether there is a significant

used as inputs are presented with the results.

Results and discussion:

Predicted No Predicted Violation

Actual No 714146(TN) 5725(FP)

Actual 69833(FN) 10440(TP)

Table 1: Confusion Matrix for the random forest.

a data record as a violation. One common solution in such a situation is resampling to

determined by the algorithm. Thus, a Bayesian Outlook at Posterior Probabilities based

variables from the random forest. It is represented below.

Fig 1: Predictor Importance from Random Forest

direction of conditional probabilities

to be true based on experience and expert knowledge. A Bayesian Network is constructed

Fig 2: Bayesian Network constructed.

determine the probabilities. A sensitivity analysis inputting different values of random

variables is presented. A few interesting results are discovered.

Fig 3: Sensitivity Analysis by State.

Fig 4: Sensitivity Analysis w.r.t Season

slightly higher probability.

1,3,5-Triglycidyl Isocyanurate Hexachloroethane Nicotine

Fig 5: Sensitivity Analysis according to substance.

violation. They are Nicotine,135-Triglycidil Isocyanurate,Hexachloroethane. However

of likelihood and should be thought of as indicators.

Fig 6: Sensitivity Analysis per type of industry.

companies not traditionally associated with occupational exposure.

We now move to generating Association rules using a technique based on K-mediod

the first cluster, we get

Minimum support: 0.01 (8001 instances)

Minimum metric <lift>: 1.1

Number of cycles performed: 110

Type of Industry=Construction Violation or Not=Violation 67399 ==>

substance=Manganese Fume (as Mn) 12211 conf:(0.18) < lift:(3.39)> lev:(0.01)

substance=Lead, Inorganic (as Pb) Type of Industry=Construction 59221 ==> Violation or

Not=Violation 13941 conf:(0.24) <lift:(2.35)> lev:(0.01)

Minimum support: 0 (800 instances)

Minimum metric <lift>: 1.1

Number of cycles performed: 20

substance=Asbestos (all forms) Season=Summer Type of Industry=Construction 6464 ==>