Sei sulla pagina 1di 21

Final Report-EGEE 597

Amey Vidvans

Analysis of OSHA Health Samples database to

improve detection of violations using Machine

Learning Methods

Amey Vidvans

EGEE 597

1
Final Report-EGEE 597

Amey Vidvans

Problem Description:

Machine Learning is being used in a wide range of fields to access hidden knowledge in

data. Thus, the idea in this project is to use a dataset about Chemical Exposure Health

Data, containing information about inspections and chemicals tested to improve the

efficiency of inspections and enable OSHA inspectors to take advantage of accumulated

knowledge during inspections. There is also an urgent need to improve the testing and

inspection process given certain information like type of company as the issue of

occupational chemical exposure is serious. The critics of OSHA contend that it is not

doing enough to enforce the regulations at workplaces. According to [1], [2], occupational

chemical exposure causes more than premature 50000-100000 deaths per year in the US

alone. Also, it would take 600 years for OSHA to test less than half of the industrial

facilities. There is also a subjective element to how inspections are carried out [3]. When

a complaint regarding an industrial facility is received, a visit may be arranged

depending on circumstances. When a visit is arranged, it is up to the OSHA inspectors to

expand or curtail the inspection to include or exclude substances, sites. Thus, there is a

need to apply machine learning methods to aid OSHA inspectors in determining sites

and substances to be tested based on the complaint received. The techniques discussed

here can be applied and an exhaustive list of association rules can be created to create

knowledge based system for judging likelihood of violations.

Dataset Description:

The dataset as mentioned earlier is a Chemical Exposure Dataset obtained from OSHA

recording all relevant information regarding an inspection. Thus, each data record in the

dataset corresponds to an actual inspection. There are 25 data fields or attributes

2
Final Report-EGEE 597

Amey Vidvans

describing each data record. A few are mentioned below along with comments regarding

their relevance. The rest may be accessed at [4].

Establishment Name: Sampled Establishment

State: Identifies the site state where the inspection was performed.

Zip Code: Identifies the site zip code where the inspection was performed.

SIC Code: Indicates the 4-digit Standard Industrial Classification Code from the 1987

version SIC manual which most closely applies.

Date Sampled: Date sample was taken.

Eight Hour TWA Calc: Eight hour TWA calculation used.

IMIS Substance Code: IMIS substance code number is the substance code assigned by

OSHA to each substance.

Substance: Chemical name primarily lists Substances as it appears in the OSHA PELs, 29

CFR 1910.1000, TABLES Z-1-A, Z-2, Z-3; the ACGIH TLV's; or by common name.

Sample Result: Sample result from laboratory analysis for each sample submitted with a

unique field number.

Unit of Measurement: Unit of measurement (UOM) from IMIS manual. Values: [M -

mg/m3, X - micrograms, P - Parts per million, Y - milligrams, F - fibers/cc, % - percentage]

The dataset is a consolidation of inspections from 6/15/1984 to 1/3/2011.The dataset in its

raw form needed to be processed to glean relevant required information. The datafields

3
Final Report-EGEE 597

Amey Vidvans

were extracted with relevant queries in MS Access and were discretized into categorical

variables. For example, date when sample was taken was converted into a Season

categorical variable. Using SIC code classification, the variable was converted into the

Categorical Variable Type of Industry. Finally based on the sample results, the Target

Categorical Variable Violation or Not was created using cutoff values taken from

ACGIH(American Conference of Governmental Industrial Hygienists). Note that these

are the most stringent exposure limits, more than even the ones from OSHA

PELs(Permissible Exposure Limits). Over all, from the more than 700 substances tested

only 434 had acceptable limits. Where ACGIH limits were not available, and if the

substance had a OSHA PEL or NIOSH Limit, the relevant cutoff value was taken into

consideration. A full listing of substances and their limits with explanation of cutoff value

is attached in the Appendix. The main drawback of this dataset is that it is not a

comprehensive dataset. It means that not every industrial facility is included in the

dataset. Also, not every substance is tested. Also, there was a lot of missing data in the

form of units of measurement and use of non-standard measuring units. Here the last

available unit if measurement in the dataset was used till a suitable description of the unit

appeared in the dataset. A significant portion of the dataset is thus lost which can impat

the inferences that can be made. For example, if the last measured unit was M

(micrograms per m3), it was used to determine violation till the next unit appeared and if

the intermediate rows had missing values. Thus, last value imputation was used. Thus,

from a total of 2114613 entries initially, the data records went down to 800145. ZIP Code

was discarded from the analysis as it was thought that the random variable State would

adequately explain geographic effects. Thus, the final variables are State, Season,

Substance, Type of Industry, and the target variable Violation or Not.

4
Final Report-EGEE 597

Amey Vidvans

Literature Survey:

There was no relevant study found that had a similar objective as this study. In fact, the

author could find only one instance of this dataset being used in research [5] and that too

for a single substance, Asbestos only. Thus, building a single model to describe all

substances and geographic attributes was a challenge. However, examining the dataset,

it was determined that the dataset could be thought of as a binary sparse data. The

probability of a violation was roughly 10% in the dataset. Presence or absence of a

substance could be represented as binary variables and so on. Plasse et al [6] studied

binary sparse matrices with numerical attributes for determining association rules in the

automobile industry using generated clusters. Their work included use of clustering and

association rule mining to determine relevant links. Thus, the work presented in the

study is indeed novel. The model constructed for this study could be easily utilized in the

accident prediction analysis using OSHA Injury Reports with a few changes to the inputs

and variables. Association rule mining has long been used to determine patterns in

datasets [7].

Methods:

Techniques used in this work were Random Forest, Bayesian Networks, and WEKA

Association Engine. The Author, determined that starting off with a Random Forest

considering the problem as a classification problem was the first step. Initial estimates

about variable importance are also useful for building Bayesian Network. It is also a

robust technique and very useful as benchmark models for classification problem. It also

gives Posterior Probabilities based on counts observed which is basis for classification. A

Random forest with 100 decision trees, In Bag fraction of 0.5 and sampling all predictors

5
Final Report-EGEE 597

Amey Vidvans

at once (because the author feels that leaving any variable is not advisable) is built. The

relevant arguments that are to be activated are found in the section for code.

For building the Bayesian Network, R statistical language is used. A BIC score based hill

climbing algorithm with relevant inputs of blacklisted and whitelisted nodes is used

that were determined from domain knowledge. Queries about the events given evidence

can be made after conditional probability tables are created after compilation. Bayesian

Networks are a relevant technique in this situation as the question about an observation

being a violation or not is not really a classification problem but a probabilistic one. The

probability of an observation being a violation depends upon the values taken on by the

4-predictor random categorical variables. These conditional probability tables were

modelled as multinomial distributions which are like binomial distributions except that

they can model multiple discrete outcomes. The hill climbing algorithm determines

which links or edges are appropriate based on the cumulative BIC score after adding that

edge. The network with the lowest BIC score is finally selected. However, from domain

knowledge we know that certain edges are to be included using a technique called

blacklisting and whitelisting. Thus, the Bayesian network is then fitted over the data and

inferences can be made. Validation of this Bayesian Network was done on a subjective

basis, based on expert judgement.

For generating the association rules, the association engine from WEKA Associator is

used. The criterion for generating association rules are Delta, Min Support, Type of Metric

and number of rules required. Minimum support is the number of instances in last

association divided by size of dataset that gives a percentage value. The Support Value

starts off at a large value and is decreased by Delta (0.05) at every run of the engine. We

need to have as big a support size and as large a lift as possible for Association rule to be

6
Final Report-EGEE 597

Amey Vidvans

significant. Looking at dataset, it is estimated at 0.01 with 8000 instances to generate

association rules including Violations (which are roughly 80000 in number). This number

will be successively lowered to get more interesting associations. The type of metric used

is Lift, which is defined as follows.

()
Lift= ().()

Thus, if lift is greater than 1, the events A, B are not independent and obviously depend

on each other. The question to be answered here is whether there is a significant

relationship or general trends in the dataset. Inspiration for such an analysis was taken

from a study of basket case analysis [8] solved by Agrawal et al with the Apriori

Algorithm. The A Priori Algorithm works on generating large item sets determined by

inputs and correlating them to other larger and larger item sets to determine association

rules while maintaining requirements of the metric used which may be Confidence or

Lift. The author determined that an association rule with a confidence greater than 0.1 is

useful as it will give a better result than randomly trying to guess the outcome. This is

justifiable as we are trying to do better than randomly guessing the outcome. The values

used as inputs are presented with the results.

Results and discussion:

The Random forest model gave the following confusion matrix shown below

Predicted No Predicted Violation


Violation

Actual No 714146(TN) 5725(FP)


Violation

Actual 69833(FN) 10440(TP)


Violation

7
Final Report-EGEE 597

Amey Vidvans

Table 1: Confusion Matrix for the random forest.

Thus, we can see from the confusion matrix that the random forest which is considered a

benchmark classifier predicts only 13% of all violations. Probability of Violation based on

Random forest is roughly 2%. However, the probability of randomly picking a violation

from dataset is roughly 10%. Thus, decision trees perform worse than randomly picking

a data record as a violation. One common solution in such a situation is resampling to

reduce class imbalance. This however in this case is not an acceptable solution. The Cross-

Validation Error was 0.0893. The Posterior Probabilities obtained from the random forests

are based on democratic counts and are heavily influenced by the splitting criterion

determined by the algorithm. Thus, a Bayesian Outlook at Posterior Probabilities based

on Bayes Rule is required. Bayes Rule is used to infer posterior probabilities of events

based on a priori probabilities. A Nave Bayes Classifier was also tried out on the dataset.

It however performed even worse than the random forest both in terms of sensitivity,

specificity, and Cross Validation Error. However, we can get predictor importance of the

variables from the random forest. It is represented below.

8
Final Report-EGEE 597

Amey Vidvans

Fig 1: Predictor Importance from Random Forest

The random variable substance plays the most important role in determining the

possibility of violation which was expected. This information was used to determine the

direction of conditional probabilities

Thus, this indicates that the assumption that the random variables are independent is not

valid and there are conditional dependencies present in the data structure. We know this

to be true based on experience and expert knowledge. A Bayesian Network is constructed

as shown below in R. The Child nodes are thought to be dependent on their parent nodes.

However, the last variable, for example Violation or Not depends only on substance if its

value is known and does not depend on Type of Industry or its parent nodes.

Fig 2: Bayesian Network constructed.

9
Final Report-EGEE 597

Amey Vidvans

The Bayesian Network is now ready for inference and queries can be made to it to

determine the probabilities. A sensitivity analysis inputting different values of random

variables is presented. A few interesting results are discovered.

Iowa

Fig 3: Sensitivity Analysis by State.

Iowa, it turns out has significantly greater chances of producing violations than the other

states. The rest of them are centred around the 0.1 probability which is expected.

10
Final Report-EGEE 597

Amey Vidvans

Fig 4: Sensitivity Analysis w.r.t Season

Again, this is also centered around the 0.1 probability with the Fall Season having a

slightly higher probability.

1,3,5-Triglycidyl Isocyanurate Hexachloroethane Nicotine

11
Final Report-EGEE 597

Amey Vidvans

Fig 5: Sensitivity Analysis according to substance.

The figure shows posterior probabilities of various substances based on knowledge that

a violation has occurred. 3 substances it turns out have a 100% chance of producing a

violation. They are Nicotine,135-Triglycidil Isocyanurate,Hexachloroethane. However

the observation is not very reliable as the number of observations in each substance is

probably too low to consider this a significant result. Going back to the original dataset,

this observation is proved correct. Thus these probabilities are not very strong indicators

of likelihood and should be thought of as indicators.

Fig 6: Sensitivity Analysis per type of industry.

As can be seen in Fig 6, the industry Retail Trade has the highest likelihood of producing

violations. This is counterintuitive but can be attributed to the lax attitude adopted by

companies not traditionally associated with occupational exposure.

12
Final Report-EGEE 597

Amey Vidvans

We now move to generating Association rules using a technique based on K-mediod

clustering. To increase the relevance of rules and possibility of finding strong rules, a K-

mediod clustering algorithm was run on the dataset. It was clustered into a group of 2

clusters. The first cluster being State and the second cluster being Season, Type of

Industry, Substance, and Violation or Not. The Association Engine was run on both

clusters to discover previously unknown relations. The results were very interesting and

are very useful during inspections to improve violation detection. Some of the rules

discovered are presented along with the parameters used to generate those rules. From

the first cluster, we get

Minimum support: 0.01 (8001 instances)

Minimum metric <lift>: 1.1

Number of cycles performed: 110

Type of Industry=Construction Violation or Not=Violation 67399 ==>

substance=Manganese Fume (as Mn) 12211 conf:(0.18) < lift:(3.39)> lev:(0.01)

substance=Lead, Inorganic (as Pb) Type of Industry=Construction 59221 ==> Violation or

Not=Violation 13941 conf:(0.24) <lift:(2.35)> lev:(0.01)

Minimum support: 0 (800 instances)

Minimum metric <lift>: 1.1

Number of cycles performed: 20

13
Final Report-EGEE 597

Amey Vidvans

substance=Asbestos (all forms) Season=Summer Type of Industry=Construction 6464 ==>

Violation or Not=Violation 2267 conf:(0.35) < lift:(3.5)> lev:(0)

substance=Mercury (Vapor) (as Hg) Type of Industry=Construction 2098 ==> Violation or

Not=Violation 1355 conf:(0.65) < lift:(6.44)> lev:(0)

Minimum support: 0 (80 instances)

Minimum metric <lift>: 1.1

Number of cycles performed: 10

Type of Industry=Finance,Insurance,Real Estate Violation or Not=Violation 4113 ==>

substance=Formaldehyde Season=Summer 251 conf:(0.06) < lift:(13.63)> lev:(0)

substance=Carbon Black 440 ==> Season=Summer Type of Industry=Construction

Violation or Not=Violation 92 conf:(0.21) < lift:(6.16)> lev:(0)

substance=p-Dichlorobenzene 328 ==> Season=Summer Type of Industry=Construction

Violation or Not=Violation 127 conf:(0.39) < lift:(11.41)> lev:(0)

Season=Winter Type of Industry=Construction Violation or Not=Violation 22373 ==>

substance=Asbestos (all forms) 2205 conf:(0.1) < lift:(3.2)> lev:(0)

substance=Cadmium Dust (as Cd) 1671 ==> Violation or Not=Violation 566 conf:(0.34) <

lift:(3.38)> lev:(0)

Now moving to the 2nd cluster,

state=MT 10710 ==> Violation or Not=Violation 1824 conf:(0.17) < lift:(1.7)> lev:(0)

14
Final Report-EGEE 597

Amey Vidvans

state=RI 11900 ==> Violation or Not=Violation 1803 conf:(0.15) < lift:(1.51)> lev:(0)

state=NJ 51510 ==> Violation or Not=Violation 7175 conf:(0.14) < lift:(1.39)> lev:(0)

state=CT 11910 ==> Violation or Not=Violation 1591 conf:(0.13) < lift:(1.33)> lev:(0)

state=ME 12573 ==> Violation or Not=Violation 1644 conf:(0.13) < lift:(1.3)> lev:(0)

The arrows denote the direction of possibilities. The confidence and lift values are shown

next to each of the association rules which are measures of uncertainty, also describing

strength of the association rule One interesting outcome of the exercise was that as the

minimum support went down, the number of rules generated increased exponentially. It

is recommended to have as high a value of Min Support and Lift as possible. The meaning

of these association rules is self-explanatory. Summer seems to be the season when a lot

of associations are being formed. There seems to be a link between summer and

possibility of violation.A significant number of substances appear frequently in the

association rules which should be taken note of.All these association rules are highly

significant.

One of the most interesting applications of the principles of association rules and

Bayesian Networks was in exploring synergistic effects of various effects. Synergistic

effects of exposure occur when the human body is exposed to various substances that

attack the same organ.

Targeting only chemicals that cause neurologic and neurobehavioral damage (Class HE

7).As an example, Manganese fume(A) and Lead, Inorganic(as Pb )(B) are selected for

testing synergy. The procedure for calculating the lift is demonstrated in the example

below for the construction industry.

15
Final Report-EGEE 597

Amey Vidvans

Table 2: Calculating Lift from Synergistic relationship

() 0.3257
Lift= = = 1277.25.
().() 0.0150.017

Thus it can be seen that the likelihood of having a violation in Lead,Inorganic(as Pb) with

the knowledge that Manganese Fume have already caused a violation is very high. Thus,

similar synergistic effects can be explored for the other substances too.

Conclusion:

Thus, through this study we have a framework to address the questions about likelihood

of violations given certain information. We have also generated a set of association rules

to act as guidelines when looking for new potential sites or when a complaint is received

depending on the circumstances. We have also looked at a procedure to test for synergy

in the dataset using Bayesian Networks and Association rules.

Future Work:

We can look for trends to explore effect of time on violation probability. i.e are industries

improving their health and safety records? Is a substance causing more violations as time

progresses? These questions will reveal if attitudes towards health and safety in industry

are improving or deteriorating.

16
Final Report-EGEE 597

Amey Vidvans

Classifying substances by their type may give interesting results, since similar substances

are handled in similar manners. This clustering can be done based on their chemical

properties information which is readily available.

Acknowledgements:

The basic research idea for this study was proposed by Prof Gernand. I am grateful to

him for giving me the chance to work on this dataset.

References:

[1] http://oshadata.peer.org IEEE format not applicable

[2] Mark Karlin(2015) Title: The US Government Must Address Toxic Chemical
Exposure in the Workplace[online]Available: http://www.truth-
out.org/buzzflash/commentary/us-government-needs-to-do-more-abouttoxic-chemical-
exposure-in-the-workplace/19344-us-government-needs-to-do-more-about-
toxicchemical-exposure-in-the-workplace

[3]National Council for Occupational Safety and Health(NA),Title: The OSHA


Inspection: A Step-by-Step Guide[online] Available:
https://www.osha.gov/dte/grant_materials/fy10/sh-20853-10/osha_inspections.pdf

[4] US Dept of Labor(NA), Title: Chemical Exposure Health Data[online] Available:


https://www.osha.gov/opengov/healthsamples.html

[5] DM.Cowan,TJ.Cheng,M.Ground,J.Sahmel,A.Varughese,AK Madl Analysis of


workplace compliance measurements of asbestos by the U.S. Occupational Safety and
Health Administration (19842011),Regul Toxicol Pharmacol,Vol 2015 Aug ;72(3) pp
615 629, Aug, 2015

[6] M Plasse,N Niang,G Saporta,A Villeminot,L Leblond Combined use of association


rules mining and clustering methods to find relevant links between binary rare
attributes in a large data set,Comp. Stat & Data Anal.Vol 52 Issue1; pp 596-613, Sept
2007

17
Final Report-EGEE 597

Amey Vidvans

[7] T Djatna, I Alitu An Application of Association Rule Mining in Total Productive


Maintenance Strategy: An Analysis and Modelling in Wooden Door Manufacturing
Industry,Ind Engg & Serv Sci Vol 4 ,pp 336-343, 2015

[8] R Agrawal,R Srikant Fast algorithms for mining association rules in large
databases Proceedings of the 20th International Conference on Very Large Data
Bases(1994), VLDB, pp 487-499, Santiago, Chile, Sept 1994.

18
Final Report-EGEE 597

Amey Vidvans

Appendix:
Code for R Bayesian Network

> QueryAmey1 <- read.csv("~/QueryAmey1.csv")

> View(QueryAmey1)

> QueryAmey1$zip_code=NULL

> QueryAmey1 <- read.csv("~/QueryAmey1.csv")

> View(QueryAmey1)

> QueryAmey1$zip_code=NULL

> res<-hc(QueryAmey1)

Error: could not find function "hc"

> library(bnlearn)

Attaching package: bnlearn

The following object is masked from package:stats:

sigma

Warning message:

package bnlearn was built under R version 3.3.2

> res<-hc(QueryAmey1)

> plot(res)

> whitelist = data.frame(from = c("Season"), to = c("Type.of.Industry"))

> res2<-hc(QueryAmey1,whitelist = whitelist)

19
Final Report-EGEE 597

Amey Vidvans

> plot(res2)

> whitelist = data.frame(from = c("Season"), to = c("Violation.or.Not"))

> res3<-hc(QueryAmey1,whitelist = whitelist)

> plot(res3)

> blacklist = data.frame(from = c("state"), to = c("Violation.or.Not"))

> res4<-hc(QueryAmey1,blacklist = blacklist)

> plot(res4)

>fittedbn<-bn.fit(res4,data = QueryAmey1)

>cpquery(fittedbn,(Violation.or.Not=="Violation"),(substance=="Formaldehyde"))

[1] 0.2036199

Code for Random Forest

%%Data set imported using Import Data Tab as a Table(Very important)

ZIP = QueryAmey1(1:size(QueryAmey1,1),2);

STATE=QueryAmey1(1:size(QueryAmey1,1),1);

SUB=QueryAmey1(1:size(QueryAmey1,1),3);

SEA=QueryAmey1(1:size(QueryAmey1,1),4);

TYPE=QueryAmey1(1:size(QueryAmey1,1),5);

BIN = QueryAmey1(1:size(QueryAmey1,1),6);

%f=fitctree([STATE ZIP SUB SEA TYPE],BIN,'CategoricalPredictors','all');

DD=TreeBagger(100,[STATE SUB SEA


TYPE],BIN,'CategoricalPredictors','all','InBagFraction',0.5,'Method','classification','Num
PredictorsToSample',4,'OOBPredictorImportance','on','OOBPrediction','on');

d=fitcnb([STATE SUB SEA TYPE],BIN);

X=[STATE SUB SEA TYPE];

[label,Posterior,Cost]=predict(d,X);

[Yfit,scores]=predict(DD,X);

20
Final Report-EGEE 597

Amey Vidvans

%cc=fitcsvm([STATE ZIP SIC SUB SEA],BIN);

CV=d;

CV1=DD;

defaultCVMdl = crossval(CV);

defaultLoss = kfoldLoss(defaultCVMdl);

defaultCVMdl2 = crossval(CV1,'KFold',10);

defaultLoss1 = kfoldLoss(defaultCVMdl2)

%view(f)

bar(d.predictorImportance)

bar(DD.OOBPermutedPredictorDeltaError);

Dataset:

Submitted after Presentation.

List of States

levels(QueryAmey1$state) according to index number used

[1] "AK" "AL" "AR" "AS" "AZ" "CA" "CO" "CT" "CZ" "DC" "DE" "FL" "FN" "GA" "GU" "HI"
"IA" "ID" "IL" "IN" "JQ" "KS" "KY" "LA"

[25] "MA" "MD" "ME" "MI" "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ"
"NM" "NV" "NY" "OH" "OK" "OR" "PA" "PI" "PR" "RI"

[49] "SC" "SD" "TN" "TX" "UT" "VA" "VI" "VT" "WA" "WI" "WV" "WY"

21

Potrebbero piacerti anche