Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Amey Vidvans
Learning Methods
Amey Vidvans
EGEE 597
1
Final Report-EGEE 597
Amey Vidvans
Problem Description:
Machine Learning is being used in a wide range of fields to access hidden knowledge in
data. Thus, the idea in this project is to use a dataset about Chemical Exposure Health
Data, containing information about inspections and chemicals tested to improve the
knowledge during inspections. There is also an urgent need to improve the testing and
inspection process given certain information like type of company as the issue of
occupational chemical exposure is serious. The critics of OSHA contend that it is not
doing enough to enforce the regulations at workplaces. According to [1], [2], occupational
chemical exposure causes more than premature 50000-100000 deaths per year in the US
alone. Also, it would take 600 years for OSHA to test less than half of the industrial
facilities. There is also a subjective element to how inspections are carried out [3]. When
expand or curtail the inspection to include or exclude substances, sites. Thus, there is a
need to apply machine learning methods to aid OSHA inspectors in determining sites
and substances to be tested based on the complaint received. The techniques discussed
here can be applied and an exhaustive list of association rules can be created to create
Dataset Description:
The dataset as mentioned earlier is a Chemical Exposure Dataset obtained from OSHA
recording all relevant information regarding an inspection. Thus, each data record in the
2
Final Report-EGEE 597
Amey Vidvans
describing each data record. A few are mentioned below along with comments regarding
State: Identifies the site state where the inspection was performed.
Zip Code: Identifies the site zip code where the inspection was performed.
SIC Code: Indicates the 4-digit Standard Industrial Classification Code from the 1987
IMIS Substance Code: IMIS substance code number is the substance code assigned by
Substance: Chemical name primarily lists Substances as it appears in the OSHA PELs, 29
CFR 1910.1000, TABLES Z-1-A, Z-2, Z-3; the ACGIH TLV's; or by common name.
Sample Result: Sample result from laboratory analysis for each sample submitted with a
raw form needed to be processed to glean relevant required information. The datafields
3
Final Report-EGEE 597
Amey Vidvans
were extracted with relevant queries in MS Access and were discretized into categorical
variables. For example, date when sample was taken was converted into a Season
categorical variable. Using SIC code classification, the variable was converted into the
Categorical Variable Type of Industry. Finally based on the sample results, the Target
Categorical Variable Violation or Not was created using cutoff values taken from
are the most stringent exposure limits, more than even the ones from OSHA
PELs(Permissible Exposure Limits). Over all, from the more than 700 substances tested
only 434 had acceptable limits. Where ACGIH limits were not available, and if the
substance had a OSHA PEL or NIOSH Limit, the relevant cutoff value was taken into
consideration. A full listing of substances and their limits with explanation of cutoff value
is attached in the Appendix. The main drawback of this dataset is that it is not a
comprehensive dataset. It means that not every industrial facility is included in the
dataset. Also, not every substance is tested. Also, there was a lot of missing data in the
form of units of measurement and use of non-standard measuring units. Here the last
available unit if measurement in the dataset was used till a suitable description of the unit
appeared in the dataset. A significant portion of the dataset is thus lost which can impat
the inferences that can be made. For example, if the last measured unit was M
(micrograms per m3), it was used to determine violation till the next unit appeared and if
the intermediate rows had missing values. Thus, last value imputation was used. Thus,
from a total of 2114613 entries initially, the data records went down to 800145. ZIP Code
was discarded from the analysis as it was thought that the random variable State would
adequately explain geographic effects. Thus, the final variables are State, Season,
4
Final Report-EGEE 597
Amey Vidvans
Literature Survey:
There was no relevant study found that had a similar objective as this study. In fact, the
author could find only one instance of this dataset being used in research [5] and that too
for a single substance, Asbestos only. Thus, building a single model to describe all
substances and geographic attributes was a challenge. However, examining the dataset,
it was determined that the dataset could be thought of as a binary sparse data. The
substance could be represented as binary variables and so on. Plasse et al [6] studied
binary sparse matrices with numerical attributes for determining association rules in the
automobile industry using generated clusters. Their work included use of clustering and
association rule mining to determine relevant links. Thus, the work presented in the
study is indeed novel. The model constructed for this study could be easily utilized in the
accident prediction analysis using OSHA Injury Reports with a few changes to the inputs
and variables. Association rule mining has long been used to determine patterns in
datasets [7].
Methods:
Techniques used in this work were Random Forest, Bayesian Networks, and WEKA
Association Engine. The Author, determined that starting off with a Random Forest
considering the problem as a classification problem was the first step. Initial estimates
about variable importance are also useful for building Bayesian Network. It is also a
robust technique and very useful as benchmark models for classification problem. It also
gives Posterior Probabilities based on counts observed which is basis for classification. A
Random forest with 100 decision trees, In Bag fraction of 0.5 and sampling all predictors
5
Final Report-EGEE 597
Amey Vidvans
at once (because the author feels that leaving any variable is not advisable) is built. The
relevant arguments that are to be activated are found in the section for code.
For building the Bayesian Network, R statistical language is used. A BIC score based hill
climbing algorithm with relevant inputs of blacklisted and whitelisted nodes is used
that were determined from domain knowledge. Queries about the events given evidence
can be made after conditional probability tables are created after compilation. Bayesian
Networks are a relevant technique in this situation as the question about an observation
being a violation or not is not really a classification problem but a probabilistic one. The
probability of an observation being a violation depends upon the values taken on by the
modelled as multinomial distributions which are like binomial distributions except that
they can model multiple discrete outcomes. The hill climbing algorithm determines
which links or edges are appropriate based on the cumulative BIC score after adding that
edge. The network with the lowest BIC score is finally selected. However, from domain
knowledge we know that certain edges are to be included using a technique called
blacklisting and whitelisting. Thus, the Bayesian network is then fitted over the data and
inferences can be made. Validation of this Bayesian Network was done on a subjective
For generating the association rules, the association engine from WEKA Associator is
used. The criterion for generating association rules are Delta, Min Support, Type of Metric
and number of rules required. Minimum support is the number of instances in last
association divided by size of dataset that gives a percentage value. The Support Value
starts off at a large value and is decreased by Delta (0.05) at every run of the engine. We
need to have as big a support size and as large a lift as possible for Association rule to be
6
Final Report-EGEE 597
Amey Vidvans
association rules including Violations (which are roughly 80000 in number). This number
will be successively lowered to get more interesting associations. The type of metric used
()
Lift= ().()
Thus, if lift is greater than 1, the events A, B are not independent and obviously depend
relationship or general trends in the dataset. Inspiration for such an analysis was taken
from a study of basket case analysis [8] solved by Agrawal et al with the Apriori
Algorithm. The A Priori Algorithm works on generating large item sets determined by
inputs and correlating them to other larger and larger item sets to determine association
rules while maintaining requirements of the metric used which may be Confidence or
Lift. The author determined that an association rule with a confidence greater than 0.1 is
useful as it will give a better result than randomly trying to guess the outcome. This is
justifiable as we are trying to do better than randomly guessing the outcome. The values
The Random forest model gave the following confusion matrix shown below
7
Final Report-EGEE 597
Amey Vidvans
Thus, we can see from the confusion matrix that the random forest which is considered a
benchmark classifier predicts only 13% of all violations. Probability of Violation based on
Random forest is roughly 2%. However, the probability of randomly picking a violation
from dataset is roughly 10%. Thus, decision trees perform worse than randomly picking
reduce class imbalance. This however in this case is not an acceptable solution. The Cross-
Validation Error was 0.0893. The Posterior Probabilities obtained from the random forests
are based on democratic counts and are heavily influenced by the splitting criterion
on Bayes Rule is required. Bayes Rule is used to infer posterior probabilities of events
based on a priori probabilities. A Nave Bayes Classifier was also tried out on the dataset.
It however performed even worse than the random forest both in terms of sensitivity,
specificity, and Cross Validation Error. However, we can get predictor importance of the
8
Final Report-EGEE 597
Amey Vidvans
The random variable substance plays the most important role in determining the
possibility of violation which was expected. This information was used to determine the
Thus, this indicates that the assumption that the random variables are independent is not
valid and there are conditional dependencies present in the data structure. We know this
as shown below in R. The Child nodes are thought to be dependent on their parent nodes.
However, the last variable, for example Violation or Not depends only on substance if its
value is known and does not depend on Type of Industry or its parent nodes.
9
Final Report-EGEE 597
Amey Vidvans
The Bayesian Network is now ready for inference and queries can be made to it to
Iowa
Iowa, it turns out has significantly greater chances of producing violations than the other
states. The rest of them are centred around the 0.1 probability which is expected.
10
Final Report-EGEE 597
Amey Vidvans
Again, this is also centered around the 0.1 probability with the Fall Season having a
11
Final Report-EGEE 597
Amey Vidvans
The figure shows posterior probabilities of various substances based on knowledge that
a violation has occurred. 3 substances it turns out have a 100% chance of producing a
the observation is not very reliable as the number of observations in each substance is
probably too low to consider this a significant result. Going back to the original dataset,
this observation is proved correct. Thus these probabilities are not very strong indicators
As can be seen in Fig 6, the industry Retail Trade has the highest likelihood of producing
violations. This is counterintuitive but can be attributed to the lax attitude adopted by
12
Final Report-EGEE 597
Amey Vidvans
clustering. To increase the relevance of rules and possibility of finding strong rules, a K-
mediod clustering algorithm was run on the dataset. It was clustered into a group of 2
clusters. The first cluster being State and the second cluster being Season, Type of
Industry, Substance, and Violation or Not. The Association Engine was run on both
clusters to discover previously unknown relations. The results were very interesting and
are very useful during inspections to improve violation detection. Some of the rules
discovered are presented along with the parameters used to generate those rules. From
13
Final Report-EGEE 597
Amey Vidvans
substance=Cadmium Dust (as Cd) 1671 ==> Violation or Not=Violation 566 conf:(0.34) <
lift:(3.38)> lev:(0)
state=MT 10710 ==> Violation or Not=Violation 1824 conf:(0.17) < lift:(1.7)> lev:(0)
14
Final Report-EGEE 597
Amey Vidvans
state=RI 11900 ==> Violation or Not=Violation 1803 conf:(0.15) < lift:(1.51)> lev:(0)
state=NJ 51510 ==> Violation or Not=Violation 7175 conf:(0.14) < lift:(1.39)> lev:(0)
state=CT 11910 ==> Violation or Not=Violation 1591 conf:(0.13) < lift:(1.33)> lev:(0)
state=ME 12573 ==> Violation or Not=Violation 1644 conf:(0.13) < lift:(1.3)> lev:(0)
The arrows denote the direction of possibilities. The confidence and lift values are shown
next to each of the association rules which are measures of uncertainty, also describing
strength of the association rule One interesting outcome of the exercise was that as the
minimum support went down, the number of rules generated increased exponentially. It
is recommended to have as high a value of Min Support and Lift as possible. The meaning
of these association rules is self-explanatory. Summer seems to be the season when a lot
of associations are being formed. There seems to be a link between summer and
association rules which should be taken note of.All these association rules are highly
significant.
One of the most interesting applications of the principles of association rules and
effects of exposure occur when the human body is exposed to various substances that
Targeting only chemicals that cause neurologic and neurobehavioral damage (Class HE
7).As an example, Manganese fume(A) and Lead, Inorganic(as Pb )(B) are selected for
testing synergy. The procedure for calculating the lift is demonstrated in the example
15
Final Report-EGEE 597
Amey Vidvans
() 0.3257
Lift= = = 1277.25.
().() 0.0150.017
Thus it can be seen that the likelihood of having a violation in Lead,Inorganic(as Pb) with
the knowledge that Manganese Fume have already caused a violation is very high. Thus,
similar synergistic effects can be explored for the other substances too.
Conclusion:
Thus, through this study we have a framework to address the questions about likelihood
of violations given certain information. We have also generated a set of association rules
to act as guidelines when looking for new potential sites or when a complaint is received
depending on the circumstances. We have also looked at a procedure to test for synergy
Future Work:
We can look for trends to explore effect of time on violation probability. i.e are industries
improving their health and safety records? Is a substance causing more violations as time
progresses? These questions will reveal if attitudes towards health and safety in industry
16
Final Report-EGEE 597
Amey Vidvans
Classifying substances by their type may give interesting results, since similar substances
are handled in similar manners. This clustering can be done based on their chemical
Acknowledgements:
The basic research idea for this study was proposed by Prof Gernand. I am grateful to
References:
[2] Mark Karlin(2015) Title: The US Government Must Address Toxic Chemical
Exposure in the Workplace[online]Available: http://www.truth-
out.org/buzzflash/commentary/us-government-needs-to-do-more-abouttoxic-chemical-
exposure-in-the-workplace/19344-us-government-needs-to-do-more-about-
toxicchemical-exposure-in-the-workplace
17
Final Report-EGEE 597
Amey Vidvans
[8] R Agrawal,R Srikant Fast algorithms for mining association rules in large
databases Proceedings of the 20th International Conference on Very Large Data
Bases(1994), VLDB, pp 487-499, Santiago, Chile, Sept 1994.
18
Final Report-EGEE 597
Amey Vidvans
Appendix:
Code for R Bayesian Network
> View(QueryAmey1)
> QueryAmey1$zip_code=NULL
> View(QueryAmey1)
> QueryAmey1$zip_code=NULL
> res<-hc(QueryAmey1)
> library(bnlearn)
sigma
Warning message:
> res<-hc(QueryAmey1)
> plot(res)
19
Final Report-EGEE 597
Amey Vidvans
> plot(res2)
> plot(res3)
> plot(res4)
>fittedbn<-bn.fit(res4,data = QueryAmey1)
>cpquery(fittedbn,(Violation.or.Not=="Violation"),(substance=="Formaldehyde"))
[1] 0.2036199
ZIP = QueryAmey1(1:size(QueryAmey1,1),2);
STATE=QueryAmey1(1:size(QueryAmey1,1),1);
SUB=QueryAmey1(1:size(QueryAmey1,1),3);
SEA=QueryAmey1(1:size(QueryAmey1,1),4);
TYPE=QueryAmey1(1:size(QueryAmey1,1),5);
BIN = QueryAmey1(1:size(QueryAmey1,1),6);
[label,Posterior,Cost]=predict(d,X);
[Yfit,scores]=predict(DD,X);
20
Final Report-EGEE 597
Amey Vidvans
CV=d;
CV1=DD;
defaultCVMdl = crossval(CV);
defaultLoss = kfoldLoss(defaultCVMdl);
defaultCVMdl2 = crossval(CV1,'KFold',10);
defaultLoss1 = kfoldLoss(defaultCVMdl2)
%view(f)
bar(d.predictorImportance)
bar(DD.OOBPermutedPredictorDeltaError);
Dataset:
List of States
[1] "AK" "AL" "AR" "AS" "AZ" "CA" "CO" "CT" "CZ" "DC" "DE" "FL" "FN" "GA" "GU" "HI"
"IA" "ID" "IL" "IN" "JQ" "KS" "KY" "LA"
[25] "MA" "MD" "ME" "MI" "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ"
"NM" "NV" "NY" "OH" "OK" "OR" "PA" "PI" "PR" "RI"
[49] "SC" "SD" "TN" "TX" "UT" "VA" "VI" "VT" "WA" "WI" "WV" "WY"
21