Sei sulla pagina 1di 12

1

Interpretable Automated Machine Learning


in Maana Knowledge Platform TM

Alexander Elkholy, Fangkai Yang, Steven Gustafson


Maana, Inc.
Belleue, Wa.
aelkholy@maana.io
arXiv:1905.02168v1 [cs.LG] 6 May 2019

Abstract—Machine learning is becoming an essential Unfortunately, this level of understanding remains


part of developing solutions for many industrial applica- somewhat of a ”dark art” in that the knowledge
tions, but the lack of interpretability hinders wide indus- and judgment used to find good domain-specific
try adoption to rapidly build, test, deploy and validate
machine learning pipelines is usually found in the
machine learning models, in the sense that the insight of
developing machine learning solutions are not structurally heads of the data scientists. Therefore, while it is
encoded, justified and transferred. In this paper we de- possible to see the final machine learning pipeline,
scribe Maana Meta-learning Service, an interpretable and the steps the data scientist went through, and the
interactive automated machine learning service residing compromises and decisions they made, are not cap-
in Maana Knowledge Platform that performs machine- tured. Once the model is delivered, most insights
guided, user assisted pipeline search and hyper-parameter
and assumptions related to development of the so-
tuning and generates structured knowledge about deci-
sions for pipeline profiling and selection. The service is lution are lost, making long term sustaining difficult.
shipped with Maana Knowledge Platform and is validated Second, since there is no clear way of encoding the
using benchmark dataset. Furthermore, its capability of empirical experience of the data scientist derived
deriving knowledge from pipeline search facilitates various from developing data science solutions to facilitate
inference tasks and transferring to similar data science knowledge transfer and sharing so that they can
projects. be applied to similar projects efficiently and shared
among a group of data scientists in the organiza-
tion, it causes repetitive work, low efficiency and
I. I NTRODUCTION
inconsistency in the quality of solutions.
Machine learning is becoming an essential part To facilitate rapid development and deployment
of developing solutions for many industrial appli- of data science solutions, automated machine learn-
cations. Developers of such applications need to ing (AutoML) has gained more interest recently due
rapidly build, test, deploy and validate machine to the availability of public dataset repositories and
learning models. The validation of models is a key open source machine learning code bases. AutoML
capability that will enable industries to more widely systems such as Auto-WEKA [1], Auto-SKLEARN
adopt machine learning capabilities for business [2] and TPOT[3] attempt to optimize the entire ma-
decision making, however, this process suffers from chine learning pipeline, which can consist of inde-
the lack of interpretability. First, in most cases, pendent steps such as featurization, which encodes
validation begins with understanding how a machine data in numeric form, or feature selection, which
learning model is developed - the pipeline from attempts to pick the best subset of features to train
source data, through data processing and featur- a model from data. Given sufficient computation
ization, to model building and parameter tuning. resources, these system can achieve good accuracy
The capability of understanding machine learning in building machine learning pipelines, but they do
models also represents a key piece of domain not provide clear explanations to justify the choice
knowledge: data scientists who understand how to of models that can be verified by data scientist, and
make successful domain specific machine learning consequently the problem of interpretability remains
pipelines will be in high demand across that domain. unsolved.
2

To address this challenge, in this paper we de- planning [6] and hierarchical reinforcement learning
scribe the Maana Meta-learning service which [7]. Symbolic plans generated from a pre-defined
provides interpretable automated machine learn- symbolic formulation of a dynamic domain is used
ing. The goal of this project is two-folded. First, to guide reinforcement learning, and recently this
we hope that the efficiency of developing data approach is generalized to improve interpretability
science solutions can be improved by leveraging of deep reinforcement learning. In the setting of
an automated search and profiling algorithm such AutoML, generating machine learning pipelines is
that a baseline solution can be automatically gen- treated as a symbolic planning problem on an action
erated for the data scientists to fine-tune. Second, description in action language BC[8] that contains
we hope that such automated search process is actions such as preprocessing, featurizing, cross val-
transparent to human users, and through learning idation, training and prediction. The pipeline is sent
process the service can return interpretable insights to execution where each symbolic action by map-
on the choice of models and hyper-parameters and ping to primitive actions in a Markov Decision Pro-
encode them as knowledge. Contrasted with most cess [9] (MDP) space, which are ML pipeline com-
AutoML systems that provide end-to-end solutions, ponents instantiated with random hyper-parameters,
the Maana Meta-learning service is an interactive in order to learn the quality of the actions in the
assistant to data scientists that performs user-guided, pipeline. The learning process is value iteration
machine-assisted automated machine learning. By on R-learning [10], [11], where cross-validation
having data scientists specify a pre-determined accuracy of the pipeline is used as rewards. After
search space, and Meta-learning service then goes the quality of the current pipeline is measured, an
through several stages to perform model selection, improved ML pipeline is generated thereafter using
pipeline profiling and hyper-parameter tuning. Dur- the learned values, and the interaction with learning
ing this process, it returns intermediate results and continues, until no better pipeline can be found.
user can inject feedback to steer the search process. This step is called model profiling. After that, a
Finally, it generates an optimal pipeline along with more systematic parameter sweeping is performed,
structured knowledge encoding the decision mak- i.e., model searching. This allows us to describe
ing process, leading to an interpretable automated the pipeline steps in an intuitive representation and
machine learning process. explore the program space more systematically and
Maana Meta-Learning service features two com- efficiently with the help of reinforcement learning.
ponents: (1) a knowledge representation that cap- In this paper, we demonstrate that Maana Meta-
tures domain knowledge of data scientists and (2) learning provides a decent baseline on a variety of
an AutoML algorithm that generates machine learn- data sets, involving both binomial and multinomial
ing pipeline, evaluates their efficacy by sampling classification tasks on various data types. Further-
hyper-parameters, and encodes all the information more, when knowledge instance is filled into the
about the choices made and subsequent perfor- pre-defined knowledge schema, the insights derived
mance / parameters into the knowledge representa- from Meta-learning process can be visualized as
tion. The knowledge representation is defined using a knowledge graph, improving interpretability and
GraphQL1 . Developed by Facebook as an alternative facilitating knowledge sharing, sustaining as well as
to the popular REST interface [4], GraphQL pro- transferring to similar tasks. We show that by using
vides only a single API endpoint for data access, an interactive process leveraging domain knowledge
backed by a structured, hierarchical type system. and user feedback to populate knowledge into a
Consequently, it allows us to define a knowledge structured knowledge graph, in order to address
taxonomy to capture concepts of machine learning the interpretable automated machine learning sought
pipelines, seamlessly populate facts to the prede- after by industrial application of data science.
fined knowledge graph and reason with them. The
AutoML algorithm, in charge of generating and
II. R ELATED W ORK
choosing which pipelines to pursue, is based on
PEORL framework [5], an integration of symbolic In contrast with optimizing the selection of model
parameters, the goal of the AutoML task is to
1
https://graphql.org/ optimize an entire machine learning pipeline. That
3

is, starting from the raw data, it concerns itself only certain data types are compatible with certain
with everything, including optimal selection of fea- featurizers). Combined with R-Learning [10], feed-
turization, selection of algorithm, hyper-parameter back on actions taken is learned in order to generate
selection as well as the cohesive collection of these new plans.
as an ensemble. The most recent and most relevant
approaches to the AutoML paradigm are Auto- III. P RELIMINARIES
WEKA and Auto-SKLEARN. Auto-WEKA [1] [12] A. GraphQL and Maana Knowledge Platform
call this the combined algorithm selection and hy- GraphQL is a unified layer for data access and
perparameter optimization problem or CASH. This manipulation. In a distributed system, it is located
approach is formalized as a Bayesian optimiza- at the same layer like REST, SOAP, and XMLRPC,
tion problem where it sequentially tests different that means it is used as an abstraction layer to
hide the database internals. A GraphQL schema
pipelines based on the performance of the last using consists of a hierarchical definition of types and
what is called a sequential model-based optimiza- the operations that can be applied on times, i.e.,
tion (SMBO) formulation [13]. In combination with queries and mutations. GraphQLs type system is
the algorithms available in the WEKA library[14], very expressive and supports features like inher-
they provide a complete package targeted toward itance, interfaces, lists, custom types, enumerated
non-expert users, allowing them to build machine types. By default, every type is nullable, i.e. not
every value specified in the type system or query
learning solutions without necessarily knowing the has to be provided. Every GraphQL type system
details required to do so. Auto-SKLEARN [2] ap- must specify a special root type called Query, which
proaches the AutoML task in much the same way serves as the entry point for the querys validation
by treating it as a Bayesian optimization problem. and execution. One example of a GraphQL schema
However, they claim that by giving a ”warmstart” definition is shown as follows. It contains two types:
to the optimization procedure, the time to reach Person that contains fields of name (the punctual
! denotes non-empty fields), age, a list of instances
performant pipelines is significantly reduced. That of books (denoted by brackets []), and a list of
is, they pre-select possible good configurations to instances friends, and a type Book with fields title
begin the procedure. Thus their goal is to increase and a list of persons as authors. Furthermore, there
the efficiency of and reduce time to build. Addi- are 3 queries that retrieve an instance of Person by
tionally, instead of the WEKA library, they use name, an instance of book by title, and a list of
the scikit-learn library [15]. Recent system Key- books by applying a filter. There is also a mutation
StoneML [16] uses technique similar to database that adds a person by providing a name.
query optimization to optimize machine learning type Person{
pipelines end-to-end, where ML operator has a name : String!
age : Integer
declarative logical representation. By comparison,
books(favorite: Boolean) : [Book]
our work has a different focus and scope. Instead friends : [Person]
of directly outputting the best machine learning }
pipeline and providing a one-to-one solution, we type Book {
focus on an interactive process where data scientists title : String!
use this service to explore their predefined search authors : [Person]
}
space and refine their decision. In this setting,
type Query {
Meta-learning provides “user-guided, machine as- person(name : String!) : Person
sisted” automated search and facilitates encoding book(title : String!) : Book
knowledge and decision making process and address books(filter : String!) : [Book]
the challenge of interpretability of data science }
solutions. The intepretability of automated machine type Mutation {
learning and knowledge derived from the search addPerson(name:String!) : Boolean
}
algorithm is enabled leveraging PEORL framework
[5], which is a combination of symbolic planning Such schema provides an representational ab-
and reinforcement learning. Symbolic planning [6] straction of operations and the data they manip-
generates possible sequences of actions to achieve ulates, and connects the front-end query/mutation
a goal which are pre-defined and logic-based (e.g. calls with the back-end implementation details.
4

Maana Knowledge Platform2 is architected based where ρ is a value that will be further updated by
on Graphql-based microservices where their type reinforcement learning. Reinforcement learning is
systems are connected with each other to become a achieved by R-Learning [10], [11], i.e., performing
Computational Knowledge Graph (CKG). Different value iteration
from traditional semantic systems based on ontology
and description logic [17], the CKG separates the α
Rt+1 (st , at ) ←−t rt − ρt (st ) + max Rt (st+1 , a),
conceptual modeling of data, the content of the a
βt
data and the operations on the data. This separa- ρt+1 (st ) ←
− rt + max Rt (st+1 , a) − max Rt (st , a)
a a
tion enables a fluidity of modeling, allowing data (1)
from any source and in any format to be seam- to approximate policy that achieves maximal long
lessly integrated, modeled, searched, analyzed, op- term average reward using R and gain reward using
erationalized and re-purposed. Each resulting model ρ.
is a unique combination of three key components At any time t, given an action description in BC
subject-matter expertise, relevant data from silos, and an initial state I and goal state G, PEORL uses
and the right algorithm all of which are instrumen- an answer set solver such as CLINGO to generate
tal in optimizing operations and decision flows. Fur- a plan Πt , i.e., a sequence of actions that transits
thermore, the CKG is also dynamic, which means state from I to G. After that, the action is sent
that it can represent conceptual and computational to execution one by one, value iteration (1) is
models. In addition, it can be used to perform com- performed. After that, ρ values for all state s in
plex transformations and calculations at interactive plan Π are summed up to obtain the quality of the
speeds, making it a game-changing technology for plan,
agile development of AI-driven knowledge applica-
tions.
X
quality(Πt ) = ρ(s)
hs,a,s0 i
B. PEORL Framework
PEORL [5] is a framework that integrates sym- and ρ(s, a) for all ρ values for all transition hs, a, s0 i
bolic planning with reinforcement learning [7]. Us- are used to update the facts in action descriptions.
ing a symbolic formulation to capture high-level Plan Πt+1 is generated that not only satisfies the
domain dynamics and planning with it, a symbolic goal condition G, but also has a plan quality greater
plan is used to guide reinforcement learning to than the quality(Πt ). This process terminates when
explore the domain, instead of performing random plan cannot be further improved.
trial-and-error. Due to the fact that domain knowl- Meta-learning concerns on generating machine
edge significantly reduces the search space, this learning pipeline with proper hyper-parameter to
approach accelerate learning and also improves the meet an objective, such as accuracy. This problem
robustness and adaptability of symbolic plans for can be formulated as an interplay between gener-
sequential decision making. One instantiation of ating reasonable machine learning pipeline, viewed
such framework in [5] uses action language BC to as a symbolic plan generated from a domain formu-
formulate dynamic domain through a set of causal lation for commonsense knowledge of data science,
laws, i.e., preconditions and effects of actions and and evaluating machine learning pipeline, viewed
static relationships between properties (fluents) of a as execution of actions and receiving rewards from
state. In particular, PEORL requires that causal laws the environment, derived from the objective. This
formulating cumulative effect (plan quality) defined approach allows to use interpretable, explicitly rep-
on a sequence of actions. For an action a executed resented expert knowledge to delineate search space
at state s, such causal laws has the form to look for proper pipeline along with their hyper-
parameters, and also allows user to change their
a causes quality = C + Z if
specification for the search space in run time, lead-
s, ρ(s, a) = Z, quality = C. ing to to an more interpretable and transparent
meta-learning. The details of the algorithm will be
2
https://www.Maana.io/knowledge-platform/ described in Section IV-B.
5

IV. M ETHODOLOGY and candidate preprocessors. The full definition of


training input type is
A. Knowledge Schema for Data Science
We first show the knowledge schema defined input TrainingInput {
to capture concepts and relationships in a data modelId: ID
science solution. First we define a machine learning minimumAccuracy: Float
interface targetName: String!
dataInput: TrainingDataInput!
interface MachineLearningModel { fields: [FieldInput]
id: ID folds: Int
algorithm: MachineLearningAlgorithm selectionCriteria: Metric
features: [Feature] candidateModels: [MachineLearningAlgorithm]
preprocessor: Preprocessor candidatePreprocessors: [PreprocessorAlgorithm]
saved: Boolean modelProfilingEpisode: Int
accuracy: Float modelSearchEpisode: Int
timeToLearnInSeconds: Float }}
labels: [Label]
} The mutation that triggers Meta-learning accepts
a training input and outputs an instance of Machine-
where MachineLearningAlgorithm, LearningModel.
Feature and Preprocessor are enumeration
type, such as trainClassifier(input: TrainingInput!):
MachineLearningModel
enum MachineLearningAlgorithm {
random_forest_classifier After a machine learning model is trained, it can
linear_scv_classifier be used to classify new input data in JSON format:
gaussian_nb_classifier classifyInstances(modelID:ID!,data:JSON!)
multinomial_nb_classifier
logistic_classifier which will return instances of the type
sgd_classifier
gradient_boosting_classifier [Label]
}
.
The interface is implemented by every classifier The schema defined as above can be viewed as
that is defined as a type that includes an ID and all taxonomies, or, structure of the data, that will be
values for hyper-parameters, for instance used to store the results generated during automated
type LogisticClassifier machine learning algorithm (Section IV-B). Encod-
implements MachineLearningModel{ ing the knowledge this way enables the pipeline
norm: String search process to be better understood, insights
tolerance: Float
about deriving the optimal pipeline to be encoded
C: Float
balance: Boolean and inference tasks on Meta-learning to be per-
solver: String formed.
maxIterations: Int
}
B. Automated Machine Learning
In order to train a classifier, the training input
consists of a training data specified by an URL or Automated machine learning algorithm is based
path. Field inputs consists of user-defined preference on PEORL framework, where pipeline generation
for applying featurizers to a column: is viewed as a symbolic planning problem, and
pipeline evaluation as reinforcement learning prob-
input FieldInput {
name: String! lem that the actions are performed on manipulating
type: FeatureType! data and rewards are derived from cross valida-
featurizerName: [FeaturizerAlgorithm] tion scores. This approach significantly accelerates
} learning by incorporating domain knowledge to
The user can also specify minimal accuracy re- guide exploration, enables online injection of user
quired for search to stop based on a selection criteria feedback that changes pipeline search in a flexible
(accuracy, F1, precision, recall), candidate models and easy way, and interpretable learning in the sense
6

that the learning is performed only on reasonable datatype(Y,train),


pipelines generated from a pre-defined symbolic not
knowledge. modeltrained(Y,k),
not
1) Representing Domain Knowledge: We use ac- featurizerinitialized(F,Field,Y,k).
tion language BC to represent dynamic domain of
ML operations and translate the causal laws into featurizerinitialized(F,Field,Y,k):-
corresponding answer set program (ASP). We first initfeaturizer(F,Field,Y,k-1),
introduce three types of objects: feature(F).
cost(Q+R,k) :-
• Preprocessors, including
featurizationcompleted(k),
– matrix decompositions: ro(R,k),cost(Q,k-1).
(truncatedSVD, pca,kernelPCA, cost(Q+10,k):-
fastICA), featurizationcompleted(k),
– kernel approximation: #count{R:ro(R,k)}=0, cost(Q,k-1).
featurizationcompleted(k):-
(rbfsampler, Nystroem), #count{Field:featurizerinitialized(_,
– feature selection: Field,_,k)}=X1,
(selectkbest, #count{
selectpercentile), Field:has_field(_,Field)}=X2,
– scaling: X1=X2.
(minmaxscaler, robustscaler, • Select preprocessor:
absscaler), and {initpreprocessor(P,Y,k)}:-
– no preprocessing (noop), featurizationcompleted(k),
preprocessor(P),
• Featurizers:
not
including two standard featurizers for text modeltrained(Y,k),
classification, i.e., CountVectorizor and datatype(Y,train),
TfidfVectorizer. featurizationcompleted(k).
• Classifiers: preprocessorinitialized(P,Y,k) :-
initpreprocessor(P,Y,k-1),
including logistic regression, Gaussian naive
featurizationcompleted(k).
Bayes, linear SVM, random forest, multinomial cost(Q+R,k) :-
naive Bayes and stochastic gradient descent. initpreprocessor(P,Y,k-1),
We treat each operation in the pipeline as an ac- ro(P,R,k-1),
tion, with describing their causal laws accordingly. cost(Q,k-1).
First of all, it includes facts about compatibility with cost(Q+10,k) :- initpreprocessor(P,Y,k-1),
sparse vectors, such as #count{R:ro(P,R,k-1)}=0,
cost(Q,k-1).
acceptsparse(random_forest_classifier)
• Crossvalidate. If we have token and label from
facts about operators and data type, such as the data, we can cross validate the pipeline
by choosing featurizers, preprocessor and a
compatible(integer,std_scaler)... classifier, and the effect is the model being
and actions such as data import, train and predict. validated. If one of preprocessor and classifier
does not accept sparse vector, it needs to be
The actions related to configuration of machine transformed into dense vector.
learning pipelines are described as follows: sparse :- has_type(X,text),
• Select featurizers for each column. The fol- #count{T:has_type(X1,T)}=1.
lowing rules describes the effects of initialize {crossvalidate(C,P,dense,T,k)} :-
featurizer for each column, and after each col- classifier(C),
umn has one featurizer selected, the cumulative preprocessorinitialized(P,Y,k),
quality increments. has_attr(T,Y,k),
{initfeaturizer(F,Field,Y,k)}:- not
feature(F), sparse,
has_attr(Field,Y,k), has_targetfield(data,T).
compatible(Type,F), {crossvalidate(C,P,sparse,T,k)} :-
has_type(Field,Type), classifier(C),
7

has_attr(T,Y,k), has_targetfield(adult_data, field_salary).


preprocessorinitialized(P,Y,k),
acceptsparse(P), The configuration of pipeline search space comes
acceptsparse(C), from user specification on candidate featurizers,
sparse. candidate preprocessors ad candidate models. This
modelvalidated(C,P,S,T,k) :- information is translated into ASP file as well. For
datatype(Y,train), example, the following file configure the pipeline
crossvalidate(C,P,S,T,k-1). search to be performed amoung the listed classifier,
modelvalidated(T,k) :- preprocessors and featurizers:
datatype(Y,train), classifier(linear_svc_classifier)...
crossvalidate(C,P,S,T,k-1). feature(one_hot)...
cost(Q+R,k) :- preprocessor(pca)...
ro(P,C,R,k-1),
cost(Q,k-1), Furthermore, user can override default appli-
crossvalidate(C,P,S,T,k-1). cation choices of featurizers by specifying their
cost(Q+10,k) :- own preference. For instance, user wants to use
crossvalidate(C,P,S,T,k-1), robustscaler for column Age, then the follow-
#count{R:ro(P,C,R,k-1)}=0, ing fact is appended to the ASP file:
cost(Q,k-1).
use_featurizer(robust_scaler,field_age).
Besides causal laws described above, all fluents
are declared inertial, and concurrent execution of ac- The planning goal to train a classifier is defined
to be
tions are prohibited except for initfeaturizer.
Furthermore, to facilitate fast profiling, we :- not modeltrained(adult_data,k), query(k).
use empirical assignment of featurizers to data :- query(k),cost(Q,k), Q<=1.
types, unless they are overriden by the user.
This capability is enabled leveraging the non- Alternatively, given a test file, the goal can also
monotonic reasoning capability of answer set pro- be classifying the test data, with the first constraint
gramming. The default application of featurizers replaced by
are: for categorical data type, apply one_hot, :- not has_attr(adult_test, field_salary,k).
for flat data type, apply std_scaler, for integer
type, apply min_max_scaler, for text, apply Given the initial condition and a goal, plan can
hashing_vectorizer (bag of words). One ex- be generated by translating the action description
ample of these rules is above to ASP and run answer set solver CLINGO:
:- initfeaturizer(F,Field,Y,k), 1:import_train(adult_data)
has_type(Field,categorical), 2:initfeaturizer(one_hot,field_sex,
F!=one_hot adult_data)
#count{F1:use_featurizer(Field,F1)}=0. initfeaturizer(one_hot,field_race,
adult_data)
2) Generation of Pipeline: The generation of initfeaturizer(one_hot,field_education,
pipeline is treated as a symbolic planning problem adult_data)
given the above formulation of the dynamic domain. initfeaturizer(one_hot,field_workclass,
The initial condition comes from two sources: (1) adult_data)
data schema and (2) configuration of pipeline search initfeaturizer(robust_scaler,field_age,
space. The data schema consists of the column name adult_data)
and its data type extracted from a data source (e.g., 3:initpreprocessor(Nystroem,adult_data)
a CSV file in which each column has been pre- 4:crossvalidate(gradient_boosting_classifier,
labelled with their data type and one column is nystroem,dense,field_salary)
designated as classification target). For instance, the 5:train(gradient_boosting_classifier,
following ASP file is generated for Adult3 dataset, nystroem,dense,field_salary)
where field_salary is the classification target:
The above output is a plan that achieves the goal
datatype(adult_data,train).
from the initial state. The machine learning pipeline
datatype(adult_test,test).
has_field(data,field_age). is encoded into operations step by step, including
has_type(field_age,integer)... initializing the featurizer in step 2, initializing the
preprocessor in step 3, and picking up a classifier
3
https://archive.ics.uci.edu/ml/datasets/Adult to perform cross validation in step 4. In practice,
8

the plan request is sent from a front-end UI or


a GraphQL query, where there are other hyper-
parameters that need to be specified, which will be
described in Section V.
Currently we only allow one featurizer to be Algorithm 1 Pipeline Profiling
applied to a column. In the future, we will allow
user to specify multiple featurizers to be applied to a Require: candidate classifiers C, candidate preprocessors P
column, to enable more flexible feature specification and user-specified featurizer-column application F,
planning goal,
and increase the expressivity of search space of G = (:- not modeltrained(ıdata), ∅), domain
pipelines. representation D, profiling episodes ıprof ilingepisodes
3) Pipeline Learning: The evaluation of pipeline and cross-validation folds v.
is performed using the algorithm based on PEORL, 1: P0 ⇐ ∅, Π ⇐ ∅
shown in Algorithm 1. The algorithm accepts a set 2: generate initial state I from C, P, F.
of candidate classifiers, a set of candidate preproces- 3: while True do
sors and user-specified application of featurizers to 4: Πo ⇐ Π
5: solve planning problem
columns and a training dataset. The input parameter Π ⇐ C LINGO.ısolve(I, G, D ∪ Pt )
to the data also consists of model profiling episodes 6: if Π = Πo then
and cross validation folds. First of all,the initial state 7: return Πo
of planning problem (line 2) and a pipeline is gener- 8: end if
0
ated (line 5). The action in the pipeline is executed 9: for action hs, a, s i ∈ Π do
if a ∈ {ıinitf eaturizer(F, Col, Y ),
through calling a library of methods that wraps 10:
ıinitpreprocessor(P, Col, Y )} where
SCIKIT- LEARN libraries, and for featurization and
F ∈ F,P ∈ P, ıCol is a column name and Y is the
preprocessing, give reward -1 to promote shorter training dataset name then
pipeline (line 11) and perform value iteration of R- 11: ıreward ⇐ −1
learning (line 12). For cross-validation action, pick- 12: update R(a, s) and ρat (s) for action a
ing up a random parameter for preprocessor, featur- 13: end if
izers and classifiers (line 16) to assemble a pipeline 14: if a ∈ {ıcrossvalidate(C, P, D, Col, Y )} where
C ∈ C, ıCol is a column name and Y is the
(line 17) and perform cross validation (line 18). It training dataset name then
involves converting the feature matrix to dense if 15: for i < ıprof ilingepisode do
specified. After that, deriving reward proportional 16: instantiate C, P, F by random sampling their
to the average cross validation score (line 19) and hyper-parameters.
update R and ρ values using R-Learning. After the 17: assemble pipeline using C, P, F .
whole pipeline is profiled, evaluating the pipeline 18: perform v-fold cross validation using
C, F, P .
quality (line 27), update planning goal (line 28) and 19: obtain ıreward ∝ ıcvscore.
write back ρ values into symbolic formulation to 20: update R(a, s) and ρat (s) for action a
generate a new pipeline. When the pipeline cannot 21: i←i+1
be further improved, return the optimal pipeline 22: end for
(line 6). In our application pipeline output takes the 23: else
form of a serial pickle object. 24: execute a
25: end if
4) Interactive Process of Meta-Learning: Meta-
26: end for
Learning is by its nature a long running job, espe- 27: quality(Π) ← P a
hs,a,s0 i∈Π ρ (s)
cially with larger datasets. The PEORL algorithm 28: update planning goal
applied to machine learning by itself does not scale G ⇐ (A, quality > quality(Π)).
well. Because of this we have incorporated two fea- 29: update facts Pt ⇐ {ρ(a) = z : ρat (s) = z}.
tures with the intention of improving performance. 30: end while
First, before running the PEORL algorithm, we have
a broad search phase were we test each classifier
once. We then restrict the algorithm to only the
best performing classifier and continue the search
process.
9

1) Model Selection (Phase 1). For C = C the result is returned to the dispatcher to perform
for each classifier C, call Algorithm 1 using value iteration.
the chosen feature set, and P = {ınoop},
recording the performance. Select classifier A. Example: Classify Spam Email
C0 based on the predefined selection criteria We show how the Meta-learning service perform
(accuracy, F1, precision, recall). on an example dataset obtained from UCI machine
2) Pipeline Learning (Phase 2). Call Algo- learning data repository: Spam message detaction4 .
rithm 1 with C = {C0 }, P and F being user The dataset contains 4601 data entries, 57 float and
specified featurizers and preprocessors, and integer features to detect if a message is a spam or
generate optimal pipeline Π1 not.
3) Parameter Sweeping. (Phase 3). Perform After Meta-learning service is running, the user
grid search or random search for hyper pa- can load a CSV file into a workspace into Maana
rameter Π1 and return the final pipeline Π2 . project. After that, the user launch the service
We also allow the user to gradually refine their through the GraphQL endpoint, where the user spec-
preference during the process. When the user sees ifies feature fields and their related types, candidate
the results, they may inject their feedback by over- classifiers (logistic regression, random forest, linear
riding any preset configurations above, at any time, SVC, SGD classifier) and candidate preprocessors
and this information will be picked up by Meta- (noop, random trees embedding, truncated SVD,
Learning search algorithm, and change its behavior PCA, Nystroem, kernel PCA). It performs 10 folds
towards user’s feedback in the next episode of cross validation, 10 episodes of model profiling and
planning and learning. The user can remove or 20 episodes of model search.
add possible algorithms to test, cancel the current After the service is launched, it goes to the first
pipeline, stop a phase with the current best classifier, phase, model selection. In this phase, it will not
or stop the entire process and use the best classifier apply any preprocessors and only apply default
found. featurizers to the column, and pick up 10 sets of
random hyper parameters for each model and cal-
V. S YSTEM I MPLEMENTATION culate the average cross validation accuracy. By the
end of model selection (Phase 1), it stores the metric
In Maana Knowledge Platform, CSV files can information into the knowledge graph, and is visu-
be uploaded and each column becomes a field and alized in the first column of Fig. 1b. It shows that
their types are automatically identified. The user the most accurate model, based on cross validation
can trigger Meta-learning service by submitting a results, is logistic regression. At this point, the user
query through GraphQL endpoint. The GraphQL is notified that logistic regression is selected, based
input is used to generate part of the initial state for on predefined selection criteria. In Pipeline Learning
planning, and the Meta-learning service is triggered (Phase 2), Meta-learning will try to find the best
for pipeline search. Throughout the pipeline search combination of preprocessors and featurizers using
process, the results are constantly written to the the selected classifier, following Algorithm 1. Since
Maana Knowledge platform according to the knowl- the user does not override any default selection
edge schema. The service is implemented in Python of featurizers, one_hot_encoder is applied to
with Graphene library to enable GraphQL server categorical fields, and min_max_scaler is ap-
and endpoints. It is deployed using a Docker image plied to integer fields. During this process, ASP-
along with other components of Maana Knowledge based planner generates pipelines using the selected
platform. classifier, and reinforcement learner evaluates the
Additionally, another feature we use to improve generated pipeline on the data, using reward derived
performance is the parallelization of building mod- from the cross validation accuracy. The pipeline is
els. Because the model profiling and model search gradually improved till the point that it does not
episodes can be done in parallel, we use asyn- change. By the end of Phase 2, the performance of
chronous approach, where multiple workers are selecting different preprocessor with the classifier
launched and each perform their own parameter
sampling and cross validation on the dataset, and 4
https://archive.ics.uci.edu/ml/datasets/Spambase
10

(a) The Meta-learning Service Architecture. (b) Results of the three stages.

(c) The results in the Knowledge Graph.

Fig. 1: Meta-learning Service Overview


Dataset Featurizer Preprocessor Classifier CV accuracy
Reuters 50/50 hashing vectorizer none linear SVC 0.848
IMDB hashing vectorizer None SGD classifier 0.879
Adult one hot, min max scaling random trees embedding logistic regression 0.8523
Spam detection min max scaling none logistic regression 0.927
Parkinsons detection std scaler none random forest 0.887
Abalone std scaler Nystroem random forest 0.552
Car one hot Nystroem gradient boosting 0.938

TABLE I: Baseline Pipelines Learned on Datasets

is output in Fig. 1b. From the result, it shows Finally, during parameter sweeping (Phase 3) hyper-
not performing any preprocessor has the best per- parameters are swept, leading to the final results
formance used with logistic regression. Combined in the third column. All of the intermediate search
with the default featurizer, a pipeline of using results are stored in the knowledge graph, shown
min max scaler for integer fields, one hot encoder as a snapshot in Fig. 1c. The upper part of the
for categorical field, random tree embedding for screen shots shows the knowledge schema organized
preprocessor and a logistic regression is learned. as knowledge graph, and on clicking each of the
11

schema node, data instance is shown in the lower • Car evaluation dataset11 contains 1728 in-
part of the workspace. stances, 6 categorical data fields to make clas-
During this process, based on the pre-defined sification of purchasing decisions.
candidate models, preprocessors and the profiling Detailed results are shown in Table I. The re-
episodes, the system has evaluated 4 pipelines (each sult shows that the Meta-learning service generates
parameterized with 10 sets of hyper parameters) in competitive baseline result for the data scientist to
model Selection process. During pipeline learning further work on.
phase, the 8 pipelines based on selected classifier
(logistic regression) and candidate preprocessors are
further evaluated (with 10 hyper-parameter tested in VI. C ONCLUSION AND F UTURE W ORK
a single learning episode) until the optimal pipeline
Meta-learning service provides a novel frame-
converges. This process does not provide system-
work for machine learning pipeline search that is
atic pipeline optimization and search. Instead, it
transparent, interpretable and interactive. It serves as
leverages the decision space pre-defined by the data
a profiling tool for data scientist to use: by incor-
scientist and perform quick profiling and provide
porating human knowledge, meta learning service
evidence for the data scientist to further refine their
performs efficient pipeline generation and profiling
decisions.
in the search space delineated by the data scientists,
allowing feedback to be injected in the middle to
B. Evaluation on Datasets alter search space and providing useful feedback
We evaluate meta-learning services for classifica- for the data scientist to understand the best machine
tion tasks using default setting for featurizers. Our learning pipeline for the dataset of interest. While
data set are obtained includes: the Meta-learning service will by no means replace
5
• Reuters 50/50 dataset contains of 2,500 texts data scientist to finish data science project automat-
(50 per author) for author identification. ically, it can save a large amount of time for manual
6
• IDMB movie review dataset contains 25,000 search and tuning. Currently it is deployed in Maana
movie reviews obtained from IMDB. The clas- Knowledge Platform to facilitate data scientists to
sification task is to predict a movie review is build machine learning solutions faster, with better
positive or negative. insight and facilitate knowledge management and
7
• Adult dataset contains 48842 instances. Each sharing across different projects.
instance has 14 fields, including age (integer), This framework leaves several paths for improve-
working class (categorical), education (categor- ment. Up until now we have only applied featur-
ical), capital gain (float), etc that consitute the ization to data based on what type it is. However,
feature space to predict one of the two classs: it is possible to perform some level of automated
salary > 50k or <= 50k. feature extraction and clean up of the data. It should
8
• Spam email detection contains 4601 data en- be possible to use the knowledge gleaned from
tries, 57 float and integer features to detect if the meta-attributes to guide the algorithm as well.
a message is a spam or not. More advanced parameter optimization can also be
9
• Parkinson’s detection dataset contains 197 applied.
instances, with 22 float features to detect if a
person has Parkinson’s disease or not based on
vocal characteristics. R EFERENCES
10
• Abalone dataset contains 4177 instances, 8 [1] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown,
float and integer attributes to detect the sex of “Auto-WEKA: Combined selection and hyperparameter opti-
abalone. mization of classification algorithms,” in Proc. of KDD-2013,
2013, pp. 847–855.
5
https://archive.ics.uci.edu/ml/datasets/Reuter 50 50 [2] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,
6 M. Blum, and F. Hutter, “Efficient and robust automated ma-
http://ai.stanford.edu/ amaas/data/sentiment/
7
https://archive.ics.uci.edu/ml/datasets/Adult chine learning,” in Advances in Neural Information Processing
8
https://archive.ics.uci.edu/ml/datasets/Spambase Systems, 2015, pp. 2962–2970.
9
https://archive.ics.uci.edu/ml/datasets/Parkinsons
10 11
https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
12

[3] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, dations, algorithms, and empirical results,” Machine Learning,
“Evaluation of a tree-based pipeline optimization tool for vol. 22, pp. 159–195, 1996.
automating data science,” in Proceedings of the Genetic and [12] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-
Evolutionary Computation Conference 2016. ACM, 2016, pp. Brown, “Auto-weka 2.0: Automatic model selection and hy-
485–492. perparameter optimization in weka,” The Journal of Machine
[4] R. T. Fielding and R. N. Taylor, Architectural styles and the Learning Research, vol. 18, no. 1, pp. 826–830, 2017.
design of network-based software architectures. University of [13] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential
California, Irvine Doctoral dissertation, 2000, vol. 7. model-based optimization for general algorithm configuration,”
[5] F. Yang, D. Lyu, B. Liu, and S. Gustafson, “Peorl: Integrating in International Conference on Learning and Intelligent Opti-
symbolic planning and hierarchical reinforcement learning for mization. Springer, 2011, pp. 507–523.
robust decision-making,” in International Joint Conference of [14] F. Eibe, M. Hall, and I. Witten, “The weka workbench. online
Artificial Intelligence (IJCAI), 2018. appendix for” data mining: Practical machine learning tools and
[6] A. Cimatti, M. Pistore, and P. Traverso, “Automated planning,” techniques,” Morgan Kaufmann, 2016.
in Handbook of Knowledge Representation, F. van Harmelen, [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
V. Lifschitz, and B. Porter, Eds. Elsevier, 2008. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
[7] R. Sutton and A. G. Barto, Reinforcement Learning: An Intro- V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
duction. MIT Press, 1998. M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-
[8] J. Lee, V. Lifschitz, and F. Yang, “Action Language BC: chine learning in Python,” Journal of Machine Learning Re-
A Preliminary Report,” in International Joint Conference on search, vol. 12, pp. 2825–2830, 2011.
Artificial Intelligence (IJCAI), 2013. [16] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and
[9] M. L. Puterman, Markov Decision Processes. New York, USA: B. Recht, “Keystoneml: Optimizing pipelines for large-scale
Wiley Interscience, 1994. advanced analytics,” in Data Engineering (ICDE), 2017 IEEE
[10] A. Schwartz, “A reinforcement learning method for maximizing 33rd International Conference on. IEEE, 2017, pp. 535–546.
undiscounted rewards,” in Proc. 10th International Conf. on [17] F. Baader, D. Calvanese, D. McGuinness, P. Patel-Schneider,
Machine Learning. Morgan Kaufmann, San Francisco, CA, and D. Nardi, The description logic handbook: Theory, imple-
1993. mentation and applications. Cambridge university press, 2003.
[11] S. Mahadevan, “Average reward reinforcement learning: Foun-

Potrebbero piacerti anche