Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
IT Shared
Where Knowledge meets Sharing
19October2015
OurPosts
Data Science Interview Questions 2015(21)
December(3)
October(3)
DataScienceInterview
Questions
PostedbyAlexeyGrigorev
JavaInterviewQuestions
MasteringDataAnalysis
withR
DataScienceInterviewQuestions
June(3)
April(2)
March(3)
February(2)
January(5)
2014(2)
WhoAreWe?
WeareagroupofMaster'sDegree
studentsfromallovertheworldwhomet
eachotherthankstothefoundationand
specialisationcoursesoftheErasmus
MundusIT4BIProgramme.
AhmetAnlPala
Source:DataScience:AnIntroduction
OurIT4BIMasterstudiesfinished,andthenextlogicalstepaftergraduationisfindingajob.Iwas
interestedinDataSciencejobsandthispostisasummaryofmyinterviewexperienceand AlexeyGrigorev
preparation.
ThetermDataScienceisnotyetwellestablish,sointerviewsforDataSciencejobsmightincludea AndresFelipeZamora
verybroadrangeofquestions,dependingontheinterpretationofthetermbyaparticularcompany.In Montao
thispostIattempttoorganizeDataScienceinterviewquestionsinsomeusableform,butitmightalso
bebiasedbyhowIseeDataSciencemyself.Ihopeyoualsocanfindituseful.
Thesourcesofthequestionsare: ElenaSamota
linksthatIdiscoveredontheInternet,
myowndatascienceinterviews(beingontheintervieweeside) GuvenToprakkiran
Thequestionsarewithoutanswers.Firstofall,theanswerthatIwouldwritecouldbebadorwrong,
andsecond,thepostwouldbetoobig.Also,goingthroughthelistandlookingfortheanswers
yourselfisagoodexercisetoprepareforaninterview. HichamAkaoka
Badssi
Thislistmightlookscaryatfirst,butitsveryunlikelythatallofthesequestionswillbeaskedduring
oneinterview.Veryfewjobsrequireapplicantstoknowallofthesepoints.Soitsratherabroad
overviewofthingsthatmaypotentiallybeasked.Dontletthislistofquestionsdiscourageyouifyou JosLuisPinoLpez
dontknowtheanswertosomeofthem:chancesarethatthesequestionsarenotimportantforyour
interview.
So,letsgetstarted. MadalinaBurghelea
TableofContent
MaximilianoAriel
BackgroundQuestions Lpez
Careers
http://www.itshared.org/2015/10/data-science-interview-questions.html 1/13
2/9/2016 IT Shared: Data Science Interview Questions
GeneralQuestions
MiaJohnsonViouls
Process
Mathematics
LinearAlgebra
NavidMahlouji
OtherAreas
ProbabilityandStatistics
BasicProbability NyamiRonald
Mitterand
Distributions
BasicStatistics
ExperimentDesign SteffiMelinda
PointEstimates
Testing
A/BTests StephanyGarca
Martnez
BayesianStatistics
TimeSeries
Advanced TamaraMendt
MachineLearning
GeneralMLQuestions
Regression
Classification
Regularization
DimensionalityReduction
ClusterAnalysis
Optimization
Recommendation
FeatureEngineering
NaturalLanguageProcessing
MetaLearning
Miscellanea
ComputerScience
LibrariesandTools
Databases
DistributedSystemsandBigData
HandsOn
ProblemtoSolve
Coding
Papers
Sources
UsefulLinks
BackgroundQuestions
Usually,interviewsstartwithbackgroundquestions:theycanaskyoutotalkaboutyourself.Thiscan
alsohappenatthetelephoneinterviewstage.
Careers
Forbackgroundquestionsbereadytotalkaboutasummaryofyourcareer.
Summarizeyourexperience
Whatcompaniesyouworkedat?Whatwasyourrole?
Doyouhaveaprojectportfolio?Whatprojectsyouimplemented?Discusssomeofthem
indetails
Forgraduatingstudents:Tellmeaboutyourmasterthesis
Foraspiringdatascientists:Whydoyouwantacareerindatascience?
Whatareyourcareergoals?
GeneralQuestions
Therealsobesomequestionsnotdirectlyrelatedtotheprojectsyoudid,butrathertoyour
http://www.itshared.org/2015/10/data-science-interview-questions.html 2/13
2/9/2016 IT Shared: Data Science Interview Questions
(self)education.Forexample:
Whathaveyoudonetoimproveyourdataanalysisknowledgeinthepastyear?
Whatisthelatestpaperorbookyouread?Whydidyoureaditandwhatdidyoulearn?
Whatdatascienceblogsdoyoufollow?
Haveyoutakenanydatasciencerelatedonlinecourses?Ifyes,howmanydidyou
completewithacertificate?
Process
AllMachineLearning,DataMiningandDataScienceprojectsshouldfollowsomeprocess,sothere
canbequestionsaboutit:
Canyououtlinethestepsinananalyticsproject?
HaveyouheardofCRISPDM(CrossIndustryStandardProcessforDataMining)?
CRISPDMdefinesthefollowingsteps:
ProblemDefinition
DataUnderstanding(orDataExploration)
DataPreparation
Modeling
Evaluation
Deployment(fortheproduction)
Sonextyoumaydiscusseachofthesestepsindetails
Whatisthegoalofeachstep?
Whatarepossibleactivitiesateachstep?
Mathematics
SomebackgroundmathematicsisnecessaryfordoingDataScience,thereforeyoushouldexpect
mathrelatedquestions.
LinearAlgebra
BasicLinearAlgebraquestionsmightinclude:
Whatis[MathProcessingError]?Howtosolveit?
Howdowemultiplymatrices?
WhatisanEigenvalue?AndwhatisanEigenvector?WhatisEigenvalueDecomposition
orTheSpectralTheorem?
WhatisSingularValueDecomposition?
YoucanexpecttonsofLinerAlgebraquestionsintheMachineLearningpartofthe
interview(seebelow).
IfyouareinterestedinlearningorrefreshingLinearAlgebra,seeBestTimetoLearnLinearAlgebra
isNow!
OtherAreas
DiscreteMathematicsandLogicsarenotthatimportantforDataScience
ProbabilityandStatisticsarecoreskillsanddiscussedinthenextsection
CalculusandOptimizationareusuallydiscussedintheMachineLearningpartandusually
whentalkingaboutaparticularalgorithm
ProbabilityandStatistics
ProbabilityandStatisticsisanimportantpartofaninterview,becauseitsthebasicsforMachine
Learning.Itisalsousefulifthecompanyisdoingsomemarketingorwebsiteoptimization,sothey
couldaskaboutrelatedconceptssuchasA/Btests.
BasicProbability
http://www.itshared.org/2015/10/data-science-interview-questions.html 3/13
2/9/2016 IT Shared: Data Science Interview Questions
Youcanhaveacoupleofsimplequestionstocheckyourunderstandingofprobability.
Forexample:
Giventwofairdices,whatistheprobabilityofgettingscoresthatsumto4?to8?
AsimplequestionsonBayesrule:Imagineatestwithatruepositiverateof100%and
falsepositiverateof5%.Imagineapopulationwitha1/1000rateofhavingthecondition
thetestidentifies.Givenapositivetest,whatistheprobabilityofhavingthatcondition?
Distributions
Youcanexpectquestionsaboutprobabilitydistributions:
Whatisthenormaldistribution?Giveanexampleofsomevariablethatfollowsthis
distribution
Whataboutlognormal?
Explainwhatalongtaileddistributionisandprovidethreeexamplesofrelevant
phenomenathathavelongtails.Whyaretheyimportantinclassificationandprediction
problems?
HowtocheckifadistributionisclosetoNormal?Whywouldyouwanttocheckit?Whatis
aQQPlot?
GiveexamplesofdatathatdoesnothaveaGaussiandistribution,orlognormal.
Doyouknowwhattheexponentialfamilyis?
DoyouknowtheDirichletdistribution?themultinomialdistribution?
BasicStatistics
WhatistheLawsofLargeNumbers?CentralLimitTheorem?
WhyaretheyimportantforStatistics?
Whatsummarystatisticsdoyouknow?
ExperimentDesign
DesigningexperimentsisanimportantpartofStatistics,anditsespeciallyusefulfordoingA/Btests.
SamplingandRandomization
Whydoweneedtosampleandhow?
Whyisrandomizationimportantinexperimentaldesign?
Some3rdpartyorganizationrandomlyassignedpeopletocontrolandexperimentgroups.
Howcanyouverifythattheassignmenttrulywasrandom?
Howdoyoucalculateneededsamplesize?
Poweranalysis.Whatisit?
Biases
Whenyousample,whatbiasareyouinflicting?
Howdoyoucontrolforbiases?
WhataresomeofthefirstthingsthatcometomindwhenIdoXintermsofbiasingyour
data?
Otherquestions
Whatareconfoundingvariables?
PointEstimates
Confidenceintervals
Whatisapointestimate?Whatisaconfidenceintervalforit?
Howaretheyconstructed?
Whydoyouneedtostandardize?
Howtointerpretconfidenceintervals?
Testing
Hypothesistests
Whydoweneedhypothesistesting?WhatisPValue?
http://www.itshared.org/2015/10/data-science-interview-questions.html 4/13
2/9/2016 IT Shared: Data Science Interview Questions
Whatisthenullhypothesis?Howdowestateit?
DoyouknowwhatTypeI/TypeIIerrorsare?
Whatis[MathProcessingError]Test/[MathProcessingError]Test/ANOVA?Whentouse
it?
Howwouldyoutestiftwopopulationshavethesamemean?Whatifyouhave3or4
populations?
YouappliedANOVAanditsaysthatthemeanisdifferent.Howdoyouidentifythe
populationswherethemeansaredifferent?
Whatisthedistributionofpvalues,ingeneral?
A/BTests
WhatisA/Btesting?HowisitdifferentfromusualHypothesistesting?
Howcanyouprovethatoneimprovementyouvebroughttoanalgorithmisreallyan
improvementovernotdoinganything?HowfamiliarareyouwithA/Btesting?
Howcanwetellwhetherourwebsiteisimproving?
Whatarethemetricstoevaluateawebsite?Asearchengine?
Whatkindofmetricswouldyoutrackforyoumusicstreamingwebsite?
Commonmetrics:Engagement/retentionrate,conversion,similarproducts/duplicates
matching,howtomeasurethem.
Reallifenumbersandintuition:Expecteduserbehavior,reasonablerangesforuser
signup/retentionrate,sessionlength/count,registered/unregisteredusers,deep/top
levelengagement,spamrate,complaintrate,adsefficiency.
BayesianStatistics
InmyinterviewsIdidnthaveanyquestionsaboutBayesianStats,nordidIfindalotofquestionson
theInternet.Butherearesome:
HaveyoueverseenBayesTheorem?
Doyouknowwhataconjugateprioris?
YoumightalsogetquestionsaboutBayesiannonparametricmodels,butImnotsureifitscommon.
TimeSeries
Whatisatimeseries?
Whatisthedifferencebetweendataforusualstatisticalanalysisandtimeseriesdata?
Haveyouusedanyofthefollowing:Timeseriesmodels,Crosscorrelationswithtimelags,
Correlograms,Spectralanalysis,Signalprocessingandfilteringtechniques?Ifyes,in
whichcontext?
Intimeseriesmodelinghowcanwedealwithmultipletypesofseasonalitylikeweeklyand
yearlyseasonality?
Advanced
Resampling
Explainwhatresamplingmethodsareandwhytheyareuseful.Alsoexplaintheir
limitations.
Bootstrappinghowandwhyitisused?
Howtouseresamplingforhypothesistesting?HaveyouheardofPermutationTests?
Howwouldyouapplyresamplingtotimeseriesdata?
MachineLearning
Inmyexperience,theMachineLearningpartisusuallythelargestpartoftheinterview.Itmaybea
fewbasicquestions,butitshelpfultobepreparedtomoreindepthMachineLearningquestions,
especiallyifyouclaimtohaveworkedwithitonyourCV.
GeneralMLQuestions
TheMLpartmaystartwithsomethinglike:
Whatisthedifferencebetweensupervisedandunsupervisedlearning?Whichalgorithms
aresupervisedlearningandwhicharenot?Why?
http://www.itshared.org/2015/10/data-science-interview-questions.html 5/13
2/9/2016 IT Shared: Data Science Interview Questions
WhatisyourfavoriteMLalgorithmandwhy?
Andthengointodetails
Regression
Describetheregressionproblem.Isitsupervisedlearning?Why?
Whatislinearregression?Whyisitcalledlinear?
Discussthebiasvariancetradeoff.
LinearRegression:
WhatisOrdinaryLeastSquaresRegression?Howitcanbelearned?
CanyouderivetheOLSRegressionformula?(Foronestepsolution)
Ismodel[MathProcessingError]stilllinear?Why?
Dowealwaysneedtheinterceptterm?Whendoweneeditandwhendowenot?
Whatiscollinearityandwhattodowithit?Howtoremovemulticollinearity?
Whatifthedesignmatrixisnotfullrank?
Whatisoverfittingaregressionmodel?Whatarewaystoavoidit?
WhatisRidgeRegression?HowisitdifferentfromOLSRegression?Whydoweneedit?
WhatisLassoregression?HowisitdifferentfromOLSandRidge?
LinearRegressionassumptions:
Whataretheassumptionsrequiredforlinearregression?
Whatifsomeoftheseassumptionsareviolated?
SignificantfeaturesinRegression
Youwouldliketofindsignificantfeatures.Howwouldyoudothat?
Youfitamultipleregressiontoexaminetheeffectofaparticularfeature.Thefeature
comesbackinsignificant,butyoubelieveitissignificant.Whycanithappen?
Yourmodelconsidersthefeature[MathProcessingError]significant,and[Math
ProcessingError]isnot,butyouexpectedtheoppositeresult.Whycanithappen?
Evaluation
Howtocheckistheregressionmodelfitsthedatawell?
Otheralgorithmsforregression
Decisiontreesforregression
[MathProcessingError]NearestNeighborsforregression.Whentouse?
Doyouknowothers?E.g.Splines?LOESS/LOWESS?
Classification
Basic:
Canyoudescribewhatistheclassificationproblem?
Whatisthesimplestclassificationalgorithm?
Whatclassificationalgorithmsdoyouknow?Whichoneyoulikethemost?
Decisiontrees:
Whatisadecisiontree?
Whataresomebusinessreasonsyoumightwanttouseadecisiontreemodel?
Howdoyoubuildit?
Whatimpuritymeasuresdoyouknow?
Describesomeofthedifferentsplittingrulesusedbydifferentdecisiontreealgorithms.
Isabigbrushytreealwaysgood?Whywouldyouwanttopruneit?
Isitagoodideatocombinemultipletrees?
WhatisRandomForest?Whyisitgood?
Logisticregression:
Whatislogisticregression?
Howdowetrainalogisticregressionmodel?
Howdoweinterpretitscoefficients?
SupportVectorMachines
Whatisthemaximalmarginclassifier?Howthismargincanbeachievedandwhyisit
http://www.itshared.org/2015/10/data-science-interview-questions.html 6/13
2/9/2016 IT Shared: Data Science Interview Questions
beneficial?
HowdowetrainSVM?WhatabouthardSVMandsoftSVM?
Whatisakernel?ExplaintheKerneltrick
Whichkernelsdoyouknow?Howtochooseakernel?
NeuralNetworks
WhatisanArtificialNeuralNetwork?
HowtotrainanANN?Whatisbackpropagation?
Howdoesaneuralnetworkwiththreelayers(oneinputlayer,oneinnerlayerandone
outputlayer)comparetoalogisticregression?
Whatisdeeplearning?WhatisCNN(ConvolutionNeuralNetwork)orRNN(Recurrent
NeuralNetwork)?
Othermodels:
Whatothermodelsdoyouknow?
HowcanweuseNaiveBayesclassifierforcategoricalfeatures?Whatifsomefeatures
arenumerical?
Tradeoffsbetweendifferenttypesofclassificationmodels.Howtochoosethebestone?
Comparelogisticregressionwithdecisiontreesandneuralnetworks.
Regularization
WhatisRegularization?
WhichproblemdoesRegularizationtrytosolve?
Whatdoesitmean(practically)foradesignmatrixtobeillconditioned?
Whenmightyouwanttouseridgeregressioninsteadoftraditionallinearregression?
Whatisthedifferencebetweenthe[MathProcessingError]and[MathProcessingError]
regularization?
Why(geometrically)doesLASSOproducesolutionswithzerovaluedcoefficients(as
opposedtoridge)?
LetusgothroughthederivationofOLSorLogisticRegression.Whathappenswhenwe
add[MathProcessingError]regularization?Howdothederivationschange?Whatifwe
replace[MathProcessingError]regularizationwith[MathProcessingError]regularization?
DimensionalityReduction
Basics:
Whatisthepurposeofdimensionalityreductionandwhydoweneedit?
Aredimensionalityreductiontechniquessupervisedornot?Areallofthemare
(un)supervised?
Whatwaysofreducingdimensionalitydoyouknow?
Isfeatureselectionadimensionalityreductiontechnique?
Whatisthedifferencebetweenfeatureselectionandfeatureextraction?
IsitbeneficialtoperformdimensionalityreductionbeforefittinganSVM?Whyorwhynot?
PrincipalComponentAnalysis:
WhatisPrincipalComponentAnalysis(PCA)?Whatistheproblemitsolves?Howisit
relatedtoeigenvaluedecomposition(EVD)?
WhatstherelationshipbetweenPCAandSVD?WhenSVDisbetterthanEVDforPCA?
UnderwhatconditionsisPCAeffective?
WhydoweneedtocenterdataforPCAandwhatcanhappedifwedontdoit?Dowe
needtoscaledataforPCA?
IsPCAalinearmodelornot?Why?
OtherDimensionalityReductiontechniques:
DoyouknowotherDimensionalityReductiontechniques?
WhatisIndependentComponentAnalysis(ICA)?WhatsthedifferencebetweenICAand
PCA?
Supposeyouhaveaverysparsematrixwhererowsarehighlydimensional.Youproject
theserowsonarandomvectorofrelativelysmalldimensionality.Isitavalid
dimensionalityreductiontechniqueornot?
HaveyouheardofKernelPCAorothernonlineardimensionalityreductiontechniques?
WhataboutLLE(LocallyLinearEmbedding)or[MathProcessingError]SNE([Math
ProcessingError]distributedStochasticNeighborEmbedding)
WhatisFisherDiscriminantAnalysis?HowitisdifferentfromPCA?Isitsupervisedor
http://www.itshared.org/2015/10/data-science-interview-questions.html 7/13
2/9/2016 IT Shared: Data Science Interview Questions
not?
ClusterAnalysis
Whatistheclusteranalysisproblem?
Whichclusteranalysismethodsyouknow?
Describe[MathProcessingError]Means.Whatistheobjectiveof[MathProcessingError]
Means?CanyoudescribetheLloydalgorithm?
Howdoyouselect[MathProcessingError]forKMeans?
Howcanyoumodify[MathProcessingError]Meanstoproducesoftclassassignments?
Howtoassessthequalityofclustering?
Describeanyotherclusteranalysismethod.E.g.DBSCAN.
Optimization
Youmayhavesomebasicquestionsaboutoptimization:
Whatisthedifferencebetweenaconvexfunctionandnonconvex?
WhatisGradientDescentMethod?
WillGradientDescentmethodsalwaysconvergetothesamepoint?
Whatisalocaloptimum?
Isitalwaysbadtohavelocaloptima?
WhattheNewtonsmethodis?
WhatkindofproblemsarewellsuitedforNewtonsmethod?BFGS?SGD?
Whatareslackvariables?
Describeaconstrainedoptimizationproblemandhowyouwouldtackleit.
Recommendation
Whatisarecommendationengine?Howdoesitwork?
DoyouknowabouttheNetflixPrizeproblem?Howwouldyouapproachit?
Howtodocustomerrecommendation?
WhatisCollaborativeFiltering?
Howwouldyougeneraterelatedsearchesforasearchengine?
HowwouldyousuggestfollowersonTwitter?
FeatureEngineering
HowtoapplyMachineLearningtoaudiodata,images,texts,graphs,etc?
WhatisFeatureEngineering?Canyougiveanexample?Whydoweneedit?
Howtogofromcategoricalvariablestonumerical?
NaturalLanguageProcessing
Ifthecompanydealswithtextdata,youcanexpectsomequestionsonNLPandInformation
Retrieval:
WhatisNLP?HowisitrelatedtoMachineLearning?
HowwouldyouturnunstructuredtextdataintostructureddatausableforMLmodels?
WhatistheVectorSpaceModel?
WhatisTFIDF?
Whichdistancesandsimilaritymeasurescanweusetocomparedocuments?Whatis
cosinesimilarity?
Whydoweremovestopwords?Whendowenotremovethem?
LanguageModels.Whatis[MathProcessingError]Grams?
MetaLearning
FeatureSelection:
Areallfeaturesequallygood?
Whatarethedownfallsofusingtoomanyortoofewvariables?
Howmanyfeaturesshouldyouuse?Howdoyouselectthebestfeatures?
http://www.itshared.org/2015/10/data-science-interview-questions.html 8/13
2/9/2016 IT Shared: Data Science Interview Questions
WhatisFeatureSelectionandwhydoweneedit?
Describeseveralfeatureselectionmethods.Arethesemethodsdependonthemodelor
not?
Modelselection:
Youhavebuiltseveraldifferentmodels.Howwouldyouselectthebestone?
Youhaveonemodelandwanttofindthebestsetofparametersforthismodel.How
wouldyoudothat?
Howwouldyoulookforthebestparameters?Doyouknowsomethingelseapartfromgrid
search?
WhatisCrossValidation?
Whatis10FoldCV?
Whatisthedifferencebetweenholdingoutavalidationsetanddoing10FoldCV?
Modelevaluation
Howdoyouknowifyourmodeloverfits?
Howdoyouassesstheresultsofalogisticregression?
Whichevaluationmetricsyouknow?Somethingapartfromaccuracy?
Whichisbetter:Toomanyfalsepositivesortoomanyfalsenegatives?
Whatprecisionandrecallare?
WhatisaROCcurve?WhatisAUROC(AUC)?HowtointerpretthecurveandAUROC?
DoyouknowaboutConcordanceorLift?
DiscussionQuestions:
Youhaveamarketingcampaignandyouwanttosendemailstousers.Youdevelopeda
modelforpredictingifauserwillreplyornot.Howcanyouevaluatethismodel?Istherea
chartyoucanuse?
Miscellanea
CurseofDimensionality
WhatisCurseofDimensionality?Howdoesitaffectdistanceandsimilaritymeasures?
Whataretheproblemsoflargefeaturespace?Howdoesitaffectdifferentmodels,e.g.
OLS?Whataboutcomputationalcomplexity?
Whatdimensionalityreductionscanbeusedforpreprocessingthedata?
Whatisthedifferencebetweendensitysparsedataanddimensionallysparsedata?
Others
Youaretraininganimageclassifierwithlimiteddata.Whataresomewaysyoucan
augmentyourdataset?
ComputerScience
KnowledgeinComputerScienceisasimportantforDataScienceasknowledgeinMachineLearning.
Soyoumaygetthesametypeofquestionsasforanysoftwaredeveloperposition,butpossiblywith
lowerexpectationsonyouranswers.
IwasaJavadeveloperforquitesometime,andIpreparedalistofquestionsIasked(andoftenwas
asked)onJavainterviews:JavaInteviewquestions.Thislistcanalsobehelpfulforpreparingtoa
DataScienceinterview.
LibrariesandTools
ApartfrombasicsofJava/Scala/Python/etc,youmaybeaskedaboutlibrariesfordataanalysis:
WhichlibrariesfordataanalysisdoyouknowinPython/R/Java?
Haveyouusednumpy,scipy,pandas,sklearn?
WhataresomefeaturesofthesklearnapithatdifferentiateitfromfittingmodelsinR?
Whataresomefeaturesofpandas/scipythatyoulike?Hate?SamequestionsforR.
Whyisvectorizationsuchapowerfulmethodforoptimizingnumericalcode?Whatis
goingonthatmakesthecodefasterrelativetoalternativeslikenestedforloops?
Whenisitbettertowriteyourowncodethanusingadatasciencesoftwarepackage?
Stateany3positiveandnegativeaspectsaboutyourfavoritestatisticalsoftware.
Describeadifficultbugyouveencounteredandhowyouresolvedit.
http://www.itshared.org/2015/10/data-science-interview-questions.html 9/13
2/9/2016 IT Shared: Data Science Interview Questions
Howdoesfloatingpointaffectprecisionofcalculations?Equalitytests?
WhatisBLAS?LAPACK?
Databases
Haveyoubeeninvolvedindatabasedesignanddatamodeling?
SQLRelatedquestions:e.g.whatsgroupby?
OrgivensomeDBschemayoumaybeaskedtowriteasimpleSQLquery.
Whatisastarschema?snowflakeschema?
DescribedifferentNoSQLtechnologiesyourefamiliarwith,whattheyaregoodat,and
whattheyarebadat.
DistributedSystemsandBigData
BasicBigDataquestions:
Whatisthebiggestdatasetthatyouhaveprocessedandhowdidyouprocessit?What
wastheresult?
HaveyouusedApacheHadoop,ApacheSpark,ApacheFlink?Why?Haveyouused
ApacheMahout?
MapReduce
Whataretheadvantages/disadvantagesofsharednothingarchitecture?
WhatisMapReduce?Whyisitsharednothingarchitecture?
CanyouimplementwordcountinMapReduce?Whataboutsomethingabitmorecomplex
likeTFIDF?NaiveBayes?
Whatisloadbalance?HowtomakesureaMapReduceapplicationhasgoodload
balance?
CanyougiveexampleswhereMapReducedoesnotwork?
Whatareexamplesofembarassinglyparallelizablealgorithms?
Implementationquestions
Howwouldyouestimatethemedianofadatasetthatistoobigtoholdinthememory?
TherearesomepoststhatyoumayfindusefulwhenpreparingfortheBigDatapart:
HadoopandMapReduce
NaiveBayesonApacheFlink
HandsOn
Also,manyinterviewshaveapartwhichIcallhandson:youaregivensomeproblemdescription
andyouareaskedtosolveit.Youcanjusttalktheinterviewersthroughyoursolutionorevenbe
askedtositandimplementsomeparts.Sometimesthereisalsoatestassignmenttobedoneat
home(priortotheinterview).
ProblemtoSolve
Forexample:
Assumethatyouareaskedtoleadaprojectonchurndetection,andhavedatasetofknownusers
whostoppedusingtheserviceandoneswhoarestillusing.Thisdataincludesdemographicsand
otherfeatures.
Dothefollowing:
1.Describethemethodologyandmodelthatyouwillchosetoidentifychurn,anddescribeyour
thoughtprocess.
2.ThinkhowwouldyoucommunicatetheresultstotheCEO?
3.Supposeinthedatasetonly0.025ofuserschurned.Howwouldyoumakeitmorebalanced?
Also:
Howwouldyouimplementitifyouhadoneday?Onemonth?Oneyear?
Howwouldyourapproachscale?
Otherproblems:
http://www.itshared.org/2015/10/data-science-interview-questions.html 10/13
2/9/2016 IT Shared: Data Science Interview Questions
Howwouldyouapproachidentifyingplagiarism?
Howtofindindividualpaidaccountssharedbymultipleusers?
Howtodetectbogusreviews,orbogusFacebookaccountsusedforbadpurposes?
Usuallythedomainoftheproblemisrelatedtowhatthecompanyisdoing.Iftheyredoing
marketing,itwillmostlikelybemarketingrelated.
Additionally,youmaybeasked:
Howwouldyouapproachcollectingthedataifyoudidnthavethedataset?
Coding
Sometimesyouevenmaybepresentedasmalldatasetandasktodoaparticulartaskwithanytool.
Forexample,
writeascripttoextractfeatures,
thendosomeexploratorydataanalysisand
finallyapplysomeMLalgorithmtothisdataset.
Orjustthelasttwo,withareadytousedatasetintabularform.
Papers
ItsalsopossiblethatyoullbeaskedtoreadsomeMLpaperandshareyourthoughtsonit,andthen
discusstheproposedalgorithm,itstimecomplexity,howitcanbeimplementedandimproved.
Iwasntaskedtodoitmyself,butbasedonmyexperienceworkingasaMLdeveloper,Ibelievethat
readingpapersandbeingabletounderstandthemisanimportantskill,sodontbesurprisedif
somebodytriestocheckthisability.
Sources
Ihadtoworkthroughalotofsourcestomakethiscompilation.IdidnotincludeallthequestionsI
cameacross,justtheonesthatmadesenseoronesIreallygotduringmyinterviews.Italso,of
course,includesmyowninterviews.
HereisthelistofsourcesIused:
http://www.quora.com/Whatisatypicaldatascientistinterviewlike
http://www.quora.com/Whataretheinterviewquestionsonregressionmodeling
http://www.quora.com/HowshouldIprepareforstatisticsquestionsforadatascience
interview
http://www.quora.com/ABTesting/WhatkindofABtestingquestionsshouldIexpectin
adatascientistinterviewandhowshouldIprepareforsuchquestions
http://www.quora.com/Whatare20questionstodetectfakedatascientists
http://www.quora.com/WhataresomecommonMachineLearninginterviewquestions
http://www.quora.com/Whatarethebestinterviewquestionstoevaluateamachine
learningresearcher
https://www.quora.com/AreCSquestionspartofadatascientistinterviewatFacebook
http://www.quora.com/DataScience/HowshouldIprepareforstatisticsquestionsfora
datascienceinterview
http://stats.stackexchange.com/questions/5465/statisticsinterviewquestions
http://www.reddit.com/r/datascience/comments/2nhb4k/what_interview_questions_have_y
ou_been_asked/
http://www.reddit.com/r/statistics/comments/310h76/i_have_an_interview_for_a_parttime_
data_analyst/
https://www.reddit.com/r/datascience/comments/3fsz54/my_top10_technical_questions_fo
r_job_candidates/
https://www.reddit.com/r/datascience/comments/3kzf69/data_scientist_interview_question
s_on_pca_svm/
http://www.reddit.com/r/MachineLearning/comments/392nwy/interview_questions_for_data
_scientist_positions/
http://blog.udacity.com/2015/04/datascienceinterviewquestions.html
http://alyaabbott.wordpress.com/2014/10/01/howtoaceadatascienceinterview/
http://www.marketingdistillery.com/2014/09/03/howtosuccessfullyrecruitadatascientist/
http://www.edureka.co/blog/frequentlyaskeddatascienceinterviewquestions
http://www.itshared.org/2015/10/data-science-interview-questions.html 11/13
2/9/2016 IT Shared: Data Science Interview Questions
http://www.galvanize.it/blog/howtonailadatascienceinterview
http://analyticsindiamag.com/commonanalyticsinterviewquestions/
http://www.datasciencecentral.com/profiles/blogs/66jobinterviewquestionsfordata
scientists
UsefulLinks
IfyouarepreparingtoaDataScienceinterview,youmayalsofindthefollowinglinksuseful:
http://www2.udacity.com/rs/udacity/images/Ultimate%20Skills%20Checklist%20For%20Yo
ur%20First%20Data%20Analyst%20Job.pdf
http://www.quora.com/Whataresomeimportantquestionstoaskarecruiterwhen
interviewingforadatasciencejob
http://www.quora.com/InadatascientistinterviewshouldIusePythonorC++for
algorithmdatastructurequestions
http://www.quora.com/HowdoIprepareforadatascientistinterview
http://datascienceinterview.quora.com/DataScienceInterviewPreparation
http://datascienceinterview.quora.com/Answers1
https://github.com/gkamradt/LessonsLearnedDataScienceInterviews
http://mathewanalytics.com/2015/08/18/homeworkduringthehiringprocessnothanks/
https://medium.com/@D33B/interviewquestionsfordatascientistpositions
5ad3c5d5b8bd
https://medium.com/@D33B/interviewquestionsfordatascientistpositionspartii
ac294c2c7241
http://www.jasq.org/justanotherscalaquant/newageyinterviewsatthegrocerystartup
http://www.erinshellman.com/crusheditlandingadatasciencejob/
http://treycausey.com/data_science_interviews.html
http://nadbordrozd.github.io/interviews/
TheEnd
Eventhoughthepostwaslengthy,Ihopeyouenjoyeditandfoundthisinformationuseful.Happy
interviewing!Andpleasedoletusknowifyougotanyinterestingquestionsthatweshouldadd.
Labels:byAlexey,DataScience,InterviewQuestions,MachineLearning
http://www.itshared.org/2015/10/data-science-interview-questions.html 12/13
2/9/2016 IT Shared: Data Science Interview Questions
5 Comments IT Shared
1 Login
Subscribeto:PostComments(Atom)
PoweredbyBlogger.
http://www.itshared.org/2015/10/data-science-interview-questions.html 13/13