Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data science is deep knowledge discovery through data inference and exploration. This discipline often
involves using mathematic and algorithmic techniques to solve some of the most analytically complex business
problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers
around evidence-based analytical rigor and building robust decision capabilities.
Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It
is all about adding substantial enterprise value by learning from data.
The variety of projects that a data scientist may be engaged in is incredibly broad. Here are few examples:
automated decision engines e.g. automated fraud detection, and even self-driving cars
The objectives of these types of initiatives may be clear, but the problems require extensive quantitative expertise to
solve. They may require building predictive models, attribution models, segmentation models, heuristics for deep
pattern-discovery in data, etc this commands having exhaustive knowledge of all sorts of machine-learning
algorithms and sharp technical ability. As you might guess, these are not the easiest skills to pick up.
Mathematics Expertise
At the heart of deriving insight from data is the ability to view the data through a quantitative lens. There are textures,
patterns, dimensions, and correlations in data that can be expressed numerically, and discovering inference from
data becomes a brain teaser of mathematical techniques. Solutions to many business problems often involve building
analytic models that are deeply grounded in the hard math theory, and being able to understand how models work is
as important as knowing the process to build them (danger of building without knowing the math).
Also, a big misconception is that data science all about statistics. While statistics are important, it is not the only type
of mathematics that should be well-understood by a data scientist. First, there are two main branches of statistics
classical statistics and Bayesian statistics. When most people refer to stats they are generally referring to classical
stats, but knowledge of both types is very helpful. Furthermore, many inferential techniques and machine learning
algorithms lean heavily on knowledge of linear algebra. For example, key data science processes like SVD (used for
dimension reduction / latent variable discovery) are grounded in matrix mathematics and have much less to do with
classical statistics. Overall, data scientists should have substantial breadth and depth in their knowledge of math.
Training
While solid math skills are necessary, there is a glaring misconception out there that you need a Ph.D in Statistics to
become a legitimate data scientist. That view completely misses the point that data science is multidisciplinary; years
of study in academia may not leave graduates with the correct set of experience and abilities to excel i.e. a Ph.D
statistician may not have nimble hacking skills or strategic business intuition to complete the trifecta.
As a matter of fact, data science is such a relatively new and rising discipline that universities have not caught up in
developing comprehensive data science degree programs meaning that no one can really claim to have "done all
the schooling" to be become a data scientist. Where does much of the training come from? The unyielding intellectual
curiosity that data scientists possess drive them to be passionate autodidacts, motivated to learn skills on their own
with deep determination (Read: where can you find people like this?).
What is Analytics?
Analytics has risen quickly in popular business lingo over the past several years; the term is used loosely, but
generally meant to describe critical thinking that is quantitative in nature. Technically, analytics is the "science of
analysis" put another way, the practice of analyzing information to make decisions.
Is "analytics" the same thing as data science? Depends on context. Sometimes it is synonymous with the definition of
data science that we have described, and sometimes it represents something else. A data scientist using raw data to
build a predictive behavior model falls into the scope of analytics. At the same time, a general business user
interpreting pre-built dashboard reports (e.g. GA) is also in the realm of analytics, but does not cross into the
specialized skill needed in data science. Analytics has come to have fairly broad meaning, though at the end of the
day, the semantics don't matter much.
"Analyst" is somewhat of an ambiguous term that can represent many different types of roles (marketing analyst,
operations analyst, portfolio analyst, financial analyst, etc). Is an analyst the same as a data scientist? We've
discussed pretty strict canon around what is a data scientist as an expert's role with requisite talents in math,
technology, and strategy consulting. Let's just say that some analysts are definitely data-scientists-in-training. As
represented in this visual, there is a place in the middle where the distinction can blur a bit.
clustering algorithms that mine for natural similarities between different customers
neural nets that can recognize what image patterns look like
Data scientists work intimately with machine learning techniques to build algorithms that automate elements of their
problem-solving. It is a requisite part of the data science toolset, needed to tackle some of the most complex datadriven projects.
What is machine learning? You probably use it dozens of times a day without even
knowing it. Each time you do a web search on Google or Bing, that works so
well because their machine learning software has figured out how to rank what
pages. When Facebook or Apple's photo application recognizes your friends in your
pictures, that's also machine learning. Each time you read your email and a spam
filter saves you from having to wade through tons of spam, again, that's because
your computer has learned to distinguish spam from non-spam email. So, that's
machine learning. There's a science of getting computers to learn without being
explicitly programmed. One of the research projects that I'm working on is getting
robots to tidy up the house. How do you go about doing that? Well what you can do
is have the robot watch you demonstrate the task and learn from that. The robot
can then watch what objects you pick up and where to put them and try to do the
same thing even when you aren't there. For me, one of the reasons I'm excited
about this is the AI, or artificial intelligence problem. Building truly intelligent
machines, we can do just about anything that you or I can do. Many scientists think
the best way to make progress on this is through learning algorithms called neural
networks, which mimic how the human brain works
Is this A or B?
This family is formally known as two-class classification. Its useful for any question that has just two possible
answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like
this or can be re-phrased to fit this form. Its the simplest and most commonly asked data science question. Here are
few typical examples.
Will this customer renew their subscription?
Is this an image of a cat or a dog?
Will this customer click on the top link?
Will this tire fail in the next thousand miles?
Does the $5 coupon or the 25% off coupon result in more return customers?
Is this A or B or C or D?
This algorithm family is called multi-class classification. Like its name implies, it answers a question that has
several (or even many) possible answers: which flavor, which person, which part, which company, which candidate.
Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few
typical examples.
Which animal is in this image?
Which aircraft is causing this radar signature?
What is the topic of this news article?
What is the mood of this tweet?
Who is the speaker in this recording?
Is this Weird?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are
paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no.
The difference is that binary classification assumes you have a collection of examples of both yes and no cases.
Anomaly detection doesnt. This is particularly useful when what you are looking for occurs so rarely that you havent
had a chance to collect many examples of it, like equipment failures. Its also very helpful when there is a lot of variety
in what constitutes not normal, as there is in credit card fraud detection. Here are some typical anomaly detection
questions
Are these voltages normal for this season and time of day?
How Much / How Many?
When you are looking for a number instead of a class or category, the algorithm family to use is regression.
Which van in my fleet needs servicing the most? can be rephrased as How badly does each van in my
fleet need servicing?
Which 5% of my customers will leave my business for a competitor in the next year? can be rephrased as
How likely is each of my customers to leave my business for a competitor in the next year?
Two-Class Classification as Regression
It may not come as a surprise that binary classification problems can also be reformulated as regression. (In fact,
under the hood some algorithms reformulate every binary classification as regression.) This is especially helpful when
an example can belong part A and part B, or have a chance of going either way. When an answer can be partly yes
and no, probably on but possibly off, then regression can reflect that. Questions of this type often begin How likely
or What fraction
Which groups of sensors in this jet engine tend to vary with (and against) each other?
What leadership practices do successful CEOs have in common?
What are the most common patterns in gasoline price changes across the US?
What groups of words tend to occur together in this set of documents? (What are the topics they cover?)
If your goal is to summarize, simplify, condense, or distill a collection of data, dimensionality reduction and clustering
are your tools of choice.
What Should I Do Now?
A third extended family of ML algorithms focuses on taking actions. These are called reinforcement learning (RL)
algorithms. They are little different than the supervised and unsupervised learning algorithms. A regression algorithm
might predict that the high temperature will be 98 degrees tomorrow, but it doesnt decide what to do about it. A RL
algorithm goes the next step and chooses an action, such as pre-refrigerating the upper floors of the office building
while the day is still cool.
RL algorithms were originally inspired by how the brains of rats and humans respond to punishment and rewards.
They choose actions, trying very hard to choose the action that will earn the greatest reward. You have to provide
them with a set of possible actions, and they need to get feedback after each action on whether it was good, neutral,
or a huge mistake.
Typically RL algorithms are a good fit for automated systems that have to make a lot of small decisions without a
humans guidance. Elevators, heating, cooling, and lighting systems are excellent candidates. RL was originally
developed to control robots, so anything that moves on its own, from inspection drones to vacuum cleaners, is fair
game. Questions that RL answers are always about what action should be taken, although the action is usually taken
by machine.
Where should I place this ad on the webpage so that the viewer is most likely to click it?
Should I adjust the temperature higher, lower, or leave it where it is?
Should I vacuum the living room again or stay plugged in to my charging station?
How many shares of this stock should I buy right now?
Should I continue driving at the same speed, brake, or accelerate in response to that yellow light?
RL usually requires more effort to get working than other algorithm types because its so tightly integrated with the
rest of the system. The upside is that most RL algorithms can start working without any data. They gather data as
they go, learning from trial and error.