Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In industry, the data science team can consist of multiple roles, and
it becomes essential for the organization to have a smooth operation
between different roles. In other words, the research and models
done by the data scientist can be put in production quickly without
major re-writing of the code.
http://101.datascience.community/2016/11/28/data-scientists-data-eng https://www.slideshare.net/continuumio/journey-to-open-data-science
ineers-software-engineers-the-difference-according-to-linkedin/
General “BIG DATA/ML” Architecture
For example the model developed in TensorFlow might look like this when deployed as product (e.g.
as an app for your phone to tell whether your image contains a cat or a dog).
DOCKER
https://www.docker.com/survey-2016
http://www.somatic.io/blog/docker-and-deep-learning-a-bad-match
Unfortunately that is wrong for deep learning applications. For any serious deep learning application, you need NVIDIA
graphics cards, otherwise it could take months to train your models. NVIDIA requires both the host driver and the docker
image's driver to be exactly the same. If the version is off by a minor number, you will not be able to use the NVIDIA card, it
will refuse to run. I don't know how much of the binary code changes between minor versions, but I would rather have the card
try to run instructions and get a segmentation fault then die because of a version mismatch.
We build our docker images based off the NVIDIA card and driver along with the software needed. We essentially have the
same docker image for each driver version. To help stay manage this, we have a test platform that makes sure all of
our code runs on all the different docker images.
This issue is mostly in NVIDIA's court, they can modify their drivers to be able to work across different versions. I'm not
sure if there is anything that Docker can do on their side. I think its something they should figure out though, the
combination of docker and deep learning could help a lot more people get started faster, but right now its
an empty promise.
The biggest impact on data science right now is not coming from a new
algorithm or statistical method. It’s coming from Docker containers.
Containers solve a bunch of tough problems simultaneously: they make it easy to
use libraries with complicated setups; they make your output reproducible; they
make it easier to share your work; and they can take the pain out of the Python data
science stack.
.pwc.com/us/en/technology-forecast/2014
Benefits of microservices
1) Code can be broken out into smaller microservices that are easier to learn, release
and update.
2) Individual microservices can be written using the best tools for the job.
3) Releasing a new service doesn't require synchronization across a whole company.
4) New technology stacks have lower risk since the service is relatively small.
5) Developers can run containers locally, rebuilding and verifying after each commit on a
system that mirrors production.
6) Both Docker and Kubernetes are open source and free to use.
7) Access to Docker hub leverages the work of the opensource community.
8) Service isolation without the heavyweight VM. Adding a service to a server does not
affect other services on the server.
9) Services can be more easily run on a large cluster of nodes making it more reliable.
10) Some clients will only host in private and not on public clouds.
11) Lends itself to immutable infrastructure, so services are reloadable without missing
state when a server goes down.
12) Immutable containers improve security since data can only be mutated in specified
volumes, root kits often can't be installed even if the system is penetrated.
13) Increasing support for new hardware, like the GPU in a container means even gpgpu
tasks like deep learning can be containerized.
14) There is a cost for running microservices - the build and runtime becomes more
complex. This is part of the price to pay and if you've made the right decision in your
context, then benefits will exceed the costs.
Costs of microservices
• Managing multiple services tends to be more costly.
• New ways for network and servers to fail.
Conclusion
In the right circumstances, the benefits of microservices outweigh the extra cost of
management. https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
Docker vs AWS Lambda ”In General”
https://www.quora.com/Are-there-any-alternatives-to-Amazon-Lambda
http://blog.xebia.com/create-the-smallest-possible-docker-container/
https://www.ctl.io/developers/blog/post/optimizing-docker-images/
“Docker images can get really big. Many are over 1G in size. How do they get so big? Do they really
need to be this big? Can we make them smaller without sacrificing functionality?
Here at CenturyLink we've spent a lot of time recently building different docker images. As we began
https://ypereirareis.github.io/blog/2016/02/15/docker-image-size-optimization/ experimenting with image creation one of the things we discovered was that our custom images
were ballooning in size pretty quickly (it wasn't uncommon to end up with images that weighed-in at
1GB or more). Now, it's not too big a deal to have a couple gigs worth of images sitting on your local
system, but it becomes a bit of pain as soon as you start pushing/pulling these images across the
network on a regular basis. “
https://github.com/microscaling/imagelayers-graph
ImageLayers.io is a project maintained by Microscaling Systems since September 2016. The project
was developed by the team at CenturyLink Labs. This utility provides a browser-based visualization
of user-specified Docker Images and their layers. This visualization provides key information on the https://blog.replicated.com/2016/02/05/refactoring-a-dockerfile-for-image-size/
composition of a Docker Image and any commonalities between them. ImageLayers.io allows
Docker users to easily discover best practices for image construction, and aid in determining which “There’s been a welcome focus in the Docker community recently around image
images are most appropriate for their specific use cases.
size. Smaller image sizes are being championed by Docker and by the community.
When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s
Deploying in Kubernetes Please see deployment/README.md
greatly needed.”
What is lambda architecture anyway?
kappa-architecture.com
Kappa Architecture is a simplification of Lambda Architecture. A
Kappa Architecture system is like a Lambda Architecture system with the batch
https://www.oreilly.com/ideas/questioning-the-lambda-architecture processing system removed. To replace batch processing, data is simply fed
through the streaming system quickly.
The Lambda Architecture is an approach to building stream processing applications
on top of MapReduce andStorm or similar systems. This has proven to be a
Kappa Architecture revolutionizes database migrations and reorganizations: just
surprisingly popular idea, with a dedicated website and an upcoming book.
delete your serving layer database and populate a new copy from the canonical
store! Since there is no batch processing layer, only one set of code needs to
be maintained.
I like that the Lambda Architecture emphasizes retaining the input data unchanged. I
think the discipline of modeling data transformation as a series of materialized stages from an
original input has a lot of merit. I also like that this architecture highlights the problem of
reprocessing data (processing input data over again to re-derive output).
The problem with the Lambda Architecture is that maintaining code that needs to produce
the same result in two complex distributed systems is exactly as painful as it seems like it
would be. I don’t think this problem is fixable. Ultimately, even if you can avoid coding your
application twice, the operational burden of running and debugging two systems is
going to be very high. And any new abstraction can only provide the features supported by
the intersection of the two systems. Worse, committing to this new uber-framework walls off
the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig,
Crunch, Cascading, Oozie, etc).
Managing containers
Enter Kubernetes (from Google)
DOCKER Management → enter Kubernetes
www.computerweekly.com/feature/Demystifying-Kubernete
With a single command the Docker environment is set up and you can docker run until nextplatform.com/2016/03/03
you drop. But what if you have to run Docker containers across two hosts? How about
50 hosts? Or how about 10,000 hosts? Now, you may ask why one would want to do
this. There are some good reasons why:
https://www.nextplatform.com/2015/09/29/why-containers-at-scale-is-hard/
Two founders of the Kubernetes project at Google, Craig McLuckie and Joe Beda, today
announced their new company, Heptio. The company has raised $8.5 million in a series A
investment round led by Accel, with participation from Madrona Venture Group.
www.sdxcentral.com
Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling, and management of containerized
applications.
It groups containers that make up an application into logical units for easy
management and discovery. Kubernetes builds upon
15 years of experience of running production workloads at Google, combined with
best-of-breed ideas and practices from the community.
https://www.youtube.com/watch?v=21hXNReWsUU
http://cloud9.nebula.fi/app.html http://nshani.blogspot.co.uk/2016/02/getting-started-with-kubernetes.html
Kubernetes concepts
http://www.slideshare.net/arungupta1/package-your-ja
va-ee-application-using-docker-and-kubernetes
http://www.slideshare.net/jawnsy/kubernetes-my-bff
Inference can be very resource intensive. Our server executes the following
TensorFlow graph to process every classification request it receives. The
Inception-v3 model has over 27 million parameters and runs 5.7 billion floating
point operations per inference.
blog.kubernetes.io/2016/03
linkedin.com/pulse
Alternatives
https://news.ycombinator.com/item?id=10438273
https://www.oreilly.com/ideas/swarm-v-fleet-v-kubernetes-v-mesos
Conclusion
There are clearly a lot of choices for orchestrating, clustering, and managing
containers. That being said, the choices are generally well differentiated. In terms of
orchestration, we can say the following:
medium.com/@mustwin Swarm has the advantage (and disadvantage) of using the standard Docker interface.
Whilst this makes it very simple to use Swarm and to integrate it into existing
workflows, it may also make it more difficult to support the more complex scheduling
Bare Metal that may be defined in custom interfaces.
Fleet is a low-level and fairly simple orchestration layer that can be used as a base for
Most schedulers with the notable exception of Cloud Foundry can be running higher level orchestration tools, such as Kubernetes or custom systems.
installed on “bare metal” or physical machines inside your datacenter. This
can save you big on hypervisor licensing fees. Kubernetes is an opinionated orchestration tool that comes with service discovery and
replication baked-in. It may require some re-designing of existing applications, but used
correctly will result in a fault-tolerant and scalable system.
Volume Mounts
Mesos is a low-level, battle-hardened scheduler that supports several frameworks for
Volume mounts allow you to persist data across container container orchestration including Marathon, Kubernetes, and Swarm. At the time of
deployments. This is a key differentiator depending on your applications’ writing, Kubernetes and Mesos are more developed and stable than Swarm. In terms
needs. Mesos is the leader here, and Kubernetes is slowly catching up. of scale, only Mesos has been proven to support large-scale systems of hundreds or
thousands of nodes. However, when looking at small clusters of, say, less than a dozen
nodes, Mesos may be an overly complex solution.
Kubernetes Still on top?
https://news.ycombinator.com/item?id=12462261 https://github.com/kubernetes/kubernetes
http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html
After all, Kubernetes is a mere two years old (as a public open source I would argue that general-purpose clusters like those managed by Google
project), whereas Apache Mesos has clocked seven years in market. Kubernetes are better for hosting Internet businesses depending on artificial
Docker Swarm is younger than Kubernetes, and it comes with the intelligence technologies than special-purpose clusters like NVIDIA DGX-1.
backing of the center of the container universe, Docker Inc Yet the
orchestration rivals pale in comparison to Kubernetes' community, Consider the case that an experiment model training job is using all the 100 GPUs in the cluster. A
production job gets started and asks for 50 GPUs. If we use MPI, we'd have to kill the experiment job
which -- now under management by the so to release enough resource to run the production job. This tends to make the owner of the
Cloud Native Computing Foundation -- is exceptionally large and experiment job get the impression that he is doing a "second-class" work.
diverse.
Kubernetes is smarter than MPI as it can kill, or preempt, only 50 workers of the experiment job,
•
Kubernetes is one of the top projects on GitHub: in the top 0.01 so to allow both jobs run at the same time. With Kubernetes, people have to build their programs
percent in stars and No. 1 in terms of activity. into Docker images that run as Docker containers. Each container has its own filesystem and
network port space. When A runs as a container, it removes only files in its own directory. This is to
•
While documentation is subpar, Kubernetes has a significant Slack some extent like that we define C++ classes in namespaces, which helps us removing class name
conflicts.
and Stack Overflow community that steps in to answer questions
and foster collaboration, with growth that dwarfs that of its rivals. An Example A typical Kubernetes cluster runs an automatic speech recognition (ASR) business
•
More professionals list Kubernetes in their LinkedIn profile than might be running the following jobs:
any other comparable offering by a wide margin. 1) The speech service, with as many instances so to serve many simultaneous user requests.
•
Perhaps most glaring, data from OpenHub shows Apache Mesos 2) The Kafka system, whose each channel collects a certain log stream of the speech service.
3) Kafka channels are followed by Storm jobs for online data processing. For example, a Storm job joins the
dwindling since its initial release and Docker Swarm starting to utterance log stream and transcription stream.
slow. In terms of raw community contributions, Kubernetes is 4) The joined result, namely session log stream, is fed to an ASR model trainer that updates the model.
exploding, with 1,000-plus contributors and 34,000 commits -- 5) This trainer notifies ASR server when it writes updated models into Ceph.
more than four times those of nearest rival Mesos. 6) Researchers might change the training algorithm, and run some experiment training jobs, which serve testing
ASR service jobs.
The famous 'classical big data' on Spark
Apache Spark has emerged as the de facto framework for big data
analytics with its advanced in-memory programming model and upper-level
libraries for scalable machine learning, graph analysis, streaming and
structured data processing. It is a general-purpose cluster computing
framework with language-integrated APIs in Scala, Java, Python and R. As
a rapidly evolving open source project, with an increasing number of
contributors from both academia and industry, it is difficult for researchers
to comprehend the full body of development and research behind Apache
Spark, especially those who are beginners in this area.
http://advancedspark.com/:
https://www.youtube.c
om/watch?v=PFK6gsnlV5 https://www.meetup.com/Advanced-Spark-and-
E TensorFlow-Meetup/
The goal of this workshop is to build an end-to-end, streaming data analytics and recommendations pipeline on your local machine using Docker and the latest streaming analytics
tools. First, we create a data pipeline to interactively analyze, approximate, and visualize streaming data using modern tools such as Apache Spark, Kafka, Zeppelin, iPython, and
ElasticSearch.
Dask as an alternative to apache spark #1
https://youtu.be/1kkFZ4P-XHg
http://dask.pydata.org/en/latest/spark.html https://www.quora.com/Is-https-github-com-blaze-dask-an-alternative-to-Spark
Dask seems to be aimed at parallelism of only certain operations (some parts of NumPy and
Spark is mature and all-inclusive. If you want a single project that does everything and you’re Pandas) on larger than memory data on a single machine. Spark is a general purpose
already on Big Data hardware then Spark is a safe bet, especially if your use cases are typical computing engine that can work across a cluster of machines and has many libraries
ETL + SQL and you’re already using Scala. optimized for distributed computing (machine learning, graph, etc.).
Dask is lighter weight and is easier to integrate into existing code and hardware. If The advantages of Dask seem to be that it is a drop in replacement for NumPy and Pandas.
your problems vary beyond typical ETL + SQL and you want to add flexible parallelism to Granted, given the prevalence of those two libraries that isn't a small advantage.
existing solutions then dask may be a good fit, especially if you are already using Python
and associated libraries like NumPy and Pandas.
If you are looking to manage a terabyte or less of tabular CSV or JSON data then you
should forget both Spark and Dask and use Postgres or MongoDB.
TensorFlow Basics
Weights Persistence. Save and Restore a model.
Fine-Tuning. Fine-Tune a pre-trained model on a new task.
Using HDF5. Use HDF5 to handle large datasets. GPU Computing with Apache Spark and Python
Using DASK. Use DASK to handle large datasets. by Continuum Analytics, .slideshare.net
Kubernetes + Dask
Running on kubernetes on google Dask Cluster Deployments
container engine http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments
recorditblog.com
Using the different software above, an application can be deployed, scaled
easily and accessed from the outside world in few seconds. But, what about
the data? Structured content would probably be stored in a distributed
database, like MongoDB, for example Unstructured content is traditionally
stored in either a local file system, a NAS share or in Object Storage. A local
file system doesn’t work as a container can be deployed on any node in the
http://www.slideshare.net/kubecon/kubecon-eu-2016-kubernetes-storage-101
cluster.
On the other side, Object Storage can be used by any application from any
container, is highly available due to the use of load balancers, doesn’t require
any provisioning and accelerate the development cycle of the applications.
Why ? Because a developer doesn’t have to think about the way data should
be stored, to manage a directory structure, and so on.
The fact that the picture is uploaded directly to the Object Storage platform http://kubernetes.io/docs/user-guide/persistent-volumes/walkthrough/
means that the web application is not in the data path. This allows the
application to scale without deploying hundreds of instances. This web Persistent Volumes Walkthrough
application can also be used to display all the pictures stored in the
The purpose of this guide is to help you become familiar with Kubernetes Persistent Volumes.
corresponding Amazon S3 bucket. By the end of the guide, we’ll have nginx serving content from your persistent volume.
You can view all the files for this example in the docs repo here.
The url displayed below each picture shows that the picture is downloaded This guide assumes knowledge of Kubernetes fundamentals and that you have a cluster up and
directly from the Object Storage platform, which again means that the web running.
application is not in the data path. This is another reason why Object See Persistent Storage design document for more information.
Storage is the de facto standard for web scale applications.
Data Lakes vs data warehouses #1
“A data lake is a storage repository that holds a vast amount of raw data in its
native format, including structured, semi-structured, and unstructured data. The
data structure and requirements are not defined until the data is needed.”
The table below helps flesh out this definition. It also highlights a few of the key www.kdnuggets.com/2015/09
differences between a data warehouse and a data lake. This is, by no means, an
exhaustive list, but it does get us past this “been there, done that” mentality:
http://www.smartdatacollective.com/all/13556
Data. A data warehouse only stores data that has been modeled/structured, while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured. [See my
big data is not new graphic. The data warehouse can only store the orange data, while the data lake can store all the orange and blue data.]
Processing. Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to model it. That’s called schema-on-write. With a data lake, you just load
in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches.
Storage. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this: First,
Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware.
Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied
to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and
apps on-the-fly.
Security. Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data
warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a
question of if, but when.
Users. For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it
the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. Trust me, a data lake, at this point in its maturity, is best suited for the data scientists.
Data Lakes Medical Examples
searchhealthit.techtarget.com
Montefiore also recently started another program using the data lake to do cardio-
genetic predictive analytics to determine the degrees of possibility of patients having
sudden cardiac death based on their genetic background.
Deploying Deep learning models
Small scale interference
Data science pipeline development and deployment
https://www.continuum.io/blog/developer-blog/productionizing-and-deploying-data-science-projects
“Keras Inception V3 image classification model Prediction with deployment on Compute Engine
w/ Docker & Google Container Registry (like Docker Hub), using Flask Front-end Webserver.”
Note! If you are using TensorFlow (TF) backend, you can Flask web server allows fast creation of front-end
use the TF computation graph (i.e. trained model) which you can use for example for quick demo of your
trained “natively” on TensorFlow with the Keras minimum viable product (MVP), progress report for
deployment for example when your data scientist(s) your boss, interactive front-end for showing scientific
keep on re-training the model, and you can then just results for your PI (principal investigator).
replace the computation graph (.ckpt, checkpoint) with
Keras, Tensorflow, Docker, Flask, Gunicorn, Nginx, the new weights
Supervisor tech stack in deployment
TensorFlow Serving Tensorflow for production
How to deploy Machine Learning models with TensorFlow. Part 1 —
make your model ready for serving.
https://medium.com/towards-data-science/how-to-deploy-machine-learning-models-with-tensorflow-part-1-make-your-model-ready-for-serving-776a14ec3198
By Vitaly Bezgachev
Streaming Machine learning Google
How to use Google Cloud Dataflow with TensorFlow for batch
predictive analysis
Justin Kestelyn, Google Cloud Platform, April 18, 2017
https://cloud.google.com/blog/big-data/2017/04/how-to-use-google-cloud-dataflow-with-tensorflow
-for-batch-predictive-analysis
Add EFS storage to the cluster So what is a Deep Learning pipeline exactly? Well, my definition is a 4 step pipeline, with a
potential retro-action loop, that consists of :
● Programmatically add an EFS File System and Mount
Points into your nodes Data Ingest: This step is the accumulation and pre processing of the data that will be used to
train our model(s). This step is maybe the less fun, but it is one of the most important. Your model
● Verify that it works by adding a PV / PVC in k8s. will be as good as the data you train it on
Automating Tensorflow deployment Training: this is the part where you play God and create the Intelligence. From your data, you will
create a model that you hope will be representative of the reality and capable of describing it (and
● Data ingest code & package even why not generate it)
Training code & package
●
Evaluation + Monitoring: Unless you can prove your model is good, it is worth just about
Evaluation code & package nothing. The evaluation phase aims at measuring the distance between your model and reality.
This can then be fed back to a human being to adjust parameters. In advanced setups, this can
●
Serving code & package essentially be your test in CI/CD of the model, and help auto tune the model even without human
intervention
●
● Deployment process
Serving: there is a good chance you will want to update your model from time to time. If you want
to recognize threats on the network for example, it is clear that you will have to learn new malware
signatures and patterns of behavior, or you can close your business immediately. Serving is this
phase where you expose the model for consumption, and make sure your customers always enjoy
the latest model.
Jupyter notebooks vs IDE development
What is the practical difference?
Jupyter notebooks What are they all about
If you are a beginner in data science (and with Python) you might have noticed that there
are a lot of tutorial implemented as Jupyter notebooks (used to be referred as ipython).
The notebooks allow easy embedding of formatted text and images to executable code, which
make then very useful to provided with scientific papers and walkthrough example code.
Idea of a notebook
Both in academia (e.g. non-tech savvy old-school PI vs. new generation
data wizard) and in industry, one may want to communicate
concisely the rationale for the computing with the results.
Compare this to “Excel data science”, where a person manually
manipulates the data destroying end-to-end processing pipeline and
making reproducible research impossible.
In Jupyter notebook, one could include both data preparation and
data analysis parts, and the grad student could for example do the
dirty work and ask for insight from more senior researchers.
In theory, the collaborators would quickly realize that the notebook is
about, and allowing quick playing around that would easily usable by
the poor grad student as well eliminating the Excel →
Matlab/R/Python → Excel circus probably experienced in some
research labs.
Downside is that the notebooks are not ‘plug’n’play’ executables, and
can depend on a lot of standard libraries, are even worse for a non-
tech savvy PI, custom libraries written by the grad student
Jupyter notebook Features
4) The Paper of the Future by Alyssa Goodman et al. (Authorea Preprint, 2017). This article
explains and shows with demonstrations how scholarly "papers" can morph into long-lasting
rich records of scientific discourse, enriched with deep data and code linkages, interactive
figures, audio, video, and commenting. It includes an interactive d3.js visualization and has an
astronomical data figure with an IPYthon Notebook "behind" it.
Data-driven journalism
The Need for Openness in Data Journalism, by
Brian Keegan.
analysis for the article Screen shot of the first 3D PDF published in Nature in
The Ferguson Area Is Even More Segregated Than 2009 (Goodman 2009). A video demonstration of how
You Probably Guessed
by Jeremy Singer-Vine. users can interact with the figure is on YouTube, here.
Open the PDF here in Adobe Acrobat to interact.
Jupyter notebooks how do they compare to IDEs
“Real-life” development typically a lot easier with a
proper IDE (Integrated Development Environment) then
https://www.slant.co/versus/1240/15716/~pycharm_vs_jupyter
Compare to the Rstudio IDE and its notebooks You can use Jupyter notebooks also with
PyCharm
https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html
IDE PyCharm
PyCharm the most commonly used IDE, and could be a safe bet to start with.
https://www.kaggle.com/general/4308
Rodeo IDE has the feel of RStudio if you are coming from R
https://www.r-bloggers.com/
rstudio-clone-for-python-ro By Erik Marsja
deo/ http://www.marsja.se/rstudio-like-python
-ides-rodeo-spyder/
Deploying Deep learning models
Large-scale interference (+Training) with streaming
Streaming Machine learning ”Heavier” tech stacks
“STREAMING”
i.e. continuous stream of data
for example from credit card
transactions, Uber cars, Internet
of Things devices such as
electricity meters or next-
generation vital monitoring at
hospitals.
https://www.youtube.com/watch?v=UmCB9ycz55Q | https://github.com/fluxcapacitor/pipeline/wiki
Streaming Machine learning Confluent #1
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner, Technology Evangelist at Confluent, Published on Mar 24, 2017
https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Big Data vs. Fast Data SQL Server, Oracle, MySQL, Teradata, Netezza, JDBC/ODBC, Hadoop, SFDC, PostgreSQL, RapidMiner,
TIBCO Spotfire
Streaming Machine learning Confluent #2a
Machine Learning and Deep Learning Applied to Real Time with
Apache Kafka Streams
Kai Wähner, Technology Evangelist at Confluent, Published on May 23, 2017
https://www.slideshare.net/KaiWaehner/apache-kafka-streams-machine-learning-deep-learning
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
https://www.slideshare.net/databricks/integrating-deep-learning-libraries-with-a
pache-spark
https://github.com/databricks/spark-deep-learning
Deep Learning Pipelines on Apache Spark enables fast transfer learning with a
Featurizer (that transform images to numeric features).
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features",
modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3,
labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
view rawmodelCreator.py hosted with ❤ by GitHub
Industry
Use cases
Kubernetes in deep learning #1
https://openai.com/blog/infrastructure-for-deep-learning/
infrastructure. Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our
infrastructure for small- and large-scale jobs, and we're actively solidifying our toolkit for making distributed
use-cases as accessible as local ones.
In this post, we'll share how deep
Kubernetes requires each job to be a Docker container, which gives us dependency isolation and
learning research usually code snapshotting. However, building a new Docker container can add precious extra seconds to a
proceeds, describe the researcher's iteration cycle, so we also provide tooling to transparently ship code from a researcher's laptop
into a standard image. We expose Kubernetes's flannel network directly to researchers' laptops, allowing
infrastructure choices we've made to users seamless network access to their running jobs. This is especially useful for accessing monitoring
services such as TensorBoard. (Our initial approach — which is cleaner from a strict isolation perspective —
support it, and open-source required people to create a Kubernetes Service for each port they wanted to expose, but we found that it
added too much friction.)
kubernetes-ec2-autoscaler, a batch-
We're releasing kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. It runs as a
optimized scaling manager for normal Pod on Kubernetes and requires only that your worker nodes are in Auto Scaling groups.
Kubernetes. We hope you find this
Our infrastructure aims to maximize the productivity of deep learning
post useful in building your own researchers, allowing them to focus on the science. We're building tools to
deep learning infrastructure. further improve our infrastructure and workflow, and will share these in upcoming
weeks and months. We welcome help to make this go even faster!
Kubernetes in deep learning #2
https://news.ycombinator.com/item?id=12391505
--
Docker is an open platform for developers and system administrators to build, ship
and run distributed applications. With Docker, IT organizations shrink application
delivery from months to minutes, frictionlessly move workloads between data
centers and the cloud and can achieve up to 20X greater efficiency in their use of
computing resources. Inspired by an active community and by transparent, open
source innovation, Docker containers have been downloaded more than 700
million times and Docker is used by millions of developers across thousands of the
world’s most innovative organizations, including eBay, Baidu, the BBC, Goldman
Sachs, Groupon, ING, Yelp, and Spotify. Docker’s rapid adoption has catalyzed an
active ecosystem, resulting in more than 180,000 “Dockerized” applications, over
40 Docker-related startups and integration partnerships with AWS, Cloud Foundry,
Google, IBM, Microsoft, OpenStack, Rackspace, Red Hat and VMware.
https://www.youtube.com/watch?v=wf4Jg-9gv9Q
Business data into value
8 ways to turn data into value with Apache Spark machine learning
OCTOBER 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM
http://www.ibmbigdatahub.com/blog/8-ways-turn-data-value-apache-spark-machine-learning
1. Obtain a holistic view of business
In today's competitive world, many corporations work hard to gain a holistic view or a 360
degree view of customers, for many of the key benefits
as outlined by data analytics expert Mr. Abhishek Joshi. In many cases, a holistic view was not
obtained, partially due to the lack of capabilities to organize huge amount of data and then to
analyze them. But Apache Spark’s ability to compute quickly while using data frames to
organize huge amounts of data can help researchers quickly develop analytical models that https://thinkbiganalytics.com/big_data_solutions/data-science/
provide a holistic view of the business, adding value to related business operations. To realize
this value, however, an analytical process, from data cleaning to modeling, must
still be completed.
Losing customers means losing revenue. Not surprisingly, then, companies strive to detect
potential customer churn through predictive modeling, allowing them to implement
interventions aimed at retaining customers. This might sound easy, but it can actually be very
complicated: Customers leave for reasons that are as divergent as the customers themselves http://dx.doi.org/10.1186/s40165-015-0014-6
are, and products and services can play an important, but hidden, role in all this. What’s more,
merely building models to predict churn for different customer segments—and with regard to
different products and services—isn’t enough; we must also design interventions, then
select the intervention judged most likely to prevent a particular customer from departing. Yet
even doing this requires the use of analytics to evaluate the results achieved—and,
eventually, to select interventions from an analytical standpoint. Amid this morass of choices,
Apache Spark’s distributed computing capabilities can help solve previously baffling problems.
Recommendations for purchases of products and services can be very powerful when made
appropriately, and they have become expected features of e-commerce platforms, with
many customers relying on recommendations to guide their purchases. Yet developing
recommendations at all means developing recommendations for each customer—or, at the very ebaytechblog: Spark is helping eBay create value from its data, and so the future is bright
least, for small segments of customers. Apache Spark can make this possible by offering the for Spark at eBay. In the meantime, we will continue to see adoption of Spark increase at
distributed computing and streaming analytics capabilities that have become invaluable tools eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product
for this purpose. announcements, industry chatter, and Spark’s own strengths and capabilities.
Open data and Apache Spark
-----Jump to Topic-----
00:00:06 - Workshop Intro & Environment Setup
00:13:06 - Brief Intro to Spark
00:17:32 - Analysis Overview: SF Fire Department Calls for Service
00:23:22 - Analysis with PySpark DataFrames API
00:29:32 - Doing Date/Time Analysis
00:47:53 - Memory, Caching and Writing to Parquet
01:00:40 - SQL Queries
01:21:11 - Convert a Spark DataFrame to a Pandas DataFrame
-----Q & A-----
01:24:43 - Spark DataFrames vs. SQL: Pros and Cons?
01:26:57 - Workflow for Chaining Databricks notebooks into Pipeline?
01:30:27 - Is Spark 2.0 ready to use in production?
https://www.youtube.com/watch?v=iiJq8fvSMPg
Internet of things (IoT)
To proof that our IoT platform is really independent on application environment, we took one IoT
gateway (RaspberryPi 2) from the city project and put into Austin Convention Center during
OpenStack Summit together with IQRF based mesh network connecting sensors that measure
humidity, temperature and CO2 levels. This demonstrates ability that IoT gateway can manage or
collect data from any technology like IQRF, Bluetooth, GPIO, and any other communication
standard supported on Linux based platforms.
We deployed 20 sensors and 20 routers on 3 conference floors with a single active IoT gateway
receiving data from entire IQRF mesh network and relaying it to dedicated time-series database, The following screenshot shows real time CO2 values from different rooms on 2
in this case Graphite. Collector is MQQT-Java bridge running inside docker container managed by floors. Historical graph shows values from Monday. You can easily recognize
Kubernetes. when the main keynote session started and when was the lunch period.
Healthcare
https://www.healthdirect.gov.au/
https://www.youtube.com/watch?v=ePp54ofRqRs
http://dx.doi.org/10.1016/j.conb.2015.04.002
“Cloud deployment also makes it easier to build tools that run identically for all users,
especially with virtual machine platforms like Docker. However, cloud deployment for
neuroscience does require transferring data to cloud storage, which may become a bottleneck.
Deploying on academic clusters requires at least some support from cluster administrators but
keeps the data closer to the computation. … There is also rapidly growing interest in the ‘‘data
analysis notebook’’. These notebooks – the Jupyter notebook being a particularly popular
example – combine executable code blocks, notes, and graphics in an interactive document that
runs in a web browser, and provides a seamless front-end to a computer, or a large https://www.docker.com/customers/docker-helps-varian-medical-systems-battle-cancer
cluster of computers if running against a framework like Spark. Notebooks are a particularly
appealing way to disseminate information; a recent neuroimaging paper, for example, provided all
of its analyses in a version-controlled repository hosted on GitHub with Jupyter notebooks that
generate all the figures in the paper [45]—a clear model for the future of reproducible
science
https://dx.doi.org/10.12688/f1000research.7536.1
http://dx.doi.org/10.1371/journal.pone.0152686
http://dx.doi.org/10.1186/s13742-015-0087-0
Most large-scale analytics, whether in industry or neuroscience, involve common patterns. Raw data
are massive in size. Often, they are processed so as to extract signals of interest, which are then used for
statistical analysis, exploration, and visualization. But raw data can be analyzed or visualized directly (top
arrow). And the results of each successive step informs how to perform the earlier ones (feedback loops). http://homolog.us/blogs/blog/2015/09/22/is-docker-for-suckers/
Icons below highlight some of the technologies, discussed in this essay, that are core to the modern large-
scale analysis workflow.
Neuroscience streaming data with Spark
http://dx.doi.org/10.1016/j.conb.2015.04.002
http://t-redactyl.io/blog/2016/10/a-crash-course-in-reproducible-research-in-python.html
http://dx.doi.org/10.1038/nj7622-703a
http://conference.scipy.org/proceedings/scipy2016/pdfs/christian_oxvig.pdf
Anaconda
Enterprise
Notebooks
continuum.io