Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
FINRA
Building a Secure Data Science Platform on AWS
Safeguard information with high degree of security and least privileges access
3
SEPARATE INFRASTRUCTURE SERVICES
4
SCALE THE DATA PLANT
Considerations
Scale compute and storage separately.
Resiliency and disaster recovery
Flexibility of instance types
Data discovery through an enterprise data catalog
Security
Virtual private cloud (VPC) & encryption
Separation of duties
DevOps: Automate everything
Least privileges and no catch-all rules
Centralized monitoring for total transparency
5
SCALE THE DATA PLANT
6
CENTRALIZED DATA MANAGEMENT
Unified catalog
Schemas
Versions
Encryption type
Storage policies
Shared Metastore
Common definition of tables & partitions
Use with Spark, Presto, Hive, etc.
http://finraos.github.io/herd Faster instantiation of clusters
7
EFFECTS OF CLOUD CHANGE
8
REMAINING PAIN POINTS
No standard setup
9
DATA SCIENCE TOOLING: BEFORE UDSP
10
SOLUTION: UNIVERSAL DATA SCIENCE
PLATFORM
11
UDSP V1
Secure
Technology controls and curates content
Self-Service
Users manage their machines
Scalable Compute
Size machines to your needs
Turnkey
Libraries pre-built and installed
12
NO USERS, WHY?
Needs driven by technology
IT: Reduce costs
Users: need more compute
Secure but inflexible
Local machines where more flexible
Install any package and experiment
Data availability
On-premises databases not reachable
Setup still required
Driver configuration to connect to databases
Technology in the way
Technology required to install any new package
13
UDSP V2
Flexible
Download/Install any package
Data Availability
No additional setup necessary
On-premises and cloud data accessible
Ownership
Changes proposed and vetted through
the data science forum
14
ADOPTION METRICS
15
INVENTORY
R 3.2.5, Python (2.7.12 and 3.4.3)
Packages
R: 300+ Python: 100+
Tools for Building Packages
gcc, gfortran, make, java, maven,
ant
IDEs
Jupyter, RStudio Server
Deep Learning
CUDA, CuDNN (if GPU present)
Theano, Caffe, Torch
TensorFlow
16
SELF SERVICE
Completely self service, no technology administration
Users associated to groups (AWS billing tags and machine selection choices)
17
USDP: CREATE AND LAUNCH
18
UDSP: MONITOR RUNNING INSTANCES
19
UDSP: USE TOOLS WITH BROWSER
20
MAINTAINING THE USDP
Community Driven Experimentation
Data scientists can install any package to try it out
No technologist necessary to administer installation
New library (or version) is proposed for next release
Releases have been monthly
Envision quarterly releases
Philosophy: Support last major release (most recent
patch)
R 3.3.1 is available and still releasing patches, UDSP
has 3.2.5
21
THE ROAD AHEAD
22
SURVEILLANCE PLATFORM
23
RECAP
Each improvement brought pressures to legacy ways of working
24
RELATED SESSIONS
Other FINRA Sessions:
25
ABOUT THE PRESENTERS
Scott Donaldson Vincent Saulys
Scott.Donaldson@finra.org Vincent.Saulys@finra.org
https://www.linkedin.com/in/scottdonaldson www.linkedin.com/in/vincentsaulys
26
QUESTIONS?
Learn more at
http://technology.finra.org
27
Thank you!