Benvenuto in Scribd!

What Does Mean?: Scalable

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

11 visualizzazioni26 pagine

The document discusses the concept of scalability in operations, algorithms, and data sizes. Operationally, scalability now means being able to utilize thousands of cheap computers rather than just fitting in main memory. Algorithmically, scalable algorithms should perform operations proportional to N log N rather than Nm as data sizes increase. With large datasets like telescopic images, algorithms may only get one pass through streaming data so must make the most of that pass. Relational databases can help solve "needle in haystack" problems scalably by indexing large datasets. However, some tasks like trimming sequences require touching all data records so cannot leverage indexing for improved scalability.

Descrizione originale:

015_scalability_dataScience

Titolo originale

015_scalability_dataScience

Copyright

Formati disponibili

PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

11 visualizzazioni26 pagine

What Does Mean?: Scalable

Caricato da

Daniel Softjoy Montjoy

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 26

Cerca all'interno del documento

What Does Scalable Mean?

Operationally:
In the past: Works even if data doesnt fit in main memory
Now: Can make use of 1000s of cheap computers

Algorithmically:
In the past: If you have N data items, you must do no more

than Nm operations -- polynomial time algorithms

Now: If you have N data items, you must do no more than Nm/k
operations, for some large k
Polynomial-time algorithms must be parallelized

Soon: If you have N data items, you should do no more than N *

log(N) operations

As data sizes go up, you may only get one pass at the data
The data is streaming -- you better make that one pass count
Ex: Large Synoptic Survey Telescope (30TB / night)
5/15/13

Bill Howe, eScience Institute

Example: Find matching DNA sequences

Given a set of sequences
Find all sequences equal to
GATTACGATATTA

5/15/13

Bill Howe, eScience Institute

TACCTGCCGTAA

GATTACGATATTA

TACCTGCCGTAA = GATTACGATATTA?

No.

time = 0
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

CCCCCAATGAC = GATTACGATATTA?

No.

time = 1
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

GATTACGATATTA contains GATTACGATATTA?

Yes!
Send it to the output.

time = 17
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)

5/15/13

Bill Howe, eScience Institute

TTTTCGTAATT
AAAATCCTGCA
AAACGCCTGCA

TTTACGTCAA
GATTACGATATTA

What if we sort the sequences?

Start at the 50% mark

CTGTACACAACCT
GATTACGATATTA

100%

CTGTACACAACCT < GATTACGATATTA

time = 0

No match.
Skip to 75% mark

GATTACGATATTA
GGATACACATTTA

100%

GGATACACATTTA > GATTACGATATTA

time = 1

No match.
Go back to 62.5% mark

GATATTTTAAGC
GATTACGATATTA

100%

GATATTTTAAGC < GATTACGATATTA

No match.
Skip back to 68.75% mark

GATTACGATATTA

100%

GATTACGATATTA = GATTACGATATTA
Match!
Walk through the records until we
fail to match.

GATTACGATATTA

100%

How many comparisons did we do?

40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N))

Far better scalability

Relational Databases

Databases are good at Needle in Haystack problems:

Extracting small results from big datasets
Transparently provide old style scalability
Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = GATTACGATATTA;

5/15/13

Bill Howe, eScience Institute

*almost

New task: Read Trimming

Given a set of DNA sequences
Trim the final n bps of each sequence
Generate a new dataset

5/15/13

Bill Howe, eScience Institute

TACCTGCCGTAA

GATTACGATATTA

TACCTGCCGTAA becomes TACCT

time = 0

CCCCCAATGAC becomes CCCCC

time = 1

GATTACGATATTA becomes GATTA

time = 17

Can we use an index?

No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?

time = 0

time = 1

time = 2

time = 3

How much time did this take?

7 cycles

time = 7

40 records, 6 workers
O(N/k)

Potrebbero piacerti anche

Improving Test Coverage.2
Documento8 pagine
Improving Test Coverage.2
whatisinname
100% (2)
Lab Task UTCN
Documento3 pagine
Lab Task UTCN
Cristian-Vlad Pop
Nessuna valutazione finora
Sas Data Governance Framework 107325
Documento12 pagine
Sas Data Governance Framework 107325
Daniel Softjoy Montjoy
Nessuna valutazione finora
Awesome Big Data Algorithms
Documento37 pagine
Awesome Big Data Algorithms
Yamabushi
Nessuna valutazione finora
Earthquake Prediction
Documento20 pagine
Earthquake Prediction
Gnr
Nessuna valutazione finora
Computing For Research I: Spring 2012
Documento34 pagine
Computing For Research I: Spring 2012
sakiaslam
Nessuna valutazione finora
Advanced Data Structures
Documento263 pagine
Advanced Data Structures
lennon757
100% (1)
Some Uses of Hashing in Networking Problems: Michael Mitzenmacher
Documento44 pagine
Some Uses of Hashing in Networking Problems: Michael Mitzenmacher
Kank Riyan
Nessuna valutazione finora
Traffic Analysis Using Streaming Queries: Mike Fisk Los Alamos National Laboratory
Documento27 pagine
Traffic Analysis Using Streaming Queries: Mike Fisk Los Alamos National Laboratory
Ramesh Paramasivam
Nessuna valutazione finora
DFT Questions
Documento8 pagine
DFT Questions
Adhi Suruli
Nessuna valutazione finora
PAGE8
Documento5 pagine
PAGE8
praveenmorampudi42
Nessuna valutazione finora
Teaching With Stata: Peter A. Lachenbruch & Alan C. Acock Oregon State University
Documento28 pagine
Teaching With Stata: Peter A. Lachenbruch & Alan C. Acock Oregon State University
Ahmed SBM
Nessuna valutazione finora
EEB 435 Python LECTURE 3
Documento34 pagine
EEB 435 Python LECTURE 3
osward
Nessuna valutazione finora
DFT Questions
Documento8 pagine
DFT Questions
Naga Nithesh
Nessuna valutazione finora
Analysis of Algorithms
Documento43 pagine
Analysis of Algorithms
dogan20021907
Nessuna valutazione finora
Digital Signal Processing Fundamentals
Documento40 pagine
Digital Signal Processing Fundamentals
Kapilachander Thangavel
Nessuna valutazione finora
Tuesday Exercises PDF
Documento38 pagine
Tuesday Exercises PDF
for_you882002
Nessuna valutazione finora
Unit Ii BD
Documento74 pagine
Unit Ii BD
keerthanavelmurugan02
Nessuna valutazione finora
Biological Database (BIT2002)
Documento4 pagine
Biological Database (BIT2002)
kumarkl
Nessuna valutazione finora
Automation High Througput
Documento41 pagine
Automation High Througput
Katia Conceição
Nessuna valutazione finora
SPC Basics: Presented By: Tariq Khurshid
Documento50 pagine
SPC Basics: Presented By: Tariq Khurshid
tkhurshid3997
Nessuna valutazione finora
Task 1: You Need To Do Only 1 Task. Feel Free To Do All of Them, It Might Be A Good Learning Opportunity
Documento9 pagine
Task 1: You Need To Do Only 1 Task. Feel Free To Do All of Them, It Might Be A Good Learning Opportunity
MARIA FERNANDA MARTINEZ VAZQUEZ
Nessuna valutazione finora
Analysis of Algorithms Into Coursenotes
Documento68 pagine
Analysis of Algorithms Into Coursenotes
bsitler
Nessuna valutazione finora
CSCE626 Amato LN PerformanceAnalysisMethodology
Documento19 pagine
CSCE626 Amato LN PerformanceAnalysisMethodology
Raj Kumar Yadav
Nessuna valutazione finora
Short Read Alignment Algorithms: Raluca Gordân
Documento47 pagine
Short Read Alignment Algorithms: Raluca Gordân
Biswa
Nessuna valutazione finora
Cluster Computer For Bioinformatics Applications: Nile University, Bioinformatics Group
Documento52 pagine
Cluster Computer For Bioinformatics Applications: Nile University, Bioinformatics Group
abhaytech8451
Nessuna valutazione finora
Digital Signal Processing Fundamentals
Documento40 pagine
Digital Signal Processing Fundamentals
jokerbuddy01
Nessuna valutazione finora
05 Timing
Documento7 pagine
05 Timing
Aditya Soni
Nessuna valutazione finora
Introduction To Big Data & Basic Data Analysis
Documento47 pagine
Introduction To Big Data & Basic Data Analysis
maninaga
Nessuna valutazione finora
Concurrency Control: R&G - Chapter 17
Documento24 pagine
Concurrency Control: R&G - Chapter 17
raw.junk
Nessuna valutazione finora
More Bi Go
Documento25 pagine
More Bi Go
Ices Matara
Nessuna valutazione finora
Assignment 2 Grandhi Vyasa Maharshi
Documento7 pagine
Assignment 2 Grandhi Vyasa Maharshi
RAKESH KUMAR SINGH
Nessuna valutazione finora
Oracle
Documento28 pagine
Oracle
donnie
Nessuna valutazione finora
Our Course: Prof. David Hay Rothberg A - 411 Dhay@cs - Huji.ac - Il
Documento13 pagine
Our Course: Prof. David Hay Rothberg A - 411 Dhay@cs - Huji.ac - Il
IjazKhan
Nessuna valutazione finora
Quantum Computers: Chapter-1
Documento20 pagine
Quantum Computers: Chapter-1
ravitejakokkula
Nessuna valutazione finora
A "How To" Tutorial On Logic Analyzer Basics For Digital Design
Documento4 pagine
A "How To" Tutorial On Logic Analyzer Basics For Digital Design
shinojose
Nessuna valutazione finora
User Traffic Modeling For Future Mobile Systems
Documento27 pagine
User Traffic Modeling For Future Mobile Systems
Anonymous SPEYHIjc1
Nessuna valutazione finora
ANN Doc
Documento2 pagine
ANN Doc
sonali Pradhan
Nessuna valutazione finora
Petsc Tutorials
Documento231 pagine
Petsc Tutorials
latec
Nessuna valutazione finora
Data Flow Architecure
Documento11 pagine
Data Flow Architecure
Saravanan Kumarasamy
Nessuna valutazione finora
DAA or Algorithms in 9 Hours
Documento344 pagine
DAA or Algorithms in 9 Hours
somanishelendra3
Nessuna valutazione finora
Kappa
Documento24 pagine
Kappa
Avinash_Negi_7301
Nessuna valutazione finora
Catep Tutorial
Documento38 pagine
Catep Tutorial
Samyabrata Saha
Nessuna valutazione finora
Synthesys Intro
Documento35 pagine
Synthesys Intro
Nikunj Vadodaria
Nessuna valutazione finora
2021 09 24 Scalable-Monitoring-Tools
Documento49 pagine
2021 09 24 Scalable-Monitoring-Tools
leufm
Nessuna valutazione finora
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
Documento47 pagine
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
selva
Nessuna valutazione finora
Research Citation Notes
Documento35 pagine
Research Citation Notes
Web Best Wabii
Nessuna valutazione finora
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
Documento55 pagine
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
Chanda Test
Nessuna valutazione finora
Six Sigma Training Posters
Documento32 pagine
Six Sigma Training Posters
Denise Cheung
Nessuna valutazione finora
Congestion Control Algorithms
Documento4 pagine
Congestion Control Algorithms
Ami Verma
Nessuna valutazione finora
CASTEP Startup
Documento38 pagine
CASTEP Startup
Muraleetharan Boopathi
Nessuna valutazione finora
Teaching With Stata: Peter A. Lachenbruch & Alan C. Acock Oregon State University
Documento28 pagine
Teaching With Stata: Peter A. Lachenbruch & Alan C. Acock Oregon State University
Hector Garcia
Nessuna valutazione finora
Clock and Data Recovery PHD Thesis
Documento6 pagine
Clock and Data Recovery PHD Thesis
azyxppzcf
100% (2)
Mining Data Streams
Documento33 pagine
Mining Data Streams
Hashir Khan
Nessuna valutazione finora
Mmd04A Streams
Documento78 pagine
Mmd04A Streams
ammuachunew1234
Nessuna valutazione finora
Algorithm Design
Documento16 pagine
Algorithm Design
Eyob Wondimkun
Nessuna valutazione finora
Time Series Analysis
Documento49 pagine
Time Series Analysis
Lars Larson
Nessuna valutazione finora
Bucket Sort: ECE 250 Algorithms and Data Structures
Documento22 pagine
Bucket Sort: ECE 250 Algorithms and Data Structures
khyatinirmal
Nessuna valutazione finora
Netezza Tips and Tricks
Documento11 pagine
Netezza Tips and Tricks
CharanTeja
Nessuna valutazione finora
Random Sample Consensus: Robust Estimation in Computer Vision
Da Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
Nessuna valutazione finora
Validated Numerics: A Short Introduction to Rigorous Computations
Da Everand
Validated Numerics: A Short Introduction to Rigorous Computations
Warwick Tucker
Nessuna valutazione finora
Certified CMMI Associate Exam Was Completed by Daniel Geray Montjoy Pita On Aug 21, 2017 2:56:34 PM
Documento8 pagine
Certified CMMI Associate Exam Was Completed by Daniel Geray Montjoy Pita On Aug 21, 2017 2:56:34 PM
Daniel Softjoy Montjoy
Nessuna valutazione finora
Building Blocks of Deep Neural Networks
Documento7 pagine
Building Blocks of Deep Neural Networks
Daniel Softjoy Montjoy
Nessuna valutazione finora
Digital Transformation Strategy
Documento6 pagine
Digital Transformation Strategy
Daniel Softjoy Montjoy
Nessuna valutazione finora
Introdatasciencepython Week1
Documento5 pagine
Introdatasciencepython Week1
Daniel Softjoy Montjoy
Nessuna valutazione finora
Lecture 13 Quiz - Coursera
Documento5 pagine
Lecture 13 Quiz - Coursera
Daniel Softjoy Montjoy
Nessuna valutazione finora
DG at Chevron Gom - Pnec17
Documento34 pagine
DG at Chevron Gom - Pnec17
Daniel Softjoy Montjoy
Nessuna valutazione finora
Competive On Analitycs Resumen
Documento12 pagine
Competive On Analitycs Resumen
Daniel Softjoy Montjoy
Nessuna valutazione finora
10 Kinds of Stories To Tell With Data
Documento5 pagine
10 Kinds of Stories To Tell With Data
Daniel Softjoy Montjoy
Nessuna valutazione finora
RH 120e
Documento8 pagine
RH 120e
Sawadogo Gustave Napinguebson
Nessuna valutazione finora
Green ICT: A Study of Awareness, Attitude and Adoption Among IT/Computer Engineering Students of LDRP-ITR, Gandhinagar
Documento13 pagine
Green ICT: A Study of Awareness, Attitude and Adoption Among IT/Computer Engineering Students of LDRP-ITR, Gandhinagar
AHMAD ARESYAD
Nessuna valutazione finora
Prescriptions For Closing The Seven Service Quality Gaps
Documento1 pagina
Prescriptions For Closing The Seven Service Quality Gaps
Reema Negi
Nessuna valutazione finora
Simulation of Inventory System PDF
Documento18 pagine
Simulation of Inventory System PDF
hmsohag
Nessuna valutazione finora
Wbuhs PG Thesis
Documento7 pagine
Wbuhs PG Thesis
gbx272pg
100% (2)
Pre Post and Infix Notations
Documento12 pagine
Pre Post and Infix Notations
Golla Girija
Nessuna valutazione finora
Class 12 Physics Mcqs Chapter: 6 Electromagnetic Induction: Answer
Documento12 pagine
Class 12 Physics Mcqs Chapter: 6 Electromagnetic Induction: Answer
Diksha T
Nessuna valutazione finora
Experiment #3 Venturi Meter: Home Unquantized Projects
Documento7 pagine
Experiment #3 Venturi Meter: Home Unquantized Projects
Eddy Kimathi
Nessuna valutazione finora
Evaluation of A Systematic Approach To Matrix Acidizing On An Oil Producing Well
Documento6 pagine
Evaluation of A Systematic Approach To Matrix Acidizing On An Oil Producing Well
Trần Hoàng Chương
Nessuna valutazione finora
Switch Rotatorios
Documento12 pagine
Switch Rotatorios
Roberto Andrés
Nessuna valutazione finora
GS Ep Tec 260 en
Documento61 pagine
GS Ep Tec 260 en
Cesar
Nessuna valutazione finora
Case Analysis, Case 1
Documento2 pagine
Case Analysis, Case 1
Aakarsha Maharjan
Nessuna valutazione finora
April 7-9 2022-WPS Office
Documento3 pagine
April 7-9 2022-WPS Office
Allen Antolin
Nessuna valutazione finora
International Trade Syllabus
Documento3 pagine
International Trade Syllabus
Dialee Flor Dael Baladjay
Nessuna valutazione finora
Microwave Project Report
Documento30 pagine
Microwave Project Report
Md Rakib
Nessuna valutazione finora
PSY502 OLd Papers
Documento6 pagine
PSY502 OLd Papers
cs619finalproject.com
100% (4)
Almugea or Proper Face
Documento5 pagine
Almugea or Proper Face
Valentin Badea
Nessuna valutazione finora
Introduction To Multistage Car Parking System
Documento4 pagine
Introduction To Multistage Car Parking System
International Journal of Application or Innovation in Engineering & Management
Nessuna valutazione finora
PVTP Complete PDF Free
Documento680 pagine
PVTP Complete PDF Free
maxlira
Nessuna valutazione finora
M 02 0001
Documento3 pagine
M 02 0001
Miguel ruiz
Nessuna valutazione finora
Unit 3 Approaches To The Study of Medieval Urbanisation : Structure
Documento20 pagine
Unit 3 Approaches To The Study of Medieval Urbanisation : Structure
Sunil Sunil
Nessuna valutazione finora
Inspection List For Electrical Portable
Documento25 pagine
Inspection List For Electrical Portable
Arif Fuadianto
Nessuna valutazione finora
English Action Plan 2023-2024
Documento5 pagine
English Action Plan 2023-2024
Gina Daligdig
Nessuna valutazione finora
ACFrOgDVly789-6Z8jIbi7pBoLupubEgMyOp7PczEvUguHoW3uj oR2PKzDvuhRzzkIhacYjxXRrU6iA7sHt t6MhtpZFq0t uZL2pF5Ra NNZ kmcl5w7BCQeUegKhjRhNuou88XxLodzWwbsr
Documento14 pagine
ACFrOgDVly789-6Z8jIbi7pBoLupubEgMyOp7PczEvUguHoW3uj oR2PKzDvuhRzzkIhacYjxXRrU6iA7sHt t6MhtpZFq0t uZL2pF5Ra NNZ kmcl5w7BCQeUegKhjRhNuou88XxLodzWwbsr
John Steven Llorca
Nessuna valutazione finora
Oracle Pac 2nd Key
Documento48 pagine
Oracle Pac 2nd Key
Krishna Kumar Gupta
Nessuna valutazione finora
Prestressed Concrete Problem
Documento9 pagine
Prestressed Concrete Problem
Prantik Adhar Samanta
Nessuna valutazione finora
Serie 20 Sauer Danfoss
Documento18 pagine
Serie 20 Sauer Danfoss
Cristian
100% (1)
Missing Person Project
Documento9 pagine
Missing Person Project
Laiba Waheed
Nessuna valutazione finora
PDF - Gate Valve OS and Y
Documento10 pagine
PDF - Gate Valve OS and Y
LENINROMEROH4168
Nessuna valutazione finora
Guide For H Nmr-60 MHZ Anasazi Analysis: Preparation of Sample
Documento7 pagine
Guide For H Nmr-60 MHZ Anasazi Analysis: Preparation of Sample
conker4
Nessuna valutazione finora