Assistant Professor Lecture # 2 Software Engineering Department Sir Syed University of Engineering & Technology Content • Idea of Big Data • Example Models • Relevance of Big Data • Key Computing Resources for Big Data • Scalability — Scale Up & Scale Out • Techniques towards Big Data • Why Big Data now? • Contrasting Approaches in Adopting High-Performance Capabilities • Big Data Market Idea of Big Data • Methods of obtaining knowledge Theory (model), hypothesis, experiment, analysis (repeat) ▫ Explorative: start theory with observations of phenomena ▫ Constructivism: starts with axioms and reason implications
• (Big) Data + Analytics ) Insight (prediction of the future)
▫ For industry: insight = business advantage and money... • Analytics: follow an explorative approach and study the data ▫ To infer knowledge, use statistics / machine learning • Construct a theory (model) and validate it with the data Example Models • Similarity is a (very) simplistic model and predictor for the world ▫ Humans use this approach in their cognitive process ▫ Uses the advantage of Big Data • Weather prediction ▫ You may develop and rely on complex models of physics ▫ Or use a simple model for a particular day; e.g. expect it to be similar to the weather of the day over the last X years. • Preferences of Humans ▫ Identify a set of people which liked items you like ▫ Predict you like also the items those people like (items you haven’t rated so far) Relevance of Big Data • Big Data Analytics is emerging • Relevance increases compared to supercomputing
Google Search Trends, relative searches
Key Computing Resources for Big Data Key Computing Resources “Big Data Analytics”, David Loshin, 2013 • Processing capability: CPU, processor, or node. • Memory • Storage • Network Scalability — Scale Up & Scale Out • Scale out ▫ Use more resources to distribute workload in parallel ▫ Higher data access latency is typically incurred • Scale up ▫ Efficiently use the resources ▫ Architecture-aware algorithm design Scalability — Scale Up & Scale Out
• For independent data ==> scale up may not have obvious
advantage than scale out • For linked data ==> utilizing scale up as much as possible before scale out Techniques towards Big Data • Massive Parallelism • Data Mining and Analytics • Huge Data Volumes Storage • Data Retrieval • Data Distribution • Machine Learning • High-Speed Networks • Data Visualization • High-Performance Computing • Task and Thread Management ➔ Techniques exist for years to decades. Why is Big Data hot now? Why Big Data now? • More data are being collected and stored • Open source code • Commodity hardware / Cloud ➔ • High-Volume • High-Velocity • High-Variety ➔ • Artificial Intelligence Contrasting Approaches in Adopting High-Performance Capabilities Aspect Typical Scenario Big Data Applications that take advantage of A simplified application execution model massive parallelism developed by encompassing a distributing file system, application Application specialized developers skilled in programming model, distributed database and Development high-performance computing, program scheduling is packaged within Hadoop, an performance optimization, and open source framework for reliable, scalable, code tuning distributed and parallel computing. Uses high-cost massively parallel Innovative methods of creating scalable and yet processing (MPP) computers, elastic virtualized platforms take advantage of Platform utilizing high-bandwidth networks clusters of commodity hardware components and massive I/O devices. (Cloud-based utility computing services) coupled with open source tools and technology. Contrasting Approaches in Adopting High-Performance Capabilities Aspect Typical Scenario Big Data Limited to file-based or relational Alternate models or data management (i.e. No SQL) database management systems provide a variety of methods for managing (RDBMS) using standard row- information to best suit specific business process Data oriented data layouts needs, such as in-memory data management (for Management rapid access), columnar layouts to speed query response and graph databases (for social network analytics) Requires large capital investment The ability to deploy systems like Hadoop on in purchasing high-end hardware virtualized platforms allows small and medium Resources to be installed and managed in- businesses to utilize cloud-based environments house. Big Data Market Human Brain is a Graph/ Network of 100 B nodes and 1 T Edges Summary • Idea of Big Data • Example Models • Relevance of Big Data • Key Computing Resources for Big Data • Scalability — Scale Up & Scale Out • Techniques towards Big Data • Why Big Data now? • Contrasting Approaches in Adopting High-Performance Capabilities • Big Data Market