Google Case Study, What Is Hadoop.

University of Management & Technology
Assignment # 1 (Hadoop, MapReduce, Open Stack)
Course: Distributed Systems Submitted to: Mr. Asif Subhani Submitted by: M. Usman Riaz ID: 101620043
The Anatomy of a Large-Scale Hyper textual Web Search Engine

Sergey Brin and Lawrence Page started the design of Google to make a search engine that can crawl and index the web quickly and efficiently and to effectively deal with huge uncontrolled hypertext collections. One of the main goals was to improve the quality and scalability of search. Another goal was to setup a system that can support novel research activities on large-scale web data and a reasonable number of people can actually use it for their academic research. Google makes efficient use of storage space to store the index. This allows the quality of the search to scale effectively to the size of the web as it grows. Its data structures are optimized for fast and efficient access. To get high precision, Google uses the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank. The probability that the random surfer visits a page is its PageRank. The ranking also involves damping factor, which is the probability at each page the random surfer will get bored and request another random page. It allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. The text of a link is associated with the page that the link is on and also with the page the link points to. This idea of anchor text propagation provides better quality search but the challenge was the efficient usage of it because of the heavy data processing task. Along with page rank Google keeps a track of location information of all hits, some visual presentation details and stores full raw HTML of pages in the repository. Most of the Googles architecture is implemented in C or C++ for efficiency and can run in either Solaris or Linux. The data structures of Google include big files, document indexes, lexicon, forward and reverse indexes and a huge repository. Googles data structure are optimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has fast distributed crawling system, where URL server and the crawlers are implemented in Python. Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IO and a no. of queues. The steps involved in indexing are parsing, indexing documents into barrels using multiple indexers running in parallel and sorting. The Googles ranking system is designed so that no particular factor can have too much influence. The dot product of the vector of countweights with the vector of type-weights is used to compute an IR score for the document.
Finally, the IR score is combined with PageRank to give a final rank to the document. Fo multi word search, Google has a complex algorithm. Google also considers feedback by trusted user while updating the ranks of webpages. Google can produce better results than the major commercial search engines for most searches. Google has evolved to overcome a number of bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, disk capacity, and network IO during various operations. By the efficient crawling and indexing performed by Google, information can be kept up to date and major changes can be tested relatively quickly. Google does not have optimizations such as query caching, sub-indices on common terms. The inventors intended to speed up Google considerably through distribution and hardware, software, and algorithmic improvements. They wished to make Google as a high quality search tool for searchers and researchers all around the world, sparking the next generation of search engine technology.
Hadoop
Apache Hadoop is an open-source software framework for storage and large scale processing of datasets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0. The Hadoop platform was designed to solve problems where you have a lot of data perhaps a mixture of complex and structured data and it doesnt fit nicely into tables. Its for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. Thats exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so theyre more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples. Hadoop is designed to run on a large number of machines that dont share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organizations data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. Theres no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. In a centralized database system, youve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. Thats MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
MapReduce
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
Example:
Lets look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course weve made this example very simple so its easy to follow. You can imagine that a real application wont be quite so simple, as its likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles were covering here remain the same. Either way, in this example, city is the key and temperature is the value. Toronto, 20 Whitby, 25 New York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18 Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map
tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33) Lets assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38) As an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.
Open Stack
OpenStack is an open source platform that lets you build an Infrastructure as a Service (IAAS) cloud that runs on commodity hardware and scales massively. The main components of OpenStack are:
Object Store (codenamed Swift) provides object storage. It allows you to store or retrieve files (but not mount directories like a fileserver). Image (codenamed Glance) provides a catalog and repository for virtual disk images. These disk images are mostly commonly used in OpenStack Compute. Compute (codenamed Nova) provides virtual servers upon demand. Dashboard (codenamed Horizon) provides a modular web-based user interface for all the OpenStack services. With this web GUI, you can perform most operations on your cloud like launching an instance, assigning IP addresses and setting access controls. Identity (codenamed Keystone) provides authentication and authorization for all the OpenStack services. It also provides a service catalog of services within a particular OpenStack cloud. Network (codenamed Neutron , also known as Quantum in earlier releases) provides network connectivity as a service between interface devices managed by other OpenStack services (most likely Nova). The service works by allowing users to create their own networks and then attach interfaces to them. Block Storage (codenamed Cinder) provides persistent block storage to guest VMs.
Flavours:
An OpenStack virtual machine is created based on a virtual hardware template called a flavor. The flavor defines the amount of virtualized CPU cores (a portion of a physical CPU core that is assigned to a virtual machine) memory, and storage used by the virtual machine. The default OpenStack set of flavors is:
Additional flavors can be created to match the recommended hardware requirements for the application. For an existing application, in the near term, a flavor can be specified that matches the performance of the dedicated hardware previously used. In future an applications engineering rules can be modified to incorporate smaller units of compute resources to allow operators to be more granular in OpenStack flavor size. It should be possible to fit an equation to the performance curve such that the size of the OpenStack flavor can be left up to the Operator. In either case, an agent could be developed to validate the resource allocation (CPU, RAM, storage, disk swapping etc.) that is incorporated into the application. Though non-persistent storage is disk space, if the VM is terminated for any reason the disk space is lost. Care must be taken to ensure persistent data is kept in permanent storage, in the OpenStack case persistent block storage is provided by Cinder.
Virtual Instances:
Virtual instances are hosted on an OpenStack compute node. Ideally, all virtual instances will be the same flavor if the compute node is dedicated to supporting a single service. The scheduler will create virtual instances on a compute node until either the virtual CPU core limit or virtual memory limit is reached on that compute node. The default CPU
overcommit ratio ( increasing the number of virtual CPU cores on a compute node at the cost reducing performance) is 16 virtual cores to 1 physical core but for CPU intensive applications a ratio of 4:1 or 2:1 is more applicable and if warranted could be set as 1:1. Increasing the number of virtual instances will adversely affect the performance of the instances. In the case of an existing applications that provides engineering rules with precise performance modeling; a ratio of 1:1 can be mandated for deterministic behavior.
Data storage:
Data storage options available to a virtual instance in OpenStack are:

Local file system (managed by Nova) on the compute node Block storage volumes (Cinder) on the block storage node Software Image repository (Glance) on an image server
A bottleneck with virtual machines is local disk I/O by the Hypervisor since there all virtual instances have their local file systems on the compute nodes local hard disk. The problem can be mitigated by having a SSD drive on the compute node with a high IOPS (I/O operations per second) rate and migrating critical data to a block storage node. If the compute node goes down and the applications running on the virtual instances must be recovered in their current state then critical data must be stored outside of the virtual instance on an external block storage node. The block storage is attached to the virtual machine via iSCSI. Having a separate storage node also allows scaling and redundancy of block storage independently of the compute node. The block storage node should have a high IOPS rate and be reliably accessible to the network. The Glance component can store snapshots of the virtual instance which can be used to backup and quickly restore a virtual instance with an application. A database, such as Oracle RAC, is treated as an external entity to the OpenStack environment and is accessed through standard approaches such as JDBC and ODBC. In the longer term, the database server could be run in a virtual instance.
References:
http://blogoscoped.com/archive/2008-01-24-n20.html http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ http://en.wikipedia.org/wiki/MapReduce http://en.wikipedia.org/wiki/Apache_Hadoop http://gonorthforge.com/november-14-2013-openstack-simply-explained/

Google Case Study, What Is Hadoop.

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Google Case Study, What Is Hadoop.

Caricato da

Copyright:

Formati disponibili

University of Management & Technology

Assignment # 1 (Hadoop, MapReduce, Open Stack)

The Anatomy of a Large-Scale Hyper textual Web Search Engine

Potrebbero piacerti anche