Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Universal Scale
FILE STORAGE AT THE CROSSROADS
Contents
Executive summary 1
A perfect storm 1
Universal-scale file storage 3
Rethinking the file storage industry 5
The limitations of on-premises storage appliances 5
Inadequacy of cloud-based file solutions 5
Qumulo File Fabric (QF2) 6
How QF2 works 14
The QF2 file system 15
Real-time quotas 15
Snapshots 15
Continuous replication 16
Scalable Block Store (SBS) 16
Object and file storage 17
Conclusion 18
As a result, new requirements for file-based storage are emerging. The new requirements
point to the need for a universal-scale file storage system. For example, such a system has
no upper limit on the number of files it can manage no matter their size, and it can run
anywhere, whether on-premises or in the cloud.
Qumulo File Fabric (QF2) is a modern, highly scalable file storage system that spans the data
center and the public cloud. It scales to billions of files, costs less and has lower TCO than
legacy storage appliances. It is also the highest performance file storage system on premises
and in the cloud. Real-time analytics let administrators easily manage data no matter how
large the footprint or where it’s located globally. QF2’s continuous replication enables data
to move where it’s needed when it’s needed, for example, between onpremises clusters and
clusters running in the cloud.
QF2 runs on industry-standard hardware and was designed from the ground up to meet all
of today’s requirements for scale. QF2 is the world’s first universal-scale file storage system,
allowing the modern enterprise to easily represent and manage file sets numbering in the
billions of files, in any operating environment, anywhere in the world.
A perfect storm
IDC predicts that the amount of data created will reach 40 zettabytes (a zettabyte is a billion
terabytes) by 2020, and that there will be more than 163 zettabytes by 2025. This is ten times
the data generated in 2016.1 Approximately 90% of this growth will be for file and object
storage.
Machine-generated data, virtually all of which is file based, is one of the primary factors
behind the dramatic acceleration of data growth. Life sciences researchers developing the
latest medical breakthroughs use
vast amounts of file data for genome
One drop of human blood creates enough data sequences and share that data with
to fill an entire laptop computer, and some colleagues around the world; oil and
gas companies’ greatest assets are their
research projects require a million drops.
file-based seismic data used for natural
gas and oil discovery; every movie and
television program we watch is increasingly produced on computers and stored as files.
Text-based log files—data about machines, created by machines—have proliferated to the
point of becoming big data.
1. http://www.seagate.com/files/www-content/our-story/trends/files/
Seagate-WP-DataAge2025-March-2017.pdf
1
There is also a trend toward higher resolution digital assets. Uncompressed 4K video is the
new standard in media and entertainment. The resolution of digital sensors and scientific
equipment is constantly increasing. Higher resolution causes file sizes to grow more than
linearly. A doubling of the resolution of a digital photograph increases its size by four times.
As the world demands more fidelity from digital assets, its storage requirements grow.
At the same time, there have been huge advances in data analysis and machine learning over
the past decade. These advances have suddenly made data more valuable over time rather
than less. Scrambling to adapt to the new landscape of possibilities, businesses are forced
into a “better to keep it than miss it later” philosophy.
The trend toward massive data footprints and the development of sophisticated analytical
tools were paralleled by the advent of the public cloud. Its arrival overturned many basic
assumptions about how storage should work.
The cloud meant that elastic compute resources and global reach were now achievable
without building data centers across the world. Consequently, new ways of working have
arrived and are here to stay. All businesses realize that, in the future, they will no longer
be running their workloads out of single, selfmanaged data centers. Instead, they will be
moving to multiple data centers, with one or more in the public cloud. This flexibility will
help them adapt to a world with distributed employees and business partners. Companies
will focus their resources on their core lines of business instead of IT expenditures. Most
will improve their disaster recovery and business continuity plans, and many will do this
by taking advantage of the elasticity provided by the cloud.
All these conditions have come together to form a perfect storm of storage disruption that
existing largescale file systems will find hard to weather.
Users of legacy scale-up and scale-out file systems, the work horses of file-based data, find
that those systems are inadequate for a future dominated by big data. A core part of the
problem is that the metadata of large file systems—their directory structures and file
attributes—have themselves become big data. All existing solutions rely on brute force to
give insight into the storage system, and brute force has been defeated by scale. For example,
tree walks, the sequential processes that scan nested directories as part of routine manage-
ment tasks, have become computationally infeasible. Brute force methods are fundamental
to the way legacy file systems are designed and cannot be fixed with patches.
Against this backdrop of profound change, users of file storage still need to maintain and
safely manage large-scale, complex workflows that rely on collaborations between many
distinct computer programs and humans. Moreover, the traditional buying criteria of price,
performance, ease-of-use and reliability remain as important as ever, no matter how much
the landscape has changed.
Storage is at the crossroads, and the new problems must be faced. Until that happens, users
of large-scale file storage will continue to struggle to understand what is going on inside
their systems. They will struggle to cope with massive amounts of data. They will struggle
to meet the demands for global reach, with few good options for file-based data that spans
the data center and the public cloud.
Universal-scale file storage scales to billions of files. The notion that capacity is only
measured in terms of bytes of raw storage is giving way to a broader understanding that
capacity is just as often defined by the number of digital assets that can be stored. Modern
file-based workflows include a mix of large and small files, especially if they involve any
amount of machine-generated data. As legacy file systems reach the limits in the number
of digital assets they can effectively store, buyers can no longer assume that they will have
adequate file capacity.
Companies want to benefit from the rapid technical and economic advances of standard
hardware, such as denser drives and lower-cost components. They want to reduce the
complexity of hardware maintenance through standardization and streamlined configura-
tions. The trend of smart software on standard hardware outpacing proprietary hardware
will only increase. Users of large-scale file storage require their storage systems to run on a
variety of operating environments and not be tied to proprietary hardware.
Universal-scale file storage scales across geographic locations with data mobility.
Businesses are increasingly global. Their file-based storage systems must now scale across
geographic locations. This may involve multiple data centers, and almost certainly the public
cloud. A piecemeal approach and a label that says “Cloud Ready” won’t work. True mobility
and geographic reach are now required.
Universal-scale file storage gives access to rapid innovation. Modern file storage needs a
simple, elegant design and advanced engineering. Companies who develop universal-scale
file storage will use Agile development processes that emphasize rapid release cycles and
continual access to innovation. Threeyear update cycles, a result of cumbersome “waterfall”
development processes, are a relic of the past that customers can no longer tolerate.
Universal-scale file storage enables elastic consumption of file storage. As the needs of
lines of business surpass what central IT can provide in a reasonable time frame, access to
elastic compute resources has become a requirement. A flexible, on-demand usage model is
a hallmark of the public cloud. However, the shift to cloud has stranded users of large-scale
file storage, who have no effective way to harness the power the cloud offers.
Legacy systems are expensive, and their inefficiency adds even more to their cost. Generally,
only 70% to 80% of the provisioned storage capacity is actually available. Performance
suffers if the disk gets any fuller. Another problem is that legacy systems were not designed
for the higher drive densities that are now available. Rebuild times in the event of a failed
disk can stretch into days.
Finally, there is no visibility into the data. Getting information about how the system is
being used is clumsy and slow. It can take so long to get the information that it is outdated
even before the administrator sees it.
The efforts of legacy storage appliance vendors to pivot to the cloud have resulted in
solutions with limited capacity and no scalable performance. This inflexibility negates the
opportunities for elastic resources that are the very reason people are turning to the cloud.
Gateway products are an alternative to cloud and legacy vendor solutions. These facilitate
the connection between on-premises file storage and cloud compute instances, but they
don’t store files where they are needed, especially for long-lived workloads.
None of the solutions provide visibility and control of the data footprint in the cloud, which
leads to overprovisioning of capacity, performance or both. In general, current solutions for
file storage in the cloud are piecemeal approaches that address only parts of the problem.
Customers are stranded in their attempts to integrate file-based
workloads with the cloud.
QF2 has a unique ability to scale. Here are some of the capabilities that set it apart from
legacy file storage solutions.
QF2 scales to billions of files. With QF2, you can use any mix of large and small files and
store as many files as you need. There is no practical limit with Qumulo’s advanced file-sys-
tem technology. Many Qumulo customers have data footprints in excess of a billion files.
This is in stark contrast to legacy scale-out
storage which was not designed to handle
Our research organization falls between the cracks modern workflows with mixed file sizes,
for most storage vendors, with giant imaging sets and which bogs down and becomes
inefficient when there are many small
and millions of tiny genetic sequencing scraps. files. QF2 is vastly more efficient at
Finding a system that reasonably handled representing and protecting small files
than legacy scale-out NAS, typically
all our complex workflows was difficult, requiring one third the storage capacity
and in the end only QF2 was the right fit. and half the protection overhead.
QF2 has the highest performance. QF2 is the highest performance file storage system
on premises and in the cloud. It provides two times the price performance compared to
legacy storage systems. On premises, QF2 is optimized for standard hardware with SSDs and
HDDs, which cost less than proprietary hardware. In the cloud, QF2 intelligently trades off
between low-latency block resources and
With critical high-profile projects, you want to higher-latency, lower-cost block options.
Its built-in, block-based tiering of hot and
know exactly what you’re going to be leaning on cold data delivers flash performance at
for successful delivery. When the La La Land hard disk prices.
project came around it was make or break, QF2 has lower cost. QF2 costs less and has
and we were never down for a moment. Qumulo a lower TCO than legacy storage appliances
on a capacity basis, as measured by cost
is our rock, allowing us to focus on the visual per usable terabyte. QF2’s cost advantage
effects with absolute confidence the data is safe. comes from efficient use of storage
capacity and from its use of standard
— Tim LeDoux, Founder/VFX Supervisor, Crafty Apes hardware.
Although it may not seem immediately obvious, QF2’s cost efficiencies also make it extreme-
ly reliable. Storage system reliability is usually measured in terms of mean time to data loss
(MTTDL). MTTDL is the average number of years a given cluster will survive before there’s a
hardware failure that causes a significant loss of data. At a minimum, MTTDLs should be
measured in the tens of thousands of years.
Reprotect times matter because the longer it takes to reprotect the cluster, the more vulnera-
ble the cluster is to other failures and the poorer the MTTDL. As disks become denser, data
footprints increase, and clusters grow, a legacy storage system’s reprotect times can turn into
weeks.
QF2 uses sophisticated data protection techniques that enable the fastest reprotect times in
the industry. They are measured in hours, not days or weeks. When reprotect times are fast,
reliability increases. Better reliability
means that administrators can greatly
For a critical digital media archive, reduce the level of redundancy they
need to achieve target MTTDL standards,
QF2 is the safest place I can think to put
which in turn increases storage efficien-
it short of directly in a backup vault. Soon cy and lowers cost.
we won’t need anything else but backup, QF2 makes 100% of user-provisioned
high-speed virtual storage, and QF2. capacity available for user files, in
contrast to legacy scale-up and scale-out
— Joel Hsia, Assistant Head for Systems NAS that only recommend using 70%
Development, Marriott Library, University of Utah to 80%.
QF2 provides real-time control at scale. The inability to manage large file systems (as
opposed to simply storing them) is the Achilles heel of legacy systems. For example, as
file systems get larger, a simple directory query can take days to run, which means even
standard management tasks, such as setting quotas, become impossible.
Another drawback to legacy systems is that analytics is not an integral part of the software.
It is a separate software package that is installed on external hardware and has its own
management interface. There is also no single place where administrators can get an over-
view of their system and then drill down into the details.
QF2 has real-time analytics that tell you what’s happening in your file system now. Analytics
is a part of the QF2 codebase; it is not an afterthought. Instead of running multiple com-
mands, parsing through pages of log files, and running separate programs, an administrator
can simply look at the GUI and understand what’s happening. For example, an administra-
tor can immediately see if a process or user is hogging system resources and, in real-time,
apply a capacity quota.
QF2 gives you the freedom to store and access your data anywhere. QF2 is hardware
independent and can run both in the data center and in the cloud, while still offering the
same interface and capabilities to users, no matter if they are on-premises, off-premises, or
spanning both. Administrators have the freedom to take advantage of the elastic compute
resources that the cloud offers and then move data back to their data centers.
QF2 has industry-leading support. Many storage customers are dissatisfied with the
support they receive from their vendors. They find them to be unresponsive and reactive
rather than proactive. Qumulo offers responsive, personal customer support, with some
of the highest Net Promoter Scores (NPS) in the industry.
QF2 pricing is based on a single, simple subscription service that covers everything, includ-
ing software, updates and support.
QF2 provides cloud-based monitoring and trends. A QF2 subscription includes cloud-
based monitoring that proactively detects potential problems, such as disk failures.
Administrators can also access the QF2
trends service, which provides historical
We use the same agile methodology at data about how the system is being used.
Sinclair, and I’ve seen first-hand the ability This information can help lower costs
and optimize workflows.
to drive good products into production, so
QF2 provides access to innovation.
much faster than with traditional 18-month
Qumulo follows Agile and other modern
monolithic releases. Given Qumulo’s existing development practices, which means it
has many small releases that steadily
lead on its competitors, I knew that fast
improve the product and keep it on the
development pace would help keep it out in leading edge of what’s possible. This is in
front of our needs. contrast to legacy storage vendors that
have infrequent releases that can keep
— Nathan Larsen, Director of IT, customers waiting for improvements
Sinclair Oil Corporation for years.
QF2 provides a fully programmable REST API. Customers get programmatic access to any
feature or administrative setting in QF2. The QF2 REST API is built for developers. The API is
suitable for DevOps and Agile operating approaches, which are how modern application
stacks are constructed and managed, particularly in the cloud. For example, you can use
tools such as Terraform and CloudFormation to automatically spin up QF2 clusters in the
cloud.
QF2 is unique in how it approaches the problems of scalability. Its design implements
principles similar to those used by modern, large-scale, distributed databases. The result is
a file system with unmatched scale characteristics.
QumuloDB is built-in and fully integrated with the file system itself. In contrast, metadata
queries in legacy storage appliances are answered outside of the core file system by an
unrelated software component.
Real-time quotas
Just as real-time aggregation of metadata enables QF2’s real-time analytics, it also enables
real-time capacity quotas. Quotas allow administrators to specify how much capacity a given
directory is allowed to use for files.
Unlike legacy systems, in QF2 quotas are deployed immediately and do not have to be
provisioned. They are enforced in real time, and changes to their capacities are immediately
implemented. Quotas can be specified at any level of the directory tree.
Snapshots
Snapshots let system administrators capture the state of a file system or directory at a given
point in time. If a file or directory is modified or deleted unintentionally, users or adminis-
trators can revert it to its saved state.
Snapshots in QF2 have an extremely efficient and scalable implementation. A single QF2
cluster can have a virtually unlimited number of concurrent snapshots without perfor-
mance or capacity degradation.
• Efficient transactions that allow QF2 clusters to scale to many hundreds of nodes
• Built-in tiering of hot/cold data that gives flash performance at archive prices
The virtualized protected block functionality of SBS is a huge advantage for the QF2 file
system. In legacy storage systems that do not have SBS, protection occurs on a file by file
basis or using fixed RAID groups, which introduce many difficult problems such as long
rebuild times, inefficient storage of small files and costly management of disk layouts.
Object storage also leaves unhandled the problem of organizing data. Instead, users are
encouraged to index the data themselves in some sort of external database. This may suffice
for the storage needs of
standalone applications, but
A surprising amount of valuable business logic is encoded it complicates collaboration
in the directory structure of enterprise file systems. between applications and
between humans and those
applications. Modern workflows almost always involve applications that were developed
independently but work together by exchanging file-based data, an interop scenario that is
simply not possible with object storage. Further, object stores don’t offer the benefits of a file
system for governance.
QF2 opens new possibilities for its customers, who are makers in every sense of the word.
With QF2, meeting the release date of a major animated motion picture gets easier. With
QF2, it’s feasible to achieve medical breakthroughs from multi-petabyte experimental
datasets. With QF2, identifying security threats in a billion-file network log is a daily reality.
Qumulo is a different kind of storage company. As the creators of the world’s most advanced
file storage system, our own team of innovators puts what we believe into practice every day.
File storage at universal scale is our vision and our passion.