Sei sulla pagina 1di 5

CASE STUDY | PageFreezer

Worlds cleverest website archiving tool


was supported by technologies and brainpower from Redwerk

Redwerk - is a custom software development


agency founded in 2005 that helps
Challenge
conceptualize, test, and launch software
development projects to our customers in the
USA, Canada, and EU.
PageFreezer is an online SaaS archiving tool meant to help people save website
pages, blogs, and even social media channels, like Twitter, Facebook or LinkedIn.
User can set up an automated crawling process and customize its schedule
according to his current needs. The net result is an interactive copy of a desired
web page.

The service is not just a convenient way to preserve todays web resources for
PageFreezer is the name of a technology
future generations, but an easy-to-use additional tool for judicial protection,
start-up and also a web service which archives
regulatory compliance or marketing needs. Being an enterprise class solution,
websites in a convenient and
easy-to-use way, according to flexible
PageFreezer is able to process the crawling of even the most complex pages and
schedules defined by the user. Any website, is easy to implement for both individuals and companies of all stripes.
blog, or even Facebook and Twitter profiles,
can be preserved for future generations in an Apart from easy page archiving, youre also able to run them live any time
interactive way, going much further than needed, just like they were never down.
common screenshots.

Redwerks part in PageFreezer development consisted in underlying technology


support and building a SaaS application that would let users save their website
or social profiles content continuously and in evidentiary quality. One of the key
Technologies
features was the ability to use obtained data as if it was still live. The tool was
also supposed to crawl web pages of different types and complexity using the
Java Spring
same integrated platform. It was planned that the crawling processes would run
Solr PHP
automatically, restricted by time intervals defined by a user. Saved data had to
WordPress Python
be available to search through, as well.
HTML5 CSS3

JavaScript jQuery Hence, PageFreezers main features were:


Bootstrap AngularJS - Automatic archiving
Linux Solaris - Public records compliance
Tomcat Apache
- Live replay/browsing of archives
- Search for contents
Nginx MySQL
- Digital signatures
- Data export
- Data access through API
CASE STUDY | PageFreezer

I've been working with Redwerk almost Solution


continuously since 2006 on various complex
software development projects (C++, Java, JSP,
Spring, Django, iPhone). This company provides
excellent software application development
services for a great price. They are very flexible,
Website Crawling
customer-focused, responsive and
communicative. I would warmly recommend other
companies to hire them for your software
For PageFreezer, we created a proprietary highly advanced web crawler,
development projects.
which takes into account every minor peculiarity of every known web server
and web browser software. Its a Java library, which integrates well with any
project and provides interfaces to override various behaviors.

In order to monitor the crawling processes as conveniently as possible, we


created an informative admin interface. We made it possible to crawl and
capture images as well as text, and even flash animations, even when they were
on different domains. An extra URL list was created for this purpose.

Include, exclude and advanced website settings were introduced, making it


even more convenient for users who wish to crawl certain URLs depending on
keywords. Flexible user agent selection for crawling was also added. The
mechanism was designed to crawl web pages at moments when they are not
under high load. Clients can also use the option of crawling speed to configure
the number of crawl workers for each individual task to reduce the load on the
website.

Redwerk also implemented a standard sitemap XML crawling feature to


reduce the time it takes to crawl large websites, because only modified pages
and their contents are crawled and archived.

A number of outstanding, technologically advances crawling options were also


made available:
- parsing links out of XML files using XSLT templates
- generic authentication mechanism allowing crawlers to authorize on almost
any website

All of these features make PageFreezer a much more technologically advanced


solution compared to the competition.

Website Playback

One of the main goals and most impressive usage scenarios was that users had
to be able to browse copies of websites as if they were live now.
CASE STUDY | PageFreezer

It is a legal level archiving. The sort of archiving


This was perhaps the key challenge, and involved a lot of complex thinking and
that has to be done to be in compliance for public innovative approaches. It is based on hyperlink resolution and on-fly
companies and governments so that they can substitution, JavaScript and redirect interception and much more.
prove that their website said and did exactly what
theyre saying it did at any one particular time.
In order to get to your desired point in time, a convenient calendar was
created, highlighting the dates on which the snapshots were taken. In order to
allow the user to see the site structure we created a simple navigation tree
which reflects the URL hierarchy. All the tree nodes are clickable and open the
corresponding site page.

Social Media

Crawling social media profiles was a much harder challenge, as different rules
apply to them compared to conventional websites. PageFreezers link
extraction was initially created with the help of regular expressions and
content parsers, but most Twitter, Facebook and other social networks are
dynamically built with JavaScript. As they were all different, it was very
exhausting to build the framework and extend it to additional social networks.
The whole solution was unreliable at this stage, and all future modifications to
these social networks would have had to be implemented in the system, too. In
the end, it was decided to develop a social network adapter based on
third-party social network client libraries in Java. Spring Social was identified
as meeting our requirements.

Data Storage

One of the most difficult tasks in this project was to select the best storage
option, which had to be very scalable. The project started with approximately
500 sites, but had to be prepared for much more. We toyed with the idea of
using S3 or Google for some time, but those proved to be too slow to access
and too expensive. So Redwerk had to come up with a more flexible,
custom-tailored idea, and after some benchmarking we built a simple yet
scalable custom storage cloud from scratch, based on a database and NFS file
system.

Data Integrity

It was essential, as always, to ensure that no information was lost in case of


failure of any part of the system. We implemented a modern logic which
CASE STUDY | PageFreezer

makes crawlers stop and wait in case the database or the file system are
Awarded
unavailable. When these components come back, no information gathered by
the crawlers is lost, and the use of checksums helps maintain the integrity of all
stored data.

Digital Signatures
Red Herring Top 100 Global Finalist

A digital signature is a set of algorithms and other methods for validating


digital documents or messages. They are used almost in all sectors of economy
to detect forgery or tampering, making it a fundamental security tool.

The PageFreezer service is no exception. Here, Redwerk opted TSA, used by


PageFreezer to digitally sign all crawled content. Hash data of crawled content,
verified certificates, user keys and timestamps are all used when signing
through TSA. Therefore, a valid TSA signature is what guarantees to
PageFreezer clients a reason to believe that original webpage was crawled at
particular moment of time. PageFreezer data can even be used as evidence in
court thanks to this implementation.

Once the system is enabled, all snapshots available to the user will be signed
through TSA, and the signature can be verified on the browsing page at any
time.

Security

To protect data from destructive forces and the unwanted actions of


unauthorized users we use a rock-solid combination of firewalls, fail2ban,
back-ups and slave database servers. Generally speaking, the system was
created to be as modular and scalable as possible. The components do not
affect the performance of each other. Crawlers are separate processes, and
different modules were designed for logged-in users and guests.
CASE STUDY | PageFreezer

Results
All in all, PageFreezer is one of the projects we are renowned for. Over the last
couple of years, Redwerk team carried out a successful prototyping and
building of a product along with a couple of re-designs to keep it up to date.

We strived for perfection and kept adding new functionalities to satisfy users
needs. Our team was responsible for the full system maintenance, up to
administrative tasks like database and the archived content upgrades and
backups. As for now, PageFreezer is one of the top online content archiving
solutions, and we are proud to say Redwerks technology and know-how much
contributed to its success!

developers on the QA engineers on years long lines of code


dedicated team the team engagement

Visit pagefreezer.com View full case study

Potrebbero piacerti anche