Hanzo:web — social web archiving

Only you can save the Web!

mark.middleton@hanzoman.com
peter.ferne@hanzoman.com

Hanzo

A memory-less medium

stone tablet

  • Varying methods of preservation have preceded what is now a refined and specialised field of knowledge and application. Today the Internet, particularly the Web, is a central reference base for knowledge and culture, business, scientific and technical publications, as well as a forum for personal commentary and interaction on all subjects and aspects of life.
  • Unlike other forms of publishing and expression, the digital world, especially the web, requires constant energy in the form of server resources, bandwidth provision and so on, in order to maintain its very existence.
  • In this way, web content is for ever dependent on publishers maintaining this energy, so called permanent publishing.
  • In this fragile ecology it is very easy to delete; content updates are often made at the expense of pre-existing content (content at the same URI).

404

404 Not Found

  • The average half-life of web pages is less than 2 years.
  • This implies that for around a billion web users in the world, 30% of favoured bookmarks and saved links will fail each year. This takes place despite 90% of users actively trying to keep these pages safe.
  • The problem affects the longer living pages too as their environment (outlinks) cease to exist, hence degrading their own interest and value: "decay".
  • A new strain of "blog-rot", commentary without the original.

In search of stability

Stability

How can we stabilise the "permanent publishing" of content?

the "who":
decoupling publishing artefacts from the publisher — users or other third-party organisations take some responsibility for the content in the long term
the "how":
capture the content, store it in a time-structured archive, make it accessible on the web

So far, so...?

The Internet Archive

  • Internet Archive: since 1996, IA has archived a large scale snapshot of the surface web every couple of months and this is accessible online (great!)
  • But, the question remains, who decides what should be preserved for the future? IA (and other traditional libraries) do not accept the users' or participants' input on the selection of content. Nor do they accept input on the timing of the archiving.
  • Not my archive! The IA's archive is not designed for users' individual organisation and sharing of content, comments, tagging, etc.

So far, so...?

HTTrack

  • Web copiers: make a local copy of content, but...
  • ...they modify the content (naming, linking) to make the saved copy navigable.
  • Local :-o No sharing

So far, so...?

Furl Spurl Yahoo MyWeb

  • Page-level online archiving services (Furl, Spurl, Yahoo MyWeb).
  • Enable a restricted scope only (pages in isolation) and fragmented archives (pages in isolation are not connected together).
  • Does not scale up to a global archive, will remain a collection of fragmented and isolated sets of pages.
  • Corporate or institutional dependence.

The Future

The Future

A user-driven global archive!

  • Users decide what and when stuff has to be archived!
  • Expansive scope, pages, clusters of pages (context), sites, threads and feeds (RSS) etc.
  • Global cross-linking of the resulting archive: creation of a global archive.

Is everybody ready?

blogging tagging

  • The success of blogging and social bookmarking tools has shown that users do care about content: they spend time and energy tagging, commenting and primitive forms of archiving — such as file saving and printing!
  • Problem: so far, there is no tool for doing this job in way consistent with the "increasing participation" net culture.

What do we need?

serious plumbing

An infrastructure to...

  • Commodify the capture of web content: hide the complexity of it!
  • Commodify the storage of and access to archived web content (index/search, web serving, content rendering, archive interaction, link redirection, etc.)
  • APIs for front-end(s) and applications to exploit this infrastructure.
  • Preservation platform for the long term availability of the content.

Introducing Hanzo

Hanzo:web architecture

sharp corners do things, rounded corners hold data

  • collectors — 1 or more collectors per crawl server
  • archive — spread across many machines
  • ingester, index, metadata, hanzo:web, hanzo:services, archive manager — currently 1 machine each
  • the archive manager tells the collectors what to do, and...
  • ...they pass the results to the ingester
  • Back-end services: archiving tools & infrastructure
  • User-friendly web UI
  • Open API
  • Client-side tools coming soon

Demo

Demo

Photo by Dave Morris
  • ...
  • ...
  • ...

/api/collect

$ curl -d "\
> key=_your_hanzoweb_developer_key_&\
> url=http://hobix.com/textile/&\
> name=Textile%20Reference&\
> tags=textile%20syntax%20cheatsheet&\
> description=Why's%20guide%20to%20Textile&\
> scope=context&\
> visibility=public\
> "\ 
> http://hanzoweb.com/api/collect

  • ...
  • ...
  • ...

/api/collect

<?xml version="1.0" encoding="UTF-8"?>
<response status="ok">
    <item id="{item-id}">
        <instance id="{instance-id}" />
        <status level="0">Queued</status>
    </item>
</response>

  • ...
  • ...
  • ...