About Me

What is Web Archiving

  • Web = HTTP traffic
  • Archiving = Preserving at high-fidelity, saving it all

  • Different from Scraping, Extraction
  • 'Lossless' preservation, HTTP headers + full HTTP content

Why Web Archive

Source: Ten years of the UK web archive: what have we saved?

Web content changes and disappears over time

Who is Web Archiving?

Wayback Machine

Crawling since 1996, public access since 2001

created by Alexa and Internet Archive

What are the components?

  • Crawler (mostly Heritrix) preserving HTTP traffic to WARC files
  • An index of urls and their locations in WARC files.
  • A web app performing url rewriting, retrieval of content from WARC files.

Other 'Wayback Machines'

Many other, lesser-known public web archives:

International Internet Preservation Consortium group

CommonCrawl

Crawling HTML content primarily for analysis

Publicly available, since 2008

What does it provide?

  • Crawling via Apache Nutch crawler into WARC files
  • A url index of all urls and their locations in WARC files
  • Link and Metadata files provided (WAT)
  • Extracted Text Files provided (WET).

Webrecorder

On-demand high fidelity web archiving

What does it do?

  • Records all traffic real time (to WARC) as user interacts with a web page.
  • 'Replay' what has been recorded immediately.
  • Allows user to create public or private collections.
  • New project, a lot more coming soon!

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

  • Standardized, almost ubiqutous across web archiving initiatives.
  • Created in collaboration between Internet Archive, many national libraries
    • Improvement on previous ARC format
  • Designed to fully store HTTP request and response traffic, support deduplication, metadata, other arbitrary resources
  • WARC 1.0 ISO Standard since 2005
  • WARC 1.1 revision in progress: https://github.com/iipc/warc-specifications

WARC Format: Details

  • WARC file contains or more concatenated records
  • Each record can be (often is) gzip compressed
  • .warc.gz extension if records are gzip compressed
  • .warc extension if not gzip compressed
  • Entire file is NOT gzipped compressed

WARC Format: Details

  • Each record contains MIME-style WARC headers, followed by HTTP headers, followed by HTTP payload
  • HTTP response record, WARC-Type: response

    WARC/1.0 WARC-Type: response WARC-Date: 2013-12-04T16:47:32Z WARC-Record-ID: WARC-Payload-Digest: sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A WARC-Target-URI: http://example.com/ Content-Length: 200 Content-Type: application/http; msgtype=response ... HTTP/1.0 200 OK Server: nginx Content-Type: text/html Content-Length: 100 ... <html> ...

WARC Format: Details

  • HTTP Request record, WARC-Type: request

    WARC/1.0 WARC-Type: request WARC-Record-ID: WARC-Date: 2014-01-03T03:03:41Z Content-Length: 320 Content-Type: application/http; msgtype=request ... GET / HTTP/1.0 ...

  • Supports any other HTTP verb, includes payload if necessary.

WARC Format: Deduplication

  • revisit record indicates no new content, a duplicate of another response record
  • Duplicate by exact digest of HTTP payload (not headers)
  • HTTP revisit record, WARC-Type: revisit

    WARC/1.0 WARC-Type: revisit WARC-Target-URI: http://www.duplicate.example.com/ WARC-Record-ID: WARC-Date: 2013-12-05T16:47:32Z Content-Length: 0 Content-Type: application/http; msgtype=response WARC-Payload-Digest: sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A WARC-Refers-To-Target-URI: http://example.com/ WARC-Refers-To-Date: 2013-07-02T19:54:02Z

  • Need to find a response record stored elsewhere, by url and date with same digest (where?)

WARC Format: Other Records

  • metadata -- any metadata about an existing record
  • resource -- any other resource, non-HTTP data
  • conversion -- transformation of existing record
  • WARCs can conform to an additional schema and contain not just raw HTTP traffic
  • WAT and WET Formats

Limitations of the WARC Format

  • No url or record index in the spec!
  • Not easiely splittable (chunked gzip)
  • Need to build external url index or read linearly

CDX (Capture Index)

  • Plain-text index, developed at IA many years ago
  • Space-delimited, alpha sorted text index
  • De-facto standard, but has changed over the years
  • Supported by web archive replay tools.
  • CDX Format Notes

CDX Format (Example)

  • Sample CDX Line (11 fields)
  •                         

    com,example)/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz

  • IA has billions of entries for the Wayback Machine
  • Binary search to lookup a url in Wayback
  • Improvement: Use Compression

Improvement: CDX Compression (ZipNum)

  • gzip compress CDX lines (3000) at a time
  • Plain text, secondary index of compressed blocks
  • Bin-search secondary index, read compressed blocks
  • Compressed index can be split into shards
  • Format sometimes called (ZipNum) for 'zipped N number of lines'

Improvement: Use JSON

  • CDX + JSON = CDXJ
  • Instead of space-delimited, use sorted key plus JSON line
  •                         

    com,example)/?example=1 20140103030321 {"url": "http://example.com?example=1", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "length": "1043", "offset": "333", "filename": "example.warc.gz"}

  • Still alpha sortable, more flexible, optional fields
  • Required fields are WARC filename, offset and length of WARC record
  • More on CDXJ

CDX API for Url Queries

Creating CDX from WARCs

Use pywb tools cdx-indexer

  1. pip install pywb
  2. cdx-indexer -s -j warcfile.warc.gz > index.cdxj

pywb includes other options, automated indexing

Creating CDX from WARCs at Scale

Use webarchive-indexing tools

  • Runs on hadoop, EMR or locally (mrjob)
  • Indexes WARCs, samples, creates compressed zipnum cluster of CDXJ
  • Used to created the CommonCrawl Index

An Alternative: Warcbase

  • Built on top of Hadoop + HBase
  • Store WARC records into HBase
  • Indexing, access, provided by HBase. No need for CDX
  • New project, but gaining momentum for analysis
  • Downside: Requires HBase, Hadoop stack

Taking a step back: How are WARCs Created

Traditional approach: Crawling (very simplified)

  • Crawler starts with a list of seed urls
  • Crawler visits urls, creates WARC files from HTTP request/responses
  • Crawler discovers new urls, repeats following specific rules.
  • Crawl stops when some condition is met.

Crawling: How Effective

This has worked very well with early web.

Still continues to work, but maybe not as well..

Still the best way to amass a large quantity of web data

But...

Problems with traditional crawling:

  • JAVASCRIPT
  • Contextualed pages: Your "twitter.com" is different than my "twitter.com"
  • Complex user interactions -- more JAVASCRIPT!
  • Crawler sees what the crawler sees, not necessarily what the user sees.

Symmetrical Web Archiving

  • Record content the same as it is accessed.
  • If a user views the content through browser, why not archive through browser also?
  • Webrecorder does exactly this!
  • Writes WARCs, indexes WARCs and replays WARCs in real time
  • Small Data -- "Quality over quantity"

Url Rewriting

  • Access archived urls from a different domain by transforming urls.
  • /replay/[timestamp]/[url] -- access url archived at timestamp
  • /record/[url] -- record/archive url right now.
  • Urls rewritten from /[url] -> /[prefix]/[url] for both record (write) and replay (read).
  • What about Javascript?

Client Side Rewriting -- Wombat.js

Challenges Ahead

Even besides JS, many challenges remain.

  • Contextual content -- when date and url is not enough
  • Web application -- client side state can not be archived
  • Websockets
  • HTTP/2

Alternative to Url Rewriting

  • HTTP Proxy Mode
  • Requires manual setup from the user, can't link to an archived page.
  • Unless browser running elsewhere, pre-configured.
  • Emulated Browsers connected to web archives via HTTP proxy.
  • oldweb.today does this.

Thank you!

Questions?