About Me

Worked at Internet Archive on their Wayback Machine
Build web archive replay tools
Created Url Index for CommonCrawl
Currently working on the Webrecorder project
at Rhizome.org -- a digital art and preservation org.
Also recently launched oldweb.today

What is Web Archiving

Web = HTTP traffic
Archiving = Preserving at high-fidelity, saving it all

Different from Scraping, Extraction
'Lossless' preservation, HTTP headers + full HTTP content

Why Web Archive

Source: Ten years of the UK web archive: what have we saved?

Web content changes and disappears over time

Who is Web Archiving?

Wayback Machine

Crawling since 1996, public access since 2001

created by Alexa and Internet Archive

What are the components?

Crawler (mostly Heritrix) preserving HTTP traffic to WARC files
An index of urls and their locations in WARC files.
A web app performing url rewriting, retrieval of content from WARC files.

Other 'Wayback Machines'

Many other, lesser-known public web archives:

International Internet Preservation Consortium group

CommonCrawl

Crawling HTML content primarily for analysis

Publicly available, since 2008

What does it provide?

Crawling via Apache Nutch crawler into WARC files
A url index of all urls and their locations in WARC files
Link and Metadata files provided (WAT)
Extracted Text Files provided (WET).

Webrecorder

On-demand high fidelity web archiving

What does it do?

Records all traffic real time (to WARC) as user interacts with a web page.
'Replay' what has been recorded immediately.
Allows user to create public or private collections.
New project, a lot more coming soon!

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

Standardized, almost ubiqutous across web archiving initiatives.
Created in collaboration between Internet Archive, many national libraries

Improvement on previous ARC format

Designed to fully store HTTP request and response traffic, support deduplication, metadata, other arbitrary resources
WARC 1.0 ISO Standard since 2005
WARC 1.1 revision in progress: https://github.com/iipc/warc-specifications

WARC Format: Details

WARC file contains or more concatenated records
Each record can be (often is) gzip compressed
.warc.gz extension if records are gzip compressed
.warc extension if not gzip compressed
Entire file is NOT gzipped compressed

WARC Format: Details

Each record contains MIME-style WARC headers, followed by HTTP headers, followed by HTTP payload

HTTP response record, WARC-Type: response

WARC/1.0 WARC-Type: response WARC-Date: 2013-12-04T16:47:32Z WARC-Record-ID: WARC-Payload-Digest: sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A WARC-Target-URI: http://example.com/ Content-Length: 200 Content-Type: application/http; msgtype=response ... HTTP/1.0 200 OK Server: nginx Content-Type: text/html Content-Length: 100 ... <html> ...

WARC Format: Details

HTTP Request record, WARC-Type: request

WARC/1.0 WARC-Type: request WARC-Record-ID: WARC-Date: 2014-01-03T03:03:41Z Content-Length: 320 Content-Type: application/http; msgtype=request ... GET / HTTP/1.0 ...

Supports any other HTTP verb, includes payload if necessary.

WARC Format: Deduplication

revisit record indicates no new content, a duplicate of another response record
Duplicate by exact digest of HTTP payload (not headers)

HTTP revisit record, WARC-Type: revisit

WARC/1.0 WARC-Type: revisit WARC-Target-URI: http://www.duplicate.example.com/ WARC-Record-ID: WARC-Date: 2013-12-05T16:47:32Z Content-Length: 0 Content-Type: application/http; msgtype=response WARC-Payload-Digest: sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A WARC-Refers-To-Target-URI: http://example.com/ WARC-Refers-To-Date: 2013-07-02T19:54:02Z

Need to find a response record stored elsewhere, by url and date with same digest (where?)

WARC Format: Other Records

metadata -- any metadata about an existing record
resource -- any other resource, non-HTTP data
conversion -- transformation of existing record
WARCs can conform to an additional schema and contain not just raw HTTP traffic
WAT and WET Formats

Limitations of the WARC Format

No url or record index in the spec!
Not easiely splittable (chunked gzip)
Need to build external url index or read linearly

CDX (Capture Index)

Plain-text index, developed at IA many years ago
Space-delimited, alpha sorted text index
De-facto standard, but has changed over the years
Supported by web archive replay tools.
CDX Format Notes

CDX Format (Example)

Sample CDX Line (11 fields)

com,example)/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz

IA has billions of entries for the Wayback Machine
Binary search to lookup a url in Wayback
Improvement: Use Compression

Improvement: CDX Compression (ZipNum)

gzip compress CDX lines (3000) at a time
Plain text, secondary index of compressed blocks
Bin-search secondary index, read compressed blocks
Compressed index can be split into shards
Format sometimes called (ZipNum) for 'zipped N number of lines'

Improvement: Use JSON

CDX + JSON = CDXJ
Instead of space-delimited, use sorted key plus JSON line

com,example)/?example=1 20140103030321 {"url": "http://example.com?example=1", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "length": "1043", "offset": "333", "filename": "example.warc.gz"}

Still alpha sortable, more flexible, optional fields
Required fields are WARC filename, offset and length of WARC record
More on CDXJ

CDX API for Url Queries

A CDX Server is an api endpoint which returns CDX data
CommonCrawl: index.commoncrawl.org

Ex:

http://index.commoncrawl.org/CC-MAIN-2015-40-index?url=http://commoncrawl.org/* output=json

API Reference for CommonCrawl Index
API Reference for IA Wayback CDX
cdx-index-client for bulk querying CDX Server API

Creating CDX from WARCs

Use pywb tools cdx-indexer

pip install pywb
cdx-indexer -s -j warcfile.warc.gz > index.cdxj

pywb includes other options, automated indexing

Creating CDX from WARCs at Scale

Use webarchive-indexing tools

Runs on hadoop, EMR or locally (mrjob)
Indexes WARCs, samples, creates compressed zipnum cluster of CDXJ
Used to created the CommonCrawl Index

An Alternative: Warcbase

Built on top of Hadoop + HBase
Store WARC records into HBase
Indexing, access, provided by HBase. No need for CDX
New project, but gaining momentum for analysis
Downside: Requires HBase, Hadoop stack

Taking a step back: How are WARCs Created

Traditional approach: Crawling (very simplified)

Crawler starts with a list of seed urls
Crawler visits urls, creates WARC files from HTTP request/responses
Crawler discovers new urls, repeats following specific rules.
Crawl stops when some condition is met.

Crawling: How Effective

This has worked very well with early web.

Still continues to work, but maybe not as well..

Still the best way to amass a large quantity of web data

But...

Problems with traditional crawling:

JAVASCRIPT
Contextualed pages: Your "twitter.com" is different than my "twitter.com"
Complex user interactions -- more JAVASCRIPT!
Crawler sees what the crawler sees, not necessarily what the user sees.

Symmetrical Web Archiving

Record content the same as it is accessed.
If a user views the content through browser, why not archive through browser also?
Webrecorder does exactly this!
Writes WARCs, indexes WARCs and replays WARCs in real time
Small Data -- "Quality over quantity"

Url Rewriting

Access archived urls from a different domain by transforming urls.
/replay/[timestamp]/[url] -- access url archived at timestamp
/record/[url] -- record/archive url right now.
Urls rewritten from /[url] -> /[prefix]/[url] for both record (write) and replay (read).
What about Javascript?

Client Side Rewriting -- Wombat.js

Part of the pywb web archive replay system.
Tries hard to make JS think its running at original url.
Overrides AJAX calls, attribute accessors, document.write(), many other things.
A summary (by a contributor)
wombat.js
Not perfect, but mostly of works, especially in Chrome and FF
DEMO WEBRECORDER

Challenges Ahead

Even besides JS, many challenges remain.

Contextual content -- when date and url is not enough
Web application -- client side state can not be archived
Websockets
HTTP/2

Alternative to Url Rewriting

HTTP Proxy Mode
Requires manual setup from the user, can't link to an archived page.
Unless browser running elsewhere, pre-configured.
Emulated Browsers connected to web archives via HTTP proxy.
oldweb.today does this.

About Me

What is Web Archiving

Why Web Archive

Who is Web Archiving?

Wayback Machine

Other 'Wayback Machines'

CommonCrawl

Webrecorder

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

WARC Format: Details

WARC Format: Details

WARC Format: Details

WARC Format: Deduplication

WARC Format: Other Records

Limitations of the WARC Format

CDX (Capture Index)

CDX Format (Example)

Improvement: CDX Compression (ZipNum)

Improvement: Use JSON

CDX API for Url Queries

Creating CDX from WARCs

Creating CDX from WARCs at Scale

An Alternative: Warcbase

Taking a step back: How are WARCs Created

Crawling: How Effective

Problems with traditional crawling:

Symmetrical Web Archiving

Url Rewriting

Client Side Rewriting -- Wombat.js

Challenges Ahead

Alternative to Url Rewriting

Thank you!

Questions?