Scientific Computing in the Open Web Platform

or

What I've Been Doing with Henry for the Past 4 Years

ht@inf.ed.ac.uk

Institute for Language, Cognition and Computation

The Open Web Platform

The Open Web Platform (OWP) is a platform for innovation, consolidation and cost efficiencies focused on those things happen within or intersect the actions of the Web browser.

It's a contract readable by developers and authors!

Recommendations, standards, notes, ...
Algorithms, technologies, implementations, ...
APIs, vocabularies, ...
Shared expectations!

And so why doesn't this work for Science on the Web? Or does it?

Problem Statement

We want to lower the bar for publishing scientific data on the Web so that we enable the network effect while still retaining some aspect of semantics and interoperability.

~500K datasets from data.gov, May 8th, 2012

many data sets are tabular
representation formats predate the Web
incompatible with browser technologies
archives of data sets
too large to be processed

Changing the Paradigm

Let's use the Web "as is"

A methodology for publishing scientific data sets onto the Web so that they are accessible in Web-oriented formats.
HTML has table markup so why can't we use it?
Name things with URIs! Develop a usable naming strategy for data sets on the Web; identifiable and easily retrievable.

Why not just use XML?

The International Virtual Observatory Alliance (IVOA) and others have tried that.
Random XML has its limitations (poor support?) on the Web and in the browser.
That's a problem.

The Three Little Problems

Once upon a time, there were three little problems...

Too big: Data sets are typically too large to be processed by the typical Open Web Platform (OWP) implementation as one large Web resource.

Too dumb: HTML table markup lacks the constructs to convey all the information coded within typical tabular data sets.

Too forgetful: Accessing data may require formulating complex queries or URIs which is error prone. Users can request too much data which results in failures or requires paging results.

Where's the smart with the brick house?

Or was he dinner?

The PAN Methodology

Partition the data set along properties inherent in the data (e.g. time, geospatial coordinates, etc.) into reasonable sized subsets suitable to Web applications.
Annotate the data according to some ontology and encode in a common syntax (HTML) using RDFa.
Name each data partition with a unique URI using a consistent naming scheme that can be traced back to your partitioning scheme from (1).

Seems obvious? Not to some...

Simple is good but questions remain:

Partitioning is fixed. No paging! No custom queries!
Yet, which choices are partitions are correct for your data?
Annotations via RDFa is new! What is the right way? Whose ontology?
If partitioning is fixed, naming must be stable.

PAN Partitions

Table Annotations

<table typeof="Table">
<thead>
<tr>
...
  <th property="column" typeof="Column">
     <span property="title">Temperature</span>
     <span property="property" resource="w:airTemperature"/>
     <span property="valueSpace" typeof="ValueDescription">
        (°<span property="symbol">C</span>)
        <span property="datatype" resource="xsd:double"/>
        <span property="quantity" resource="quantity:ThermodynamicTemperature"/>
        <span property="unit" resource="unit:DegreeCelsius"/>
     </span>
  </th>
...
</tr>
</thead>
<tbody>
   <tr>
...
     <td>22.2</td>
...
   </tr>

Partition Annotations

<a typeof="Partition" rel="nearby" 
   href="http://www.mesonet.info/data/q/5/n/767/2014-02-12T06:00:00Z">
767
<span property="range" typeof="FacetPartiton">
  <span property="facet" resource="/data/#latitude"/>
  <span property="facet" resource="/data/#longitude"/>
  <span property="shape" typeof="schema:GeoShape">
    <span property="schema:box" content="40 -130 35 -130 35 -125 40 -125"/>
  </span>
</span>
<span property="range" typeof="FacetPartition">
   <span property="facet" resource="/data/#receivedTime">Received</span>
   <span property="valueType" resource="xsd:dateTime"/>
   from <span property="start">2014-02-12T06:00:00Z</span> 
   to <span property="end">2014-02-12T06:30:00Z</span>
   (<span property="length">PT30M</span>)
</span>
</a>

PAN in Practice: CWOP & mesonet.info

CWOP: Citizen Weather Observation Program
53.6 million weather reports per month

Input: APRS, Output: PAN-enabled Web Pages of tabular data.

DW3904>APRS,TCPXX*,qAX,CWOP:@090158z5132.18N/00043.53W_061/000g001t030r000p000P000h87b10389L000.DsVP
CW1604>APRS,TCPXX*,qAX,CWOP:@090158z4444.70N/06531.17W_204/004g009t027r000p000P000h80b10204.DsVP
DW6741>APRS,TCPXX*,qAX,CWOP:@090158z3749.55N/08000.08W_296/005g...t036r...p...P008h74b10188.DsVP
DW6916>APRS,TCPXX*,qAX,CWOP:@090158z4310.23N/10818.40W_238/001g002t027r000p000P000h58b10189.DsVP
DW6011>APRS,TCPXX*,qAX,CWOP:@090158z4307.07N/08756.60W_261/002g006t028r000p000P000h55b10249.DsVP

Applying a Duality to Web Resources

PAN-enabled Web Resources are:

Pages you can just view.
Data you can process.

The data is not:

Duplicated,
Stored in JSON or other alternate formats,
unstructured,
or scraped via formatting assumptions.

The data is annotated with RDFa.

Building Blocks

For geospatial data, partitioning provides a good baseline for algorithms.

Accesing Tabular Data

Green Turtle implements RDFa API - W3C Note - July 2012

API Details

Find a table of data:

// (1) Find the element that holds the partition
var datasets = document.getElementsByType("pan:Partition");

// (2) Use the subject to find the partition's item subjects
var items = document.data.getValues(datasets[0].data.id,"pan:item");

// (3) Access the first item (a table)
var table = document.getElementsBySubject(items[0])[0];

Find a column:

var columns = document.data.getValues(table.data.id,"pan:column");
var column = null;  // A variable to hold the subject URI.

for (var i=0; !column && i<columns.length; i++) {
  // Find the column labeled with the air temperature property
  if (document.data.getValues(columns[i],"pan:property")
        .indexOf("http://mesonet.info/airTemperature")>=0) {
     column = columns[i];
  }
}

// Find the index by finding the column element by subject URI.
var index = document.getElementsBySubject(column)[0].cellIndex;

Processing Data with Map / Reduce

We can map resources containing data to resultants.
We can reduce resultants to answers.
...and we can do it all in the browser!

The smart pig uses map / reduce!

Query & paging gets you all tied up.

Barnes Interpolation

Iterative weighted averages based on observed values:

$g_{k}^{0} (x, y) = \frac{\sum_{i}^{n} w_{i} o_{i}}{\sum_{i}^{n} w_{i}}, g_{k}^{1} (x, y) = g_{k}^{0} (x, y) + \frac{\sum_{i}^{n} w_{i} (o_{i} - g_{i}^{0} (x, y))}{\sum_{i}^{n} w_{i}}, ..., g_{k}^{n + 1} (x, y) = g_{k}^{n} (x, y) + \frac{\sum_{i}^{n} w_{i} (o_{i} - g_{i}^{n} (x, y))}{\sum_{i}^{n} w_{i}}$

where $w_{i} = exp (\frac{- {d_{i}}^{2}}{L^{2} C})$

There will be a test later!

It produces the typical colored gradient of surfaces for temperature etc. for using in visualizations (e.g. over maps).

Think: the weather report on the 10pm news.

Interpolation Process via Map / Reduce

Run it Live!

For / °

at for minutes

coloring by range (°C, °C)

with quadrangle size °.

Polar Vortex - 2014-01-23

22 seconds to render, 13.6 seconds data access.

240 Partitions

Polar Vortex Animation

Loading ...

Concluding Remarks

One cannot underestimate the value of view source in the development of the Web.

We want to extend this to both scientific data:

enables the copy and modify model
allows good constructs to go viral on the Web

What will the hacker in the corner will do with scientific data?