r. alexander miłowski, geek

[X]

Projects

NASA OpenNEX Challenge and Processing HDF5 in Python

I've recently been playing around a lot with HDF5 data as second part of the NASA Challenge: New Ways to Use, Visualize, and Analyze OpenNEX Climate and Earth Science Data. Henry Thompson and I won a small award in the first challenge for our proposal to apply the PAN Methodology to the OpenNEX data. Specifically, we are looking at the NASA Earth Exchange (NEX) data sets that are hosted by Amazon AWS that provides climate projects both retrospectively (1950-2005) and prospectively (2006-2099).

The model is partitioned into files stored on Amazon S3 in 60 month segments (5 years). Each file contains a grid at 1/120° resolution of temperature values from the climate model. The extent of the data set roughly covers the area between 24° to 49° latitude and -127° to -66° longitude (think: North America). In total, that is 60 copies of a matrix of 3105 columns by 7025 rows or 1,308,757,500 temperature values. Thus, each file is roughly 1.3GB in size.

In my experiments with the data, there is only data over land areas and I've calculated that roughly 45% of that data is fill values. That is, values where the model has no prediction but the HDF5 format requires a value. As such, there are only about 700 million actual values to process and less than 700MB of actual non-fill data.

Processing HD5 is Difficult

There are many obstacles to processing HDF5. There are many different versions: is it netCDF, HDF5, HDF5-EOS, or HDF4? You can't necessarily tell by looking at the files, there isn't a media type that will tell you the different versions, and they are often incompatible with each other (you can't read HDF4 with HDF5 tools).

To make things worse, HDF5 is really a library. There is a specification but there is really only one implementation. Thus, the format is the C implementation. Thus, you have to make sure you have the right library version installed that reads the version of HDF you are processing.

I originally discovered that I could dump HDF5 into XML using a tool called h5dump but that proved to be pointless. The choice of markup is verbose and you can't subset the data.

You can get slices of the data via the h5dump tool but the output is in a Data Definition Language format. As such, I would need to find a parser to process that data format as well. At that point, I gave up on using h5dump.

Python to the Rescue

I did discover that you can process HDF5 data in Python via a module called h5py. It uses numpy to give you vectorized access to the multi-dimensional data set.

Access to the OpenNEX data couldn't be easier:

>>> import numpy
>>> import h5py
>>> f = h5py.File("tasmax_quartile75_amon_rcp85_CONUS_202601-203012.nc","r")
>>> f["tasmax"]
<HDF5 dataset "tasmax": shape (60, 3105, 7025), type "<f4">
>>> 

The file becomes a hashtable of data sets that are numpy multi-dimensional arrays underneath. You can access the numpy array directly:

>>> dset = f["tasmax"].value
>>> dset
array([[[  1.00000002e+20,   1.00000002e+20,   1.00000002e+20, ...,
           1.00000002e+20,   1.00000002e+20,   1.00000002e+20],
        [  1.00000002e+20,   1.00000002e+20,   1.00000002e+20, ...,
           1.00000002e+20,   1.00000002e+20,   1.00000002e+20],
        [  1.00000002e+20,   1.00000002e+20,   1.00000002e+20, ...,
           1.00000002e+20,   1.00000002e+20,   1.00000002e+20],
        ..., 
...

Now it becomes a task of accessing the numpy array properly to do your processing. Given the size of my data, if you do this wrong, you'll spend a great deal of time iterating over data. Because I'm new to numpy, I spent a great deal of time iterating over data before I got something that worked reasonably well.

One of the tasks I need to do is to summarize the data and reduce the resolution. This is a process of collecting chunks of adjacent cells and averaging the value for the cell. I have the additional nuance that there are fill values of 1.00000002e+20 instead of a regular zero value and that makes the problem a bit harder.

The solution looks pretty neat:

d = numpy.empty((621,1421),numpy.float32);
months  = dset.shape[0]
rows    = dset.shape[1]
columns = dset.shape[2]
m = 0         # just month 1 (January, start of 60 month period)
chunkSize = 5 # summarizing 5 by 5 grid subsets

for i in range(0,d.shape[0]):
   for j in range(0,d.shape[1]):
      s = dset[m,i*chunkSize:i*chunkSize+chunkSize,j*chunkSize:j*chunkSize+chunkSize]
      reduction = reduce(lambda r,x: (r[0]+1,r[1]+x) if x<1.00000002e20 else r, s.flat,(0,0))
      d[i,j] = 0 if reduction[0] == 0 else reduction[1] / reduction[0]

Along the way I discovered tuples and that increased the speed by twice. I need to count the number of grid entries that have values to calculate the average. You can do that as two steps:

count = reduce(lambda r,x: r+1 if x<1.00000002e20 else r, s.flat,0)
sum = reduce(lambda r,x: r+x if x<1.00000002e20 else r, s.flat,0)
d[i,j] = 0 if count==0 else sum / count

or as one step:

reduction = reduce(lambda r,x: (r[0]+1,r[1]+x) if x<1.00000002e20 else r, s.flat,(0,0))
d[i,j] = 0 if reduction[0] == 0 else reduction[1] / reduction[0];

and the one step does half the work and so is faster.

Is Python Slow?

It seemed to me that Python seems a bit slow. In fact, when manipulating large arrays of data without numpy, it was painfully slow. That makes me question why it is used so often for processing scientific data.

The interactive nature of Python allowed me to experiment with the HDF5 data in ways that I could not with the pre-canned tools like h5dump. This let me experiment with the data and understand a bit more about the size of the task. That experimentation enabled me to think about the problem at hand rather than brute-force a solution with code.

While this process would be enormously faster in C++, it would be harder to write. I would have had to learn a low-level HDF5 C-API and handle many issues myself as a result. That would have made the prototyping processing much more complex. Yet, if I really need this to run really fast, I might attempt to do so.

In looking around for rationales about Python and processing speed, I found a stack exchange question titled How slow is python really? (Or how fast is your language?) about the speed of Python in comparison to other languages. There are a lot of interesting results, including how fast Rust is as a language. That makes me wonder whether other languages can better balance the experimentation enabled by dynamic-typed interpreted languages vs the speed afforded by strict-typed compiled languages.

Because this is a write once operation, I am less worried about the speed of the process. The processing time is reasonable enough to do what I need to do with OpenNEX data set.

Do Elements have URIs?

I was discussing a problem with triples generated from RDFa and the in-browser applications I have developed using Green Turtle with a learned colleague of mine whose opinions I value greatly. In short, I wanted to duplicate the kinds of processing I'm doing in the browser so I can run it through XProc and do more complicated processing of the documents. Yet, I rely on the origin of triples in the document for my application to work.

His response was just generate a URI and I pushed back a bit. I don't think of the origin of a triple as a thing that is easily named with a URI and I need to explain why I believe that.

Why do I care? Because I do things with RDFa in the browser (a simple example and a complicated one) and sometimes I want to do the same thing outside of the browser; other tools are failing me right now.

A Bit of History

Some of you might remember XPointer Framework that provided a mechanism for embedding a pointer into the fragment identifier. In theory, you can point to specific elements by using tumblers (e.g., /1/2 is the second child of the first element) or by a pointer (i.e., an XPath expression) but you might need to deal with the complexity of whatever namespaces your document uses. The result is something that might not be so easy to parse, cut-n-paste, or otherwise manipulate but it should work.

Yet, we really don't have XPointer functionality in browsers except possibly in relation to some minimal form necessary for SVG's use of XLink. Some of it might have to do with the complexity involved and diminishing returns. That is, people have gotten along with naked fragment identifiers and the id attribute for quite awhile. Others have usurped the fragment portion of the URI for other nefarious purposes (e.g., application state).

Nothing is Free

In the browser, there is no support other than for naked fragment identifiers that map to HTML's id attribute. We don't even have consistent xml:id support within the browsers. Not to mention, there is the conflict of HTML's id attribute and xml:id when serializing as XML syntax. Keep in mind, developers have to implement whatever we cook up and time or mind share is not on XPointer's side.

The net is that we get nothing for free and we have little to rely upon.

Fragile Pointers

There is probably an implicit rule for the Web:

If you want someone to point to your content, you should give it an identifier.

We learned that with links on the Web and gave things unique URIs. We then we learned that we need to assign identifiers to portions of content within Web resources for similar reasons. Extra identifiers don't hurt and they give people the ability to point at specific things in your content. Thus, having a good scheme for a liberal sprinkling of identifiers is a good idea.

Unfortunately, thoughtful content doesn't always happen. Some might say that it rarely happens. As such, if you want to point to specific things and they don't have identifiers, you are out of luck. XPointer was suppose to help solve that and you didn't get it.

But my original problem is not about linking and is instead about tracking origins during the processing of the document. The RDFa API that Green Turtle implements provides the ability to get elements by their type or specific property values. This allows the ability to write applications that process elements based on their type and various other annotations to go between the annotation graph of triples and the document to make things happen in the very same document.

I don't want to generate a URI, nor a pointer, and doing so feels like work around. It is a result of a system that isn't designed to track origin or, dare I say, provenance.

Provenance?

In my opinion, the origin of a triple isn't the common use of the term provenance as used in many Semantic Web communities. Often, provenance means the Web resource from whence the triple was generated and not the element node. To complicate this, provenance can also mean the earliest known history and so the term is very overloaded.

A triple in RDFa originates from a particular element. In a few cases (e.g., typeof attributes with more than one type), an element can generate more than one triple. Meanwhile, in reverse, every triple from RDFa annotations has a single element node that is its origin.

Thus, I prefer origin over provenance so that I can avoid the overloaded and confusing use of the word provenance in both industry and research.

Interoperability?

From any Web resource annotated with RDFa you can generate Turtle or JSON-LD output that represents the triples harvested from the document. Unfortunately, we lose any information about the origin of a triple unless we generate more information. Such additional information would need to have a URI or pointer to the element from which the triple was generated. That brings us full circle and left holding an empty bag.

Any tool that processes the RDFa directly has this information when it harvests the triples. Within that context, we can use that information, just like Green Turtle does, to provide the application a way to traverse between the annotation graph of triples and the document from whence they came. Unfortunately, this seems to be a different model ftp, what many systems have implemented.

In the end, I am less concerned about interoperability, mainly because it is my tool chain that I am using to process information. I'll use whatever tools that work and I don't intend to expose the intermediate states onto the Web. Those might be famous last words, so I'll take some I told you so tokens in advance.

Still Searching for a Solution

I don't have a solution to do this right now. I'm tempted to use PhantomJS or node.js to run my application as if it was in the browser and then process the output with XProc. This would satisfy my main use case of post-processing the results into static variants for various purposes.

I would like to put this content into MarkLogic and run some of the processing there, but they don't support RDFa and they don't have a notion of an origin of a triple. It would be ideal to have this within the context of a database because the origin is a node and storing an internal reference should be straightforward (but I'm guessing). I bet I could hack up eXist and make it do this for me too.

Right now, I have too much to do. The applications work in the browser and I'll let the dust settle for the rest of it. Maybe I'll find a clever solution somewhere in the near future.

GeoJSON to the Rescue (or not)!

This is the fourth entry in my series on my PhD dissertation titled Enabling Scientific Data on the Web. In this entry, we will explore GeoJSON as an alternate approach to geospatial scientific data.

What is GeoJSON?

GeoJSON is a format developed for encoding a variety of geographic data structures. It is feature-oriented, just like KML, and can replace the use of KML in many, but not all, Web applications. The encoding conforms to standard JSON syntax, with an expected structure and set of property names.

A GeoJSON Object Containing Two San Francisco Landmarks
{ "type": "FeatureCollection",
  "features": [
      {"type": "Feature",
       "properties": {
          "name": "AT&T Park",
          "amenity": "Baseball Stadium",
          "description": "This is where the SF Giants play!"
       },
       "geometry": {
          "type": "Point",
          "coordinates": [-122.389283, 37.778788 ]
       }
      },
      {"type": "Feature",
       "properties": {
          "name": "Coit Tower"
       },
       "geometry": {
          "type": "Point",
          "coordinates": [ -122.405896, 37.802266 ]
       }
    }
  ]
}

A GeoJSON object starts with a feature collection and each feature is a tuple of a geometric object, an optional identifier, and a property object. The geometry object describes a point, line, polygon, arrays of such objects, or collections of mixed geometry objects.

The properties property of the feature is any JSON object value. In the example shown above, it defines a set of metadata for each point that describes a location in San Francisco. If the property names match the expectations of the consuming application, it may affect the rendering (e.g. a map marker might be labeled with the feature name). There is no standardization of what the properties property may contain other than it must be a legal JSON object value.

GeoJSON at the USGS

The US Geological Survey (USGS) provides many different feeds of various earthquakes around the world as GeoJSON feeds. Each feature is a single point (the epicenter) and an extensive set of properties is provided that describe the earthquake. The property definitions is defined on the USGS website but their use is not standardized.

An Earthquake Feed Example
{"type":"FeatureCollection",
 "metadata":{
     "generated":1401748792000,
     "url":"http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_day.geojson",
     "title":"USGS Significant Earthquakes, Past Day",
     "status":200,
     "api":"1.0.13",
     "count":1
  },
  "features":[
     {"type":"Feature",
      "properties":{
         "mag":4.16,
         "place":"7km NW of Westwood, California",
         "time":1401676603930,
         "updated":1401748647446,
         "tz":-420,
         "url":"http://earthquake.usgs.gov/earthquakes/eventpage/ci15507801",
         "detail":"http://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/ci15507801.geojson",
         "felt":3290,
         "cdi":5.4,
         "mmi":5.36,
         "alert":"green",
         "status":"reviewed",
         "tsunami":null,
         "sig":806,
         "net":"ci",
         "code":"15507801",
         "ids":",ci15507801,",
         "sources":",ci,",
         "types":",cap,dyfi,focal-mechanism,general-link,geoserve,losspager,moment-tensor,nearby-cities,origin,phase-data,scitech-link,shakemap,",
         "nst":100,
         "dmin":0.0317,
         "rms":0.22,
         "gap":43,
         "magType":"mw",
         "type":"earthquake",
         "title":"M 4.2 - 7km NW of Westwood, California"
      },
      "geometry":{
         "type":"Point",
         "coordinates":[-118.4911667,34.0958333,4.36]
      },
      "id":"ci15507801"
    }
  ]
}

It is quite easy to see that when this data is encountered outside of the context of the USGS, the property names have little meaning and no syntax that identifies them as belonging the USGS.

Out with the Old, in with the New

Just replacing KML's XML syntax and legacy structures from Keyhole with a JSON syntax doesn't address much other than making it easier for JavaScript developers to access the data. There are plenty of mapping tool kits, written in JavaScript, that can readily do things with GeoJSON data with minimal effort and that is generally a good thing. Many can also consume KML as well and so we haven't necessarily improved access.

The format is still oriented towards map features. If you look at the example above, you'll see that the non-geometry information overwhelms the feature information. If you want to process just the properties, you need to enumerate all the features and then extract (access) the data. Because JSON results in a data structure, GeoJSON makes this a bit easier than KML and is an obvious win for this format.

Remember that we are still looking at scientific data sets and scientists love to make tables of data. The USGS earthquake feed is a table of data that happens to have two columns of geospatial information (the epicenter) and 26 other columns of data. Yet, we are forced to a map-feature view of this data set by the choice of GeoJSON.

Keep in mind that the OGC says this about KML:

The OGC KML Standard is an XML grammar to encode and transport representations of geographic data for display in an earth browser, such as a 3D virtual globe, 2D web browser application, or 2D mobile application. Put simply: KML encodes what to show in an earth browser, and how to show it. [OGC Reference Model, 2011]

We could say almost the same thing about GeoJSON except that it doesn't say what to do with the properties. There is only an implied aspect of GeoJSON that the features are rendered into map features and then the properties are displayed somehow. That somehow is left up to the Website developer to code in JavaScript.

Does JSON-LD Help?

GeoJSON is fine for what it does and doesn't do, but it probably shouldn't be used to exchange scientific data. It lacks any ability to standardized what to expect for as data for each feature and such standardization isn't the purview of the good folks that developed it. We might be able to place something in the value of the properties property to facilitate syntactic recognition of specific kinds of data.

One new thing that I am considering exploring is a mixed model where the properties object value is assumed to be a JSON-LD object. This allows the data to have a much more rich annotation and opens the door to standardization. Unfortunately, this is still on my TODO list.

What is next?

I'm just about done with formats for scientific data. There are many, many more formats out there and they suffer from many of the same pitfalls. Up next, I want to address what it means to be on the Web, address some architecture principles, and describe some qualities we want for Web-oriented scientific data.

[More entries ...]