NGDA Second-year Goals and Tasks


Facet I: Research

Goal— Characterize the "landscape" of geospatial data from the perspective of long-term preservation. Why— There's little awareness in the library community of the complexity of geospatial data (either its intrinsic complexity or its production-related complexity).

  1. Finish Banning's analyses of the USGS DOQ and DRG, CaSIL shapefile, and NASA Landsat 7 products.
  2. Analyze preservation of remote-sensing and seismographic geospatial data. Look at data sizes and production rates, data processing levels, availability of format- and semantics-defining specifications, ability to reproduce products from raw data, and the needs of the scientific communities that use (and will continue to use) the data. Interview data producers and consumers.
    • Consider the role federal agencies play in preservation.
    • Host a workshop?
  3. Write up and publish the results.

Facet II: Registry

Goal— Develop a working, populated registry for format specifications and other semantics-defining specifications. Why— This represents one of our two principal strategies for long-term preservation.

  1. Develop a registry system that models dependencies between specifications and enforces format "recoverability."
  2. Develop a web interface that supports incremental population and long-term maintenance of the registry.
    • Consider the implications of registry maintenance by a distributed community.
  3. Populate the registry with all formats encountered in practice, including dependent formats.
  4. As the registry is populated, develop a data model for formats and a vocabulary of format relationships.
  5. Acquire and ingest (possibly embargoed) ESRI format specifications.
  6. Participate in GDFR discussions.

Facet III: Archive

Goal— Develop an operational archive and ingest system. Why— To archive at-risk content and validate proposed approaches to long-term preservation.

  1. Complete development of the initial NGDA system.
  2. Implement validation and registry-related constraint checks.
  3. Evaluate Fedora as an archival platform.
  4. Investigate distributed storage systems and approaches.

Facet IV: Access

Goal— Develop multiple access mechanisms for archived content. Why— Inaccessible content is useless, and access is needed to make project accomplishments visible.

  1. Develop a "simple" access mechanism such that archive objects are located at canonical URLs (e.g., http://archive/id); object manifests are retrievable as HTML documents; object components are downloadable as MIME-typed files; and the archive as a whole is crawlable by Internet search engines.
  2. ADL:
    1. Develop ADL ingest services. Automate index-building.
    2. Develop a crawler/mapper component that crawls archive collections and maps and ingests them into ADL using the aforementioned services.
  3. Provide OAI access to archive metadata.

Facet V: Content

Goal— Archive at-risk content. Why— In addition to being the ultimate purpose of the project, this provides needed feedback on the other facets.

  1. Complete ingest of the CaSIL collections.
  2. Identify additional at-risk collections.
  3. Archive 'em.
  4. Develop a collection development policy.
    • Define "at-risk."
    • Consider value over time, urgency, and ephemerality.

Facet VI: Legal

Goal— Determine the legal (contractual, copyright, and other) ramifications of long-term archival of geospatial data. Why— It's a necessary evil.

  1. Research the impact of CRADAs on access to government data.
  2. Develop prototype provider/archive contract(s).

Greg Janée
Created: 2006-01-25
Last modified: 2007-01-09 13:58