Skip to content

Creating EML Metadata For Geospatial Data Using AI Chatbot Agents

Background

The Ecological Metadata Language (EML) specification is widely used across domains in environmental research, conservation, and management. While there are tools for generating metadata for tabular and vector data using the EML specification, there exists a taller barrier of entry for geospatial data types (e.g., TIFF, GeoJSON, Shapefile, etc.). Here, we access the capacity for AI chatbot agents (e.g., ChatGPT, Claude, DeepSeek, CoPilot) to take geospatial data type inputs, extract metadata from the files, format the metadata according to the EML speficiation, and validate successfully on the ezEML web user interface (UI).

What is ezEML?

ezEML is a web UI developed by the Environmental Data Initiative (EDI) designed to simplify the creation of metadata in the EML format. The ezEML UI guides users through the process of creating EML metadata, making it accessible even to those without extensive technical expertise. By streamlining the metadata creation and validation process, ezEML helps ensure that datasets are findable, accessible, interoperable, and reusable, facilitating data sharing and collaboration within the scientific community.

How to: Upload XML To ezEML UI For Metadata Validation

  1. Access Web UI & Log In: If you don't have an account, you may need to register or sign in using an institutional login if available.
  2. Navigate To Upload Section: Look for an option or tab labeled "Upload" or "Import". This is typically found in the main navigation menu or dashboard.
  3. Upload XML File: Click on the "Upload" or "Import" button to start the process. You will be prompted to select the XML file from your computer. Click on "Choose File" or "Browse" to locate your XML file. Select the XML file you wish to upload and click "Open".
  4. Validate XML File: After uploading, ezEML will automatically begin the validation process. The tool will check the XML file against the EML schema to ensure it is correctly formatted and complete.
  5. Review Results: Once the validation is complete, ezEML will display the results. If there are any errors or warnings, they will be listed with details about what needs to be corrected. Review the feedback carefully. You may need to edit your XML file to address any issues.
  6. Edit & Re-upload (If Necessary): If errors were found, use an XML editor to make the necessary corrections to your file. Save the changes and repeat the upload process to validate the corrected file.
  7. Finalize and Save: Once your XML file passes validation without errors, you can proceed to save or publish your metadata within ezEML. Follow any additional prompts to complete the process, such as adding metadata details or linking to datasets.

Experiments

For each experiment, the following variables are manipulated:

  • Geospatial file type
  • AI Chatbot Agent
  • Prompt

The following method was followed for each experiment:

  • Upload geospatial file to AI (if possible)
  • Type prompt and execute
  • Retrieve EML output either as downloadable XML file or text block that can be copied and saved as an XML File
  • Visually inspect the outputs for general adherance to XML structure
  • Upload XML, without modification to the ezEML web UI

Input Dataset Descriptions

LeafyLandscapesPlots

  • This is a collection of observations of plant traits and biomass in a series of 211 one meter squared plots collected in 2024 by Ian Breckheimer at Rocky Mountain Biological Laboratory.. The dataset contains polygon geometries for each plot along with summarized field measurements for each plot.

LeafyLandscapesPlots

IceTrafficDataFrame

This is a shapefile containing a polygon grid summarizing sea ice concentration data (CryoSat-2/SMOS Merged Product) and marine vessel counts (generated from Automatic Identification System vessel tracking data). Data are summarized on a monthly basis from January 2015 to December 2022 with one column per month. Data are projected in Alaska Albers (EPSG: There are >2,000 rows (one row per polygon) and >200 columns, making metadata generation a challenge.

IceTrafficDataFrame

HLSL30.020_B01

This is a single-layer raster dataset representing one layer from a satellite image Band 2 (Blue) from the Harmonized Landsat - Sentinel 2 dataset. The image was collected in spring 2024 and shows a small mountainous area in Western Colorado. Although the primary data is public, this subset was extracted from the NASA APPEars web tool and has different properties from data already in the public domain.

HLSL30 020 B01

SeaIceVessel

These data represent a monthly time series of sea ice concentration data (CryoSat-2/SMOS Merged Product) and marine vessel counts (generated from Automatic Identification System vessel tracking data). Data are summarized on a monthly basis from January 2015 to December 2022 with one row per pixel per month. Vessel activity is measured as the total number of ships per pixel in a given month.

Results

Agent Compatibility By Geospatial File Type

chatGPT-4o Claude DeepSeek CoPilot
Shapefile Yes No No No
GeoJSON Yes No (limited size) Yes (limited size) No
TIFF Yes No Yes No
CSV Yes Yes (limited size) Yes No

Results by Experiment

File Type Agent Source Dataset Prompt Output ezEML Validation Notes
GeoJSON GPT4o LeafyLandscapesPlots Please generate metadata in the EML version 2.2 format for the LeafyLandscapesPlots geoJSON file. XML embedded text Partial Lacking bounding box coordinates
Shapefile GPT4o IceTrafficDataFrame Please generate metadata in the EML version 2.2 format for the IceTrafficDataFrame shapefile. XML Download Partial Bounding box in incorrect project, lacking creator and contributor
TIFF Deepseekv-3 HLSL30.020_B01 Please generate metadata in the EML version 2.2 format for the attached data XML embedded text Minimal Support tiff format and successful EML metadata but not well-populated
TIFF GPT4o HLSL30.020_B01 Please generate metadata in the EML version 2.2 format for the HLSL30.020_B01.tif TIFF file. XML embedded text Minimal Only image size information extracted. Hallucinated image date
CSV GPT4o SeaIceVessel Please generate metadata in the EML version 2.2 format for the SeaIceVessel csv XML embedded text Partial Missed some column names
CSV DeepSeek SeaIceVessel Please generate metadata in the EML version 2.2 format for the SeaIceVessel csv XML embedded text Partial Complete columns, but put "dimensionless" ratio for numeric data

Screenshots

Experiment: GeoJSON uploaded to GPT4o

Prompt:

LeafyLandscapesPlots-GPT4o-GeoJSON

ezEML UI:

LeafyLandscapesPlots-GPT4o-GeoJSON_ezEML_validation

Experiment: Shapefile uploaded to GPT4o

Prompt:

IceTrafficDataFrame-GPT4o-shp-GPToutput

ezEML UI:

IceTrafficDataFrame-GPT4o-shp-ezEML

Discussion

We found a wide variety of capabilities for EML generation across LLMs. chatGPT-4o was able to ingest the largest number of spatial data file types, followed by DeepSeek. In several cases, LLMs were capable of extracting spatial information (projection, bounding box) from within the provided spatial data format.

The EMLs produced by LLMs provided a starting point for metadata generation, but without supplemental information provided by the author, the LLMs were prone to hallucinations. However, with large data sets including many columns, the LLM could save valuable time on manually entering a large number of data fields. As with many other LLM use cases, we found that it is useful to think of them as a competent assistant. Very helpful with tedious tasks, but in need of supervision.

Future tests would benefit from further engagement and prompt engineering alongside LLMs. It could be possible to refine EML content by providing or correctly initial LLM output in an iterative fashion.

Conclusion

LLMs are a useful tool for lowering the barrier to entry for metadata generation by automating tedious processes and creating templates for (often unfamiliar) file formats. However, users must exercise caution when using LLMs for metadata creation given the tendency to hallucinate important information, especially when not pre-provided by the user.

Acknowledgements & Contributors

This tutorial was generated by a group of collaborators participating in the 2025 Environmental Data Science Summit hosted by the National Center for Environmental Analysis and Synthesis in Santa Barbara, California, which took place February 4-6, 2025.

Authors

  • Ian Breckheimer, Rocky Mountain Biological Laboratory, ikb@rmbl.org
  • Kelly Kapsar, Michigan State University, kelly.kapsar@gmail.com
  • Unis Lebbie, University of Southampton, lebbieunis@gmail.com
  • Zachary Nickerson, National Ecological Observatory Network, nickerson@battelleecology.org