Skip to content

Darwin Core Archive Geography Cleaner

John Wieczorek edited this page Oct 7, 2016 · 5 revisions

This workflow:

The files produced by this workflow are:

  • count_country.csv - the distinct values of the country field and the number of times they appeared in the extracted core file.
  • count_geography.csv - the distinct combination of values of the higher geography fields and the number of times they appeared in the extracted core file.
  • dwca_extracted_occurrences.txt - the core file of the downloaded Darwin Core Archive as a TXT file
  • dwca_extracted_occurrences_geography_standardized.txt - a copy of the file dwca_extractedoccurrences.txt with higher geography fields replaced by standard values from lookup_geography and with original higher geography values copied to new fields with field names having '_orig' appended to them.
  • dwca.zip - the Darwin Core archive file downloaded from the given URL
  • lookup_country.txt - downloaded copy of the country lookup file
  • lookup_geography.txt - downloaded copy of the geography lookup file
  • new_country.csv - file containing the country values not found in the country lookup file.
  • new_geography.csv - file containing the distinct combinations of higher geography not found in the geography lookup file.
  • recommended_geography.csv - file containing the recommendations to standardize distinct combinations of higher geography.

References

Workflow configuration file: https://github.com/kurator-org/kurator-validation/blob/master/packages/kurator_dwca/workflows/dwca_geography_cleaner.yaml

Field Value Count Report explanation: https://github.com/kurator-org/kurator-validation/wiki/Field-Value-Count-Report

Darwin Core Controlled Value lookup files: https://github.com/kurator-org/kurator-validation/tree/master/packages/kurator_dwca/data/vocabularies

Geography Recommendation Report: https://github.com/kurator-org/kurator-validation/wiki/Geography-Recommendation-Report

Clone this wiki locally