-
Notifications
You must be signed in to change notification settings - Fork 3
Darwin Core Archive Geography Cleaner
John Wieczorek edited this page Oct 7, 2016
·
5 revisions
This workflow:
- creates a given directory as a workspace
- downloads a Darwin Core Archive from a given URL
- downloads a geography lookup file from https://github.com/kurator-org/kurator-validation/tree/master/packages/kurator_dwca/data/vocabularies
- downloads a country lookup file from https://github.com/kurator-org/kurator-validation/tree/master/packages/kurator_dwca/data/vocabularies
- extracts the core file of the Darwin Core Archive to a tab-separated text file
- creates a report of counts of distinct values of the combination of higher geography fields
- creates a report of counts of distinct values of the country field
- creates a report of recommended values for geography
- creates a report of geography combinations not found in the geography lookup file
- creates a report of country values not found in the country lookup file
- creates a new occurrences file with the standardized geography incorporated and the original geography save in new fields
The files produced by this workflow are:
- count_country.csv - the distinct values of the country field and the number of times they appeared in the extracted core file.
- count_geography.csv - the distinct combination of values of the higher geography fields and the number of times they appeared in the extracted core file.
- dwca_extracted_occurrences.txt - the core file of the downloaded Darwin Core Archive as a TXT file
- dwca_extracted_occurrences_geography_standardized.txt - a copy of the file dwca_extractedoccurrences.txt with higher geography fields replaced by standard values from lookup_geography and with original higher geography values copied to new fields with field names having '_orig' appended to them.
- dwca.zip - the Darwin Core archive file downloaded from the given URL
- lookup_country.txt - downloaded copy of the country lookup file
- lookup_geography.txt - downloaded copy of the geography lookup file
- new_country.csv - file containing the country values not found in the country lookup file.
- new_geography.csv - file containing the distinct combinations of higher geography not found in the geography lookup file.
- recommended_geography.csv - file containing the recommendations to standardize distinct combinations of higher geography.
Workflow configuration file: https://github.com/kurator-org/kurator-validation/blob/master/packages/kurator_dwca/workflows/dwca_geography_cleaner.yaml
Field Value Count Report explanation: https://github.com/kurator-org/kurator-validation/wiki/Field-Value-Count-Report
Darwin Core Controlled Value lookup files: https://github.com/kurator-org/kurator-validation/tree/master/packages/kurator_dwca/data/vocabularies
Geography Recommendation Report: https://github.com/kurator-org/kurator-validation/wiki/Geography-Recommendation-Report