1.2.2 Configuration files

As it is shown in the folder structure:

 MYPROJECT
   ├── config
   │   ├── analysis
   │   │   ├── dataset
   │   │   │   └── # --->'dataset configuration' yaml files
   │   │   ├── # --->'analysis configuration' yaml files
   │   ├── # ---> 'general configuration' yaml files
   └── data
       └── DATASET1_data
           ├── # --->  .csv files

^{*DATASET1 (here in upper case) represents an experiment, you should replace it by a meaningful name.}

there are three types of configuration files, all of them in .yaml format:

dataset configuration: provides the quantifications and metadata file names that are themselves present in the data directory.
analysis configuration: indicates which analysis (e.g. "differential analysis") is to be run for which data file and with which parameters.
general configuration: configures the analysis to be run by both pointing to the analysis configuration and providing general information such as e.g. output folder names etc.

How to organize the configuration files

The user must place the configuration files in the config folder, as follows:

the dataset configuration files, inside the config/analysis/dataset/ subfolder
the analysis configuration files, inside the config/analysis/ subfolder
the general configuration files, directly inside the config folder

Example of a structure of the config folder _{^{(click to show/hide)}}

config
├── analysis
│   ├── dataset
│   │   ├── dataset1_data.yaml
│   │   └── dataset1_data_integrate.yaml  # <-- if suitable
│   ├── abundance_plot_dataset1.yaml
│   ├── differential_analysis_pairwise_dataset1.yaml
│   ├── enrichment_lineplot_dataset1.yaml
│   ├── isotopologues_plot_dataset1.yaml
│   ├── metabologram_abundance_dataset1.yaml
│   ├── pca_plot_dataset1.yaml
│   └── timecourse_analysis_dataset1.yaml
├── general_config_abundance_plot_dataset1.yaml
├── general_config_differential_analysis_dataset1.yaml
├── general_config_enrichment_lineplot_dataset1.yaml
├── general_config_isotopologues_plot_dataset1.yaml
├── general_config_metabologram_abundance_dataset1.yaml
├── general_config_pca_plot_dataset1.yaml
└── general_config_timecourse_analysis_dataset1.yaml

We encourage the user to download the examples from Zenodo (Zenodo links available at 1 Using DIMet in the command line) as these examples contain all our types of config files. The user only needs to apply minor modifications to the config files, guided by the present documentation, and run DIMet with ease!

The next sections explain each type of these files (click in the items to show/hide):

The dataset configuration

For each given dataset, a corresponding dataset configuration file must be created, which is located inside the config/analysis/dataset/ folder. This file describes the metadata and quantification files, and sets the ordering of the conditions.

Important

For every configuration file, all referenced files' names must be written without the extension.

The name of each configuration file must be meaningful, otherwise, your configurations might fail to run.

The dataset configuration file must contain the following parameters:

__target__: str, mandatory. Defines the object class, which must always be set as dimet.data.DatasetConfig. Do not modify this value.
label: str, mandatory. The name of the dataset and that must be coherent with the corresponding subfolder name.
subfolder: str, mandatory. The name of the folder that the user defined inside data/ and that contains the dataset to be analyzed (e.g. "experiment1_data").
name: str, mandatory. A short description of the data (e.g. "data from experiment testing response to doxorubicin").
metadata: str, mandatory. The name of the metadata file.
abundances: str, optional. The name of the file containing the metabolites' total abundances.
mean_enrichment: str, optional. The name of the file containing the metabolites' ¹³C (or other tracer) mean enrichment.
isotopologues : str, optional. The name of the file with the isotopologues' measures (per metabolite) in absolute values.
isotopologue_proportions : str, optional. The name of the file with the isotopologues' proportions (values in the interval [0,1]).
conditions: List, mandatory. A list of strings corresponding to the conditions, where the control is the first to be listed. The control condition can be also named 'WT' (wild-type) or 'untreated', it depends of your experimental setup. The ordering of this list is taken into account when running the visualizations.

At least one type of quantification file is required for running DIMet.

A template of a dataset configuration is shown below. The # <- comment indicates the parameters that the user must fill:

_target_: dimet.data.DatasetConfig

name:  # <- name of your dataset, fill after the colon
label:  # <- short description of your dataset, fill after the colon
subfolder:  DATASET1_data  # <- subfolder name in the data folder, change after the colon

# ALWAYS WITHIN THE data/dataset_data SUBFOLDER
metadata:    # <- file name, fill after the colon
abundances:    # <- file name, fill after the colon
mean_enrichment:    # <- file name, fill after the colon
isotopologue_proportions:    # <- file name, fill after the colon
isotopologues:    # <- file name, fill after the colon

conditions :
  - cond1 # <- first must be control, replace by your condition
  - cond2 # <- replace by your condition
  # the rest of the conditions must be vertically listed

This dataset configuration will be used by all the types of analysis except the omics integration, see the section Special case of dataset configuration: the integration dataset configuration for that purpose.

Special case of dataset configuration: the integration dataset configuration (for omics integration)

When performing the omics integration, a integration dataset configuration file must be created, that will point to your dataset_data subfolder. This .yaml file must have the following parameters:

__target__: str, mandatory. Defines the object class and must always be set as dimet.data.DataIntegrationConfig. Do not modify this value.
label, name, subfolder, conditions, metadata files' names must be defined exactly in the same way as explained in section The dataset configuration seen immediately before.
abundances: str, mandatory. The name of the file containing the metabolites' total abundances.
mean_enrichment: str, optional. The name of the file containing the metabolites' ^13^C (or other isotope) mean enrichment. Optional.
transcripts: List, mandatory. A list of strings corresponding to the names of the files that contain the differential expression results. The content of the files is explained in the section Data files, subsection Data files for the omics integration. The ordering of the list must be coherent with the order of the comparisons to be defined in the analysis configuration .yaml file (see the following section).
pathways: Dict, mandatory. The two keys to specify the two files' names with the pathways information. For the specific format of the pathways' files see the section Data files, subsection Data files for the omics integration. Each one must be written in its respective key:
- metabolites: str, mandatory. The file name of the pathways and metabolites' identifiers correspondences.
- transcripts: str, mandatory. The file name of the pathways and gene symbols correspondences.

Neither the isotopologues nor the isotopologues' proportions are accepted in the integration dataset configuration.

A template of a integration dataset configuration .yaml file is shown below. The # <- comment indicates the parameters that the user must fill:

_target_: dimet.data.DataIntegrationConfig  

label: integrate_DATASET1   # <- change after the colon
name:     # <- short description of your dataset
subfolder: DATASET1_data  # <- subfolder name in the data folder, change after the colon

conditions :
 - cond1 # <- first must be control, replace by yours
 - cond2 # <- replace by yours

# ALWAYS WITHIN THE data/dataset_data SUBFOLDER
metadata:    # <- file name, fill after the colon
abundances:    # <- file name, fill after the colon
mean_enrichment:    # <- file name, fill after the colon

# WITHIN THE data/dataset_data SUBFOLDER
transcripts:  
  - myDEG_1   # <- file name, replace by yours
  - myDEG_2  # <- other file name (if any), replace by yours

pathways: 
  metabolites:   # <- file name, fill after the colon
  transcripts:    # <- file name, fill after the colon

The analysis configuration

The analysis configuration is located inside the config/analysis/ folder. For each analysis to be performed, one analysis configuration file must be created. It indicates which is the type of analysis we want to run, on which dataset this analysis will be applied, and the parameters that are specific to that analysis.

Recall

For every configuration file, all referenced files' names must be written without the extension.

The name of each configuration file must be meaningful, otherwise, your configurations might fail to run.

We explain below each type of analysis configuration file:

The configuration for the Principal Component Analysis (PCA)

The method pca_analysis automatically processes the total metabolite abundances, or the mean enrichment, or both (abundances and/or mean enrichment being defined in the dataset configuration).

The configuration file for the pca_analysis must contain the following parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- dataset: str, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be pca_analysis, do not change this value.

A template of a pca_analysis configuration file is shown below; the # <- comment indicates the parameters that the user must fill:

label: pca-table-my-data-set  # <- change after the colon

defaults:
  - dataset:  # <-  name of the dataset configuration file; fill after the colon
  - method: pca_analysis

The pca_analysis computes the tables (with the eigenvalues and the explained variances) only. For the visualization (PCA figures) see The configuration for the pca plot.

The configuration for the pairwise differential analysis

The pairwise differential analysis compares 2 groups and accepts all types of quantifications (abundances and/or mean enrichment and/or isotopologues and/or isotopologues' proportions).

The configuration file for the pairwise differential analysis must contain the following parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be differential_analysis, do not change this value.
comparisons: List, mandatory. A list of items where each item is a nested list representing a comparison. Therefore, each item defines of two groups: [[condition, timepoint], [condition_r, timepoint_r]], where the second group ([condition_r, timepoint_r]) is the "reference", so the first group will be compared against that reference.
statistical_test: Dict, mandatory. Specific statistical test to apply to each type of quantification, by setting the key-value pairs as follows:
- abundances: str, optional. Test to apply to the metabolites' total abundances.
- mean_enrichment: str, optional. Test to apply to mean enrichment.
- isotopologues: str, optional. Test to apply to isotopologues' absolute values.
- isotopologue_proportions: str, optional. Test to apply to isotopologue proportions.

The statistical tests currently supported are shown in the section 2 Statistical tests

A template pairwise differential analysis configuration is shown below. the # <- comment indicates the parameters that the user must fill:

label: differential-analysis-DATASET1   # <- change after the colon

defaults:
 - dataset:      # <- name of the dataset configuration file; fill after the colon
 - method: differential_analysis  

comparisons :
  -  [[<condition>, <timepoint>], [<condition_r>, <timepoint_r>]] # <-  interest vs reference
  # Vertically list the rest of the comparisons.     
 
statistical_test:
  abundances:    # <- see statistic test options, fill after the colon
  mean_enrichment:    # <- see statistic options, fill after the colon
  isotopologues:     # <- see statistic options, fill after the colon
  isotopologue_proportions:     # <- see statistic options, fill after the colon

Correction method for multiple tests

The method for correction of multiple tests is set by default as fdr_bh (Benjamini-Hochberg).
Note that if the statistical test is disfit (Fitting of a distribution to the z-scores), the correction for multiple tests is senseless, so it is not applied.

DIMet relies on the correction methods from statsmodels (https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html): the abbreviations are written the same for DIMet in the command line version. The default parameter in DIMet can be modified by doing a local install and then editing the field correction_method in the file src/dimet/config/analysis/method/differential_analysis.yaml.

The configuration for the multi-group comparison analysis

The multi_group_comparison analysis performs the Kruskal-Wallis test to compare 3 or more groups. The configuration file for the multi_group_comparison must set the following parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be multi_group_comparison, do not change this value.
conditions: List, mandatory. The groups to be included in the analysis, defined as a list of items where each item is a group. A group is defined as a couple [<condition>, <timepoint>]
datatypes: List, mandatory. A list of the types of quantification, the user can set abundances, mean_enrichment, isotopologues and isotopologue_proportions. The list must contain at least one of these types.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: multi-group-comparison-my-dataset # <- replace after the colon

defaults:
  - dataset:   # <- name of the dataset configuration file; fill after the colon
  - method: multi_group_comparison

conditions:
  - [Control, T0h]  # <- replace

datatypes: [abundances]

For the correction for multiple tests see the pairwise differential_analysis subsection above.

The configuration for the time-course analysis

The time_course_analysis performs the statistical comparison of consecutive timepoints. The configuration file must define the following parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be time_course_analysis, do not change this value.
statistical_test: Dict, mandatory. Specific statistical test to apply to each type of quantification, by setting the key-value pairs as follows:
- abundances: str, optional. Test to apply to the metabolites' total abundances.
- mean_enrichment: str, optional. Test to apply to mean enrichment.
- isotopologues: str, optional. Test to apply to isotopologues' absolute values.
- isotopologue_proportions: str, optional. Test to apply to isotopologue proportions.
For the available statistical tests, and correction method, see the pairwise differential_analysis subsection above.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: time-course-my-data-set   # <- replace after the colon

defaults:
  - dataset:  # <- name of the dataset configuration file; fill after the colon
  - method: time_course_analysis
 
statistical_test:
  isotopologue_proportions:  # <- fill after the colon

The configuration for the bi-variate analysis

The bivariate_analysis performs MDV profiles comparison and metabolites time-course profiles comparison, using the correlation test chosen by the user. DIMet offers both Spearman and Pearson correlation tests. Three types of comparisons are performed automatically: (i) MDV profile between conditions, (ii) MDV profile between time-points, and (iii) metabolite (total abundances and/or mean enrichment) time course profiles between conditions. For the first two types of comparisons, MDV (Mass Distribution Vector) arrays are extracted automatically from the isotopologue proportions, following the MDV definition given here.

The configuration file must define the following parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be bivariate_analysis, do not change this value.
conditions: List, mandatory. The list of conditions that will enter in the analysis
statistical_test: str, optional. The name of the correlation test, that must be spearman or pearson, all lowercase. If this parameter is absent (i.e. this entire line is omitted) in the config file, the Spearman test is performed by default.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: bi-variate-analysis-mydataset-n   # <- replace after the colon

defaults:
  - dataset:  # <- name of the dataset configuration file; fill after the colon
  - method: bivariate_analysis 
 
conditions:   
  - Control # <- list vertically the conditions that will enter into the analysis
  - Treated1

statistical_test: spearman   # <- replace after the colon with pearson, if desired

Notes: Time-points are detected automatically. Consecutive time-points are compared for the MDV profile bi-variate analysis.

The configuration for the pca plot (visualization)

The method pca_plot automatically processes the total metabolite abundances, or the mean enrichment, or both (abundances and/or mean enrichment being defined in the dataset configuration).

Parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be pca_plot, do not change this value.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: pca-plot-data-n  # <- change after the colon

defaults:
  - dataset: LDHAB-Control_data  # <- change after the colon
  - method: pca_plot

The configuration for the abundance bars (visualization)

Parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be abundance_plot, do not change this value.
timepoints: List, mandatory. The list of the time-points to be included across all the figures.
width_each_subfig: float, mandatory. The width of each independent figure.

The abundance_plot method strictly requires the definition of the abundances file in the dataset configuration.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: abundance-bars-for-dataset-n  # <- change after the colon

defaults:
  - dataset: LDHAB-Control_data  # <- change after the colon
  - method: abundance_plot

timepoints:
  - T48  # <- change; provide your time points vertically listed

width_each_subfig: !!float 3.7  # <- modify the number only

It generates one figure by metabolite in .svg format.

The configuration for the isotopologues bars (visualization)

Parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be isotopologue_proportions_plot, do not change this value.
timepoints: List, mandatory. The list of the time-points to be included across all the figures.
width_each_subfig: float, mandatory. The width of each independent figure.
inner_numbers_size: float, mandatory. The font size of the numbers that appear inside each bar segment; each number corresponds to the average proportion computed over the biological replicates for each isotopologue.

A template is shown below; the # <- comment indicates the parameters that the user must fill:

label: isotopologues-stacked-for-n-data  # <- change after the colon

defaults:
  - dataset: # <- fill after the colon
  - method: isotopologue_proportions_plot  

timepoints:
  - T48  # <- change; provide your time points vertically listed

width_each_stack: !!float 1.6  # <- modify the number only
inner_numbers_size : !!float 16  # <- modify the number only

The isotopologue_proportions_plot method strictly requires the definition of the isotopologue_proportions file in the dataset configuration. It generates one figure by metabolite in .svg format.

The configuration for the enrichment line-plot (visualization)

Parameters:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be mean_enrichment_line_plot, do not change this value.
width_each_subfig: float, mandatory. The width of each independent figure.

The mean_enrichment_line_plot method strictly requires the definition of the mean_enrichment file in the dataset configuration. It automatically includes all timepoints, as the line-plot is a popular visualization choice for the follow-up of the mean enrichment across time.

A template is shown below:

label:  # <- fill after the colon

defaults:
  - dataset: # <- fill after the colon
  - method: mean_enrichment_line_plot

width_subplot : !!float 3.1  # <- change the number

The configuration for the omics integration (Metabologram)

DIMet integrates the differential omics -transcripts and metabolites- producing Metabologram(s) . Some parameters are similar to the pairwise differential analysis, but the settings are not identical:

label: str, mandatory. The name of the analysis and the name of the user dataset.
defaults: Dict, mandatory. It has two sub-keys:
- datasetstr, mandatory. The name of the dataset configuration file.
- method: str, mandatory. The value must be metabologram_integration, do not change this value.
comparisons: List, mandatory. A list of items where each item is a nested list representing a comparison. Therefore, each item defines of two groups: [[condition, timepoint], [condition_r, timepoint_r]], where the second group ([condition_r, timepoint_r]) is the "reference" so the first group will be compared against that reference.

In this configuration for the omics integration, the order of the comparisons must be coherent with the order of the differential expression files written in the integration dataset configuration .

statistical_test: Dict, mandatory. Specific test to apply to each type of quantification. One single key, abundances or mean_enrichment must be set, but not both in the same .yaml file. The value must be the one of the currently supported test (see sub-section The configuration for the pairwise differential analysis).
columns_metabolites: Dict, mandatory. The specific column names of the differential metabolome file(s) to define ID and values:
- ID: is set as "metabolite", do not change this value.
- values: must be set as log2FC or FC depending on the user preferences.
columns_transcripts: Dict, mandatory. The specific column names of the differential transcriptome file(s) to define ID and values:
- ID: is the column with the gene symbols.
- values: must be set depending on how the files were generated (e.g. log2FoldChange if generated by DESeq2).
compartment: Dict, mandatory. Only one compartment name as unique key is accepted.

Template for the configuration for the omics integration:

label: metabologram-using-abundance-DATASET1  # <- change after the colon

defaults:
 - dataset:   # <- integration dataset config, fill after the colon
 - method: metabologram_integration

comparisons :
 - [[cond2, T24], [cond1, T24]]   # <-  see documentation, replace

# running for total abundances
statistical_test:
 abundances:    # <-  can be abundances OR mean_enrichment, fill after the colon

columns_metabolites:
 ID : metabolite
 values :    # <- log2FC or FC, fill after the colon
 
columns_transcripts:
 ID:   # <- the gene symbols column name, fill after the colon
 values:   # <- the numeric column name,  fill after the colon
 
compartment:
 en

If the user needs to run two integrations, one using differential abundances, and another using the differential mean enrichment, then two separate analysis configuration files must be created.

The general configuration

The general configuration file determines the working directory, the analysis configuration to be run and the output folders. It is located directly inside the config folder.

The name of the general configuration file must be meaningful and unambiguously similar to the analysis configuration one. Using a name with the convention general_configuration_<analysis config file>.yaml is highly recommended.

Parameters:

hydra: Dict, mandatory. Dictionary that defines the directories for the analysis. It contains two dictionaries:
- job: its key chdir (bool) defines the working directory; if True the current directory is set. Do not change this value
- run: its key dir (str) defines output directory with the convention
  output/DATE/HOUR/- which is automatically generated by declaring outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}/${analysis.dataset.label}-${analysis.method.label}. Do not change this value.
defaults: List, mandatory. A list of key-value pairs, only one must be set currently:
- analysis: str, mandatory. The file name of the analysis configuration file.
figure_path: str, mandatory. The name of the figures output folder; "figures" is recommended.
table_path: str, mandatory. The of the tables output folder; "tables" is recommended

A template is shown below; the # <- comment indicates the unique part that the user must fill:

hydra:
  job:
    chdir: true
  run:
    dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}/${analysis.dataset.label}-${analysis.method.label}

defaults:
  - analysis:   # <-  the analysis configuration file name, fill after the colon

figure_path: figures
table_path: tables

DIMet, via Hydra, will generate the full output folders and files names automatically. Log files with information about the run (and errors if any) are also automatically saved to .log files in the output folders.

logo_footer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1.2.2 Configuration files

How to organize the configuration files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally