-
Notifications
You must be signed in to change notification settings - Fork 2
1.2.2 Configuration files
As it is shown in the folder structure:
MYPROJECT
├── config
│ ├── analysis
│ │ ├── dataset
│ │ │ └── # --->'dataset configuration' yaml files
│ │ ├── # --->'analysis configuration' yaml files
│ ├── # ---> 'general configuration' yaml files
└── data
└── DATASET1_data
├── # ---> .csv files
*DATASET1
(here in upper case) represents an experiment, you should replace it by a meaningful name.
there are three types of configuration files, all of them in .yaml
format:
-
dataset configuration
: provides the quantifications and metadata file names that are themselves present in thedata
directory. -
analysis configuration
: indicates which analysis (e.g. "differential analysis") is to be run for which data file and with which parameters. -
general configuration
: configures the analysis to be run by both pointing to theanalysis configuration
and providing general information such as e.g. output folder names etc.
The user must place the configuration files in the config
folder, as follows:
-
the dataset configuration files, inside the
config/analysis/dataset/
subfolder -
the analysis configuration files, inside the
config/analysis/
subfolder -
the general configuration files, directly inside the
config
folder -
Example of a structure of the
config
folder (click to show/hide)config ├── analysis │ ├── dataset │ │ ├── dataset1_data.yaml │ │ └── dataset1_data_integrate.yaml # <-- if suitable │ ├── abundance_plot_dataset1.yaml │ ├── differential_analysis_pairwise_dataset1.yaml │ ├── enrichment_lineplot_dataset1.yaml │ ├── isotopologues_plot_dataset1.yaml │ ├── metabologram_abundance_dataset1.yaml │ ├── pca_plot_dataset1.yaml │ └── timecourse_analysis_dataset1.yaml ├── general_config_abundance_plot_dataset1.yaml ├── general_config_differential_analysis_dataset1.yaml ├── general_config_enrichment_lineplot_dataset1.yaml ├── general_config_isotopologues_plot_dataset1.yaml ├── general_config_metabologram_abundance_dataset1.yaml ├── general_config_pca_plot_dataset1.yaml └── general_config_timecourse_analysis_dataset1.yaml
We encourage the user to download the examples from Zenodo (Zenodo links available at 1 Using DIMet in the command line) as these examples contain all our types of config files. The user only needs to apply minor modifications to the config files, guided by the present documentation, and run DIMet with ease!
The next sections explain each type of these files (click in the items to show/hide):
The dataset configuration
For each given dataset, a corresponding dataset configuration file must be created, which is located inside the config/analysis/dataset/
folder. This file describes the metadata and quantification files, and sets the ordering of the conditions.
Important
- For every configuration file, all referenced files' names must be written without the extension.
- The name of each configuration file must be meaningful, otherwise, your configurations might fail to run.
The dataset configuration file must contain the following parameters:
-
__target__
: str, mandatory. Defines the object class, which must always be set asdimet.data.DatasetConfig
. Do not modify this value. -
label
: str, mandatory. The name of the dataset and that must be coherent with the correspondingsubfolder
name. -
subfolder
: str, mandatory. The name of the folder that the user defined insidedata/
and that contains the dataset to be analyzed (e.g. "experiment1_data"). -
name
: str, mandatory. A short description of the data (e.g. "data from experiment testing response to doxorubicin"). -
metadata
: str, mandatory. The name of the metadata file. -
abundances
: str, optional. The name of the file containing the metabolites' total abundances. -
mean_enrichment
: str, optional. The name of the file containing the metabolites' 13C (or other tracer) mean enrichment. -
isotopologues
: str, optional. The name of the file with the isotopologues' measures (per metabolite) in absolute values. -
isotopologue_proportions
: str, optional. The name of the file with the isotopologues' proportions (values in the interval [0,1]). -
conditions
: List, mandatory. A list of strings corresponding to the conditions, where the control is the first to be listed. The control condition can be also named 'WT' (wild-type) or 'untreated', it depends of your experimental setup. The ordering of this list is taken into account when running the visualizations.
At least one type of quantification file is required for running DIMet.
A template of a dataset configuration is shown below. The # <-
comment indicates the parameters that the user must fill:
_target_: dimet.data.DatasetConfig
name: # <- name of your dataset, fill after the colon
label: # <- short description of your dataset, fill after the colon
subfolder: DATASET1_data # <- subfolder name in the data folder, change after the colon
# ALWAYS WITHIN THE data/dataset_data SUBFOLDER
metadata: # <- file name, fill after the colon
abundances: # <- file name, fill after the colon
mean_enrichment: # <- file name, fill after the colon
isotopologue_proportions: # <- file name, fill after the colon
isotopologues: # <- file name, fill after the colon
conditions :
- cond1 # <- first must be control, replace by your condition
- cond2 # <- replace by your condition
# the rest of the conditions must be vertically listed
This dataset configuration will be used by all the types of analysis except the omics integration, see the section Special case of dataset configuration: the integration dataset configuration for that purpose.
Special case of dataset configuration: the integration dataset configuration (for omics integration)
When performing the omics integration, a integration dataset configuration file must be created, that will point to your dataset_data subfolder. This .yaml file must have the following parameters:
-
__target__
: str, mandatory. Defines the object class and must always be set asdimet.data.DataIntegrationConfig
. Do not modify this value. -
label
,name
,subfolder
,conditions
,metadata
files' names must be defined exactly in the same way as explained in section The dataset configuration seen immediately before. -
abundances
: str, mandatory. The name of the file containing the metabolites' total abundances. -
mean_enrichment
: str, optional. The name of the file containing the metabolites' ^13^C (or other isotope) mean enrichment. Optional. -
transcripts
: List, mandatory. A list of strings corresponding to the names of the files that contain the differential expression results. The content of the files is explained in the section Data files, subsection Data files for the omics integration. The ordering of the list must be coherent with the order of the comparisons to be defined in the analysis configuration .yaml file (see the following section). -
pathways
: Dict, mandatory. The two keys to specify the two files' names with the pathways information. For the specific format of the pathways' files see the section Data files, subsection Data files for the omics integration. Each one must be written in its respective key:-
metabolites
: str, mandatory. The file name of the pathways and metabolites' identifiers correspondences. -
transcripts
: str, mandatory. The file name of the pathways and gene symbols correspondences.
-
Neither the isotopologues nor the isotopologues' proportions are accepted in the integration dataset configuration.
A template of a integration dataset configuration .yaml file is shown below. The # <-
comment indicates the parameters that the user must fill:
_target_: dimet.data.DataIntegrationConfig
label: integrate_DATASET1 # <- change after the colon
name: # <- short description of your dataset
subfolder: DATASET1_data # <- subfolder name in the data folder, change after the colon
conditions :
- cond1 # <- first must be control, replace by yours
- cond2 # <- replace by yours
# ALWAYS WITHIN THE data/dataset_data SUBFOLDER
metadata: # <- file name, fill after the colon
abundances: # <- file name, fill after the colon
mean_enrichment: # <- file name, fill after the colon
# WITHIN THE data/dataset_data SUBFOLDER
transcripts:
- myDEG_1 # <- file name, replace by yours
- myDEG_2 # <- other file name (if any), replace by yours
pathways:
metabolites: # <- file name, fill after the colon
transcripts: # <- file name, fill after the colon
The analysis configuration
The analysis configuration is located inside the config/analysis/
folder.
For each analysis to be performed, one analysis configuration file must be created.
It indicates which is the type of analysis we want to run,
on which dataset this analysis will be applied, and the parameters that are
specific to that analysis.
Recall
- For every configuration file, all referenced files' names must be written without the extension.
- The name of each configuration file must be meaningful, otherwise, your configurations might fail to run.
We explain below each type of analysis configuration file:
The configuration for the Principal Component Analysis (PCA)
The method pca_analysis automatically processes the total metabolite abundances, or the mean enrichment, or both (abundances and/or mean enrichment being defined in the dataset configuration).
The configuration file for the pca_analysis must contain the following parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
: str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bepca_analysis
, do not change this value.
-
A template of a pca_analysis configuration file is shown below; the # <-
comment indicates the parameters that the user must fill:
label: pca-table-my-data-set # <- change after the colon
defaults:
- dataset: # <- name of the dataset configuration file; fill after the colon
- method: pca_analysis
The pca_analysis computes the tables (with the eigenvalues and the explained variances) only. For the visualization (PCA figures) see The configuration for the pca plot.
The configuration for the pairwise differential analysis
The pairwise differential analysis compares 2 groups and accepts all types of quantifications (abundances and/or mean enrichment and/or isotopologues and/or isotopologues' proportions).
The configuration file for the pairwise differential analysis must contain the following parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bedifferential_analysis
, do not change this value.
-
-
comparisons
: List, mandatory. A list of items where each item is a nested list representing a comparison. Therefore, each item defines of two groups:[[condition, timepoint], [condition_r, timepoint_r]]
, where the second group ([condition_r, timepoint_r]
) is the "reference", so the first group will be compared against that reference. -
statistical_test
: Dict, mandatory. Specific statistical test to apply to each type of quantification, by setting the key-value pairs as follows:-
abundances
: str, optional. Test to apply to the metabolites' total abundances. -
mean_enrichment
: str, optional. Test to apply to mean enrichment. -
isotopologues
: str, optional. Test to apply to isotopologues' absolute values. -
isotopologue_proportions
: str, optional. Test to apply to isotopologue proportions.
-
The statistical tests currently supported are shown in the section 2 Statistical tests
A template pairwise differential analysis configuration is shown below.
the # <-
comment indicates the parameters that the user must fill:
label: differential-analysis-DATASET1 # <- change after the colon
defaults:
- dataset: # <- name of the dataset configuration file; fill after the colon
- method: differential_analysis
comparisons :
- [[<condition>, <timepoint>], [<condition_r>, <timepoint_r>]] # <- interest vs reference
# Vertically list the rest of the comparisons.
statistical_test:
abundances: # <- see statistic test options, fill after the colon
mean_enrichment: # <- see statistic options, fill after the colon
isotopologues: # <- see statistic options, fill after the colon
isotopologue_proportions: # <- see statistic options, fill after the colon
Correction method for multiple tests
The method for correction of multiple tests is set by default as
fdr_bh
(Benjamini-Hochberg).
Note that if the statistical test is
disfit
(Fitting of a distribution to the z-scores), the correction for multiple tests is senseless, so it is not applied.
DIMet relies on the correction methods from statsmodels (https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html): the abbreviations are written the same for DIMet in the command line version. The default parameter in DIMet can be modified by doing a local install and then editing the field
correction_method
in the file src/dimet/config/analysis/method/differential_analysis.yaml
.
The configuration for the multi-group comparison analysis
The multi_group_comparison analysis performs the Kruskal-Wallis test to compare 3 or more groups. The configuration file for the multi_group_comparison must set the following parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bemulti_group_comparison
, do not change this value.
-
-
conditions
: List, mandatory. The groups to be included in the analysis, defined as a list of items where each item is a group. A group is defined as a couple[<condition>, <timepoint>]
-
datatypes
: List, mandatory. A list of the types of quantification, the user can set abundances, mean_enrichment, isotopologues and isotopologue_proportions. The list must contain at least one of these types.
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: multi-group-comparison-my-dataset # <- replace after the colon
defaults:
- dataset: # <- name of the dataset configuration file; fill after the colon
- method: multi_group_comparison
conditions:
- [Control, T0h] # <- replace
datatypes: [abundances]
For the correction for multiple tests see the pairwise differential_analysis subsection above.
The configuration for the time-course analysis
The time_course_analysis performs the statistical comparison of consecutive timepoints. The configuration file must define the following parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must betime_course_analysis
, do not change this value.
-
-
statistical_test
: Dict, mandatory. Specific statistical test to apply to each type of quantification, by setting the key-value pairs as follows:-
abundances
: str, optional. Test to apply to the metabolites' total abundances. -
mean_enrichment
: str, optional. Test to apply to mean enrichment. -
isotopologues
: str, optional. Test to apply to isotopologues' absolute values. -
isotopologue_proportions
: str, optional. Test to apply to isotopologue proportions.
For the available statistical tests, and correction method, see the pairwise differential_analysis subsection above.
-
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: time-course-my-data-set # <- replace after the colon
defaults:
- dataset: # <- name of the dataset configuration file; fill after the colon
- method: time_course_analysis
statistical_test:
isotopologue_proportions: # <- fill after the colon
The configuration for the bi-variate analysis
The bivariate_analysis performs MDV profiles comparison and metabolites time-course profiles comparison, using the correlation test chosen by the user. DIMet offers both Spearman and Pearson correlation tests. Three types of comparisons are performed automatically: (i) MDV profile between conditions, (ii) MDV profile between time-points, and (iii) metabolite (total abundances and/or mean enrichment) time course profiles between conditions. For the first two types of comparisons, MDV (Mass Distribution Vector) arrays are extracted automatically from the isotopologue proportions, following the MDV definition given here.
The configuration file must define the following parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bebivariate_analysis
, do not change this value.
-
-
conditions
: List, mandatory. The list of conditions that will enter in the analysis -
statistical_test
: str, optional. The name of the correlation test, that must bespearman
orpearson
, all lowercase. If this parameter is absent (i.e. this entire line is omitted) in the config file, the Spearman test is performed by default.
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: bi-variate-analysis-mydataset-n # <- replace after the colon
defaults:
- dataset: # <- name of the dataset configuration file; fill after the colon
- method: bivariate_analysis
conditions:
- Control # <- list vertically the conditions that will enter into the analysis
- Treated1
statistical_test: spearman # <- replace after the colon with pearson, if desired
Notes: Time-points are detected automatically. Consecutive time-points are compared for the MDV profile bi-variate analysis.
The configuration for the pca plot (visualization)
The method pca_plot automatically processes the total metabolite abundances, or the mean enrichment, or both (abundances and/or mean enrichment being defined in the dataset configuration).
Parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bepca_plot
, do not change this value.
-
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: pca-plot-data-n # <- change after the colon
defaults:
- dataset: LDHAB-Control_data # <- change after the colon
- method: pca_plot
The configuration for the abundance bars (visualization)
Parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must beabundance_plot
, do not change this value.
-
-
timepoints
: List, mandatory. The list of the time-points to be included across all the figures. -
width_each_subfig
: float, mandatory. The width of each independent figure.
The abundance_plot
method strictly requires the definition
of the abundances
file in the dataset configuration.
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: abundance-bars-for-dataset-n # <- change after the colon
defaults:
- dataset: LDHAB-Control_data # <- change after the colon
- method: abundance_plot
timepoints:
- T48 # <- change; provide your time points vertically listed
width_each_subfig: !!float 3.7 # <- modify the number only
It generates one figure by metabolite in .svg format.
The configuration for the isotopologues bars (visualization)
Parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must beisotopologue_proportions_plot
, do not change this value.
-
-
timepoints
: List, mandatory. The list of the time-points to be included across all the figures. -
width_each_subfig
: float, mandatory. The width of each independent figure. -
inner_numbers_size
: float, mandatory. The font size of the numbers that appear inside each bar segment; each number corresponds to the average proportion computed over the biological replicates for each isotopologue.
A template is shown below; the # <-
comment indicates the parameters that the user must fill:
label: isotopologues-stacked-for-n-data # <- change after the colon
defaults:
- dataset: # <- fill after the colon
- method: isotopologue_proportions_plot
timepoints:
- T48 # <- change; provide your time points vertically listed
width_each_stack: !!float 1.6 # <- modify the number only
inner_numbers_size : !!float 16 # <- modify the number only
The isotopologue_proportions_plot
method strictly requires the definition
of the isotopologue_proportions
file in the dataset configuration.
It generates one figure by metabolite in .svg format.
The configuration for the enrichment line-plot (visualization)
Parameters:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bemean_enrichment_line_plot
, do not change this value.
-
-
width_each_subfig
: float, mandatory. The width of each independent figure.
The mean_enrichment_line_plot
method strictly requires the definition
of the mean_enrichment
file in the dataset configuration.
It automatically includes all timepoints,
as the line-plot is a popular visualization choice for the follow-up of the mean enrichment across time.
A template is shown below:
label: # <- fill after the colon
defaults:
- dataset: # <- fill after the colon
- method: mean_enrichment_line_plot
width_subplot : !!float 3.1 # <- change the number
The configuration for the omics integration (Metabologram)
DIMet integrates the differential omics -transcripts and metabolites- producing Metabologram(s) . Some parameters are similar to the pairwise differential analysis, but the settings are not identical:
-
label
: str, mandatory. The name of the analysis and the name of the user dataset. -
defaults
: Dict, mandatory. It has two sub-keys:-
dataset
str, mandatory. The name of the dataset configuration file. -
method
: str, mandatory. The value must bemetabologram_integration
, do not change this value.
-
-
comparisons
: List, mandatory. A list of items where each item is a nested list representing a comparison. Therefore, each item defines of two groups:[[condition, timepoint], [condition_r, timepoint_r]]
, where the second group ([condition_r, timepoint_r]
) is the "reference" so the first group will be compared against that reference.
In this configuration for the omics integration, the order of the
comparisons
must be coherent with the order of the differential expression files written in the integration dataset configuration .
-
statistical_test
: Dict, mandatory. Specific test to apply to each type of quantification. One single key,abundances
ormean_enrichment
must be set, but not both in the same .yaml file. The value must be the one of the currently supported test (see sub-section The configuration for the pairwise differential analysis). -
columns_metabolites
: Dict, mandatory. The specific column names of the differential metabolome file(s) to define ID and values:-
ID
: is set as "metabolite", do not change this value. -
values
: must be set aslog2FC
orFC
depending on the user preferences.
-
-
columns_transcripts
: Dict, mandatory. The specific column names of the differential transcriptome file(s) to define ID and values:-
ID
: is the column with the gene symbols. -
values
: must be set depending on how the files were generated (e.g. log2FoldChange if generated by DESeq2).
-
-
compartment
: Dict, mandatory. Only one compartment name as unique key is accepted.
Template for the configuration for the omics integration:
label: metabologram-using-abundance-DATASET1 # <- change after the colon
defaults:
- dataset: # <- integration dataset config, fill after the colon
- method: metabologram_integration
comparisons :
- [[cond2, T24], [cond1, T24]] # <- see documentation, replace
# running for total abundances
statistical_test:
abundances: # <- can be abundances OR mean_enrichment, fill after the colon
columns_metabolites:
ID : metabolite
values : # <- log2FC or FC, fill after the colon
columns_transcripts:
ID: # <- the gene symbols column name, fill after the colon
values: # <- the numeric column name, fill after the colon
compartment:
en
If the user needs to run two integrations, one using differential abundances, and another using the differential mean enrichment, then two separate analysis configuration files must be created.
The general configuration
The general configuration file determines the working directory,
the analysis configuration to be run and the output folders.
It is located directly inside the config
folder.
The name of the general configuration file must be meaningful and unambiguously similar to the analysis configuration one. Using a name with the convention
general_configuration_<analysis config file>.yaml
is highly recommended.
Parameters:
-
hydra
: Dict, mandatory. Dictionary that defines the directories for the analysis. It contains two dictionaries:-
job
: its keychdir
(bool) defines the working directory; ifTrue
the current directory is set. Do not change this value -
run
: its keydir
(str) defines output directory with the convention
output/DATE/HOUR/- which is automatically generated by declaringoutputs/${now:%Y-%m-%d}/${now:%H-%M-%S}/${analysis.dataset.label}-${analysis.method.label}
. Do not change this value.
-
-
defaults
: List, mandatory. A list of key-value pairs, only one must be set currently:-
analysis
: str, mandatory. The file name of the analysis configuration file.
-
-
figure_path
: str, mandatory. The name of the figures output folder; "figures" is recommended. -
table_path
: str, mandatory. The of the tables output folder; "tables" is recommended
A template is shown below; the # <-
comment indicates the unique part that the user must fill:
hydra:
job:
chdir: true
run:
dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}/${analysis.dataset.label}-${analysis.method.label}
defaults:
- analysis: # <- the analysis configuration file name, fill after the colon
figure_path: figures
table_path: tables
DIMet, via Hydra, will generate the full output folders and files names
automatically. Log files with information about the run (and errors if any) are also
automatically saved to .log
files in the output folders.