- 1. What is SDRF-Proteomics?
- 2. Understanding Templates
- 3. Understanding Column Types
- 4. Step 1: Choose Your Template
- 5. Step 2: Fill in Sample Information
- 6. Step 3: Fill in Data File Information
- 7. Step 4: Define Your Experimental Variables
- 8. Step 5: Validate Your File
- 9. Complete Example
- 10. Common Scenarios
- 11. Common Mistakes to Avoid
- 12. Finding the Right Terms (Ontologies)
- 13. Getting Help
- 14. Next Steps
This tutorial will guide you through creating your first SDRF file step-by-step. By the end, you’ll understand the format and have a working file for your experiment.
Estimated time: 10-15 minutes
SDRF (Sample and Data Relationship Format) is a simple tab-separated file (like Excel) that describes your proteomics experiment. It connects your biological samples to your mass spectrometry data files.
Think of SDRF as a table where:
-
Each row = one sample-to-file relationship
-
Each column = one piece of information about that sample or file
That’s it! No programming, no complex formats — just a spreadsheet.
|
ℹ️
|
Why use SDRF? When you submit data to repositories like PRIDE, SDRF ensures your experiment is fully described and can be reanalyzed by others. It’s becoming a standard requirement for proteomics data submission. For complete details, see the full specification. |
Templates are pre-made SDRF files with the right columns already set up for your experiment type. Instead of figuring out which columns you need, just pick a template and fill in your data.
These define the basic biological information needed based on your organism:
| Template | Description |
|---|---|
human |
Includes columns for age, sex, ancestry. View template |
vertebrates |
For mouse, rat, zebrafish, etc. View template |
invertebrates |
For insects, worms, etc. View template |
plants |
Includes plant-specific metadata. View template |
Add extra columns for specific experimental workflows:
| Template | Description |
|---|---|
cell-lines |
Adds cell line identifiers and Cellosaurus accessions. View template |
dia-acquisition |
DIA-specific parameters. View template |
immunopeptidomics |
MHC typing and related metadata. View template |
crosslinking |
XL-MS specific columns. View template |
single-cell |
Single-cell proteomics metadata. View template |
|
💡
|
You can combine templates! Start with a core template (e.g., "human") and add columns from specialized templates as needed. |
Download all templates from: GitHub Templates
SDRF columns follow a naming pattern that tells you what kind of information they contain:
Describe the biological sample:
-
characteristics[organism]— Species name -
characteristics[disease]— Disease or "normal" -
characteristics[organism part]— Tissue or organ
See Sample Metadata Guidelines for all available characteristics.
Describe the data file or MS run:
-
comment[data file]— Raw file name -
comment[instrument]— Mass spectrometer -
comment[label]— Labeling type
See MS-Proteomics Template for all available comments.
The experimental variable you’re comparing:
-
factor value[disease]— Comparing disease states -
factor value[compound]— Drug treatment study -
factor value[time]— Time course experiment
|
❗
|
Column names are case-sensitive and spacing matters!
|
Open your template in Excel, Google Sheets, or any spreadsheet software. For each sample, fill in:
| Column | What to Write | Example | Notes |
|---|---|---|---|
|
A unique identifier for your sample |
patient_001 |
Must be unique across the file |
|
Species name (lowercase) |
homo sapiens |
Use scientific name from NCBI Taxonomy |
|
Tissue or body part |
liver |
Use terms from UBERON |
|
Disease name, or "normal" |
hepatocellular carcinoma |
Use "normal" for healthy samples |
|
💡
|
Don’t stress about finding exact ontology terms. Write the common name (e.g., "liver", "breast cancer") and the validator will check it for you. You can always refine later. |
For each row, also fill in information about the raw file:
| Column | What to Write | Example | Notes |
|---|---|---|---|
|
A name for this MS run |
run_001 |
Often same as source name |
|
Type of labeling |
label free sample |
Or TMT126, TMT127N, etc. |
|
Mass spectrometer used |
Q Exactive HF |
From PSI-MS ontology |
|
Your raw file name |
sample_001.raw |
Exact filename including extension |
|
ℹ️
|
One row = one sample-to-file relationship. In multiplexed experiments (TMT/iTRAQ), multiple samples share the same file, so you’ll have multiple rows pointing to the same raw file. In fractionated experiments, one sample spans multiple files, so you’ll have multiple rows for the same sample. For more details, see SDRF File Format in the specification. |
Factor values tell analysis tools what you’re comparing in your experiment. This is crucial for downstream analysis!
A factor value is the experimental variable you’re studying. If your experiment compares cancer vs. healthy tissue, then disease is your factor. The values would be "hepatocellular carcinoma" and "normal".
| Experiment Type | Factor Value Column | Example Values |
|---|---|---|
Disease vs. healthy |
|
cancer, normal |
Drug treatment |
|
aspirin, DMSO |
Time course |
|
0 hour, 6 hour, 24 hour |
Tissue comparison |
|
liver, kidney, heart |
Multiple variables |
Multiple factor columns |
Both disease AND time |
|
🔥
|
Factor values often duplicate information from characteristics columns — and that’s correct! The factor value explicitly marks which characteristic is the experimental variable. |
Save your file as .sdrf.tsv and validate it:
# Install the validator
pip install sdrf-pipelines
# Validate your file
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv# Validate against a specific template
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv --template humanThis checks that all required columns for your template are present.
Validation checks for:
-
Correct column names and formatting
-
Valid ontology terms (organism, disease, etc.)
-
Required columns present
-
No empty cells where values are required
For more validation options, see Tool Support.
Here’s a minimum valid SDRF file for a human liver cancer study, including all required columns from the human template:
| source name | characteristics[organism] | characteristics[organism part] | characteristics[disease] | characteristics[biological replicate] | characteristics[age] | characteristics[sex] | assay name | technology type | comment[proteomics data acquisition method] | comment[label] | comment[instrument] | comment[cleavage agent details] | comment[fraction identifier] | comment[technical replicate] | comment[data file] | factor value[disease] |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
patient_001 |
homo sapiens |
liver |
hepatocellular carcinoma |
1 |
55Y |
male |
run_001 |
proteomic profiling by mass spectrometry |
Data-dependent acquisition |
label free sample |
Q Exactive HF |
NT=Trypsin;AC=MS:1001251 |
1 |
1 |
patient_001.raw |
hepatocellular carcinoma |
patient_002 |
homo sapiens |
liver |
hepatocellular carcinoma |
2 |
62Y |
female |
run_002 |
proteomic profiling by mass spectrometry |
Data-dependent acquisition |
label free sample |
Q Exactive HF |
NT=Trypsin;AC=MS:1001251 |
1 |
1 |
patient_002.raw |
hepatocellular carcinoma |
control_001 |
homo sapiens |
liver |
normal |
1 |
48Y |
male |
run_003 |
proteomic profiling by mass spectrometry |
Data-dependent acquisition |
label free sample |
Q Exactive HF |
NT=Trypsin;AC=MS:1001251 |
1 |
1 |
control_001.raw |
normal |
control_002 |
homo sapiens |
liver |
normal |
2 |
51Y |
female |
run_004 |
proteomic profiling by mass spectrometry |
Data-dependent acquisition |
label free sample |
Q Exactive HF |
NT=Trypsin;AC=MS:1001251 |
1 |
1 |
control_002.raw |
normal |
Required columns in this example:
-
Sample metadata: source name, organism, organism part, disease, biological replicate, age, sex
-
Data file metadata: assay name, technology type, proteomics data acquisition method, label, instrument, cleavage agent, fraction identifier, technical replicate, data file
-
Factor value: the experimental variable being compared (disease)
What this example tells us:
-
2 biological replicates per condition (numbered 1-2 within each factor value group)
-
No fractionation (fraction identifier = 1 for all)
-
Single injection per sample (technical replicate = 1)
-
Label-free DDA proteomics with trypsin digestion on Q Exactive HF
For multiplexed experiments, multiple samples share the same raw file. Each sample gets its own row with a different label:
| source name | comment[label] | comment[data file] |
|---|---|---|
sample_A |
TMT126 |
multiplex_1.raw |
sample_B |
TMT127N |
multiplex_1.raw |
sample_C |
TMT127C |
multiplex_1.raw |
For complete TMT/iTRAQ documentation, see Isobaric Labelling in the specification.
If you fractionated your sample before MS, add a comment[fraction identifier] column:
| source name | comment[fraction identifier] | comment[data file] |
|---|---|---|
sample_001 |
1 |
sample_001_F01.raw |
sample_001 |
2 |
sample_001_F02.raw |
sample_001 |
3 |
sample_001_F03.raw |
For more details, see Fractions in the specification.
Same sample run multiple times? Use the same source name with different assay names and data files:
| source name | assay name | comment[technical replicate] | comment[data file] |
|---|---|---|---|
sample_001 |
sample_001_rep1 |
1 |
sample_001_rep1.raw |
sample_001 |
sample_001_rep2 |
2 |
sample_001_rep2.raw |
For cell lines, include the cell line name and Cellosaurus accession:
| source name | characteristics[cell line] | characteristics[cellosaurus accession] |
|---|---|---|
hela_001 |
HeLa |
CVCL_0030 |
hek_001 |
HEK293 |
CVCL_0045 |
Find accessions at Cellosaurus.
For the complete cell lines template, see Cell Lines Template.
| Mistake | Correct | Explanation |
|---|---|---|
|
|
Column names must be lowercase |
|
|
No space before the bracket |
|
|
Use "normal" for healthy tissue/samples |
Empty cells |
|
Never leave cells empty |
|
|
Two words separated by a space |
SDRF uses ontology terms to ensure consistency across datasets. Here’s where to find them:
| For This Field | Look Here | Examples |
|---|---|---|
Organism names |
homo sapiens, mus musculus |
|
Tissue/organ names |
liver, brain, blood |
|
Disease names |
breast cancer, diabetes |
|
Cell types |
T cell, hepatocyte |
|
Instruments & methods |
Q Exactive, Orbitrap |
|
Cell lines |
HeLa (CVCL_0030), HEK293 |
|
💡
|
Don’t worry about finding the exact ontology term initially. Just write the common name (e.g., "liver", "breast cancer") and the validator will check it for you. |
Browse real SDRF files from published datasets in ProteomeXchange:
-
Open an issue on GitHub to reach the bigbio team
For advanced use cases and complete details:
Once you’re comfortable with the basics:
-
Explore templates for your specific experiment type: All Templates
-
Read the metadata guidelines for detailed field descriptions:
-
Learn about tool support for converting SDRF to analysis pipelines: Tool Support
-
See the full specification for advanced use cases: SDRF-Proteomics Specification