Skip to content

Latest commit

 

History

History
565 lines (410 loc) · 14.8 KB

File metadata and controls

565 lines (410 loc) · 14.8 KB

SDRF-Proteomics Quick Start Tutorial

This tutorial will guide you through creating your first SDRF file step-by-step. By the end, you’ll understand the format and have a working file for your experiment.

Estimated time: 10-15 minutes

1. What is SDRF-Proteomics?

SDRF (Sample and Data Relationship Format) is a simple tab-separated file (like Excel) that describes your proteomics experiment. It connects your biological samples to your mass spectrometry data files.

1.1. The Core Concept

Think of SDRF as a table where:

  • Each row = one sample-to-file relationship

  • Each column = one piece of information about that sample or file

That’s it! No programming, no complex formats — just a spreadsheet.

ℹ️

Why use SDRF? When you submit data to repositories like PRIDE, SDRF ensures your experiment is fully described and can be reanalyzed by others. It’s becoming a standard requirement for proteomics data submission.

For complete details, see the full specification.

2. Understanding Templates

Templates are pre-made SDRF files with the right columns already set up for your experiment type. Instead of figuring out which columns you need, just pick a template and fill in your data.

2.1. Core Templates (Organism-based)

These define the basic biological information needed based on your organism:

Template Description

human

Includes columns for age, sex, ancestry. View template

vertebrates

For mouse, rat, zebrafish, etc. View template

invertebrates

For insects, worms, etc. View template

plants

Includes plant-specific metadata. View template

2.2. Specialized Templates (Experiment-based)

Add extra columns for specific experimental workflows:

Template Description

cell-lines

Adds cell line identifiers and Cellosaurus accessions. View template

dia-acquisition

DIA-specific parameters. View template

immunopeptidomics

MHC typing and related metadata. View template

crosslinking

XL-MS specific columns. View template

single-cell

Single-cell proteomics metadata. View template

💡
You can combine templates! Start with a core template (e.g., "human") and add columns from specialized templates as needed.

Download all templates from: GitHub Templates

3. Understanding Column Types

SDRF columns follow a naming pattern that tells you what kind of information they contain:

3.1. characteristics[…​] — Sample Metadata

Describe the biological sample:

  • characteristics[organism] — Species name

  • characteristics[disease] — Disease or "normal"

  • characteristics[organism part] — Tissue or organ

See Sample Metadata Guidelines for all available characteristics.

3.2. comment[…​] — Data File Metadata

Describe the data file or MS run:

  • comment[data file] — Raw file name

  • comment[instrument] — Mass spectrometer

  • comment[label] — Labeling type

See MS-Proteomics Template for all available comments.

3.3. factor value[…​] — Experimental Variables

The experimental variable you’re comparing:

  • factor value[disease] — Comparing disease states

  • factor value[compound] — Drug treatment study

  • factor value[time] — Time course experiment

Column names are case-sensitive and spacing matters!

  • characteristics[organism] — Correct

  • Characteristics[organism] — Wrong (capital C)

  • characteristics [organism] — Wrong (space before bracket)

4. Step 1: Choose Your Template

Answer this question to find your template:

Your Sample Type Template Link

Human samples

human

Download

Mouse, rat, zebrafish

vertebrates

Download

Insects, worms

invertebrates

Download

Plants

plants

Download

Cell lines

cell-lines

Download

Other / Not sure

ms-proteomics

Download

5. Step 2: Fill in Sample Information

Open your template in Excel, Google Sheets, or any spreadsheet software. For each sample, fill in:

Column What to Write Example Notes

source name

A unique identifier for your sample

patient_001

Must be unique across the file

characteristics[organism]

Species name (lowercase)

homo sapiens

Use scientific name from NCBI Taxonomy

characteristics[organism part]

Tissue or body part

liver

Use terms from UBERON

characteristics[disease]

Disease name, or "normal"

hepatocellular carcinoma

Use "normal" for healthy samples

💡
Don’t stress about finding exact ontology terms. Write the common name (e.g., "liver", "breast cancer") and the validator will check it for you. You can always refine later.

6. Step 3: Fill in Data File Information

For each row, also fill in information about the raw file:

Column What to Write Example Notes

assay name

A name for this MS run

run_001

Often same as source name

comment[label]

Type of labeling

label free sample

Or TMT126, TMT127N, etc.

comment[instrument]

Mass spectrometer used

Q Exactive HF

From PSI-MS ontology

comment[data file]

Your raw file name

sample_001.raw

Exact filename including extension

ℹ️

One row = one sample-to-file relationship. In multiplexed experiments (TMT/iTRAQ), multiple samples share the same file, so you’ll have multiple rows pointing to the same raw file. In fractionated experiments, one sample spans multiple files, so you’ll have multiple rows for the same sample.

For more details, see SDRF File Format in the specification.

7. Step 4: Define Your Experimental Variables

Factor values tell analysis tools what you’re comparing in your experiment. This is crucial for downstream analysis!

7.1. What is a Factor Value?

A factor value is the experimental variable you’re studying. If your experiment compares cancer vs. healthy tissue, then disease is your factor. The values would be "hepatocellular carcinoma" and "normal".

Experiment Type Factor Value Column Example Values

Disease vs. healthy

factor value[disease]

cancer, normal

Drug treatment

factor value[compound]

aspirin, DMSO

Time course

factor value[time]

0 hour, 6 hour, 24 hour

Tissue comparison

factor value[organism part]

liver, kidney, heart

Multiple variables

Multiple factor columns

Both disease AND time

🔥
Factor values often duplicate information from characteristics columns — and that’s correct! The factor value explicitly marks which characteristic is the experimental variable.

8. Step 5: Validate Your File

Save your file as .sdrf.tsv and validate it:

# Install the validator
pip install sdrf-pipelines

# Validate your file
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv

8.2. Option 2: Validate Against a Template

# Validate against a specific template
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv --template human

This checks that all required columns for your template are present.

Validation checks for:

  • Correct column names and formatting

  • Valid ontology terms (organism, disease, etc.)

  • Required columns present

  • No empty cells where values are required

For more validation options, see Tool Support.

9. Complete Example

Here’s a minimum valid SDRF file for a human liver cancer study, including all required columns from the human template:

source name characteristics[organism] characteristics[organism part] characteristics[disease] characteristics[biological replicate] characteristics[age] characteristics[sex] assay name technology type comment[proteomics data acquisition method] comment[label] comment[instrument] comment[cleavage agent details] comment[fraction identifier] comment[technical replicate] comment[data file] factor value[disease]

patient_001

homo sapiens

liver

hepatocellular carcinoma

1

55Y

male

run_001

proteomic profiling by mass spectrometry

Data-dependent acquisition

label free sample

Q Exactive HF

NT=Trypsin;AC=MS:1001251

1

1

patient_001.raw

hepatocellular carcinoma

patient_002

homo sapiens

liver

hepatocellular carcinoma

2

62Y

female

run_002

proteomic profiling by mass spectrometry

Data-dependent acquisition

label free sample

Q Exactive HF

NT=Trypsin;AC=MS:1001251

1

1

patient_002.raw

hepatocellular carcinoma

control_001

homo sapiens

liver

normal

1

48Y

male

run_003

proteomic profiling by mass spectrometry

Data-dependent acquisition

label free sample

Q Exactive HF

NT=Trypsin;AC=MS:1001251

1

1

control_001.raw

normal

control_002

homo sapiens

liver

normal

2

51Y

female

run_004

proteomic profiling by mass spectrometry

Data-dependent acquisition

label free sample

Q Exactive HF

NT=Trypsin;AC=MS:1001251

1

1

control_002.raw

normal

Required columns in this example:

  • Sample metadata: source name, organism, organism part, disease, biological replicate, age, sex

  • Data file metadata: assay name, technology type, proteomics data acquisition method, label, instrument, cleavage agent, fraction identifier, technical replicate, data file

  • Factor value: the experimental variable being compared (disease)

What this example tells us:

  • 2 biological replicates per condition (numbered 1-2 within each factor value group)

  • No fractionation (fraction identifier = 1 for all)

  • Single injection per sample (technical replicate = 1)

  • Label-free DDA proteomics with trypsin digestion on Q Exactive HF

10. Common Scenarios

10.1. TMT/iTRAQ Multiplexed Samples

For multiplexed experiments, multiple samples share the same raw file. Each sample gets its own row with a different label:

source name comment[label] comment[data file]

sample_A

TMT126

multiplex_1.raw

sample_B

TMT127N

multiplex_1.raw

sample_C

TMT127C

multiplex_1.raw

For complete TMT/iTRAQ documentation, see Isobaric Labelling in the specification.

10.2. Fractionated Samples

If you fractionated your sample before MS, add a comment[fraction identifier] column:

source name comment[fraction identifier] comment[data file]

sample_001

1

sample_001_F01.raw

sample_001

2

sample_001_F02.raw

sample_001

3

sample_001_F03.raw

For more details, see Fractions in the specification.

10.3. Technical Replicates

Same sample run multiple times? Use the same source name with different assay names and data files:

source name assay name comment[technical replicate] comment[data file]

sample_001

sample_001_rep1

1

sample_001_rep1.raw

sample_001

sample_001_rep2

2

sample_001_rep2.raw

10.4. Cell Line Experiments

For cell lines, include the cell line name and Cellosaurus accession:

source name characteristics[cell line] characteristics[cellosaurus accession]

hela_001

HeLa

CVCL_0030

hek_001

HEK293

CVCL_0045

Find accessions at Cellosaurus.

For the complete cell lines template, see Cell Lines Template.

11. Common Mistakes to Avoid

Mistake Correct Explanation

Source Name (capitalized)

source name (lowercase)

Column names must be lowercase

characteristics [organism] (space)

characteristics[organism] (no space)

No space before the bracket

control for healthy samples

normal for healthy samples

Use "normal" for healthy tissue/samples

Empty cells

not available or not applicable

Never leave cells empty

sourcename (no space)

source name (with space)

Two words separated by a space

12. Finding the Right Terms (Ontologies)

SDRF uses ontology terms to ensure consistency across datasets. Here’s where to find them:

For This Field Look Here Examples

Organism names

NCBI Taxonomy

homo sapiens, mus musculus

Tissue/organ names

UBERON

liver, brain, blood

Disease names

MONDO

breast cancer, diabetes

Cell types

Cell Ontology

T cell, hepatocyte

Instruments & methods

PSI-MS

Q Exactive, Orbitrap

Cell lines

Cellosaurus

HeLa (CVCL_0030), HEK293

💡
Don’t worry about finding the exact ontology term initially. Just write the common name (e.g., "liver", "breast cancer") and the validator will check it for you.

13. Getting Help

13.1. Examples

Browse real SDRF files from published datasets in ProteomeXchange:

13.2. Questions

  • Open an issue on GitHub to reach the bigbio team

13.3. Full Documentation

For advanced use cases and complete details:

14. Next Steps

Once you’re comfortable with the basics:

  1. Explore templates for your specific experiment type: All Templates

  2. Read the metadata guidelines for detailed field descriptions:

  3. Learn about tool support for converting SDRF to analysis pipelines: Tool Support

  4. See the full specification for advanced use cases: SDRF-Proteomics Specification