Bioinformatics Project Template

Author

David Martín-Gálvez

Published

January 12, 2026

1 Introduction

This document describes a generic, flexible template to structure any bioinformatics analysis project. It is designed to work across different project types (RNA-seq, ChIP-seq, metagenomics, variant calling, proteomics, single-cell analysis, long-read sequencing, computational chemistry, etc.) and different computing environments (local workstations, HPC clusters, cloud).

The template implements:

A standard folder layout for organizing code, data, and results.
Minimal software requirements and installation instructions.
Initialization scripts to quickly scaffold new projects.
Git + SSH key setup and basic cluster access.
Separation of shared configuration and machine-specific configuration.
A portable path system based on BASE_DIR and REPO_DIR.
A repository that stores scripts and configuration only, never large data.
A project structure designed for use on local machines, cluster, and NAS.
SLURM array job architecture for scalable HPC execution.
Multi-environment storage strategy (macOS, HOME, fstrat, NAS).

1.1 Configuration System

The configuration system consists of:

config.yaml → shared, versioned configuration; same for all users and machines.
config_local.yaml → private, unversioned configuration; machine-specific.
load_config.sh → loader that reads both YAML files with yq and exposes variables to Bash/SLURM.

This ensures complete reproducibility, portability, and clarity across different computing environments.

1.2 Key Principles

Scripts never hard-code paths — all paths are derived from configuration.
Data lives outside the repository — only scripts and configuration are version-controlled.
One-sample Bash scripts can be tested locally or used in SLURM array jobs.
Clear separation between DATA (inputs) and ANALYSES (outputs).
Portable across environments — only BASE_DIR and REPO_DIR change between machines.

1.3 Why This Template is Generic

This template is not specific to any analysis type. Instead, it provides:

Flexible naming for scripts and directories (adapt 01_analysis_one_sample.sh to your analysis).
Customizable paths in config.yaml (use raw_data, sequences, alignments, or your own naming).
Agnostic architecture (works with any tool: FastQC, STAR, samtools, bcftools, Mothur, QIIME2, R, Python, etc.).
Scalable design (from single-machine to HPC array jobs).

Examples of how to adapt for different projects:

Project Type	Sample Definition	Input Files	Analysis Steps
RNA-seq	Sample ID	FASTQ files	QC → Trim → Map → Count → DE
ChIP-seq	Replicate ID	BAM files	QC → Peak call → Annotate
Metagenomics	Sample ID	FASTQ files	QC → Denoise → Taxonomy → Abundance
Variant calling	Sample ID	FASTQ/BAM files	QC → Map → Call → Annotate → Filter
Single-cell	Cell barcode	h5ad/mtx files	QC → Normalize → Cluster → Annotate

1.4 How to Use This Template

This repository is configured as a GitHub Template Repository, which allows you to create new projects without copying the full Git history.

1.4.1 Creating a New Project from Template

On GitHub:
- Navigate to this page.
- Click the green “Use this template” button (top right)
- Select “Create a new repository”
- Fill in:
  - Owner: Your username or organization
  - Repository name: Your project name (e.g., sparrow-rnaseq-analysis)
  - Description: Brief project description
  - Visibility: Public or Private (your choice)
- Click “Create repository”

Clone your new repository:

git clone git@github.com:<your-username>/<your-project>.git
cd <your-project>

Customize for your project:
- Edit README.md: Update title, description, and project-specific information
- Edit config/config.yaml: Adjust paths and parameters for your analysis type
- Create config/config_local.yaml: Add your machine-specific paths
- Update docs/: Rename QMD files and customize documentation
- Adapt scripts in scripts/<step>/: Rename and modify for your specific analysis steps
- Create sample list: Generate config/samples_names.txt with your sample identifiers

Initialize project structure:

# Specify parent path and analysis folder name
bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_name

# Example: Create analysis folder 'run_20260112' in your project area
bash scripts/utils/init_project.sh -p /fstrat/username/my-project -n run_20260112

1.4.2 Advantages of Template Repository vs Manual Cloning

Aspect	Template Repository	Manual Clone
Git history	✅ Clean start	❌ Inherits all template history
GitHub connection	✅ Automatic	❌ Requires remote reset
Simplicity	✅ One-click creation	❌ Manual `rm -rf .git` needed
Updates	✅ Independent	❌ Conflicts with template updates
Best practice	✅ Recommended	❌ Not recommended

1.4.3 After Creating Your Project

Configure Git identity (if not already set globally):

git config user.name "Your Name"
git config user.email "your.email@example.com"

Set up SSH keys (see Section 4 for detailed instructions)
Initialize analysis folder (see detailed instructions in Section on Initialize the Project):
```
# Create your analysis folder with appropriate name
bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_folder
```
This will automatically create config/config_local.yaml with correct paths.
Start developing: Add your data paths, customize scripts, and begin analysis

2 Minimal Software

2.1 Required Software

Shell & CLI:
- bash ≥ 4.0, coreutils, awk, sed, grep, rsync, tar, gzip
YAML:
- yq (mikefarah/yq) to read YAML with bash
Documentation:
- Quarto (optional but recommended): https://quarto.org
VCS:
- Git ≥ 2.30
Optional (depending on analysis):
- Conda/Mamba for reproducible environments
- Snakemake or Nextflow for workflows
- R (≥4.2) and/or Python (≥3.10)

2.2 Quick Install (macOS)

First, install Homebrew if you don’t have it (https://brew.sh):

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then install the required packages:

brew install yq git quarto rsync gnu-sed gawk

3 Initialize Repository

3.1 Initialize and Contect to the Remote

Initialize and connect to remote:

git init
# create the remote on GitHub/GitLab, then:
git remote add origin https://github.com/<user>/<repo>.git
git config user.name "Your Name"
git config user.email "you@email"
git add .
git commit -m "Initial template"
git push -u origin main

3.2 Configure .gitignore

Ignore local files:

config/config_local.yaml
analyses/
data/processed/
logs/

Add to .gitignore:

echo "config/config_local.yaml" >> .gitignore

Important: config_local.yaml must never be committed as it contains machine-specific paths.

4 SSH Keys (Git, Cluster, and NAS)

SSH keys enable secure, password-free authentication to remote servers. This section covers setup for GitHub, HPC clusters, and NAS storage.

4.1 Generate SSH Keys

Generate an ED25519 key pair (modern, secure standard):

ssh-keygen -t ed25519 -C "your-email@example.com" -f ~/.ssh/id_ed25519
# Start SSH agent and add key
ssh-agent -s
ssh-add ~/.ssh/id_ed25519
# Display public key (copy this to remote servers)
cat ~/.ssh/id_ed25519.pub

Important: Keep ~/.ssh/id_ed25519 private. Only share ~/.ssh/id_ed25519.pub.

4.2 Add Public Key to Remote Services

4.2.1 GitHub/GitLab SSH Setup

Copy your public key from above (cat ~/.ssh/id_ed25519.pub)
Add to GitHub:
- Go to GitHub → Settings → SSH and GPG keys
- Click “New SSH key” and paste your public key
Test connection:

ssh -T git@github.com
# Expected: "Hi <username>! You've successfully authenticated..."

4.2.2 HPC Cluster SSH Setup

Add your public key to the cluster’s ~/.ssh/authorized_keys:

# On your local machine, copy key to cluster
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@cluster.example.com
# Or manually:
cat ~/.ssh/id_ed25519.pub | ssh user@cluster.example.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

4.2.3 NAS Storage SSH Setup (Restricted Access)

For security on shared NAS systems, create a restricted user with limited permissions:

# On NAS (typically done by admin)
# Create user with SSH-only access, no shell
useradd -m -s /usr/sbin/nologin -c "Restricted SSH user" data_ssh

# Set ACL permissions to restrict access to specific folders
setfacl -m u:data_ssh:rx /volume/shared_data/
# This user can only read/execute, not write to parent directories

4.3 Configure SSH Client (`~/.ssh/config`)

Create or edit ~/.ssh/config to manage multiple SSH connections with appropriate settings for HPC:

# GitHub
Host github.com
    HostName github.com
    User git
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes

# HPC Cluster (Picasso, XSEDE, etc.)
Host cluster
    HostName cluster.example.com
    User your_username
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    # Keep connection alive (prevent disconnections)
    ServerAliveInterval 60
    ServerAliveCountMax 5
    # Prevent "Too many authentication methods" error
    PreferredAuthentications publickey
    # Required for VS Code Remote-SSH
    ForwardAgent yes

# NAS Storage
Host nas
    HostName nas.example.com
    User data_ssh
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    StrictHostKeyChecking accept-new

4.4 Test SSH Connections

# Test cluster
ssh cluster
# Should connect without password

# Test NAS
ssh nas
# Should connect without password

# Test GitHub
ssh -T github.com

4.5 VS Code Remote-SSH (Recommended for HPC)

VS Code Remote-SSH allows you to edit files and run terminals directly on the cluster without mounting via SSHFS (which is unstable on HPC systems).

4.5.1 Setup

Install VS Code extensions:
- Remote - SSH
- Remote - SSH: Editing Configuration Files (optional)
VS Code will auto-detect entries from your ~/.ssh/config
Connect to cluster:
- Press F1 (or Cmd+Shift+P on macOS)
- Type: Remote-SSH: Connect to Host…
- Select your cluster entry
- VS Code opens a new window and installs VS Code Server on the remote host
Open remote folder:
- In the left sidebar, click “Open Folder”
- Navigate to your project directory (e.g., /home/user/projects/my-analysis-repo)
- You can now edit files directly on the cluster
Open remote terminal:
- Press Ctrl+`` (orCmd+`` on macOS)
- Executes commands on the cluster, not locally

4.5.2 Benefits of Remote-SSH vs SSHFS

Feature	Remote-SSH	SSHFS
Stability	✅ Highly stable	❌ Prone to failures
Setup	✅ None needed	❌ Requires mounting
Performance	✅ Fast	❌ Slow with many files
Permissions	✅ Correct handling	❌ Can get corrupted
Extensions	✅ Work normally	❌ Often fail
HPC standard	✅ Recommended	❌ Problematic on clusters

4.6 Securing Permissions

Ensure SSH directory permissions are correct:

# On both local and remote systems
chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_ed25519
chmod 644 ~/.ssh/id_ed25519.pub
chmod 600 ~/.ssh/authorized_keys
chmod 644 ~/.ssh/config

5 Repository Structure

This repository contains only scripts and configuration, never data or results.

bioinfo-project-repo/
│
├── config/
│   ├── config.yaml              # Shared, versioned, portable paths
│   ├── config_local.yaml        # Local-only, ignored by Git
│   └── samples_names.txt         # Sample identifiers list
│
├── scripts/
│   ├── 01_qc/                   # Quality control analysis scripts
│   │   ├── slurm_out/           # SLURM log files
│   │   ├── 01_fastqc_one_sample.sh
│   │   ├── 01_fastqc_array.sbatch
│   │   └── README.md
│   ├── templates/               # Reusable script templates
│   │   ├── analysis_script_template.sh
│   │   ├── array_job_template.sbatch
│   │   ├── job_slurm_template.sh
│   │   └── slurm_out/
│   └── utils/
│       ├── init_project.sh      # Scaffolds folders and templates
│       ├── load_config.sh       # Loads YAML configuration
│       └── test_load_config.sh
│
├── docs/
│   ├── bioinformatics_project_template.qmd  # Project documentation
│   ├── bioinformatics_project_template.html
│   └── index.html
│
├── logs/                        # Job logs and outputs
│
└── README.md

5.1 Key Principles

No data and no results in the repository.
Scripts are portable and reusable across machines.
Configuration is split between shared (config.yaml) and local (config_local.yaml).
config_local.yaml is never committed (add to .gitignore).

6 Configuration System

6.1 `config.yaml` (Shared, Versioned)

Path: config/config.yaml

This file contains only relative paths and pipeline parameters, never absolute paths.

Generic example (adapt to your analysis type):

paths:
  fastq_raw:       "1_DATA/FASTQ"
  fastq_processed: "2_ANALYSES/Results/02_preprocessing"
  reference:       "1_DATA/REFERENCE"
  analyses:        "2_ANALYSES"

cluster:
  threads: 8
  memory: "32G"
  queue: "short"
  time_limit: "02:00:00"

6.1.1 Key Rules

No absolute paths.
Stable across macOS, HOME, fstrat, NAS.
Versioned in Git.
Customize path names for your specific analysis type.

6.2 `config_local.yaml` (Machine-Specific, Not Versioned)

Path: config/config_local.yaml

Examples:

6.2.1 macOS

base_dir: "/Users/david/my-analysis"
repo_dir: "/Users/david/Repositories/my-analysis-repo"

6.2.2 Cluster HOME

base_dir: "/mnt/home/users/.../my-analysis"
repo_dir: "/mnt/home/users/.../my-analysis-repo"

6.2.3 fstrat (execution workspace)

base_dir: "/fstrat/dmartin/my-analysis/run_20251209"
repo_dir: "/mnt/home/users/dba_001_uma/dmartin/my-analysis-repo"

6.2.4 Add to `.gitignore`:

echo "config/config_local.yaml" >> .gitignore

This file must never be committed.

6.3 `load_config.sh`

Path: scripts/utils/load_config.sh

#!/bin/bash

MAIN_CONFIG="config/config.yaml"
LOCAL_CONFIG="config/config_local.yaml"

if [[ ! -f "$MAIN_CONFIG" ]]; then
    echo "ERROR: Missing config.yaml" >&2; exit 1
fi
if [[ ! -f "$LOCAL_CONFIG" ]]; then
    echo "ERROR: Missing config_local.yaml" >&2; exit 1
fi

# Machine-specific base directory
BASE_DIR=$(yq -r '.base_dir' "$LOCAL_CONFIG")
REPO_DIR=$(yq -r '.repo_dir' "$LOCAL_CONFIG")

# Relative paths from config.yaml
FASTQ_RAW_REL=$(yq -r '.paths.fastq_raw' "$MAIN_CONFIG")
FASTQ_PROC_REL=$(yq -r '.paths.fastq_processed' "$MAIN_CONFIG")
REF_REL=$(yq -r '.paths.reference' "$MAIN_CONFIG")
ANALYSES_REL=$(yq -r '.paths.analyses' "$MAIN_CONFIG")

# Build absolute paths
FASTQ_RAW_DIR="$BASE_DIR/$FASTQ_RAW_REL"
FASTQ_PROC_DIR="$BASE_DIR/$FASTQ_PROC_REL"
REFERENCE_DIR="$BASE_DIR/$REF_REL"
ANALYSES_DIR="$BASE_DIR/$ANALYSES_REL"

# Cluster parameters
THREADS=$(yq -r '.cluster.threads' "$MAIN_CONFIG")
MEMORY=$(yq -r '.cluster.memory' "$MAIN_CONFIG")
QUEUE=$(yq -r '.cluster.queue' "$MAIN_CONFIG")

7 Project Structure (Data + Results)

This structure exists outside the repository, typically on NAS or local storage. For instance:

my-analysis/
│
├── 1_DATA/
│   ├── FASTQ/                # Original FASTQ files (read-only)
│   │   ├── sample_1_R1.fastq.gz
│   │   ├── sample_1_R2.fastq.gz
│   │   ├── sample_2_R1.fastq.gz
│   │   └── sample_2_R2.fastq.gz
│   │
│   ├── REFERENCE/            # Reference data and indices
│   │   ├── GENOME/           # Reference sequences (FASTA)
│   │   ├── ANNOTATION/       # Gene annotations (GTF/GFF3/BED)
│   │   └── INDEXES/          # Pre-built indices (tool-specific)
│
├── 2_ANALYSES/
│   ├── Scripts/              # symlink to repository scripts
│   └── Results/
│       ├── 01_qc/            # Quality control outputs (FastQC, etc.)
│       ├── 02_preprocessing/ # Processed FASTQ, trimming outputs
│       ├── 03_main_analysis/ # Alignment, quantification, etc.
│       ├── 04_tables/        # Results tables
│       └── 05_figures/       # Plots and visualizations

Note: prefer creating these folders incrementally as bioinformatics analysis steps are executed and needed. There is no need to pre-create the entire structure; scripts can create required paths on demand (for example, init_project.sh and analysis scripts create 2_ANALYSES/Results/<step> when appropriate). This approach avoids empty directories, reduces organizational errors, and improves traceability.

7.1 Key Principles

DATA (1_DATA/): Contains inputs, normally not modified by the pipeline.
ANALYSES (2_ANALYSES/): Contains all outputs, can be regenerated.
Clear separation enables reproducibility and clean HPC workflows.

8 Script Architecture

8.1 One-Sample Bash Scripts

Path: scripts/<step>/<step>_one_sample.sh (e.g., scripts/01_qc/01_fastqc_one_sample.sh)

Each script processes a single sample and is independent of SLURM. This enables:

Local testing: Test on macOS before cluster submission
Reusability: Same script runs locally or in HPC arrays
Debugging: Easy to troubleshoot single-sample issues

8.1.1 Usage

# Test locally
bash scripts/01_qc/01_fastqc_one_sample.sh sample_1

# This reads from: 1_DATA/FASTQ/sample_1*.fastq.gz
# And writes to:  2_ANALYSES/Results/01_qc/

8.1.2 Example Implementation

#!/bin/bash
set -euo pipefail

SAMPLE_ID="$1"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1

source scripts/utils/load_config.sh

# Create output directory
OUTDIR="$ANALYSES_DIR/Results/01_qc"
mkdir -p "$OUTDIR"

# Load module if available
if command -v module &> /dev/null; then
    module load fastqc
fi

# Run analysis
fastqc -t "$THREADS" -o "$OUTDIR" "$FASTQ_RAW_DIR/${SAMPLE_ID}"*.fastq.gz

echo "FastQC completed for $SAMPLE_ID"

8.1.3 Sample List Format

Create config/samples_names.txt with one sample ID per line:

sample_1
sample_2
sample_3

Array jobs use this file to determine: - Number of tasks: --array=1-N where N = line count - Sample assignment: Each task gets one line based on SLURM_ARRAY_TASK_ID

8.2 SLURM Array Jobs

Path: scripts/<step>/<step>_array.sbatch (e.g., scripts/01_qc/01_fastqc_array.sbatch)

Array jobs run one-sample scripts in parallel on HPC clusters. The structure is simple:

scripts/<step>/
├── slurm_out/                # Log files only
├── <step>_one_sample.sh      # Bash script - process one sample
└── <step>_array.sbatch       # SLURM script - submit array job

8.2.1 Quick Start

# Submit all samples as parallel tasks (from repository root)
cd /path/to/repository
sbatch scripts/01_qc/01_fastqc_array.sbatch

# Monitor
squeue -j <job_id>
tail -f scripts/01_qc/slurm_out/fastqc_array_<job_id>_1.out

8.2.2 How It Works

The array job: 1. Reads sample names from config/samples_names.txt 2. Creates N parallel SLURM tasks (N = sample count) 3. Each task calls: bash 01_fastqc_one_sample.sh <SAMPLE_ID> 4. Results are combined in 2_ANALYSES/Results/01_qc/

8.2.3 Example SLURM Script

#!/bin/bash
#SBATCH -J fastqc_array
#SBATCH -o scripts/01_qc/slurm_out/%x_%A_%a.out
#SBATCH -e scripts/01_qc/slurm_out/%x_%A_%a.err
#SBATCH -c 8
#SBATCH --mem=16G
#SBATCH -t 3-00:00:00
#SBATCH --constraint=cal
#SBATCH --array=1-N      # N = number of samples

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1

source scripts/utils/load_config.sh

SAMPLES_FILE="config/samples_names.txt"
mapfile -t SAMPLES < "$SAMPLES_FILE"
TASK_ID=$((SLURM_ARRAY_TASK_ID - 1))
SAMPLE_ID="${SAMPLES[$TASK_ID]}"

echo "Processing sample: $SAMPLE_ID (task ${SLURM_ARRAY_TASK_ID})"

# Call one-sample script
bash scripts/01_qc/01_fastqc_one_sample.sh "$SAMPLE_ID"

8.2.4 Key Features

One sample per task: SLURM_ARRAY_TASK_ID indexes into sample list
Parallel execution: All samples run simultaneously on HPC
Simple submission: sbatch scripts/01_qc/01_fastqc_array.sbatch
Reusability: Same one-sample script for local testing and HPC
Log separation: Each task has its own log file in slurm_out/

8.2.5 Important: Log Directory Requirements

⚠️ Critical: SLURM requires the log directory to exist before job submission.

The repository includes .gitkeep files in each slurm_out/ directory
This ensures slurm_out/ exists when you clone the repository
SLURM cannot create the log directory automatically
If the directory doesn’t exist, job submission will fail

Why .gitkeep? Git doesn’t track empty directories. The .gitkeep file forces Git to include the slurm_out/ directory in the repository, ensuring it exists for all users.

For new analysis steps: When creating a new step (e.g., scripts/02_preprocessing/), always:

mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeep

9 Storage Architecture (macOS, HOME, fstrat, NAS)

9.1 Summary

macOS: Development, testing, documentation.
HOME: Repository + lightweight files.
fstrat: High-performance execution workspace (temporary).
NAS: Permanent project archive.

All scripts behave identically because they rely on BASE_DIR and REPO_DIR.

9.2 fstrat Execution Workspace (Per-Run)

Within each run directory:

/fstrat/dmartin/my-analysis/run_20251209/
│
├── 1_DATA/        # minimal replicated inputs
└── 2_ANALYSES/
    ├── Scripts/   # symlink to repository scripts
    └── Results/   # all outputs generated here

Create symlink:

ln -s "$REPO_DIR/scripts" "$BASE_DIR/2_ANALYSES/Scripts/scripts_repo"

This ensures traceability without duplicating code.

9.3 Synchronizing Results from fstrat to NAS

After execution:

rsync -av /fstrat/.../2_ANALYSES/ /NAS/my-analysis/2_ANALYSES/

This pattern allows: * Fast execution on cluster. * Durable storage on NAS. * Clean reproducibility.

9.4 Environment-Dependent Configuration

Only two variables change across machines:

base_dir: "path/to/data_and_results"   # fstrat, NAS, or local path
repo_dir: "path/to/repository"         # HOME on cluster, local path on macOS

All scripts use these paths indirectly through load_config.sh.

10 Initialize the Project

Use scripts/utils/init_project.sh to create folders and templates:

# View help and options
bash scripts/utils/init_project.sh --help

# Initialize project with specified parent path and folder name
bash scripts/utils/init_project.sh -p <parent-path> -n <folder-name>

# Example: Create analysis folder 'run_20260112' in '/fstrat/dmartin/sparrow-analysis'
bash scripts/utils/init_project.sh -p /fstrat/dmartin/sparrow-analysis -n run_20260112

Required Arguments:

-p, --parent-path PATH: Parent directory where the analysis folder will be created
-n, --folder-name NAME: Name of the analysis folder (e.g., run_20260112, experiment_01)

What the script does:

Creates the analysis folder structure:

<parent-path>/<folder-name>/
├── 1_DATA/              # Minimal replicated inputs
└── 2_ANALYSES/
    ├── Scripts/         # Symlink to repository scripts
    └── Results/         # All outputs generated here

Creates a symbolic link: 2_ANALYSES/Scripts/ → repository scripts/ folder
Creates repository configuration files if missing:
- config/config.yaml → shared, versioned configuration
- config/config_local.yaml → private, machine-specific configuration
Creates script templates in scripts/templates/:
- array_job_template.sbatch → SLURM array job template
- analysis_script_template.sh → Generic one-sample script template

Structure Philosophy:

Repository folder (this repo): Contains scripts, config, docs (version-controlled)
Analysis folder (created by script): Contains data and results (not version-controlled)
Separation of concerns: Code vs. Data
Portability: Only analysis folder location changes between machines
Symlink advantage: Scripts always up-to-date with repository changes

11 Creating New Analysis Steps

To add a new analysis step (e.g., 02_preprocessing):

11.1 Create Directory Structure

mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeep

Note: The .gitkeep file is required so Git tracks the slurm_out/ directory. SLURM needs this directory to exist before job submission.

11.2 Create One-Sample Script

Copy and adapt the template:

cp scripts/templates/analysis_script_template.sh scripts/02_preprocessing/02_trim_one_sample.sh

Edit the script to: - Process your specific analysis (trimming, alignment, etc.) - Use appropriate input/output paths from load_config.sh - Load required modules or tools

11.3 Create SLURM Array Script