FASTQ to Final Report: The Files of NGS

FASTQ to Final Report: The Files of NGS

If you’re diving into the world of Next-Generation Sequencing (NGS) bioinformatics, you’ve probably noticed that your terminal quickly fills up with an alphabet soup of file extensions. .fastq, .bam, .vcf, .bed, etc.

A standard NGS pipeline is essentially a sophisticated data transformation journey. At each step, data is refined, compressed, or analyzed, generating specific file types along the way. Today, we’re going to walk through a standard DNA sequencing pipeline and demystify the files you’ll encounter at each stage.

Step 1: Raw Data Generation

The Files: BCL and FASTQ

When your samples finish running on an Illumina sequencing machine, the rawest form of data produced is the BCL (Base Call) file. These binary files record the fluorescent signals emitted by each cluster on the flow cell.

Because BCL files are heavily instrument-specific, the very first step in the pipeline (often done automatically by the sequencing facility) is demultiplexing and converting them into our first major bioinformatics format: the FASTQ file (.fq or .fastq).

  • What it does: FASTQ files store the raw biological sequences (reads) alongside their corresponding quality scores.

  • Inside the file: Every single read gets exactly four lines:

    1. A sequence identifier (always starts with @).

    2. The raw nucleotide sequence (A, C, T, G, N).

    3. A separator (always +).

    4. The Phred quality scores, encoded as ASCII characters, which tell you how confident the machine is in each base call.

Step 2: Providing the Map

The File: FASTA

Before we can figure out what our raw reads mean, we need a reference genome to map them against. Enter the FASTA file (.fa or .fasta).

  • What it does: Unlike FASTQ, FASTA files only contain sequence data, with no quality scores.

  • Inside the file: It consists of a header line starting with a > (e.g., >chr1), followed by lines of the nucleotide or amino acid sequence.

Step 3: Alignment and Mapping

The Files: SAM, BAM, and CRAM

Now the real computational heavy lifting begins. Aligners take your millions of short FASTQ reads and figure out exactly where they belong on the FASTA reference genome.

  • SAM (Sequence Alignment Map): The output of this alignment process. It’s a human-readable, tab-delimited text file that tells you exactly where a read mapped, its mapping quality, and if it had any mismatches (represented by the CIGAR string).

  • BAM (Binary Alignment Map): SAM files are massive. A BAM file is the exact same data, but compressed into a binary format. Computers read BAMs much faster, and they save enormous amounts of storage space. You will almost always convert SAMs to BAMs immediately.

  • CRAM: The modern evolution of BAM. CRAM files offer even more extreme compression by reference-based encoding (it only records how a read differs from the reference genome, rather than saving the whole read).

Step 4: Defining Regions of Interest (Optional but Common)

The File: BED

Sometimes you don't care about the whole genome. If you did Whole Exome Sequencing (WES) or a targeted gene panel, you only want to analyze specific regions.

  • What it does: The BED file (.bed) is a simple tab-delimited file used to define genomic coordinates.

  • Inside the file: It requires at least three columns: Chromosome, Start Position, and End Position (e.g., chr7 140453132 140453150). You feed this to your variant caller to tell it exactly where to look.

Step 5: Variant Calling

The Files: VCF and gVCF

This is the "eureka" moment of the pipeline. Bioinformatics tools scan your aligned reads (BAM) against the reference (FASTA) to find where your sample differs from the norm—these are your SNPs, insertions, and deletions.

  • VCF (Variant Call Format): A text file that lists all the genetic variations found in your sample.

  • Inside the file: It contains a heavy header (lines starting with ##) explaining the filters and tools used, followed by data lines detailing the Chromosome, Position, Reference allele, Alternate allele, and complex statistical metrics about the variant's quality.

  • gVCF (Genomic VCF): A special type of VCF used for joint calling across large cohorts. While a standard VCF only lists the variants, a gVCF records information for every single position in the genome, explicitly stating when a region confidently matches the reference.

Step 6: Annotation

The Files: Annotated VCF, MAF, TSV

A raw VCF tells you there is an A to T mutation at position 45,000 on Chromosome 3. But what does that mean? Annotation tools add biological context to your VCF.

  • Annotated VCF: The same VCF format, but the INFO column is now packed with data telling you if the variant is in a gene (e.g., BRAF), if it changes a protein sequence, and if it's found in disease databases like ClinVar.

  • MAF (Mutation Annotation Format): Frequently used in cancer genomics (like The Cancer Genome Atlas), this is a more human-readable, tab-delimited summary of the mutations, making it easier to load into R or Python for downstream visualization.

The TL;DR Pipeline Summary:

  1. Machine outputs BCL.

  2. Converted to raw reads: FASTQ.

  3. Aligned to reference (FASTA) to create SAM/BAM/CRAM.

  4. Scanned for mutations (often guided by BED) to generate VCF.

  5. Annotated with biological databases to give you your final results.

Mastering this pipeline is a rite of passage for every bioinformatician. Once you understand the flow of data from .fastq to .vcf, you hold the keys to unlocking the secrets hidden within the genome.

Back to blog