background

BIAAS

BIOINFORMATICS AS A SERVICE

BIAAS

BIOINFORMATICS AS A SERVICE

What is bioinformatics as a service?

Our service is designed for scientists who want easy access to just the right kind of bioinformatics expertise. Watch the video to learn how it works.

Omics data analysis

We analyse omics data from sequencing, array and mass-spec experiments

DNA sequencing enables identifying mutations, assembling genomes and studying genetic variation in populations of any species.

View Details

Transcriptome-wide expression analyses are the standard approach to study molecular mechanisms in biological systems from single cells to complex microbiomes.

View Details

Single-cell RNA sequencing enables cataloging cells at a scale and resolution unmatched by bulk sequencing.

View Details

Epigenomics characterizes the chromatin state down to minuscule chemical modifications.

View Details

Proteomics and metabolomics reveal the functional state of a biological system.

View Details

DNA SEQUENCING DATA ANALYSIS

Understand the effects of genetic variation and mutations with DNA sequencing data analysis.

DNA-SEQUENCING comes in many forms. Whole-genome sequencing (WGS), whole-exome sequencing (WES) and targeted sequencing enable studying heritable and somatic DNA variants. In addition to NGS data, SNP and CGH arrays can be used to identify genetic polymorphisms and copy-number variants, respectively. Metagenomic whole-genome sequencing of microbial communities allows analyzing their compositions and functions.

We routinely analyze DNA sequence data to address research questions in both basic biology and biomedical settings. Below we present some of the typical DNA-sequencing data analyses. If you are interested to learn how we can help you to get the most out of your DNA-seq data, leave us a message and we will book you a short call with our expert.

Variant Analysis

IN MOST CASES, DNA sequencing is employed in order to identify and analyze genetic variants. These variants can be small nucleotide substitutions, insertions, deletions, copy-number alterations or structural variants. Futhermore, they may be heritable polymorphisms or somatic mutations.

Variant analysis typically starts with the quality control of raw DNA-sequencing data and aligning the sequencing reads against a reference genome. Variants that differ between the sample and public reference — or between different samples — can then be computationally identified.

A crucial part of variant analysis is annotating the detected variants. Annotations such as allele frequencies (both in-sample and in public databases such as gnomAD), predicted effects on protein structure or gene regulation and predicted pathogeneicity allow for flexible selection or ranking of variants for downstream analyses and interpretation.

Variant analysis in cancer research often focuses on identifying somatic mutations which accelerate tumorigenesis (driver mutations) or that can be used to diagnose a patient or predict their course of disease. Learn more about mutation analysis in cancer research.

Tumor Evolutionary Analysis

TUMOR EVOLUTION underpins cancer's ability to adapt under selective pressures imposed by therapies. Somatic mutations at a subclonal level can be used to reveal the clonal structure of a tumor and track it through processes such as relapse and metastasis.

Learn more about tumor clonality analysis.

Genome Assembly

FOR ORGANISMS WITH NO REFERENCE GENOMES or highly dynamic genomes, DNA-sequencing data analysis starts with assembling a genome de novo. Genome assembly benefits from deep whole-genome sequencing.

An assembled genome is annotated based on sequence homology, predicted gene sequences and, if available, RNA-sequencing data from the same organism. If annotated genomes for close relative species exist, the annotation can be improved by transferring gene information to the newly assembled genome.

The quality of an assembled genome is assessed using metrics such as N50, L50 and completness with regards to highly conserved orthologs. A new high-quality genome enables analyses into pan-genomes, population genetics and much more!

Metagenomics

METAGENOMICS OFFERS AN UNBIASED VIEW into the microbial diversity of ecological niches including samples from host organisms and soil. Using shot-gun whole-genome sequencing data, reads are assembled into contigs and assigned to species or operational taxonomic units (OTUs).

Identified species or OTUs are organized into a phylogeny and quantified. The functions brought about by individual genes or multi-gene pathways present in the sequenced community can be identified using public databases.

Note that 16S amplicon sequencing, a cost-effective alternative to metagenomic sequencing, can be used to identify species and build phylogenies, but it does not allow for high-quality functional analyses.

Population Genetics

GENOME-WIDE MEASUREMENTS of individuals sampled from related populations contain rich information on the populations’ structure, genealogy and history. Population genetic analyses of non-model organisms often begin with genome assembly and annotation, and proceed to identifying genetic polymorphisms in the sampled populations. The downstream analyses based on these polymorphisms and their allele frequencies help studying evolutionary phenomena such as speciation and adaptation.

Typical analyses involve principal component analysis, analysis of genetic variation within and between populations to identify loci affected by evolutionary selection, and analyses of population admixtures, phylogeny and demographic histories.

Genome-wide Association analysis

BIOMEDICALLY MOTIVATED POPULATION-SCALE GENETIC ANALYSES aim to identify genes and variants associated to relevant phenotypes or diseases. Apart from the few diseases which are monogenic and strongly heritable, most diseases require large, population-level sample sizes to achieve sufficient statistical power to find associations. Such genome-wide association studies (GWAS) are based on SNP-array or DNA-sequencing data from biobanks or other large repositories.

GWAS results in summary statistics on the association between each individual variant and the studied disease. In the case of polygenic diseases, individual variants may have very weak effect sizes even when the disease is strongly heritable. In such cases, polygenic risk scores (PRS) can be used to sum the effect of a large number of variants, resulting in a combined risk score with potential clinical utility.

RNA SEQUENCING DATA ANALYSIS

RNA sequencing data analysis brings to light the intricate mechanisms of gene regulation.

TRANSCRIPTOME-WIDE ANALYSES of gene expression are extremely popular among researchers studying gene regulation in biological systems ranging from single cells to tissues and complex microbiomes. RNA-seq data allows for a wide range of analyses to address countless research questions across the fields of biology and biomedicine.

Below we present some of the most common analyses we perform on RNA-seq data. The explorative, differential expression and pathway analyses largely apply to other high-throughput expression data as well, such as expression microarray or proteomic data.

We hope that the examples below inspire you to appreciate just how rich the world of RNA-sequencing is. If you are planning an RNA-seq experiment and wish to learn how we can help you to get the most out of your data, leave us a message and we will book you a short call with our expert.

Exploratory Gene Expression Analysis

EVERY RNA-SEQ EXPRESSION STUDY incorporates an exploratory analysis. After the raw sequencing reads of an RNA-seq experiment have been quality controlled and gene counts derived, the data set is visualized using Principal Component Analysis (PCA) and expression heatmaps to unveil its general patterns. These visualizations help us answer questions such as:

  • Do the biological replicates resemble each other with regards to their expression profiles?
  • Do distinct sample groups (e.g., different tissues, treatments or time points) form separate clusters?
  • Are there outlier samples?
Differential Expression Analysis

DIFFERENTIAL EXPRESSION ANALYSIS is a statistical comparison of two sample groups. It results in differential expression statistics for each detected transcript, such as the fold change and statistical significance. These statistics are typically visualized using a volcano plot. The genes which are found to be up- or down-regulated can be further visualized as heatmaps or boxplots, for instance.

As a statistical analysis, this phase of an expression study benefits from the statistical power brought by biological replicates. Three biological replicates per condition is a common “rule-of-thumb” minimum, but it only allows for reliable detection of genes with relatively large expression differences. With a careful experimental design and sufficient sample size, subtler differences can be detected and confounding factors controlled for.

Genome Assembly

FOR ORGANISMS WITH NO REFERENCE GENOMES or highly dynamic genomes, DNA-sequencing data analysis starts with assembling a genome de novo. Genome assembly benefits from deep whole-genome sequencing.

An assembled genome is annotated based on sequence homology, predicted gene sequences and, if available, RNA-sequencing data from the same organism. If annotated genomes for close relative species exist, the annotation can be improved by transferring gene information to the newly assembled genome.

The quality of an assembled genome is assessed using metrics such as N50, L50 and completness with regards to highly conserved orthologs. A new high-quality genome enables analyses into pan-genomes, population genetics and much more!

Pathway Analysis

PATHWAY ANALYSIS puts genes from a differential expression analysis into broader biological context. Simple pathway analyses compare the up- and down-regulated genes statistically to predetermined gene lists. These lists are annotated to biologically meaningful terms, such as a biological process, signaling pathway or a specific disease.

Such analyses may rely either on over-representation analysis or gene set enrichment analysis, which both result in a list of enriched gene sets with relevant statistics and annotations.

More mechanistic pathway analyses rely on experimentally validated interactions between genes. They enable identifying not just which pathways are represented by the differentially expressed genes, but also shed light on whether the pathways are activated or inhibited, and by which genes.

For the more avanced pathway analyses, we use Ingenuity Pathway Analysis (IPA, QIAGEN). IPA enables a wide range of in-depth analyses into known and novel gene regulatory networks.

Transcriptome Assembly

FOR NON-MODEL ORGANISMS, and those with very dynamic genomes, i.e. microbes, we typically start RNA sequencing data analysis with assembling a transcriptome de novo and annotating it using homologues of related species and computational gene predictions.

A new reference transcriptome is an invaluable resource for your further research, and that of the entire research community. Once a high-quality reference transcriptome has been established, the door opens to most downstream analyses which are routinely used with model organisms.

Single-Cell Expression Analysis

SINGLE-CELL RNA-SEQUENCING (scRNA-seq) experiments allow for cataloguing cell types and uncovering differentiation trajectories at a scale and resolution unmatched by bulk RNA sequencing.

Used particularly to study the composition and development of complex tissues, scRNA-seq data sets typically comprise thousands of individual cells. Most approaches used to analyze bulk RNA-seq data can be tailored for single-cell RNA-seq data as well.

Learn more

MicroRNA Data Analysis

SMALL RNA-SEQUENCING enables studying various species of short RNAs, and microRNAs in particular. MicroRNA-seq analysis is largely similar to that of mRNAs, but pathway and regulatory analyses make use of predicted and/or previously validated microRNA target genes.

Sequencing both mRNA and small RNA from the matched samples enables estimating the regulatory relationship between microRNAs and their targets. To identify genes subject to microRNA-mediated regulation in a given condition, argonaute CLIP-sequencing (and related protocols) can be employed.

Alternative Splicing Analysis

IN ADDITION TO STUDYING EXPRESSION ON THE LEVEL OF GENES, RNA-sequencing allows for a more detailed view: splice-variant level expression. Reliable identification of alternative splicing events benefits from deeper sequencing than the typical gene-level expression analysis.

Depending on the quantity and quality of the data, alternative splicing analyses may focus on quantifying expression levels of known, previously annotated splice isoforms, or on detecting novel splicing events as well.

Fusion Gene Detection

IN CANCER, CERTAIN STRUCTURAL VARIANTS are known to cause fusion genes. Two separate genes fused together in the DNA may lead to a fusion transcript. The fusion transcript, in turn, may lead to a fusion protein with a novel, potentially cancer-driving combination of regulation and function.

Fusion genes can be detected from RNA-sequencing data with tools that identify and analyze discordantly mapping RNA-seq reads or read pairs.

Integrating RNA-seq and Epigenomic Data

PERFORMING RNA-SEQ AND EPIGENOMIC SEQUENCING (such as ChIP or ATAC-seq) on the same samples enables integrative analyses to study gene regulatory programs genome-wide.

Regulatory connections can be identified between enhancers and their target genes, as well as transcription factors and their targets, building on evidence from both gene expression and the epigenomic status of regulatory elements.

Learn more

"How do you integrate RNA-seq and ChIP-seq data?”

I’ve heard biologists ask the question above countless times. (You may replace ChIP-seq with ATAC-seq, bisulphite-seq or any other epigenomic data type.)

It makes a lot of sense to ask that question.

Generating and analyzing data from a single NGS-based assay such as RNA-seq or ChIP-seq is not as rare a skill as it was a few years ago. This is due to a new "NGS native" generation of biologists who have acquired basic ‘omics data analysis skills early in their training, largely obviating the need for biologists to walk to the department of statistics or — god forbid! — computer science to knock on the doors of code-savvy researchers, suggesting a “collaboration” to get their data analyzed.

However, integrating different data modalities is a different matter, and this is the phase where research projects often stall.

The idea is simple: if you see, smell and taste a wine, your brain may be able to integrate these multi-sensory inputs and infer just which river valley the grapes originate from, way better than if it had to rely on just one sense.

So, what is the multiomic brain that takes all possible NGS data you generate and spits out insight?

The wrong answer, I have learned, is “it really depends on your research question”. The correct one is “correlation”. That is the short answer — the longer one is “careful analysis of individual data types, correlation, filtering, visualization, interpretation — iterate a couple times — and you might arrive at some very fine results!”

What follows is an even longer answer.

The last time I heard The Question, I got inspired enough to chat about it with my colleague Grigorios who is a bit of an expert on the topic. Grigorios presented to me the analysis he had run for a customer with ChIP-seq data for several histone markers as well as RNA-seq data, all from a time course experiment involving different genotypes and treatments on certain cultured cells. The approach to integrating all this data follows the integrative chromatin accessibility and expression analyses in his paper on erythroid differentiation (Georgolopoulos et al., 2021, Nat Commun).

The workflow

For a walk-through of this data integration, let us assume an experiment visualized below, with RNA-sequencing and an epigenomic sequencing assay performed at a few timepoints, and a treatment administered after the first time point. The epigenomic assay could be ChIP-seq (or CUT&Tag) for one or more histone modifications, or a chromatin accessibility assay such as ATAC-seq.

(The integrative analyses discussed here do not require time series data; one could analyze the expression and epigenetic states across a pseudotime trajectory using single-cell data or, simply, comparing single-timepoint data from bulk experiments in different conditions.)

The question is, what are the molecular mechanisms between the treatment and an altered cellular state at the end? Can we give a multi-step description of events cascading through the network of genes and gene products that reprograms the cell to adapt to the perturbation?

In the context of translational research, identifying the critical elements, such as transcription factors or enhancers that enable a cell's progression to a diseased state, offer possible targets for new therapies.

Below we see a workflow for identifying active cis- and trans-regulatory paths in such a cascade. It begins with processing epigenomic and transcriptomic data separately, and brings the two modalities together by correlating the expression of each gene to the signal from its putative cis-regulatory elements (CREs). It then proceeds to identify transcription factors (TFs) which drive the chromatin changes, through identifying TF binding motifs within the CREs and correlating TF expression to the state of these putative binding sites.

The central steps in the workflow include:

  • Classifying cis-regulatory elements. The peaks, interpreted as CREs, can be grouped by their temporal pattern by cluster analysis. This may yield several categories of CREs, whose epigenetic state can be classified based on the observed pattern (e.g., activated, transiently activated, constitutively active) and the epigenomic signal that was measured (e.g., "accessible" in the case of ATAC-seq, "active" in the case of H3K27ac ChIP-seq). The CRE clusters are further annotated by enriched binding motifs.
  • Grouping genes by their temporal expression pattern. The genes are likewise grouped by cluster analysis and classified based on the observed patterns (e.g., "up-regulated", "constitutively expressed"). Gene clusters are annotated by gene set enrichment analysis to link them to biological functions and processes.
  • Linking CREs to genes. Linking a putative CRE to a gene relies on genomic proximity between the two as well as correlation between the CREs epigenomic activity and the gene's expression.
  • Linking TFs to target genes. Establishing a link between a TF and a target gene relies on information on TF-CRE links (binding motifs, correlation) CRE-target gene links (see above) as well as TF-target correlation. Such multiomic filtering of possible TF-target links enables identifying a network of all such active trans-regulatory paths.

What are the follow-up experiments?

The approach described above results in a rich description of regulatory programs involved in a studied process. There are several ways to further enrich and validate the findings, such as:

  • Validating CRE-target interactions by chromosome conformation capture. Methods such as Hi-C enable establishing evidence of physical interaction between distant loci, genome-wide.
  • Validating the CRE-gene interactions with genome editing. The gold standard experiment to validate a regulatory element's role in driving a gene's expression is to delete the CRE, using CRISPR-Cas9, and quantify the target gene's expression in wild-type and edited cells.
  • Validating TF-CRE interactions using ChIP-seq. The mere presence of a binding site in an apparently active regulatory element is not direct evidence of the TF-CRE interaction. A ChIP-seq (or CUT&RUN or CUT&Tag) experiment using an antibody specific for a TF of interest can be used to verify the factor's physical presence.
  • Linking TFs to target genes. Establishing a link between a TF and a target gene relies on information on TF-CRE links (binding motifs, correlation) CRE-target gene links (see above) as well as TF-target correlation. Such multiomic filtering of possible TF-target links enables identifying a network of all such active trans-regulatory paths.

SINGLE-CELL RNA SEQUENCING DATA ANALYSIS

Single-cell RNA sequencing enables cataloging and studying cellular identities at a scale and resolution unmatched by bulk sequencing.

Single-cell RNA sequencing (scRNA-seq) is one of the most rapidly advancing and diversifying technologies in molecular biology. The ability to study gene expression on the resolution of single cells has been as transformative as the advent of bulk RNA-sequencing previously.

In addition to single-cell RNA-seq, a number of other next-generation sequencing (NGS) -based assays have been adapted to single-cell protocols. These include genomic, proteomic and epigenetic assays, notably single-cell ATAC-sequencing, which is commonly performed in conjunction with scRNA-seq.

Platforms and protocols for scRNA-seq vary in their throughput (number of cells) and transcript coverage (3'/5' tag-based vs whole-transcript). Our team has experience working with several technologies, such as 10X Genomics, Drop-Seq, BD Rhapsody system and protocols of the CEL-Seq and Smart-Seq families.

Here we present typical single-cell analyses, focusing on scRNA-seq but covering also its integration with other common single-cell assays. We also list single-cell papers that we have published.

Quality control and preprocessing

Like with any NGS data, the analysis of single-cell sequencing data starts with quality control and preprocessing.

Raw sequencing reads are quality-tested and metrics such as cell quality, accuracy, and diversity are generated. Reads are then aligned to an applicable reference genome or transcriptome, and additional metrics such as the number of cells, reads per cell, genes per cell, sequencing saturation and fraction of mitochondrial transcripts are plotted and inspected.

These QC metrics inform us about the total quality of the libraries and the usability of the samples and enable identifying and removing low-quality cells.

Further preprocessing is often carried out to remove unwanted signal, or noise, from certain downstream analyses. These include

  • Imputation to estimate read counts for dropouts, or genes with zero transcripts due to technical, rather than biological, reasons,
  • Normalization to remove biases due to e.g., differences in cell sizes, and
  • Reducing the data to representative variables such as highly-variable genes or principal components.
Exploratory Analysis

Preprocessed single-cell RNA-seq data is clustered to identify groups of similar cells and visualized using non-linear dimensionality reduction algorithms such tSNE and UMAP and correlation heatmaps to unveil general patterns of cell heterogeneity.

These visualizations help us answer technical questions such as:

  • Do the biological replicates resemble each other?
  • Are there outlier samples or cells?
  • Are the cell clusters distinct?
  • ...and biological questions such as:
  • How heterogeneous are the underlying cell types/states?
  • Do distinct samples (e.g., different tissues, treatments or time points) form separate clusters?
Cell Type Iidentification

Identifying and characterizing cell types (and more refined cell states) is the most central part of most single-cell projects.

It all starts with identifying features (e.g., genes, proteins, accessible regions) that are specific to each cell cluster. These markers are defined by differential expression (DE) comparison of each cell cluster and the remaining ones, yielding DE statistics such as fold change and statistical significance.

The cluster markers can be visualized using scatter plots, violin plots, and heatmaps.

Markers are further annotated to biologically meaningful terms, such as a biological processes, signaling pathways or a specific disease. Such analyses may rely either on over-representation analysis or gene set enrichment analysis, which both result in a list of enriched gene sets with relevant statistics and annotations.

Single-cell datasets are typically also integrated with publicly available data in order to exploit the cell-type information from already annotated datasets or cell atlases. This enables transferring cell labels into the analyzed dataset.

The transferred cell labels and identified markers and their annotations are used, together with prior information on cell-type/state markers, to identify the captured cell types.

Trajectory Analysis

In addition to characterizing distinct cellular identities, single-cell data lends itself to identifying continuums of gradual change in cell state, or trajectories. Uncovering such continuums is also called pseudotime analysis — while all cells are sampled at the same time point, individual cells may represent different stages in a temporal process such as differentiation.

De novo reconstruction of lineage differentiation and cell maturation trajectories allow exploring cellular dynamics, delineation of cell developmental lineages, and characterization of transition between cell states along a latent pseudotime dimension.

An ensemble of trajectory inference algorithms may be used for robust identification of root and terminal cellular states, branching points, and lineages. Single cells are ranked across deterministic or probabilistic lineages, and their ranking indicates their progression in a dynamic process of interest.

This type of analysis may also utilize the ratio of processed and unprocessed transcripts to infer whether a gene's expression is increasing or decreasing in a given cell. Combining this information from all quantified genes at a given state enables inferring the direction and pace of change in states. This is called RNA velocity analysis.

Integrative Single-cell Analyses

Integrative single-cell analyses bring different datasets, including different data types and species together. This enables more accurate and detailed cell labeling and mechanistic insight into gene regulation in the studied system. Such analyses rely on common properties, or anchors, between the datasets, such as matched features (e.g., genes or homologues) or matched cells.

Integrating Multiple Single-Cell RNA-seq Datasets

Perhaps the most common integration of single-cell datasets takes place between scRNA-seq datasets from different sources or technology platforms. Using genes as anchors, a successful integration removes the technical bias while retaining biological variance of the datasets.

Combining different scRNA-seq datasets is particularly helpful when there is a well-characterized public expression atlas available for a relevant tissue or organism.

Integrating Single-Cell RNA-Seq And Epigenomics

Integrating single-cell RNA-seq data with single-cell ATAC-seq or single-cell methylation data often relies on matched cells as anchors (when the measurements derive from the same cells as in, e.g., 10X Genomics Multiome technology).

Combining expression data with chromatin accessibility or methylation profiles enables more robust identification of cell types and allows for quantifying the effect of chromatin state to expression in individual cell types.

Read more about integrating epigenomics and transcriptomics

Integrating Single-Cell RNA-Seq And Proteomics

Since proteins, rather than transcripts, are key drivers of cellular functions, single-cell proteomics complements scRNA-seq experiments with more accurate estimates of cells functional states.

Single-cell proteomic profiling (CITE-seq, flow cytometry, mass cytometry, and mass spectrometry) comes in different degrees of throughput (number of proteins quantified) and can be targeted specifically to surface proteins, as in CITE-seq which involves a panel surface proteins quantified from cells with matched scRNA-seq reads.

Surface proteins are particularly useful in cell type identification, while the inclusion of cytosolic proteins enable better characterization of pathway and gene-regulatory activities.

Cross-Species Integration

Cross-species integrative analysis enables the identification of cell-type phylogenies that define the relationships of evolutionary and developmental mechanisms between different organisms. Shared homologues are used as anchors in cross-species integration.

This is particularly helpful when a disease/organ is better characterized on a single-cell resolution in an animal model than in human.

Ligand-Receptor Analysis

Ligand-receptor (LR) analysis uncovers cell-cell interactions that coordinate homeostasis, development, and other system-level functions. Changes and dysfunction in such interactions may go unnoticed in an analysis limited to the internal state of individual cells or cell types.

Ligand-receptor analysis identifies and quantifies intercellular interactions based on the expression of known receptors and their ligands. The interactions may take place within or between tissues, and the strength of this interaction is compared between biological conditions of interest, such as patient groups, disease states, and treatments.

Spatial Transcriptomic Analysis

Spatially resolved single-cell transcriptomic assays couple expression data with the cell's positional context in a tissue or organ. This is particularly useful in the study of complex solid tissues, such as tumors and their microenvironment.

Spatial transcriptomic analysis involves cell/spot clustering in space, identification of spatially variable genes and resolving cell types in space.

Retaining the positional information of sequenced cells adds to the accuracy of identifying cell types and ligand-receptor interactions. It also enables spatial visualization of gene expression or chromatin accessibility (in the case of scATAC-seq) and integrating imaging-based data to the analysis.

Even in the case of lower-resolution assays, like 10X Visium, multimodal spatial analysis helps in correcting gene expression values and imputing dropout events.

EPIGENOMIC DATA ANALYSIS

Uncover epigenetic mechanisms of gene regulation in development and disease.

EPIGENOMICS CHARACTERIZES THE CHROMATIN STATE down to minuscule chemical modifications. Epigenetic changes to the DNA and associated proteins affect gene expression and may lead to altered cellular states, including diseases.

We analyze a wide range of epigenomic sequencing data in order to gain deeper understanding of intra-cellular molecular mechanisms and to identify biomarkers for diseases.

Below we discuss common epigenomic data types and analyses, and present some of our past work involving epigenomic data analysis. To discuss your epigenomic bioinformatics needs, just leave us a message.

Epigenomic Assays

HIGH-THROUGHPUT ASSAYS FOR EPIGENOMIC PROFILING are numerous, and new protocols are being developed continuously. The most common epigenomic assays focus on DNA methylation, DNA-binding proteins, histone modifications, chromatin accessibility or the 3D conformation of the chromatin.

  • DNA methylation. DNA methylation assays based on bisuplhite-treated DNA enable identifying methylation events at the highest resolution. Such assays use next-generation sequencing (whole-genome or reduced representation bisulphite-sequencing) or microarrays. An alternative approach, MeDIP-sequencing, relies on immunoprecipitation and suffers from lower resolution.
  • Transcription factor binding and histone modifications. Assays to identify DNA-bound proteins such as transcription factors, as well as chemical modifications to the histone proteins, make use of antibodies. ChIP-sequencing is the most common method, but newer alternatives with better resolution have been developed. These include ChIP-exo, Chipmentation, CUT&RUN and CUT&Tag.
  • Chromatin accessibility. The gold standard assay for mapping regions of open chromatin is ATAC-sequencing. ATAC-seq has largely replaced previous methods such as DNase-seq and FAIRE-seq.
  • Chromatin conformation. The importance of the chromatin's three-dimensional conformation has gained particular appreciation recently. Chromatin conformation assays are used to study the physical interactions between genes and their distal regulatory elements as well as the proteins that cause such looping of the chromatin. Hi-C is a typical assay for the former, while ChIA-PET can be applied to the latter.

To study the epigenome's direct effect on gene expression, epigenomic measurements are often complemented with RNA-sequencing experiments in the same setting.

Single-cell experiments, particularly single-cell ATAC-sequencing, is increasingly performed as a co-assay with single-cell RNA-sequencing. This yields gene expression and chromatin accessibility profiles from the same individual cells.

Peak Calling And Annotation

THE ANALYSIS WORKFLOW for most sequencing-based epigenomic data (particularly ChIP-seq, ATAC-seq and related experiments) involves identifying, annotating and analysing peaks, or genomic regions with signal of interest.

The raw sequencing reads are first quality-controlled and aligned to a reference genome, after which possible control libraries (pre-IP input and IP with non-specific antibody, in the case of ChIP-seq) are used to normalize the read coverage signal.

Peaks in the signal are identified using a peak caller tool. This phase may require careful parameter tuning to optimize the analysis to the used protocol.

To enable further analysis, peaks are annotated with relevant information such as read statistics, and near or overlapping features such as genes, regulatory elements and binding motifs.

Annotating peaks with genes enables gene set enrichment analyses for further interpretation of downstream effects.

Exploratory Analysis

ANNOTATED PEAKS ACROSS THE SAMPLE SET are visualized using PCA (and UMAP or t-SNE algorithms for single-cell data) and heatmaps. These visualizations help in optimizing the peak calling process and answer questions such as:

  • Do the biological replicates resemble each other with regards to their epigenomic profiles?
  • Do distinct sample groups (e.g., different tissues, treatments or time points) form separate clusters?
  • Are there outlier samples?
Differential Peak Analysis

TO COMPARE DIFFERENT CONDITIONS, the identified peaks can be statistically compared — or, more commonly, differential peaks can be directly called from the respective read coverage signals.

Similar to differential gene expression analysis, differential peak analysis yields estimates on the effect size and statistical significance. These statistics can be visualized as a volcano plot.

As genome-wide epigenomic measurements yield a continuous signal across the genome, such analyses may also focus on specific regions of interest, such as promoters or known binding sites of a protein of interest. Density heatmaps are used to visualize the signal at sites of interest in different conditions.

Furthermore, overlapping binding motifs at the peaks can be statistically compared between conditions and visualized as volcano plots.

Transcription Factor Binding Site Analyses

CHIP-SEQ AND RELATED PROTOCOLS can be used to identify transcription factor (TF) binding sites across the genome. Such assays rely on antibodies specific to the protein of interest, and this approach thus enables identifying binding sites of just one TF. ATAC-seq data, on the other hand, can be used to identify binding sites of all DNA-bound proteins in parallel, through an analysis called TF footprinting.

In TF footprinting, narrow drops in the chromatin accessibility signal are interpreted as protein binding sites. The identity of the TF may be indirectly inferred from binding motifs. Coupled with RNA-seq data, TF footprinting can be used to study the combined effects of TFs on gene expression in a very high-throughput manner.

DNA Methylation Data Analysis

THE ANALYSIS OF DNA METHYLATION DATA starts with the quality control and alignment of sequencing reads (or QC and normalization of array data), and proceeds to calling the methylated sites.

Detected methylated sites are used to identify larger regions of DNA methylation or differentially methylated regions (DMRs) between samples. These regions can be annotated similarly as peaks in other epigenomic data.

Possible downstream analyses for DNA methylation data include:

  • Integration with gene expression data. When RNA-seq or other gene expression data is available from the same setting, the association of promoter methylation and gene expression can be studied.
  • Epigenetic biomarker discovery. DNA methylation data from patient samples enables discovering clinically revelant epigenetic markers.
  • Biological age analysis. Epigenetic models of biological aging have been developed for DNA methylation data. Such models can be used to estimate the biological, as opposed to chronological, age of an individual or specific tissue within an individual.
Integrating RNA-Seq And Epigenomic Data

PERFORMING RNA-SEQ AND EPIGENOMIC SEQUENCING (such as ChIP or ATAC-seq) on the same samples enables integrative analyses to study gene regulatory programs genome-wide.

Regulatory connections can be identified between enhancers and their target genes, as well as transcription factors and their targets, building on evidence from both gene expression and the epigenomic status of regulatory elements.

PROTEOMIC AND METABOLOMIC DATA ANALYSIS

Proteomics and metabolomics reveal the functional state of a biological system.

COMPUTATIONAL ANALYSIS OF PROTEINS AND METABOLITES addresses fundamental questions of biochemistry: Which reactions take place? What is being built? How is energy produced and used?

While transcriptomics is commonly used to infer activities of signaling and metabolic pathways, proteomics and metabolomics give a more direct view into the key molecules of such pathways and individual reactions.

Proteins and metabolites are typically identified and quantified by the means of mass-spectrometry (MS). Other methods, relying on antibodies (for proteins) or nuclear magnetic resonance (NMR; for metabolites) provide lower-throughput or less quantitative data, often with lower costs compared to MS.

In addition to pathway analyses, proteomic and metabolomic data from patient samples are particularly suitable for biomarker discovery.

Scroll down to learn more and to browse our references in proteomics, metabolomics and lipidomics.

Proteomic Data Analysis

BIOINFORMATIC ANALYSIS OF PROTEOMES starts with identifying proteins and quantifying their abundances — absolute or relative, depending on the experiment.

As with the analysis of any gene expression data, the next step is an exploratory analysis to study the variance and grouping of the data set with principal component analysis (PCA) or similar dimensionality reduction approaches.

More focused analyses may consist of differential expression and pathway analyses to characterize differences between samples.

These analyses can also be performed on proteins enriched for specific post-translational modifications such as phosphorylation. Quantitative data from the total proteome and enriched subset (e.g., phosphoproteome) can be processed in parallel or in an integrative manner to gain a more detailed view of pathway activities.

Metabolomic Data Analysis

THE METABOLOME consists of an almost endless catalogue of endogenous and exogenous small molecules that partake in reactions in an organism.

While proteomics studies the catalysts of these reactions, metabolomics is concerned with their substrates, intermediates and products.

As with proteomics, high-throughput metabolomic data is typically used to either quantify and study metabolic pathways or to identify clinically relevant molecules such as biomarkers. Thus, the explorative and statistical analyses are very similar to those of proteomics.

A special case of metabolomics, lipidomics focuses on the vast diversity of lipid molecules in an organism. Lipidomic analyses often aim at characterizing the (dys)function of lipid metabolism and trafficking, particularly in metabolic diseases.