Metagenomics technology and microbial community diversity analysis methods

Metagenomics technology and microbial community diversity analysis methods

A large number of microorganisms in nature cannot be cultivated under laboratory conditions by pure culture methods, and the technical methods of traditional microbiology limit the research on environmental microorganisms. The rapid development of high-throughput omics technology has enabled humans to have an unprecedented understanding of the complex microbial communities in various ecosystems.

Starting from amplicon sequencing and metagenomic sequencing, this paper introduces the basic analysis process of metagenomics in microbial community detection, and points out that big data analysis techniques and means are used to overcome metagenomics data analysis, and the analysis results are used in A more understandable form is the focus and difficulty of future research.

The emergence of omics technology has realized the detection and analysis of environmental microorganisms and their functions at the molecular level , providing an effective way to understand the complete picture of environmental microorganisms.

Microbiomics is usually a general term for various systems biology techniques and methods such as metagenomic, metatranscriptome, metaproteome, and metametabolome. scientific issues as a whole , as well as the association between community structure and ecosystems .

Among them, metagenomics based on high-throughput sequencing technology is currently the most critical and mature omics method, and also provides a research basis for other omics research.

Metagenomics Sequencing Analysis Process

Metagenome refers to the sum of the genomes of all microorganisms in a specific environment . Metagenomics research is to learn the genetic, functional and ecological characteristics of microbial communities by directly analyzing the DNA of microorganisms in the environment .

Current metagenomic research relies heavily on high-throughput sequencing technologies, including amplicon sequencing and metagenomic sequencing.

Amplicon sequencing is mainly aimed at ribosomal RNA genes (rDNA) and functional genes . The former amplifies molecular markers such as bacterial or archaeal 16S rDNA and fungal 18S rDNA and internal transcribed spacer (ITS) sequences , while the latter is used to amplify certain microbial markers. specific functional genes for amplification .

Metagenome sequencing is the sequencing of all DNA in the environment. The cost of genome sequencing is high, and the computational resource requirements for subsequent data analysis are relatively high.

In contrast, amplicon technology has become the main method of environmental microbiomics research due to its low cost of sequencing and analysis.

Amplicon Sequencing Analysis Workflow

The random sampling of environmental microbial community studies results in low reproducibility of amplicon analysis , which can be better compensated for by increasing biological replicates and deleting sequences that occur only once in a single sample.

The researchers defined a number of indices to quantify biodiversity and provide a method for comparing different biodiversity.

After the rise of microbial ecology, many research methods and means of macroecology have been gradually applied to the study of microbial ecology, providing new research ideas for it.

There are various analysis methods for microbial amplicon sequencing, and the analysis process is also different.

Metagenomics technology and microbial community diversity analysis methods

Taking 16S rDNA sequencing as an example, the main methods and processes of amplicon sequencing analysis

With the change of sequencing technology, the Illumina sequencing platform gradually occupies most of the market for microbial amplicon sequencing, and currently, the paired-end 250 bp sequencing strategy is mostly used.

Sequences assembled by this scheme can be used as starting files for selecting representative sequences .

At present , OTU (operable taxonomic unit) and ASV (amplicon sequence variation) are the two main forms of representative sequences.

Metagenomics technology and microbial community diversity analysis methods

Metagenome sequencing analysis process

Shotgun sequencing of the entire genomic DNA of microbial communities has been increasing in recent years as sequencing technologies continue to improve in throughput and read length, and their costs continue to decrease .

The large amount of metagenomic sequencing data requires more specialized algorithms and software to process and analyze. Metagenomics analysis often has a set of general procedures.

Metagenomics technology and microbial community diversity analysis methods

Common analysis workflows for metagenomic sequencing analysis

Although the process of metagenomic sequencing analysis is similar, due to the large amount of sequencing data, there is currently a lack of standard analysis tools for unified processing. Different analysis tools and methods vary greatly in performance and speed, especially for different types of microbiomes. Data often also needs to be adjusted for personalization .

With the popularization of third- generation sequencing technology , the software designed for each step of metagenomic data analysis at home and abroad is developing rapidly.

Metagenomics technology and microbial community diversity analysis methods

Microbial community diversity analysis methods

Microbial diversity in microbial ecology is divided into layers according to the scale of describing species, usually mainly including taxonomic diversity, lineage diversity, genetic diversity and functional diversity.

Among them, taxonomic diversity and functional diversity are often measured by analyzing the distribution of taxa, functional genes or pathways in different environments, lineage diversity is measured by calculating the proximity of different taxa at the phylogenetic level, and genetic diversity Corresponding descriptions need to be made through finer-level omics research techniques.

Quantitative description of microbial diversity

According to macroecological description habits, diversity is often divided into three categories according to the spatial scale: α-diversity mainly describes the diversity in local communities or patches, β-diversity mainly describes different communities (or the whole landscape) Species differences, γ-diversity focuses on diversity at larger regional scales.

For the study of microbial ecology, diversity analysis often focuses on alpha-diversity and beta-diversity.

Due to the randomness of sampling and sequencing , the analysis results cannot fully reflect the true state of the community. For species accumulation curves for this type of data, the number of sequences increases linearly at a constant rate as the sample size increases, while the number of observed species accumulates at a decreasing logarithmic rate.

The dilution method allows comparison of species accumulation curves for different sample sizes. The accumulation curve drawn using this method is called the dilution curve , which is drawn by keeping the percentage composition of OTUs in the sample unchanged, and constructing the accumulation curve of sample species with the same OTU composition but with different sample sizes .

It is generally believed that when the end of the dilution curve of a sample tends to be flat , it is considered that the sampling and sequencing of the sample have been approximately completed .

The disadvantage of the dilution method is that information such as rare species will be distorted . Therefore, it is generally believed that the dilution curve can only work effectively when the species in the sample conform to random or uniform distribution .

After obtaining amplicon data and calculating α-diversity based on these data, it is generally necessary to construct a dilution curve and re-extract the OTU table to reduce the influence of sample size on the comparison between diversity indicators.

Quantitative description of alpha-diversity

The quantitative description objects of α-diversity are mainly species richness and species number distribution . The following takes OTU as an example to illustrate, the OTU observation value (Sobs) in the OTU table can be used as the observation index of species richness.

Metagenomics technology and microbial community diversity analysis methods

In addition to the above table, Hill number is also an important index for describing community α-diversity, which is a family of indices composed of a class of diversity indices that integrate relative abundance, species richness and eliminate some defects.

The Hill number conforms to the replication principle , that is, the sum of the Hill numbers of two completely different communities is equal to the Hill number after the two communities are mixed.

Quantitative description of beta-diversity

Beta-diversity is concerned with the similarity or dissimilarity among multiple microbial communities or samples .

In the quantitative description of β-diversity, complementarity is an important description angle, which refers to the number of species that are included between two samples that are not included in the other . The more complementary the 2 samples are, the higher their β-diversity can be considered.

For the calculation, description and extension of complementarity, the corresponding rules in the set are mostly imitated, and the complementarity can also be visualized by Venn diagrams. At the same time, the similarity or difference between samples is calculated by the shared species and unique species between samples. similarity.

The dissimilarity between samples can be measured by the distance index . For the OTU table, the matrix formed by the pairwise distances of all samples is called the distance matrix or dissimilarity matrix .

The data in the common OTU table represents the number of sequences under each OTU in each sample, that is, each sample not only has OTU type information, but also contains the abundance information of each OTU. This type of data is often referred to as quantitative data . Another type of data in applications, which does not contain abundance information for each OTU, is often referred to as presence-absence data , also known as 1-0 data.

There are many commonly used similarity-dissimilarity indices, and each form of index is calculated differently for quantitative data and presence-absence data.

Among them, Jaccard distance is a typical presence-missing data distance index, which is proposed in the form of similarity coefficient . Compared with the Jaccard index, the Sørensen index increases the weight of the shared OTUs of the two samples.

For the ” double zero problem ” in ecological data analysis (some OTUs are missing in both samples at the same time), in the calculation of Jaccard index and Sørensen index, double zero data does not participate in data calculation, so it is called non -zero data. Symmetric index .

Bray-Curtis dissimilarity is a quantitative data distance index, and abundance information is considered in the calculation. Bray-Curtis dissimilarity is also an asymmetric index due to the consideration of the index of double-zero data .

Faith proposed the concept of pedigree diversity (PD) , and defined it as the sum of the shortest evolutionary branch lengths of the species to be observed on the branch tree . PD considers the differences between species at the evolutionary level, and includes information such as species phenotypic traits and ecological niches.

Lineage alpha-diversity was calculated based on Faith’s basic definition of PD. When calculating pedigree β-diversity, the UniFrac index is a commonly used calculation index, which calculates the dissimilarity between communities based on the shared and unique pedigree structures contained in different communities.

Analytical methods for community structure

Most microbial ecology studies mainly focus on the changes of microbial communities in different habitats or different environmental gradients . A set of high-throughput amplicon data is the observation of microorganisms under multiple sets of spatiotemporal samples .

Data sets generated by such studies are now mostly analyzed using multivariate statistical methods .

Metagenomics technology and microbial community diversity analysis methods

exploratory approach

Exploratory methods provide the main gradient of sample variation and the degree of similarity of samples , but even if the samples show some regularity after analysis, they still need to be verified .

Principal component analysis (PCA) is one of the most common and widely used multivariate statistical methods. Mathematically, PCA is actually a process of dimensionality reduction . PCA uses Euclidean distance to measure the difference between samples, but when the sample coverage gradient is too long (that is, there are many identical OTUs in multiple samples), there will be problems such as horseshoe effect .

Correspondence analysis (CA) is often used to measure differences between sample communities as reflected by sample OTU data. CA circumvents the horseshoe effect, however CA ranking is often accompanied by a bow effect , and detrended correspondence analysis (DCA) can be used to minimize the bow effect.

Principal Coordinate Analysis (PCoA) is conceptually derived from PCA, and also follows the basic idea of ​​dimensionality reduction , compressing and projecting the sample space into a low-dimensional space. Since the pairwise dissimilarity matrix between samples is used, there is no direct relationship between the ranking axis of PCoA and the original variables, but its variance interpretation can still be given by the corrected eigenvalues ​​of the dissimilarity matrix.

Non-metric Multidimensional Scaling Analysis (NMDS) is a special sorting method, and iterative sorting is often carried out during the analysis to obtain the smallest possible stress value (a quantitative indicator of the degree to which the dissimilarity between the original samples has been modified). ), it is generally considered that the coercion value less than 0.15 is acceptable. In NMDS analysis, the ranking distance has nothing to do with the original dissimilarity between samples, and the ranking axis does not have the function of explaining the variance of sample dissimilarity, so the ranking axis of the NMDS ranking graph cannot give a reasonable degree of explanation.

interpretive method

When analyzing the differences in microbial communities between different sample groups, the environmental factors that cause such differences are often also concerned, that is, the differences in microbial communities are regarded as response variables (dependent variables), and environmental factors are regarded as explanatory variables (independent variables).

Thus, on the basis of the exploratory method, the explanatory method adds a set of explanatory variables . The component of an explanatory variable on each ordinal axis represents the variable’s contribution to the distribution of the sample along that axis.

Redundancy analysis (RDA) and canonical correspondence analysis (CCA) are two typical explanatory ranking methods.

RDA can be regarded as an extension of PCA ranking. After adding explanatory variables, the ranking axis (principal components) is constrained to be a linear combination of explanatory variables . Similar to PCA, RDA is not suitable for processing datasets with long sample coverage gradients .

CCA is a better choice when RDA is not applicable, and it is a canonical form of correspondence analysis using explanatory variables to constrain the response variables .

The visual display of RDA and CCA is based on the PCA ordination chart, adding vectors (quantitative variables) or points (category variables) representing explanatory variables.

Statistical test methods

Common methods for statistical testing of differences between samples are: ANOSIM (group similarity analysis), PERMANOVA (permutation analysis of variance) and MRPP (multiple response permutation process).

However, the traditional correlation coefficient test is often not well implemented in the statistical test of the correlation between the environmental factors and the differences between the sample communities.

Variable decomposition analysis (VPA) utilizes the idea of ​​partial analysis to divide the total variance in a data set of response variables into the independent and joint explanatory contributions of a single explanatory variable . After the factors , the contribution of different environmental factors to the differences of different communities is further explained .

in conclusion

The combined application of multi-omics technology has gradually become an important means of understanding environmental microbial communities and their functions. Through the application of omics technology, researchers have gradually realized that living in soil, freshwater, seawater, air, and even human bodies and other environments. Microorganisms, their phylogenetic diversity and functional complexity far exceed previous understandings.

At present, how to use the big data analysis technology and means in the booming period to overcome the difficulty of metagenomics data analysis , and to display the analysis results in a form that is easier to understand and manipulate A common challenge with statistical researchers.

The full text of the paper is published in “Science and Technology Herald” Issue 3, 2022

Founded in 1980, “Science and Technology Herald” is an academic journal of the China Association for Science and Technology. Decision-making consultation and advice on management, optimization of scientific research environment, cultivation of scientific culture, promotion of scientific and technological innovation and transformation of scientific and technological achievements. The permanent columns include academician’s preface, think tank views, science and technology reviews, hot topics, reviews, papers, academic focus, science and humanities, etc.

Read more here: Source link