Introduction
Genes that encode the RNA component of the small subunit of ribosomes, commonly known as the16S rRNAIn bacteria and archaea, they are among the most preserved in all kingdoms of life.However, they contain regions that are less evolutionary restricted and whose sequences are indicative of their phylogeny.The amplification of these genomic regions by PCR of an environmental sample and the subsequent sequencing of a sufficiently large number of individual amplicas allows the analysis of clades diversity in the sample and an approximate estimate of its relative abundance.The analytical process is known as "DNA Diversity Analysis of 16S" and is the focus of this SOP.
Pop describes the essential steps to processGen 16S rRNASequences.Procedure and tools are only recommendations and depends on the user to evaluate what works best for their needs.
Glossary of Terms and Jargon
- Gen 16S rRNA
- The gene responsible for coding ribosomal RNA 16S.The gene is used in the construction of phylogeny.
- Bars codes
- Short nucleotide sequences added to the ends of the DNA fragments that must be sequenced.Allows sample indexing so that several DNA libraries can be mixed on a sequencing track.
- Variable region
- Gen 16S rRNAThe sequences contain hypervarable regions that can provide specific sequences of useful clad signing for bacterial identification.
- Domain
- This is a barcode -based binning reading process, mainly used to divide them between samples.
- Operational taxonomic unit (One)
- An operational taxonomic unit is an operational definition of a species or group of species that is often used when there are only available DNA sequence data.
- Rarefation Analysis
- Rarefation is a process used to estimate the true diversity of a sample, extracting random sequences.Analysis estimates the diversity of subsets of different sizes and extrapolated to determine whether the depth of sequencing achieved (number of sample readings) is sufficient to capture diversity within a sample.A graph is generated that shows an increase in the number of species (or other metric diversity) such as the number of sequence readings.A curve that is reaching asymptote indicates that more diversity would not be expected if the depth of sequencing were increased.
- PHRED Quality ScoreoP score
- Sequencing accuracy measure.Logarithmicly related to the likelihood of an incorrectly incorrectly sequencer.Example: A PHRD score of 30 (q30) means that the probability of an incorrect call to this base is 1 in 1000 and the accuracy of the so -called SO -Call is 99.9%.The accuracy of the base requires a base with Q10 is 90%, Q20 is 99%, Q30 is 99.9%, Q40 is 99.99%and Q50 = 99.999%.
- Adapter
- Sequence of specific nucleotides of the added platform to the ends of the DNA molecule to facilitate sequencing, p.In illuminated, the adapter facilitates the union to the complementary target sequences mobilized in the flow cell.
- Chimera
- PCR artifact.Chimeras are potentially formed during PCR when incomplete (generally overestimated) extended DNA fragments.
- Alpha diversity
- Diversity in a single sample.Diversity can be characterized using the number of different species (wealth), the abundance of each species (uniformity), with rates that combine wealth and uniformity and with methods based on divergence (phylogenetic diversity).
- Beta diversity
- Comparison of diversity between samples.
- Unifrac
- Beta diversityMetric from a distance based on the phylogenetic distance between members of the communities/samples.UnifracCapture the total amount of exclusive evolution of each sample.
- Asv
- Variation of the amply sequence
Schematic workflow of analysis
![]() |
---|
Figure 1. Steps in the workflow of the 16S analysis |
PHASE 1: Pre -processing readings
It is essential to pre -process without processing before sending them for subsequent analysis.Pre -processing includes the elimination of low quality bases, ambiguous bases and sequences of adapters, stitching paired readings and detection of chemical readings.Sequencing errors, ambiguous base readings and chimeras can cause the spurious OTU if not eliminated.
Prohibited: Non -processed readings (multiplexing or dismayed)
Production: High quality readings ready forOneharvest
QC graphics and statistics
The first stage of pre -processing data is to verify the quality of the bases in all readings.After understanding the quality spectrum of readings, we can decide the parameters to cut low quality bases.If non -processed readings are not before continuing with the next step.Married or Qiime of Illuminate Tools can perform the task of demultiplexing.
Software: Fastqc, mutliqc, Prinseq, Solexaqa
Cut and filter readings
Data received from sequencing facilities may still contain sequence artifacts and therefore should be excluded or readings must be filtered.For example, at the end of 3 'of the readings, adapted sequences from the preparation of the preparation of the preparation of the preparation of the library.These adapter bases should be eliminated and low quality bases should be cut.1Recommend a minimum quality score of 3 for the low quality course in the mountains of readings.With a size of 4 from the base window and the elimination of any reading less than 75% of the original reading length.It is also recommended that readings contain ambiguous bases (N) are discarded.
Software: trimmáciča, prime, sola, soeta
Paired reading sewing
When the combined length of sequenced readings of both ends of DNA fragments is greater than the fragment size, there is an overlap between paired readings.Reading pairs can be joined, depending on overlap information, generating a single sequence.During the reading sewing process, higher quality bases can be selected to improve the quality of sewn readings.Overlaps between the corresponding matches.
Software: But, pandasake, flash, fuse above
Chimeradetection
The chimeras are PCR artifacts.These are formed during the PCR cycles by joining two or more DNA models of different parents.If these chimeras are not eliminated, they can be recognized as new sequences during the alignment process and therefore to deceive interpretation.In the present, there are no tools that can completely eliminate chimeras without playing non -cheméric sequences.Of the various tools available, it was found that Uchime worked better than Chimerasllayer, which was the best program to detect Chimeras before it develops Uchime2.To eliminate 454 sequence chimeras, you can also use perse3.
Software: We study, Chimmera, Perseus
Level 2:OneSelection generation, classification and phylogenetic trees
During this phase, readings are processed so that comparisons can be made between the samples.The first step is the group of readings based on the similarity in the OTU and select a representative sequence for eachOne.EachOneIt is then classified by comparison with a reference database and a phylogenic inference is performed based on sequence alignment and the construction of a phylogenetic tree.
Prohibited: High quality readings
Production: OTUS, representative sequences,OneTable with classification and abundance of eachOne, heat map, sequence alignment and phylogenetic tree
Oneharvest
OneSelection is the grouping of pre -processed readings at OTU.The groups are formed according to the identity of the sequence.The user can define the identity threshold.It is assumed that the more than 97% sequences are conventionally derived from the same bacterial species/OneOther identity percentages can be used, depending on the granularity of the desired groups and the known divergence in the 16s of the OTU of interest.You are an approach toOneThe selection exists.1) AgainOneChoose groups based on peer sequence identity levels;2) Closed referenceOneChoose sequences and alignment groups in relation to a reference database and sequences that are not> 97% identical to a known reference are discarded 3) Open referenceOneThe selection begins with the alignment with a reference database, but if the reading does not coincide with a well -known sequence, it is not discarded, but is sent to againOneharvest.After the sequences, they were grouped at OTU and counted to estimateOneabundance, a representative sequence is chosen for eachOne.EachOneTherefore, it is represented by a single sequence and this will accelerate the downstream analysis.OneSelection, the most abundant sequence or a random sequence.
Software: Above,
Asvprediction
ExactlyAsvThe forecast proves to be a good alternative toOnePicking (Callahan, B. J. et al 20164) .The process infers exactly the sample sequences and solves differences for only one sequence of nucleotides.This approach allows a more precise taxonomic classification and has additional benefits, such as reuse of the ASV sample previously processed in future projects (Callahan, B. J. et al 20175) .Dada2 is a r pack that contains a complete processing of reading pipes,AsvForecast and classification.QIIME2 has a certain interface2, although there may be limitations in which settings can be configured when traveling the QIIME2 and not through R.
Software: Dada, vouchers
Classification
Here, a taxonomic identity is attributed to each representative sequence.Taxonomies are extracted from a reference set.There are three main reference databases with aligned, validated and observedGen 16S rRNAS: Greenogenes, Ribosomal Database Project (RDP) and Silva.Each of these databases has strengths and weaknesses that should be taken into consideration, and all are commonly used.There are several methods to assign taxonomy to these reference databases, including Uclisto, Uclisto, the RDP and RTAx classifier.
Data base: Silva, Greengenes, RDP
Software: UCLUST, RDP classifier, RTAX
Alignment
To understand the evolutionary relationships between the sequences of the sample and perform a diversity analysis, it is necessary to generate a phylogenetic tree of the OTU.The first step to generate the tree is to generate a multiple alignment of the representativeOnesequencies.Pynost aligns sequences with an alignment of reference sequences 16.INPERNAL uses a hidden Markov model that also incorporates secondary structure information.
Software: Pynast, infernal
Create phylogenetic tree
The phylogenetic tree represents the relationship between sequences in terms of the evolutionary distance of a common ancestor.In the downstream analysis, this tree is used, for example, when calculating the UNIFRAC distances.
Software: Fasttree
An alternative option for most of the steps mentioned in phase 1 and phase 2 is to perform IM-Tonado (Jeraldo et al. 2014 under review).corresponding reading sewn, eliminates the chimeras and generatesOneTable, phylogenetic tree and assign taxonomy.The exclusive feature of IM-CRIME is that it can analyze correspondences that do not overlap.16S rRNA.Dec Studies Try to use information in two variable regions instead of a variable region, such as any pattern16S rRNAStudy to define OTU.
Phase 3: Measure diversity and other statistical analyzes
OneInformation (OTU number, OTU abundance) and the phylogenetic tree generated from phase 2 are used to estimate diversity within and between samples.An additional statistical analysis can also be made to prove the importance of diversity.
Prohibited: CLASSIFIEDOneTable with abundance, phylogenetic tree and sample metadata
Production: Alpha and beta diversity metrics, distance matrix, statistical tests, rarefaction charts, PCOA installments, heat maps
Determine alpha diversity
Alpha diversityIt is a measure of diversity within a sample.The depth of sequencing must be high enough to capture true diversity within a sample.Samples with a larger number of readings would show greater diversity than samples with fewer readings.Rarefation AnalysisTherefore, it is necessary to understand true diversity within a sample and determine whether its sequencing effort is sufficient and if total diversity has been captured in the sample.So the Mohur, because the Qiime has tools to generate multiple rarefations and then measure alpha diversity in the thinOneTables.Popular measures of alpha diversity are available in Mothur and Qiime: Shannon Index, Chao1, observed species and complete tree of phylogenetic diversity.
Software: Mothur, value
Determine Beta Diversity
Beta diversityIt is a measure of diversity between samples.One of the most commonly used metrics is theUnifracDistance that compares samples using phylogenetic information.A total matrix or some beta diversity metrics among all samples of the study and can be displayed in different ways, such as a tree, graph, network, etc.PCOA plotting.
Software: UNIFERAC for distance metrics, Mothur, Qiime
Another statistical analysis
Additional statistical tests can be performed between samples or groups of samples in Qiime.For alpha diversity, a parametric or non -parametric test can be performed in a rare number of sequences.For beta diversity, the correlation of butter, the partial shelf and correlation of the tablecloth correlation matrix can be used to compare the distance arrays.Multivariate analysis are also available to prove the importance between the distance matrix and other factors.Available statistical methods are: Adonis, years, better, better, native Moraan, MRPP, Permisp and DB-RDA methods.In R and other R packs, such as Phyloseq and ADE4, they can also be considered for this type of analysis.
Software: Qee, R Packs (Pyloseq, Ade4
Additional notes
In SOP, we refer to Qiime and Qiime2.QIIME2 is more a platform / command line interface than the original QIime containing a set of Python Wrap scripts.The developers qiime suggest migrating to qiime2.
Vsearch is an open source alternative to research and our tests showed that it works equally well in the H3abionet test data set.There is no 64 -bit memory limitation when Vsearch is used.
Appendages
Tools mentioned in SOP
- FastQC -http://www.bioinformatics.babraham.ac.uk/projects/fastqc
- Multiqc -https://multiqc.info/
- Printhttp://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi
- Solexaqa -http://www.biomedcentral.com/1471-2105/11/485
- Pera -http://bioinformatics.oxfordjournals.org/content/early/2013/10/18/bioinformatics.btt593.full.pdf
- Pandaseeq -http://www.biomedcentral.com/1471-2105/13/31
- Flash -http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507.full.pdf
- To teachhttp://drive5.com/usearch/manual/uchime_algo.html
- Chimerasllayer -http://nebc.nox.ac.uk/bioinformatics/docs/chimeraslayer.html
- Perseus -http://www.biomedcentral.com/1471-2105/12/38/
- From top -http://www.drive5.com/uparse/
- VSearch =https://github.com/torognes/vsearch
- Uclust -http://www.drive5.com/uclust/downloads1_2_22q.html
- RDP Classifier -http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/
- Rtax -https://github.com/davidsoergel/rtax
- Ponost -http://www.ncbi.nlm.nih.gov/pubmed/19914921
- Infernal -http://infernal.janelia.org/
- Fasttree -http://www.microbesonline.org/fasttree/
- Imatraincia -http://sourceforge.net/projects/imtornado/
- Mothur -http://www.mothur.org/
- Cost -http://qiime.org/
- Qazim -https://qiime2.org/
- Unifrac -http://bmf.colorado.edu/unifrac/
- Rativo R.
- Philoseq -http://www.bioconductor.org/packages/release/bioc/html/phyloseq.html
- ADE4 -http://cran.r-project.org/web/packages/ade4/index.html
- Dada -https://benjjneb.github.io/dada2
Databases mentioned in SOP
- Silva -http://www.arb-silva.de/
- Greengenes -http://greengenes.lbl.gov/
- RDP Classifier -http://rdp.cme.msu.edu/
H3abionet Evaluation Exercises
Practice the data set
You can access data sets and input metadatahere.
Input data assessment questions
- Have the number, length and quality of readings been obtained according to what would be expected for the sequencing platform used?
- Was it the quality input data set good enough to perform the analysis?
- How did the quality of readings and GC content affect the way the analysis is performed?
Operational Evaluation Questions
- At each stage of the workflow, describe which software was used and why:
- Is the choice of nature and/or quality of readings affected?
- Has the choice been taken due to the time and cost of the analysis?
- What are the precision and performance considerations for the chosen software?
- For each software, describe which input parameters they were chosen and why:
- Is the choice of nature and/or quality of readings affected?
- Did the hardware available play a role in choosing parameters?
- How did the objective of the study affect the choice of parameters?
- For each stage of the workflow, how do you know that it has been successfully completed and that the results can be used for the next step?
Execution Time Analysis
This is useful information for making forecasts for customers and employees.
- How long and disk space each stage of the workflow gave?
- How did the underlying hardware work?Was it possible to do other things or perform other analysis on the same computer at the same time?
Results Analysis
- What percentage of readings was eliminated during quality cuts?Did all samples have a similar number of readings after pre -processing the reading steps?What was the average, maximum and minimum sample count?How many readings were discarded due to ambiguous bases?
- What percentage of readings could not be sewn?The readings retained or discarded?
- How many chimeras were detected?
- How does the cutting or filtering strategy affect the number of OTUS collected and the classification and phylogenetic analysis of the OTU?
- How the similarity threshold is used duringOneDoes selection affect the number of OTU identified and the classification and phylogenetic analysis of the OTU?
- How many otu were collected?What percentage of OTU could be classified at gender level and species?What percentage of OTU could only be attributed to higher taxonomic ranges than gender?What is the confidence limit for classifications?
- The use of a different16S rRNADoes the classification database affect results (for example, were they a smaller or larger number of OTU classified into lower taxonomic tracks (gender, species))?Has any otus been classified differently?
- Did the samples have sufficient sequence depth to capture diversity?Was the rarefaction curve flattened?Should any sample be deleted due to low readings count?
- Was there any difference in alpha diversity between samples in the different categories of metadata (for example, greater phylogenetic diversity in treatment 1 against treatment 2)?
- When groups of samples were compared (for example, treatment 1 against treatment 2), depending on distance metrics, such as UNIFRAC, was any specific grouping standard observed?
- Was any of the OTU significantly correlated with any of the treatments or other metadata?
Bibliography
-
Bokulich, Nicholas a., eth al."Quality filtration greatly improves estimates of diversity of ampline lighting sequencing."Nature Methods 10.1 (2013): 57.↩
-
Edgar, Robert C., et al."Uchime improves sensitivity and speed of chimera detection".Bioinformatics 27.16 (2011): 2194-2200.↩
-
Quince, Christopher, et al."Precise determination of Microbial Diversity of 454 Pirossal Data".Nature Methods 6.9 (2009): 639.↩
-
Callahan, B.J., et al."Dado 2: High resolution sample inference in the data of the amplicon illumine".Nature Methods, 13.7 (2016), 581-3.↩
-
Callahan, B.J., et al."Exact sequence variants should replace operational taxonomic units in the analysis of the marker gene."The ISSSE Journal, 11 (12), 26), 2639-2643.↩
This work is licensed under aCreative Commons Attribution-Non Commercial Sharealike 4.0 International Licence.
Hang tags: Genomics_analysis