1. Give an accession number J00306. Design a pair of primer for exon ranged from 1231 to 1368. Validate the
quality of your obtained primer.
Step1: access to NCBI and enter the accession number J00306.
Step2: press ctrl + F simultaneously and type “1231…1368” to find out the exon we would like to design primer. (In case
you cannot see the picture clearly, please zoom out).
Step3: access the “Pick Primer” on the right side
Step4: put the range of exon into the box
Step5: get primer (make sure the highlighted part is checked).
The first pair of primer is usually the best one.
In general setting, to obtain a good primer, some criteria is set, such as:
- The length is 18-30 bases
- Melting temperature is 50-60 degrees Celsius.
- GC content is between 45% and 55%.
- The Max Tm difference is only 3 degrees Celsius.
Both have the same length of 20 bp. The forward primer starts at 1184 position, while reverse one start at 1463. Melting
temperature of the forward and reverse primer are 54.6oC and 55.9oC, respectively. The GC content is the same, 50%.
Both of the forward and reverse primer does not form stable hairpins and dimerize. However, they do not have GC clamps
at the 3' end of the primers. In overall, these primers can be considered as a good pair of primer.
2. Give an accession number NC_000009.12 and a sequence ranged from 94603133 to 94640249. What does this
sequence encode for? List out values of BLAST output.
Step1: access NCBI to inspect what NC_000009.12 is
Step2: Run BLAST
Step3: blastn _ input the sequence range _ Human genomic plus transcript (Human G + T).
For the BLAST result, the sequence ranged from 94603133 to 94640249 of the accession number NC_000009.12
encodes for (only choose 3 first result with Query cover is higher than 99%):
- Homo sapiens fructose-bisphosphatase 1 (FBP1), RefSeqGene on chromosome 9
- Human DNA sequence from clone RP11-342C23 on chromosome 9, complete sequence
- Homo sapiens fructose-1,6-bisphosphatase 1 (FBP1) gene, complete cds
3. Ins gene encodes for insulin in human. Giving accession number of NC_000011.10 with range from 2159779 to
2161209, find a primer pair with production length of 500+-50 bp
4. Giving an accession number, NC_0000009.12, and a sequence range, from 94603133 to 94640249. Answer the
following questions;
a. By using the human G+T databases for BLAST, what does this sequence encode for.
b. Briefly describe the implementation of your BLAST output to support your answer in question a.
In the first result, we get the information:
Max Score: 89.8
Total score: 743
Query cover: 0%
Identity: 100%
Accession number: NC_000006.12
Gaps: 1/67 (1%)
Meaning:
E value is a parameter that describes the number of hits one can "expect" to see by chance when searching a
database of a particular size. The smaller the E value, the better the result. E value equal to 0 means that the
result is good and we can use them.
In this case, E value = 1e-13 << 0.05 => the alignment is significantly matching.
The identity is 91% >75% shows that Mouse GULO (chromosome 14) is highly identical to human GULO
(chromosome 6)
The first result is the most identical gene.
c. Use the mRNA of your sequence (from your BLAST result) to design a pair of primer and report your
result.
5. The following picture shows the phylogenetic and modular analysis of C militaris (CCM) poly ketide sytheases
(PKS)compared with those involved in the production of human mycotoxin. (a) A neighbor joining tree
showing the relationship of ketoacyl CoA synthase (KS) domain sequences. (b) Modulation of comparison of C
militaris PKSs with those involved in production of mycotoxins. Domain definitions ACP, acyl carrier protein
domain, AT, acyltransferase domain, CYC, cyclase domain, DH, dehydratase domain, TE, thioesterase domain
CCM_00603 is lacked of gene cluster for patulin biosythesis.
a. Starting from raw samples of C militaris, what bioinformatic approaches can be used to reconstruct the
phylogenetic tree in the figure (a)?
b. Researchers have concerned about the possibility of harmful side effects of the chinese support this
hypothesis? Explain your answer.
6. The table below shows values of diffrerent coding statistic the 223bp long second coding exon of human Bglubin gene, and in a 223bp long sequence from the middle of the second intron of the same gene.
Position asymmetry
Periodic asymmetry index
Average mutual information
Fourier spectrum
Exon sequence
0.0957
1.159
0.00681
2.278
Intron sequence
0.0211
1.009
0.000344
0.892
a. What are those methods in this table used for?
The sequence based measures indicatie of protein-coding functon in genomic DNA.
A good knowledge of the core coding statstcs is important to understand how gene identicaton programs work and to
interpret their predictons
The main distncton here is between measures dependent of a model of coding DNA, and measures independent of
such a model. The model of coding DNA is always probabilistc, allowing to compute the probability of a DNA sequence,
giien that the sequence is coding. Although in the practce, the ialues (scores) of a giien coding statstc in a query
sequence can be computed in a number of diferent ways, here for the model-based coding statstcs we will compute
scores based on such a probability. Indeed, giien a query sequence, we will compute the probability of the sequence
under the model of coding DNA, and under an alternatie model or non-coding DNA (which, here, for illustraton
purposes will be simply random DNA). We will take the logarithm of the rato of these two probabilites--the loglikelihood rato--as the score of the coding statstc in the query sequence.
b. Based on this result, why do average mutual information is the most sensitive method?
7. Protein vanA with the help of two other proteins, adds alactate group instead of alanine to the end of the
peptidoglycan chain. This occurrence help bacteria resisting to Vancomycin. Giving two vanA’s structures
from a modern sample and a 30000 year old DNA sample with PDB ID of 1E4E and 3SE7 respectively, use
appropriate bioinformatic tool(s) to answer the following questions:
a. Which class does vanA belong to (according to shape and secondary structure)?
VanA is a D-alanine-D-lactate ligase, indicating that it adds lactate to the growing peptidoglycan chain.
b. How different are their primary strutres (single chain only)?
The enzyme that makes the normal peptidoglycan is a D-alanine-D-alanine ligase, which adds alanine to
the chain.
Surprisingly, it is very similar to VanA made by modern bacteria, showing that this war of antibiotics and
resistance began long before medical science discovered the utility of antibiotics.
c. How different are their teriary strutres (single chain only)?
We compare the ancient and modern proteins using the Structure Comparison Tool.
http://www.rcsb.org/pdb/workbench/showPrecalcAlignment.do?
action=pw_fatcat&name1=1E4E.A&name2=3SE7.A
d. In term of evolution, make at least 2 assumptions based on your previouscomparisons.
VanA reconstructed from a 30,000 year old bacterium, with bound ATP.
8. There are types of genetic variations such as Tandemrepeat polymorphism, Insertion/ Deletion polymorphism,
Single nucleotide morphism (SNP). In your opinion, explain why researchers focus extensively on SNPs
nowadays.
Tandemrepeat polymorphism: Tandem repeats or variable number of tandem repeats (VNTR) are a very common class
of polymorphism, consisting of variable length of sequence motifs that are repeated in tandem in a variable copy number.
VNTRs are subdivided into two subgroups based on the size of the tandem repeat units. Microsatellites or Short Tandem
Repeat (STR) repeat unit: 1-6 (dinucleotide repeat: CACACACACACA). Minisatellites repeat unit: 14-100. For
example: Spinocerebellar ataxia Type10 (SCA10) (OMIM:+603516) is caused by largest tandem repeat seen in human
genome. Normal population has 10-22 mer pentanucleotide ATTCT repeat in intron 9 of SCA10 gene; where as SCA10
patients have 800-4500 repeat units, which causes the disease allele up to 22.5 kb larger than the normal one.
Insertion/ Deletion polymorphism: Insertion/Deletion (INDEL) polymorphisms are quite common and widely
distributed throughout the human genome. Sequence repetitiveness in the form of direct or inverted tandem repeat have
been shown to predispose DNA to localized rearrangements between homologous repeats. Such rearrangements are
thought to be one of the reason which create INDEL polymorphism. For example: Association between coronary heart
disease and a 287 bp Indel Polymorphism located in intron 16 of the angiotensin converting enzyme (ACE) have been
reported (OMIM 106180). This Indel, known as ACE/ID is responsible for 50% of the inter individual variability of
plasma ACE concentration.
In silico estimation of potentially polymorphic VNTR are over 100,000 across the human genome. The short
insertion/deletions are very difficult to quantify and the number is likely to fall in between SNPs and VNTR.
Single nucleotide polymorphism (SNPs) Responsible for 90% of all human genetic variation. A SNP occurs every 100 –
300 base pairs. Currently almost 12 million SNPs in the NCBI SNP database. May be within genes (coding SNP, cSNP)
or outside gene (non – coding, the majority). May cause amino acid changes or not. If it causes an amino acid changes or
not. If it causes an amino acid change it is called non – synonymous (nsSNP). Most SNPs are not responsible for a
disease. Like microsatellites, they are used as markers for pinpointing a disease on the genome map. SNPs make
particularly good markers because they occur frequently throughout the genome, and are older and more stable
genetically. The most common polymorphisms (or genetic differences) in the human genome are single base-pair
differences. When two different haploid genomes are compared, SNPs occur, on average, about every 1,000 bases . No
biological assumptions and can identify novel genes/pathways. Excellent chance to identify risk alleles. Utility in
individual risk assessment. SNPs are important: The organism - SNPs are mutations, therefore they will alter DNA
function. Depending on where they are, this can potentially cause critical illness by altering an important genetic feature.
At the other end of the spectrum, they may have no discernable impact. Based on population genetics theory, SNPs with
severe disease-causing effects are likely to be bred out of gene pools. Genetic Epidemiologists - GEs use SNPs as genetic
markers to track disease with. Large studies called Genome-Wide Association Studies study teh SNPs in tens to hundreds
of thousands of people and find associations between particular SNPs and disease. Most SNPs have no effect on health or
development. Some of these genetic differences, however, have proven to be very important in the study of human health. Researchers
have found SNPs that may help predict an individual’s response to certain drugs, susceptibility to environmental factors such as
toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families.
Future studies will work to identify SNPs associated with complex diseases such as heart disease, diabetes, and cancer.
9. What is GWAS? In general, how do
many patients are there needed for GWAS.
scientists
conduct
a
GWAS
study?
At
least,
how
A genome-wide association study is defined as any study of genetic variation across the entire human genome that is
designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or
absence of a disease (such as cancer) or condition. It is an approach that involves rapidly scanning markers across the
complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once
new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and
prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex
diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses.
At least 1000 – 3000 patients and at most 100 – 200 thousand patients are there needed for GWAS.
In general, to conduct a GWAS study, firstly, scientists collect large cohort of cases and controls.
Second, microarray-based SNP genotyping is performed. After the derivation of haplotypes, the
detection of association signals is carried out. Then, a fine mapping of association signal is produced.
Finally, association is replicated and goes through the biological validation test.
10. Briefly list out
each step (if any).
the
procedure
of
genome
assembly
and
specific
software
needed
in
11. Explain why repetitive sequences are challenge to genome assembly
Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly
half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly
programs. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can
produce biases and errors when interpreting results. Repetitive DNA are sequences that are similar or identical to
sequences elsewhere in the genome. Although some repeats appear to be nonfunctional, others have played a part in
human evolution, at times creating novel functions, but also acting as independent, ‘selfish’ sequence elements. Repeats
arise from a variety of biological mechanisms that result in extra copies of a sequence being produced and inserted into
the genome. Repeats come in all shapes and sizes: they can be widely interspersed repeats, tandem repeats or nested
repeats, they may comprise just two copies or millions of copies, and they can range in size from 1–2 bases (mono- and
dinucleotide repeats) to millions of bases. Repeats can also take the form of large-scale segmental duplications, such as
those found on some human chromosomes and even whole-genome duplication. For de novo assembly, repeats that are
longer than the read length create gaps in the assembly. To create gaps, repeats can be erroneously collapsed on top of one
another and can cause complex, misassembled rearrangements. For genome assembly, repeats create ambiguities which,
in turn, can produce biases and errors when interpreting results. Because, sequence is cut into many small fragments.
Repeat regions can cause wrong alignment or difficult in overlapping. Repetitive sequences, which permeate the genomes
of species from across the tree of life, create ambiguities in the processes of aligning and assembling NGS data. Repetitive
sequences are a huge challenge because the reads associated with the can't be assigned to just one location in the genome.
Each copy of a repetitive element is flanked on each side by a unique sequence.
Repetitive DNA is a challenge for assembly. Consider that half the human genome consists of repetitive DNA and other
genomes have even more; transposable elements span over 80% of the maize genome. Beyond assembly, this also leads to
a tremendous technical challenge for alignment to a reference genome:… repeats introduce ambiguous assemblies and
alignments, sometimes producing biases and errors
12. FASTQ
is
considered
as
raw
data
generated
from
next
generation
sequencing
machine. What is the difference in FASTA format and FASTQ format? Briefly describe structure of FASTQ
format
The difference between FASTQ format and FASTA format:
FASTQ
FASTQ format is a text-based format for storing
both a biological sequence (usually nucleotide
sequence) and its corresponding quality scores.
Both the sequence letter and quality score are each
encoded with a single ASCII character for brevity.
Like the FASTA format, the FASTQ format
includes a sequence string, consisting of the
nucleotide sequence of each read. FASTQ also
includes an associated quality score for every base,
making them appropriate for reads from an
Illumina machine or
other brands.
FASTA
a text-based format for representing
either nucleotide sequences or peptide
sequences, in which nucleotides or amino
acids are represented using single-letter
codes. The format also allows for sequence
names and comments to precede the
sequences.
Describes the structure of FASTQ format: Each FASTQ file has records that are in blocks four lines long
The first line, beginning with the @ symbol - the UNIQUE sequence name, identifies the record. It may
optionally include information about the sequence length or the machine used for sequencing.
The second line has the sequence (in upper case), including the nucleotides G, A, T, C, and (as is the case
here in the second position) there may be an N for unknown nucleotide.
The third line begins with the + symbol and typically contains just that character (as in this case), or it can
have more
information.
The fourth line includes the quality scores (ASCII characters) corresponding to every base. Each quality
score is assigned a single character, and the entire quality score string must equal the length of the sequence
string
13. Briefly describe how Illumina sequencing machine works
Illumina sequencing works on the principle of cycle reversible termination
(a) Genomic DNA is purified and then randomly fragmented. This can be accomplished mechanically by methods such as
sonication, shearing, or nebulization, often followed by size selection of the randomly fragmented DNA. Adapters are
attached to both ends.
(b) Single-stranded DNA fragments are covalently attached to the surface of flow cell channels.
(c) The addition of DNA polymerase and unlabeled deoxynucleotides creates solid-phase “bridge amplification” in which
the template DNA makes U-shaped loops with both ends attached to the surface of the channel.
(d) Double-stranded bridges are formed. The double-stranded molecules are denatured and then amplified to generate
dense clusters of template DNA.
(e) Four labeled reversible terminators are added (with primer and DNA polymerase). Only a single reversible terminator
will be added to each template in a given cycle. As with Sanger sequencing, chain termination will occur at specific bases
that cannot elongate.
(f) Following laser excitation, the identity of the first base is recorded.
(g) For the second cycle, the reversible terminators are removed (by deprotection). All four labeled reversible terminators
and the polymerase are again added to the flow cell. The cycles are repeated. Sequencing Over Multiple Chemistry
Cycles: The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time. Align
Data: The data are aligned and compared to a reference, and sequencing differences are identified
14. How many kinds of
function of pair-end reads?
data
does
illumina
sequencing
machine
generate?
What
is
the
There are two kinds of data which illumine machine generate: single – end reads and paired – end reads.
Paired end reads are useful to identify deletions (as well as insertions) because such reads have an expected distance
(depending on the size of the library inserts) and orientation. Paired-end sequencing allows users to sequence both ends of
a fragment and generate high-quality, alignable sequence data. Paired-end sequencing facilitates detection of genomic
rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts. It read provide superior
alignment across DNA regions containing repetitive sequences, and produce longer contigs for de novo sequencing by
filling gaps in the consensus sequence
15. What are the methods for structural prediction of protein and their drawback?
In structural biology, there are two main approaches to determining protein structure:
X-ray crystallography; and nuclear magnetic resonance spectroscopy (NMR). Structures
can also be predicted computationally using three approaches: homology modeling, threading,
and ab initio prediction.
Structure prediction is a major goal of proteomics. There are three principal ways to
predict the structure of a protein. First, for a protein target that shares substantial. similarity to
other proteins of known structure, homology modeling (also called comparative modeling) is
applied. Second, for proteins that share folds but are not necessarily homologous, threading is a
major approach. Proteins that are analogous (related by convergent evolution rather than
homology) can be studied this way. Third, for targets lacking identifable homology (or analogy)
to proteins of known structure, ab initio approaches are applied.
homology Modeling (Comparative Modeling)
There are several principal types of errors that occur in comparative modeling (see Marti-Renom
et al., 2000):
• errors in side-chain packing;
• distortions within correctly aligned regions;
• errors in regions of a target that lack a match to a template;
• errors in sequence alignment; and
• use of incorrect templates.
Each target undergoes comparative modeling using an existing experimental structure as a
guide that may be superimposed on the target.
Fold recognition (threading)
The target might assume a fold that occurs in a characterized protein because of convergent
evolution, or because the two proteins are homologous but extremely distantly related. An input
sequence is parsed into subfragments and “threaded” onto a library of known folds. Scoring
functions allow an assessment of how compatible the sequence is with known structures
Ab Initio prediction (template-Free Modeling)
the resolution of ab initio methods is generally low. Knowledge-based approach would fail in following
conditons:
Structure homologues are not aiailable
Possible undiscoiered new fold exists.
Aninsen’s theory: Protein natie structure corresponds to the state with the lowest free energy of the protein-solient
system.
Limitatons of De noio Predicton Methods
o A major limitaton of de noio protein predicton methods is the extraordinary amount of computer tme
required to successfully solie for the natie conirmaton of a protein.
o Another way of circumientng the computatonal power limitatons is using coarse-grained modeling.
Coarse-grained protein models allow for de noio structure predicton of small proteins, or large protein
fragments, in a short computatonal tme.
Gene Predictinn The principle
•
Identfy common genetc features of known genes
•
Generate genetc proiles.
•
Compare the proiles to uncharacterized gene as a predicton.
•
Test and ialidate the predicton.
Cimputatinal Methids fir Gene Predictin
•
Gene Predicton methods
Extrinsic/ Homology method: Based on sequence similarity.
The assumptons of homology method:
-
Coding regions eiolie slower than non coding regions.
-
Homologous sequences refect a common ancestry and therefore gene structure.
Software: AAT, EbEST, GeneSeqer, ORFrgene2, SYM4 , GeneWise, SYNCOD
Intrinsic/ Ab inito method: Based on statstcal proiles.
Predict genes based on the statstcal propertes of uncharacterized sequence.
-
Software: FrGENESH, Gene ID, GeneMark.hmm, GenSCAN, ppound, VEIL, TWINSCAN, HMMgene.
-
Challenges in eukaryotes:
-
Protein coding genes are separated by intergenic regions.
-
The presence of exons and introns.
-
Signal sequences are difcult to identfy
Features fir gene predictin in prikaryites
Promoter elements:
-
35 region.
-
10 region.
-
Transcriptonal start site.
-
ORFrs.
-
Translaton stop sites.
THE EpISTENCE OFr CONSENSUS SEQUENCES (ESPCIALLY PROMOTER SEQUENCE) FrACILITATE THE GENE PREDICTION IN
PROKARYOTES.
•
Software: AMIGene, Easy Gene, GeneMark.hmm-P, Glimmer, SG inder, MEDstart, REGANOR, TICO, Zcurie.
Challenges in Prikaryitc gene predictin
•
Prokaryotes pose difcultes due do high gene density and simple gene structure:
–
Little informaton from short gene.
–
Reduce detecton accuracy due to oierlapping genes.
Features fir gene predictin in eukaryites
•
Predicton for eukaryotes is a whole lot more complicated than for prokaryotes.
•
Because the large informaton:
•
–
splice sites,
–
start and stop codons,
–
branch points, promoters,
–
terminators, polyA sites,
–
ribosomal binding sites,
–
topoisomerase II binding sites,
–
topoisomerase I cleaiage sites,
–
transcriptonal factor binding sites,
–
etc.
Software: Software: FrGENESH, Gene ID, GeneMark.hmm, GenSCAN, ppound, VEIL, TWINSCAN, HMMgene.
Challenges in eukaryitc gene predictin
•
Low gene density and complex gene structure.
•
Presence of alternatie splicing mechanism.
•
Presence of pseudo-genes.
Why Is Gene Predictin Difcultl
•
DNA signals haie low informaton content (degenerated and highly unspeciic).
•
Difcult to discriminate real signals.
•
Sequencing errors.
16. Acoording your understanding in read map analysis, is it possible to estimate the number of duplication by
using coverage? Briefly explain your answer.
17. Functional site (or active site) of a protein consists of only a few animo acids. Use your understanding in the
biochemical nature of protein to disscuss the functional importance of other residues that are NOT belonged to
the functional site.
18. Algorithms in genome assembly largely depends on overlappped sequence of the reads 9know sequences of
DNA fragment after sequencing). Based on the human genome project, briefly discuss the genomic feature(s) of
the human genome that may challenge the process of genome assembly.
Human genome with approximately 35 million reads, needed large computing farms and distributed computing. From
2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on
a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several
years to be produced on hundreds of sequencing machines.
Human contamination in other mammalian genome sequences will be particularly problematic, as such contamination is
expected to be common due to handling of the samples. For parts of a de novo-sequenced mammalian genome, the best
BLAST hit will be against a human or mouse sequence simply because the region in question has not been sequenced and
annotated in any other mammal.
[Genome assembly]
_ Definition of genome assembly
In bioinformatcs, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to
reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go,
but rather reads small pieces of between 20 and 30000 bases, depending on the technology used. Typically the short
fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
Genime assemblies ofer a consensus representaton of a genome, spanning all the chromosomes (and
extrachromosomal elements such as organellar genomes and plasmids). When next-generaton sequencing is performed
on a preiiously assembled Analysis of Next-Generaton Sequence Data 395 genome (e.g., when we sequence a person’s
genome) alignment to the reference genome is performed, but that human reference has already been assembled so
further assembly is not required. In contrast, when we sequence the genome of a species that has not preiiously been
characterized, de noio (“from new”) assembly is required.
Genome assembly: Challenges
Errors in assembly are important because we rely on each assembly for all aspects of the
genomic landscape, including the locations of genes.
Genomes can be assembled de novo (“anew,” without referring to other completed genomes) or
by mapping reads onto a reference genome.
The assembly process involves the collection of individual sequences, the closing of gaps, and
the lowering of the error rate
The priblem if sequence assembly can be compared to taking many copies of a book, passing each of them through a
shredder with a diferent cutter, and piecing the text of the book back together just by looking at the shredded pieces.
Besides the obiious difculty of this task, there are some extra practcal issues: the original may haie many repeated
paragraphs, and some shreds may be modiied during shredding to haie typos. Excerpts from another book may also be
added in, and some shreds may be completely unrecognizable.
While-genime assembly iniolies fragmentng genomic DNA from an organism, then constructng libraries of iarious
sizes (often from 2 kb to 50 kb or eien >100 kb). In one approach the ends of cloned inserts are sequenced (producing
mate pair reads). As reads are aligned they are organized into contgs such as those found in the Whole-Genome
Shotgun (WGS) diiision of NCBI. Contgs can be ordered and oriented to assemble scafolds (also called supercontgs).
These may contain gaps whose sizes can be estmated. Global statstcs for assemblies include: (1) the total number of
scafolds (including those with or without known placement or orientaton); (2) the scafold N50 (the length in base pairs
such that scafolds of this length or longer include 50% of the bases in the assembly); (3) the total number of contgs;
and (4 ) the contg N50 (here the length such that contgs of this length or longer include 50% of the bases in the
assembly. N50 is therefore a measure of contguity, with larger ialues denotng more complete assemblies.
The Genime Reference Cinsirtum (GRC) which is responsible for human genome assemblies lists the N50 for each
human chromosome. Fror chromosome 11 (harboring the HBB gene cluster) the N50 is about 4 1.5 megabases, while in
earlier assemblies (such as NCBI35) it was millions of base pairs shorter.
_ Genome assembly procedure or pipeline
Flowchart describing assembly and annotation procedures The steps involved in creating a highquality genome. Sequencing can include the conventional Sanger technique and/or several NextGen
technologies including 454, Illumina, and Ion Torrent (see Table 1). Contig and scafold assembly can
utilize several assemblers including: Atlas (Havlak et al. 2004), AbySS (Simpson et al. 2009),
ALLPATHS-LG (Gnerre et al. 2011), Celera assembler (Myers et al. 2000), MaSuRCA ( (accessed on July
19, 2013), and SOAPdenovo (Li et al. 2010). Chromosome mapping can use genetic information,
radiation hybrids or f uorescence in situ hybridi- zation (FISH). “ Breaking ” misassembled scafolds
and placing them on chromosomes can involve extensive manual work. Expressed sequence tags
(ESTs) are usually partial transcripts obtained from Sanger sequencing. mRNA-seq is often performed
with Illumina technology but can also be conducted with Ion Torrent machines.
OR
_ Types of read data
_ De novo assembly and Reference mapping assembly (De novo = "new", Reference = "something already
exists". So one assembly is built based on the known genome, the other is built based on nothing but itself)
In sequence assembly, two diferent types can be distnguished:
1. de-noio: assembling short reads to create full-length (sometmes noiel) sequences (see De noio sequence
assemblers, de noio transcriptome assembly)
2. mapping: assembling reads against an existng backbone sequence, building a sequence that is similar but not
necessarily identcal to the backbone sequence
In terms of complexity and tme requirements, de-nivi assemblies are orders of magnitude slower and more memory
intensiie than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare eiery
read with eiery other read (an operaton that has a naiie tme complexity of O(n2); using a hash this can be reduced
signiicantly). Referring to the comparison drawn to shredded books in the introducton: while for mapping assemblies
one would haie a iery similar book as template (perhaps with the names of the main characters and a few locatons
changed), the de-noio assemblies are more hardcore in a sense as one would not know beforehand whether this would
become a science book, a noiel, a catalogue, or eien seieral books. Also, eiery shred would be compared with eiery
other shred.
“”””copy sequence của primer của mình vào, làm từng cái
sau đó bạn bấm hairpin và sef dimer để check
hairpin là tự bản thân nó cuộn lại, bằng các liên kết H
khi primer cuộn lại sẽ k thể nối với đoạn gene của mình để cắt target gene dc
nên cần phải hạn chế
như hình này primer dài 20 nu, có hai cái loop thì k dc nè
mình cũng có thể dựa vào cái delta G nữa, Delta G của hairpin k dc nhỏ hơn -1
đây là kq của self-dimer
nghĩa là các primer cùng loại sẽ bám vào nhau
delta G của self-dimer k dc nhỏ hơn -9, nếu có, primer sẽ k có validation
self dimer phải dc hạn chế
vì nếu primer cùng loại mà bám vào nhau hết thì khi chạy PCR nó cũng sẽ k bám vào gene
những đoạn liên kết ở giữa thì k sao
nhưng hạn chế và tối kị nhất là hai đầu 3' nối với nhau
khi hai đầu 3' nối với nhau r nó sẽ k nối vào gene lúc chạy PCR dc
For hairpin analysis, you can change the default concentrations provided to match your reaction conditions. The
most valuable piece of information on this screen is the Tm for each of your structures. If the Tm of the structure is
lower than your reaction conditions, then this structure will not cause any problems. If it is higher, this oligo may be
problematic and should be redesigned.
For self-dimer analysis, click on 'Self-Dimer' to bring up a new window with each possible self-dimer your oligo can
form. For each diagram you will be able to see the calculated delta G value for this secondary structure. If you have
a strong delta G (-9kcal/mol or more negative) this oligo could be problematic.
Enter the sequence of your forward primer into the sequence box, and then click 'Hetero-Dimer.' This will open a
second box below the original sequence box, in which you enter the sequence of your reverse primer. Then click the
"Calculate" button below the second box. In general, a primer pair with a delta G of -9kcal/mol or more negative will
be problematic.
- Xem thêm -