Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os...

104

Transcript of Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os...

Page 1: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...
Page 2: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Dissertação apresentada para obtenção do grau de doutor

em Biologia Evolutiva

pelo Instituto de Tecnologia Química e Biológica

da Universidade Nova de Lisboa.

Este trabalho teve apoio financeiro da FCT e do FSE

no âmbito do Quadro Comunitário de apoio,

BD nº SFRH/BD/15856/2005.

Page 3: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Acknowledgments

I would like to thank Arcadi Navarro and Isabel Gordo for accepting to supervise this

PhD and Arcadi Navarro for the opportunity to collaborate in other projects, three of

which resulted in the publications found in the Appendices section.

I am also grateful to the Unitat de Biologia Evolutiva of the Universitat Pompeu

Fabra, now part of the Institut de Biologia Evolutiva, for hosting me during this work,

and its members for making me feel welcome.

A very special thank you to a great number of people who I was lucky to meet along

these years that took the time to discuss my (and their own) projects with me and

contributed with helpful comments and ideas which greatly improved my work, even

if that work didn’t make it into the thesis.

The work presented here would not have been possible without the financial support

from the Portuguese Fundação para a Ciência e a Tecnologia through a PhD

fellowship (SFRH/BD/15856/2005), and the excellent training and education provided

by the Programa Gulbenkian de Doutoramento em Biomedicina.

Page 4: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...
Page 5: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Index

Summary 7 Resumo 9 INTRODUCTION 11

Historical perspective 13 Genes in pieces 13 Not much room for doubt 14 First impressions 15 Evolutionary perspective 16 Four kinds of introns 17 tRNA and archaeal introns 17 Self-splicing introns 17 Spliceosomal introns 20 Introns early vs late 20 Mechanisms of intron gain and loss 22 Intron loss 22 Intron gain 23 Splicing 24 The spliceosome 25 Splicing signals and the assembly of the spliceosome 25 The minor form of spliceosome 27 Finding the correct pair of splice sites 28 Alternative splicing 29 Why should we care about introns? 30 Boost mRNA quality 31 Increase recombination 31 Source of functional diversity 31 Repositories of functional elements 32 References 35 RESULTS 41

Publication I: Intronic mutational constraints in Primates 43 Abstract 45 Introduction 45 Materials and Methods 47 Results 49 Discussion 57 Conclusions 61

Page 6: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Acknowledgments 61 References 62 Publication II: Accelerated evolution in Human introns 65 Abstract 67 Introduction 67 Materials and Methods 70 Results 74 Discussion 80 Acknowledgments 83 Supplementary Tables 84 References 93 GENERAL DISCUSSION AND CONCLUSIONS 97

Constraints on the evolution of intronic sequences 99 Accelerated evolution of intronic sequences 100 References 102

Page 7: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Summary ● 7

Summary

Spliceosomal introns, the most common class of introns in eukaryotes, found in the

protein coding genes in the nucleus of these organisms, are commonly described as

regions in the primary transcript that need to be excised in order to produce the

functional mRNA molecule. Yet, they are also regions in the RNA transcript, and the

corresponding genomic regions, with a high number of functional elements that act

either at the RNA or DNA level and help regulate important cellular processes such as

splicing and gene expression.

With the exception of the core splicing signals, whose sequence motifs and location

within the intron are relatively well defined, most of the other cis-acting functional

elements in introns are located at variable distances from the splice sites and contain

degenerate sequence motifs with low information content, which make them much

harder to locate within the introns. Given the critical roles played by these elements,

it is likely that many evolve under selective pressure to maintain function, which will

affect intron sequence conservation levels. Thus, sequence conservation can help in

the task of finding these cis-regulatory elements, as the most constrained regions in

introns are their most likely location.

In our first study we examined the sequence conservation along primate introns

(human, chimpanzee and macaque) and identified regions where functional elements

involved in splicing (within 400 base pairs from the splice sites) and transcription

regulation (up to several kilobase pairs from the donor splice site in the first intron)

are more likely to occur, and intronic regions which evolve mostly unconstrained

(central portions of introns left after removing the constrained regions described

above). The results from this study are of particular interest for defining target

regions in studies of functional elements present in introns (either computational

scans of over-represented motifs or functional experiments), and for studies using

Page 8: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

8 ● Summary

introns as neutrally evolving sequences in order to, for instance, estimate genetic

distances between species or detect selective events.

Given the potential of alternative splicing to generate proteins with diverse functions

(sometimes even opposite roles) from the same gene, and the contribution of both

tissue-specific alternative splicing events and transcription regulation to organism

complexity, it is plausible that some of the cis-acting functional elements found in

introns evolved under positive selection and are responsible for organismal

differences between species.

In our second study we performed a genome-wide scan for introns with evidence of

having evolved under positive selection in the human branch and found 86

candidates, mostly belonging to different genes. Our results indicate that functional

sequences in these fast evolving introns are more likely to have a role in the control

of transcription and gene expression than in the regulation of alternative splicing.

Since our functional analysis of the genes containing our candidate introns did not

identify any particular biological process or molecular function, we suggest that

positive selection acting upon introns has been largely decoupled from the functions

of the genes to which these introns belong. In contrast, it is possible that a significant

portion of the fast evolving elements in our candidate introns are distant

transcription regulatory elements acting on neighboring genes, which often have

unrelated functions.

Page 9: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Resumo ● 9

Resumo

Os intrões dependentes do spliceossoma, a classe de intrões mais comum em

eucatiotas, presentes nos genes que codificam proteínas existentes no núcleo destes

organismos, são frequentemente descritos como as regiões dos transcritos primários

que necessitam ser removidas para que se forme uma molécula funcional de RNA

mensageiro. No entanto, os intrões são também regiões no transcrito de RNA, e nas

zonas genómicas correspondentes, que contêm um grande número de elementos

funcionais, que actuam a nível do RNA ou do DNA, e que contribuem para a

regulação de processos celulares importantes, como o splicing e a expressão génica.

Exceptuando os sinais de splicing principais, cujos padrões de sequência e localização

dentro do intrão são relativamente bem definidos, a maior parte dos elementos

funcionais presentes nos intrões encontram-se a distâncias variáveis dos locais de

splicing e contêm padrões de sequência degenerados com baixo conteúdo de

informação, o que dificulta a sua identificação. Dada a sua importância, é provável

que muitos evoluam sob pressão selectiva para manter a sua função, o que se

reflectirá nos níveis de conservação ao longo do intrões. Desta maneira, os níveis de

conservação podem ajudar na tarefa de encontrar estes elementos reguladores, já

que as regiões mais conservadas nos intrões são as que maior probabilidade têm de

os conter.

Num primeiro estudo examinámos a conservação ao longo de sequências intrónicas

de primatas (humano, chimpanzé e macaco) e identificámos regiões com maior

probabilidade de conter elementos funcionais envolvidos na regulação do splicing

(nos 400 pares de base adjacentes aos locais de splicing) e da transcrição (várias

quilobases desde o local de splicing a 5’ do primeiro intrão), e também regiões que

evoluem maioritariamente sem restrições (as porções centrais dos intrões que

sobram depois de se excluir as regiões constrangidas acima descritas). Os resultados

deste trabalho são de particular importância, quer para a definição de regiões de

interesse em estudos de elementos funcionais presentes nos intrões (buscas

Page 10: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

10 ● Resumo

computacionais de motivos sobre-representados ou experiências funcionais), quer

para estudos que usem intrões como sequências que evoluem neutralmente para,

por exemplo, estimar distâncias genéticas entre espécies ou detectar eventos de

selecção.Tendo em conta que através do mecanismo de splicing alternativo se

podem gerar diferentes proteínas a partir do mesmo gene (por vezes até com

funções antagónicas), e a contribuição, tanto dos eventos de splicing alternativo

variável de acordo com o tecido celular como da regulação da transcrição, para a

complexidade dos organismos, é possível que alguns dos elementos funcionais

presentes nos intrões tenham evoluído sob selecção positiva e sejam responsáveis

por diferenças entre espécies a nível do organismo.

No segundo estudo procurámos intrões, ao longo de todo o genoma, com evidência

de terem evoluído sob selecção positiva no ramo humano, e encontrámos 86 intrões

candidatos, a maior parte dos quais pertencentes a genes distintos. Os nossos

resultados indicam que é mais provável que as sequências funcionais presentes

nestes intrões estejam envolvidas no controlo da transcrição e da expressão génica

do que na regulação do mecanismo de splicing alternativo. Uma vez que a análise

funcional dos genes aos quais os nossos intrões candidatos pertencem não destacou

nenhum processo biológico ou função molecular em particular, sugerimos que a

selecção positiva que actua sobre os intrões está maioritariamente dissociada das

funções dos genes aos quais os intrões pertencem. É possível que uma porção

significativa dos elementos em rápida evolução nos nossos intrões candidatos

estejam envolvidos na regulação da transcrição a larga distância de genes vizinhos,

que frequentemente têm funções distintas.

Page 11: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Introduction

Page 12: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

12 ● Introduction

"A week of hard work can sometimes save you an hour of thought."

Unknown author.

Page 13: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Historical perspective ● 13

Historical perspective

(Stumbling into introns)

The word intron first appeared in 1978 by the hand of Walter Gilbert (Gilbert 1978)

as an abbreviation for intragenic region. The discovery of introns itself however was

made the year before, in 1977, and is now commonly attributed to Richard J. Roberts

and Phillip A. Sharp, who, in 1993, were awarded the Nobel Prize in Physiology or

Medicine for their discovery of "split genes".

Genes in pieces?

By the mid 1970s, genes were seen as “transcribed code” (Gerstein et al. 2007) –

continuous stretches of DNA that are copied into RNA – and messenger RNA (mRNA)

was thought to be a direct copy of the gene sequence. This view was based mainly on

studies with bacteria and bacteriophages, which dominated the field at the time, but

the collinearity and continuity of the DNA, RNA and protein sequences was assumed

to be universal. Therefore, the finding that mRNA can derive from physically separate

sections along the DNA came as a shock1 and at the time it looked like genes were

split in pieces by introns, which were initially referred to as intervening DNA, inserts,

spacer sequences or spacers.

By 1976 it was already known that the primary transcripts of all major classes of RNA

(ribosomal, transfer and messenger) undergo some processing before they become

the functionally competent, mature forms of RNA. There was also considerable

evidence that eukaryotic mRNAs are initially transcribed as much larger molecules –

the heterogeneous nuclear RNAs (hnRNAs) – that are subsequently shortened. Based

1 James Watson actually used the word “bombshell” to describe this finding in the ‘Foreword’

to the 1977 Cold Spring Harbor Symposia on Quantitative Biology – where the first results were presented a few months before they were published – and words such as ‘amazing’ and ‘baroque’ were used in the title of scientific articles and communications.

Page 14: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

14 ● Introduction

on observations that mRNA and hnRNA share the same polyadenylation site it was

proposed that the mRNA segment was placed at the 3’-end of the hnRNA. When it

was later found that caps are also present at the 5’-end of both mRNA and hnRNA

researchers reasoned that, in some cases, the mRNA segment was located at the 5’-

termini of its precursor (Perry 1976). It was assumed that one or the other end of the

initial transcript was cut off, no one expected that the discarded segments could

come from the middle of the RNA (Marx 1977; Rogers 1978; Marx 1978).

Not much room for doubt

Not only was the discovery of introns surprising and unexpected, it also happened at

a breathtaking pace (Figure 1).

The finding was first reported at the Cold Spring Harbor Symposia on Quantitative

Biology, in the beginning of June 1977. Several groups of investigators, including

Sharp’s and Roberts’ groups, presented their independent discovery that a number

of mRNAs of animal viruses consist of sequences complementary to widely separated

portions of the viral genome. The importance of these works was immediately

recognized and featured in the News sections of magazines such as Nature and

Science (Sambrook 1977; Marx 1977) even before the original research articles

(Berget et al. 1977; Chow et al. 1977; Klessig 1977; Dunn and Hassell 1977; Lewis et

al. 1977; Aloni et al. 1977; Kitchingman et al. 1977; Hsu and Ford 1977) were

published.

Although the discovery was made in viral messengers, researchers suspected that the

same could be happening with mRNAs from animal cells, since the viruses use the

enzymes of the nucleated cells they infect to produce their own mRNA. Their

hypothesis was confirmed by other groups in November (Doel et al. 1977;

Breathnach et al. 1977), only three months after the first publication of the discovery

in viruses. As with the work on viruses, the discovery of introns in eukaryotic

messengers was made almost simultaneously by several independent groups.

Page 15: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Historical perspective ● 15

Figure 1 Timeline of events regarding the discovery of introns. Above the timeline are events discussed

in the main text and below are some of the main advances that allowed the discovery of introns. Temin,

Baltimore, Smith, Berg, Sharp and Roberts were all later awarded Nobel prizes for the discoveries

mentioned in the figure. Line width is proportional to the number of publications. *The study on

Drosophila rRNA was published in February 1977.

In the following months the list of species in which introns were observed grew

quickly and introns were found to be present in the precursors not only of mRNA but

also of ribosomal (rRNA) and transfer (tRNA) RNA. It soon became clear that in

eukaryotes genes with introns were not the exception but the rule.

First impressions

Remarkably, introns were immediately assumed to have a function. Very early on,

just as the first examples in eukaryotes were found, and even before it was known

for sure that introns are transcribed, researchers postulated that introns could have

1970 1971 1972 1973 1974 1975 1976 1977 1978 1979

Feb ▪▪▪ Jun Jul Aug Sep Oct Nov Dec Jan Feb

At the Cold SpringHarbor Symposiaseveral groups

studying animal viruses present evidence that mRNAs are complementary to noncontiguous regions of the viral genome

Sharp and Roberts are among the first to

publish their results (in August and September,

respectively), closely followed by the other

groups

… and thesurprising

discovery featured in the News section of Nature and Science

Studies on ovalbumin, beta-globin, immunoglobulin,

rRNA and tRNA genes soon demonstrate that the

phenomenon is widespread among eukaryotes and is not limited to messenger RNA.

Walter Gilbertcoined the terms

intron and exon

*

*

Paul Berg constructed the first recombinant-DNA molecule

Howard Temin and David Baltimore simultaneously discover reverse transcriptase

Hamilton Smith purified a restriction enzyme (HindII) and first showed that it cuts DNA with a specific sequence

R-loop technique is describedSouthern blotting

technique is developed

Page 16: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

16 ● Introduction

regulatory functions, including determining chromatin conformation during the

control of transcription (Williamson 1977), and regulating protein synthesis after

transcription (Marx 1978).

Another early speculation was that introns would be important for the evolution of

the genome. Perhaps the most influential article on this matter was Walter Gilbert’s

“news and views” piece early in 1978 (Gilbert 1978). In just about one thousand

words, Gilbert coins the terms intron and exon, predicts that introns account for far

more DNA than exons and foresees the disappearance of the one gene-one

polypeptide dogma. He also proposes that the presence of introns in genes can

speedup evolution by allowing rearrangements of the coding regions (also proposed

by Rogers, 1978), or by enabling single base pair changes to generate novel proteins

(instead of only changing a single amino acid), due to the deletion or addition of

whole sequences of amino acids, if those mutations occur near the splice sites and

alter the splicing pattern. He continues by speculating that splicing does not need to

be a hundred per cent efficient so that, in his own words, “evolution can seek new

solutions without destroying the old”.

Evolutionary perspective

(Learning to live with introns)

Soon after the discovery of introns it became apparent that, although they had never

been observed in bacteria, they were widespread in eukaryotes. But when and how

introns appeared and why they became so successful in eukaryotic genomes was a

mystery. Three decades later there are many models, several hypotheses, but no

definitive answers.

Page 17: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Evolutionary perspective ● 17

Four kinds of introns

In the literature (and the remainder of this book) the word intron is frequently used

to refer to the prolific nuclear mRNA spliceosomal introns. There are however three

other less abundant classes of introns, known as group I, group II and tRNA and/or

archaeal introns, which differ in the mechanism by which they are spliced out.

tRNA and archaeal introns

Introns in tRNA, rRNA and mRNA genes of archaea and in tRNA genes in the nucleus

of eukaryotes share a splicing mechanism with a characteristic that sets them apart

from all the other classes of introns: they are spliced by protein enzymes, without

any RNA catalyst (Calvin and Li 2008).

First, a splicing endonuclease excises the intron, probably guided not by sequences in

the RNA, but by RNA structural features, and then a ligase joins the two exons.

Although the ligation reaction differs, the cleavage step is conserved in eukaryotes

and archaea. The similarity of the cleavage reaction, the sequence homology of the

splicing endonucleases and the shared preferential location of the intron in the tRNA

genes all support a common origin for these introns in the two

domains/superkingdoms of cellular organisms (Archaea and Eukaryota) (Lykke-

Andersen et al. 1997).

Organisms from the other domain/superkingdom (Bacteria) don’t have this class of

introns nor the splicing endonuclease. In these organisms introns found in tRNAs

genes belong to the group I class of self-splicing introns (Fujishima et al. 2010).

Self-splicing introns

Group I and II introns were originally described in fungal mitochondrial genes (Michel

et al. 1982) but have since been found in mitochondria from other eukaryotes and

also in chloroplast and bacterial genomes. Group I introns are also present in the

nuclear genome of eukaryotes from diverse phyla and, in fact, most of the

Page 18: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

18 ● Introduction

approximately 2,900 group I introns described so far are found in rRNA genes in the

nucleus, mainly of fungi. On the other hand, group II introns have been found in a

genus of archaea (Lambowitz and Zimmerly 2004), and most of the about 750 group

II introns are found on the chloroplast of green plants and algae (Cannone et al.

2002).

Introns in these two classes are capable of self-splicing, that is, they can extract

themselves from the RNA molecule without the help of proteins or other RNAs2. They

do so by folding themselves into specific three-dimensional structures that bring the

intron-exon junctions into close proximity and allow precisely positioned reactive

groups to perform the splicing reactions3. The folding itself occurs due to the

presence of conserved partially complementary sequence stretches in the RNA

molecules (Alberts et al. 2002, 6).

Group I and group II introns can be distinguished based on their conserved sequences

and secondary structures, on the splicing reaction requirements (group I introns use

a free guanosine, while group II introns use an especially reactive adenine residue in

the intron sequence itself to initiate self-splicing) and on the structure of the released

intron, which have the shape of a lariat in group II (Cech and Bass 1986; Vicens and

Cech 2006). These fundamental differences, besides justifying their classification into

separate groups, suggest that the two groups originated independently.

Self-splicing group I introns have very well conserved primary and secondary

structures which supports the idea that they share a common origin. Their

widespread but sporadic distribution in nature suggested that they may have spread

by horizontal transfer. Phylogenetic analyses confirmed this hypothesis when it was

2 Nonetheless, in the cell, self-splicing introns are normally aided by proteins that speed up

the reaction. 3 Self-splicing introns were actually the first example of RNA molecules with catalytic function.

Up to then all known biocatalysts were proteins and RNA was seen simply as the transmitter of genetic information from DNA to protein. The discovery that RNA can also be a biocatalyst awarded Sidney Altman and Thomas Cech the 1989 Nobel Prize in chemistry and made the RNA world hypothesis plausible.

Page 19: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Evolutionary perspective ● 19

shown that introns located at homologous gene sites in different organisms tend to

be more closely related than those at heterologous sites within the same organism

(Hoshina and Imamura 2009).

On the other hand, the observed distribution of group II introns – mainly in bacteria,

mitochondria and chloroplast – suggests that they originated in bacteria and have

been kept since the bacterial endosymbionts that gave rise to those organelles. The

few group II introns found in archaea, on the other hand, are likely to be the result of

lateral transfer from bacteria (Lambowitz and Zimmerly 2004).

Both groups of introns are still capable of horizontal transfer through homing (a

process by which an intron spreads to a homologous position in an intronless allele)

and reverse splicing, and are thus currently viewed also as mobile genetic elements.

Group II introns, in particular, have been proposed to be ancestors of non-LTR

retrotransposons (Lambowitz and Zimmerly 2004).

About one-third of the introns in each group contain internal open reading frames

(ORFs) that may still code for proteins with endonuclease (group I) and/or reverse

transcriptase activity (group II), which promote their motility. Interestingly, some of

those genes embedded in the self-splicing introns, particularly homing endonuclease

genes (HEGs), are mobile genetic elements themselves. By their insertion into introns

they avoid disrupting host gene function and the introns on the other hand see their

mobility increased. What’s more, this intron-HEG relationship seems to have

strengthened during evolution since some of those intron-encoded proteins have

evolved to function also as maturases that assist in the splicing of their host intron.

Because the ability of self-splicing introns to remove themselves from the RNA

transcript in a precise manner partially explains their (and their embedded ORFs’)

success in spreading to new genes and new species – as it potentially renders them

neutral to the host – it was a change that benefited both (Lambowitz and Zimmerly

2004; Haugen et al. 2005; Stoddard 2005).

Page 20: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

20 ● Introduction

Spliceosomal introns

Introns have been found in all three domains/superkingdoms of cellular organisms

(Archaea, Bacteria and Eukaryota), different type of genes (protein, rRNA and tRNA

coding genes) and various eukaryotic organelles (nucleus, mitochondria and

chloroplast). The previous classes of introns can be found in at least two domains,

type of gene and/or organelles, but spliceosomal introns are only found in nuclear,

protein coding, eukaryotic genes. Yet, they are present in most, if not all, nuclear

eukaryotic genomes characterized to date and are by far the most common class of

introns in these organisms, reaching hundreds of thousands per genome in

vertebrates and plants (Roy and Gilbert 2006).

Contrary to group I and group II introns, spliceosomal introns do not fold into specific

three-dimensional structures and they are completely dependent on both proteins

and other RNAs (which form a large complex that gives them their name: the

spliceosome) for their extraction. Nevertheless, the chemistries of their splicing

reactions are very similar to group II introns, with spliceosomal introns being also

released in a lariat structure, and the RNA molecules at the core of the spliceosome

closely resemble a number of critical RNA domains of group II introns (Valadkhan and

Jaladat 2010). Because of the striking similarities between these two classes of

introns it has been proposed that spliceosomal introns evolved from group II introns

(Cech 1986) by the transfer of the splicing ability to other molecules and loss of the

conserved sequences that formed the typical secondary structures. As a

consequence, much more of the intron sequence is left free to diverge and many

more RNAs could be spliced (Alberts et al. 2002, 6).

Introns early vs late (when and where)

Two main theories have been proposed regarding the origin of (spliceosomal4)

introns that are the object of a long-standing debate.

4 As explained before, spliceosomal introns will frequently be referred to simply as introns.

Page 21: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Evolutionary perspective ● 21

According to the Introns Early (IE) theory, introns were present in the ancestor of

prokaryotes and eukaryotes: the last universal common ancestor (LUCA). In this

ancestor, introns were initially just genomic regions between genes that coded for

small proteins and were concatenated to form modern multiple-domain proteins. It

was hypothesized that in this primitive organism the information copying

mechanisms were error prone and, in order to prevent information loss, LUCA’s

genome had to be highly redundant. Therefore, coding sequences would be present

in multiple copies undergoing rapid information decay, and recombination within

introns would enable the joining of functional exon copies. With the improvement in

fidelity of the information copying mechanisms introns became less relevant and

were eventually lost in prokaryotes as they evolved towards increased metabolic

economy. In eukaryotes they were kept by gaining new functions (Rodríguez-Trelles

et al. 2006).

This origin of introns in LUCA would avoid the deleterious effect of inserting

functionless sequences into previously continuous genes. Yet, it implies that massive

intron losses occurred independently across all prokaryote lineages.

A more parsimonious explanation, that spliceosomal introns only appeared in

eukaryotes, is defended by the Introns Late (IL) theory. This theory proposes that

some of the many genes that were transferred from the bacterial endosymbionts

that gave rise to eukaryote organelles to the nucleus, contained self-splicing group II

intron-like elements. In the eukaryotic nucleus they spread and the spliceosome

evolved through the fragmentation of a group II intron (Belshaw and Bensasson

2006).

The debate on whether spliceosomal introns were present in the eukaryote-

prokaryote ancestor, and were then extensively lost, or rather evolved from group II

introns after these invaded the nucleus, and greatly increased in numbers, is still

active (Coulombe-Huntington and Majewski 2007; Basu et al. 2008). What seems to

have reached consensus is that spliceosomal introns and the spliceosome arose

Page 22: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

22 ● Introduction

before the most recent common ancestor of living eukaryotes and that since then

introns have been gained and lost differently in different lineages making it hard to

infer the ancestral condition.

Mechanisms of intron gain and loss

Intron loss

Two main models of intron loss have been proposed. The first, genomic deletion, can

remove parts of introns, and sometimes of adjacent coding regions, or, if it occurs by

nonhomologous recombination between short direct repeats at both ends of the

intron, it can excise introns exactly. The second, recombination with a reverse-

transcribed copy of mRNA, will delete one or more adjacent introns in an exact

manner.

Because an mRNA intermediate is needed in the second model, it should mainly

affect genes expressed in germline cells. Additionally, since reverse transcription

occurs from the 3’ end to the 5’ end of the RNA template and often terminates

prematurely, intron loss by this method is predicted to be 3’ biased. And finally,

because recombination can involve regions spanning multiple intron positions,

concerted loss of adjacent introns is expected by the second model. Despite all the

different predictions made by the two models of intron loss, results have not been

conclusive on the relative contribution of each mechanism. Some studies have found

concerted loss of adjacent introns and 5’ intron location bias in intron-sparse genes

and genomes, which support the recombination with reverse-transcribed mRNA

model. Yet, this biased location of the introns could have resulted from the

preferential retention of 5’ introns if they are particularly enriched in functional

elements, and many studies, particularly with intron-rich organisms, do not find

either of these evidences (Belshaw and Bensasson 2006; Rodríguez-Trelles et al.

2006; Roy and Gilbert 2006).

Page 23: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Evolutionary perspective ● 23

Intron gain

Five main models have been proposed to explain the origin of new introns.

The most popular one, the intron transposition model, involves the duplication of an

existing intron in a way similar to how Group II introns self-propagate. According to

this model a RNA intron sequence that has been spliced out of a transcript is reverse-

spliced into a new position of the same or a different mRNA. The new intron is finally

inserted into the genome by recombination of a reverse-transcribed copy of the

intron-acquiring transcript with its genomic template. Like with the second model of

intron loss described in the previous section, this mechanism should be 3’ biased.

Yet, because the recombination of the reverse-transcribed mRNA with the new

intron can at the same time involve loss of neighboring existing introns, it would not

necessarily lead to a bias in intron location towards the 3’ end of genes. According to

this model though, the new introns should show sequence similarity to their intron

sources but, so far, studies that have found new introns that resemble older introns

in the same genome are scarce and the regions showing inter-intron homology are

generally enriched in palindromic repetitive sequences that are also found in

intergenic regions, raising doubts that they may have resulted from the spread of

transposons with palindromic sequences.

Other models for intron gain include: Transposon insertion, in which a transposable

element inserts into an exonic portion of a gene and is removed from the RNA

transcript by the spliceosome, thus converting into a spliceosomal intron; Tandem

genomic duplication, where the duplicated region contains cryptic splice signals with

an AGGT sequence and the two copies of this sequence are recognized by the

spliceosome as the donor and acceptor splicing sites, restoring the original coding

sequence, and; Intron transfer among paralogs, in which an intron-containing paralog

transfers a copy of its intron to a paralog previously lacking an intron at that site

through homologous recombination.

Page 24: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

24 ● Introduction

Of these four models only the intron transposition and intron transfer mechanisms

ensure that the inserted sequence includes the necessary signals for correct splicing.

Although all of these models can explain current spliceosomal intron proliferation,

none of them can account for how spliceosomal introns first arose, since all require

the existence of a functional spliceosome. Only a fifth model for intron gain,

Conversion of group II introns, includes a mechanism for the origin of the

spliceosome. According to this model, group II introns from organellar genes were

transferred to the nucleus, where they were inserted into previously intronless sites.

With time, their splicing ability got transferred to trans-acting RNAs and other

molecules, with consequent degradation of their internal RNA structure and loss of

their ORFs, rendering them dependent to a common splicing apparatus: the

spliceosome (Rodríguez-Trelles et al. 2006; Roy and Gilbert 2006).

Splicing

(Getting rid of introns)

Most protein coding genes in the nucleus of eukaryotes produce transcripts with

intronic sequences that need to be removed in order to form a functional mRNA

molecule. The process by which they are extracted, pre-mRNA splicing, involves two

consecutive phosphoryl-transfer reactions, known as transesterifications, which join

the two exons and release the intron in the shape of a lariat.

In the first reaction, the 2’-OH of a specific adenine nucleotide in the intron attacks

the 5’ (donor) splice site breaking the sugar-phosphate backbone of the RNA and

thus separating the upstream exon from the intron. In the process the 5’ end of the

intron gets covalently linked to the adenine nucleotide, creating the loop in the lariat.

In the second reaction, the free 3’-OH at the end of the upstream exon attacks the 3’

(acceptor) splice site separating the intron from the downstream exon, and joining

Page 25: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Splicing ● 25

the two exons together. After this second reaction the intron is released in the shape

of a lariat that ultimately gets degraded.

The spliceosome

These splicing reactions are performed by the spliceosome: one of the largest

molecular machines in the cell, a complex assembly of RNA and protein molecules

whose composition and structure change along the splicing process.

Like with the self-splicing introns, RNA, not proteins, play the main role in splicing.

These RNA molecules, known as snRNAs (small nuclear RNAs), lie at the core of the

spliceosome and both recognize the splice sites and participate in the chemistry of

splicing. In the major form of splicing (the minor form of spliceosome is described

later in this section) there are five snRNAs, named U1, U2, U4, U5, and U6, and each

forms complexes with at least seven protein subunits. Together, the snRNA and its

associated proteins, form a snRNP (small nuclear ribonucleoprotein). Including the

proteins that form the snRNPs, over 150 proteins integrate the spliceosome in

humans (Alberts et al. 2002, 6; Valadkhan and Jaladat 2010).

This large machine is assembled on the pre-mRNA as its snRNAs find complementary

sequences in the pre-mRNA, the splicing signals.

Splicing signals and the assembly of the spliceosome

There are three main splicing signals: the 5’ splice site, where the upstream exon

ends and the intron starts; the branch site, containing the adenine nucleotide

involved in the first transesterification and that forms the branch point of the lariat

produced by splicing, and; the polypyrimidine tract/3’ spice site, at the 3’ end of the

intron, just before the downstream exon (Schwartz et al. 2008).

The spliceosome recognizes them largely by base-pairing between the snRNAs and

conserved sequence motifs in the splicing signals. This recognition is done multiple

times along the process of the spliceosome assembly, as new components join the

Page 26: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

26 ● Introduction

ribonucleoprotein complex and replace previously bound molecules, so that the RNA

sequences are checked multiple times before the chemical reaction takes place.

In mammals there are four distinct spliceosomal complexes that vary in their snRNP

and auxiliary proteins composition: E, A, B and C (in temporal order).

Early in the spliceosomal assembly pathway the U1 snRNA and U1C, a U1-specific

protein, recognize the 5’ splice site, and the U2 snRNA together with the U2 auxiliary

factor U2AF, recognize the branch site and the polypyrimidine tract and 3’ spice site.

At this point, before the use of ATP, the interaction of the U2 snRNP with this region

of the pre-mRNA is loose, and we are at the splicing complex E. With the use of ATP

the association of U2 with this region is remodeled and strengthened, forming the

splicing complex A.

In the next step the U4/U6•U5 tri-snRNP enters the spliceosome originating the B

complex. In this triple snRNP, the U4 and U6 snRNAs are held firmly together by

base-pair interactions that keep U6 in an inactive conformation, and the U5 snRNP is

more loosely associated. Once the tri-snRNP joins the spliceosome several RNA-RNA

rearrangements break the U4-U6 basepairing, U1 and U4 leave the complex, U2

replaces U4 as the basepairing partner of U6 and U6 replaces U1 at the 5’ splice

junction as the B complex becomes catalytically active.

After the first transesterification reaction major structural rearrangements lead to

the formation of spliceosomal complex C. In this step the U5 snRNA forms base-pair

interactions with exon sequences at both the 5′ and 3′ splice site, bringing the two

exons into close proximity for the second transesterification.

Once the second splicing step is completed, the spliceosome complex disassembles,

the spliced mRNA and the excised intron are released and the spliceosome

components are recycled for further rounds of splicing, closing the spliceosomal cycle

of assembly, catalysis, disassembly and recycling (Valadkhan and Jaladat 2010; Will

and Lührmann 2001).

Page 27: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Splicing ● 27

The minor form of spliceosome

A small fraction of introns in more complex eukaryotes, such as flies, mammals and

plants, have different conserved splicing motifs and are removed by a distinct

spliceosome. At the core of this spliceosome there are also five snRNPs, of which only

one, the U5 snRNP, is shared by both spliceosomes. The other four, U11, U12, U4atac

and U6atac, are low-abundance snRNPs functionally analogous to the major

spliceosome U1, U2, U4 and U6 snRNPs, respectively, making the same types of RNA-

RNA interactions with the pre-mRNA and with each other as do the major snRNPs.

This functional correspondence between major and minor class snRNPs is reflected in

the similarity of their secondary structures, but not their nucleotide sequence. It thus

seems that the low-abundance minor snRNPs are not simply a variant of the major

snRNPs and the similarities evolved not from homology but by analogy. In fact, both

models proposed for the origin of these two splicing systems assume they evolved

from self-splicing group II introns but that the differences existed already in the

progenitor of higher eukaryotes. According to one of the models the two

spliceosome types derive from two different group-II-like introns, while the other

model proposes that they evolved in separate lineages that later fused in the

ancestor of higher eukaryotes.

The introns spliced out by this spliceosome are known both as U12-type introns –

due to their dependency on that snRNP for splicing (while introns extracted by the

major form of spliceosome are named U2-type) – and as AT-AC introns – after the

first examples of this class of introns (which turned out not to be representative of

this class) that started with an AT and ended with an AC dinucleotide instead of the

canonical GT-AG. Although these introns are scarce nowadays, with only a few in the

genome of any given species, it is thought that they were much more frequent earlier

in evolution and have been either lost or converted to U2-type introns over time. Yet,

their persistence in homologous genes in highly diverged species and presence in

Page 28: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

28 ● Introduction

virtually all of metazoan evolution indicates that they must have an important

cellular function (Patel and Steitz 2003).

Finding the correct pair of splice sites

Even though the conserved sequence motifs in the splicing signals are read multiple

times by different components of the spliceosome – which increases the accuracy of

splice site selection – these motifs are short and degraded enough so that if the

recognition of the splice sites was done by this alone there would be numerous

splicing errors. The pairing of non-consecutive splice sites, for instance, would lead to

the exclusion of one or more exons from the spliced mRNA, an error known as exon

skipping, and the use of cryptic splice sites (locations in the pre-mRNA whose

nucleotide sequence resembles the one found in true splice sites) would lead to exon

truncation or incorporation of intronic sequence in the mature mRNA.

Besides the classical splicing signals there are other cis-acting elements with less

clearly identifiable consensus sequences, found both in introns (ISR, intronic splicing

regulators) and exons (ESR, exonic splicing regulators), which are important for

correct splice site identification. These elements are recognized by SR proteins

(serine- and arginine-rich proteins), hnRNPs (heterogeneous nuclear

ribonucleoproteins) and other proteins, which interact with the spliceosome either

enhancing or silencing splicing (Cartegni et al. 2002).

Two other factors are thought to help in choosing the correct splice site pair: co-

transcriptional assembly of the spliceosome and pairing of the splice sites across an

exon. As with other pre-mRNA processing factors (involved in 5’ end capping and 3’

end polyadenylation) some splicing factors are carried on the RNA polymerase II tail

during transcription and get transferred onto the nascent RNA at appropriate

locations. This way, a snRNP in the donor splice site only has one acceptor splice site

to choose from while the downstream acceptor sites have not yet been synthesized

(Alberts et al. 2002, 6).

Page 29: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Splicing ● 29

The second mechanism that helps identifying the correct pair of splice sites has been

proposed particularly for large introns. While exon size tends to be fairly uniform

across eukaryotes, with an average of approximately 150 nucleotides, introns tend to

be much longer, typically hundreds to thousands of nucleotides or more, and vary

enormously in size even within a single organism. This makes locating splice sites

across long introns remarkably difficult compared to pairing splice sites across even

sized exons. Thus, the exon definition model proposes that first, splice sites are

paired across the exons and then, consecutive exon units are paired as the

spliceosome machinery assembles on the intervening intron. The pairing of splice

sites across exons is helped by SR proteins that bind to exonic sequence and help

recruit spliceosomal components and stabilize interactions (Berget 1995; Lim and

Burge 2001; Wang and Burge 2008).

Alternative splicing

What the correct pair of splice sites is can actually change with time and tissue.

The use of different splice site pairs can lead to complete exons being skipped or

included in the mature mRNA, exons being shortened or elongated by the use of

alternative 3’ and/or 5’ splice sites and introns being kept in the processed transcript.

This variation on how a particular RNA transcript is spliced, named alternative

splicing, leads to different parts of the primary transcript being present in the mature

mRNA and can thus generate diverse peptides from a single gene.

Alternative splicing, which may have been present already in early eukaryotes,

gained prominence along eukaryotic evolution: it is more abundant in higher

eukaryotes than lower eukaryotes and occurs in more genes in higher vertebrates

than in invertebrates (Keren et al. 2010). In humans, microarray profiling studies

estimate that about two-thirds of our genes contain one or more alternatively spliced

exon, and studies using high-throughput sequencing, a more sensitive technology,

bring the estimate of alternatively spliced genes to more than 90% (Castle et al.

Page 30: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

30 ● Introduction

2008; Pan et al. 2008). This can dramatically increase the number of proteins a

genome is capable of synthesizing.

Some of these genes are constitutively alternatively spliced and the different mRNA

isoforms are present in all the tissues in which that gene is expressed, but the

majority (over 60%) of alternative splicing events are tissue-specific, lending support

to the hypothesis that alternative splicing is a major contributor to phenotypic

complexity in higher vertebrates (Wang et al. 2008).

This flexibility in the pairing of the acceptor and donor splice sites that allows for

alternative splicing is achieved by relying less on the classic splice site motifs, which

tend to be weaker in alternatively spliced exons, and depending more on exonic and

intronic splicing regulators (ESR and ISR, described in the previous section), which

tend to be more conserved in these exons (Keren et al. 2010).

Although it is not clear what portion of alternative transcripts is functional, there is

no doubt that alternative splicing is a highly regulated process, as producing the

wrong transcript in the wrong place or at the wrong time can be deleterious to the

cell.

Why should we care about introns?

While it is still widely discussed whether introns flourished in eukaryotes due to

selection over some advantageous trait, like their potential to speedup evolution

initially proposed by Gilbert in 1978, or by a neutral process involving random genetic

drift (Lynch 2006), it is clear that introns now carry out many functions that are under

selection, most of which are probably the result of intron ‘domestication’ by

eukaryotic genomes and thus, not the reason for their initial spread.

Page 31: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Why should we care about introns? ● 31

Boost mRNA quality

DNA rearrangements, frameshifts, nonsense mutations, transcriptional errors or

incorrect splicing can all lead to the production of mRNAs with premature

termination codons (PTCs) that could generate non-functional or deleterious

truncated proteins. Cells, from yeast to human, have an mRNA surveillance

mechanism, known as nonsense mediated decay (NMD), which targets this

prematurely terminated mRNAs for degradation, thus increasing mRNA quality.

Introns play a role in this process because NMD recognition of PTCs relies on the

spatial relationship between the stop codon and the introns: generally a termination

codon should only occur after all the introns. When introns are removed by splicing,

proteins in the nucleus bind to and thereby mark the exon-exon junctions. If one of

these junctions is found after a termination codon it triggers NMD (Cartegni et al.

2002).

Increase recombination

Linked loci interfere with each other's response to selection (Hill-Robertson effect),

which can lead to the loss of beneficial mutations – since beneficial mutations

occurring in different haplotypes have to compete among each other – and to the

long-term accumulation of deleterious mutations (Muller's ratchet). By breaking

down linkage disequilibrium, recombination increases the efficacy of natural

selection (Felsenstein 1974).

Introns increase the rate of intragenic meiotic crossing over, generally reducing

linkage disequilibrium between adjacent exons, and thus allow for more efficient

selection of mutations within the gene (Duret 2001).

Source of functional diversity

Thanks to alternative splicing, a single gene can encode many proteins. For instance

in humans, the approximately 24,000 protein-coding genes in the genome are

Page 32: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

32 ● Introduction

estimated to produce around 100,000 different proteins (Keren et al. 2010). In fact,

through alternative splicing, a single gene can generate more transcripts than the

number of genes in an entire genome (Graveley 2001).

Many cases of alternative splicing are tissue specific, and the alternative transcript

isoforms are differentially expressed in at least one tissue (Castle et al. 2008; Pan et

al. 2008; Wang et al. 2008), which greatly contributes to organism complexity.

Repositories of functional elements

Introns contain several regulatory elements, highly conserved sequences, and even

other genes.

Many noncoding RNAs, including microRNAs and small nucleolar RNAs (snoRNAs) are

encoded in introns of protein coding genes. After transcription, the intron removed

by splicing is processed to form these untranslated RNAs that play a role in a number

of cellular regulatory mechanisms (Brown et al. 2008).

Introns also contain about half of the ultraconserved elements found in genes. These

DNA sequences of more than 200 base pairs in length that have been perfectly

conserved for more than 85 million years are thought to play a role in the regulation

of early development (Bejerano et al. 2004; Visel et al. 2008).

Finally, the most common functional elements found in introns are involved in

regulating splicing and transcription. Splicing regulatory elements are essential in

alternative splicing to regulate splicing in a developmental and/or cell-type-specific

fashion as this complexity cannot be achieved by the classical splicing signals alone,

but they are also needed to recognize legitimate splice sites in general, particularly in

species with long introns, and thus they must be present in the majority of introns in

species like ours (Cartegni et al. 2002).

As to the elements that regulate gene expression, their presence in introns was

noticed just after introns themselves were discovered (Gruss et al. 1979) when it was

Page 33: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Why should we care about introns? ● 33

observed that the expression profile of intronless versions of genes differed from the

original intron containing version. It is now known that introns influence many stages

of mRNA metabolism besides splicing, such as transcription, editing and

polyadenylation, nuclear export, translation and mRNA decay, all of which can affect

the expression of a gene (Le Hir et al. 2003).

Interestingly, some of these elements in introns are functional at the DNA level (like

the ultraconserved elements) while others function at the RNA level (noncoding

RNAs, for example), and some of the processes introns help regulate require

elements from both levels. For instance, in regulating transcription, intronic

transcription regulatory elements in the form of cis-acting transcription factor

binding sites, as well as nucleosome-positioning elements (that can regulate

transcription by controlling DNA accessibility) act at the DNA level, while splicing

signals in the introns after transcription, thus, at the RNA level, can affect both

transcription initiation and elongation (Le Hir et al. 2003). Also in splicing, both levels

seem to play a role, as it was recently proposed that introns contain pentamers that

disfavor nucleosome binding (Schwartz et al. 2009) and thus help position

nucleosomes preferentially in exons (at the DNA level). This in turn may help exon

recognition and selection in the RNA transcript either by slowing RNA polymerase II

as it reaches the exon and thus facilitating the transfer of splicing factors carried by

the RNA polymerase II tail onto the nascent RNA, or by the interaction of particular

histone modifications on the nucleosomes located in the exons with the splicing

machinery thus influencing its function (Tilgner et al. 2009; Keren et al. 2010). This

last model has in fact been proposed to explain the establishment of alternative

splicing patterns during development and cell differentiation just as the level of

activity of a gene is also determined: through the epigenetic memory contained in

histone modifications (Luco et al. 2010).

Page 34: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

34 ● Introduction

* * *

In summary, introns’ functions start before transcription and do not end with their

removal from the transcript. Despite the diversity of functions they have been

attributed so far, it is still possible that new, surprising functions are still to be

discovered, as the new-found interest in non-coding sequences continues to produce

its fruits.

Given the critical roles introns play in several mechanisms in the cell, it is expected

that selection modulates their evolution. Intron spatial distribution can be under

pressure to maximize NMD (Lynch and Kewalramani 2003), intron size under

selection, for instance, for its effect on recombination (Duret 2001), and intron

sequence influenced by the great variety of regulatory motifs and other functional

elements introns harbor.

This thesis concerns this last level of selection on introns, looking, in the first chapter

of the Results section, at sequence conservation to identify general intronic regions

with higher density and/or higher impact functional elements, and, in the second

chapter, at individual accelerated introns in the human lineage which may set us

apart from other primates.

Page 35: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

References ● 35

References

Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. 2002. Molecular Biology of

the Cell. 4th ed. Garland Science.

Aloni Y, Dhar R, Laub O, Horowitz M, Khoury G. 1977. Novel mechanism for RNA maturation: the leader sequences of simian virus 40 mRNA are not transcribed adjacent to the coding sequences. Proc. Natl. Acad. Sci. U.S.A. 74: 3686-3690.

Basu MK, Rogozin IB, Deusch O, Dagan T, Martin W, Koonin EV. 2008. Evolutionary dynamics of introns in plastid-derived genes in plants: saturation nearly reached but slow intron gain continues. Mol. Biol. Evol. 25: 111-119.

Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. 2004. Ultraconserved elements in the human genome. Science. 304: 1321-1325.

Belshaw R, Bensasson D. 2006. The rise and falls of introns. Heredity. 96: 208-213.

Berget SM. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270: 2411-2414.

Berget SM, Moore C, Sharp PA. 1977. Spliced segments at the 5’ terminus of adenovirus 2 late mRNA. Proc. Natl. Acad. Sci. U.S.A. 74: 3171-3175.

Breathnach R, Mandel JL, Chambon P. 1977. Ovalbumin gene is split in chicken DNA. Nature. 270: 314-319.

Brown JWS, Marshall DF, Echeverria M. 2008. Intronic noncoding RNAs and splicing. Trends in Plant Science. 13: 335-342.

Calvin K, Li H. 2008. RNA-splicing endonuclease structure and function. Cell. Mol. Life

Sci. 65: 1176-1185.

Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Müller KM, et al. 2002. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 3: 2.

Cartegni L, Chew SL, Krainer AR. 2002. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat. Rev. Genet. 3: 285-298.

Page 36: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

36 ● Introduction

Castle JC, Zhang C, Shah JK, Kulkarni AV, Kalsotra A, Cooper TA, Johnson JM. 2008. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nat. Genet. 40: 1416-1425.

Cech TR. 1986. The generality of self-splicing RNA: relationship to nuclear mRNA splicing. Cell. 44: 207-210.

Cech TR, Bass BL. 1986. Biological catalysis by RNA. Annu. Rev. Biochem. 55: 599-629.

Chow LT, Gelinas RE, Broker TR, Roberts RJ. 1977. An amazing sequence arrangement at the 5’ ends of adenovirus 2 messenger RNA. Cell. 12: 1-8.

Coulombe-Huntington J, Majewski J. 2007. Characterization of intron loss events in mammals. Genome Res. 17: 23-32.

Doel MT, Houghton M, Cook EA, Carey NH. 1977. The presence of ovalbumin mRNA coding sequences in multiple restriction fragments of chicken DNA. Nucleic

Acids Res. 4: 3701-3713.

Dunn AR, Hassell JA. 1977. A novel method to map transcripts: evidence for homology between an adenovirus mRNA and discrete multiple regions of the viral genome. Cell. 12: 23-36.

Duret L. 2001. Why do genes have introns? Recombination might add a new piece to the puzzle. Trends Genet. 17: 172-175.

Felsenstein J. 1974. The evolutionary advantage of recombination. Genetics. 78: 737-756.

Fujishima K, Sugahara J, Tomita M, Kanai A. 2010. Large-scale tRNA intron transposition in the archaeal order Thermoproteales represents a novel mechanism of intron gain. Mol. Biol. Evol. 27: 2233-2243.

Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. 2007. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17: 669-681.

Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.

Graveley BR. 2001. Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics. 17: 100-107.

Page 37: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

References ● 37

Gruss P, Lai CJ, Dhar R, Khoury G. 1979. Splicing as a requirement for biogenesis of functional 16S mRNA of simian virus 40. Proc. Natl. Acad. Sci. U.S.A. 76: 4317-4321.

Haugen P, Simon DM, Bhattacharya D. 2005. The natural history of group I introns. Trends Genet. 21: 111-119.

Le Hir H, Nott A, Moore MJ. 2003. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28: 215-220.

Hoshina R, Imamura N. 2009. Phylogenetically Close Group I Introns with Different Positions among Paramecium bursaria Photobionts Imply a Primitive Stage of Intron Diversification. Molecular Biology and Evolution. 26: 1309 -1319.

Hsu MT, Ford J. 1977. Sequence arrangement of the 5’ ends of simian virus 40 16S and 19S mRNAs. Proc. Natl. Acad. Sci. U.S.A. 74: 4982-4985.

Keren H, Lev-Maor G, Ast G. 2010. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11: 345-355.

Kitchingman GR, Lai SP, Westphal H. 1977. Loop structures in hybrids of early RNA and the separated strands of adenovirus DNA. Proc. Natl. Acad. Sci. U.S.A. 74: 4392-4395.

Klessig DF. 1977. Two adenovirus mRNAs have a common 5’ terminal leader sequence encoded at least 10 kb upstream from their main coding regions. Cell. 12: 9-21.

Lambowitz AM, Zimmerly S. 2004. Mobile group II introns. Annu. Rev. Genet. 38: 1-35.

Lewis JB, Anderson CW, Atkins JF. 1977. Further mapping of late adenovirus genes by cell-free translation of RNA selected by hybridization to specific DNA fragments. Cell. 12: 37-44.

Lim LP, Burge CB. 2001. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. U.S.A. 98: 11193-11198.

Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. 2010. Regulation of alternative splicing by histone modifications. Science. 327: 996-1000.

Page 38: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

38 ● Introduction

Lykke-Andersen J, Aagaard C, Semionenkov M, Garrett RA. 1997. Archaeal introns: splicing, intercellular mobility and evolution. Trends Biochem. Sci. 22: 326-331.

Lynch M. 2006. The origins of eukaryotic gene structure. Mol. Biol. Evol. 23: 450-468.

Lynch M, Kewalramani A. 2003. Messenger RNA surveillance and the evolutionary proliferation of introns. Mol. Biol. Evol. 20: 563-571.

Marx JL. 1978. Gene structure: more surprising developments. Science. 199: 517-518.

Marx JL. 1977. Viral messenger structure: some surprising new developments. Science. 197: 853-923.

Michel F, Jacquier A, Dujon B. 1982. Comparison of fungal mitochondrial introns reveals extensive homologies in RNA secondary structure. Biochimie. 64: 867-881.

Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40: 1413-1415.

Patel AA, Steitz JA. 2003. Splicing double: insights from the second spliceosome. Nat.

Rev. Mol. Cell Biol. 4: 960-970.

Perry RP. 1976. Processing of RNA. Annu. Rev. Biochem. 45: 605-630.

Rodríguez-Trelles F, Tarrío R, Ayala FJ. 2006. Origins and evolution of spliceosomal introns. Annu. Rev. Genet. 40: 47-76.

Rogers J. 1978. Genes in pieces. New Scientist. 5 January: 18-20.

Roy SW, Gilbert W. 2006. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet. 7: 211-221.

Sambrook J. 1977. Adenovirus amazes at Cold Spring Harbor. Nature. 268: 101-104.

Schwartz SH, Silva J, Burstein D, Pupko T, Eyras E, Ast G. 2008. Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes. Genome Res. 18: 88-103.

Schwartz S, Meshorer E, Ast G. 2009. Chromatin organization marks exon-intron structure. Nat. Struct. Mol. Biol. 16: 990-995.

Page 39: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

References ● 39

Stoddard BL. 2005. Homing endonuclease structure and function. Q. Rev. Biophys. 38: 49-95.

Tilgner H, Nikolaou C, Althammer S, Sammeth M, Beato M, Valcárcel J, Guigó R. 2009. Nucleosome positioning as a determinant of exon recognition. Nat. Struct.

Mol. Biol. 16: 996-1001.

Valadkhan S, Jaladat Y. 2010. The spliceosomal proteome: at the heart of the largest cellular ribonucleoprotein machine. Proteomics. 10: 4128-4141.

Vicens Q, Cech TR. 2006. Atomic level architecture of group I introns revealed. Trends

Biochem. Sci. 31: 41-51.

Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD, Holt A, Plajzer-Frick I, Afzal V, Rubin EM, Pennacchio LA. 2008. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 40: 158-160.

Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature. 456: 470-476.

Wang Z, Burge CB. 2008. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 14: 802-813.

Will CL, Lührmann R. 2001. Spliceosomal UsnRNP biogenesis, structure and function. Curr. Opin. Cell Biol. 13: 290-301.

Williamson B. 1977. DNA insertions and gene structure. Nature. 270: 295-297.

Page 40: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...
Page 41: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Results

Page 42: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

42 ● Results

“All models are wrong but some are useful.”

George E. P. Box, 1979.

Box GEP. 1979. Robustness is the strategy of scientific model building. In Launer RL,

Wilkinson GN, eds. Robustness in statistics. New Yourk: Academic Pr. p 201-36.

Page 43: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

PUBLICATION I

Intronic mutational constraints in Primates

Olga Fernando1,2, Arcadi Navarro1,3,4

1Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències Experimentals i

de la Salut, Universitat Pompeu Fabra, Barcelona, Spain.

2Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras,

Portugal.

3National Institute for Bioinformatics, Universitat Pompeu Fabra, Barcelona, Spain.

4Institució Catalana de Recerca i Estudis Avançats (ICREA). Catalonia, Spain.

[Submitted]

Page 44: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

44 ● Publication I

The author of the thesis collected the data, performed the analyses and drafted the

manuscript.

Page 45: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 45

ABSTRACT

Introns are known to contain a variety of functional elements, the most common

being related with splicing and transcription. Many of them are present at variable

locations within the intron, have sequence motifs with low information content, and

act in a context dependent way, which difficult their identification and

characterization. In the present study we look at the frequency of substitutions along

human-chimpanzee-macaque orthologous introns in order to define regions in which

these elements are more likely to occur. We find a clear sign of the core splicing

elements present in the first and last few base pairs of introns, but also a significant

signal of the presence of other conserved elements, most likely related to splicing, up

to 400 bp from the closest splice site. We show that first introns, defined as the 5’-

most intron in the gene, form a separate class with a distinct substitution pattern and

biological role. In these introns conservation extends for several kilobases from the

donor splice site, most likely due to the presence of elements involved in

transcription. The regions here described can be used for defining target regions

when studying functional elements present in introns (either computational scans of

over-represented motifs or functional experiments), and for selecting intronic

regions in studies using introns as neutrally evolving sequences, from which these

more conserved regions should be excluded.

INTRODUCTION

Although the first sequence motifs involved in splicing were found almost at the

same time as introns themselves (Breathnach et al. 1978), 30 years later we are still a

long way from being able to predict splicing accurately from the DNA sequence alone

(Guigó et al. 2006). This is partially because the relatively easy to identify core

Page 46: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

46 ● Publication I

splicing signals – 5’ splice site, branch site, polypyrimidine tract (PPT), and 3’ splice

site – contain only about half of the information necessary to locate even short

human introns (Lim & Burge 2001). Much of the other half of the information is

expected to come from a large variety of much harder to identify short cis-acting

sequence elements.

These splicing regulatory elements (SREs) are located at variable distances from

splice sites (SSs) in both introns and exons, and enhance or inhibit splicing in a

context dependent way (i.e. the same element can act as an enhancer or an inhibitor

depending on its location) (Wang & Burge 2008). This complex regulation of splicing

together with the low information content of their motifs make it hard to locate SREs

accurately, despite their high frequency in human genes (Fairbrother et al. 2002).

Defining regions in which these elements are more likely to occur would facilitate

their study with both experimental approaches and computational screens for

overrepresented motifs.

The presence of functional elements should affect sequence conservation, which in

turn could be used to predict regions where they are more likely to be found. In this

study, we take advantage of levels of conservation along primate introns to locate

highly conserved regions that are more likely to be of functional relevance.

We focus on introns because little attention has been given to intronic SREs in

comparison with exonic elements (Sorek and Ast 2003) and, more importantly,

because introns may contain higher proportion of SREs, just like they contain the

great majority of sequence information at splice junctions (Stephens and Schneider

1992).

Nonetheless, introns contain other functional elements besides splicing related

sequences that can also affect conservation. Transcriptional regulatory elements are

common, mainly in first introns (Majewski and Ott 2002), and recently it has been

proposed that introns also contain sequences that help position nucleosomes

preferentially in exons (Schwartz et al. 2009). Thus, the patterns we obtain will also

Page 47: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 47

reflect the presence of these and possibly other unidentified functional elements in

introns. Additionally, knowing which regions within introns have higher probability of

containing functional elements is also of extreme importance for population

genetics, historical inference, and other studies that use introns as neutrally evolving

sequences (Hare and Palumbi 2003).

MATERIALS AND METHODS

Genomic Sequences and Gene Annotations

Whole genome DNA sequences for human (hg18), chimpanzee (panTro2) and

macaque (rheMac2), together with chimpanzee and macaque sequence quality

scores, were downloaded from the UCSC Genome Browser

(http://genome.ucsc.edu/).

Human gene annotations and one-to-one orthology information were obtained from

Ensembl (http://www.ensembl.org/) release 48.

Gene Alignments

Full sequences of genes with at least one intron in the human gene annotation were

extracted from the corresponding chromosome sequence file of each species

according to the one-to-one orthology information. Nucleotides with quality scores

of less than 40 were masked in the chimpanzee and macaque gene sequences, which

leaves a high confidence sequence with an error rate of less than 1/10,000. A three-

species alignment was then produced with TBA (Blanchette et al. 2004) for each

gene.

Data Filtering

For those genes with multiple transcripts the transcript with highest exon coverage

(that is, the one with the longest sequence resulting from the concatenation of all its

Page 48: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

48 ● Publication I

exons) was chosen to represent the gene. As a further measure to ensure that each

locus is present only once in the final dataset, overlapping genes were excluded from

the analysis.

In order to avoid possible annotation errors, genes with incorrect splice sites, coding

sequences (CDS) not multiple of three, without a start or a stop codon, with non-

sense mutations or with introns smaller than 20 bp5, were excluded. Additionally, in

genes suspected to have incomplete annotation because they are missing a 5’ or 3’

UTR, the first or last intron of the gene, respectively, was excluded to avoid possible

misclassifications in the first, last and single intron classes.

Finally, introns whose aligned chimpanzee or macaque sequence contained more

than 50% of Ns and/or gaps were excluded as a measure to avoid possible false

orthology, leaving 9,106 genes with 74,756 introns for analysis.

Data Analysis and Plotting

We studied introns in a position-per-position basis. Each position along an intron was

labeled as the distance of that nucleotide from the closest splice site (SS). The total

number of introns in which that nucleotide was present in our alignments was

counted (alignment columns with Ns or gaps were deemed uninformative) and the

percentage of introns in which at least one of the species’ sequence differed from

humans at that position was measured. That percentage constitutes an estimate of

the degree of conservation of each nucleotide along an intron.

Fisher's exact test was performed with the R (R Development Core Team 2009)

function fisher.test and the resulting estimates of the odds ratio and p-value under a

two-sided alternative hypothesis were used to produce Figure 4 and Figure 6. The

number of substitutions observed in a given window of size k is simply the sum of the

5 20 bp is approximately the length of the smallest spliceosomal introns described (Gilson and

McFadden 1996) and the minimum sequence length containing essential splicing signals (Wieringa et al. 1984).

Page 49: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 49

number of substitutions observed for each of the k nucleotides in that window. To

account for multiple comparisons resulting from testing several windows on the

same intron classes, p-values were conservatively adjusted using the Bonferroni

correction. The significance thresholds for the 50, 100 and 500 bp window analyses

were, respectively, 0.05, 0.01 and 0.001, accommodating for the fact that counts for

wider windows will tend to be higher – as a result of being the sum of a higher

number of observations/nucleotides – and thus yield smaller p-values.

Sequence logos (Schneider and Stephens 1990) were created with WebLogo (Crooks

et al. 2004) from intronic sequence aligned at the closest SS.

RESULTS

Conservation at the ends of introns extends up to 400 bp

The percentage of substitutions observed in human-chimpanzee-macaque

orthologous introns is shown up to 1 kb from the SSs in Figure 2. A low percentage of

introns with substitutions at a given position implies that the nucleotide at that

location has been conserved along the evolution of the three species in almost all the

introns, independently of what that nucleotide is.

Two general patterns standout in Figure 2: the 3’ and 5’ ends of introns are

approximately symmetrical, except for the ~100 bp closest to the nearest SS, and;

after a sharp initial increase, the number of substitutions continues to accumulate

steadily up to 400 bp towards the center of introns, when it stabilizes.

Given that the ends of introns contain sequence motifs essential for splicing, and that

these motifs are not equally distributed among both ends, they could be causing the

asymmetry found in Figure 2. We thus compared sequence conservation across

human introns due to the presence of the 5’ SS, PPT and 3’ SS sequence motifs (right

y-axis in Figure 3, measured in bits of information), with conservation across species

Page 50: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

50 ● Publication I

(left y-axis in Figure 3, measured as the percentage of introns without substitutions,

which is the complement of the percentage of introns with substitutions in the y-axis

of Figure 2). Only the human sequence logos are shown in Figure 3, since they are

identical to the chimpanzee and macaque (and previously published (e.g. Stephens &

Schneider 1992) human) logos. Thus, this striking relationship between the two

measures is actually present in all three species.

Figure 2 Distribution of substitutions in the first and last 1,000 bp of introns. Positions along the intron

are given as a distance from the closest SS, either the donor (red) or the acceptor (blue) SS. The inset

shows a close-up of the extreme-most 70 bp of introns; grey was used when the two colors overlapped.

To confirm the second pattern drawn from Figure 2, we compared the number of

substitutions observed in consecutive windows along the introns and found that, up

to the expected 400 bp from the closest SS, windows tend to have significantly less

substitutions than the next/previous window (Figure 4, “All” intron class).

Page 51: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 51

Figure 3 Conserved motifs in human intron ends and sequence conservation in the three species

(human, chimpanzee and macaque). The total height of each stack of letters corresponds to the amount

of information at that position measured in bits (y-axis on the right). Within each stack letters are sorted

so that the most frequent appear on top, and their height within the stack is proportional to their

relative frequency. Black dashes mark the percentage (y-axis on the left) of introns with the same

nucleotide in the three species (regardless of what the actual base, A, C, T or G, is) in the first ten and

last 30 nucleotides of introns.

First introns have a different substitution profile

Because first introns are reported to have more regulatory elements than other

introns (Majewski and Ott 2002; Keightley and Gaffney 2003) and have been shown

to present different substitution rates than other introns (Gazave et al. 2007) we

looked at their substitution profile separately. Contrary to the pattern seen with all

introns, in first introns, after the sharp increase within the first 50 bp, the number of

substitutions starts dropping until, at around 750 bp from the 5’ SS, it begins to

increase slowly (Figure 5, top panel, and Figure 4, “1st” intron class).

These differences in substitution profiles translate into significant differences

between the two classes of introns (Figure 6, “1st_x_Rest” series). First introns have

on average more substitutions for the first 200 bp, and less from that point up to 3.5

kb, although only the first 2.5 kb are significantly different from other introns.

Page 52: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

52 ● Publication I

Page 53: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 53

Figure 4 Differences in the number of substitutions in consecutive windows along introns belonging to

several classes. The number of substitutions in 500 (top panel) 100 (middle panel) and 50 bp windows

(bottom panel) is compared with the subsequent window (left of the dashed vertical line) or previous

window (right of the dashed vertical line). The magnitude of the differences, represented by the odds

ratio (see Material and Methods), are color coded according to the thresholds indicated in the figure

legend. Windows colored in blue have fewer substitutions than the contiguous window they were

compared to and windows colored in orange have more substitutions. Black borders were drawn around

windows with significant differences according to Fisher's exact test. Windows which could not be

studied (involving short introns) were colored grey, and windows with less than a mean of 100 intron

alignments were hatched. Grey polygons between panels emphasize the overlap in the x-axes. As in

previous plots, distance from the acceptor SS is given in negative values. Intron classes: All, all introns in

the study; Long, introns longer than 1455 bp; Short, introns shorter than 1456 bp; 1st, the first intron in

a gene; 1st_long, first intron in a gene if longer than 1455 bp; 1st_short, first intron in a gene if shorter

than 1456 bp; 1st_CDR, first intron in a gene if located in the coding region; 1st_5'UTR, first intron in a

gene if located in the 5’ UTR; CDR'1st_other, the first intron found in the coding region but not first in

the gene; 5'UTR_other, introns in the 5’ UTR other than the first; 4th, the fourth intron in the gene; Last,

last intron in the gene; Single, introns from genes with only two exons. Single introns were not included

in the first or the last intron classes.

At their 3’ end, first introns are not strikingly different from other introns (Figure 5,

top panel, and Figure 6, “1st_x_Rest” series), except for a tendency for higher

number of substitutions, that is also present, and more evident, in the central part of

large (> 8 kb long) first introns.

To check that the profile we see in our ‘first introns’ class is not actually characteristic

of coding-region (CDR) first introns – that is, the first intron found after the start

codon, which constitute the majority (73%) of our ‘first intron’ class and could thus

be driving the pattern – we focused on the first introns found in the CDR and

separated them into two groups, depending on whether or not they were also the

first intron in the gene. While CDR first introns that are also gene first introns show

the same pattern as our ‘first intron’ class, CDR first introns that come after the gene

first intron do not (Figure 5, middle panel, and Figure 4, “1st_CDR” and

Page 54: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

54 ● Publication I

“CDR’1st_other” intron classes), and there are significant differences between these

two classes (Figure 6, “CDR’1st—1st_x_Other”).

Figure 5 Distribution of substitutions along the first and last 5 Kb of introns. On the top panel introns

were separated into two classes, one with the first introns of genes and the other with the rest of the

Page 55: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 55

introns. On the middle panel only the first introns in the CDR are shown, separated into two classes

depending on whether or not they are also the first intron in the gene. On the bottom panel first introns

of genes are separated according to their location, the 5’UTR or the CDR. Open circles represent the raw

data, one circle each bp, to which a LOESS curve was fitted.

Another possibility was that first introns in the 5’ UTR showed a different pattern

from those in the CDR, perhaps common to all the introns in the 5’ UTR. As shown in

the bottom panel of Figure 5, their substitution pattern is very similar to that of CDR

first introns that are also gene first introns, except for the 3’ end which shows

significantly less substitutions (Figure 6, “1st—5’UTR_x_CDR”). Moreover, 5’ UTR first

introns (which, by definition, are also gene first introns) are different from other

introns in the 5’ UTR (Figure 6, “5’UTR—1st_x_Other”).

Short introns evolve faster

As done by other authors (Haddrill et al. 2005; Gazave et al. 2007), we classified

introns as short or long according to the median length of all the introns studied. In

our current dataset, that median was 1,455 bp, which of course differs from the

median in other organisms. As when all introns were considered, in both short and

long intron classes there is an increase in the number of substitutions up to 400 bp

from each SS (Figure 4, “Short” and “Long”), but when compared to each other, short

introns exhibit significantly more substitutions in virtually all comparable windows

along their length (Figure 6, “Short_x_Long”).

When we divided first introns into long and short based on the same length

threshold, we found that the substitution profile of long first introns is essentially the

same as the whole first introns class, but in short first introns there is no clear

pattern (Figure 4, “1st_long” and “1st_short”). Yet, when compared to long first

introns, there is a tendency for short first introns to have more substitutions up to

half of their length and less in the second half (Figure 6, “1st—Short_x_Long”).

Page 56: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

56 ● Publication I

Figure 6 Differences in the number of substitutions in equivalent windows of distinct classes of introns.

Plot annotations are as in Figure 4. Comparisons: 1st_x_Rest, the first introns in a gene compared with

introns in other positions along the gene; CDR'1st--1st_x_Other, from the first introns found in the

coding region those that are also the first intron in the gene compared with those that are not (1st_CDR

vs CDR'1st_other); 5'UTR--1st_x_Other, from the introns found in the 5’ UTR those that are the first

intron in the gene compared with those that are not (1st_5'UTR vs 5'UTR_other); 1st--5'UTR_x_CDR,

Page 57: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 57

first introns in the gene located in the 5’ UTR compared with those located in the CDR (1st_5'UTR vs

1st_CDR); Single_x_Rest, single introns compared to all the other introns; Single_x_1st, single introns vs

first introns in the gene; last_x_NON_1st

, last introns in the gene compared with the other introns in the

gene except first; 4th_x_NON_1st, fourth intron in the gene compared with the other introns in the

gene except first; Short_x_Long, introns shorter than 1456 bp compared with introns longer than 1455

bp; 1st--Short_x_Long, from the first intron in a gene those shorter than 1456 bp vs those longer than

1455 bp.

Other intron classes

The first kilobases in single introns are more conserved than in the rest of the introns

studied, including first introns (Figure 6, “Single_x_Rest” and “Single_x_1st”). In fact,

although few significant differences are found between single and first introns, single

introns don’t even show the higher number of substitutions in the initial 200 bp

typical of first introns when compared to other introns.

Last introns do not differ significantly from other non first introns in the gene, as

expected for a random non first intron, such as, for example, the fourth intron in the

gene (Figure 6, “Last_x_NON_1st” and “4th_x_NON_1st”). Still, last introns longer

than 3 kb seem to accumulate fewer substitutions in their central portion.

DISCUSSION

We looked at intron conservation to find regions where functional elements are

more likely to occur, and found signs of evolutionary constraints up to 400 bp from

both SSs. This distance is strikingly longer than many previous reports, using different

methods (e.g. 200 bp in Majewski & Ott 2002), but still reasonable according to

studies on conserved intronic SREs (Yeo et al. 2007), some of which found

throughout the 400 bp regions.

Page 58: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

58 ● Publication I

In these first and last 400 bp, the percentage of introns with substitutions increases

gradually with the distance to the closest SS, except for the SS neighboring

nucleotides where the increase is steep. Looking at intron sequence conservation at

single base pair resolution, we see that this sharp increase and the high conservation

in the first 6 and last 20 bp of an intron are explained by the presence of core splicing

motifs that are shared by the three species (Figure 3).

Due to its variable distance from the 3’ SS, the core splicing motif corresponding to

the branch site is not apparent in our sequence logos. Nevertheless, there is a clear

local decrease in the number of substitutions upstream of, and marginally

overlapping, the PPT motif (inset of Figure 2, and Figure 3) which almost perfectly

coincides with the reported preferential location of the branch site 18-37 nucleotides

upstream of the 3' SS (Green 1986). Likewise, the several SREs, which are also

present at variable distance from the SSs, are expected to increase sequence

conservation at their preferred locations.

Accordingly, we interpret the slow increase in the number of substitutions following

the core splicing signals as the result of a gradual decrease in the combined

frequency of distinct SREs. In fact, both SREs (Majewski and Ott 2002) and intronic

sequences disfavoring nucleosome binding (Schwartz et al. 2009) are expected to

have higher frequency close to the SSs. Two not necessarily mutually exclusive

scenarios can explain the observed pattern. If the majority of the motifs decrease in

frequency with the distance to the SS this would produce the gradual decrease in

conservation we found. Alternatively, the same result can be obtained if different

SREs have a frequency peak at different distances from the SS but there is a negative

correlation between the distance to the SS and the number of SREs that peak at that

distance.

The 5’ end of first introns is an exception to this 400 bp rule. In the intron closest to

the transcription start site, the first 2.5 kb are significantly more conserved than the

corresponding region in other introns. The fact that these are the intronic regions

Page 59: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 59

closest to the start of transcription immediately suggests a role, not in splicing, but

instead in transcription regulation. In fact, first introns are known to be enriched in

transcriptional regulatory elements, especially in their 5’ end (Majewski and Ott

2002). Thus, according to our data, cis-regulatory elements involved in transcription

are frequent in primate introns up to 2.5 kb from the 5’ SS, a distance similar to that

found by Keightley & Gaffney (2003) comparing rat and mouse.

There is some confusion in the literature on what the term ‘exon’ refers to. The word

was first used to name the regions left after the removal of introns (Gilbert 1978),

but it has since been used also as a synonymous of coding sequence (Zhang 2002).

The latter usage fails to account for exons in UTRs, with implications on what is called

first intron. According to our data showing that first introns, defined as the 5’-most

intron in the gene, form a class with a distinct substitution pattern, the original

definition of exon makes more sense from a biological point of view.

We classified introns into short or long based on the median intron length. The

conservation up to 400 bp from each SS is present in both classes, suggesting that the

same mechanism is used to recognize short and long introns. At first sight this might

seem unexpected, as short introns are thought to be spliced via an “intron definition”

and long introns via an “exon definition” mechanism (McGuire et al. 2008; Lim &

Burge 2001). However, our threshold length is somewhat artificial and, if there are in

fact such two classes of introns in human genes, the threshold is likely to be much

lower (less than 134 bp (Lim & Burge 2001)). This would mean that the majority of

introns in our short intron class actually function as long introns, and explain the lack

of difference in these two classes.

Among first introns, those that belong to the short intron class do not exhibit the

substitution pattern typical of first introns. Since first introns tend to be longer

(Hawkins 1988), it is possible that our short first intron class is enriched with introns

misclassified as first in the gene, despite our efforts to identify genes with incomplete

annotation. Still, true short first introns will not display the substitution pattern

Page 60: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

60 ● Publication I

described for all first introns since the extent of conservation at the 5’ end is almost

twice as long as the longest intron in the short class.

Finally, we find that short introns evolve faster than long introns, both within and

outside the extreme-most 400 bp. Besides primates (in the present study), rodents

(Gaffney and Keightley 2006), Drosophila (Haddrill et al. 2005) and rice (Guo et al.

2007) also show higher conservation in longer introns, which seems to indicate that

this is a general trend among eukaryotes. A simple explanation could be that shorter

introns need less regulatory motifs to be correctly removed by splicing. Additionally,

long introns may harbor a higher number of other regulatory motifs not necessarily

related with splicing, such as the multispecies conserved sequence (MCS) elements

found mainly in longer introns by Sironi et al. (2005).

Lastly, introns contain a variety of functional elements that constrain their evolution.

Some elements are present in all introns (splicing related) while others are present

only in some – such as transcriptional regulatory elements, present mainly in first

introns, and a great variety of genes for non-coding RNAs, encoded at odd introns. By

pooling introns together, our method detects mainly elements shared by many of

those introns which produce general trends of sequence conservation. This

information is useful for defining target regions when studying functional elements

present in introns, but also for selecting intronic regions in studies using introns as

neutrally evolving sequences.

Based on this assumption of neutrality, introns have been used to estimate genetic

distances between species (Castresana 2002), estimate the neutral rate of nucleotide

substitution (Hoffman and Birney 2007), detect positive selection in exons (Resch et

al. 2007; Ke et al. 2008) among other. Many of these studies recognized the existence

of conserved regions in introns and exclude them from the rest of the analysis. Yet,

according to our study, they greatly underestimated the length of those regions, thus

failing to exclude a large portion of constrained sequence.

Page 61: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 61

CONCLUSIONS

We find that sequence constraints at the 5’ and 3’ ends of introns in primates extend

for longer that what was found in most previous reports, up to 400 bp from each

splice site in most introns and for several kilobases from the donor splice site in first

introns. Knowing the extent of these regions is crucial for studies using introns as

neutrally evolving sequences, since including these regions can lead to wrong

estimates of the neutral mutation rate and generate false positives in tests of

positive selection. Because these regions are also the most likely location of intronic

regulatory sequences, involved, for instance, in splicing and transcription regulation,

our results are also relevant for defining target regions when studying functional

elements present in introns and for interpreting results of association studies when

the phenotype causing variant is found in introns past the core splicing signals.

ACKNOWLEDGMENTS

OF was supported by a PhD fellowship (SFRH/BD/15856/2005) from the Fundação

para a Ciência e a Tecnologia (Portugal). Financial support was provided by the

Spanish Ministry of Science and Innovation (Grant BFU2009-13409-C02-02 to AN) and

the Spanish National Institute for Bioinformatics (INB, www.inab.org).

Page 62: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

62 ● Publication I

REFERENCES

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708-715.

Breathnach R, Benoist C, O’Hare K, Gannon F, Chambon P. 1978. Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proc. Natl. Acad. Sci. U.S.A. 75: 4853-4857.

Castresana J. 2002. Estimation of genetic distances from human and mouse introns. Genome Biology. 3: research0028.1 - research0028.7.

Crooks GE, Hon G, Chandonia J-M, Brenner SE. 2004. WebLogo: a sequence logo generator. Genome Res. 14: 1188-1190.

Fairbrother WG, Yeh R-F, Sharp PA, Burge CB. 2002. Predictive Identification of Exonic Splicing Enhancers in Human Genes. Science. 297: 1007-1013.

Gaffney DJ, Keightley PD. 2006. Genomic selective constraints in murid noncoding DNA. PLoS Genet. 2: e204.

Gazave E, Marqués-Bonet T, Fernando O, Charlesworth B, Navarro A. 2007. Patterns and rates of intron divergence between humans and chimpanzees. Genome

Biol. 8: R21.

Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.

Gilson PR, McFadden GI. 1996. The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns. Proc. Natl. Acad. Sci. U.S.A. 93: 7737-7742.

Green MR. 1986. Pre-mRNA splicing. Annu. Rev. Genet. 20: 671-708.

Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al. 2006. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7: S2.1-31.

Guo X, Wang Y, Keightley P, Fan L. 2007. Patterns of selective constraints in noncoding DNA of rice. BMC Evolutionary Biology. 7: 208.

Page 63: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Intronic mutational constraints in Primates ● 63

Haddrill PR, Charlesworth B, Halligan DL, Andolfatto P. 2005. Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content. Genome Biol. 6: R67.

Hare MP, Palumbi SR. 2003. High intron sequence conservation across three mammalian orders suggests functional constraints. Mol. Biol. Evol. 20: 969-978.

Hawkins JD. 1988. A survey on intron and exon lengths. Nucleic Acids Res. 16: 9893-9908.

Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide substitution using introns. Mol. Biol. Evol. 24: 522-531.

Ke S, Zhang XH-F, Chasin LA. 2008. Positive selection acting on splicing motifs reflects compensatory evolution. Genome Res. 18: 533-543.

Keightley PD, Gaffney DJ. 2003. Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc. Natl. Acad. Sci. U.S.A. 100: 13402-13406.

Lim, L.P. & Burge, C.B., 2001. A computational analysis of sequence features involved in recognition of short introns. Proceedings of the National Academy of

Sciences of the United States of America, 98(20), 11193-11198.

Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.

McGuire AM, Pearson MD, Neafsey DE, Galagan JE. 2008. Cross-kingdom patterns of alternative splicing and splice recognition. Genome Biol. 9: R50.

R Development Core Team. 2009. R: A Language and Environment for Statistical

Computing. Vienna, Austria http://www.R-project.org.

Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.

Schneider TD, Stephens RM. 1990. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097-6100.

Schwartz S, Meshorer E, Ast G. 2009. Chromatin organization marks exon-intron structure. Nat. Struct. Mol. Biol. 16: 990-995.

Page 64: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

64 ● Publication I

Sironi M, Menozzi G, Comi GP, Bresolin N, Cagliani R, Pozzoli U. 2005. Fixation of conserved sequences shapes human intron size and influences transposon-insertion dynamics. Trends Genet. 21: 484-488.

Sorek R, Ast G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 1631-1637.

Stephens, R.M. & Schneider, T.D., 1992. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. Journal of Molecular Biology, 228(4), 1124-1136.

Wang Z, Burge CB. 2008. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 14: 802-813.

Wieringa, B., Hofer, E. & Weissmann, C., 1984. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell, 37(3), 915-925.

Yeo GW, Van Nostrand EL, Nostrand ELV, Liang TY. 2007. Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet. 3: e85.

Zhang MQ. 2002. Computational prediction of eukaryotic protein-coding genes. Nat.

Rev. Genet. 3: 698-709.

Page 65: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

PUBLICATION II

Accelerated evolution in Human introns

Olga Fernando1,2, Arcadi Navarro1

1Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències Experimentals i

de la Salut, Universitat Pompeu Fabra, Barcelona, Spain.

2Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras,

Portugal.

[In preparation]

Page 66: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

66 ● Publication II

The author of the thesis collected the data, performed the analyses and drafted the

manuscript.

Page 67: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 67

ABSTRACT

Non-protein-coding regions of the genome contain the majority of the nucleotides

under selection in mammals and have been proposed to harbor a great part of the

differences that separate humans from other hominoids. Within non-protein-coding

regions, introns contain a variety of functional elements which when disrupted can

have dramatic effects. Many of these functional elements are involved in the

regulation of splicing and gene expression and could thus be responsible for some of

the organismal differences between great apes.

We performed a genome-wide scan for introns with evidence of having evolved

under positive/directional selection in the human lineage (PSIs) by performing a

maximum likelihood test using the models described in Haygood et al. (2007), with

chimpanzee and macaque as the background lineages, and found 86 candidate

introns in 83 genes. Analysis of the distribution of these introns along the gene and

comparisons with the results of an independent study of positive selection on

promoter regions indicates that the functional sequences in these fast evolving

introns are likely to have a role in the control of transcription and gene expression.

Regulation of alternative splicing on the other hand does not seem to be a major

source of PSIs. Functional analysis of genes containing these introns did not identify

and particular biological process or molecular function of interest, which can happen

if these sequences in the intron are selected by the effect they have on a neighboring

gene instead of the gene where the intron lies.

INTRODUCTION

Perhaps partially because most of the biochemical methods available at the time of

the first evolutionary studies calculating genetic distances were based on comparing

Page 68: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

68 ● Publication II

proteins, much of the attention since then has been dedicated to non-synonymous

variation. Yet, as noticed by King and Wilson already in the 1970s (King and Wilson

1975), genetic distances between humans and chimpanzees seemed too small to

account for all the organismal differences observed between these species, which led

them to propose that most of those differences could be due to changes in the

expression of genes rather than in the sequence of the protein.

Current results, based on DNA sequencing techniques and thus not limited to protein

coding regions of the genome, seem to support a smaller role for protein sequence

changes. For instance, top signals in genome-wide association studies of human

diseases and variable traits often occur at DNA sites that do not encode amino acids

(Lomelin et al. 2010) and, although only around 1.2% of the genome encodes for

proteins, the estimated fraction of constrained nucleotides in mammals is of 3 to 6

percent, meaning that the majority of these sites under selection do not encode

amino acids (Koonin and Wolf 2010).

Among non-protein-coding sequences, introns are a likely location for a good portion

of these nucleotides since they harbor a variety of functional elements involved in

critical processes such as splicing and gene expression, both processes highly

regulated in the cell.

Incorrect splicing is estimated to account for at least 15% (Krawczak et al. 1992),

considering only changes in canonical splice signals, up to 50% (Wang and Cooper

2007) of human diseases caused by mutations. This translates in 8% to 27% of human

deaths being the result of mutations that affect splicing (Lynch 2010) and most part

of the core splice site motifs (5’ splice site, branch point sequence, polypyrimidine

tract and 3’ spice site) and cis-regulatory elements (both enhancers and silencers)

that regulate splicing are found in the intronic portions of the transcript.

The effect of introns on gene expression was noticed soon after introns themselves

were discovered and they are now known to affect directly or indirectly, in the act of

their removal, almost every step of mRNA metabolism (Le Hir et al. 2003). Through

Page 69: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 69

their regulation of splicing they can suppress gene expression by introducing

premature termination codons that activate nonsense-mediated mRNA decay, a

process that can be quite common since one-third of alternatively spliced transcripts

is estimated to contain a premature termination codon (Wang and Cooper 2007).

Last but not least, the ENCODE project (Birney et al. 2007) revealed that sequences

involved in regulating transcription, such as transcription factor binding sites, are

symmetrically distributed around transcription start sites and can be found thousand

of base pairs away from the transcription start site. This means that a good portion of

the information we usually associate with promoters is actually present in the first

introns of genes.

Another, more surprising, finding of the ENCODE project was that many of the

experimentally found functional elements are not evolutionarily constrained in

mammals and may serve as a reservoir of elements for natural selection to model in

a lineage-specific way. This would mean that differences between species, some of

which adaptive, would accumulate in regulatory regions, supporting King and

Wilson’s initial proposal.

In the present study we apply a maximum likelihood test, performed using the Null

and Alternative Models described in Haygood et al. (2007) and with chimpanzee and

macaque as the background lineages, to compare rates of evolution along the human

lineage between an intron and nearby putatively neutral intronic sequences in search

for introns with fast evolving sites in the human lineage since they can contain

regulatory elements under positive selection that could account for part of the

organismal differences between humans and our closest relatives that cannot be

explained by similar studies focused on protein sequence evolution. We are

encouraged by a similar study done on promoter regions, which found evidence for

positive selection in human promoters of neural- and nutrition-related genes

(Haygood et al. 2007), by recent findings that a considerable portion of fast-evolving

regions is located in introns (Pollard et al. 2006; Kim and Pritchard 2007), and by the

Page 70: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

70 ● Publication II

classical example of positive selection in human populations for the ability to digest

lactose into adulthood. This lactase persistence trait, lactase being the enzyme that

breaks down lactose into absorbable sugars, results from the continued expression of

its gene, LTC, which would normally become inactive around the age of 12 (Wooding

2007). The mutations responsible for this phenotype eluded researchers for decades

after the mapping of the LTC gene and they were finally found to be located in the

introns of a neighboring gene, MCM6, with unrelated functions (Tishkoff et al. 2007;

Ingram et al. 2009).

MATERIALS AND METHODS

Gene alignments

We downloaded whole genome DNA sequences for human (hg18), chimpanzee

(panTro2) and macaque (rheMac2), and sequence quality scores for chimpanzee and

macaque, from the UCSC Genome Browser (http://genome.ucsc.edu/). Human gene

annotations and one-to-one orthology information were retrieved from Ensembl

(http://www.ensembl.org/) release 48 using BioMart

(http://www.ensembl.org/biomart/).

For all genes with one-to-one orthologs in all three species, and at least one intron

annotated in humans (14,286 genes), the full sequence was extracted from the

corresponding chromosome sequence file in each species. Gene sequences were

then aligned with TBA (Blanchette et al. 2004), after masking all nucleotides with

quality scores of less than 40 (finished sequence standard, comparable to human

(Schmutz et al. 2004)) in the chimpanzee and macaque sequences.

Reference and Test sets

Since our method is based on comparing 'test' introns against carefully selected

neutral intron fragments we used the annotation of all the human genes in Ensembl

Page 71: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 71

release 48 to produce a list of coordinates of central parts of introns, for the

Reference Set (RS), and a list of coordinates of full introns for the Test Set (TS). We

define the central part of an intron as the part that is left after excluding 400 bp from

each end of the intron and, in the case of first introns, after excluding another 3,100

bp (3,500 bp in total) from the 5’ end (see Figure 7), which tend to be more

constrained (Fernando and Navarro Submitted). From the coordinates for the RS we

removed all positions that were annotated as exons or non-central parts of introns in

other transcripts. After discarding duplicated entries, the list for the RS consisted of

non-overlapping genomic coordinates for strict central parts of introns.

Figure 7 Schematic representation of a portion of the genome. In the upper part of the figure white

boxes represent genes. In the bottom part, a close-up on Gene B, taller boxes represent exons and

shorter ones introns. After removing the intron portions defined in the main text the red intronic

portions remain. These were used to construct the Reference Set.

Page 72: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

72 ● Publication II

When several transcripts included the same intron, only one set of coordinates was

kept in the list for the TS, but the information regarding the transcripts containing

that intron was kept. Overlapping introns were kept as long as at least one of the

start or end coordinates was different.

Both lists of coordinates were then filtered to include only coordinates represented

in the gene alignments. To minimize possible false orthologs in the TS, gene

alignments with less than 75% of the CDS aligned in all three species were not used.

In order to avoid false positives, the TS was further filtered to exclude introns

without support from any valid transcript after checking for possible annotation

errors, namely: incorrect splice sites, CDS not multiple of three, lack of the start or

the stop codon, presence of non-sense mutations or introns smaller than 20 bp6.

Each intron left in the TS was extracted from the corresponding gene alignment and

windows of 51 ungapped and unmasked sites with at least 12 differences between

human and chimpanzee or 17 differences between human and macaque were

masked (similar to Haygood et al. 2007). Introns with either more than 0.06% of thus

masked bases, more than 30% gaps, or more than 10% low quality score nucleotides

were excluded, also with the aim to avoid false positives in our results.

A reference sequence alignment was constructed for each intron in the TS by

concatenating all segments in the RS within a 100 kb window centered on that intron

excluding all segments overlapping the intron itself.

Finally, all columns in the alignments of both the Reference and Test Sets with gaps

or masked bases were removed, and only introns with alignments longer than 20 bp

and corresponding reference alignment longer than 7,000 bp were analyzed.

6 20 bp is approximately the length of the smallest spliceosomal introns described (Gilson and

McFadden 1996) and the minimum sequence length containing essential splicing signals (Wieringa et al. 1984).

Page 73: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 73

Positive selection test

A maximum likelihood test was performed using the Null and Alternative Models

described in Haygood et al. (2007), fitted with HyPhy (Pond et al. 2005) to our introns

in the TS and corresponding reference sequence in the RS. The two models, of single-

nucleotide substitutions, allow for different classes of intron sites, so that the test

can detect positive selection even if it is acting on only a limited number of sites, and

can also distinguish between positive selection and relaxation of negative selection

(accommodated for in the Null Model).

Following the strategy of Haygood et al. (2007), we fitted each model to our data ten

times, starting from random points, to guard against local maxima of the likelihood

function. The likelihood ratio test was done by comparing twice the difference

between the best log likelihood of each model with a χ2 distribution with one degree

of freedom. Additionally, for each intron in the TS, we constructed 100 bootstrap

replicates over the corresponding reference sequence in the RS. For each bootstrap

replicate we fitted the two models ten times and calculated the P value as described

for the original reference sequence. The median of all P values was then chosen as

the representative P value for that intron.

To account for multiple testing, false discovery rates (FDR) Q values were calculated

with the qvalue package in R (R Development Core Team 2009) using the bootstrap

method and we considered introns to have significant evidence of positive selection

when Q < 0.05.

Data Analysis and Plotting

Fisher's exact test, Spearman's rank correlation and Mann-Whitney tests were

performed with R (R Development Core Team 2009).

Page 74: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

74 ● Publication II

Functional analysis

We used PANTHER’s “Gene Expression Data Analysis” tools (Thomas et al. 2006),

both the binomial statistics tool the Mann-Whitney U Test tool, and GOstat

(Beissbarth and Speed 2004) and its variant Rank GOstat, to look for statistically over-

and under-represented biological processes, molecular functions, cellular

components and pathways among the genes whose introns were analyzed in this

study. The “Gene Expression Data Analysis” tools use the PANTHER database

(Thomas et al. 2003) while GOstat ant its variants use the Gene Ontology (GO)

database (Ashburner et al. 2000) annotations. The multiple testing correction option

was used in all tools.

RESULTS

After applying several filters to control for potential annotation errors and for the

quality of the alignments (see Methods) we were left with 87,631 introns in 17,859

valid transcripts belonging to 8,979 genes, all of which with an associated reference

alignment of at least 7,000 bp coming from less than 50 kb to each side of the intron.

For more than half of the introns the reference alignment contained sequences

coming from at least two different genes.

P values showed a weak correlation with intron length (Spearman's rank correlation

rS = -0.101, two-tailed P << 0.001), but no or very weak correlation with the

percentage of possible indicators of bad sequence or alignment quality, such as gaps,

divergence masked bases or low quality score nucleotides, or with the length of the

reference alignment (rS = -0.076, -0.003, -0.056 and -0.013, two-tailed P <<0.001,

=0.373, <<0.001 and <<0.001, respectively). The frequency of GC and of CG-

susceptible sites (Keightley and Gaffney 2003) also had no correlation with P values

(both rS = -0.007, two-tailed P = 0.043 and 0.049, respectively).

Page 75: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 75

Because genes can have more than one transcript and overlap other genes, some of

the tested introns belong to more than one transcript (or gene) and others overlap to

various degrees. Introns shared between transcripts were tested only once, but

overlapping introns (12,040) were tested for positive selection independently.

Positively selected introns

The likelihood ratio test (LRT) based on the branch-site models described in Haygood

et al. (2007) identified 86 introns with evidence for positive selection in the human

branch (PSIs) after correcting for multiple testing (Q < 0.05; Supplementary Table 1).

These introns are distributed over 83 genes, with three genes containing two PSIs

each.

Since some of the introns tested for positive selection overlap, their results are

expected to be correlated. In fact, considering all 9,549 possible pairs of introns that

overlap, there is a negative correlation between the percentage of overlap and the

absolute difference in HyPhy parameter estimates (such as the transition to

transversion ratio: rS = -0.469, two-tailed P << 0.001), and also the absolute

difference in P values (rS = -0.324, two-tailed P << 0.001) of those introns. Among the

86 PSIs there are two pairs of overlapping introns, each pair belonging to the same

gene. In other words, the PSIs in two of the genes with multiple PSIs overlap.

If for some reason overlapping introns tended to have smaller or larger P values than

non-overlapping introns, our number of PSIs could be overestimated or

underestimated, respectively. This was not the case, as the observed number of

overlapping introns with Q < 0.05 was slightly less, but not significantly different,

from the expected (Fisher's exact test, two-tailed P = 0.753) nor was there a

significant difference in Q values between the overlapping and non-overlapping sets

(Mann-Whitney test, two-tailed P = 0.338). Repeating the analysis with introns with P

< 0.05 (4,185 high scoring introns, HSIs) we reach the same conclusions (Fisher's

exact test, two-tailed P = 0.241, and Mann-Whitney test, two-tailed P = 0.192).

Page 76: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

76 ● Publication II

First introns have lower P and Q values

Several reports indicate that first introns are enriched in functional elements

(Chamary and Hurst 2004 and references therein) and previous results from our

group (see previous chapter) show that first introns have a distinct conservation

profile. With this in mind we tested if there was an enrichment of first introns in PSIs

or HSIs and if P or Q values are different in first introns compared to other introns.

We repeated this analysis with second introns, for comparison purposes, and other

classes of introns of interest, namely, last introns and introns in UTRs. The results are

summarized on Table 1.

Table 1 Distribution of P and Q values by several classes of introns.

HSIs a PSIs a Class b N OR c ∆ Mean P d OR ∆ Mean Q d

First 6696 1.23 ** -3.39 x 10-2 ** 1.24 -3.64 x 10-3 ** Second 8435 1.06 -1.15 x 10-2 ** 0.96 7.58 x 10-4

Last 9119 0.98 1.04 x 10-3 0.76 4.33 x 10-4 5’UTR 7412 1.06 -1.23 x 10-2 ** 1.26 -1.07 x 10-3 3’UTR 3361 0.90 7.88 x 10-3 0.60 3.27 x 10-4 a Introns with P (HSIs) or Q (PSIs) < 0.05 compared to the remaining introns.

b First, second and last introns in the gene and introns in the 5’ or 3’ UTRs compared to introns in other

locations in the gene. c Odds Ratio. A value larger than one indicates that more HSIs or PSIs were found in that class (

b) than

expected. Significant Fisher’s exact tests are marked with asterisks. d Difference between the mean P or Q values in that class of introns and the mean of all the other

introns not in that class. Significant Mann-Whitney tests are marked with asterisks. * Fisher or Mann-Whitney two-tailed P < 0.05 (*) or < 0.001 (**).

Contrary to other intron classes, first introns have significantly more HSIs than

expected and lower P and Q values. Second introns and introns in the 5’UTR (the

majority of which are first and second introns in the gene) also have significantly

lower P values.

Since the introns being studied can belong to more than one transcript, intron

classification is not always straightforward. The results reported in Table 1 were

Page 77: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 77

obtained by including introns in a given class as long as at least one of its transcripts

supported that classification. We repeated the analysis using a different classification

criterion in which introns were put in a given class only if all the transcripts they

belong to support that decision. The results, in Supplementary Table 2, are very

similar to the ones presented here.

Functional analysis

We used both the PANTHER and GO ontologies to explore the function of the genes

containing PSIs.

In a first approach we used PANTHER’s binomial statistics tool and GOstat to

compare the list of genes with PSIs against the list of the other genes with analyzed

introns. With the PANTHER annotation no term was significantly over- or under-

represented in the group of genes with PSIs after correction for multiple testing.

Using GOstat 14 biological process terms were significantly overrepresented in the 79

genes with at least one GO annotation out of the 83 genes with PSIs. Eleven of those

terms are parent to two of the significant terms: “positive regulation of interleukin-

10 biosynthetic process” (GO:0045082) and “T-helper cell differentiation”

(GO:0042093). The remaining significant term is “pyrimidine deoxyribonucleotide

metabolic process” (GO:0009219). However, all the significant immunity related

terms contain only two genes (BCL3 and IRF4) and the remaining significant term is

also due to the presence of only two genes (TYMS and DUT).

We thus tried another common strategy which, for each term with analyzed genes,

tests if there is an enrichment in lower or higher P values relative to the overall P

value distribution and is implemented in both PANTHER’s “Gene Expression Data

Analysis” tools and Rank GOstat. In order to do this, each gene must have a single P

value so, in genes with multiple introns, one needs to choose one P value to

represent the gene.

Page 78: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

78 ● Publication II

Our first approach was to choose the lowest P value among the introns in the gene,

which resulted in several significant terms both with PANTHER and GO. One problem

with this approach is that genes with more introns analyzed tend to have lower P

values (the median number of analyzed introns per gene in genes with P < 0.05 is 12,

twice the median in other genes; Mann-Whitney test, two-tailed P << 0.001) and the

two variables are strongly correlated (rS = -0.565, two-tailed P << 0.001). The number

of analyzed introns itself is very strongly correlated with the total number of introns

in the gene (rS = 0.872, two-tailed P << 0.001), so that the genes with P < 0.05 are

more intron-rich (median of 17 introns per gene versus 10 in the other genes; Mann-

Whitney test, two-tailed P << 0.001) and there is also a strong correlation between

the number of introns a gene has and it’s P value (rS = -0.478, two-tailed P << 0.001).

In an attempt to reduce this bias we multiplied each gene P value by the number of

analyzed introns in the gene. This ended the correlation between gene P values and

the number of analyzed introns per gene (rS = -0.036, two-tailed P < 0.001), but genes

with smaller P values still have more introns (median of 10 versus 7 analyzed introns

per gene; Mann-Whitney test, two-tailed P << 0.001).

Finally, we corrected the gene P value taking into account the number of analyzed

introns in the gene (N) by sampling N introns, without replacement, 1,000,000 times,

from the total 87,631 introns analyzed, and keeping the smallest of the N sampled P

values. The proportion of times the uncorrected gene P value was smaller than this

value was then used as the corrected gene P value, which is no longer associated

(median 8 versus 8, Mann-Whitney test, two-tailed P = 0.382) or correlated (rS =

0.017, two-tailed P = 0.112) with the number of analyzed introns.

With the PANTHER annotation only "Other homeostasis activities" in the "Biological

Process" ontology was marginally significant (P = 0.040) with an enrichment in genes

with lower P values. With the GO annotation "RNA metabolic process" (GO:0016070)

and "regulation of metabolic process" (GO:0019222) in the "Biological process"

ontology showed a marginally significant (P = 0.035) enrichment in genes with higher

Page 79: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 79

P values, and in the "Cellular component" ontology, "intracellular part" (GO:0044424)

and two of its parental terms were enriched for genes with low P values.

Additionally, because first introns are enriched in sequences involved in the

regulation of transcription of the gene, and thus, elements under positive selection in

these introns are more likely to affect the gene the intron belongs to than elements

in other introns more distant from the gene’s transcription start site (TSS), we did a

functional analysis study based only on the information from these introns, so that

the gene P value is the first intron’s P value.

Of the 5,271 genes with a first intron analyzed, after correcting for the number of

genes tested by FDR, 11 genes had Q < 0.05. The only significant result with

PANTHER’s binomial statistics tool was an enrichment of genes with Q < 0.05 in the

"De novo pyrimidine deoxyribonucleotide biosynthesis" Pathway, but only 2 of the 11

genes fitted in that category. With GOstat, 65 biological process terms were

significantly enriched in genes with Q < 0.05, including the eleven terms identified

initially using all genes with PSIs. Yet, except for two terms related to nucleotide

metabolic process (GO:0055086 and GO:0009117) which contained the same four

genes, all other significant terms were due to a single gene each. Eight of these 65

terms had also significantly lower P values according to Rank GOstat, but all of them

were due to gene IRF4. Another 15 “Molecular function” GO terms were significantly

enriched in genes with Q < 0.05, all of them again with only one gene, except for

“magnesium ion binding” (GO:0000287) and “pyrophosphatase activity”

(GO:0016462) plus two of its parental terms, with three genes each (two of them

shared by all four terms). In the "Cellular component" ontology we got the same

results as when the resample corrected P values were used. With PANTHER’s Mann-

Whitney U Test tool no term was significantly enriched in higher or lower P values.

Page 80: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

80 ● Publication II

Overlap with other non-coding regions under positive selection

We compared our results in introns with those obtained by Haygood et al. (2007) for

promoters on the human branch, since elements that regulate transcription can be

found in both types of non-coding sequences. At first sight positive selection seems

to affect introns and promoters independently, as neither the number of genes with

both introns and promoters under positive selection, nor the number of genes which

have HSIs and also P < 0.05 in the promoter study, are significantly different from the

expected if the two are independent (Fisher's exact test, two-tailed P = 1 and 0.839,

respectively). Yet, when we consider only first introns, there are significantly more

genes with P < 0.05 in both studies than expected by chance (odds ratio = 2.607;

Fisher's exact test, two-tailed P < 0.001).

DISCUSSION

Although, in absolute terms, non-protein-coding regions have more nucleotides in

functional elements compared to protein coding regions, the relative frequency of

these nucleotides is much lower in the former. It is thus not surprising that the

number of PSI is relatively small considering the number of tested introns. The fact

that the P values were not correlated with the percentage of gaps, low quality, or

divergence masked bases, which could indicate poor sequence or alignment quality,

or with the frequency of GC or CG-susceptible sites, gives us confidence in that these

are true PSI. The weak negative correlation found between P values and intron length

may actually be expected since as more intronic sites are analyzed, more sites under

positive selection may be included. We note though that, due to the presence of

overlapping introns in our test set, our correlation estimates may be inflated.

Since most of our overlapping introns result from alternative splicing, the lack of a

significant difference in the P or Q values between the overlapping and non-

overlapping sets of introns and of the expected and observed numbers of PSI and HSI

Page 81: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 81

in these two sets indicate that introns involved alternative splicing events are not

contributing disproportionally to the PSI and HIS classes, and thus, that regulation of

alternative splicing is not a particular target of positive selection in introns.

Our finding that the 5’-most introns in the gene (first, second and 5’ UTR introns)

have significantly lower P values and first introns in particular have also significantly

lower Q values and more HSIs than other introns indicates instead that these fast

evolving intronic sequences are more likely to be involved in the control of gene

expression, as elements involved in the regulation of gene expression are more

frequent in those introns closer to the transcription start site (Majewski and Ott

2002).

The comparison of the results from this study with those from Haygood et al. (2007)

provided additional compelling evidence for the role of the accelerated elements in

first introns in regulating gene expression. In that other study the authors identified

genes whose promoter region upstream of the TSS showed evidence of positive

selection. Since elements involved in regulating transcription are also found

downstream of the TSS, manly in the first intron, and positive selection on the

regulation of gene expression may act simultaneously on multiple regulatory

elements of the same gene, one might expect to find a significant overlap of genes

with high scoring (P < 0.05) promoters and first introns in particular, which is exactly

what was found.

In order to determine if our PSIs belonged to genes with particular functions we have

to take into account that functional information in the PANTHER and GO databases is

provided per gene but, by studying the genes’ introns, we are testing genes with

multiple introns several times, such that the more introns a gene has, the more likely

it is to contain a PSI and a lower P value. We found that applying a resampling

strategy to correct gene P values effectively cleared both the association between

low gene P values and high number of analyzed introns and the correlation between

these two variables.

Page 82: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

82 ● Publication II

Only a few PANTHER and GO terms stood out from our functional analysis, mostly

related with nucleotide metabolism and immunity. Yet, they were all supported by a

very small number of genes and, thus, are not reliable. This lack of association

between the selection in introns and the function of the protein coded by the gene

they are in is consistent with previous observations that the evolution of protein

sequences is decoupled from the evolution of non-protein-coding sequences (Resch

et al. 2007). It is possible that the accelerated elements in PSIs act on a neighboring

gene of unrelated function (Kleinjan and van Heyningen 2005), either close to the

gene containing the PSI, such as in the case of introns in MCM6 affecting the

activation of the LTC gene (Tishkoff et al. 2007), or even a distant gene, as in the case

of intron 5 of LMBR1 which contains a long-range regulatory element of the SHH

gene (He et al. 2008).

Page 83: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 83

ACKNOWLEDGMENTS

Ralph Haygood and Olivier Fedrigo for providing their HyPhy Batch Language scripts

and the HyPhy team for teaching OF how to use their software.

OF was supported by a PhD fellowship (SFRH/BD/15856/2005) from the Fundação

para a Ciência e a Tecnologia (Portugal).

Page 84: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

84 ● Publication II

Page 85: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 85

Page 86: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

86 ● Publication II

Page 87: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 87

Page 88: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

88 ● Publication II

Page 89: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 89

Page 90: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

90 ● Publication II

Page 91: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 91

Page 92: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

92 ● Publication II

Supplementary Table 2 Distribution of P and Q values by several classes of introns.

HSIs a PSIs a

Class b N OR c ∆ Mean P d OR ∆ Mean Q d

First 5763 1.24 ** -3.66 x 10-2 ** 1.06 -3.70 x 10-3 *

Second 5897 1.05 -1.04 x 10-2 * 0.86 8.32 x 10-4

Last 8164 0.98 -6.88 x 10-4 0.60 5.42 x 10-4

5’UTR 2838 1.08 -1.72 x 10-2 ** 1.08 -1.80 x 10-3

3’UTR 388 0.86 3.36 x 10-2 * 2.65 -1.84 x 10-3 a Introns with P (HSIs) or Q (PSIs) < 0.05 compared to the remaining introns.

b First, second and last introns in the gene and introns in the 5’ or 3’ UTRs compared to introns in other

locations in the gene. c Odds Ratio. A value larger than one indicates that more HSIs or PSIs were found in that class (

b) than

expected. Significant Fisher’s exact tests are marked with asterisks. d Difference between the mean P or Q values in that class of introns and the mean of all the other

introns not in that class. Significant Mann-Whitney tests are marked with asterisks. * Fisher or Mann-Whitney two-tailed P < 0.05 (*) or < 0.001 (**).

Page 93: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 93

REFERENCES

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene Ontology: tool for the unification of biology. Nat Genet. 25: 25-29.

Beissbarth T, Speed TP. 2004. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 20: 1464-1465.

Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 447: 799-816.

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708-715.

Chamary J-V, Hurst LD. 2004. Similar rates but different modes of sequence evolution in introns and at exonic silent sites in rodents: evidence for selectively driven codon usage. Mol. Biol. Evol. 21: 1014-1023.

Fernando O, Navarro A. Submitted. Intronic mutational constraints in Primates.

Gilson PR, McFadden GI. 1996. The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns. Proc. Natl. Acad. Sci. U.S.A. 93: 7737-7742.

Haygood R, Fedrigo O, Hanson B, Yokoyama K-D, Wray GA. 2007. Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat. Genet. 39: 1140-1144.

He F, Wu D-D, Kong Q-P, Zhang Y-P. 2008. Intriguing balancing selection on the intron 5 region of LMBR1 in human population. PLoS ONE. 3: e2948.

Le Hir H, Nott A, Moore MJ. 2003. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28: 215-220.

Ingram C, Raga T, Tarekegn A, Browning S, Elamin M, Bekele E, Thomas M, Weale M, Bradman N, Swallow D. 2009. Multiple Rare Variants as a Cause of a Common Phenotype: Several Different Lactase Persistence Associated Alleles in a Single Ethnic Group. J. Mol. Evol.

Page 94: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

94 ● Publication II

http://www.ncbi.nlm.nih.gov/pubmed/19937006 (Accessed November 26, 2009).

Keightley PD, Gaffney DJ. 2003. Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc. Natl. Acad. Sci. U.S.A. 100: 13402-13406.

Kim SY, Pritchard JK. 2007. Adaptive evolution of conserved noncoding elements in mammals. PLoS Genet. 3: 1572-1586.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science. 188: 107-116.

Kleinjan DA, van Heyningen V. 2005. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76: 8-32.

Koonin EV, Wolf YI. 2010. Constraints and plasticity in genome and molecular-phenome evolution. Nat. Rev. Genet. 11: 487-498.

Krawczak M, Reiss J, Cooper DN. 1992. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum. Genet. 90: 41-54.

Lomelin D, Jorgenson E, Risch N. 2010. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 20: 311-319.

Lynch M. 2010. Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. U.S.A. 107: 961-968.

Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.

Pollard KS, Salama SR, King B, Kern AD, Dreszer T, Katzman S, Siepel A, Pedersen JS, Bejerano G, Baertsch R, et al. 2006. Forces shaping the fastest evolving regions in the human genome. PLoS Genet. 2: e168.

Pond SLK, Frost SDW, Muse SV. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 21: 676-679.

R Development Core Team. 2009. R: A Language and Environment for Statistical

Computing. Vienna, Austria http://www.R-project.org.

Page 95: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Accelerated evolution in Human introns ● 95

Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.

Schmutz J, Wheeler J, Grimwood J, Dickson M, Yang J, Caoile C, Bajorek E, Black S, Chan YM, Denys M, et al. 2004. Quality assessment of the human genome sequence. Nature. 429: 365-368.

Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. 2003. PANTHER: A Library of Protein Families and Subfamilies Indexed by Function. Genome Research. 13: 2129-2141.

Thomas PD, Kejariwal A, Guo N, Mi H, Campbell MJ, Muruganujan A, Lazareva-Ulitsky B. 2006. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucl. Acids

Res. 34: W645-650.

Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, et al. 2007. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39: 31-40.

Wang G-S, Cooper TA. 2007. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 8: 749-761.

Wieringa, B., Hofer, E. & Weissmann, C., 1984. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell, 37(3), 915-925.

Wooding SP. 2007. Following the herd. Nat. Genet. 39: 7-8.

Page 96: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...
Page 97: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

General discussion and

conclusions

Page 98: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...
Page 99: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Discussion and conclusions ● 99

Although at the time of their discovery introns were already expected to have a role

in many cellular functions and even the evolution of genomes (Williamson 1977;

Marx 1978; Gilbert 1978), the three decades that have passed since have confirmed

many of those hypothesis and increased the repertoire of intronic functions beyond

what was initially imagined.

We now know that introns contain a variety of functional elements and even other

genes. Besides the majority of the core splicing signals, introns also contain

regulatory elements essential for splicing and transcription which are expected to

affect the evolution of these sequences by being a target for negative/purifying or

positive/directional selection.

Constraints on the evolution of intronic sequences

Several studies have found that intron nucleotides closer to the splice sites show a

higher degree of conservation, but the reported length of these conserved regions

varies greatly in the literature (Majewski and Ott 2002; Hare and Palumbi 2003; Sorek

and Ast 2003; Kaufmann et al. 2004). Inconsistencies among the different studies are

likely to be the result of differences in the methods used to estimate conservation,

the species studied and the subsets of introns used.

We were interested in determining the length of these constrained regions at the 5’

and 3’ ends of introns in primates because they are the most likely location of

intronic regulatory sequences, and also because by defining these regions we also

identify the complementary regions, in the middle of the intron, that are most likely

to be evolving neutrally.

In order to do that, we looked at the frequency of substitutions along human-

chimpanzee-macaque orthologous introns from each splice site and found that

sequence constraints extend for longer that what was found in most previous

reports, up to 400 bp from each splice site. In the first (5’-most) intron of the gene,

conservation of the 5’end extends up to several kilobases from the donor splice site,

Page 100: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

100 ● Discussion and conclusions

most likely due to the presence of regulatory elements involved in transcription,

which tend to be located close to the transcription start site.

The knowledge of the extent of these regions is useful for defining target regions

when studying functional elements present in introns (either computational scans of

over-represented motifs or functional experiments), and also for selecting intronic

regions in studies using introns as neutrally evolving sequences (from which these

more conserved regions should be excluded) such as to estimate genetic distances

between species or to detect positive selection.

Accelerated evolution of intronic sequences

It has been suggested that the majority of changes that separate humans from their

closest relatives lie in regulatory regions rather than in protein coding sequences, and

it is possible that many of these changes are adaptations. Since introns carry so many

regulatory elements involved in several steps of splicing and transcription control,

they are a promising location for these adaptive changes in different lineages.

We performed a genome-wide scan for introns with evidence of having evolved

under positive selection in the human lineage using the central part of introns (after

excluding the constrained regions identified in our previous study) as our neutrally

evolving sequences to which we compare the substitution rates in our test introns.

Traditionally, synonymous sites in protein-coding regions and ancestral repeats have

been used with this purpose, but evidence has been accumulating that selection also

acts on these regions (Lomelin et al. 2010; Hellmann et al. 2003; Hirsh et al. 2005;

Chamary et al. 2006; Imamura et al. 2009; Faulkner and Carninci 2009). Our decision

to use the central portions of introns comes from our observations in the previous

study, from examples of successful use of intronic sequences in independent studies

(Haygood et al. 2007; Parsch et al. 2010; Hoffman and Birney 2007; Resch et al. 2007;

Ke et al. 2008) and from the need to use sequences from the same genomic region as

Page 101: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Discussion and conclusions ● 101

the sequences being tested to minimize differences in the mutation rate, which can

vary along the genome.

We found evidence for positive selection in 86 human introns mostly belonging to

different genes. Our functional analysis of the genes to which these introns belong

did not yield any biological process or molecular function particularly enriched with

these genes, which might not be an unexpected result if the selected sequences in

these introns act on a neighboring gene of unrelated function, likely as a distant

transcription regulatory element. In fact, there is evidence that many genes require

distant cis-regulatory elements for their correct spatial and temporal expression, and

that these elements can be found up to one mega base pairs from the gene, often

embedded within another gene, generally within its introns, that fulfills a very

different function from the regulated gene (Kleinjan and van Heyningen 2005).

We were still able to infer that transcription regulation is a more likely target of

positive selection in introns than regulation of alternative splicing given that

overlapping introns (which mainly result from alternative splicing events) were not

particularly enriched in PSIs, but introns closer to the TSS (which are enriched for

transcription regulatory elements), especially the first intron, were. The fact that

genes with fast evolving promoter regions were more likely to have also fast evolving

first introns also supports the notion that accelerated elements in first introns are

likely regulating gene expression.

Page 102: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

102 ● Discussion and conclusions

REFERENCES

Chamary JV, Parmley JL, Hurst LD. 2006. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat. Rev. Genet. 7: 98-108.

Faulkner GJ, Carninci P. 2009. Altruistic functions for selfish DNA. Cell Cycle. 8: 2895-2900.

Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.

Hare MP, Palumbi SR. 2003. High intron sequence conservation across three mammalian orders suggests functional constraints. Mol. Biol. Evol. 20: 969-978.

Haygood R, Fedrigo O, Hanson B, Yokoyama K-D, Wray GA. 2007. Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat. Genet. 39: 1140-1144.

Hellmann I, Zollner S, Enard W, Ebersberger I, Nickel B, Paabo S. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13: 831-837.

Hirsh AE, Fraser HB, Wall DP. 2005. Adjusting for Selection on Synonymous Sites in Estimates of Evolutionary Distance. Mol Biol Evol. 22: 174-177.

Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide substitution using introns. Mol. Biol. Evol. 24: 522-531.

Imamura H, Karro J, Chuang J. 2009. Weak preservation of local neutral substitution rates across mammalian genomes. BMC Evolutionary Biology. 9: 89.

Kaufmann D, Kenner O, Nurnberg P, Vogel W, Bartelt B. 2004. In NF1, CFTR, PER3, CARS and SYT7, alternatively included exons show higher conservation of surrounding intron sequences than constitutive exons. Eur. J. Hum. Genet. 12: 139-149.

Ke S, Zhang XH-F, Chasin LA. 2008. Positive selection acting on splicing motifs reflects compensatory evolution. Genome Res. 18: 533-543.

Kleinjan DA, van Heyningen V. 2005. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76: 8-32.

Page 103: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...

Discussion and conclusions ● 103

Lomelin D, Jorgenson E, Risch N. 2010. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 20: 311-319.

Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.

Marx JL. 1978. Gene structure: more surprising developments. Science. 199: 517-518.

Parsch J, Novozhilov S, Saminadin-Peter SS, Wong KM, Andolfatto P. 2010. On the utility of short intron sequences as a reference for the detection of positive and negative selection in Drosophila. Mol. Biol. Evol. 27: 1226-1234.

Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.

Sorek R, Ast G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 1631-1637.

Williamson B. 1977. DNA insertions and gene structure. Nature. 270: 295-297.

Page 104: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...