LOCI IMAGE GENERATOR AND THE EVOLUTION OF...
Transcript of LOCI IMAGE GENERATOR AND THE EVOLUTION OF...
UNIVERSIDADE FEDERAL DO ABC
CENTRO DE CIÊNCIAS NATURAIS E HUMANAS
PROGRAMA DE PÓS-GRADUCAÇÃO EM EVOLUÇÃO E DIVERSIDADE
Max Mario Fuhlendorf
LOCI IMAGE GENERATOR AND THE EVOLUTION OF TRYPANOSOMATID ECOTIN:
CUSTOMIZED SOFTWARE AS A TOOL FOR EVOLUTIONARY ANALYSIS
Santo André – SP
2018
2
3
Max M. Fuhlendorf
Loci image generator and the evolution of trypanosomatid ecotin:
customized software as a tool for evolutionary analysis
Dissertação apresentada ao Programa de Pós-graduação em Evolução e Diversidade da Universidade Federal do ABC como requisito à obtenção do título de Mestre. Linha de pesquisa: bioinformática e evolução gênica.
Orientador: Márcia Aparecida Sperança Coorientador: Arlei Marcili
Santo André – SP
2018
4
5
6
7
8
9
para Cleide, Max e Josepha
10
11
Agradecimentos
Agradeço aos meus pais, Cleide M. Fuhlendorf e Max Fuhlendorf Jr,
profundamente. Eles foram muito além do que a maioria dos pais vai pelos filhos,
e não há agradecimentos suficientes que possam expressar meu carinho.
Minha orientadora, amiga e mestra Márcia Sperança me ensinou a pensar
como cientista sem nunca perder a empatia, e será sempre uma referência na
minha vida pessoal e profissional.
O corpo docente da UFABC é em sua maioria formado por pessoas
dinâmicas, absurdamente capazes e dispostas a sair de seus caminhos para
fomentar o aprendizado e a curiosidade dos seus alunos. Às dezenas deles que me
marcaram pessoal e profissionalmente, deixo aqui minha sincera gratidão.
Enfim, agradeço aos meus colegas de laboratório, amigos para noites
viradas de estudo, discussões existenciais e abraços quando precisei. Artur, Bianca,
Daubian, Douglas, Stephanie e Thiago: vocês tem lugar cativo na minha vida.
12
13
“ Slow though the process of selection may be, ... I can see no limit to the amount of change, to the beauty and infinite complexity of the coadaptations between all organic beings, one with another and with their physical conditions of life, which may be effected in the long course of time by nature's power of selection.
” DARWIN, Charles. 1859
“
All that you touch You Change.
All that you Change Changes you.
The only lasting truth Is Change.
God Is Change. And hidden within Change
Is surprise, delight, Confusion, pain, Discovery, loss,
Opportunity, and growth. As always, God exists
To shape And to be shaped.
” BUTLER, Octavia. 1993
14
15
Resumo
A filogenética molecular moderna vai além de simples análises de
sequências. O contexto genômico surge como uma ferramenta cada vez mais
relevante na inferência e compreensão de relações evolutivas e funcionais entre
genes. Aqui propomos um simples software online para identificação de loci de
interesse em sequências genômicas anotadas e geração de imagens prontas para
publicação. O programa foi testado com um caso único, uma análise filogenética
de uma enzima de tripanossomatídeos que foi transferida horizontalmente de
bactérias no ancestral comum do grupo, que não ocorre em outros eucariotos
extantes e é relevante para a virulência do parasita.
Palavras-chave: Genômica; Tripanossomatídeos; GenBank; Bioinformática.
16
17
Abstract
Modern molecular phylogenetic research goes beyond simply analyzing
sequence data. Genomic context is increasingly relevant as a tool to infer and
understand evolutionary and functional relationships between genes. Here we
propose a simple web-based software program for labelling loci of interest in
annotated genomic sequences and generating ready-for-publication imagery. The
program was tested with a unique use case, a phylogenetic analysis of a
trypanosomatid enzyme that was laterally transferred from bacteria in the group’s
common ancestor, occurs in no other extant eukaryotes and is very important to
the parasite’s virulence.
Keywords: Genomics; Trypanosomatids; GenBank; Bioinformatics.
18
19
List of acronyms
BLAST Basic Local Alignment Search Tool
CDS Coding sequence
CHP Conserved Hypothetical Protein
CSS Cascading Style Sheets
CSS3 Cascading Style Sheets (version 3)
FASTA FAST(ALL)
GenBank National Institutes of Health's genetic sequence database
HTML Hypertext Markup Language
HTML5 Hypertext Markup Language (version 5)
ISP Inhibitor of Serine Peptidase Enzymes
NCBI National Center for Biotechnology Information
PHP PHP: Hypertext Preprocessor
px pixel
RefSeq National Center for Biotechnology Information's Reference Sequence database
SQL Structured Query Language
tBLASTn Translated Nucleotide Basic Local Alignment Search Tool
UCSC University of California Santa Cruz
WAMP Windows Apache MySQL PHP
20
21
Summary
Initial remarks .....................................................................................23 Chapter I — Online database-driven loci image generator ..................25
Introduction ................................................................................... 27 Methodology .................................................................................. 29 Algorithm ....................................................................................... 31
input.php .................................................................................... 32 parser.php .................................................................................. 32 insert.php ................................................................................... 33 fasta.php .................................................................................... 34 image.php .................................................................................. 35
Results and sample images ............................................................ 39 Future improvements and functionalities ..................................... 57 Conclusion ...................................................................................... 59
Chapter II — Ecotin in Kinetoplastida ..................................................61 Introduction ................................................................................... 63
General aspects of the Trypanosomatida order ........................ 63 Serine Protease Inhibitors: ecotins and Trypanosomatida ISPs . 65
Methodology .................................................................................. 69 Results and discussion ................................................................... 71
Final remarks .......................................................................................81 References ...........................................................................................83 Glossary ...............................................................................................89
22
23
Initial remarks
Trypanosomatids are wildly successful agents of neglected tropical diseases.
They are neglected in part because of their occurrence, concentrated in
underdeveloped countries, and in part because they are very complex and
specialized organisms that are anything but easy to study.
When it was discovered in the early 2000’s that they carry genes for a
protein found in no other eukaryotes, a protein that is essential for the virulence
of a large number of bacterial species, a small group of researchers started looking
into this enzyme, ecotin, a wide-specificity serine protease inhibitor.
Here we try to improve the knowledge in this area with a novel algorithmic
approach. A software tool developed in-house as a tool to help solving the
phylogeny of a group of trypanosomatid ecotins took on a life of its own to become
a multi-purpose loci analysis tool. Using this new methodology and traditional
phylogenetic tools, we aim to shed a little light on the evolution and functional
importance of ecotins in trypanosomatids.
24
25
Chapter I — Online database-driven loci image
generator
26
27
Introduction
When studying the evolution of proteins and other genomic features it is
very important to consider not only the specific nucleotide sequences, but also
their relative positions to one another in the chromosomes of living beings. Spatial
relationships can often provide clues to shared evolutionary history, as well as
being indicative of functional relationships between genes and regulatory
sequences of the genome working in tandem.
For this reason, since the early 2000s, a variety of software has been written
with the express purpose of allowing easy visualization and editing of genomic
data, usually organized as stacked linear sequences in an interactive web-based
interface. These so-called genome browsers are extremely useful tools for a variety
of tasks that go well beyond simply visualizing gene loci (SKINNER et al., 2009; STEIN
et al., 2002). In a way, though, most genome browsers’ strength in flexibility is also
a hindrance in some specific use cases. When one's goal is simply to display and
compare specific gene loci between different species, the available genome
browsers can become very cumbersome to use. Some are specialized, useful only
in some contexts such as human or Drosophila genome browsers, others are really
complex and difficult to use for people without specialized training. Among these
challenges, one factor stands out: with most available software, comparing loci
from different species is not trivial, requiring a robust knowledge of the genome
browser’s settings and configuration procedures. The most used programs, like the
UCSC Genome Browser [https://genome.ucsc.edu/cgi-bin/hgGateway], are heavily
dependent on extensive databases of carefully annotated genomes, which make
them good tools for research on the most studied model species (CASPER et al.,
2018; UCSC GENOME BROWSER GATEWAY, [s.d.]).
For smaller-scale studies in which gene loci are important, especially if the
species in question are less commonly studied, these tools are lacking. In some
28
cases, when the species being studied don't have extensive and manually
annotated genomes available, the more widely used genome browsers can
become almost useless. It is not uncommon for researchers to manually sketch loci
in a piece of paper, using automatically annotated genomes as a reference. When
their research is concluded, those sketches need to be painstakingly transformed
into informative image files for publication, arduous repetitive work that could be
easily automatized.
Here we propose a simple web-based software program that can organize
GenBank formatted genomic data, offering database-driven search of coding
sequences identified by user provided labels. In its current incarnation, the main
function of the software is the creation of beautiful graphical representations of
target genomic loci, easily customizable and ready for publication. It is also able to
create customized amino acid FASTA files with custom labels, ready for alignment
and analysis. The software can store very large amounts of data in a standard
MySQL database, from which multiple views can be very easily created. The use of
a PHP/Javascript implementation for the server/client development guarantees
flexibility and scalability for future uses. Even in its current beta stage, the software
can be useful for researchers working with genomic loci in non-model organisms,
providing a very easy to use tool for visualization, data storage and retrieval.
Usability is key: the user only needs to know how to copy and paste text, and how
to input data into browser forms.
In the pages below, we present a brief description of the software creation
process, its algorithm and future planned improvements for the release version.
We also present a sample project created to analyze the occurrence of ecotin-like
serinopeptidases inhibitors in various species of Leishmania and Trypanosoma
parasites. The current version of this software, though functional, is still in beta-
testing and further interface improvements are ongoing before its general
deployment.
29
Methodology
Once the lack of proper tools for creating comparative genomic loci
diagrams was identified, the software was conceived as an in-house solution at the
Pathogenic Agents Laboratory at the Federal University of ABC (UFABC). The entire
project was created based on reliable and widely used modern internet
technologies, all either open-source or free-to-use so as to guarantee future-
proofing and flexibility: the code can be reused and modified for different uses and
by other programmers.
Web page structure and design was created using basic HTML5 and CSS,
taking advantage of the new canvas element that permits on-the-fly image creation
(CASCADING STYLE SHEETS, [s.d.], HTML 5.1 2ND EDITION, [s.d.]; DUCKETT, 2014).
Server-side programming was written in the PHP interpreted language, using
version 5 to maintain compatibility with older web hosting services (PHP:
HYPERTEXT PREPROCESSOR, [s.d.]; WELLING; THOMSON, 2016). Client-side
scripting was done in Javascript, mostly to manipulate the canvas elements used
for image creation (DUCKETT, 2014; JAVASCRIPT W3SCHOOLS, [s.d.]; WELLING;
THOMSON, 2016). All data was stored and manipulated in MySQL, a simple and
very efficient open source relational database implementation using the industry-
standard SQL database manipulation language (BEAULIEU, 2009; MYSQL, [s.d.], SQL
W3SCHOOLS, [s.d.]; WELLING; THOMSON, 2016).
The code for all the languages above was edited using Notepad++, an open-
source text editor capable of formatting, annotating, validating and doing code-
completion for a variety of programming and design languages, including all those
cited above (NOTEPAD++, [s.d.]). Testing and development was done offline, with
the WampServer all-in-one web development platform that includes an Apache
web server, the PHP interpreter and MySQL database in a single installable file for
Windows operating systems; the acronym Wamp stands for
30
WindowsApacheMySQLPHP (PHILLIPS, 2012; WAMPSERVER, [s.d.]). Online testing
was done manually transferring the code files to a personal Yahoo Small Business
Web Hosting account that supports PHP and MySQL (YAHOO SMALL BUSINESS WEB
HOSTING, [s.d.]). Both online and offline testing was done on a variety of modern
web browsers compatible with HTML5, CSS3 and Javascript, with a heavier focus
on the Google Chrome browser because it is the market leader by a large margin,
being the browser of choice for almost two thirds of internet users as of May 2018
(STATCOUNTER GLOBAL STATS - BROWSER USAGE SHARE, [s.d.]).
The software is currently working with no apparent bugs, and is going
through the beta testing stage, in which it is tested by a select group of target users
before full deployment (FINE, 2002). It is important to note that all references given
above in this section are only a small parcel of the myriad sources consulted both
on and offline, and should not be considered definitive in any sense. Modern web
software development is a very dynamic enterprise: the platforms and languages
are in a constant state of flux with new features and bug fixes being introduced
daily, and programmers need to consult a wide variety of internet sources including
tutorials, discussion forums, language reference manuals and others in an almost
hourly basis.
31
Algorithm
The algorithm is spread across a number of different files, as is customary
in web-based PHP applications. Each file behaves like a web page and performs a
specific task, such as data acquisition, data processing, data retrieval and image
generation. In the final version, each of these functional web pages will be
organized and accessible from a single master page that will serve as the main
interface. In the current beta form, the software works in a more sequential
manner: the user pastes the GenBank data in the input.php page; this data is then
treated, formatted and prepared for database storage in the parser.php page,
where the user is also asked to label all the CDS features (coding sequences) of
interest; the insert.php page performs the actual database creation if necessary,
further data checks for consistency, and gives positive confirmation for the user
that the data insertion was successful; the fasta.php simply creates an amino acid
FASTA-formatted file from labelled sequences from different species, useful for
alignment and further evolutionary studies; and, finally, the image.php actually
creates the visual genomic loci representation from a list of protein labels provided
by the user, with some customization options available such as a choice of full
chromosome view or a view zoomed into the area of interest contained between
labelled proteins, and a choice of image size in pixels.
An additional file called functions.php is not directly accessible by the user,
as instead of performing a specific task, it contains the procedural functions for all
repetitive tasks used in the other pages, such as database connections, treatment
of warnings and error messages for the user, etc., and will not be explicitly detailed
below. The following sections explain these algorithms in detail, and the entire beta
source code of each of these pages is available as annexes at the end of this
document.
32 input.php
This page is very simple, containing only a very large input text box and a
submit button. The user only has to paste a complete GenBank record into the text
box and click on the submit button. No programming is employed at this stage, only
HTML and CSS for data input and page layout.
In the final version of the software this page will also include a login system
and a project selection/creation option, to allow for multiple users with more than
one project each.
parser.php
In this page most of the data manipulation and validation occurs. GenBank
records are very precisely structured (SAMPLE GENBANK RECORD, [s.d.]), a great
feature for computational parsing, but since genomes are such complicated
entities the number of required data validations is big. These are the steps done at
this stage:
1. The GenBank data input by the user is first split into an array of text
lines, with each line of the original text record assigned to an array
item;
2. The algorithm analyzes each of the array’s entries to determine its
nature, using regular expression text validations. In this first pass two
main areas are identified: the header containing sequence,
organism, chromosome and other assembly data, and the much
larger features section with all the annotated genomic features;
3. The header area of the array is looped through first with crucial
information being extracted, such as organism name, organism
definition and RefSeq id, essential as a unique sequence identifier
(PRUITT; BROWN; MURPHY, 2018);
33
4. The next step is another loop over the last part of the array, this time
over all the lines containing genomic features as identified in step 2.
Each complete feature is stored as a new text item in a newly created
features array;
5. The features array is looped through, and each item is again
processed using regular expressions to create a structured
representation of the feature in question to replace the text version.
Each feature can have a number of qualifiers, which are details such
as product name and translation for mRNA and CDS features. Since
the list of possible qualifiers for all different types of features is very
large, a separate array is created during this step containing unique
qualifier labels, as this data will be crucial for database creation and
data insertion in the next page;
6. After data processing is complete, the algorithm prints a complete
list of GenBank features in a HTML table with a corresponding text
field attached to the details of each feature, and a visual
representation of its relative position in the nucleotide sequence. In
the text field the user should label the features of interest, as those
will be used by the loci visualization tool later in the page image.php;
7. When all labels are typed by the user, they click on the submit button
and the page submits the header data, the features array and the
qualifiers array for database inclusion, which happens in the next
page, insert.php.
Future versions of this page will have mostly cosmetic refinements to make
it more user friendly.
insert.php
This page requires almost no user interaction. It will validate data provided
by the parser.php algorithm, verify database status and create tables and columns
34
as needed, check for duplicate data and alert the user, and finally insert the data
into the database if no errors are found. These are the steps done at this stage:
1. First the algorithm checks if the database tables have already been
created, creating them if needed: the main table to contain sequence
header data and the features table to contain all genomic features
from the sequence, as well as the user provided labels.
2. Using the list of qualifiers created in step 5 of the parser.php
execution, the algorithm checks for any missing columns in the
features table and creates them;
3. Then the algorithm inserts the header with the sequence details
obtained in the last page into the main table, including RefSeq id; if a
duplication is detected, the algorithm fails with an error message to
the user indicating the reason: no duplicate RefSeq ids are allowed;
4. If the insertion of header data occurs correctly, the structured list of
features is composed into a large single insert SQL statement which
is then passed to the database; if any errors occur the insert is
cancelled and the user alerted;
5. If no errors are detected in step 4, the user is informed that the
database insert was successful.
Future versions of this page will have cosmetic refinements to make it more
user friendly, and probably an option to modify or delete previously inserted data
if the user tries to insert duplicates.
fasta.php
This algorithm is very simple: it does only a database query for all features
with a user-specified label, followed by printing the results as text in the FASTA file
format ready for analysis in other programs.
35
The user can specify the desired label adding ?label=xxxx to the HTTP
request string, where xxxx is the label string. For example, to get a FASTA file with
all CDSs labelled ‘CHP1’ one simply writes ‘fasta.php?label=CHP1’ in the browser
address bar.
image.php
This page contains the most important part of the software as initially
designed: the creation of image files clearly representing loci of labeled sequences
in a number of related species for visual inspection and publication. Up to 18
feature labels can be displayed simultaneously, using highly contrasting colors for
differentiation (TRUBETSKOY, 2017). Its rendered output can be adjusted by a few
parameters given as variables in the HTTP request string, adding a ? character after
image.php in the address bar, followed by these optional variables:
● &width=####px — where #### is any number larger than 500
This will set the image width in pixels, and affects the
resolution and quality of the final output, with larger number
resulting in better looking images which are also larger. The user can
test different sizes to determine the ideal resolution for their use
case. If this option is absent or set with a value smaller than 500px,
the algorithm defaults to 1900px, a number which is sufficient for
printing the resulting images in portrait orientation with over 300dpi
of resolution, the minimum standard required by most publications.
● &zoom=# — where # can be either 0 or 1
This will determine whether the algorithm shows the loci in
relation to the entire sequence for each GenBank entry in the
database, the standard behavior when this option is not set or set to
0. When this option is set to 1, each GenBank sequence will be
truncated between the first and last occurrences of the features
custom labelled by the user at step 6 of parser.php, restricted by the
36
list provided by the &cds option explained below. This is especially
useful in cases where a GenBank record is very large and the locus of
interest is small, as it effectively provides a simple zooming ability.
● &feat=xxx,xxx,xxx — where xxx are the names of features of interest
This is the list of feature labels shown by the image, comma-
separated. It can be no longer than 18 labels; any extra labels will be
ignored due to design constraints. The user may have tagged tens or
hundreds of features with many different labels when inserting the
GenBank record, and showing all the distinct labels at once may not
be useful or even possible inside one image. If the user wishes to
compare different loci in different images, here is where it can be
done, by providing the algorithm with the specific list of labels to be
used for each image, and generating different images. If the user
does not use this optional parameter, the algorithm retrieves the first
18 distinct labels from the database.
These are the steps done for image creation:
1. All the different formatting information has to be set up first: using
the pixel width obtained from the user via the optional &width
parameter (or the standard 1900px value if there was no user input),
the program determines all the other relative values, such as font
size, image canvas height, indicator triangles height, etc. It also
verifies if the user has passed optional parameters &feat, &width or
&zoom and validates them;
2. Then the algorithm retrieves all GenBank records from the database
that contain the labels provided by the user with the optional &feat
parameter, or all records if no parameter was passed;
3. If the user passed the optional parameter &zoom=1, the program
finds the lower and upper bounds in each GenBank record
37
corresponding to the first and last occurrences of labelled features in
the sequence; if no &zoom parameter was passed, the upper and
lower drawing bounds are set to 0 and to the GenBank record size in
base-pairs, respectively;
4. With all necessary data prepared, the program creates the HTML
canvas element and iterates through all the feature labels (set by the
user or retrieved from the database), creating color-coded
references at the top of the image for each label. 18 contrasting
colors are preset;
5. At the top left corner of the image a scale bar is printed, an essential
visual feature for comparing GenBank records of different sizes. The
scale is calculated to increase in increments of 1 kbp from 1 to 9kbp
and in increments of 10 kbp for values larger than 10 kbp;
6. The printing of the actual genomic data is done in several sub-steps:
a. First, each GenBank record’s species name is printed on the
left panel, with the genus name abbreviated, in bold italics. In
a smaller font the RefSeq id of the sequence is printed below
the species name, ensuring a very compact and easy to read
label that nonetheless is uniquely identified by the RefSeq id;
b. Then the unlabeled features are printed to the right of the
species’ names, coding sequences (CDS) in dark gray and
other types of features in light gray;
c. The labelled features are then printed in color according to
the legend printed at step 4, with slightly taller bars and
triangular indicators on the top or bottom: triangles are
placed at the top of bottom in an alternating fashion, avoiding
excessive clumping in very large sequences with closely
labelled features;
38
d. Finally, a small rotated number is printed at the start and end
of each feature sequence, indicating the exact base-pair
position of start and end in the original GenBank record.
This algorithm is very close to its final stage, pending only suggestions from
the beta testers and a few refinements to the zoom mechanism.
39
Results and sample images
We present below several examples of generated loci images from
image.php, with comments on specific use cases and usage. The output from
fasta.php are simple standard FASTA-files with no new features, so it suffices to say
that the algorithm is producing the expected output and no samples are shown in
these results.
Three species with fully annotated genomes were used for this
demonstration, Leishmania major [RefSeq NC_007256.2], L. mexicana [RefSeq
NC_018319.1], e L. donovani [RefSeq NC_018242.1] (DOWNING et al., 2011;
ROGERS et al., 2011). Four classes of proteins were selected and labelled using the
software as described in the previous section: ecotin-like serinopeptidases
inhibitors labelled “ecotin”, a “katanin-like” microtubule-severing protein, a
ribonucleoprotein labelled “ribonucleoproteinP18” and a membrane associated
protein, which was purposefully labelled with the very long name “tb-292
membrane associated protein-like protein” for display purposes. All images below
were generated from the same dataset.
The first image, fig.1 on the next page, is an example of an export done
without any parameters given by the user, with the default values: no zoom,
1900px image and all the feature labels displayed. Dark grey bars represent coding
sequences (CDS), light gray bars represent non-CDS features, and colored bars are
labelled CDSs. Note how the algorithm gracefully handles excessively long feature
labels by compressing them laterally.
40
Figu
re 1
. Sam
ple
loci
visu
aliz
atio
n ge
nera
ted
with
out a
ny p
aram
eter
s pa
ssed
41
Figures 2 and 3 represent the exact same data (with full-sequence and
zoomed-in versions displayed together), but with the &width parameter passed
with values 550px and 4400px respectively, to demonstrate the lack of correlation
between fonts and most other image components’ sizes to the provided pixel
width: they are stable because they are all set as proportions of the user-provided
pixel width. The overblown borders in colored components vary in size more
abruptly because of rounding errors in their border widths, and this is not currently
fixable due to limitations in the HTML canvas element used for rendering. The
difference in quality due to resolution is quite visible, making clear that sub-1000px
resolutions should not be used except in extremely simple genomes:
42
Figu
re 2
. Sam
ple
loci
visu
aliz
atio
n - 5
50px
wid
e
43
Figu
re 3
. Sam
ple
loci
visu
aliz
atio
n - 4
400p
x w
ide
44 Figures 4 to 7 represent the same data, but with the user specified &feat
parameter selecting a single label in each image pair (all shown with and without
the &zoom parameter). Note how the first color is always red, for maximum
contrast:
45
Figu
re 4
. Sam
ple
loci
visu
aliz
atio
n - “
ecot
in”
labe
l onl
y
46 It is important to note that for coding sequences appearing only once in
each Genbank record, such as the katanin-like protein shown in fig. 5, it makes no
sense to use the zoomed-in view as it will simply show a large bar for each species
and the scale bar will be compressed into a single vertical bar.
47
Figu
re 5
. Sam
ple
loci
visu
aliz
atio
n - “
kata
nin-
like”
labe
l onl
y
48 Likewise, for very closely spaced features like those in fig. 6, the zoomed-in
view might be too restrictive in this version of the software, as the zoom level is
not user-controllable.
49
Figu
re 6
. Sam
ple
loci
visu
aliz
atio
n - “
ribon
ucle
opro
tein
P18”
labe
l onl
y
50 Fig. 7 further illustrates that only GenBank sequences that have labelled
feature will be included in the image, so as to avoid duplications: only L. mexicana
has a feature with the label “tb-292 membrane associated protein-like protein”, so
it’s the only GenBank record shown.
51
Figu
re 7
. Sam
ple
loci
visu
aliz
atio
n - “
tb-2
92 m
embr
ane
asso
ciat
ed p
rote
in-li
ke p
rote
in”
labe
l onl
y
52 Figures 8, 9 and 10 represent combinations of 2 and 3 different labels to
illustrate the potential variety of visual representations that can be generated from
a single dataset, once it is imported and labelled into the system. They also make
clear that the legend colors depend only on the label order, so the first label in the
user passed list will be red, the second one will be green, the third yellow, the
fourth blue and so on.
53
Figu
re 8
. Sam
ple
loci
visu
aliz
atio
n - “
tb-2
92 m
embr
ane
asso
ciat
ed p
rote
in-li
ke p
rote
in”
and
“eco
tin”
labe
ls
54
Figu
re 9
. Sam
ple
loci
visu
aliz
atio
n - “
ecot
in”,
“tb
-292
mem
bran
e as
soci
ated
pro
tein
-like
pro
tein
” an
d “r
ibon
ucle
opro
tein
P18”
labe
ls
55
Figure 10 is a useful illustration of the care needed when using the zoomed-
in view: since one of the labels only occurs once in L. major, it is shown with no
genomic context.
56
Figu
re 1
0. S
ampl
e lo
ci v
isual
izat
ion
- “tb
-292
mem
bran
e as
soci
ated
pro
tein
-like
pro
tein
” an
d “r
ibon
ucle
opro
tein
P18”
labe
ls
57
Future improvements and functionalities
Although very useful as it is, this software has ample room for
improvements and creation of new functionalities. Bellow we present a short list
of some of the planned upgrades in the near future.
● Writing a module for labeling and modifying labels of GenBank
sequences already imported into the database, allowing users to first
ingest a large amount of data and then work on labeling the areas of
interest inside the software itself;
● Writing a module that allows labels to be sorted into classes,
facilitating multiple analysis of the same data;
● Making various alterations to the image.php module, to make it
interactive, with real-time zooming and the addition of feature detail
mouse-overs: when the user moves the mouse over the features in
the image, a pop-up pane will display the details of the feature.
● Integration with various online APIs (Application Programming
Interfaces, tools that facilitate data exchange between different
systems), allowing access to tools such as sequence alignment and
Blast searches.
58
59
Conclusion
The original problem that motivated this project was successfully solved: the
loci viewer is working exactly as planned, and it is proving very useful for the
intended use case. It is still in beta and with an incomplete user interface, but as
these issues are being resolved, it can grow to become a more mature package,
with multiple modules feeding off the same database. The use of free and open-
source components guarantees that it has the flexibility to be altered and
expanded not only by its original designer, but by anyone else with programming
skills.
This still unnamed software project is a good example of the kind
interdisciplinary effort that should make up the bulk of modern science but
unfortunately still doesn’t. Computer software is absolutely essential for modern
genomics but very few biologists are well trained in their inner workings. Software
that works as a black box is dangerous in science: if the user does not understand
the algorithmic process well, he or she can get beautiful results that are completely
disconnected from reality. Small in-house projects like these are a way to both get
programming experience and foster interdisciplinarity in biology labs.
60
61
Chapter II — Ecotin in Kinetoplastida
62
63
Introduction
General aspects of the Trypanosomatida order
Kinetoplastida are a class of unicellular eukaryotes characterized by the
presence of the kinetoplast, a feature formed by circular concatenated DNA
molecules (kDNA) inside a solitary mitochondrion. Trypanosomatida is an order of
kinetoplastids composed of obligatory parasites of invertebrates, plants and all
classes of vertebrates (BORGHESAN et al., 2013; MARCILI, 2008). This order
includes genera that are exclusive arthropod parasites, such as Crithidia
(CAMARGO et al., 1992) and Herpetomonas (BORGHESAN et al., 2013), and also
genera with heteroxenous life cycles (involving more than one host), such as
Leishmania and Trypanosoma. These two genera are of great medical importance
as they contain species that are etiological agents for serious human diseases, such
as sleeping sickness, Chagas disease, visceral and cutaneous leishmaniasis. Some
Trypanosomatida genera have obligate endosymbiont bacteria, having been used
as model organisms in studies on unicellular symbiosis (ALVES et al., 2013; MOTTA
et al., 2010).
Species in the Leishmania genus are flagellated protozoa with heteroxenous
life cycles. Their life cycle has two main phases, each with morphological and
physiological characteristics adapted to life inside invertebrate and vertebrate
hosts. Promastigotes are flagellated and motile, adapted to extracellular survival in
the digestive tract of vector insects, while amastigotes are unable to move,
adapted to intracellular survival inside vertebrate macrophages (GRIMALDI; TESH,
1993; REY, 2008). Leishmania insect vectors belong to the order Diptera, the
common flies (REY, 2008).
Species in Trypanosoma genus, also heteroxenous, usually have more
complex life cycles (RAMOS et al., 2005). In trypanosomatids belonging to the
Stercoraria section, such as Trypanosoma cruzi, the metacyclic trypomastigote
64
infective form penetrates the skin or skin lesion of the vertebrate host after being
expelled with the feces of the insect. In members of the Salivaria section, such as
Trypanosoma brucei, the metacyclic trypomastigote is injected directly from the
salivary gland of the insect host into the bloodstream of the vertebrate host. Once
inside the vertebrate host the infective metacyclic trypomastigotes transform into
bloodstream trypomastigotes which invade cells of various tissues and, in the case
of T. cruzi, transform into amastigotes, static forms that multiply inside the cells by
binary fission. After a number of division cycles, T. cruzi amastigotes transform into
bloodstream trypomastigotes and are released into the circulatory system,
infecting other cells in the body. T brucei trypomastigotes can directly multiply by
binary fission. When a hematophagous insect vector feeds on the blood of an
infected vertebrate it consumes bloodstream trypomastigotes forms, which
differentiate inside the insect into procyclic trypomastigotes (in some species) and
then into epimastigotes, which are capable of multiplying by binary fission. These
life cycle stages of the parasites populate the posterior digestive tract (in section
Stercoraria) or the salivary gland (in section Salivaria) of the host. The cycle is
closed with the transformation of epimastigote forms into metacyclic
trypomastigote forms that infect a new vertebrate host (HOARE, 1972; REY, 2008).
Species in this genus can have different arthropod vectors: Chagas disease agent
T. cruzi and all other stercorarian trypanosomes is transmitted by hemipteran
insects in the Reduviidae family, while sleeping sickness agent T. brucei and all
other salivarians are transmitted by dipteran vectors (REY, 2008).
The phylogeny of trypanosomatids is constructed based mostly on
molecular studies, because the external morphology of these unicellular organisms
is often indistinguishable between species (STEVENS et al., 1999; STEVENS; NOYES;
GIBSON, 1998). The most common molecular marker used for barcoding
trypanosomatids is the V7V8 region of the 18S rDNA, being widely used in the
description of inter and intraspecific relations in this clade (CORTEZ et al., 2006; DA
65
SILVA et al., 2004; FERREIRA et al., 2007; LIMA et al., 2012; MARCILI et al., 2009;
RODRIGUES et al., 2006; TEIXEIRA et al., 2011; VIOLA et al., 2009). Auto jeans used
in trypanosomatid evolutionary studies are the gGAPDH gene for the
glyceraldehyde-3-phosphate dehydrogenase protein (HAMILTON et al., 2004;
HAMILTON; GIBSON; STEVENS, 2007) and the CATL gene encoding the cathepsin
enzyme (PÉREZ, 2012).
Recent studies indicate that the heteroxenous life cycle in trypanosomatids
may have an evolutionary history beginning before the start of the Tertiary period,
but the overall Kinetoplastida phylogeny is still filled with uncertainty, even more
so among trypanosomatids (VOTÝPKA et al., 10/2015). The existence of bacterial
endosymbionts in the group is of marked interest to researchers, possibly being
related to the transition from free to parasitic life cycles or being involved in lateral
gene transfers between bacteria and eukaryotes (ALVES et al., 2013).
Trypanosomatids have polycistronic DNA transcription which tends to keep coding
sequences conserved in contiguous groups, resulting in multiple gene loci being
preserved between different species (JACKSON, 02/2015).
Serine Protease Inhibitors: ecotins and Trypanosomatida ISPs
Ecotins are serine protease inhibitors initially described in E. coli bacteria
and named for its capacity to inhibit the digestive enzyme trypsin — E. coli trypsin
inhibitor (CHUNG et al., 1983). The E. coli ecotin has a molecular weight of 18 kDa
and is expressed in the cellular periplasm with an homodimeric active form. It
inhibits serine proteases of family S1A including trypsin, chymotrypsin, neutrophil
elastase and cathepsin G (CHUNG et al., 1983; MCGRATH; GILLMOR; FLETTERICK,
1995; YANG et al., 1998). Ecotin activity protect cells against exogenous serine
peptidases involved in various biological processes, including coagulation and
fibrinolysis; this capacity for inhibiting a considerable number of different proteins
differentiates ecotin from most other serine protease inhibitors, which generally
are highly specific (EGGERS et al., 5/2001, 2004; LAUWEREYS; LAMBEIR, 1996).
66 Trypanosomatids are the only eukaryotes with genes coding for ecotin
analogs, described for the first time in 2005 by Ivens et al. in L. major, with three
variants that were named ISP1, ISP2 and ISP3 (IVENS et al., 2005). In L. major the
ISP1 and ISP2 ecotins have 16.5 and 17.5 kDa respectively, and while structurally
similar to the E. coli ecotin, their amino acid sequence identity is only 36%
(MORRISON et al., 2012), and they have different patterns of expression and
inhibitory activity in the various stages of the parasite life cycle (ESCHENLAUER et
al., 2009). In L. major the ISP1 variant is expressed in larger quantities in the life
cycle forms living in the insect host, and knockout studies with this gene suggest
that it has endogenous functions, mainly in the flagellar formation process
(MORRISON et al., 2012). Also in L. major, ISP2 expression occurs in all life cycle
stages and there are evidences that this enzyme participates in the parasite
macrophage infection process in hosts, by inhibiting serine proteases such as
neutrophil elastase in vertebrates; there is evidence that Leishmania parasites with
knocked down ISP2 suffer more intense phagocytosis by host macrophages
(ESCHENLAUER et al., 2009; FARIA et al., 2011; RIBEIRO-GOMES et al., 2007). E. coli
and other bacteria that have periplasmic ecotin use it to evade hosts’ immune
systems, and L. major employs its ISP2 inhibitor in a similar fashion (EGGERS et al.,
2004; IRELAND et al., 02/2014). The lack of genes coding for ecotin target enzymes
(the S1A family of serine proteases) in both E. coli and L. major is a strong indicator
of the probable role of ecotins in these species’ interactions with vertebrate hosts
(ESCHENLAUER et al., 2009).
An ISP2 homolog has been found in T. cruzi with a high degree of sequence
similarity to the L. major gene (EL-SAYED et al., 2005). BLAST searches in the NCBI
GenBank database reveal that other members of the genus Trypanosoma also
possess ISP2 homologues, as well as close relatives in the order Trypanosomatida
such as Leptomonas spp., most papers published on the subject have focused on
Leishmania ISPs. It is probable that, due to both its conservation in various species
67
of trypanosomatids and to its flexible functional properties, ecotin homologs have
offered some fitness gain to trypanosomatids with vertebrate hosts. Also, the
conservation of ISPs in various species indicates an origin in the common ancestor
of Trypanosomatida. The similarity between trypanosomatid ISPs and bacterial
ecotins make us raise the hypothesis of a lateral gene transfer between E. coli like
bacteria and the common ancestor of the various Trypanosomatida genera as the
origin of ISPs (ESCHENLAUER et al., 2009; OPPERDOES; MICHELS, 2007). Recent
research suggest that this kind of lateral gene transfers has been essential in this
group’s evolutionary history (ALVES et al., 2013). The bacterial endosymbionts in
Kinetoplastida are in class Betaproteobacteria (ANDRADE et al., 2011); This group
contains vertebrate infecting species that not only possess ecotin encoding genes,
but that depend on those ecotins being expressed to maintain their virulence
(IRELAND et al., 02/2014), which may be another hint of ancestral lateral gene
transfers between Betaproteobacteria and Trypanosomatida.
Leishmania and Trypanosoma parasites are responsible for a number of
severe neglected tropical diseases, as officially listed by the World Health
Organization (MOLYNEUX, 2013; NEGLECTED TROPICAL DISEASES, 2015). Multiple
sources indicate that ecotin and its homologs are connected to these parasites’
infective capacity, but research in this specific subject is still timid, especially in the
Trypanosoma genus: neglected tropical diseases like Chagas disease (caused by T.
cruzi) and African sleeping sickness (caused by T. brucei) are neglected for socio-
historical reasons, as these afflictive diseases rarely, if ever, occur in developed
countries.
Studying the evolution of the genes encoding for trypanosomatid ISPs can
shed light not only on the group’s evolutionary history, but also on the overall
importance of this enzyme for future researchers. In the next few pages we show
evidence for a common ancestry of ISPs in extant trypanosomatids using both
phylogenetic inferences and a novel method for gene loci analysis.
68
69
Methodology
Amino acid sequences of ISPs were obtained from the NCBI database
(National Center for Biotechnology Information, U.S. National Library of Medicine)
using their BLAST search package using E. coli ecotin as a BLAST target (BORATYN
et al., 2013). Sequence alignment using the MUSCLE algorithm and phylogenetic
maximum likelihood analysis was done in SeaView (GOUY; GUINDON; GASCUEL,
2010), with selection of best-fit amino acid substitution matrices done with
PROTTEST 3 (DARRIBA et al., 2011). Resulting tree files were manually edited to
standardize terminal labels, and cladogram image files were exported using the iTol
web tree tool (LETUNIC; BORK, 2011).
For the loci viewer project database input, complete genomic sequences for
loci analysis were obtained from the same database using tBLASTn, a tool that
takes amino acid sequences as input and searches for corresponding nucleotide
sequences, with the search limited to RefSeq annotated genomes (BORATYN et al.,
2013). Data input and manipulation was done as described in the algorithm section
in Chapter I. The complete list of sequences with respective web links and NCBI
GenBank IDs is listed in Table I. After data input was completed, the database was
manually manipulated using SQL queries to identify close genes to the left of right
of the ecotin homologs in various species, aided by the visual map generated
showing that ISPs occurred in two different loci around fifty kbp apart in most
species. One of these genes is a putative katanin encoding gene, and the other
three are conserved hypothetical protein-encoding genes that were called CHP1, 2
and 3. Using the L. braziliensis genome as a reference, these four amino acid
sequences were run through the NCBI tBLASTn tool using the same settings used
for the ecotin homologs, and resulting CDSs were manually labelled CHP1, CHP2,
CHP3 and katanin-like in the loci viewer database.
70
Table 1 - List of RefSeq genomic records inserted into the loci image generator database
Leishmania braziliensis MHOM/BR/75/M2904 NC_009307.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_009307.2
Leishmania braziliensis MHOM/BR/75/M2904 NC_018242.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_018242.1
Leishmania infantum JPCM5 NC_009399.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_009399.2
Leishmania major strain Friedlin NC_007256.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_007256.2
Leishmania mexicana MHOM/GT/2001/U1103 NC_018319.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_018319.1
Leishmania panamensis NC_025860.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_025860.1
Leptomonas pyrrhocoris NW_015438382.1 https://www.ncbi.nlm.nih.gov/nuccore/NW_015438382.1
Leptomonas pyrrhocoris NW_015438394.1 https://www.ncbi.nlm.nih.gov/nuccore/NW_015438394.1
Trypanosoma brucei brucei TREU927 NC_007278.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_007278.1
Trypanosoma brucei gambiense DAL972 NC_026738.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_026738.1
Trypanosoma cruzi strain CL Brener NW_001849489.1 https://www.ncbi.nlm.nih.gov/nuccore/NW_001849489.1
Trypanosoma grayi NW_008825978.1 https://www.ncbi.nlm.nih.gov/nuccore/NW_008825978.1
Trypanosoma grayi NW_008826261.1 https://www.ncbi.nlm.nih.gov/nuccore/NW_008826261.1
Although the main focus of this research is pre-existing data analysis, we
also expressed an ISP2 ecotin homolog from T. cruzi to test its in-vitro activity
against a range of serine proteases. Enzymatic activity tests are still ongoing. For
this part of the project, oriented by Prof. Márcia Aparecida Sperança PhD., the ISP2
coding gene of T. cruzi strain Y was extracted by PCR from purified DNA with
custom-designed primers. It was cloned and expressed at the Federal University of
ABC’s Pathogenic Agents Laboratory using standard recombinant expression
techniques from Sambrook and Russel (2001) (SAMBROOK; RUSSELL, 2001).
71
Results and discussion
Recombinant T. cruzi ISP2 was successfully expressed and purified and is
currently undergoing enzymatic reaction tests at Federal University of ABC’s
Pathogenic Agents Lab by a separate study group led by Prof. Sperança. Preliminary
results indicate that that enzyme inhibits serine proteases such as trypsin and
neutrophil elastase very strongly, at least as strongly as E. coli ecotins. Numeric
data is still unavailable as these essays are still in progress, but these preliminary
results seem to indicate that the T. cruzi ISP2 very probably acts similarly to
Leishmania spp. ISP2, protecting the parasite by inhibiting host macrophage action
(FARIA et al., 2011).
Maximum likelihood phylogenetic inference resulted in the tree
represented in fig. 11, with E. coli ecotin as the outgroup and color-coded ISPs 1, 2
and 3. The tree topology strongly indicates that the ISP ecotin homologs have
differentiated from each other a long time ago, and it is probable that at least ISP1
and ISP2 have been with these organism’s genome at least since the split between
the Leishmania and Trypanosoma genera.
72
Figu
re 1
1. M
axim
um li
kelih
ood
clad
ogra
m o
f eco
tin h
omol
ogs
(ISPs
), w
ith IS
P1 la
belle
d in
ora
nge,
ISP2
labe
lled
in
gree
n an
d IS
P3 la
belle
d in
blu
e
73
Using the loci image generation software described in detail in chapter I
resulted in the images in figures 12 and 13, showing the full chromosomes and a
zoomed-in area of interest respectively. Figure 12 is only useful for comparing
Leishmania records as to overall position in the chromosome: ecotin homologs
occur only in chromosome 15 of these species, and the data for other species is
either incomplete or badly annotated, resulting in huge contig sequences. In figure
13 clears the image a little, but data for T. grayi and Leptomonas pyrrhocoris
sequences is still fragmentary.
74
Figu
re 1
2. L
oci i
mag
e ge
nera
tor r
esul
t sho
win
g fu
ll G
enBa
nk re
cord
s an
d al
l sel
ecte
d la
bels
75
The images show up to five ecotin homologs in various Leishmania species.
This duplication was not reported in previous papers. Sequence data analysis and
visual genomic context inspection give strong support to the idea that ecotin
homologs suffered various duplications and/or multiple events of lateral gene
transference before the differentiation of modern Trypanosomatida genera.
Closely examining fig. 13 it can be seen that Leishmania spp. mostly retained all
five ISP copies, while trypanosomes lost at least a few of them. To facilitate this
analysis, a second zoomed in image (fig. 14) was generated with manual database
manipulation, changing all flanking proteins labels to delim and identifying as well
as possible the ecotin homologs as ISP1, ISP2 and ISP3. In these figures, the CHP1-
3 and katanin-like labels are accessory labels: they serve to identify the complete
ecotin loci and were helpful to identify possible identification with ISP1, ISP2 and
ISP3.
76
Figu
re 1
3. L
oci i
mag
e ge
nera
tor r
esul
t sho
win
g G
enBa
nk re
cord
s an
d al
l sel
ecte
d la
bels
zoom
ed in
to th
e ar
ea o
f int
eres
t
77
We know ISPs came from bacteria by horizontal gene transfer in
Kinetoplastida because they appear in no other eukaryotes. The unanswered
question is how. Looking at these zoomed-in images and keeping the cladogram in
fig. 11 in mind, we can form a hypothesis for how trypanosomatids acquired ecotin
homologs. The next paragraph is speculative, but given the evidence it probably is
not too far off-base:
78
Figu
re 1
4. L
oci i
mag
e ge
nera
tor r
esul
t cre
ated
with
the
sam
e da
ta a
s Fi
g. 1
3, b
ut w
ith in
divi
dual
ISPs
iden
tifie
d as
ISP1
, ISP
2 an
d IS
P3 a
nd w
ith fl
anki
ng p
rote
ins
iden
tifie
d sim
ply
as d
elim
. Not
e th
at th
e bl
ue d
elim
pro
tein
s ar
e no
t the
sam
e, th
ey w
ere
kept
in th
is im
age
only
to il
lust
rate
the
gene
loci
cor
rect
ly
79
The ancestor of all trypanosomatids either participated in multiple lateral
gene transfers with ecotin-possessing bacteria or this event occurred only once
and was followed by multiple gene duplications. If multiple gene transfers
occurred, they probably happened no more than three times for ISPs 1, 2 and 3,
and the additional ISP2 and ISP3 copies carried by Leishmania spp. are the result of
a subsequent duplication. The positions of ISP1 and ISP2 in T. brucei, in the first and
second ecotin loci respectively, with the ISP2 being probably homologous to T. cruzi
ISP2, points to an early locus duplication, occurring before the two genera split. In
this scenario, Trypanosoma spp. subsequently lost copies of the gene; their
sequences show a much more compact genome when compared to Leishmania
spp. in the images presented, leading to the suspicion that more deletions occurred
in Trypanosoma species than in Leishmania, which would lend credence to the idea
that T. cruzi and T. brucei lost some of their ecotin copies. A possible sequence of
events based in this limited dataset is this: the common Trypanosomatida ancestor
had three ecotin copies in the first locus (at around position 120 kbp in
chromosome 15), gotten either via lateral transfers with bacteria or via a single
lateral transfer followed by contiguous duplication. The ISP2 and ISP3 ancestors in
this locus then suffered a simultaneous duplication event, creating the second
locus at around 190 kbp. Subsequently, various species lost some of these copies.
The preservation of ISP2 in almost all species is an interesting fact, and
makes sense given the ample evidence of its importance against hosts’ immune
systems. Another interesting fact is that T. brucei parasites preserved the ISP1
variant in all cases, while T. cruzi lost the ISP1 gene. Since ISP1 seems to be involved
in the development of motility and flagellar development in promastigotes inside
the insect vector in Leishmania species (MORRISON et al., 2012), this could be a
reason for its preservation in T. brucei and loss in T. cruzi. These species are
members of section Salivaria and Stercoraria respectively, with different life cycles
and methods of transmission. While T. cruzi is transmitted by hemipterans, with
80
infecting parasites deposited with their feces on the vertebrate host, T. brucei lives
in the salivary gland of dipteran insects and is injected by their proboscis like the
Leishmania species. It could be that the ISP1 ecotin variant gives some advantage
to trypanosomatids with dipterans as their arthropod hosts. This association could
result in an interesting future research project, with potential public health
applications.
These speculations are given to encourage further research. As tempting as
it is to affirm their validity, our data set is very limited in scope and of very poor
quality in some cases make bold affirmations. Automated genomic annotation can
only go so far, and some of these sequences have errors, omissions and other
problems. Looking closely at the L. mexicana data in fig. 13, for example, it seems
that the coding sequences between the first ISP occurrence and the CHP2 labelled
gene should clearly be labelled as ISP2 and ISP3, but in the automated annotation
they appear as “unknown proteins”. Nevertheless, since the amount of available
genomic data grows so fast, these speculations can be further developed as more
data becomes available.
One thing this work clearly shows without a doubt is the ubiquity of large
amounts of unreviewed genomic data online. The amount of retrievable
information at very little monetary cost and using free-to-use bioinformatics tools
is huge, and in this world of big data and exponentially falling sequencing costs, this
fact will become more obvious as time passes. The next generation of budding
biologists may well have to learn programming before they learn the names of all
the plant and animal families.
81
Final remarks
Biology is the most interdisciplinary of hard sciences. A biologist cannot get
by with only an encyclopedic knowledge of plants and animals, as was common in
the last century.
Today, working or aspiring biologists must contend with the fact that
modern biology is as much about large amounts of data as it is about physiology
and nomenclature. Embracing bioinformatic tools and methods is the only way
forward for those in the field not wanting to be left behind, especially in genetics,
ecology and other fields where the amount of data can clearly outstrip anyone’s
capacity unaided analysis.
Going forward, we must ourselves adapt to this brave new world of big data
and big data analysis — and spur or peers to do the same — or risk fast becoming
obsolete.
82
83
References
ALVES, J. M. P. et al. Endosymbiosis in trypanosomatids: the genomic cooperation between bacterium and host in the synthesis of essential amino acids is heavily influenced by multiple horizontal gene transfers. BMC evolutionary biology, [s. l.], v. 13, n. 1, p. 190, 2013. Disponível em: <http://bmcevolbiol.biomedcentral.com/articles/10.1186/1471-2148-13-190>. Acesso em: 15 nov. 2017.
ANDRADE, I. d. S. et al. Characterization of a porin channel in the endosymbiont of the trypanosomatid protozoan Crithidia deanei. Microbiology, [s. l.], v. 157, n. 10, p. 2818–2830, 2011. Disponível em: <http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.049247-0>. Acesso em: 15 nov. 2017.
BEAULIEU, A. Learning SQL. [s.l.] : O’Reilly Media, Incorporated, 2009. Disponível em: <https://market.android.com/details?id=book-1KybAgAAQBAJ>
BORATYN, G. M. et al. BLAST: a more efficient report with usability improvements. Nucleic acids research, [s. l.], v. 41, n. W1, p. W29–W33, 2013. Disponível em: <http://academic.oup.com/nar/article/41/W1/W29/1091045/BLAST-a-more-efficient-report-with-usability>. Acesso em: 16 nov. 2017.
BORGHESAN, T. C. et al. Molecular phylogenetic redefinition of Herpetomonas (Kinetoplastea, Trypanosomatidae), a genus of insect parasites associated with flies. Protist, [s. l.], v. 164, n. 1, p. 129–152, 2013. Disponível em: <http://dx.doi.org/10.1016/j.protis.2012.06.001>
CAMARGO, E. P. et al. Ribosomal DNA restriction analysis and synthetic oligonucleotide probing in the identification of genera of lower trypanosomatids. The Journal of parasitology, [s. l.], v. 78, n. 1, p. 40–48, 1992. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/1310733>
Cascading Style Sheets. [s.d.]. Disponível em: <https://www.w3.org/Style/CSS/Overview.en.html>. Acesso em: 2 jun. 2018.
CASPER, J. et al. The UCSC Genome Browser database: 2018 update. Nucleic acids research, [s. l.], v. 46, n. D1, p. D762–D769, 2018. Disponível em: <http://dx.doi.org/10.1093/nar/gkx1020>
CHUNG, C. H. et al. Purification from Escherichia coli of a periplasmic protein that is a potent inhibitor of pancreatic proteases. The Journal of biological chemistry, [s. l.], v. 258, n. 18, p. 11032–11038, 1983. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/6411724>
CORTEZ, A. P. et al. The taxonomic and phylogenetic relationships of Trypanosoma vivax from South America and Africa. Parasitology, [s. l.], v. 133, n. Pt 2, p. 159–169, 2006. Disponível em: <http://dx.doi.org/10.1017/S0031182006000254>
DARRIBA, D. et al. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics , [s. l.], v. 27, n. 8, p. 1164–1165, 2011. Disponível em: <http://dx.doi.org/10.1093/bioinformatics/btr088>
84
DA SILVA, F. M. et al. Phylogeny, taxonomy and grouping of Trypanosoma rangeli isolates from man, triatomines and sylvatic mammals from widespread geographical origin based on SSU and ITS ribosomal sequences. Parasitology, [s. l.], v. 129, n. Pt 5, p. 549–561, 2004. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/15552400>
DOWNING, T. et al. Whole genome sequencing of multiple Leishmania donovani clinical isolates provides insights into population structure and mechanisms of drug resistance. Genome research, [s. l.], v. 21, n. 12, p. 2143–2156, 2011. Disponível em: <http://dx.doi.org/10.1101/gr.123430.111>
DUCKETT, J. Web Design with HTML, CSS, JavaScript and jQuery Set. [s.l.] : Wiley, 2014. Disponível em: <https://market.android.com/details?id=book-zb9EnwEACAAJ>
EGGERS, C. T. et al. The role of ecotin dimerization in protease inhibition. Journal of molecular biology, [s. l.], v. 308, n. 5, p. 975–991, 5/2001. Disponível em: <http://linkinghub.elsevier.com/retrieve/pii/S0022283601947540>. Acesso em: 26 jul. 2015.
EGGERS, C. T. et al. The periplasmic serine protease inhibitor ecotin protects bacteria against neutrophil elastase. Biochemical Journal, [s. l.], v. 379, n. Pt 1, p. 107–118, 2004. Disponível em: <http://dx.doi.org/10.1042/BJ20031790>
EL-SAYED, N. M. et al. The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science, [s. l.], v. 309, n. 5733, p. 409–415, 2005. Disponível em: <http://dx.doi.org/10.1126/science.1112631>
ESCHENLAUER, S. C. P. et al. Influence of parasite encoded inhibitors of serine peptidases in early infection of macrophages with Leishmania major. Cellular microbiology, [s. l.], v. 11, n. 1, p. 106–120, 2009. Disponível em: <http://dx.doi.org/10.1111/j.1462-5822.2008.01243.x>
FARIA, M. S. et al. Leishmania inhibitor of serine peptidase 2 prevents TLR4 activation by neutrophil elastase promoting parasite survival in murine macrophages. Journal of immunology , [s. l.], v. 186, n. 1, p. 411–422, 2011. Disponível em: <http://dx.doi.org/10.4049/jimmunol.1002175>
FERREIRA, R. C. et al. Morphological and molecular diversity and phylogenetic relationships among anuran trypanosomes from the Amazonia, Atlantic Forest and Pantanal biomes in Brazil. Parasitology, [s. l.], v. 134, n. Pt 11, p. 1623–1638, 2007. Disponível em: <http://dx.doi.org/10.1017/S0031182007003058>
FINE, M. Beta Testing for Better Software. [s.l.] : Wiley, 2002. Disponível em: <https://market.android.com/details?id=book-rZa6DoPlAsgC>
GOUY, M.; GUINDON, S.; GASCUEL, O. SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular biology and evolution, [s. l.], v. 27, n. 2, p. 221–224, 2010. Disponível em: <http://dx.doi.org/10.1093/molbev/msp259>
GRIMALDI, G.; TESH, R. B. Leishmaniases of the New World: current concepts and implications for future research. Clinical microbiology reviews, [s. l.], v. 6, n. 3, p. 230–250, 1993. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/8358705>
85
HAMILTON, P. B. et al. Trypanosomes are monophyletic: evidence from genes for glyceraldehyde phosphate dehydrogenase and small subunit ribosomal RNA. International journal for parasitology, [s. l.], v. 34, n. 12, p. 1393–1404, 2004. Disponível em: <http://dx.doi.org/10.1016/j.ijpara.2004.08.011>
HAMILTON, P. B.; GIBSON, W. C.; STEVENS, J. R. Patterns of co-evolution between trypanosomes and their hosts deduced from ribosomal RNA and protein-coding gene phylogenies. Molecular phylogenetics and evolution, [s. l.], v. 44, n. 1, p. 15–25, 2007. Disponível em: <http://dx.doi.org/10.1016/j.ympev.2007.03.023>
HOARE, C. A. The trypanosomes of mammals: a zoological monograph. Oxford: Blackwell Scientific Publ, 1972.
HTML 5.1 2nd Edition. [s.d.]. Disponível em: <https://www.w3.org/TR/html51/>. Acesso em: 2 jun. 2018.
IRELAND, P. M. et al. The serine protease inhibitor Ecotin is required for full virulence of Burkholderia pseudomallei. Microbial pathogenesis, [s. l.], v. 67-68, p. 55–58, 02/2014. Disponível em: <http://linkinghub.elsevier.com/retrieve/pii/S0882401014000059>. Acesso em: 15 nov. 2017.
IVENS, A. C. et al. The genome of the kinetoplastid parasite, Leishmania major. Science, [s. l.], v. 309, n. 5733, p. 436–442, 2005. Disponível em: <http://dx.doi.org/10.1126/science.1112680>
JACKSON, A. P. Genome evolution in trypanosomatid parasites. Parasitology, [s. l.], v. 142, n. S1, p. S40–S56, 02/2015. Disponível em: <http://www.journals.cambridge.org/abstract_S0031182014000894>. Acesso em: 15 nov. 2017.
JavaScript W3Schools. [s.d.]. Disponível em: <https://www.w3schools.com/jS/default.asp>. Acesso em: 2 jun. 2018.
LAUWEREYS, M. J.; LAMBEIR, A.-M. V. R. Ecotin as a factor Xa, XIa, and XIIa inhibitor, 5585259 A, 1996.
LETUNIC, I.; BORK, P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic acids research, [s. l.], v. 39, n. Web Server issue, p. W475–8, 2011. Disponível em: <http://dx.doi.org/10.1093/nar/gkr201>
LIMA, L. et al. Evolutionary insights from bat trypanosomes: morphological, developmental and phylogenetic evidence of a new species, Trypanosoma (Schizotrypanum) erneyi sp. nov., in African bats closely related to Trypanosoma (Schizotrypanum) cruzi and allied species. Protist, [s. l.], v. 163, n. 6, p. 856–872, 2012. Disponível em: <http://dx.doi.org/10.1016/j.protis.2011.12.003>
MARCILI, A. Trypanosoma cruzi: Diversidade, relações filogenéticas e padrões ecogeográficos de isolados silvestres. 2008. Universidade de São Paulo, São Paulo, 2008.
MARCILI, A. et al. A new genotype of Trypanosoma cruzi associated with bats evidenced by phylogenetic analyses using SSU rDNA, cytochrome b and Histone H2B genes and
86
genotyping based on ITS1 rDNA. Parasitology, [s. l.], v. 136, n. 6, p. 641–655, 2009. Disponível em: <http://dx.doi.org/10.1017/S0031182009005861>
MCGRATH, M. E.; GILLMOR, S. A.; FLETTERICK, R. J. Ecotin: lessons on survival in a protease-filled world. Protein science: a publication of the Protein Society, [s. l.], v. 4, n. 2, p. 141–148, 1995. Disponível em: <http://dx.doi.org/10.1002/pro.5560040201>
MOLYNEUX, D. Neglected tropical diseases. Community eye health / International Centre for Eye Health, [s. l.], v. 26, n. 82, p. 21–24, 2013. Disponível em: <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756642/>. Acesso em: 27 jul. 2015.
MORRISON, L. S. et al. Ecotin-like serine peptidase inhibitor ISP1 of Leishmania major plays a role in flagellar pocket dynamics and promastigote differentiation. Cellular microbiology, [s. l.], v. 14, n. 8, p. 1271–1286, 2012. Disponível em: <http://dx.doi.org/10.1111/j.1462-5822.2012.01798.x>
MOTTA, M. C. M. et al. The Bacterium Endosymbiont of Crithidia deanei Undergoes Coordinated Division with the Host Cell Nucleus. PloS one, [s. l.], v. 5, n. 8, p. e12415, 2010. Disponível em: <http://dx.plos.org/10.1371/journal.pone.0012415>. Acesso em: 15 nov. 2017.
MySQL. [s.d.]. Disponível em: <https://www.mysql.com/>. Acesso em: 2 jun. 2018.
Neglected tropical diseases. 2015. Disponível em: <http://www.who.int/neglected_diseases/diseases/en/>. Acesso em: 8 out. 2015.
Notepad++. [s.d.]. Disponível em: <https://notepad-plus-plus.org/>. Acesso em: 2 jun. 2018.
OPPERDOES, F. R.; MICHELS, P. A. M. Horizontal gene transfer in trypanosomatids. Trends in parasitology, [s. l.], v. 23, n. 10, p. 470–476, 2007. Disponível em: <http://dx.doi.org/10.1016/j.pt.2007.08.002>
PÉREZ, H. A. G. Diagnóstico, caracterização molecular e epidemiologia de tripanossomas de ungulados. 2012. Universidade de São Paulo, São Paulo, 2012. Disponível em: <http://www.teses.usp.br/teses/disponiveis/42/42135/tde-20092012-084016/publico/HeraklesAntonioGarciaPerez_Doutorado.pdf>. Acesso em: 7 out. 2015.
PHILLIPS, A. WAMP It Up! Kindle ed. [s.l.] : Pintop Productions, 2012. Disponível em: <https://www.amazon.com/WAMP-Up-Arelthia-Phillips-ebook/dp/B007QOBZGG/ref=sr_1_1?s=books&ie=UTF8&qid=1529210716&sr=1-1&keywords=wamp>
PHP: Hypertext Preprocessor. [s.d.]. Disponível em: <http://php.net/>. Acesso em: 2 jun. 2018.
PRUITT, K.; BROWN, G.; MURPHY, M. RefSeq Frequently Asked Questions (FAQ). [s.l.] : National Center for Biotechnology Information (US), 2018. Disponível em: <https://www.ncbi.nlm.nih.gov/books/NBK50679/>. Acesso em: 17 jun. 2018.
RAMOS, F. et al. E. dispar strain: analysis of polymorphism as a tool for study of geographic distribution. Molecular and biochemical parasitology, [s. l.], v. 141, n. 2, p. 175–177, 2005. Disponível em: <http://dx.doi.org/10.1016/j.molbiopara.2005.02.010>
REY, L. Parasitologia. 4a. ed. Rio de Janeiro: Guanabara Koogan, 2008.
87
RIBEIRO-GOMES, F. L. et al. Neutrophils activate macrophages for intracellular killing of Leishmania major through recruitment of TLR4 by neutrophil elastase. Journal of immunology , [s. l.], v. 179, n. 6, p. 3988–3994, 2007. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/17785837>
RODRIGUES, A. C. et al. Phylogeny of Trypanosoma ( Megatrypanum ) theileri and related trypanosomes reveals lineages of isolates associated with artiodactyl hosts diverging on SSU and ITS ribosomal sequences. Parasitology, [s. l.], v. 132, n. Pt 2, p. 215–224, 2006. Disponível em: <http://dx.doi.org/10.1017/S0031182005008929>
ROGERS, M. B. et al. Chromosome and gene copy number variation allow major structural change between species and strains of Leishmania. Genome research, [s. l.], v. 21, n. 12, p. 2129–2142, 2011. Disponível em: <http://dx.doi.org/10.1101/gr.122945.111>
SAMBROOK, J.; RUSSELL, D. W. Molecular cloning: a laboratory manual. 3rd ed. Cold Spring Harbor, N.Y: Cold Spring Harbor Laboratory Press, 2001.
Sample GenBank Record. [s.d.]. Disponível em: <https://www.ncbi.nlm.nih.gov/genbank/samplerecord/>. Acesso em: 2 jun. 2018.
SKINNER, M. E. et al. JBrowse: a next-generation genome browser. Genome research, [s. l.], v. 19, n. 9, p. 1630–1638, 2009. Disponível em: <http://dx.doi.org/10.1101/gr.094607.109>
SQL W3Schools. [s.d.]. Disponível em: <https://www.w3schools.com/sql/>. Acesso em: 2 jun. 2018.
StatCounter Global Stats - Browser Usage Share. [s.d.]. Disponível em: <http://gs.statcounter.com/>. Acesso em: 2 jun. 2018.
STEIN, L. D. et al. The generic genome browser: a building block for a model organism system database. Genome research, [s. l.], v. 12, n. 10, p. 1599–1610, 2002. Disponível em: <http://dx.doi.org/10.1101/gr.403602>
STEVENS, J. R. et al. The ancient and divergent origins of the human pathogenic trypanosomes, Trypanosoma brucei and T. cruzi. Parasitology, [s. l.], v. 118 ( Pt 1), p. 107–116, 1999. Disponível em: <https://www.ncbi.nlm.nih.gov/pubmed/10070668>
STEVENS, J. R.; NOYES, H.; GIBSON, W. The Evolution of Trypanosomes Infecting Humans and Primates. Memórias do Instituto Oswaldo Cruz, [s. l.], v. 93, n. 5, p. 669–676, 1998. Disponível em: <http://dx.doi.org/10.1590/S0074-02761998000500019>
TEIXEIRA, M. M. G. et al. Phylogenetic validation of the genera Angomonas and Strigomonas of trypanosomatids harboring bacterial endosymbionts with the description of new species of trypanosomatids and of proteobacterial symbionts. Protist, [s. l.], v. 162, n. 3, p. 503–524, 2011. Disponível em: <http://dx.doi.org/10.1016/j.protis.2011.01.001>
TRUBETSKOY, S. List of 20 Simple, Distinct Colors. 2017. Disponível em: <https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/>. Acesso em: 17 jun. 2018.
UCSC Genome Browser Gateway. [s.d.]. Disponível em: <https://genome.ucsc.edu/cgi-bin/hgGateway>. Acesso em: 2 jun. 2018.
88
VIOLA, L. B. et al. Phylogenetic analyses based on small subunit rRNA and glycosomal glyceraldehyde-3-phosphate dehydrogenase genes and ultrastructural characterization of two snake Trypanosomes: Trypanosoma serpentis n. sp. from Pseudoboa nigra and Trypanosoma cascavelli from Crotalus durissus terrificus. The Journal of eukaryotic microbiology, [s. l.], v. 56, n. 6, p. 594–602, 2009. Disponível em: <http://dx.doi.org/10.1111/j.1550-7408.2009.00444.x>
VOTÝPKA, J. et al. New Approaches to Systematics of Trypanosomatidae: Criteria for Taxonomic (Re)description. Trends in parasitology, [s. l.], v. 31, n. 10, p. 460–469, 10/2015. Disponível em: <http://linkinghub.elsevier.com/retrieve/pii/S147149221500152X>. Acesso em: 15 nov. 2017.
WampServer. [s.d.]. Disponível em: <http://www.wampserver.com/en/>. Acesso em: 2 jun. 2018.
WELLING, L.; THOMSON, L. PHP and MySQL Web Development. [s.l.] : Addison-Wesley, 2016. Disponível em: <https://market.android.com/details?id=book-zlqOMQEACAAJ>
Yahoo Small Business Web Hosting. [s.d.]. Disponível em: <https://smallbusiness.yahoo.com/>. Acesso em: 2 jun. 2018.
YANG, S. Q. et al. Ecotin: a serine protease inhibitor with two distinct and interacting binding sites. Journal of molecular biology, [s. l.], v. 279, n. 4, p. 945–957, 1998. Disponível em: <http://dx.doi.org/10.1006/jmbi.1998.1748>
89
Glossary
BLAST Standard algorithm for aligning, searching and comparing genetic and protein sequence data.
CDS A sequence of nucleotides that encodes a protein in a gene.
CHP A gene that appears to encode a complete protein of unknown function or origin, deducted by computational analysis of genomic data.
CSS The standard style sheet language used in webpages to apply formatting to HTML pages.
CSS3 CSS version 3, the current version as of 2018.
FASTA Standard text-based format for representing genetic or amino acid sequence data.
GenBank GenBank is the National Institutes of Health's genetic sequence database, publicly and freely available online. The name also describes the standardized text file format in which individual records are presented.
HTML Standard markup language for creating Web pages.
HTML5 HTML version 5, the current version as of 2018.
ISP Serinepeptidase inhibitors in trypanosomatids, ecotin homologs, are called ISPs in the literature.
Javascript Scripting language that runs locally in web browsers to execute commands from dynamic web pages. It is the de facto standard for modern browsers.
MySQL Widely used relational open-source database using the SQL interface.
Notepad++ Freeware text editor with markup for dozens of common programming languages.
PHP One of the most used interpreted server-side programming languages.
SQL Standardized query language for relational database interactions.
tBLASTn A type of BLAST search that takes a protein sequence as input and matches it to possible translations in stored genomic data.
90
WAMP A framework for web server deployment in the Windows operating system using the Apache web server, the PHP server-side interpreted programming language and the MySQL relational database system.
WampServer A freely available implementation of WAMP.