Cancer ID - ULisboa · sangu´ıneas, e de elevado interesse. Este trabalho foca-se no estudo de...
Transcript of Cancer ID - ULisboa · sangu´ıneas, e de elevado interesse. Este trabalho foca-se no estudo de...
Cancer IDAutomated System for Identification of Circulating Tumor Cells
Rita Gonçalves Pires Antunes Angélico
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Prof. Maria Margarida Campos da SilveiraDr. Christoph Brune
Examination Committee
Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Maria Margarida Campos da Silveira
Member of the Committee: Prof. João Miguel Raposo Sanches
May 2016
ii
”Make a radical change in your lifestyle and begin to boldly do things which you may previously
never have thought of doing, or been too hesitant to attempt. So many people live within
unhappy circumstances and yet will not take the initiative to change their situation because
they are conditioned to a life of security, conformity, and conservation, all of which may appear
to give one peace of mind, but in reality nothing is more damaging to the adventurous spirit
within a man than a secure future. The very basic core of a man’s living spirit is his passion for
adventure. The joy of life comes from our encounters with new experiences, and hence there is
no greater joy than to have an endlessly changing horizon, for each day to have a new and
different sun. If you want to get more out of life, you must lose your inclination for monotonous
security and adopt a helter-skelter style of life that will at first appear to you to be crazy. But
once you become accustomed to such a life you will see its full meaning and its incredible
beauty.”
Jon Krakauer, Into the Wild.
iii
iv
Acknowledgments
The journey of knowledge accretion is never a solo trip. Directly or indirectly several people and
institutions contributed for this thesis, to whom I address my deepest gratitude.
First and foremost, I would like to thank my supervisor Professor Margarida Silveira, for not only
accepting this challenge (a self-proposed topic and being my supervisor while I was in Erasmus), but
also for constantly questioning and demanding me to step up my game, for her incredible availability,
perseverance, support, tips and knowledge transfer.
Second, I want to express my most honest thankfulness to Doctor Christoph Brune, Leonie Zeune
and Doctor Guus van Dalum, members of the Cancer ID team, for accepting and welcoming me in their
team during my Erasmus period. Without their support this thesis would have never even existed. Their
support and knowledge were essential for the development of this project.
I am also grateful to two great institutions, one Instituto Superior Tecnico for teaching me every day
that “O ensino de amadores nao cria profissionais” Alfredo Bensaude (Translation: amateur education
does not shape professionals) and for giving me the opportunity to grow not only academically and as
a professional, but also on a personal level. As scary as it may seem, at least to me, 24% of my life up
to date was lived in this school. Second, the University of Twente for the incredible conditions provided
and for accepting me as an exchange student. I would also like to thank the Erasmus Programme for
providing me this opportunity.
Now, I would like to address my inmost gratitude to my father, that along with all the flaws a daughter
sees in a father, has raised me to be the person I am today, that has given me support and the freedom
to live life to the fullest and to whom I look with admiration. I also want to address a big thank you to my
brother, a rare moment of kindness for us, for being my everyday challenger, but also a huge supporter.
Finally but not less important, I would like to thank all my friends. I would like to thank my Erasmus
family (Andrea Gambuti, Claudia Ruffoni, Ilmari Ahonen, Mert Imre and Ophelie Haurou-Bejottes) for one
of the greatest times of my life and for dealing with my craziness as something totally socially acceptable
and for the constant support. To Nuno Pereira, that no matter what, no matter where, for the past 4 years
has been there for me like no other person has ever been. To Monica, that after 8 years of friendship
hasn’t given up and deals with all my stress and panic moments. To Joao Satiro and Olek that over the
last 5 years, and Antonio for the past 3, have worked side by side with me and, and specially Satiro who
has been a great support. I would like also to address my sweetest thank you to Ines Godet, a force of
nature and great support throughout the past year and a half. I would also like to thank Cristiano for all
the hugs, contagious good mood and friendship. Rita and Sancho also deserve a big thank you for not
only being a great support but the most humble and kind people I know.
This was a challenging project that I could not end it without paying an homage to all who are fighting
or have fought any kind of cancer and everyone supporting them and developing work in order to save
or ease their pain.
v
vi
Resumo
Actualmente, existem diversas opcoes para tratamento de cancro. O estudo de Celulas Tumorais em
Circulacao (CTCs) fornece dados relevantes sobre a eficacia do tratamento e a progressao da doenca,
permitindo assim um melhor ajuste dos tratamentos. Assim sendo, o desenvolvimento de um sistema
automatico de classificacao, que usa como fonte de informacao imagens com 4 canais de amostras
sanguıneas, e de elevado interesse. Este trabalho foca-se no estudo de imagens, adquiridas com o sis-
tema CellSearch, de pacientes com cancro do pulmao de celulas pequenas (SCLC), para o qual ainda
nao existe nenhum sistema automatico para a sua enumeracao. Este sistema e composto por dois blo-
cos principais: processamento de imagem e aprendizagem automatica. O bloco de processamento de
imagem consiste em: normalizacao de imagens, segmentacao e extraccao de features (morfologicas,
de intensidade e texturais). O bloco de aprendizagem automatica foi desenvolvido tendo em conta o
facto do numero de nao-CTCs ser largamente superior ao numero de CTCs, usando tecnicas como o
bootstrapping e algoritmos de boosting. Neste projecto, algoritmos convencionais, como Support Vec-
tor Machines (ja usado no passado no ambito de projectos semelhantes) e k-Nearest Neighbor, foram
implementados, bem como algoritmos mais recentes, nunca aplicados neste contexto, especificamente
o AdaBoost e o RUSBoost. Embora diversas novas abordagens tenham sido testadas, nao foi possıvel
desenvolver um sistema automatico para enumeracao de CTCs para SCLC fidedigno.
Palavras-chave: Celulas Tumorais em Circulacao, Classes nao-balanceadas, k-Nearest
Neighbor, Support Vector Machines, Ensemble methods, Cancro do pulmao de celulas pequenas.
vii
viii
Abstract
Nowadays, there are several treatment options for cancer. The study of Circulating Tumor Cells
(CTCs) provides great insight into treatment effectiveness and disease progression, allowing for a better
treatment adjustment. Therefore, the development of automated classification system, which uses as
source of information 4-channel images of blood (a non-invasive biopsy), is of great interest. This work
focused on the study of images, acquired with the CellSearch system, of patients with Small-Cell Lung
Cancer (SCLC), which no automated system for its enumeration has been developed up to date. This
system has two main building blocks: image processing and machine learning. The image processing
block consists of: image normalization, segmentation and feature extraction (morphological, intensity-
related and texture features). The machine learning block was developed taking into account the fact that
the number of non-CTCs highly outnumbers the number of CTCs, using techniques such as bootstrap-
ping and the use of ensemble methods. In this thesis conventional algorithms were implement, namely
Support Vector Machines (which has been used in this context before) and k -Nearest Neighbor, along
with recent algorithms, which have never been studied before in this field, specifically the AdaBoost and
RUSBoost. Even though several new approaches were tested it was not possible to develop a reliable
automated CTC enumeration system for SCLC.
Keywords: Circulating Tumor Cells, Class imbalance, k -Nearest Neighbor, Support Vector Ma-
chines, Ensemble methods, Small-Cell Lung Cancer.
ix
x
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Cancer and its impact on Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Circulating Tumor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Detection of Circulating Tumor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Original Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 State-of-Art 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Machine Learning and Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Image Processing 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Image Preprocessing & ROI Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Image Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Morphological Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Intensity Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xi
3.3.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Classification and Performance Evaluation 21
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Boostrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 RUSBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Nested Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.2 Receiver-Operator Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Results 33
5.1 Dataset - Fluorescence microscopy for blood cells analysis . . . . . . . . . . . . . . . . . 33
5.2 Discussion on Image Processing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusions and Future Work 43
Bibliography 45
A Histograms of Feature Distributions 51
A.1 Morphological Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Intensity Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 Texture Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
B Classification Results 55
B.1 Dataset 1 - Patient A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.2 Dataset 2 - Patient B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C Noise Analysis 59
xii
List of Tables
2.1 Summary of different cytometric approaches for CTC enumeration. Adapted from [18] . . 10
2.2 Performance of different automated CTC enumeration systems. Acronyms: Accuracy(ACC),
Sensitivity(SENS), Specificity(SPEC),Region of Interest (ROI), Castration Resistant Prostate
Cancer (CRPC), Apoptotic (Apop.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Summary of the extracted features (P2A - Perimeter to Area Ratio, Max. - Maximum, ch.
- channel, HOG - Histogram of oriented gradients) . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Clinical characteristics of 59 patients with small-cell lung cancer. (ED-extensive disease
stage; LD-limited disease stage) [15]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Set of features of each category used for classification. . . . . . . . . . . . . . . . . . . . 37
5.3 Area Under the Curve (AUC) of each of the algorithms tested for the total of the 3 datasets. 42
B.1 Area Under the Curve (AUC) of each of the algorithms tested for patient A. . . . . . . . . 55
B.2 Area Under the Curve (AUC) of each of the algorithms tested for patient B. . . . . . . . . 57
xiii
xiv
List of Figures
1.1 Estimated Number of New Cancer Cases by World Area 2012 [2]. . . . . . . . . . . . . . 2
1.2 Estimated New Cancer Cases (left) and Deaths Worldwide (right) for Leading Cancer
Sites by Level of Economic Development, 2012. (*Excluding non-melanoma skin cancer.
Estimates may not sum to worldwide total due to rounding) [2]. . . . . . . . . . . . . . . . 3
1.3 Relative contribution of external factors to cancer incidence. Adapted from [3]. . . . . . . 3
1.4 The metastatic process: cells detach from a primary tumor, penetrate the surrounding
tissue, enter nearby blood vessels (intravasation) and circulate in the vascular system.
Some of these cells eventually adhere to blood vessel walls and are able to extravasate
and migrate into the local tissue, where they can form a secondary tumor. [7]. . . . . . . . 4
1.5 CellSearch thumbnail gallery. The software of the CellSearch CellTracks displays thumb-
nails of all objects that are positive for both DAPI and CK. Events 337, 340, and 341 show
a CTC: positive for DAPI and PE and negative for CD45. Note the weak CD45-staining of
several white blood cells in events 340 and 341 [6]. . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Comparison of common thresholding procedures. Two original images containing a small
(1A) and large number of objects (1B) were thresholded using three methods: triangle
(2A and 2B), otsu (3A and 3B), and isodata (4A and 4B). The three methods give similar
results on an image with a large number of objects, but triangle finds the correct number
of objects in images which contain a small number of objects. Image A1 is shown using a
logarithmic intensity scale to show the texture in the background; the left part of the image
is part of the cartridge border. [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Detail of a PE image (1), and masks as thresholded by the triangle (2), otsu (3), and
isodata (4) methods. [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 ”Example of selection of cartridge scan area. 1: original FITC images of one side of
a cartridge stitched together after application of linear convolution filter to border images
(arrow indicates an air bubble), 2: border enhanced image by gradient magnitude filtering,
3: Binary image of thresholded borders (red color), 4: Selected scan area (red color) after
inversion of image 3, binary propagation of center square, and size verification.” [12] . . . 17
xv
3.2 ”Determination of the global search threshold for each picture. The threshold (THR) was
selected by normalizing the height and dynamic range of the intensity histogram, locating
point A as shown, and then adding a fixed offset.” [24] . . . . . . . . . . . . . . . . . . . . 18
4.1 Example of a 2D linearly separable binary problem, where patterns of one class are rep-
resented by diamonds and the by circles. The optimal separating hyperplane (strong
full line) maximizes the distance between the support vectors of each class (darker data
points, limiting the margins). Adapted from [33]. . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Examples of the use of the two adaptations of the SVM. 4.2(a) Is an illustrative example
of the use of a kernel (φ (x1, x2) = x21 + x22) when the data is separable, although no
hyperplane in the input space is able to separate the data. Therefore, data was mapped
into a feature space, where the decision surface was computed. 4.2(b) Aims to present
a situation where data is not linearly separable in the input space. Two possible options
are available, either the decision surface is the one presented by the dotted line and it
may lead to overfitting and poor generalization, or one allows for errors to be committed,
soft-margin concept, and the dashed line is used as the separating hyperplane. Adapted
from [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 AdaBoost (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 RUSBoost (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Nested Cross-Validation (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Example of a full cartridge (left, presented vertically) and one image from that dataset
(right, horizontally). Overlay corresponds to the 3 channels superimposed. DNA corre-
sponds to DAPI-DNA channel, CK to CK-PE channel and CD45 to CD45-APC channel. . 34
5.2 Example of two cells present. Figure 5.2(a) is an example of a CTC, whereas Figure
5.2(b) is an example of a cell that is non-CTC. The red and green contours represent the
contour resulting from segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Example of a non-CTC. Given the way the manual classification was performed, there
is no way of knowing if this is just one cell or a cluster of two cells. The green contour
represent the contour resulting from segmentation . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Example of a non-CTC, present in figure 5.1. This example highlights two problems, first
the segmentation not being able to create two distinct areas if they are close. Second,
given the way the expert reviewer did the classification, it is not possible to know if the
element on the right (inside the contour) is just a smudge, a cell or an apoptotic cell. . . . 36
xvi
5.5 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-
strapping 5.5(a), with Prior Probabilities 5.5(b) and k -NN with the optimal amount of
neighbors5.5(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK
maximum intensity and CD45 intensity standard deviation; Morphological - area, eccen-
tricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local
entropy and HOG features of the 3 channels; Intensity - mean, maximum and standard
deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 - texture and
intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Receiver operator curves for classification of CTC with Support Vector Machines, using
Linear 5.6(a) and Gaussian (RBF) 5.6(b) Kernels. (All - all features; User - area, ec-
centricity, DNA maximum intensity, CK maximum intensity and CD45 intensity standard
deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Tex-
ture - median local contrast, median local entropy and HOG features of the 3 channels;
Intensity - mean, maximum and standard deviation of the intensity signal and mass of the
3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel). 40
5.7 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost
5.7(a) and RUSBoost 5.7(b). (All - all features; User - area, eccentricity, DNA maximum
intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological -
area, eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast,
median local entropy and HOG features of the 3 channels; Intensity - mean, maximum and
standard deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 -
texture and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . 41
A.1 Histogram of distributions of CTC and non-CTC of morphological features. . . . . . . . . . 51
A.2 Histogram of distributions of CTC and non-CTC of intensity features. . . . . . . . . . . . . 52
A.3 Histogram of distributions of CTC and non-CTC of texture features, except HOG features. 53
B.1 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-
strapping B.1(a), with Prior Probabilities B.1(b) and k -NN with the optimal amount of
neighborsB.1(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK
maximum intensity and CD45 intensity standard deviation; Morphological - area, eccen-
tricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local
entropy and HOG features for the 3 channels; Intensity - mean, maximum and standard
deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 - texture
and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . 55
xvii
B.2 Receiver operator curves for classification of CTC with Support Vector Machines, using
Linear B.2(a) and Gaussian (RBF) B.2(b) Kernels. (All - all features; User - area, ec-
centricity, DNA maximum intensity, CK maximum intensity and CD45 intensity standard
deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Tex-
ture - median local contrast, median local entropy and HOG features for the 3 channels;
Intensity - mean, maximum and standard deviation of the intensity signal and mass for the
3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel). 56
B.3 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost
B.3(a) and RUSBoost B.3(b). (All - all features; User - area, eccentricity, DNA maximum
intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological -
area, eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast,
median local entropy and HOG features for the 3 channels; Intensity - mean, maximum
and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK,
CD45 - texture and intensity features for the correspondent channel). . . . . . . . . . . . . 56
B.4 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-
strapping B.4(a), with Prior Probabilities B.4(b) and k -NN with the optimal amount of
neighborsB.4(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK
maximum intensity and CD45 intensity standard deviation; Morphological - area, eccen-
tricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local
entropy and HOG features for the 3 channels; Intensity - mean, maximum and standard
deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 - texture
and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . 57
B.5 Receiver operator curves for classification of CTC with Support Vector Machines, using
Linear B.5(a) and Gaussian (RBF) B.5(b) Kernels. (All - all features; User - area, ec-
centricity, DNA maximum intensity, CK maximum intensity and CD45 intensity standard
deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Tex-
ture - median local contrast, median local entropy and HOG features for the 3 channels;
Intensity - mean, maximum and standard deviation of the intensity signal and mass for the
3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel). 57
B.6 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost
B.6(a) and RUSBoost B.6(b). (All - all features; User - area, eccentricity, DNA maximum
intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological -
area, eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast,
median local entropy and HOG features for the 3 channels; Intensity - mean, maximum
and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK,
CD45 - texture and intensity features for the correspondent channel). . . . . . . . . . . . . 58
C.1 Distribution of Noise, by channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xviii
Glossary
Antigen Molecule capable of inducing an immune re-
sponse on the part of the host organism, the
presence of this molecule can also be used for
detection of specific cell by its enrichment.
Apoptosis Process of programmed cell death.
Biopsy Sample of tissue taken from the body for further
examination.
CTCs Circulating Tumor Cells is a type of cancerous
cell present in the circulatory system.
Cancer Set of diseases characterized by uncontrolled
growth and spread of abnormal cells.
Cytokeratin Class of fibrous proteins that are intermediate
filaments of the cytoplasm of epithelial cells,
provide structural support to the cytoskeleton
and play a role in various cellular functions.
DNA Desoxyribonucleic acid is a molecule that con-
tains genetic information and it is typically found
in the nucleus of a cell.
ED Extensive Disease stage, used in the context of
cancer to refer to metastasised cancer.
Epithelial cell Cell of the closely packed cells forming the ep-
ithelium (membranous tissue covering internal
organs and other internal surfaces of the body).
Ferrofluid Liquid that becomes strongly magnetized in the
presence of a magnetic field.
HOG Histogram of oriented gradients is a feature de-
scriptor.
LD Limited disease stage, used in the context of
cancer to refer to localized cancer.
xix
Leukocyte Cells of the immune system present on the
blood stream, these are also designated white
blood cells.
Metastasis Spread of a disease from one location to an-
other not directly connected with it.
ROI Region of Interest is a segmented part of an
image, from which feature are extracted.
SCLC Small-Cell Lung Cancer is a type of lung can-
cer.
SVM Support Vector Machines are set of supervised
learning methods used for classification, re-
gression and outliers detection.
TIFF Tagged Image File Format, is a computer file
format for storing raster graphics images.
Tumor Lesion or lump is formed in the body due to ab-
normal cellular growth. Not necessarily cancer-
ous.
XML Extensible Markup Language is a system for
annotating a document, that defines a set of
rules for encoding documents in a format which
is both human-readable and machine-readable.
In the case of this project it the XML files en-
code notes regarding each dataset.
k -NN k -Nearest Neighbor is a non-parametric classi-
fication algorithm.
xx
Chapter 1
Introduction
1.1 Motivation
1.1.1 Cancer and its impact on Society
Cancer is the name given to a set of diseases characterized by uncontrolled growth and spread of
abnormal cells. This continuous and unrestrained cell division can result in death of the patient [1].
Cancer is a major health problem (Figure 1.1). In high-income countries, it is the second leading
cause of death and the third in low and middle income countries. Cancer is responsible for more deaths
than AIDS, tuberculosis and malaria combined. One in seven deaths worldwide is due to cancer [2].
In 2012, there were 14.1 million new cancer cases, of which 8 million occurred in economically
developing countries and an estimate of 8.2 million cancer deaths (killing approximately 22,000 people
per day). Worldwide, lung, bronchus and trachea are the leading cause of cancer caused death amongst
males, followed by liver. On the other hand, in females it is breast then lung, bronchus and trachea
cancer (Figure 1.2) [2].
Besides the enormous impact cancer has on the number of people it affects, it also represents an
immense economic burden. In 2010, the 13.3 million new cases of cancer estimated to cost the world
US$290 billion, being approximately 53% medical costs, 24% of income losses and the remaining on
non medical expenses. It is expected that in 2030 this value will rise to US$458 accounting for 21.5
million new cases of cancer [3, 4] (no reliable source of more recent worldwide cancer costs was found).
About 5% of all cancers are associated with an inherited genetic alteration that might lead to one or
more specific types of cancer. Although, most cancers result from the damage of genes that might occur
during a person’s lifetime. This damage can be caused by both internal or external factors (Figure 1.3)
[1].
1.1.2 Circulating Tumor Cells
Cancer involves the malfunction of genes that control growth and division of cells. These cells are
less specialized than normal cells and are able to ignore signals that would either prevent their division
1
Figure 1.1: Estimated Number of New Cancer Cases by World Area 2012 [2].
or promote their apoptosis [5]. Cancer cells can induce nearby cells to form blood vessels that supply
tumours with oxygen and nutrients and at the same time remove waste, providing the ideal growth
conditions.
Cells from a primary tumour detach and travel through the circulatory or lymphatic systems and are
therefore called Circulating Tumor Cells (CTCs). The microscopic observation of these cells was firstly
described, in 1869, by Thomas Ashworth [6]. These cells can generate new colonies in sites far from
where the first tumour was located, designated as metastasis [7] (Figure 1.4).
In several tumors, this process has already occurred when the primary tumor is detected [5] leading
to a high rate of compromised treatments. Approximately 90% of the deaths in cancer patients are due
to metastasis [8, 9, 10].
The presence of CTCs in metastatic cancer patients is associated with poor survival prospects.
Improvements in treatment and progress in early-stage diagnosis can reflect on higher survival rates.
However the increasing number of treatment options (chemotherapy, radiation therapy, surgery, targeted
therapy, immunotherapy, etc), raised the need for methods that determine if the intended therapy is
being effective [11]. Ideally, these methods would be non-invasive and provide a real-time analysis of
the tumor activity. Several studies have disclosed that a change in the CTC count could be an indicator
of treatment effectiveness [12], therefore assessment of CTC may satisfy this need. In case the tumor
was not completely eliminated from the body, tumor cells will remain dormant or expand. When they
2
Figure 1.2: Estimated New Cancer Cases (left) and Deaths Worldwide (right) for Leading Cancer Sitesby Level of Economic Development, 2012. (*Excluding non-melanoma skin cancer. Estimates may notsum to worldwide total due to rounding) [2].
Figure 1.3: Relative contribution of external factors to cancer incidence. Adapted from [3].
3
Figure 1.4: The metastatic process: cells detach from a primary tumor, penetrate the surrounding tissue,enter nearby blood vessels (intravasation) and circulate in the vascular system. Some of these cellseventually adhere to blood vessel walls and are able to extravasate and migrate into the local tissue,where they can form a secondary tumor. [7].
form a detectable metastasis, the cells may no longer be as sensible as before to the same therapies,
and in some cases, actually displaying resistance. This creates the need of a biopsy to access the
best treatment options. Biopsies are invasive, hard and not always possible from metastatic sites, thus
the possibility to isolate tumor cells from the blood provides a ”real-time liquid biopsy”. The study of
circulating tumor cells can provide game-changing methods to guide personalized therapies, increasing
the survival rate of patients [6].
This project addresses the problem of automated identification of CTCs of Small-Cell Lung Cancer
(SCLC). This type of cancer comprehends about 13-20% of all lung cancer cases and, without treatment,
SCLC leads to death within 2 to 4 months [13]. SCLC is strongly associated with cigarette smoking [14],
and it is a disease characterized by high propensity for widespread metastases, often present in a early
stage disease. In limited disease stage (localized disease), the 5-year survival rate is approximately
10% (maximum 26%), whereas in extensive-stage SCLC (metastasised disease) there is a high initial
response to chemotherapy, although few patients survive beyond the first two years [15].
1.1.3 Detection of Circulating Tumor Cells
There are several systems for CTC detection, however the most widely used is the CellSearh system
(Jansen Diagnosis, LLC; Raritan, NJ), which has been thoroughly validated in patients with metastatic
cancer [15]. The system enriches cells from 7.5 ml of blood expressing the epithelial cell adhesion
membrane (EpCAM) antigen and identifies CTCs as nucleated cells (DAPI-DNA) expressing cytokeratin
8/18 or 19 (CK-PE) and lacking the leukocyte antigen (CD45-APC). Several reports suggest that CTCs
can be effectively detected with this test system, also in SCLC [6]. An alternative system for data
collection (in the form of image) is the functionalized and structured medical wire (FSMW) [11] and will
be later reviewed in chapter 2, along with other systems for CTC detection.
After the images are acquired by this system, expert reviewers classify a Region of Interest (ROI) as
a CTC if it has an oval or cell like morphology, if it is DAPI and CK positive, CD45 negative and greater
than 4µm (Figure 1.5) [12, 6]. The biggest challenges in classification of circulating tumor cells are the
heterogeneity in morphology, partially caused by the large diversity in the viability or apoptotic stage of
the CTC, leading to a difficulty on setting criteria on what can be considered a CTC. An extensive training
is needed to keep the variations (inter and intra reviewer) into assigning objects as CTC to a minimum.
4
Inter-reviewer variability in CTC enumeration can be of 4% to 31% (median 14%) [12]. Additionally CTCs
are very rare cells in blood. In patients with metastatic cancer there is approximately 1 CTC per mL of
blood and it is surrounded by approximately 5× 106 white blood cells and 5× 109 red blood cells [6].
Figure 1.5: CellSearch thumbnail gallery. The software of the CellSearch CellTracks displays thumbnailsof all objects that are positive for both DAPI and CK. Events 337, 340, and 341 show a CTC: positivefor DAPI and PE and negative for CD45. Note the weak CD45-staining of several white blood cells inevents 340 and 341 [6].
Automated classification of CTCs is relevant to provide the ”real-time liquid biopsy” and treatment
assessment mention in section 1.1.2 and for elimination of operator error in classification and make the
process more time efficient [11].
1.2 Proposed Approach
Image data used in this thesis was provided by the Cancer ID project (http://www.cancer-id.eu/)
team of University of Twente. This dataset consists of images from blood samples obtained from 59 pa-
tients with SCLC, the blood collection was done before, after one cycle and at the end of chemotherapy.
Each blood sample corresponds to one cartridge, 175 four channel TIFFs, acquired with a fluorescent-
based microscopy system CellTracksTM analyzer II, using a 10X NA0.45 objective with filters for DAPI,
PE, APC and FITC (biomarker not used for feature extraction). This dataset was previously described
and analysed by Hiltermann, et. al [15]. In chapter 5 the dataset will be detailed.
In the present work, it is proposed, for CTC identification, a system that is composed by two main
components: image processing and machine learning (Figure 1.6).
The image processing block contemplates a solution for edge removal, image normalization, image
segmentation (triangle threshold method), ROI analysis and extraction of morphological features (area,
eccentricity, perimeter, perimeter to area ratio), quantitative intensity related features (mean and maxi-
mum intensity, standard deviation of intensity signal, mass) and texture related features (local contrast,
local entropy, histogram of oriented gradients). Before stepping into the classification part, outliers (ex-
ample a ROI with an area too big to be considered a cell) were removed from the dataset. The machine
5
Figure 1.6: Proposed approach.
learning block aims to compare the performance of four different classification algorithms: Support Vec-
tor Machine (SVM), k -Nearest Neighbor (k -NN), AdaBoost and RUSBoost. Parameter estimation was
performed within a nested Cross Validation procedure.
1.3 Original Contribution
In this project we propose innovative methods for automated CTC enumeration in both of the main
components (Image Processing and Classification) and also regarding the type of cancer. The auto-
mated identification of CTC present in previous works (using either CellSearch System or FSMW) was
for breast [11, 16, 17], colorectal [17], non-small-cell lung cancer [11] and castration resistant prostate
cancer [12].
Regarding feature extraction, we present a new texture feature: histogram of oriented gradients. On
classification, Support Vector Machines (along with Naive Bayesian Classifiers) have been previously
used on FSMW (breast and non-small-cell lung cancer), with color histograms as features [11]. Using
the CellSearch technology, the classifier studied was the Random Forests with morphological, texture,
quantitative and correlation features, nevertheless the images were retrieved using a camera with im-
proved resolution (Time Delay and Integration camera using a 40X 0.6 NA objective) [17]. Thus, the use
of SVMs on CellSearch System and the use of k -NN, AdaBoost and RUSBoost algorithms introduce a
new approach for the problem at hand.
Additionally this projects tries deal with the data imbalance, a problem that has not been addressed
before in the context of CTC automated enumeration.
6
1.4 Thesis Outline
The remainder of this dissertation is organized in the following way: the State of the Art chapter
presents the most relevant contributions for CTC enumeration, along with the most important contribu-
tions for the algorithms used, are highlighted in chapter 2. In chapter 3, each technique used in feature
extraction will be thoroughly described. Then, chapter 4 exploits each classification algorithm used,
covering its fundamentals. Chapter 5 will follow, covering the experimental results, covering all the ex-
perimental design and its results. Finally, chapter 6 will conclude this thesis, summarizing the results
and highlighting future work.
7
8
Chapter 2
State-of-Art
2.1 Introduction
In the past 12 years, there has been a growing interest in developing systems for enumeration of
Circulating Tumor Cells with the help of expert reviewers [18]. These systems are of high relevance
to assess disease progression, treatment effectiveness and survival prognosis without being invasive.
Only in recent years the automation of these has been a focus of study, which is an extremely important
topic due to the high dependence on the reviewers expertise, inter and intra reviewer variability and the
impact these factors have on the assessment of patients’ diagnosis [12, 16, 11, 17].
This chapter will review the main trends and most important contributions in this field. First, section
2.2 presents a short overview of the available systems for CTC enumeration. Followed by section 2.3,
that highlights the major contributions on feature extraction applied to the study of CTC. Section 2.4 sum-
marizes the machine learning and classification techniques that successfully distinguished CTCs from
other possible classes. Finally, section 2.5 briefly describes the most important systems for automated
detection of CTCs and summarizes the existing solutions.
2.2 Biomarker
There are two kinds of systems for CTC enumeration: PCR-based analysis and cytometry based.
The CellSearch System is the only FDA approved of the latter type. In table 2.1 a summary of the
advantages and disadvantages of the several cytometric approaches is presented. Currently, the most
reviewed system in literature is the CellSearch System [6, 19, 20, 21, 22].
In addition, to the different systems for CTC enumeration there is also the possibility of using different
markers. In this project, ferrofluids with EpCAM (epithelial cell adhesion molecule, to select cells of ep-
ithelial origin) and the staining reagents DAPI-DNA (4’, 2-diamidino-2-phenylindole, dihydrochoride, for
a nuclear stain), PE-CK (cytokeratin 8, 18 Phycoerythrin and cytokeratin 19 Phycoerythrin) and CD45-
APC (CD45-allophycocyan to label leukocytes) were used. However, the replacement of cytokeratin
antibodies with other staining reagents, that target certain molecules, allows a better assessment of
9
specific CTC, for example: staining reagent for Her-2 for breast cancer, Bcl-2 for non-small lung cancer
and non-Hodgkin’s lymphomas and/or AR for castration resistant prostate cancer [23].
Table 2.1: Summary of different cytometric approaches for CTC enumeration. Adapted from [18]
Detection Technique Advantages Disadvantages
CytometricAnalysis
CellSearch
Semi-automated;High sensitivity;CTC quantification;Reproducible;Recognition of a fixed marker (EpCAM, CKs, CD45);Visual confirmation of CTCs;FDA approved.
Only EpCam+/CK+/CD45- CTCs detected;Subjective images interpretation;No further analysis possible.
CTC-chip
98% Cell viability;Visual confirmation of CTCs;High detection rate;Further analysis possible.
Only EpCam positive CTCs detected;Not commercially available;Subjective CTC analysis;Lack of validation studies in clinical settings.
EPISPOT Analysis only on viable cells;High sensitivity.
CTC isolation not possible, thus no further analysis possible;Need of active protein secretion;No morphological analysis possible;Technically challenging.
Fast
Scan analysis of large volume of sample;Cell loss minimised;No enrichment needed;Quick analysis (up to 300,000 cells/s).
Subjective CTC analysis;Lack of validation studies in clinical settings.
FISH Genetic analysis. Further analysis not possible.
Flow Cytometry High specificity;Multiple parameters. Low sensitivity.
FSMWCE certified;In vivo samples;Screening of large blood volume.
Subjective analysis;Technically challenging.
LSC
Fast;No enrichment needed;Visual confirmation of CTCs;High specificity.
Subjective analysis;Technically challenging;Low sensitivity.
2.3 Image Processing
In light of image processing and feature extraction, there are several focus points to be considered.
First, the selection of the analysis area: when processing images from a cartridge, several of them have
the edge of the cartridge that should be removed. Up to date, one solution has been proposed for
detection and removal of the sample border, via thresholding the FITC channel (the fourth channel that
it is not used as a marker). A necessary step to get the true imaging area [12].
Second, if the images are retrieved with different machines, under different light conditions or present
too much noise, there might be the need for image normalization. The Naka-Rushton filter was intro-
duced in analysis of circulating tumor cells, by Svensson, et al. [11], for enhancement of foreground
objects and suppression of background noise. The use of top-hat background subtraction algorithms
can lead to both the presence of negative values and/or formation of extra contrast, thus the proper
background subtraction method would be as follows: recording a black image with no objects present,
followed by subtracting this black image from images with objects. However most of the times this image
is not available [12]. Following edge removal, Svensson, et al. [11] also implements a Gaussian blurring
filter for image smoothing.
To locate objects and its outline, there is the need to implement segmentation techniques. These can
be divided into two classes, contour-based and region-based. Contour-based techniques require edge
enhancement steps to find the contour or edges of objects. Region-based can be of texture analysis,
watershedding or intensity thresholding (local or global). Svensson, et al. [11] applied the watershed
10
algorithm to the DNA channel followed by the use of random forest to decide whether or not the ROI
should be considered for a candidate for further classification. Lighthart, et al. [12] implemented several
algorithms, such as the Zack’s triangle threshold via channel image histogram, the Otsu’s threshold and
isodata algorithms, for image segmentation in the study of CTCs, Figures 2.1 and 2.2.
Figure 2.1: Comparison of common thresholding procedures. Two original images containing a small(1A) and large number of objects (1B) were thresholded using three methods: triangle (2A and 2B),otsu (3A and 3B), and isodata (4A and 4B). The three methods give similar results on an image witha large number of objects, but triangle finds the correct number of objects in images which contain asmall number of objects. Image A1 is shown using a logarithmic intensity scale to show the texture inthe background; the left part of the image is part of the cartridge border. [12].
Figure 2.2: Detail of a PE image (1), and masks as thresholded by the triangle (2), otsu (3), and isodata(4) methods. [12].
Finally, in order to analyse each cell, several different features have been studied, such as color
histograms [11] and quantitative [17, 12, 16], correlation [17], texture [17, 12, 16] and morphological
features [17, 12, 16]. Further details regarding these are presented in column ”Features” of Table 2.2.
2.4 Machine Learning and Performance Evaluation
In order for the CTC enumeration system to be automatic, there is the need to have some kind of
classification, which can be solved by the implementation of machine learning algorithms. Classifiers
fall in the category of supervised learning machines and can be divided into two categories: generative
models or discriminative models. The generative approach focuses mainly in trying to learn the prob-
ability functions behind the problem and classifies a given pattern based on the most probable output
label. A discriminative approach focuses directly on the prediction.
11
Before stepping into the actual classification of each cell into CTC or non-CTC, Svensson, et al. [11]
proposed the implementation of a Random Forest classifier to identify relevant ROI and, only after this,
proceed to the classification itself. In this step, the features used were area and perimeter-to-area ratio.
Up to date, several classification approaches have been presented for the automated classification
of CTCs. The first classification method is not a machine learning implementation, it is based on nu-
meric inclusion (example: if the size is within a certain range of values, peak intensity in the DAPI-DNA
channel and standard deviation of CK-PE channel are bigger than specified thresholds and the peak in-
tensity of the CD45-APC channel is smaller than a determined constant) [12, 16]. In recent years, more
advanced techniques have been explored. Regarding generative models, both Naive Bayes Classifiers
[11] and Random Forests [17] have been successfully implemented. Support Vector Machines [11], a
discriminative method, has also been studied for this problem and performed well.
For performance evaluation, cross-validation is the algorithm used by Svensson, et al. [11]. When
evaluating the performance of a classification algorithm applied to the identification of CTCs, it should
be taken into consideration that the dataset is highly unbalanced due to the incredibly low number of
CTCs when compared to the number of non-CTC in a sample, as explored in Section 1.2. Therefore,
accuracy might not be always the most informative measurement.
Class imbalance problem has not been addressed before.
2.5 Summary
This chapter reviewed the existing work on the biomarkers for enumeration of CTC, image processing
techniques and also classification algorithms used in automated CTC identification. The following table
(Table 2.2), lists, chronologically, the more relevant studies in development of an automated system for
the enumeration of Circulating Tumor Cells.
12
Tabl
e2.
2:Pe
rform
ance
ofdi
ffere
ntau
tom
ated
CTC
enum
erat
ion
syst
ems.
Acr
onym
s:A
ccur
acy(
AC
C),
Sen
sitiv
ity(S
EN
S),
Spe
cific
ity(S
PE
C),R
egio
nof
Inte
rest
(RO
I),C
astra
tion
Res
ista
ntP
rost
ate
Can
cer(
CR
PC
),A
popt
otic
(Apo
p.).
Aut
hor(
s)B
iom
arke
rE
nric
hem
ent
Cam
era
Feat
ures
Cla
ssifi
catio
nTe
chni
que
Par
ticip
ants
Res
ults
(%)
Ligt
hart
etal
.,20
11an
d20
13[1
2,16
]C
ellS
earc
h
EpC
AM
CK
-PE
CD
45-A
PC
DA
PI-D
NA
10x/
.45N
A
Sta
ndar
dD
evia
tion
CK
-PE
Peak
DN
A-D
AP
IPe
akC
D45
-AP
CS
ize
Num
eric
alin
clus
ion
(gat
ing)
100
CR
PC
-
Err
orra
teby
clas
s
Sch
olte
nset
al.,
2012
[17]
Cel
lSea
rch
EpC
AM
CD
45-A
PC
CK
-PE
DA
PI-D
NA
TDI4
0x/.6
NA
Fore
ach
chan
nel:
Are
aPe
rimet
erC
ircul
arity
Max
.C
alip
erC
ontra
stm
ean
Cor
rela
tion
rang
eH
omog
enei
tym
ean
Ent
ropy
mea
nTo
talI
nten
sity
Sta
ndar
dD
evia
tion
Max
imum
valu
eC
orre
latio
nbe
twee
nch
anne
ls(D
AP
I/PE
,PE
/AP
C,A
PC
/DA
PI)
Tota
lInt
ensi
tyra
tioR
2S
lope
Ran
dom
Fore
st(5
Cla
sses
:C
TCA
pop.
CTC
CTC
Deb
risLe
ukoc
ytes
Deb
ris)
31P
rimar
yB
reas
tor
Col
orec
talC
ance
r37
Met
asta
ticB
reas
tor
Col
orec
talC
ance
r9
Hea
lthy
CTC
:10.
2A
pop.
CTC
:34.
1C
TCD
eb.:
9.5
Leuk
:4.0
Deb
ris:
10.8
Tota
l:9.6
AC
CS
EN
SS
PE
C
Sve
nsso
net
al.,
2014
[11]
FSM
W
EpC
AM
CK
CD
45H
oech
st(n
ucle
ardy
e)
10x/
.3N
A20
x/.5
NA
40x/
1NA
Are
aPe
rimet
er-to
-are
aR
BG
His
togr
ams
Ran
dom
Fore
st(R
OII
dent
ifica
tion)
SV
M(R
BF-
Ker
nel)
NB
C(U
nsup
ervi
sed)
NB
C(S
emi-S
uper
vise
d)
617
RO
Is
99 89 87 88
51 87 85 85
96 93 92 93
13
14
Chapter 3
Image Processing
3.1 Introduction
The goal of this thesis is to build and study a system for automated enumeration of CTC, using 4-
channel TIFF images acquired with the CellSearch System. In section 3.2, the algorithms for image
normalization, edge detection and image segmentation will be described. The following section (Section
3.3) concerns the extracted features. Lastly, section 3.5 summarizes the implemented approaches ap-
proaches. A great deal of the code used (and partly adapted) in this section was developed and provided
by the CancerID team of the University of Twente.
3.2 Image Preprocessing & ROI Identification
3.2.1 Image Normalization
An essential step, in order to quantitatively compare objects, is image normalization. This was per-
formed in the following way ”all imported 8-bit multipage TIFF images were scaled from 0–255 and
had to be re-scaled to pseudo 12-bit using information stored in the TIFF-header” [12], an offset and a
maximum value related with IMMC/Veridex TIFF scaling, using equation 3.1.
ImageToSegment = Offset+OriginalImage× MaximumV alue−Offsetmax(OriginalImage)
(3.1)
This solution has been proposed, validated and implemented by Lighthart [12], for the type of
datasets used on this thesis.
3.2.2 Edge Detection
Each dataset corresponds to one blood sample, therefore one cartridge (one scan), which corre-
sponds 175 images. Some of these images have present the cartridge border. For correct ROI seg-
mentation it was necessary to detected the sample border and exclude the outside area from further
15
analysis. This was accomplished via thresholding in the FITC channel (a debris channel, not used for
pattern extraction), however cartridges have very irregular edges, specially at the corners, making it
necessary to compare the total selected area of the whole cartridge to a training set that was acquired
manually. The algorithm is presented below [16]:
1. All FITC images from one dataset are sub-sampled by a factor of eight (neglecting small details
and avoiding unnecessary memory requirements);
2. Convolve images with oriented border with a line-shaped border, to augment this orientation and
close the border (gaps and different intensities were common issues);
3. In order to construct an image of the total cartridge, images were connected to each other, Figure
3.1, panel 1;
4. Edge boosting: gradient magnitude filter (using a gaussian derivative with a width of 8 pixels),
Figure 3.1, panel 2;
5. Edge detection: the border only takes up a small part of the image, so it does not show a large
peak in the histogram, therefore the triangle threshold method (detailed in subsection 3.2.3) was
used with the total image histogram for edge detection, Figure 3.1, panel 3;
6. The thresholded mask was inverted and the holes in the image were filled, for obtaining the selec-
tion area where cells are located;
7. The result was validated by comparison to the possible area range between 72 and 92 mm2 (see
Figure 3.1, panel 4). If the detected area failed this verification, boundaries were estimated using
results from a fixed set of previously analysed cartridges.
3.2.3 Image Segmentation
Given the fact that every object present on the images is slightly visible above the background, a
basic histogram-based thresholding algorithm is enough to segment the image. The algorithm chosen
to perform this task was the Zack’s triangle threshold method.
This geometric method assumes a maximum peak near one end of the histogram of pixel intensities
and searches towards the other end, as presented in Figure 3.2. It was considered an object of interest
a region of the image that has a higher intensity than the defined threshold. By adjusting the search
threshold until the average brightness of the pixels contiguous to the segmented object was within a
small fixed offset of the average background intensity, one can account for the variations in staining
intensity [24]. In cases where the maximum is not near one of the histograms extremes, the algorithm
searched for the threshold within the largest range.
The segmentation was performed over the DNA channel.
16
Figure 3.1: ”Example of selection of cartridge scan area. 1: original FITC images of one side of acartridge stitched together after application of linear convolution filter to border images (arrow indicatesan air bubble), 2: border enhanced image by gradient magnitude filtering, 3: Binary image of thresholdedborders (red color), 4: Selected scan area (red color) after inversion of image 3, binary propagation ofcenter square, and size verification.” [12]
17
Figure 3.2: ”Determination of the global search threshold for each picture. The threshold (THR) wasselected by normalizing the height and dynamic range of the intensity histogram, locating point A asshown, and then adding a fixed offset.” [24]
3.3 Feature Extraction
In CTC analysis, several features have been tested (Table 3.1). Below, the ones this project focuses
on are presented.
3.3.1 Morphological Features
In the field of cell analysis, the morphology can reveal important information about the type of cell we
might be dealing with. Some Circulating Tumor Cells might be within a range of sizes or have a specific
shape. Additionally, shape related features, like eccentricity, can give an insight on if we are dealing with
a ROI that is a cell or not (for example typically white blood cells and Circulating Tumor Cells are not
rectangular).
The morphological features extracted and analysed in this project were: area, perimeter, eccentricity
and perimeter to area ratio (P2A). The latter was computed as follows:
P2A =Perimeter2
4πArea(3.2)
All the other features were extracted using regionprops, a MATLAB function, available in the Image
Processing toolbox. Regionprops is a function that takes as input the segmentation mask, the original
image and several features intended to be extracted.
The area corresponds to sum the of the number of pixels in a certain region and perimeter is the
distance between each adjoining pair of pixels around the border of a region. The eccentricity is given
by the ratio of the distance between the foci of the ellipse and its major axis length. The value is between
0 and 1, where 0 corresponds to a circle and 1 is a line segment.
18
3.3.2 Intensity Features
The intensity related features provide us a quantitative analysis of the cells. The first obvious outcome
of intensity related features is the information of representativeness of a certain object. The features
extracted were mean and maximum intensity of ROIs, and two other that are not strictly intensity features:
the standard deviation of the intensity signal and the mass of the ROI. These two were considered as
intensity features in order to have a more balanced amount of features in each test done later in chapter
5. The standard deviation of the intensity signal can be also used as a texture descriptor, as a measure
of the average contrast. The mass, which is defined as the sum of the intensity of all pixels present in a
ROI, is a feature that is also related to the morphology of the cells.
Maximum and mean intensity were extracted using the regionprops, a MATLAB function available in
the Image Processing toolbox. Additionally, the pixel values (intensity of the signal of each pixel in the
ROI) were also extracted, using this function in order to compute both the standard deviation and the
mass.
For each ROI, these four features were obtained for the DNA, CK and CD45 channels.
3.3.3 Texture Features
Texture descriptors can also give some deeper insight when analysing an object of interest. The
extracted texture features were median of local contrast, median of local entropy, median of gradient
amplitude and Histogram of Oriented Gradients (HOG).
The median was computed for the local contrast, local entropy and gradient amplitude due to the
fact that each ROI had a different size and, for classification, each input vector was required to have the
same size.
The local contrast is the range value in a specified neighborhood around the corresponding pixel in
the input ROI. The range value is determined by maximum intensity value−minimum intensity value
of a 3-by-3 neighborhood.
The local entropy measures the randomness of an image and it is computed as follows:
e = −L−1∑i=0
p (zi) log2p (zi) , (3.3)
where zi indicates the intensity, p (z) is the histogram of the intensity levels in a region and L is the
number of possible intensity levels [25].
The gradient of an image represents a directional change in the intensity. The gradient amplitude
encodes edges and local contrast of images. Using a Sobel filter, first the directional gradients Gx and
Gy are computed, with respect to each of the figure axis (x and y). The gradient magnitude and direction
are then computed from their orthogonal components Gx and Gy.
Lastly, for each ROI a 10-bins Histogram of Oriented Gradients (HOG) was extracted. The idea be-
hind HOG feature descriptor is that an object appearance and shape can be described by its distribution
of intensity gradients or edge directions and it presents a certain degree of invariance to transformations
19
or rotations. In order to compute the HOG an image is divided into small connected regions (cells), an
histogram of gradient directions is then obtained for the pixels within each cell. The descriptor is the
result of the concatenation of these histograms [26].
Dalal and Trigs [26] presented the followings steps for the extraction of the HOG computation. The
first step is the computation of the gradient values by applying 1-D centered derivative masks, for both
horizontal ([−1, 0, 1]) and vertical ([−1, 0, 1]T ) directions. This step is followed by the creation of cell his-
tograms. Based on the values obtained in step one, each pixel within a cell contributes to an orientation-
based histogram with a weighted vote based on the gradient magnitude. The cell has a square shape
and the histogram values range from 0 to 180 degrees. The final step is the construction of descriptor
blocks. Cells are then grouped together into larger blocks, normalizing gradient strengths locally and
therefore taking into account changes in illumination and contrast. The HOG descriptor is the concate-
nated vector resulting from the normalized cell histograms.
3.4 Post-Processing
In order to remove samples that might not be relevant for this system, ROIs were visually inspected,
and it was decided that every ROI with an area inferior to 9 pixels and superior to 3000 pixels were
excluded from further analysis.
In addition to this step, after data was separated into train and test sets, features were normalized in
such way that each feature had mean zero and standard deviation one. In order for classifiers, such as
the k -NN (based on distances), to perform correctly, this step was necessary.
3.5 Conclusion
This chapter summarizes the image processing approaches implemented on this thesis. Section 3.2,
describes the normalization, edge detection and image segmentation algorithms. The following section
highlights the extraction of features. Table 3.1 summarizes the extracted features.
Table 3.1: Summary of the extracted features (P2A - Perimeter to Area Ratio, Max. - Maximum, ch. -channel, HOG - Histogram of oriented gradients)
Morphological Intensity Texture
AreaEccentricityPerimeterP2A
Mean Intensity DNA ch.Mean Intensity CK ch.Mean Intensity CD45 ch.Max. Intensity DNA ch.Max. Intensity CK ch.Max. Intensity CD45 ch.Standard Deviation Int. DNA ch.Standard Deviation Int. CK ch.Standard Deviation Int. CD45 ch.Mass DNA ch.Mass CK ch.Mass CD45 ch.
Median of Local Entropy DNA ch.Median of Local Entropy CK ch.Median of Local Entropy CD45 ch.Median of Local Contrast DNA ch.Median of Local Contrast CK ch.Median of Local Contrast CD45 ch.Median of Gradient Amplitude DNA ch.Median of Gradiant Amplitude CK ch.Median of Gradiant Amplitude CD45 ch.HOG DNA ch.HOG CK ch.HOG CD45 ch.
20
Chapter 4
Classification and Performance
Evaluation
4.1 Introduction
In automated pattern recognition the system has to learn a model from the training instances and be
capable of classifying future unseen data based on the previously formed model. Given the problem at
hand, several approaches were analysed, namely, k -Nearest Neighbor, Support Vector Machines and
Boosting.
k -Nearest Neighbor (k -NN), despite its simplicity, has been successful in a large number of clas-
sification problems and it was implemented for these reasons. Notwithstanding, when dealing with
imbalanced datasets, some boosting techniques that tackle this issue might have to be applied. In this
project, both bootstrapping and the introduction of prior probabilities of each class, were explored. In
section 4.2, the aforementioned techniques as well as the k -NN algorithm will be described. The fol-
lowing section provides a comprehensive description of the concepts and mathematics behind Support
Vector Machines, used in this thesis both because it is a popular discriminative method for classification
and also because it was successfully implement before in the context of CTC enumeration. Boosting
algorithms (AdaBoost and RusBoost) that perform classification based on an ensemble of multiple sim-
ple classifiers, called ”weak” classifiers, are also studied in this project due to their simplicity and high
performance and are addressed in section 4.4.
Classification performance is a clear necessary step in analysing the viability of this system, as well
as understanding what is the most adequate classifier for automated CTC enumeration. As such, in
section 4.5, the technique known as Cross-Validation (CV) was used to estimate, in an unbiased fash-
ion, performance measures such as balanced accuracy, sensitivity, specificity and Receiver-Operator
Characteristic curves. Finally, section 4.6 concludes this chapter.
21
4.2 k-Nearest Neighbors
4.2.1 Basic Concepts
In 1951, Fix and Hodges presented the Nearest Neighbor (NN) Rule [27]. The NN decision rule is
quite intuitive: an unclassified data point is attributed the classification of the nearest point in the set of
previously classified points.
Later, in 1967, Cover and Hart [28] proved that when different sample classes do not overlap in the
input space, the NN rule is asymptotically optimal, meaning that the optimal Bayes probability of error is
equal to zero. Stone then introduced the k -nearest neighbor rule (k -NN), overcoming the sub-optimality
introduced in the NN rule by the fact that most of the times classes do overlap. A new point is classified
as the class most frequent amongst its k nearest neighbors [29]. The k -Nearest Neighbor algorithm is
an instance-based learning, it does not construct a general internal model, it simply stores the instances
of training data. When a new point is presented to the k -NN, it attempts to find a predefined number of
training samples closest to this new data point and predict its label by computing a majority voting, i.e.,
the classification of a query point is assigned to the class which has the most representatives within the
nearest neighbors.
Real life datasets have a finite number of training samples and are most commonly different classes
do overlap. Therefore, it is essential that the distance metric used is suitable for the problem at hand.
Several tuning parameters can be used to improve k -NN’s performance. In this project, it was analysed
the use of different number of neighbors, and, tackle the class imbalance problem, boostrapping and the
use of prior probabilities were tested.
Throughout this work, the implementation of the k -NN used is the fitcknn, available in the Statistics
and Machine Learning Toolbox of MATLAB R2014a.
4.2.2 Mathematics
The theory behind NN is quite intuitive: consider an input space Rd, being d the dimension of the
input space, patterns nearby that same input space most likely belong to the same class. The NN rule
classifies a query pattern X to the class of its nearest neighbour in the training data Dn, when given a
set of examples Dn = (x1,y1), ..., (xn,yn), where xi ∈ Rd represents the input vectors xi ∈ Rd and yi
represent the class label. Without any prior knowledge, the Euclidean distance is typically used, which
is defined as follows:
d(X,xq) =
(d∑k=1
∣∣Xk − xkq∣∣2) 1
2
, (4.1)
where xq is a new unclassified pattern. When prior probabilities are added, a weight is assigned to
each class when computing the Euclidean distance.
22
4.2.3 Boostrap
Bootstrap, presented by Efron in 1979 [30], is a data-resampling strategy, this method has several
applications. In the light of this project, it was used as follows: the bootstrap resamples, with replace-
ment, our dataset into smaller new datasets with a more balanced ratio of CTC vs. non-CTC. For each
new dataset a k -NN model is generated which can be used to predict the class of a new pattern. In
the end the class of a new pattern will be the result of mode of the predicted class given by each k -NN
model.
4.3 Support Vector Machines
4.3.1 Basic concepts
In the early 90s, Boser, Guyon and Vapnik published the first paper that presented Support Vector
Machines [31], a generalization to nonlinear models of the Generalized Portrait algorithm [32]. In 1995,
Cortes and Vapnik introduced a notion vital for non-separable cases, the soft-margin [33]. In 1998,
Shawe-Tayler et al. [34] and Barlett (2006) [35] proposed a rigorous bound to the generalization ability
of the hard margin SVM and, in 2000, Shawe-Tayler et al. [36] presented the same bound for the soft
margin case.
Assume a binary classification of linearly separable data, as depicted in Figure 4.1. The SVM al-
gorithm tries to find the hyperplane that maximizes the distance to the closest training vectors of each
class, the support vectors.
Figure 4.1: Example of a 2D linearly separable binary problem, where patterns of one class are rep-resented by diamonds and the by circles. The optimal separating hyperplane (strong full line) maxi-mizes the distance between the support vectors of each class (darker data points, limiting the margins).Adapted from [33].
The use of hard margin SVMs does not allow for the fitting model to have any errors, which might
lead to poor generalization abilities. To overcome this issue two approaches are proposed: the use of
kernels and the use of soft margins. The use of kernels supports separation of non-linearly separable
data, through the mapping of the input space into a higher dimensional space, called feature space
(figure 4.2(a)). The later extension of SVMs, the concept of soft margins, allows errors on the training
instances while trying to minimize them, by relaxing the separability constraint (figure 4.2(b)). When
23
data is not easily separable, either an highly complex kernel is applied, which might lead to overfitting
and therefore poor generalization, or the concept of soft margin is applied.
(a) Kernel (b) Soft-margin
Figure 4.2: Examples of the use of the two adaptations of the SVM. 4.2(a) Is an illustrative example ofthe use of a kernel (φ (x1, x2) = x21 + x22) when the data is separable, although no hyperplane in theinput space is able to separate the data. Therefore, data was mapped into a feature space, where thedecision surface was computed. 4.2(b) Aims to present a situation where data is not linearly separablein the input space. Two possible options are available, either the decision surface is the one presentedby the dotted line and it may lead to overfitting and poor generalization, or one allows for errors to becommitted, soft-margin concept, and the dashed line is used as the separating hyperplane. Adaptedfrom [37].
4.3.2 Mathematics
In this section, the implementation of both hard and soft margin SVMs will be presented mathemat-
ically, as well as the proposed solution for the unbalanced datasets problem. First, SVMs with hard
margins will be described, followed by the use of kernels. Later, the soft margin concept and finally the
unbalanced dataset problem will be dealt with [31, 33].
Consider, again, a linearly separable dataset. In this problem, the training set D has K instances,
where xk represents one instance of a N-dimensional feature vector. Each instance belongs to one of
two classes:
D = {(x1, y1) , ..., (xK , yK)} ,xk ∈ RN , yk ∈ {−1, 1} . (4.2)
The separating hyperplane, that correctly classifies all the training instances, is given by the following
decision function:
f (x) = sign (w · x + b) , (4.3)
24
where b is a constant and w is the normal vector that parametrize the hyperplane. And it can be
rewritten in the following way:
yk (w · xk + b) ≥ 1,∀k. (4.4)
Although notice that, the constant on the right side of the inequality above can be any strictly positive
number, by virtue of the fact that any hyperplane defined by (w, b) may also be represented by any
positive scale pair (λw, λb ), with λ ∈ R+. In addiction, by changing the scale factor λ, any separat-
ing hyperplane can be represented in a way that equation 4.4 is always met for the nearest training
sample(s).
The distance between the hyperplane and the nearest vector is given by:
d ((w, b) ,xk) =yk (w · xk + b)
‖w‖(4.5)
=1
‖w‖, (4.6)
in order to select the optimal hyperplane from the infinite set of separating ones, one must maximize
its margin. Ergo, it is possible to conclude that the best hyperplane can be computed by minimizing ‖w‖.
Taking into the consideration the constraints presented in equation 4.6, one can rewrite it in following
quadratic problem:
minimizew,b
1
2wTw
subject to yk (w · xk + b) ≥ 1 ∀k.(4.7)
To solve the problem presented in equation 4.7, consider the Lagragian dual formulation. The Lagra-
gian function will be:
L (w, b,Λ) =1
2wTw −
K∑k=1
αk[yk(wTxk + b
)− 1], (4.8)
where Λ = (α1, ..., αK) is the vector of non-negative Lagragian multipliers associated with the con-
straints presented in equation 4.6. Next, the quantity that the dual problem maximizes (the infimum of
L (w, b,Λ), with respect to w and b) can be obtained by using ∇w,bL (w, b,Λ) = 0 and applying the
results to the equation 4.8. The conditions imposed result in:
w =
K∑k=1
αkykwk (4.9)
and
K∑k=1
αkyk = 0, (4.10)
25
which, after some manipulation, leads to:
infw,b{L (w, b,Λ)} =
K∑k=1
αk −1
2
K∑k=1
K∑l=1
αkαlykyl(xTk xl
)(4.11)
Finally, solving the dual maximization problem for the Lagragian coefficients, one can obtain the
desired solution. Using vector notation we have the following:
maximizeΛ
ΛT1− 1
2ΛTDΛ
Λ ≥ 0
Λy = 0,
(4.12)
where 1 = (1, ..., 1) is a K-dimensional unit vector, D is a symmetric matrix with elements Dkl =
ykyl(xTk xl
)and y = (y1, ..., yk) is the labels vector, for k and l ∈ {1, ...,K}. This problem is still quadratic,
however it scales with the number of training instances, opposed to before where the problem was scaled
with the number of dimensions of the feature space.
From the complementary slackness condition of the Karush-Kuhn-Tucker theorem, it is possible to
conclude that, when a solution for 4.12 is found, one of two possible cases bears validity. The Lagrangian
multiplier αk associated with a given instance xk that is a support vector, is either non-negative or zero.
Therefore, the optimal hyperplane can be computed, using equation 4.9, as a linear combination of the
support vectors. Finally, the bias b can be obtained from the constraints 4.4.
Now, we will explore the concept and implementation of kernels [31, 33]. Kernels allow for a non
linearly separable dataset to be separated by a non-linear surface (Figure 4.2(a)). This problem is
solved by mapping (z = φ (x)) the original N-dimensional input space into a new N-dimensional feature
space in which the hyperplane will try to separate the new transformed data {(φ (xk, yk))}. For some
types of kernels, the dimension of the input space can be infinite, leading to a computational problem.
Nevertheless, one can avoid the need of explicitly calculating the mapping of the input vectors by defining
the inner product in the feature space. When one replaces all the x by φ (x), it is possible to observe that
for each φ (xk) there is a dot product of it with φ (xl), making it possible to compute the inner product
only in the feature space, as stated before. The matrix D in 4.12, becomes D = ykyl (φ (xk) · φ (xl)),
additionally, if we write w as a linear combination of the support vectors in the feature space we obtain:
w =∑Kk=1 αkykφ (xk). Finally, by replacing it in equation 4.3, the decision function presents as follows:
f (x) = sign
(K∑k=1
αkyk (φ (xk) · φ (x)) + b
). (4.13)
The kernel function provides the inner product:
K (xk,xl) = φ (xk) · φ (xl) . (4.14)
Now comes the point where there is the need of choosing a kernel. In low dimensional spaces one
26
might be able to create, by inspection, a function that would separate the training set in the feature
space. However, this task becomes harder as the dimensionality of the problem increases. Thus several
kernels have proven successful in several classification problems. In this project, two were tested, one
being the linear kernel (meaning the dot product), and the other RBF kernel (Gaussian Radial Basis
function), defined as:
K (xk,xl) = e−γ‖xk−xl‖2 (4.15)
After exploiting the concept of kernels, considering that the data is not linearly separable also in the
feature space, brings the need of detailing the implementation of the soft margins concept (figure 4.2(b)).
Having this in mind, it is required to accept a small number of errors in the training phase, in this case
the dual problem presented earlier becomes unbounded and no solution can be found. In order to solve
this problem, one must relax the constrains, through the introduction of a positive slack variable in all
constraints [33], obtaining:
yk (w · xk + b) ≥ 1− ξk, ∀k. (4.16)
To minimize the errors introduced by the slack variables, they were weighted in the cost function.
Leading to the new primal problem:
minimizew,b
1
2wTw + C
K∑k=1
ξk
subject to yk (w · xk + b) ≥ 1− ξk ∀k,
ξk ≥ 0 ∀k,
(4.17)
where C is a tuning parameter that regulates the misclassification cost. Following a similar reasoning
as the one used for the separable case, one can solve this convex and quadratic problem. Lastly, we
obtain the following dual problem:
maximizeΛ
ΛT1− 1
2ΛTDΛ
0 ≤ Λ ≥ C
Λy = 0,
(4.18)
The last problem to take in consideration is the fact that we are dealing with an unbalanced dataset.
To overcome this adversity, the use of different penalties has been proposed [38, 39]. Gathering the two
extensions above (the use of kernels and soft-margins), along with this solution we obtain the following
SVM formulation for a binary classification problem:
27
minimizew,b,ξ
1
2wTw + C+
∑yk=+1
ξk + C−∑yk=−1
ξk
subject to yk(wT · φ (xk) + b
)≥ 1− ξk ∀k
ξk ≥ 0 ∀k,
(4.19)
where C+ = w+ × C and C− = w− × C, w+ and w− are the weights of the associated with the
positive and negative classes, respectively.
In this thesis, the dual problem of the soft margin SVM algorithm with unbalanced dataset was solved
numerically using LIBSVM, publicly available software (http://www.csie.ntu.edu.tw/~cjlin/libsvm)
developed by Chang and Lin [40].
4.4 Boosting
Boosting is a general method for improving the performance of any learning algorithm. In 1984,
Valiant proposed the probably approximately correct learning (PAC) a theoretical framework for studying
machine learning [41], which provided the background for boosting. Later, Kearns and Valiant [42,
43] questioned whether a ”weak” learning algorithm, that under normal circumstances would be just
marginally better than random classification, could be boosted into an accurate ”strong” algorithm. In
1989, Schapire [44] proved the first polynomial-time boosting algorithm. Freund [45] then developed an
efficient algorithm, which still had some drawbacks.
4.4.1 AdaBoost
In 1997, Freund and Schapire [46] presented the AdaBoost, a boosting algorithm that dealt with the
difficulties that the previously presented boosting algorithms had. The AdaBoost classifier starts by fitting
a classifier on the dataset, later it fits copies of the classifier on the same sample pattern, nevertheless
the weights of incorrectly classified instances are adjusted so that in the following iterations the classifiers
can focus in harder cases [47, 48].
Consider a training set (x1, y1) , ..., (xk, yk), where xi represents a pattern in the input space X and
yi is the correspondent label. For the sake of simplicity and given that we are dealing with binary
classification in this project, let us assume yi ∈ Y = {−1,+1}. The algorithm takes the training set
as an example and calls a ”weak” learner repeatedly in series of cycles t = 1, ..., T . In the beginning
all weights are set equally and at each round they are updated, in such way that the weak learner is
forced to focus on the hard examples of the training set, i.e., incorrectly classified examples increase
their weight. The weight on the training pattern i on the round t is Dt (i). The weak learner has now to
find a weak hypothesis ht : X → {−1,+1} adequate for the distribution Dt. The quality of the hypothesis
is measured by the error with respect to the distribution Dt on which the weak learner was trained, given
by:
28
εt = Pri∼Dt [ht (xi) 6= yi] =∑
i:ht(xi)6=yi
Dt (i) . (4.20)
After the weak hypothesis has been defined a parameter αt that measures the importance of ht
according to the equation present on Figure 4.3. This step is followed by an update of the weight
distribution so that the classifier focus on hard examples, meaning the weight of misclassified examples
increases by ht and the weight of correctly classified examples decreases. In the end, a weighted
majority vote of T weak hypothesis (being αt the weight assigned to hypothesis ht), gives us the final
hypothesis H.
Figure 4.3 AdaBoost [47].
1: Given (x1, y1),..., (xk, yk), where xi ∈ X, yi ∈ Y = {−1,+1}2: Initialize Dt (i) = 1
k3: for t = 1, ..., T do4: Train weak learner using distribution Dt
5: Get weak hypothesis ht : X → {−1,+1} with error equation 4.206: Choose αt = 1
2 ln(
1−εtεt
)7: Update:
8: Dt+1 (i) = Dt(i)Zt×{e−αt if ht (xi) = yieαt if ht (xi) 6= yi
9: = Dt(i)e−αtyiht(xi)
Zt.
10: where Zt is a normalization factor, chosen so that Dt+1 will be a distribution.11: Output the final hypothesis:12: H (x) = sign
(∑Tt=1 αtht (x)
).
In this project, the implementation of the AdaBoost.M1 used is the framework for Ensemble Learning
of MATLAB R2014a, present in the Statistics and Machine Learning Toolbox.
4.4.2 RUSBoost
Traditional machine learning algorithms tend to favour classifying patterns to the majority class, when
one of the classes highly outnumbers the examples of the other class. One technique to overcome this
problem is data sampling, which consists on an approach that balances class distribution of the training
set either undersampling (removing samples from the overrepresented class) or oversampling (adding
examples to the minority class), until the desired balance is achieved. These techniques can be either
simple as random selection or more advanced. Undersampling has the disadvantage of leading to loss of
information, due to the deletion of examples [49]. On the other hand, oversampling can lead to overfitting
and increases the model training time [50]. Proposed in 2010 by Seiffert, et al. [51], RUSBoost has his
roots on the SMOTEBoost algorithm (which is based on the AdaBoost algorithm, detailed in section
4.4.1). Both algorithms add a data sampling technique to the AdaBoost algorithm, the SMOTEBoost
uses an oversampling technique that creates new minority class samples by extrapolating between the
existing ones, a method called the synthetic minority technique (SMOTE) [52]. The RUSBoost algorithm
uses an approach that has proved to be simple, fast and with good performance [53]. RUS (Random
undersampling) simply randomly removes examples of the majority class until the desired distribution of
29
classes is achieved. The full RUSBoost algorithm is explained in the form of pseudo-code in Figure 4.4.
Figure 4.4 RUSBoost [51, 54].
1: Given: Set D of examples (x1, y1),..., (xk, yk) with minority class yr ∈ Y , |Y | = 22: Weak learner,WeakLearn3: Number of iterations, T4: Desired percentage of total instances to be represented by the minority class, M5: Initialize D1 (i) = 1
k for all i.6: for t = 1, ..., T do7: Create a temporary training dataset S′t with distribution D′t using random undersampling8: Call WeakLearn, providing it with examples S′t and their weights D′t9: Get back a hypothesis ht : X × Y → {0, 1}
10: Compute the pseudo-loss (for S and Dt):11: εt =
∑(i,y)yi 6=yDt (i) (1− ht (xi, yi) + ht (xi, y)) .
12: Calculate the weight update parameter:13: αt = εt
1−εt14: Update Dt:15: Dt+1 (i) = Dt (i)α
12 (1+ht(xi,yi)−ht(xi,y:y 6=yi))t .
16: Normalize Dt+1: Let Zt =∑iDt+1 (i)
17: Dt+1 (i) = Dt+1(i)Zt
.
18: Output the final hypothesis:19: H (x) = argmax
y∈Y
(∑Tt=1 ht (x, y) log
(1αt
)).
In this project, the implementation of the RUSBoost used is the framework for Ensemble Learning of
MATLAB R2014a, present in the Statistics and Machine Learning Toolbox.
4.5 Performance Evaluation
When developing an automated classification system, there is the need to assess the performance of
the proposed classifiers. However, different performance metrics yield different meanings and trade-offs,
one classifier can be optimal in one metric and suboptimal in another. The most commonly used metric
is accuracy, however in this project metrics accuracy might be incredibly uninformative due to the fact
that 99% of the dataset can be composed of non-CTCs. More adequate metrics and also widely used
are balanced accuracy, sensitivity and specificity. Nonetheless, these require us to decide in which point
of the Receiver-Operator Characteristic (ROC) we want to position ourselves to consider the classifier
good. Therefore the main performance evaluation metrics used were the ROC curve and the Area Under
the Curve (AUC).
4.5.1 Nested Cross Validation
Nested Cross Validation is not a performance metric, however it is used as a tool to evaluate the
performance of supervised learning algorithms. If the analysis of the performance of an algorithm is
performed on the same dataset that was used to train the classifier it can lead to a optimistically biased
result. To evaluate the generalization ability of a model in an unbiased fashion, one must have a test
or validation set that was never used in the learning part. However, if the number of samples of the
30
input dataset is small, it is not advisable to leave data out from the training set. The Cross Validation
(CV) procedure prevents this problem by randomly partitioning the full dataset into k disjoint sets. One
of the k sets is chosen as the validation set, and the rest k − 1 are used to train the model. The
process is repeated k times until all the folds are used as validation and train sets. Another task one
might have to deal with is to tune a classifier by choosing parameters, such as the adequate number
of neighbors in the k -NN algorithm, the γ of the RBF kernel, the C associated with the soft-margins
of the SVM or the number of weak learners in the Ensemble methods. All these might affect the final
classification performance. Varma and Simon [55] proposed the Nested Cross Validation method, that
not only circumvents the problem the CV does, but also tackles the problem of parameter tunning. The
Nested Cross Validation algorithm in pseudo-code, please refer to Figure 4.5.
Figure 4.5 Nested Cross-Validation [56].
1: Split the set D of K available samples into k disjunct sets Si, i = 1, ..., k of size Kk /* outer cross-
validation */2: for i = 1 to k do3: D := D \Di, K := |S|4: for each parameter set p /*parameter selection*/ do5: Spilt the set D of K available samples into k disjunct sets Dj , j = 1, ..., k of size K
k/* inner
cross-validation*/6: for j = 1 to k do7: Train the classifier on the training set Dt = D \ Dj
8: Compute test error ej on the parameter test set Dj
9: Compute inner CV test error10: Select parameter set p with minimum error11: Train classifier with selected parameter set on Dt = D = D \Di
12: Compute test error on the test set Di
13: Calculate outer CV test error.
4.5.2 Receiver-Operator Characteristic
In order to evaluate the results regarding the performance of the classifiers described above, Re-
ceiver Operator Characteristic graphs will be used. These have been used in signal detection theory
[57], diagnostic systems [58] and in medicine [59].
The area under the curve of a ROC has a baseline rate that is independent of the data, while in some
other metrics it is data dependent [60].
Fawcett and Provost [61] did a thorough study on the use of ROC for evaluation of classifiers.
The true positive rate (TPR) is defined as:
TPR = p (Y | +) ≈ positives correctly classified
total positives(4.21)
And the false positive rate (FPR) as follows:
FPR = p (Y | −) ≈ negatives incorrectly classified
total negatives, (4.22)
31
where + and − are the positive and negative instance classes, respectively.The p (+ | xi) is the
posteriori probability of the instance xi being positive.
A ROC curve plots the TPR on the Y axis and FPR on the X axis, bringing the advantage of presenting
the behaviour of a classifier regardless of class distribution or error cost. In order to choose the best
classifier based on a ROC curve analysis one must maximize (1− FPR) · TPR, which corresponds to
selecting the classifier with the higher area under the curve (AUC). This approach calculates the average
performance of the classifier over the entire performance space [58, 59].
4.6 Conclusion
This chapter covers all the machine learning algorithms used in this thesis for designing an auto-
mated classification system for CTC enumeration: the k -NN along with a bootstrapping technique, the
SVM and its extensions and, finally AdaBoost and RusBoost. As well as the tunning and validation
framework (Nested Cross-Validation) along with the metrics used to assess the performance of the
different supervised learning algorithms studied.
32
Chapter 5
Results
The two previous chapters (Chapters 3 and 4) describe the approaches implemented in order to
construct an automated classification system for CTC detection. In this chapter, the results regarding
these approaches are presented. Section 5.1 describes the dataset used, followed by section 5.2 which
presents a practical discussion of the image processing approaches. Section 5.3 describes implemen-
tation choices. Then section 5.4 gives a insight on the performance of each implemented classifier.
Finally, section 5.5 summarizes the results.
5.1 Dataset - Fluorescence microscopy for blood cells analysis
The fluorescent microscopy images used on the development of this thesis were provided by the
Cancer-ID consortium. This was a multicenter study consisting of 59 patients with Small-Cell Lung
cancer. From each patient three blood samples were retrieved, one before (designated as baseline)
and one after a cycle of chemotherapy and one at the end of chemotherapy. Each blood withdrawal
corresponds to one cartridge and each cartridge corresponds to 175 4-channel TIFF images. The
images were obtained using the CellSearch System and manually classified by expert reviewers. Table
5.1 summarizes the most important demographic and clinical information about the subjects.
Table 5.1: Clinical characteristics of 59 patients with small-cell lung cancer. (ED-extensive diseasestage; LD-limited disease stage) [15].
Characteristic All patients LD EDAge, years (minimum-maximum) 64 (47-84) 67 (47-84) 62 (47-81)Male/Female 35/24 12/9 23/15Stage, n (%) 21 (36) 38 (64)CTCsBaseline, n (median; minimum-maximum) 59 (16; 0-14 040) 21 (6; 0–220) 38 (63; 0–14 040)After one cycle, n (median; minimum-maximum) 37 (0; 0–1681) 18 (0; 0–6) 19 (1; 0–1681)After four cycles, n (median; minimum-maximum) 34 (1; 0–117) 16 (0; 0–3) 18 (1; 0–117)Overall survival days, n (median; minimum-maximum) 59 (280; 5–1424) 21 (356; 9–1424) 38 (213; 5–818)
To the blood samples it was added ferrofluids with EpCAM (epithelial cell adhesion molecule) in order
to select cells of epithelial origin, and they were stained by DAPI-DNA (4’, 2-diamidino-2-phenylindole,
dihydrochoride) for nuclear stain, PE-CK (cytokeratin 8, 18 Phycoerythrin and cytokeratin 19 Phycoery-
33
thrin) and CD45-APC (CD45-allophycocyan) to label leukocytes. The objective of the microscope was
of 10x/0.45NA and it had filters for DAPI, CK, CD45 and FITC, respectively each of the 4 channels of
the 4-page TIFF. The FITC channel was used only for removal of the edge of the cartridge.
Each cartridge was classified by an expert reviewer. Regarding non-CTCs there is no information, in
case of presence of CTCs, it is registered that within that area (a square) there is a CTC.
Along with each cartridge set (175 images), there is an XML file with the position of each CTC
relative to the whole cartridge, which is then transformed in relation to the image that is being analysed.
Additionally, the TIFF header of each imaged has two values correspondent to an offset and a maximum
value related to the condition in which the image was obtained, this information was used for image
normalization (equation 3.1).
Due to limited computation capacities, three random datasets, from all the available, were used for
testing. In total 525 images were processed and 141 634 ROIs were classified, 18 822 of which were
Circulating Tumor Cells.
5.2 Discussion on Image Processing Results
Features are extracted by segmenting image by image, from a full cartridge. An example of a car-
tridge, resulting from the concatenation of 175 images, is presented (left, vertically) in Figure 5.1. In the
same figure you can find one image from that dataset (presented on the right, horizontally), the overlay
of the 3 channels, the DAPI-DNA channel, the CK-PE channel and the CD45-APC channel (designated
as DNA, CK and CD45, respectively).
Figure 5.1: Example of a full cartridge (left, presented vertically) and one image from that dataset (right,horizontally). Overlay corresponds to the 3 channels superimposed. DNA corresponds to DAPI-DNAchannel, CK to CK-PE channel and CD45 to CD45-APC channel.
34
Each image was then normalized. After this step the average background intensities of the images
had a neglectable difference between each other. Then the edges were successfully removed.
Now lets take into consideration Figure 5.2, where an example of a CTC (Figure 5.2(a)) is presented
side by side with a non-CTC example (Figure 5.2(b)), there is no obvious visual difference between a
CTC and a non-CTC in either of the channels. If you observe the full dataset, ROI by ROI, it is hard
to find, by visual inspection, a clear and obvious pattern that allows a non expert reviewer to clearly
distinguish a CTC from a non-CTC.
(a) CTC (b) Non-CTC
Figure 5.2: Example of two cells present. Figure 5.2(a) is an example of a CTC, whereas Figure 5.2(b)is an example of a cell that is non-CTC. The red and green contours represent the contour resulting fromsegmentation.
The segmentation of these two cells (Figure 5.2) was correctly accomplished. However, there are
some situations in which one can not be sure of the quality of the segmentation. If you consider Figure
5.3 it is not possible to know, at least for a non-expert reviewer, if we are dealing with a cluster of two
cells, and, in that case the segmentation performs poorly, or if it just one big cell.
Figure 5.3: Example of a non-CTC. Given the way the manual classification was performed, there is noway of knowing if this is just one cell or a cluster of two cells. The green contour represent the contourresulting from segmentation
Now consider the situation presented in Figure 5.4, we have an example of a situation where the
segmentation algorithm does perform up to the expectation. On the left we have a cell, on the right an
35
element which is not clear, it can be a smudge, debris or an apoptotic cell. Nonetheless the segmentation
algorithm was not able to separate the two objects.
Figure 5.4: Example of a non-CTC, present in figure 5.1. This example highlights two problems, firstthe segmentation not being able to create two distinct areas if they are close. Second, given the waythe expert reviewer did the classification, it is not possible to know if the element on the right (inside thecontour) is just a smudge, a cell or an apoptotic cell.
Overall most of the objects were correctly segmented, however for the problems exemplified above
no solution was implemented.
In Appendix A, it is presented the histogram of distributions of features, by class (CTC and non-CTC)
and type (morphological, intensity and texture features), from the 3 datasets. By inspection, one can
conclude that no single feature clearly distinguishes one class from another.
Only outliers related with abnormal size were removed based on inspection, all ROIs with an area
≤ 9 and ≥ 0.3× 104 pixels were excluded from classification.
5.3 Experimental Design
The goal of this project is to access the viability of an automated classification system for identification
of circulating tumor cells. With this purpose in mind, several classifiers were tested, namely k -NN, k -NN
along with bootstrapping, k -NN using prior probabilities, SVM with a linear kernel, SVM with a RBF
kernel, AdaBoost and RUSBoost. Along with testing several classifiers it was also analysed which set of
features yielded better results. Thus each of the algorithms was tested for the set of features designated
as All, Morphological, User (features related to the ones expert reviewers usually take into account when
performing manual classification), Intensity, Texture, DNA (intensity and texture features of this channel),
CK (intensity and texture features of this channel), CD45 (intensity and texture features of this channel),
please refer to table 5.2 for a more detailed description of each category.
All the algorithms were first tested with one dataset (one cartridge), followed by a second test using
another cartridge and finally the concatenation of three datasets.
The parameters C, both for linear kernel and RBF kernel, γ for RBF kernel, k (number of neighbors)
for k -NN and the number of weak learners used in both of the ensemble methods were estimated using
36
Table 5.2: Set of features of each category used for classification.
All Morphological User Intensity
All extracted features
AreaEccentricityPerimeterPerimeter to Area ratio
AreaEccentricityMax. Intensity DNAMax. Intensity CKStandard Deviation Int. CD45
Mean Intensity DNAMean Intensity CKMean Intensity CD45Max. Intensity DNAMax. Intensity CKMax. Intensity CD45Standard Deviation Int. DNAStandard Deviation Int. CKStandard Deviation Int. CD45Mass DNAMass CKMass CD45
Texture DNA (Intensity+Texture) CK (Intensity+Texture) CD45 (Intensity+Texture)Median of Local Entropy DNAMedian of Local Entropy CKMedian of Local Entropy CD45Median of Local Contrast DNAMedian of Local Contrast CKMedian of Local Contrast CD45Median of Gradient Amplitude DNAMedian of Gradiant Amplitude CKMedian of Gradiant Amplitude CD45HOG DNAHOG CKHOG CD45
Mean Intensity DNAMax. Intensity DNAStandard Deviation Int. DNAMass DNAMedian of Local Entropy DNAMedian of Local Contrast DNAMedian of,Gradient Amplitude DNAHOG DNA
Mean Intensity CKMax. Intensity CKStandard Deviation Int. CKMass,CKMedian of Local Entropy CKMedian of Local Contrast CKMedian of,Gradient Amplitude CKHOG CK
Mean Intensity CD45Max. Intensity CD45Standard Deviation Int. CD45Mass,CD45Median of Local Entropy CD45Median of Local Contrast CD45Median of,Gradient Amplitude CD45HOG CD45
nested cross-validation, using 10-folds in the outer loop and 7 on the inner loop.
Other algorithm specifications are presented below:
• k -NN - k -NN was trained assuming k ∈ {1, 3, 5, 7, 9}.
• k -NN with bootstrapping - k -NN had k = 3 and bootstrapping was performed in such way that
each set had there was the same amount of CTCs and non-CTCs.
• k -NN with Prior Probabilities - k -NN was trained with the following sets of pairs of prior probabili-
ties {(.50, .50) ; (.60, .40) ; (.75, .25) ; (.85, .15) ; (.95, .05) ; (.99, .01) ; (.995, .005) ; (.45, .55) ; (.35, .65) ;
(.30, .70) ; (.10, .90) ; (.01, .99)}.
• SVM Kernel - Both linear and RFB kernels were tested with weights w0 = 1 and w1 = #non−CTCs#CTCs ,
corresponding to the weight of classes non-CTC and CTC respectivelly.The C was assumed to
have the following values {2−16, 2−14, 2−12, 2−10, 2−8, 2−6, 2−4, 2−2, 20, 22, 24} and γ ∈ {2−18, 2−14, 2−10,
2−6, 2−2, 22, 26, 210}.
• Ensemble methods - Both AdaBoost and RUSBoost were tested using using as weak classifier a
decision tree and the number of weak classifiers was {100, 200, 300, 400, 500, 600, 700, 800, 1000}.
• Nested Cross-Validation - The outer fold had 10 folds, and the inner 7.
5.4 Classification Results
In this section, the performance of each implemented classification algorithm will be presented.
Please note that no statistical hypothesis test was used for comparison purposes, the comparison was
based solely on the ROC curves and AUC (Area Under the Curve).
As stated before, first it was all tested with just one dataset, then another one and finally three
datasets concatenated. The results for the first two tests are presented in Appendix B. Results of the
37
three datasets concatenated are presented in this chapter. The results were quite similar, except for
dataset 1 where the best classifier was the SVM with a linear kernel, using all features.
Figure 5.5 presents the ROC curves of the three implementations of k -NN. Overall it is possible to
observe that the classification performed by the k -NN, in either of the situations, is quite weak. All curves
present an almost constant growth meaning that any increase in sensitivity will be accompanied by a
linearly proportional decrease in specificity. Furthermore, the curves are very close to the 45-degree di-
agonal and, as a result, behaves nearly as random classifier. Between the three k -NN implementations,
as expected, the k -NN coupled with the bootstrapping technique (Figure 5.5(a)) performed slightly better
than the other two, due to the fact that it is implemented in such way that tackles the problem of class
imbalance. The best set of features was the Intensity set for this classifier. In all the cases the worst
set of features was the DNA (Intensity+Texture). In the implementations of k -NN with Prior Probabilities
(Figure 5.5(b)) and k -NN (5.5(c)) the best set of features was the CK (Intensity+Texture).
Figure 5.6 displays the results of SVM with Linear Kernel and SVM with RBF Kernel. Both performed
better than the k -NN implementations. The SVM with RBF kernel (Figure 5.6(b)) produced better results
than the one with a Linear kernel (Figure 5.6(a)) and, in both cases, the best set of features was the CK
(Intensity+Texture) and the worst set CD45 (Intensity+Texture).
The ROC curves for the Ensemble methods are depicted in Figure 5.7. Unexpectedly, on average
the AdaBoost performed better than the RUSBoost. The best set for the AdaBoost (Figure 5.7(a)) was
the CK (Intensity+Texture) and the worst was the CD45 (Intensity+Texture). In the case of the RUSBoost
(Figure 5.7(b)) the one that yielded better results was Intensity and the worse the Morphology features
set.
5.5 Summary
Overall (considering all classifiers, the tests done with the two datasets separate and the test done
with three datasets concatenated) the set of features that yielded worst classification results was the
CD45 (Intensity+Texture), which is a bit odd given the fact that this is the exclusion marker, but might
be justified by the lack of quality of these images from this channel. The set of features that generated
better results for the concatenation of the three datasets was the CK (Intensity+Texture), followed by
the Intensity features set. The best classifier was the AdaBoost followed by the SVM with RBF Kernel,
however neither of the results meet the expectations. The results of all implemented classifiers are
summarized in Table 5.3, in the form of AUC.
38
(a) k -NN + Bootstrapping
(b) k -NN with Prior Probabilities
(c) k -NN
Figure 5.5: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping 5.5(a), with Prior Probabilities 5.5(b) and k -NN with the optimal amount of neighbors5.5(c). (All - allfeatures; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 intensitystandard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features of the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).
39
(a) SVM Linear Kernel
(b) SVM RBF Kernel
Figure 5.6: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear 5.6(a) and Gaussian (RBF) 5.6(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features of the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass of the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).
40
(a) AdaBoost
(b) RUSBoost
Figure 5.7: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost 5.7(a)and RUSBoost 5.7(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features of the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass of the 3channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).
41
Table5.3:
Area
Underthe
Curve
(AUC
)ofeachofthe
algorithms
testedforthe
totalofthe3
datasets.
Area
Under
theC
urveA
llM
orph.U
serInt.
Tex.D
NA
(Int+Tex.)C
K(Int+Tex.)
CD
45(Int+Tex.)
k-N
N+B
ootstrapping.6573
.5947.6837
.6997.6240
.5849.6970
.5898k
-NN
w/P
riorP.
.6498.5786
.6808.6787
.6218.5739
.6944.5795
k-N
NN
eigh..6399
.5935.6566
.6614.6106
.5656.6748
.5768S
VM
Linear.7267
.6379.7313
.7225.6664
.6495.7268
.5767S
VM
RB
F.7246
.6553.7235
.7297.6748
.6553.7305
.6299A
daBoost
.7331.6578
.7327.7331
.6728.6476
.7387.6267
RU
SB
oost.6910
.5697.6904
.6962.6060
.5754.6933
.5838
42
Chapter 6
Conclusions and Future Work
The main goal of this thesis was to study several approaches in order to build an automated classi-
fication system for Circulating Tumor Cells enumeration. Up to date, the interpretation of blood samples
analysed by the CellSearch system, still depends on the expertise of a trained reviewer, and there has
been a growing interest in developing automated systems that enumerate CTCs in a reliable fashion.
This buzzing topic has also been growing due to the increasing amount of biomarkers, detection and
physical isolation systems that are being studying and developed now, in order to perform real-time
biopsies on cancer patients. This work presented a small summary of these systems, nevertheless it
focused on the CellSearch System.
One of the objectives of this project was to study features that could be extracted from each cell and
the impact they had on classification. The features analysed were related to the morphology, intensity
and texture of each ROI. These were then grouped in sets in order to evaluate which were the ones that
could be more informative and produced better classification results. It was concluded that the three
best set of features were the combination of all the extracted features, the set of features extracted from
the CK channel (a combination of intensity and texture features from this channel) and the set of intensity
features. The sets that generated worst results were the set of morphological features, and the texture
and intensity features of both the DNA and the CD45 channels.
The second purpose of the current project was to evaluate and compare several pattern recognition
systems. The large number of non-CTCs when compared with the scarce number of CTC poses as
problem that jeopardizes the classification systems, thus several approaches that tackle class imbalance
were implemented and tested. The three machine learning algorithms that performed better were the
two Support Vector Machines (that deal with class imbalance by associating weights to each of the
classes) and the AdaBoost. The three worst were the three implementations of the k -Nearest Neighbor,
and even in this case the one that performed best was the implementation with bootstrapping.
Overall all the implementations and results under-performed. It is not possible conclude, based on
this thesis results, that it is possible to build an automated system for CTC enumeration of Small-Cell
Lung Cancer.
To boost the results in CTC classification for SCLC several options can be studied and developed:
43
• Development of a more coherent and detailed ground-truth:
– Stricter definition of what should be considered a CTC (different reviewers can assign the
same object as different classes, even the same reviewer in different moments can classify
the same object as a CTC one time and another time as a non-CTC);
– Manual classification of CTCs after image segmentation (currently, in the manual classifica-
tion, it is considered a CTC an area shaped like a rectangle that might contain one or more
ROIs, and not necessarily all CTC);
– Manual classification in more than just CTC, for example also CTC debris and apoptotic CTC.
These two can present a very different morphology and signal intensities compared to a
normal CTC, and, currently, they are classified by an expert reviewer as a CTC;
– Classification of non-CTC: the non-CTC class is everything else in the dataset, which cre-
ates a class (non-CTC) with very vague characteristics, it can be a white blood cell, debris,
apoptotic cells, smudges, etc;
• Development of an automated classification system, that has more classes than just CTC and
non-CTC;
• Using clustering and/or a learning algorithm for outlier removal;
• Studying the use of color histograms and other features;
• Using feature selection algorithms to better assess the informativeness of each feature, such as
correlation and mutual information algorithms;
• Implementation of a noise reduction algorithm (noise analysis of the images studied in this project
is presented in appendix C);
• Improvement of the segmentation algorithm.
From a more general point of view several other improvements can be made:
• Implementation of online-learning algorithms;
• Implementation of semi-supervised solutions;
• Study of biomarkers (dyes) for different types of CTC.
The topic automated classification of Circulating Tumor Cells is still quite recent, has great potential
and huge impact on the study of cancer and there is still a lot of room for development.
44
Bibliography
[1] A. C. Society. Cancer Facts & Figures 2012. Health Policy, 1:1–68, 2012.
[2] American Cancer Society. Global Cancer Facts & Figures 3rd Edition. American Cancer Society,
(800):1–64, 2015.
[3] American Association for Cancer Research. AACR Cancer progress report. pages S1–S100, 2012.
[4] D. E. Bloom, E. Cafiero, E. Jane-Llopis, S. Abrahams-Gessel, L. Reddy Bloom, S. Fathima, A. B.
Feigl, T. Gaziano, A. Hamandi, M. Mowafi, D. O’Farrell, E. Ozaltin, A. Pandya, K. Prettner, L. Rosen-
berg, B. Seligman, A. Z. Stein, C. Weinstein, and J. Weiss. The Global Economic Burden of Non-
communicable Diseases. (September):1–46, 2012.
[5] D. A. Haber and J. Settleman. Cancer: drivers and passengers. Nature, 446(7132):145–146, 2007.
[6] S. de Wit, G. van Dalum, L. W. M. M. Terstappen, and Wit. Detection of Circulating Tumor Cells.
Scientifica, 2014.
[7] D. Wirtz, K. Konstantopoulos, and P. C. Searson. The physics of cancer: the role of physical
interactions and mechanical forces in metastasis. Nature reviews. Cancer, 11:512–522, 2011.
[8] B. Weigelt, J. L. Peterse, and L. J. van’t Veer. Breast cancer metastasis: markers and models.
Nature Reviews Cancer, 5(August):591–602, 2005.
[9] P. J. L. Weigelt, B. and L. J. Van’t Veer. Breast cancer metastasis: markers and models. Nature
reviews cancer, 5(8):591–602, 2005.
[10] R. Weinberg. The biology of cancer. Garland Science, 2013.
[11] C.-M. Svensson, S. Krusekopf, J. Lucke, and M. Thilo Figge. Automated detection of circulating
tumor cells with naive Bayesian classifiers. Cytometry. Part A : the journal of the International
Society for Analytical Cytology, 85(23):501–511, 2014.
[12] S. T. Ligthart, F. a. W. Coumans, G. Attard, A. M. Cassidy, J. S. de Bono, and L. W. M. M. Terstappen.
Unbiased and automated identification of a circulating tumour cell definition that associates with
overall survival. PloS one, 6(11), 2011.
[13] G. R. Simon. Management of Small Cell Lung Cancer. CHEST Journal, 132:324S, 2007.
45
[14] A. Rossi, P. Maione, G. Palazzolo, P. C. Sacco, M. L. Ferrara, M. Falanga, and C. Gridelli. New
targeted therapies and small-cell lung cancer. Clinical lung cancer, 9(5):271–9, 2008.
[15] T. J. N. Hiltermann, M. M. Pore, a. van den Berg, W. Timens, H. M. Boezen, J. J. W. Liesker, J. H.
Schouwink, W. J. a. Wijnands, G. S. M. a. Kerner, F. a. E. Kruyt, H. Tissing, a. G. J. Tibbe, L. W.
M. M. Terstappen, and H. J. M. Groen. Circulating tumor cells in small-cell lung cancer: a predictive
and prognostic factor. Annals of Oncology, 23(June):2937–2942, 2012.
[16] S. T. Ligthart, F. C. Bidard, C. Decraene, T. Bachelot, S. Delaloge, E. Brain, M. Campone, P. Viens,
J. Y. Pierga, and L. W. M. M. Terstappen. Unbiased quantitative assessment of Her-2 expression
of circulating tumor cells in patients with metastatic and non-metastatic breast cancer. Annals of
Oncology, 24:1231–1238, 2013.
[17] T. M. Scholtens, F. Schreuder, S. T. Ligthart, J. F. Swennenhuis, J. Greve, and L. W. M. M. Terstap-
pen. Automated identification of circulating tumor cells by image cytometry. Cytometry. Part A : the
journal of the International Society for Analytical Cytology, 81:138–48, 2012.
[18] M. Alunni-Fabbroni and M. T. Sandri. Circulating tumour cells in clinical practice: Methods of de-
tection and possible characterization. Methods, 50(4):289–297, 2010.
[19] D. R. Shaffer, M. a. Leversha, D. C. Danila, O. Lin, R. Gonzalez-Espinoza, B. Gu, A. Anand,
K. Smith, P. Maslak, G. V. Doyle, L. W. M. M. Terstappen, H. Lilja, G. Heller, M. Fleisher, and
H. I. Scher. Circulating tumor cell analysis in patients with progressive castration-resistant prostate
cancer. Clinical Cancer Research, 13(7):2023–2029, 2007.
[20] External quality assurance of circulating tumor cell enumeration using the CellSearch system: A
feasibility study. Cytometry Part B - Clinical Cytometry, 80 B(June 2010):112–118, 2011.
[21] C. Alix-Panabieres and K. Pantel. Circulating tumor cells: Liquid biopsy of cancer. Clinical Chem-
istry, 59:110–118, 2013.
[22] Z. S. Lalmahomed, J. Kraan, J. W. Gratama, B. Mostert, S. Sleijfer, and C. Verhoef. Circulating
tumor cells and sample size: The more, the better. Journal of Clinical Oncology, 28(17):288–289,
2010.
[23] S. Sleijfer, J. W. Gratama, A. M. Sieuwerts, J. Kraan, J. W. M. Martens, and J. a. Foekens. Circu-
lating tumour cell detection on its way to routine diagnostic implementation? European Journal of
Cancer, 43:2645–2650, 2007.
[24] G. Zack, W. Rogers, and S. Latt. Automatic measurement of sister chromatid exchange frequency.
Journal of Histochemistry & Cytochemistry, 25(7):741–753, 1977.
[25] R. W. S. E. Gonzalez, R.C. Digital Image Processing Using MATLAB, chapter 11. Prentice Hall,
2003.
46
[26] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. volume 1, pages
886–893. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June
2005.
[27] E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination: consistency proper-
ties. Tech. Rep. 4, 1951.
[28] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13:21–27, 1967.
[29] C. J. . Stone. Consistent Nonparametric Regression. The Annals of Statistics, 5(4):595–620, 1977.
[30] B. Efron. Bootstrap methods: Another look at the Jackknife. The Annals of Statistics, 7:1–26, 1979.
[31] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers.
Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144–
152, 1992.
[32] V. N. Vapnik and A. Lerner. Pattern recognition using Generalized Portrait method. Automation and
Remote Control, 24, 1963.
[33] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995.
[34] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over
data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998.
[35] P. L. Barlett. The sample complexity of pattern classification with neural networks: the size of the
weights is more important than the size of the network. IEEE Transactions on Information Theory,
44(2):273–297, 2006.
[36] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. Advances in Large Margin
Classifiers, pages 349–358, 2000.
[37] P. M. Morgado Marabilha. Automated Diagnosis of Alzheimer’s Disease using PET Images A study
of alternative procedures for feature extraction and selection. Master’s thesis, Instituto Superior
Tecnico, 2012.
[38] V. Vapnik. Statistical Learning Theory, chapter 10.9. Wiley, 1998.
[39] F. R. Osuna, E. and F. Girosi. Support vector machines: Training and applications. AI Memo 1602,
1997.
[40] C.-C. Chang and C.-J. Lin. Libsvm. ACM Transactions on Intelligent Systems and Technology, 2
(3):1–27, 2011.
[41] L. Valiant. A Theory of the Learnable. 27(11):1134–1142, 1984.
47
[42] M. J. Kearns and L. G. Valiant. Learning Boolean formulae or finite automata is as hard as factoring.
Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory,
1988.
[43] M. Kearns and L. Valiant. Cryptographic Limitations on Learning Boolean Formulae and Finite
Automata. J. ACM, 41(1):67–95, 1994.
[44] R. E. Schapire. The Strength of Weak Learnability. 227:197–227, 1990.
[45] Y. Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):
256–285, 1995.
[46] Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an appli-
cation to boosting. Journal of Computer and Systems Sciences, 55(1):119–139, 1997.
[47] R. E. Schapire. A brief introduction to boosting. IJCAI International Joint Conference on Artificial
Intelligence, 2(5):1401–1406, 1999.
[48] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Pro-
ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 1:511–518, 2001.
[49] G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for
balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1):20–29, 2004.
[50] C. Drummond, R. C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling
beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11. Citeseer,
2003.
[51] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: A hybrid approach to
alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems
and Humans, 40(1):185–197, 2010.
[52] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-
sampling technique. Journal of artificial intelligence research, pages 321–357, 2002.
[53] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from
imbalanced data. In Proceedings of the 24th international conference on Machine learning, pages
935–942. ACM, 2007.
[54] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: improving classifi-
cation performance when training data is skewed. In Pattern Recognition, 2008. ICPR 2008. 19th
International Conference on, pages 1–4. IEEE, 2008.
[55] S. Varma and R. Simon. Bias in error estimation when using cross-validation for model selection.
BMC bioinformatics, 7(1):91, 2006.
48
[56] C. Petersohn. Temporal Video Segmentation, page 34. Vogt Verlag, 2010.
[57] J. P. Egan. Signal detection theory and {ROC} analysis. 1975.
[58] J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–1293, 1988.
[59] J. R. Beck and E. K. Shultz. The use of relative operating characteristic (roc) curves in test perfor-
mance evaluation. Archives of pathology & laboratory medicine, 110(1):13–20, 1986.
[60] R. Caruana. An Empirical Comparison of Supervised Learning Algorithms. pages 161–168, 2006.
[61] T. Fawcett and P. Foster. Analysis Comparison and Visualization under Imprecise of Classifier
Performance : Class and Cost Distributions. pages 43–48, 1997.
49
50
Appendix A
Histograms of Feature Distributions
A.1 Morphological Features Histograms
Figure A.1 presents the distributions of morphological features (area, eccentricity, perimeter and
perimeter to area ratio).
(a) Area (b) Eccentricity
(c) Perimeter (d) Perimeter to Area Ratio
Figure A.1: Histogram of distributions of CTC and non-CTC of morphological features.
51
A.2 Intensity Features Histograms
Figure A.2 presents the distributions of intensity features.
(a) DNA Mean Intensity (b) CK Mean Intensity (c) CD45 Mean Intensity
(d) DNA Maximum Intensity (e) CK Maximum Intensity (f) CD45 MAximum Intensity
(g) DNA Standard Deviation of inten-sity signal
(h) CK Standard Deviation of intensitysignal
(i) CD45 Standard Deviation of inten-sity signal
(j) DNA Mass (k) CK Mass (l) CD45 Mass
Figure A.2: Histogram of distributions of CTC and non-CTC of intensity features.
52
A.3 Texture Features Histograms
Figure A.3 presents the distributions of texture features, except HOG Features.
(a) Median of DNA Local Entropy (b) Median of CK Local Entropy (c) Median of CD45 Local Entropy
(d) Median of DNA Local Contrast (e) Median of CK Local Contrast (f) Median of CD45 Local Contrasty
(g) Median of DNA Gradient Amplitude (h) Median of CK Gradient Amplitude (i) Median of CD45 Gradient Amplitude
Figure A.3: Histogram of distributions of CTC and non-CTC of texture features, except HOG features.
53
54
Appendix B
Classification Results
A total of three datasets were analysed in this thesis. The result of the three of them all together
was presented in section 5.4, however a first test was perform. Two of the datasets were analysed
separately, for sake of simplicity we will designate them as patient A and patient B. Section B.1 presents
the classification results of dataset 1 - patient A, and section B.2 of dataset 2 - patient B.
B.1 Dataset 1 - Patient A
(a) k -NN + Bootstrapping (b) k -NN with Prior Probabilities (c) k -NN with Optimal Number ofNeighbors
Figure B.1: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping B.1(a), with Prior Probabilities B.1(b) and k -NN with the optimal amount of neighborsB.1(c). (All -all features; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 inten-sity standard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features for the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).
Table B.1: Area Under the Curve (AUC) of each of the algorithms tested for patient A.
Area Under the Curve All Morph. User Int. Tex. DNA (Int+Tex.) CK (Int+Tex.) CD45 (Int+Tex.)k -NN+Bootstrapping .6531 .5572 .5854 .6587 .6615 .5562 .5773 .5145k -NN w/ Prior P. .6557 .5919 .6312 .6314 .6629 .5608 .6245 .4972k -NN Neigh. .6397 .5918 .6077 .5961 .6166 .5373 .6476 .5127SVM Linear .8183 .6272 .6387 .7418 .7340 .6349 .7295 .6303SVM RBF .7639 .6501 .5917 .7086 .6945 .6244 .7062 .5517AdaBoost .7364 .5018 .5937 .6524 .6821 .5865 .6933 .6100RUSBoost .6739 .6019 .5663 .6791 .6301 .6628 .6708 .6364
55
(a) SVM Linear (b) SVM RBF
Figure B.2: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear B.2(a) and Gaussian (RBF) B.2(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features for the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass for the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).
(a) AdaBoost (b) RUSBoost
Figure B.3: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost B.3(a)and RUSBoost B.3(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features for the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass for the3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).
56
B.2 Dataset 2 - Patient B
(a) k -NN + Bootstrapping (b) k -NN with Prior Probabilities (c) k -NN with Optimal Number ofNeighbors
Figure B.4: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping B.4(a), with Prior Probabilities B.4(b) and k -NN with the optimal amount of neighborsB.4(c). (All -all features; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 inten-sity standard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features for the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).
(a) SVM Linear (b) SVM RBF
Figure B.5: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear B.5(a) and Gaussian (RBF) B.5(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features for the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass for the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).
Table B.2: Area Under the Curve (AUC) of each of the algorithms tested for patient B.
Area Under the Curve All Morph. User Int. Tex. DNA (Int+Tex.) CK (Int+Tex.) CD45 (Int+Tex.)k -NN+Bootstrapping .6547 .5995 .6874 .6852 .6185 .5648 .6842 .5878k -NN w/ Prior P. .6463 .5775 .6768 .6825 .6140 .5506 .6811 .5751k -NN Neigh. .6428 .5658 .6743 .6662 .6163 .5604 .6999 .5767SVM Linear .7267 .6379 .7313 .7225 .6664 .6495 .7268 .5767SVM RBF .7255 .6549 .7239 .7280 .6758 .6557 .7310 .6304AdaBoost .7347 .6579 .7366 .7351 .6758 .6504 .7411 .6247RUSBoost .6910 .5634 .6901 .6951 .6410 .5567 .6845 .5742
57
(a) AdaBoost (b) RUSBoost
Figure B.6: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost B.6(a)and RUSBoost B.6(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features for the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass for the3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).
58
Appendix C
Noise Analysis
Given the fact that no algorithm was implemented in order to reduce the noise of the images, this
data is presented only in appendix for purpose of future work. The noise appears to be gaussian (Figure
C.1) and the channel DNA (Figure C.1(a)) has mean of 0.0104 and standard deviation of 0.0127, CK
(Figure C.1(b)) mean of 0.1888 and standard deviation 0, 0420 and finally the CD45 channel (Figure
C.1(c)) 0.0585 of mean and standard deviation 0.0153.
(a) Channel DNA Noise (b) Channel CK Noise (c) Channel CD45 Noise
Figure C.1: Distribution of Noise, by channel.
59
60