Cancer ID - ULisboa · sangu´ıneas, e de elevado interesse. Este trabalho foca-se no estudo de...

Cancer IDAutomated System for Identification of Circulating Tumor Cells

Rita Gonçalves Pires Antunes Angélico

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor(s): Prof. Maria Margarida Campos da SilveiraDr. Christoph Brune

Examination Committee

Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Maria Margarida Campos da Silveira

Member of the Committee: Prof. João Miguel Raposo Sanches

May 2016

”Make a radical change in your lifestyle and begin to boldly do things which you may previously

never have thought of doing, or been too hesitant to attempt. So many people live within

unhappy circumstances and yet will not take the initiative to change their situation because

they are conditioned to a life of security, conformity, and conservation, all of which may appear

to give one peace of mind, but in reality nothing is more damaging to the adventurous spirit

within a man than a secure future. The very basic core of a man’s living spirit is his passion for

adventure. The joy of life comes from our encounters with new experiences, and hence there is

no greater joy than to have an endlessly changing horizon, for each day to have a new and

different sun. If you want to get more out of life, you must lose your inclination for monotonous

security and adopt a helter-skelter style of life that will at first appear to you to be crazy. But

once you become accustomed to such a life you will see its full meaning and its incredible

beauty.”

Jon Krakauer, Into the Wild.

iii

Acknowledgments

The journey of knowledge accretion is never a solo trip. Directly or indirectly several people and

institutions contributed for this thesis, to whom I address my deepest gratitude.

First and foremost, I would like to thank my supervisor Professor Margarida Silveira, for not only

accepting this challenge (a self-proposed topic and being my supervisor while I was in Erasmus), but

also for constantly questioning and demanding me to step up my game, for her incredible availability,

perseverance, support, tips and knowledge transfer.

Second, I want to express my most honest thankfulness to Doctor Christoph Brune, Leonie Zeune

and Doctor Guus van Dalum, members of the Cancer ID team, for accepting and welcoming me in their

team during my Erasmus period. Without their support this thesis would have never even existed. Their

support and knowledge were essential for the development of this project.

I am also grateful to two great institutions, one Instituto Superior Tecnico for teaching me every day

that “O ensino de amadores nao cria profissionais” Alfredo Bensaude (Translation: amateur education

does not shape professionals) and for giving me the opportunity to grow not only academically and as

a professional, but also on a personal level. As scary as it may seem, at least to me, 24% of my life up

to date was lived in this school. Second, the University of Twente for the incredible conditions provided

and for accepting me as an exchange student. I would also like to thank the Erasmus Programme for

providing me this opportunity.

Now, I would like to address my inmost gratitude to my father, that along with all the flaws a daughter

sees in a father, has raised me to be the person I am today, that has given me support and the freedom

to live life to the fullest and to whom I look with admiration. I also want to address a big thank you to my

brother, a rare moment of kindness for us, for being my everyday challenger, but also a huge supporter.

Finally but not less important, I would like to thank all my friends. I would like to thank my Erasmus

family (Andrea Gambuti, Claudia Ruffoni, Ilmari Ahonen, Mert Imre and Ophelie Haurou-Bejottes) for one

of the greatest times of my life and for dealing with my craziness as something totally socially acceptable

and for the constant support. To Nuno Pereira, that no matter what, no matter where, for the past 4 years

has been there for me like no other person has ever been. To Monica, that after 8 years of friendship

hasn’t given up and deals with all my stress and panic moments. To Joao Satiro and Olek that over the

last 5 years, and Antonio for the past 3, have worked side by side with me and, and specially Satiro who

has been a great support. I would like also to address my sweetest thank you to Ines Godet, a force of

nature and great support throughout the past year and a half. I would also like to thank Cristiano for all

the hugs, contagious good mood and friendship. Rita and Sancho also deserve a big thank you for not

only being a great support but the most humble and kind people I know.

This was a challenging project that I could not end it without paying an homage to all who are fighting

or have fought any kind of cancer and everyone supporting them and developing work in order to save

or ease their pain.

v

Resumo

Actualmente, existem diversas opcoes para tratamento de cancro. O estudo de Celulas Tumorais em

Circulacao (CTCs) fornece dados relevantes sobre a eficacia do tratamento e a progressao da doenca,

permitindo assim um melhor ajuste dos tratamentos. Assim sendo, o desenvolvimento de um sistema

automatico de classificacao, que usa como fonte de informacao imagens com 4 canais de amostras

sanguıneas, e de elevado interesse. Este trabalho foca-se no estudo de imagens, adquiridas com o sis-

tema CellSearch, de pacientes com cancro do pulmao de celulas pequenas (SCLC), para o qual ainda

nao existe nenhum sistema automatico para a sua enumeracao. Este sistema e composto por dois blo-

cos principais: processamento de imagem e aprendizagem automatica. O bloco de processamento de

imagem consiste em: normalizacao de imagens, segmentacao e extraccao de features (morfologicas,

de intensidade e texturais). O bloco de aprendizagem automatica foi desenvolvido tendo em conta o

facto do numero de nao-CTCs ser largamente superior ao numero de CTCs, usando tecnicas como o

bootstrapping e algoritmos de boosting. Neste projecto, algoritmos convencionais, como Support Vec-

tor Machines (ja usado no passado no ambito de projectos semelhantes) e k-Nearest Neighbor, foram

implementados, bem como algoritmos mais recentes, nunca aplicados neste contexto, especificamente

o AdaBoost e o RUSBoost. Embora diversas novas abordagens tenham sido testadas, nao foi possıvel

desenvolver um sistema automatico para enumeracao de CTCs para SCLC fidedigno.

Palavras-chave: Celulas Tumorais em Circulacao, Classes nao-balanceadas, k-Nearest

Neighbor, Support Vector Machines, Ensemble methods, Cancro do pulmao de celulas pequenas.

vii

Abstract

Nowadays, there are several treatment options for cancer. The study of Circulating Tumor Cells

(CTCs) provides great insight into treatment effectiveness and disease progression, allowing for a better

treatment adjustment. Therefore, the development of automated classification system, which uses as

source of information 4-channel images of blood (a non-invasive biopsy), is of great interest. This work

focused on the study of images, acquired with the CellSearch system, of patients with Small-Cell Lung

Cancer (SCLC), which no automated system for its enumeration has been developed up to date. This

system has two main building blocks: image processing and machine learning. The image processing

block consists of: image normalization, segmentation and feature extraction (morphological, intensity-

related and texture features). The machine learning block was developed taking into account the fact that

the number of non-CTCs highly outnumbers the number of CTCs, using techniques such as bootstrap-

ping and the use of ensemble methods. In this thesis conventional algorithms were implement, namely

Support Vector Machines (which has been used in this context before) and k -Nearest Neighbor, along

with recent algorithms, which have never been studied before in this field, specifically the AdaBoost and

RUSBoost. Even though several new approaches were tested it was not possible to develop a reliable

automated CTC enumeration system for SCLC.

Keywords: Circulating Tumor Cells, Class imbalance, k -Nearest Neighbor, Support Vector Ma-

chines, Ensemble methods, Small-Cell Lung Cancer.

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Cancer and its impact on Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Circulating Tumor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.3 Detection of Circulating Tumor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Original Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State-of-Art 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Machine Learning and Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Image Processing 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Image Preprocessing & ROI Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Image Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Morphological Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Intensity Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

xi

3.3.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Classification and Performance Evaluation 21

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.3 Boostrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.2 RUSBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.1 Nested Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.2 Receiver-Operator Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Results 33

5.1 Dataset - Fluorescence microscopy for blood cells analysis . . . . . . . . . . . . . . . . . 33

5.2 Discussion on Image Processing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusions and Future Work 43

Bibliography 45

A Histograms of Feature Distributions 51

A.1 Morphological Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.2 Intensity Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.3 Texture Features Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

B Classification Results 55

B.1 Dataset 1 - Patient A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

B.2 Dataset 2 - Patient B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

C Noise Analysis 59

xii

List of Tables

2.1 Summary of different cytometric approaches for CTC enumeration. Adapted from [18] . . 10

2.2 Performance of different automated CTC enumeration systems. Acronyms: Accuracy(ACC),

Sensitivity(SENS), Specificity(SPEC),Region of Interest (ROI), Castration Resistant Prostate

Cancer (CRPC), Apoptotic (Apop.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Summary of the extracted features (P2A - Perimeter to Area Ratio, Max. - Maximum, ch.

- channel, HOG - Histogram of oriented gradients) . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Clinical characteristics of 59 patients with small-cell lung cancer. (ED-extensive disease

stage; LD-limited disease stage) [15]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Set of features of each category used for classification. . . . . . . . . . . . . . . . . . . . 37

5.3 Area Under the Curve (AUC) of each of the algorithms tested for the total of the 3 datasets. 42

B.1 Area Under the Curve (AUC) of each of the algorithms tested for patient A. . . . . . . . . 55

B.2 Area Under the Curve (AUC) of each of the algorithms tested for patient B. . . . . . . . . 57

xiii

List of Figures

1.1 Estimated Number of New Cancer Cases by World Area 2012 [2]. . . . . . . . . . . . . . 2

1.2 Estimated New Cancer Cases (left) and Deaths Worldwide (right) for Leading Cancer

Sites by Level of Economic Development, 2012. (*Excluding non-melanoma skin cancer.

Estimates may not sum to worldwide total due to rounding) [2]. . . . . . . . . . . . . . . . 3

1.3 Relative contribution of external factors to cancer incidence. Adapted from [3]. . . . . . . 3

1.4 The metastatic process: cells detach from a primary tumor, penetrate the surrounding

tissue, enter nearby blood vessels (intravasation) and circulate in the vascular system.

Some of these cells eventually adhere to blood vessel walls and are able to extravasate

and migrate into the local tissue, where they can form a secondary tumor. [7]. . . . . . . . 4

1.5 CellSearch thumbnail gallery. The software of the CellSearch CellTracks displays thumb-

nails of all objects that are positive for both DAPI and CK. Events 337, 340, and 341 show

a CTC: positive for DAPI and PE and negative for CD45. Note the weak CD45-staining of

several white blood cells in events 340 and 341 [6]. . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Comparison of common thresholding procedures. Two original images containing a small

(1A) and large number of objects (1B) were thresholded using three methods: triangle

(2A and 2B), otsu (3A and 3B), and isodata (4A and 4B). The three methods give similar

results on an image with a large number of objects, but triangle finds the correct number

of objects in images which contain a small number of objects. Image A1 is shown using a

logarithmic intensity scale to show the texture in the background; the left part of the image

is part of the cartridge border. [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Detail of a PE image (1), and masks as thresholded by the triangle (2), otsu (3), and

isodata (4) methods. [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 ”Example of selection of cartridge scan area. 1: original FITC images of one side of

a cartridge stitched together after application of linear convolution filter to border images

(arrow indicates an air bubble), 2: border enhanced image by gradient magnitude filtering,

3: Binary image of thresholded borders (red color), 4: Selected scan area (red color) after

inversion of image 3, binary propagation of center square, and size verification.” [12] . . . 17

xv

3.2 ”Determination of the global search threshold for each picture. The threshold (THR) was

selected by normalizing the height and dynamic range of the intensity histogram, locating

point A as shown, and then adding a fixed offset.” [24] . . . . . . . . . . . . . . . . . . . . 18

4.1 Example of a 2D linearly separable binary problem, where patterns of one class are rep-

resented by diamonds and the by circles. The optimal separating hyperplane (strong

full line) maximizes the distance between the support vectors of each class (darker data

points, limiting the margins). Adapted from [33]. . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Examples of the use of the two adaptations of the SVM. 4.2(a) Is an illustrative example

of the use of a kernel (φ (x1, x2) = x21 + x22) when the data is separable, although no

hyperplane in the input space is able to separate the data. Therefore, data was mapped

into a feature space, where the decision surface was computed. 4.2(b) Aims to present

a situation where data is not linearly separable in the input space. Two possible options

are available, either the decision surface is the one presented by the dotted line and it

may lead to overfitting and poor generalization, or one allows for errors to be committed,

soft-margin concept, and the dashed line is used as the separating hyperplane. Adapted

from [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 AdaBoost (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 RUSBoost (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Nested Cross-Validation (Pseudocode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 Example of a full cartridge (left, presented vertically) and one image from that dataset

(right, horizontally). Overlay corresponds to the 3 channels superimposed. DNA corre-

sponds to DAPI-DNA channel, CK to CK-PE channel and CD45 to CD45-APC channel. . 34

5.2 Example of two cells present. Figure 5.2(a) is an example of a CTC, whereas Figure

5.2(b) is an example of a cell that is non-CTC. The red and green contours represent the

contour resulting from segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Example of a non-CTC. Given the way the manual classification was performed, there

is no way of knowing if this is just one cell or a cluster of two cells. The green contour

represent the contour resulting from segmentation . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Example of a non-CTC, present in figure 5.1. This example highlights two problems, first

the segmentation not being able to create two distinct areas if they are close. Second,

given the way the expert reviewer did the classification, it is not possible to know if the

element on the right (inside the contour) is just a smudge, a cell or an apoptotic cell. . . . 36

xvi

5.5 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-

strapping 5.5(a), with Prior Probabilities 5.5(b) and k -NN with the optimal amount of

neighbors5.5(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK

maximum intensity and CD45 intensity standard deviation; Morphological - area, eccen-

tricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local

entropy and HOG features of the 3 channels; Intensity - mean, maximum and standard

deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 - texture and

intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . . . . 39

5.6 Receiver operator curves for classification of CTC with Support Vector Machines, using

Linear 5.6(a) and Gaussian (RBF) 5.6(b) Kernels. (All - all features; User - area, ec-

centricity, DNA maximum intensity, CK maximum intensity and CD45 intensity standard

deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Tex-

ture - median local contrast, median local entropy and HOG features of the 3 channels;

Intensity - mean, maximum and standard deviation of the intensity signal and mass of the

3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel). 40

5.7 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost

5.7(a) and RUSBoost 5.7(b). (All - all features; User - area, eccentricity, DNA maximum

intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological -

area, eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast,

median local entropy and HOG features of the 3 channels; Intensity - mean, maximum and

standard deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 -

texture and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . 41

A.1 Histogram of distributions of CTC and non-CTC of morphological features. . . . . . . . . . 51

A.2 Histogram of distributions of CTC and non-CTC of intensity features. . . . . . . . . . . . . 52

A.3 Histogram of distributions of CTC and non-CTC of texture features, except HOG features. 53

B.1 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-

strapping B.1(a), with Prior Probabilities B.1(b) and k -NN with the optimal amount of

neighborsB.1(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK



entropy and HOG features for the 3 channels; Intensity - mean, maximum and standard

deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 - texture

and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . 55

xvii

B.2 Receiver operator curves for classification of CTC with Support Vector Machines, using

Linear B.2(a) and Gaussian (RBF) B.2(b) Kernels. (All - all features; User - area, ec-



ture - median local contrast, median local entropy and HOG features for the 3 channels;

Intensity - mean, maximum and standard deviation of the intensity signal and mass for the


B.3 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost

B.3(a) and RUSBoost B.3(b). (All - all features; User - area, eccentricity, DNA maximum



median local entropy and HOG features for the 3 channels; Intensity - mean, maximum

and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK,

CD45 - texture and intensity features for the correspondent channel). . . . . . . . . . . . . 56

B.4 Receiver operator curves for classification of CTC with k -Nearest Neighbor with boot-

strapping B.4(a), with Prior Probabilities B.4(b) and k -NN with the optimal amount of

neighborsB.4(c). (All - all features; User - area, eccentricity, DNA maximum intensity, CK



entropy and HOG features for the 3 channels; Intensity - mean, maximum and standard

deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 - texture

and intensity features for the correspondent channel). . . . . . . . . . . . . . . . . . . . . 57

B.5 Receiver operator curves for classification of CTC with Support Vector Machines, using

Linear B.5(a) and Gaussian (RBF) B.5(b) Kernels. (All - all features; User - area, ec-



ture - median local contrast, median local entropy and HOG features for the 3 channels;

Intensity - mean, maximum and standard deviation of the intensity signal and mass for the


B.6 Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost

B.6(a) and RUSBoost B.6(b). (All - all features; User - area, eccentricity, DNA maximum



median local entropy and HOG features for the 3 channels; Intensity - mean, maximum

and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK,

CD45 - texture and intensity features for the correspondent channel). . . . . . . . . . . . . 58

C.1 Distribution of Noise, by channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xviii

Glossary

Antigen Molecule capable of inducing an immune re-

sponse on the part of the host organism, the

presence of this molecule can also be used for

detection of specific cell by its enrichment.

Apoptosis Process of programmed cell death.

Biopsy Sample of tissue taken from the body for further

examination.

CTCs Circulating Tumor Cells is a type of cancerous

cell present in the circulatory system.

Cancer Set of diseases characterized by uncontrolled

growth and spread of abnormal cells.

Cytokeratin Class of fibrous proteins that are intermediate

filaments of the cytoplasm of epithelial cells,

provide structural support to the cytoskeleton

and play a role in various cellular functions.

DNA Desoxyribonucleic acid is a molecule that con-

tains genetic information and it is typically found

in the nucleus of a cell.

ED Extensive Disease stage, used in the context of

cancer to refer to metastasised cancer.

Epithelial cell Cell of the closely packed cells forming the ep-

ithelium (membranous tissue covering internal

organs and other internal surfaces of the body).

Ferrofluid Liquid that becomes strongly magnetized in the

presence of a magnetic field.

HOG Histogram of oriented gradients is a feature de-

scriptor.

LD Limited disease stage, used in the context of

cancer to refer to localized cancer.

xix

Leukocyte Cells of the immune system present on the

blood stream, these are also designated white

blood cells.

Metastasis Spread of a disease from one location to an-

other not directly connected with it.

ROI Region of Interest is a segmented part of an

image, from which feature are extracted.

SCLC Small-Cell Lung Cancer is a type of lung can-

cer.

SVM Support Vector Machines are set of supervised

learning methods used for classification, re-

gression and outliers detection.

TIFF Tagged Image File Format, is a computer file

format for storing raster graphics images.

Tumor Lesion or lump is formed in the body due to ab-

normal cellular growth. Not necessarily cancer-

ous.

XML Extensible Markup Language is a system for

annotating a document, that defines a set of

rules for encoding documents in a format which

is both human-readable and machine-readable.

In the case of this project it the XML files en-

code notes regarding each dataset.

k -NN k -Nearest Neighbor is a non-parametric classi-

fication algorithm.

xx

Chapter 1

Introduction

1.1 Motivation

1.1.1 Cancer and its impact on Society

Cancer is the name given to a set of diseases characterized by uncontrolled growth and spread of

abnormal cells. This continuous and unrestrained cell division can result in death of the patient [1].

Cancer is a major health problem (Figure 1.1). In high-income countries, it is the second leading

cause of death and the third in low and middle income countries. Cancer is responsible for more deaths

than AIDS, tuberculosis and malaria combined. One in seven deaths worldwide is due to cancer [2].

In 2012, there were 14.1 million new cancer cases, of which 8 million occurred in economically

developing countries and an estimate of 8.2 million cancer deaths (killing approximately 22,000 people

per day). Worldwide, lung, bronchus and trachea are the leading cause of cancer caused death amongst

males, followed by liver. On the other hand, in females it is breast then lung, bronchus and trachea

cancer (Figure 1.2) [2].

Besides the enormous impact cancer has on the number of people it affects, it also represents an

immense economic burden. In 2010, the 13.3 million new cases of cancer estimated to cost the world

US$290 billion, being approximately 53% medical costs, 24% of income losses and the remaining on

non medical expenses. It is expected that in 2030 this value will rise to US$458 accounting for 21.5

million new cases of cancer [3, 4] (no reliable source of more recent worldwide cancer costs was found).

About 5% of all cancers are associated with an inherited genetic alteration that might lead to one or

more specific types of cancer. Although, most cancers result from the damage of genes that might occur

during a person’s lifetime. This damage can be caused by both internal or external factors (Figure 1.3)

[1].

1.1.2 Circulating Tumor Cells

Cancer involves the malfunction of genes that control growth and division of cells. These cells are

less specialized than normal cells and are able to ignore signals that would either prevent their division

1

Figure 1.1: Estimated Number of New Cancer Cases by World Area 2012 [2].

or promote their apoptosis [5]. Cancer cells can induce nearby cells to form blood vessels that supply

tumours with oxygen and nutrients and at the same time remove waste, providing the ideal growth

conditions.

Cells from a primary tumour detach and travel through the circulatory or lymphatic systems and are

therefore called Circulating Tumor Cells (CTCs). The microscopic observation of these cells was firstly

described, in 1869, by Thomas Ashworth [6]. These cells can generate new colonies in sites far from

where the first tumour was located, designated as metastasis [7] (Figure 1.4).

In several tumors, this process has already occurred when the primary tumor is detected [5] leading

to a high rate of compromised treatments. Approximately 90% of the deaths in cancer patients are due

to metastasis [8, 9, 10].

The presence of CTCs in metastatic cancer patients is associated with poor survival prospects.

Improvements in treatment and progress in early-stage diagnosis can reflect on higher survival rates.

However the increasing number of treatment options (chemotherapy, radiation therapy, surgery, targeted

therapy, immunotherapy, etc), raised the need for methods that determine if the intended therapy is

being effective [11]. Ideally, these methods would be non-invasive and provide a real-time analysis of

the tumor activity. Several studies have disclosed that a change in the CTC count could be an indicator

of treatment effectiveness [12], therefore assessment of CTC may satisfy this need. In case the tumor

was not completely eliminated from the body, tumor cells will remain dormant or expand. When they

2

Figure 1.2: Estimated New Cancer Cases (left) and Deaths Worldwide (right) for Leading Cancer Sitesby Level of Economic Development, 2012. (*Excluding non-melanoma skin cancer. Estimates may notsum to worldwide total due to rounding) [2].

Figure 1.3: Relative contribution of external factors to cancer incidence. Adapted from [3].

3

Figure 1.4: The metastatic process: cells detach from a primary tumor, penetrate the surrounding tissue,enter nearby blood vessels (intravasation) and circulate in the vascular system. Some of these cellseventually adhere to blood vessel walls and are able to extravasate and migrate into the local tissue,where they can form a secondary tumor. [7].

form a detectable metastasis, the cells may no longer be as sensible as before to the same therapies,

and in some cases, actually displaying resistance. This creates the need of a biopsy to access the

best treatment options. Biopsies are invasive, hard and not always possible from metastatic sites, thus

the possibility to isolate tumor cells from the blood provides a ”real-time liquid biopsy”. The study of

circulating tumor cells can provide game-changing methods to guide personalized therapies, increasing

the survival rate of patients [6].

This project addresses the problem of automated identification of CTCs of Small-Cell Lung Cancer

(SCLC). This type of cancer comprehends about 13-20% of all lung cancer cases and, without treatment,

SCLC leads to death within 2 to 4 months [13]. SCLC is strongly associated with cigarette smoking [14],

and it is a disease characterized by high propensity for widespread metastases, often present in a early

stage disease. In limited disease stage (localized disease), the 5-year survival rate is approximately

10% (maximum 26%), whereas in extensive-stage SCLC (metastasised disease) there is a high initial

response to chemotherapy, although few patients survive beyond the first two years [15].

1.1.3 Detection of Circulating Tumor Cells

There are several systems for CTC detection, however the most widely used is the CellSearh system

(Jansen Diagnosis, LLC; Raritan, NJ), which has been thoroughly validated in patients with metastatic

cancer [15]. The system enriches cells from 7.5 ml of blood expressing the epithelial cell adhesion

membrane (EpCAM) antigen and identifies CTCs as nucleated cells (DAPI-DNA) expressing cytokeratin

8/18 or 19 (CK-PE) and lacking the leukocyte antigen (CD45-APC). Several reports suggest that CTCs

can be effectively detected with this test system, also in SCLC [6]. An alternative system for data

collection (in the form of image) is the functionalized and structured medical wire (FSMW) [11] and will

be later reviewed in chapter 2, along with other systems for CTC detection.

After the images are acquired by this system, expert reviewers classify a Region of Interest (ROI) as

a CTC if it has an oval or cell like morphology, if it is DAPI and CK positive, CD45 negative and greater

than 4µm (Figure 1.5) [12, 6]. The biggest challenges in classification of circulating tumor cells are the

heterogeneity in morphology, partially caused by the large diversity in the viability or apoptotic stage of

the CTC, leading to a difficulty on setting criteria on what can be considered a CTC. An extensive training

is needed to keep the variations (inter and intra reviewer) into assigning objects as CTC to a minimum.

4

Inter-reviewer variability in CTC enumeration can be of 4% to 31% (median 14%) [12]. Additionally CTCs

are very rare cells in blood. In patients with metastatic cancer there is approximately 1 CTC per mL of

blood and it is surrounded by approximately 5× 106 white blood cells and 5× 109 red blood cells [6].

Figure 1.5: CellSearch thumbnail gallery. The software of the CellSearch CellTracks displays thumbnailsof all objects that are positive for both DAPI and CK. Events 337, 340, and 341 show a CTC: positivefor DAPI and PE and negative for CD45. Note the weak CD45-staining of several white blood cells inevents 340 and 341 [6].

Automated classification of CTCs is relevant to provide the ”real-time liquid biopsy” and treatment

assessment mention in section 1.1.2 and for elimination of operator error in classification and make the

process more time efficient [11].

1.2 Proposed Approach

Image data used in this thesis was provided by the Cancer ID project (http://www.cancer-id.eu/)

team of University of Twente. This dataset consists of images from blood samples obtained from 59 pa-

tients with SCLC, the blood collection was done before, after one cycle and at the end of chemotherapy.

Each blood sample corresponds to one cartridge, 175 four channel TIFFs, acquired with a fluorescent-

based microscopy system CellTracksTM analyzer II, using a 10X NA0.45 objective with filters for DAPI,

PE, APC and FITC (biomarker not used for feature extraction). This dataset was previously described

and analysed by Hiltermann, et. al [15]. In chapter 5 the dataset will be detailed.

In the present work, it is proposed, for CTC identification, a system that is composed by two main

components: image processing and machine learning (Figure 1.6).

The image processing block contemplates a solution for edge removal, image normalization, image

segmentation (triangle threshold method), ROI analysis and extraction of morphological features (area,

eccentricity, perimeter, perimeter to area ratio), quantitative intensity related features (mean and maxi-

mum intensity, standard deviation of intensity signal, mass) and texture related features (local contrast,

local entropy, histogram of oriented gradients). Before stepping into the classification part, outliers (ex-

ample a ROI with an area too big to be considered a cell) were removed from the dataset. The machine

5

http://www.cancer-id.eu/

Figure 1.6: Proposed approach.

learning block aims to compare the performance of four different classification algorithms: Support Vec-

tor Machine (SVM), k -Nearest Neighbor (k -NN), AdaBoost and RUSBoost. Parameter estimation was

performed within a nested Cross Validation procedure.

1.3 Original Contribution

In this project we propose innovative methods for automated CTC enumeration in both of the main

components (Image Processing and Classification) and also regarding the type of cancer. The auto-

mated identification of CTC present in previous works (using either CellSearch System or FSMW) was

for breast [11, 16, 17], colorectal [17], non-small-cell lung cancer [11] and castration resistant prostate

cancer [12].

Regarding feature extraction, we present a new texture feature: histogram of oriented gradients. On

classification, Support Vector Machines (along with Naive Bayesian Classifiers) have been previously

used on FSMW (breast and non-small-cell lung cancer), with color histograms as features [11]. Using

the CellSearch technology, the classifier studied was the Random Forests with morphological, texture,

quantitative and correlation features, nevertheless the images were retrieved using a camera with im-

proved resolution (Time Delay and Integration camera using a 40X 0.6 NA objective) [17]. Thus, the use

of SVMs on CellSearch System and the use of k -NN, AdaBoost and RUSBoost algorithms introduce a

new approach for the problem at hand.

Additionally this projects tries deal with the data imbalance, a problem that has not been addressed

before in the context of CTC automated enumeration.

6

1.4 Thesis Outline

The remainder of this dissertation is organized in the following way: the State of the Art chapter

presents the most relevant contributions for CTC enumeration, along with the most important contribu-

tions for the algorithms used, are highlighted in chapter 2. In chapter 3, each technique used in feature

extraction will be thoroughly described. Then, chapter 4 exploits each classification algorithm used,

covering its fundamentals. Chapter 5 will follow, covering the experimental results, covering all the ex-

perimental design and its results. Finally, chapter 6 will conclude this thesis, summarizing the results

and highlighting future work.

7

Chapter 2

State-of-Art

2.1 Introduction

In the past 12 years, there has been a growing interest in developing systems for enumeration of

Circulating Tumor Cells with the help of expert reviewers [18]. These systems are of high relevance

to assess disease progression, treatment effectiveness and survival prognosis without being invasive.

Only in recent years the automation of these has been a focus of study, which is an extremely important

topic due to the high dependence on the reviewers expertise, inter and intra reviewer variability and the

impact these factors have on the assessment of patients’ diagnosis [12, 16, 11, 17].

This chapter will review the main trends and most important contributions in this field. First, section

2.2 presents a short overview of the available systems for CTC enumeration. Followed by section 2.3,

that highlights the major contributions on feature extraction applied to the study of CTC. Section 2.4 sum-

marizes the machine learning and classification techniques that successfully distinguished CTCs from

other possible classes. Finally, section 2.5 briefly describes the most important systems for automated

detection of CTCs and summarizes the existing solutions.

2.2 Biomarker

There are two kinds of systems for CTC enumeration: PCR-based analysis and cytometry based.

The CellSearch System is the only FDA approved of the latter type. In table 2.1 a summary of the

advantages and disadvantages of the several cytometric approaches is presented. Currently, the most

reviewed system in literature is the CellSearch System [6, 19, 20, 21, 22].

In addition, to the different systems for CTC enumeration there is also the possibility of using different

markers. In this project, ferrofluids with EpCAM (epithelial cell adhesion molecule, to select cells of ep-

ithelial origin) and the staining reagents DAPI-DNA (4’, 2-diamidino-2-phenylindole, dihydrochoride, for

a nuclear stain), PE-CK (cytokeratin 8, 18 Phycoerythrin and cytokeratin 19 Phycoerythrin) and CD45-

APC (CD45-allophycocyan to label leukocytes) were used. However, the replacement of cytokeratin

antibodies with other staining reagents, that target certain molecules, allows a better assessment of

9

specific CTC, for example: staining reagent for Her-2 for breast cancer, Bcl-2 for non-small lung cancer

and non-Hodgkin’s lymphomas and/or AR for castration resistant prostate cancer [23].

Table 2.1: Summary of different cytometric approaches for CTC enumeration. Adapted from [18]

Detection Technique Advantages Disadvantages

CytometricAnalysis

CellSearch

Semi-automated;High sensitivity;CTC quantification;Reproducible;Recognition of a fixed marker (EpCAM, CKs, CD45);Visual confirmation of CTCs;FDA approved.

Only EpCam+/CK+/CD45- CTCs detected;Subjective images interpretation;No further analysis possible.

CTC-chip

98% Cell viability;Visual confirmation of CTCs;High detection rate;Further analysis possible.

Only EpCam positive CTCs detected;Not commercially available;Subjective CTC analysis;Lack of validation studies in clinical settings.

EPISPOT Analysis only on viable cells;High sensitivity.

CTC isolation not possible, thus no further analysis possible;Need of active protein secretion;No morphological analysis possible;Technically challenging.

Fast

Scan analysis of large volume of sample;Cell loss minimised;No enrichment needed;Quick analysis (up to 300,000 cells/s).

Subjective CTC analysis;Lack of validation studies in clinical settings.

FISH Genetic analysis. Further analysis not possible.

Flow Cytometry High specificity;Multiple parameters. Low sensitivity.

FSMWCE certified;In vivo samples;Screening of large blood volume.

Subjective analysis;Technically challenging.

LSC

Fast;No enrichment needed;Visual confirmation of CTCs;High specificity.

Subjective analysis;Technically challenging;Low sensitivity.

2.3 Image Processing

In light of image processing and feature extraction, there are several focus points to be considered.

First, the selection of the analysis area: when processing images from a cartridge, several of them have

the edge of the cartridge that should be removed. Up to date, one solution has been proposed for

detection and removal of the sample border, via thresholding the FITC channel (the fourth channel that

it is not used as a marker). A necessary step to get the true imaging area [12].

Second, if the images are retrieved with different machines, under different light conditions or present

too much noise, there might be the need for image normalization. The Naka-Rushton filter was intro-

duced in analysis of circulating tumor cells, by Svensson, et al. [11], for enhancement of foreground

objects and suppression of background noise. The use of top-hat background subtraction algorithms

can lead to both the presence of negative values and/or formation of extra contrast, thus the proper

background subtraction method would be as follows: recording a black image with no objects present,

followed by subtracting this black image from images with objects. However most of the times this image

is not available [12]. Following edge removal, Svensson, et al. [11] also implements a Gaussian blurring

filter for image smoothing.

To locate objects and its outline, there is the need to implement segmentation techniques. These can

be divided into two classes, contour-based and region-based. Contour-based techniques require edge

enhancement steps to find the contour or edges of objects. Region-based can be of texture analysis,

watershedding or intensity thresholding (local or global). Svensson, et al. [11] applied the watershed

10

algorithm to the DNA channel followed by the use of random forest to decide whether or not the ROI

should be considered for a candidate for further classification. Lighthart, et al. [12] implemented several

algorithms, such as the Zack’s triangle threshold via channel image histogram, the Otsu’s threshold and

isodata algorithms, for image segmentation in the study of CTCs, Figures 2.1 and 2.2.

Figure 2.1: Comparison of common thresholding procedures. Two original images containing a small(1A) and large number of objects (1B) were thresholded using three methods: triangle (2A and 2B),otsu (3A and 3B), and isodata (4A and 4B). The three methods give similar results on an image witha large number of objects, but triangle finds the correct number of objects in images which contain asmall number of objects. Image A1 is shown using a logarithmic intensity scale to show the texture inthe background; the left part of the image is part of the cartridge border. [12].

Figure 2.2: Detail of a PE image (1), and masks as thresholded by the triangle (2), otsu (3), and isodata(4) methods. [12].

Finally, in order to analyse each cell, several different features have been studied, such as color

histograms [11] and quantitative [17, 12, 16], correlation [17], texture [17, 12, 16] and morphological

features [17, 12, 16]. Further details regarding these are presented in column ”Features” of Table 2.2.

2.4 Machine Learning and Performance Evaluation

In order for the CTC enumeration system to be automatic, there is the need to have some kind of

classification, which can be solved by the implementation of machine learning algorithms. Classifiers

fall in the category of supervised learning machines and can be divided into two categories: generative

models or discriminative models. The generative approach focuses mainly in trying to learn the prob-

ability functions behind the problem and classifies a given pattern based on the most probable output

label. A discriminative approach focuses directly on the prediction.

11

Before stepping into the actual classification of each cell into CTC or non-CTC, Svensson, et al. [11]

proposed the implementation of a Random Forest classifier to identify relevant ROI and, only after this,

proceed to the classification itself. In this step, the features used were area and perimeter-to-area ratio.

Up to date, several classification approaches have been presented for the automated classification

of CTCs. The first classification method is not a machine learning implementation, it is based on nu-

meric inclusion (example: if the size is within a certain range of values, peak intensity in the DAPI-DNA

channel and standard deviation of CK-PE channel are bigger than specified thresholds and the peak in-

tensity of the CD45-APC channel is smaller than a determined constant) [12, 16]. In recent years, more

advanced techniques have been explored. Regarding generative models, both Naive Bayes Classifiers

[11] and Random Forests [17] have been successfully implemented. Support Vector Machines [11], a

discriminative method, has also been studied for this problem and performed well.

For performance evaluation, cross-validation is the algorithm used by Svensson, et al. [11]. When

evaluating the performance of a classification algorithm applied to the identification of CTCs, it should

be taken into consideration that the dataset is highly unbalanced due to the incredibly low number of

CTCs when compared to the number of non-CTC in a sample, as explored in Section 1.2. Therefore,

accuracy might not be always the most informative measurement.

Class imbalance problem has not been addressed before.

2.5 Summary

This chapter reviewed the existing work on the biomarkers for enumeration of CTC, image processing

techniques and also classification algorithms used in automated CTC identification. The following table

(Table 2.2), lists, chronologically, the more relevant studies in development of an automated system for

the enumeration of Circulating Tumor Cells.

12

Tabl

e2.

2:Pe

rform

ance

ofdi

ffere

ntau

tom

ated

CTC

enum

erat

ion

syst

ems.

Acr

onym

s:A

ccur

acy(

AC

C),

Sen

sitiv

ity(S

EN

S),

Spe

cific

ity(S

PE

C),R

egio

nof

Inte

rest

(RO

I),C

astra

tion

Res

ista

ntP

rost

ate

Can

cer(

CR

PC

),A

popt

otic

(Apo

p.).

Aut

hor(

s)B

iom

arke

rE

nric

hem

ent

Cam

era

Feat

ures

Cla

ssifi

catio

nTe

chni

que

Par

ticip

ants

Res

ults

(%)

Ligt

hart

etal

.,20

11an

d20

13[1

2,16

]C

ellS

earc

h

EpC

AM

CK

-PE

CD

45-A

PC

DA

PI-D

NA

10x/

.45N

A

Sta

ndar

dD

evia

tion

CK

-PE

Peak

DN

A-D

AP

IPe

akC

D45

-AP

CS

ize

Num

eric

alin

clus

ion

(gat

ing)

100

CR

PC

-

Err

orra

teby

clas

s

Sch

olte

nset

al.,

2012

[17]

Cel

lSea

rch

EpC

AM

CD

45-A

PC

CK

-PE

DA

PI-D

NA

TDI4

0x/.6

NA

Fore

ach

chan

nel:

Are

aPe

rimet

erC

ircul

arity

Max

.C

alip

erC

ontra

stm

ean

Cor

rela

tion

rang

eH

omog

enei

tym

ean

Ent

ropy

mea

nTo

talI

nten

sity

Sta

ndar

dD

evia

tion

Max

imum

valu

eC

orre

latio

nbe

twee

nch

anne

ls(D

AP

I/PE

,PE

/AP

C,A

PC

/DA

PI)

Tota

lInt

ensi

tyra

tioR

2S

lope

Ran

dom

Fore

st(5

Cla

sses

:C

TCA

pop.

CTC

CTC

Deb

risLe

ukoc

ytes

Deb

ris)

31P

rimar

yB

reas

tor

Col

orec

talC

ance

r37

Met

asta

ticB

reas

tor

Col

orec

talC

ance

r9

Hea

lthy

CTC

:10.

2A

pop.

CTC

:34.

1C

TCD

eb.:

9.5

Leuk

:4.0

Deb

ris:

10.8

Tota

l:9.6

AC

CS

EN

SS

PE

C

Sve

nsso

net

al.,

2014

[11]

FSM

W

EpC

AM

CK

CD

45H

oech

st(n

ucle

ardy

e)

10x/

.3N

A20

x/.5

NA

40x/

1NA

Are

aPe

rimet

er-to

-are

aR

BG

His

togr

ams

Ran

dom

Fore

st(R

OII

dent

ifica

tion)

SV

M(R

BF-

Ker

nel)

NB

C(U

nsup

ervi

sed)

NB

C(S

emi-S

uper

vise

d)

617

RO

Is

99 89 87 88

51 87 85 85

96 93 92 93

13

Chapter 3

Image Processing

3.1 Introduction

The goal of this thesis is to build and study a system for automated enumeration of CTC, using 4-

channel TIFF images acquired with the CellSearch System. In section 3.2, the algorithms for image

normalization, edge detection and image segmentation will be described. The following section (Section

3.3) concerns the extracted features. Lastly, section 3.5 summarizes the implemented approaches ap-

proaches. A great deal of the code used (and partly adapted) in this section was developed and provided

by the CancerID team of the University of Twente.

3.2 Image Preprocessing & ROI Identification

3.2.1 Image Normalization

An essential step, in order to quantitatively compare objects, is image normalization. This was per-

formed in the following way ”all imported 8-bit multipage TIFF images were scaled from 0–255 and

had to be re-scaled to pseudo 12-bit using information stored in the TIFF-header” [12], an offset and a

maximum value related with IMMC/Veridex TIFF scaling, using equation 3.1.

ImageToSegment = Offset+OriginalImage× MaximumV alue−Offsetmax(OriginalImage)

(3.1)

This solution has been proposed, validated and implemented by Lighthart [12], for the type of

datasets used on this thesis.

3.2.2 Edge Detection

Each dataset corresponds to one blood sample, therefore one cartridge (one scan), which corre-

sponds 175 images. Some of these images have present the cartridge border. For correct ROI seg-

mentation it was necessary to detected the sample border and exclude the outside area from further

15

analysis. This was accomplished via thresholding in the FITC channel (a debris channel, not used for

pattern extraction), however cartridges have very irregular edges, specially at the corners, making it

necessary to compare the total selected area of the whole cartridge to a training set that was acquired

manually. The algorithm is presented below [16]:

1. All FITC images from one dataset are sub-sampled by a factor of eight (neglecting small details

and avoiding unnecessary memory requirements);

2. Convolve images with oriented border with a line-shaped border, to augment this orientation and

close the border (gaps and different intensities were common issues);

3. In order to construct an image of the total cartridge, images were connected to each other, Figure

3.1, panel 1;

4. Edge boosting: gradient magnitude filter (using a gaussian derivative with a width of 8 pixels),

Figure 3.1, panel 2;

5. Edge detection: the border only takes up a small part of the image, so it does not show a large

peak in the histogram, therefore the triangle threshold method (detailed in subsection 3.2.3) was

used with the total image histogram for edge detection, Figure 3.1, panel 3;

6. The thresholded mask was inverted and the holes in the image were filled, for obtaining the selec-

tion area where cells are located;

7. The result was validated by comparison to the possible area range between 72 and 92 mm2 (see

Figure 3.1, panel 4). If the detected area failed this verification, boundaries were estimated using

results from a fixed set of previously analysed cartridges.

3.2.3 Image Segmentation

Given the fact that every object present on the images is slightly visible above the background, a

basic histogram-based thresholding algorithm is enough to segment the image. The algorithm chosen

to perform this task was the Zack’s triangle threshold method.

This geometric method assumes a maximum peak near one end of the histogram of pixel intensities

and searches towards the other end, as presented in Figure 3.2. It was considered an object of interest

a region of the image that has a higher intensity than the defined threshold. By adjusting the search

threshold until the average brightness of the pixels contiguous to the segmented object was within a

small fixed offset of the average background intensity, one can account for the variations in staining

intensity [24]. In cases where the maximum is not near one of the histograms extremes, the algorithm

searched for the threshold within the largest range.

The segmentation was performed over the DNA channel.

16

Figure 3.1: ”Example of selection of cartridge scan area. 1: original FITC images of one side of acartridge stitched together after application of linear convolution filter to border images (arrow indicatesan air bubble), 2: border enhanced image by gradient magnitude filtering, 3: Binary image of thresholdedborders (red color), 4: Selected scan area (red color) after inversion of image 3, binary propagation ofcenter square, and size verification.” [12]

17

Figure 3.2: ”Determination of the global search threshold for each picture. The threshold (THR) wasselected by normalizing the height and dynamic range of the intensity histogram, locating point A asshown, and then adding a fixed offset.” [24]

3.3 Feature Extraction

In CTC analysis, several features have been tested (Table 3.1). Below, the ones this project focuses

on are presented.

3.3.1 Morphological Features

In the field of cell analysis, the morphology can reveal important information about the type of cell we

might be dealing with. Some Circulating Tumor Cells might be within a range of sizes or have a specific

shape. Additionally, shape related features, like eccentricity, can give an insight on if we are dealing with

a ROI that is a cell or not (for example typically white blood cells and Circulating Tumor Cells are not

rectangular).

The morphological features extracted and analysed in this project were: area, perimeter, eccentricity

and perimeter to area ratio (P2A). The latter was computed as follows:

P2A =Perimeter2

4πArea(3.2)

All the other features were extracted using regionprops, a MATLAB function, available in the Image

Processing toolbox. Regionprops is a function that takes as input the segmentation mask, the original

image and several features intended to be extracted.

The area corresponds to sum the of the number of pixels in a certain region and perimeter is the

distance between each adjoining pair of pixels around the border of a region. The eccentricity is given

by the ratio of the distance between the foci of the ellipse and its major axis length. The value is between

0 and 1, where 0 corresponds to a circle and 1 is a line segment.

18

3.3.2 Intensity Features

The intensity related features provide us a quantitative analysis of the cells. The first obvious outcome

of intensity related features is the information of representativeness of a certain object. The features

extracted were mean and maximum intensity of ROIs, and two other that are not strictly intensity features:

the standard deviation of the intensity signal and the mass of the ROI. These two were considered as

intensity features in order to have a more balanced amount of features in each test done later in chapter

5. The standard deviation of the intensity signal can be also used as a texture descriptor, as a measure

of the average contrast. The mass, which is defined as the sum of the intensity of all pixels present in a

ROI, is a feature that is also related to the morphology of the cells.

Maximum and mean intensity were extracted using the regionprops, a MATLAB function available in

the Image Processing toolbox. Additionally, the pixel values (intensity of the signal of each pixel in the

ROI) were also extracted, using this function in order to compute both the standard deviation and the

mass.

For each ROI, these four features were obtained for the DNA, CK and CD45 channels.

3.3.3 Texture Features

Texture descriptors can also give some deeper insight when analysing an object of interest. The

extracted texture features were median of local contrast, median of local entropy, median of gradient

amplitude and Histogram of Oriented Gradients (HOG).

The median was computed for the local contrast, local entropy and gradient amplitude due to the

fact that each ROI had a different size and, for classification, each input vector was required to have the

same size.

The local contrast is the range value in a specified neighborhood around the corresponding pixel in

the input ROI. The range value is determined by maximum intensity value−minimum intensity value

of a 3-by-3 neighborhood.

The local entropy measures the randomness of an image and it is computed as follows:

e = −L−1∑i=0

p (zi) log2p (zi) , (3.3)

where zi indicates the intensity, p (z) is the histogram of the intensity levels in a region and L is the

number of possible intensity levels [25].

The gradient of an image represents a directional change in the intensity. The gradient amplitude

encodes edges and local contrast of images. Using a Sobel filter, first the directional gradients Gx and

Gy are computed, with respect to each of the figure axis (x and y). The gradient magnitude and direction

are then computed from their orthogonal components Gx and Gy.

Lastly, for each ROI a 10-bins Histogram of Oriented Gradients (HOG) was extracted. The idea be-

hind HOG feature descriptor is that an object appearance and shape can be described by its distribution

of intensity gradients or edge directions and it presents a certain degree of invariance to transformations

19

or rotations. In order to compute the HOG an image is divided into small connected regions (cells), an

histogram of gradient directions is then obtained for the pixels within each cell. The descriptor is the

result of the concatenation of these histograms [26].

Dalal and Trigs [26] presented the followings steps for the extraction of the HOG computation. The

first step is the computation of the gradient values by applying 1-D centered derivative masks, for both

horizontal ([−1, 0, 1]) and vertical ([−1, 0, 1]T ) directions. This step is followed by the creation of cell his-

tograms. Based on the values obtained in step one, each pixel within a cell contributes to an orientation-

based histogram with a weighted vote based on the gradient magnitude. The cell has a square shape

and the histogram values range from 0 to 180 degrees. The final step is the construction of descriptor

blocks. Cells are then grouped together into larger blocks, normalizing gradient strengths locally and

therefore taking into account changes in illumination and contrast. The HOG descriptor is the concate-

nated vector resulting from the normalized cell histograms.

3.4 Post-Processing

In order to remove samples that might not be relevant for this system, ROIs were visually inspected,

and it was decided that every ROI with an area inferior to 9 pixels and superior to 3000 pixels were

excluded from further analysis.

In addition to this step, after data was separated into train and test sets, features were normalized in

such way that each feature had mean zero and standard deviation one. In order for classifiers, such as

the k -NN (based on distances), to perform correctly, this step was necessary.

3.5 Conclusion

This chapter summarizes the image processing approaches implemented on this thesis. Section 3.2,

describes the normalization, edge detection and image segmentation algorithms. The following section

highlights the extraction of features. Table 3.1 summarizes the extracted features.

Table 3.1: Summary of the extracted features (P2A - Perimeter to Area Ratio, Max. - Maximum, ch. -channel, HOG - Histogram of oriented gradients)

Morphological Intensity Texture

AreaEccentricityPerimeterP2A

Mean Intensity DNA ch.Mean Intensity CK ch.Mean Intensity CD45 ch.Max. Intensity DNA ch.Max. Intensity CK ch.Max. Intensity CD45 ch.Standard Deviation Int. DNA ch.Standard Deviation Int. CK ch.Standard Deviation Int. CD45 ch.Mass DNA ch.Mass CK ch.Mass CD45 ch.

Median of Local Entropy DNA ch.Median of Local Entropy CK ch.Median of Local Entropy CD45 ch.Median of Local Contrast DNA ch.Median of Local Contrast CK ch.Median of Local Contrast CD45 ch.Median of Gradient Amplitude DNA ch.Median of Gradiant Amplitude CK ch.Median of Gradiant Amplitude CD45 ch.HOG DNA ch.HOG CK ch.HOG CD45 ch.

20

Chapter 4

Classification and Performance

Evaluation

4.1 Introduction

In automated pattern recognition the system has to learn a model from the training instances and be

capable of classifying future unseen data based on the previously formed model. Given the problem at

hand, several approaches were analysed, namely, k -Nearest Neighbor, Support Vector Machines and

Boosting.

k -Nearest Neighbor (k -NN), despite its simplicity, has been successful in a large number of clas-

sification problems and it was implemented for these reasons. Notwithstanding, when dealing with

imbalanced datasets, some boosting techniques that tackle this issue might have to be applied. In this

project, both bootstrapping and the introduction of prior probabilities of each class, were explored. In

section 4.2, the aforementioned techniques as well as the k -NN algorithm will be described. The fol-

lowing section provides a comprehensive description of the concepts and mathematics behind Support

Vector Machines, used in this thesis both because it is a popular discriminative method for classification

and also because it was successfully implement before in the context of CTC enumeration. Boosting

algorithms (AdaBoost and RusBoost) that perform classification based on an ensemble of multiple sim-

ple classifiers, called ”weak” classifiers, are also studied in this project due to their simplicity and high

performance and are addressed in section 4.4.

Classification performance is a clear necessary step in analysing the viability of this system, as well

as understanding what is the most adequate classifier for automated CTC enumeration. As such, in

section 4.5, the technique known as Cross-Validation (CV) was used to estimate, in an unbiased fash-

ion, performance measures such as balanced accuracy, sensitivity, specificity and Receiver-Operator

Characteristic curves. Finally, section 4.6 concludes this chapter.

21

4.2 k-Nearest Neighbors

4.2.1 Basic Concepts

In 1951, Fix and Hodges presented the Nearest Neighbor (NN) Rule [27]. The NN decision rule is

quite intuitive: an unclassified data point is attributed the classification of the nearest point in the set of

previously classified points.

Later, in 1967, Cover and Hart [28] proved that when different sample classes do not overlap in the

input space, the NN rule is asymptotically optimal, meaning that the optimal Bayes probability of error is

equal to zero. Stone then introduced the k -nearest neighbor rule (k -NN), overcoming the sub-optimality

introduced in the NN rule by the fact that most of the times classes do overlap. A new point is classified

as the class most frequent amongst its k nearest neighbors [29]. The k -Nearest Neighbor algorithm is

an instance-based learning, it does not construct a general internal model, it simply stores the instances

of training data. When a new point is presented to the k -NN, it attempts to find a predefined number of

training samples closest to this new data point and predict its label by computing a majority voting, i.e.,

the classification of a query point is assigned to the class which has the most representatives within the

nearest neighbors.

Real life datasets have a finite number of training samples and are most commonly different classes

do overlap. Therefore, it is essential that the distance metric used is suitable for the problem at hand.

Several tuning parameters can be used to improve k -NN’s performance. In this project, it was analysed

the use of different number of neighbors, and, tackle the class imbalance problem, boostrapping and the

use of prior probabilities were tested.

Throughout this work, the implementation of the k -NN used is the fitcknn, available in the Statistics

and Machine Learning Toolbox of MATLAB R2014a.

4.2.2 Mathematics

The theory behind NN is quite intuitive: consider an input space Rd, being d the dimension of the

input space, patterns nearby that same input space most likely belong to the same class. The NN rule

classifies a query pattern X to the class of its nearest neighbour in the training data Dn, when given a

set of examples Dn = (x1,y1), ..., (xn,yn), where xi ∈ Rd represents the input vectors xi ∈ Rd and yi

represent the class label. Without any prior knowledge, the Euclidean distance is typically used, which

is defined as follows:

d(X,xq) =

(d∑k=1

∣∣Xk − xkq∣∣2) 1

2

, (4.1)

where xq is a new unclassified pattern. When prior probabilities are added, a weight is assigned to

each class when computing the Euclidean distance.

22

4.2.3 Boostrap

Bootstrap, presented by Efron in 1979 [30], is a data-resampling strategy, this method has several

applications. In the light of this project, it was used as follows: the bootstrap resamples, with replace-

ment, our dataset into smaller new datasets with a more balanced ratio of CTC vs. non-CTC. For each

new dataset a k -NN model is generated which can be used to predict the class of a new pattern. In

the end the class of a new pattern will be the result of mode of the predicted class given by each k -NN

model.

4.3 Support Vector Machines

4.3.1 Basic concepts

In the early 90s, Boser, Guyon and Vapnik published the first paper that presented Support Vector

Machines [31], a generalization to nonlinear models of the Generalized Portrait algorithm [32]. In 1995,

Cortes and Vapnik introduced a notion vital for non-separable cases, the soft-margin [33]. In 1998,

Shawe-Tayler et al. [34] and Barlett (2006) [35] proposed a rigorous bound to the generalization ability

of the hard margin SVM and, in 2000, Shawe-Tayler et al. [36] presented the same bound for the soft

margin case.

Assume a binary classification of linearly separable data, as depicted in Figure 4.1. The SVM al-

gorithm tries to find the hyperplane that maximizes the distance to the closest training vectors of each

class, the support vectors.

Figure 4.1: Example of a 2D linearly separable binary problem, where patterns of one class are rep-resented by diamonds and the by circles. The optimal separating hyperplane (strong full line) maxi-mizes the distance between the support vectors of each class (darker data points, limiting the margins).Adapted from [33].

The use of hard margin SVMs does not allow for the fitting model to have any errors, which might

lead to poor generalization abilities. To overcome this issue two approaches are proposed: the use of

kernels and the use of soft margins. The use of kernels supports separation of non-linearly separable

data, through the mapping of the input space into a higher dimensional space, called feature space

(figure 4.2(a)). The later extension of SVMs, the concept of soft margins, allows errors on the training

instances while trying to minimize them, by relaxing the separability constraint (figure 4.2(b)). When

23

data is not easily separable, either an highly complex kernel is applied, which might lead to overfitting

and therefore poor generalization, or the concept of soft margin is applied.

(a) Kernel (b) Soft-margin

Figure 4.2: Examples of the use of the two adaptations of the SVM. 4.2(a) Is an illustrative example ofthe use of a kernel (φ (x1, x2) = x21 + x22) when the data is separable, although no hyperplane in theinput space is able to separate the data. Therefore, data was mapped into a feature space, where thedecision surface was computed. 4.2(b) Aims to present a situation where data is not linearly separablein the input space. Two possible options are available, either the decision surface is the one presentedby the dotted line and it may lead to overfitting and poor generalization, or one allows for errors to becommitted, soft-margin concept, and the dashed line is used as the separating hyperplane. Adaptedfrom [37].

4.3.2 Mathematics

In this section, the implementation of both hard and soft margin SVMs will be presented mathemat-

ically, as well as the proposed solution for the unbalanced datasets problem. First, SVMs with hard

margins will be described, followed by the use of kernels. Later, the soft margin concept and finally the

unbalanced dataset problem will be dealt with [31, 33].

Consider, again, a linearly separable dataset. In this problem, the training set D has K instances,

where xk represents one instance of a N-dimensional feature vector. Each instance belongs to one of

two classes:

D = {(x1, y1) , ..., (xK , yK)} ,xk ∈ RN , yk ∈ {−1, 1} . (4.2)

The separating hyperplane, that correctly classifies all the training instances, is given by the following

decision function:

f (x) = sign (w · x + b) , (4.3)

24

where b is a constant and w is the normal vector that parametrize the hyperplane. And it can be

rewritten in the following way:

yk (w · xk + b) ≥ 1,∀k. (4.4)

Although notice that, the constant on the right side of the inequality above can be any strictly positive

number, by virtue of the fact that any hyperplane defined by (w, b) may also be represented by any

positive scale pair (λw, λb ), with λ ∈ R+. In addiction, by changing the scale factor λ, any separat-

ing hyperplane can be represented in a way that equation 4.4 is always met for the nearest training

sample(s).

The distance between the hyperplane and the nearest vector is given by:

d ((w, b) ,xk) =yk (w · xk + b)

‖w‖(4.5)

=1

‖w‖, (4.6)

in order to select the optimal hyperplane from the infinite set of separating ones, one must maximize

its margin. Ergo, it is possible to conclude that the best hyperplane can be computed by minimizing ‖w‖.

Taking into the consideration the constraints presented in equation 4.6, one can rewrite it in following

quadratic problem:

minimizew,b

1

2wTw

subject to yk (w · xk + b) ≥ 1 ∀k.(4.7)

To solve the problem presented in equation 4.7, consider the Lagragian dual formulation. The Lagra-

gian function will be:

L (w, b,Λ) =1

2wTw −

K∑k=1

αk[yk(wTxk + b

)− 1], (4.8)

where Λ = (α1, ..., αK) is the vector of non-negative Lagragian multipliers associated with the con-

straints presented in equation 4.6. Next, the quantity that the dual problem maximizes (the infimum of

L (w, b,Λ), with respect to w and b) can be obtained by using ∇w,bL (w, b,Λ) = 0 and applying the

results to the equation 4.8. The conditions imposed result in:

w =

K∑k=1

αkykwk (4.9)

and

K∑k=1

αkyk = 0, (4.10)

25

which, after some manipulation, leads to:

infw,b{L (w, b,Λ)} =

K∑k=1

αk −1

2

K∑k=1

K∑l=1

αkαlykyl(xTk xl

)(4.11)

Finally, solving the dual maximization problem for the Lagragian coefficients, one can obtain the

desired solution. Using vector notation we have the following:

maximizeΛ

ΛT1− 1

2ΛTDΛ

Λ ≥ 0

Λy = 0,

(4.12)

where 1 = (1, ..., 1) is a K-dimensional unit vector, D is a symmetric matrix with elements Dkl =

ykyl(xTk xl

)and y = (y1, ..., yk) is the labels vector, for k and l ∈ {1, ...,K}. This problem is still quadratic,

however it scales with the number of training instances, opposed to before where the problem was scaled

with the number of dimensions of the feature space.

From the complementary slackness condition of the Karush-Kuhn-Tucker theorem, it is possible to

conclude that, when a solution for 4.12 is found, one of two possible cases bears validity. The Lagrangian

multiplier αk associated with a given instance xk that is a support vector, is either non-negative or zero.

Therefore, the optimal hyperplane can be computed, using equation 4.9, as a linear combination of the

support vectors. Finally, the bias b can be obtained from the constraints 4.4.

Now, we will explore the concept and implementation of kernels [31, 33]. Kernels allow for a non

linearly separable dataset to be separated by a non-linear surface (Figure 4.2(a)). This problem is

solved by mapping (z = φ (x)) the original N-dimensional input space into a new N-dimensional feature

space in which the hyperplane will try to separate the new transformed data {(φ (xk, yk))}. For some

types of kernels, the dimension of the input space can be infinite, leading to a computational problem.

Nevertheless, one can avoid the need of explicitly calculating the mapping of the input vectors by defining

the inner product in the feature space. When one replaces all the x by φ (x), it is possible to observe that

for each φ (xk) there is a dot product of it with φ (xl), making it possible to compute the inner product

only in the feature space, as stated before. The matrix D in 4.12, becomes D = ykyl (φ (xk) · φ (xl)),

additionally, if we write w as a linear combination of the support vectors in the feature space we obtain:

w =∑Kk=1 αkykφ (xk). Finally, by replacing it in equation 4.3, the decision function presents as follows:

f (x) = sign

(K∑k=1

αkyk (φ (xk) · φ (x)) + b

). (4.13)

The kernel function provides the inner product:

K (xk,xl) = φ (xk) · φ (xl) . (4.14)

Now comes the point where there is the need of choosing a kernel. In low dimensional spaces one

26

might be able to create, by inspection, a function that would separate the training set in the feature

space. However, this task becomes harder as the dimensionality of the problem increases. Thus several

kernels have proven successful in several classification problems. In this project, two were tested, one

being the linear kernel (meaning the dot product), and the other RBF kernel (Gaussian Radial Basis

function), defined as:

K (xk,xl) = e−γ‖xk−xl‖2 (4.15)

After exploiting the concept of kernels, considering that the data is not linearly separable also in the

feature space, brings the need of detailing the implementation of the soft margins concept (figure 4.2(b)).

Having this in mind, it is required to accept a small number of errors in the training phase, in this case

the dual problem presented earlier becomes unbounded and no solution can be found. In order to solve

this problem, one must relax the constrains, through the introduction of a positive slack variable in all

constraints [33], obtaining:

yk (w · xk + b) ≥ 1− ξk, ∀k. (4.16)

To minimize the errors introduced by the slack variables, they were weighted in the cost function.

Leading to the new primal problem:

minimizew,b

1

2wTw + C

K∑k=1

ξk

subject to yk (w · xk + b) ≥ 1− ξk ∀k,

ξk ≥ 0 ∀k,

(4.17)

where C is a tuning parameter that regulates the misclassification cost. Following a similar reasoning

as the one used for the separable case, one can solve this convex and quadratic problem. Lastly, we

obtain the following dual problem:

maximizeΛ

ΛT1− 1

2ΛTDΛ

0 ≤ Λ ≥ C

Λy = 0,

(4.18)

The last problem to take in consideration is the fact that we are dealing with an unbalanced dataset.

To overcome this adversity, the use of different penalties has been proposed [38, 39]. Gathering the two

extensions above (the use of kernels and soft-margins), along with this solution we obtain the following

SVM formulation for a binary classification problem:

27

minimizew,b,ξ

1

2wTw + C+

∑yk=+1

ξk + C−∑yk=−1

ξk

subject to yk(wT · φ (xk) + b

)≥ 1− ξk ∀k

ξk ≥ 0 ∀k,

(4.19)

where C+ = w+ × C and C− = w− × C, w+ and w− are the weights of the associated with the

positive and negative classes, respectively.

In this thesis, the dual problem of the soft margin SVM algorithm with unbalanced dataset was solved

numerically using LIBSVM, publicly available software (http://www.csie.ntu.edu.tw/~cjlin/libsvm)

developed by Chang and Lin [40].

4.4 Boosting

Boosting is a general method for improving the performance of any learning algorithm. In 1984,

Valiant proposed the probably approximately correct learning (PAC) a theoretical framework for studying

machine learning [41], which provided the background for boosting. Later, Kearns and Valiant [42,

43] questioned whether a ”weak” learning algorithm, that under normal circumstances would be just

marginally better than random classification, could be boosted into an accurate ”strong” algorithm. In

1989, Schapire [44] proved the first polynomial-time boosting algorithm. Freund [45] then developed an

efficient algorithm, which still had some drawbacks.

4.4.1 AdaBoost

In 1997, Freund and Schapire [46] presented the AdaBoost, a boosting algorithm that dealt with the

difficulties that the previously presented boosting algorithms had. The AdaBoost classifier starts by fitting

a classifier on the dataset, later it fits copies of the classifier on the same sample pattern, nevertheless

the weights of incorrectly classified instances are adjusted so that in the following iterations the classifiers

can focus in harder cases [47, 48].

Consider a training set (x1, y1) , ..., (xk, yk), where xi represents a pattern in the input space X and

yi is the correspondent label. For the sake of simplicity and given that we are dealing with binary

classification in this project, let us assume yi ∈ Y = {−1,+1}. The algorithm takes the training set

as an example and calls a ”weak” learner repeatedly in series of cycles t = 1, ..., T . In the beginning

all weights are set equally and at each round they are updated, in such way that the weak learner is

forced to focus on the hard examples of the training set, i.e., incorrectly classified examples increase

their weight. The weight on the training pattern i on the round t is Dt (i). The weak learner has now to

find a weak hypothesis ht : X → {−1,+1} adequate for the distribution Dt. The quality of the hypothesis

is measured by the error with respect to the distribution Dt on which the weak learner was trained, given

by:

28

http://www.csie.ntu.edu.tw/~cjlin/libsvm

εt = Pri∼Dt [ht (xi) 6= yi] =∑

i:ht(xi)6=yi

Dt (i) . (4.20)

After the weak hypothesis has been defined a parameter αt that measures the importance of ht

according to the equation present on Figure 4.3. This step is followed by an update of the weight

distribution so that the classifier focus on hard examples, meaning the weight of misclassified examples

increases by ht and the weight of correctly classified examples decreases. In the end, a weighted

majority vote of T weak hypothesis (being αt the weight assigned to hypothesis ht), gives us the final

hypothesis H.

Figure 4.3 AdaBoost [47].

1: Given (x1, y1),..., (xk, yk), where xi ∈ X, yi ∈ Y = {−1,+1}2: Initialize Dt (i) = 1

k3: for t = 1, ..., T do4: Train weak learner using distribution Dt

5: Get weak hypothesis ht : X → {−1,+1} with error equation 4.206: Choose αt = 1

2 ln(

1−εtεt

)7: Update:

8: Dt+1 (i) = Dt(i)Zt×{e−αt if ht (xi) = yieαt if ht (xi) 6= yi

9: = Dt(i)e−αtyiht(xi)

Zt.

10: where Zt is a normalization factor, chosen so that Dt+1 will be a distribution.11: Output the final hypothesis:12: H (x) = sign

(∑Tt=1 αtht (x)

).

In this project, the implementation of the AdaBoost.M1 used is the framework for Ensemble Learning

of MATLAB R2014a, present in the Statistics and Machine Learning Toolbox.

4.4.2 RUSBoost

Traditional machine learning algorithms tend to favour classifying patterns to the majority class, when

one of the classes highly outnumbers the examples of the other class. One technique to overcome this

problem is data sampling, which consists on an approach that balances class distribution of the training

set either undersampling (removing samples from the overrepresented class) or oversampling (adding

examples to the minority class), until the desired balance is achieved. These techniques can be either

simple as random selection or more advanced. Undersampling has the disadvantage of leading to loss of

information, due to the deletion of examples [49]. On the other hand, oversampling can lead to overfitting

and increases the model training time [50]. Proposed in 2010 by Seiffert, et al. [51], RUSBoost has his

roots on the SMOTEBoost algorithm (which is based on the AdaBoost algorithm, detailed in section

4.4.1). Both algorithms add a data sampling technique to the AdaBoost algorithm, the SMOTEBoost

uses an oversampling technique that creates new minority class samples by extrapolating between the

existing ones, a method called the synthetic minority technique (SMOTE) [52]. The RUSBoost algorithm

uses an approach that has proved to be simple, fast and with good performance [53]. RUS (Random

undersampling) simply randomly removes examples of the majority class until the desired distribution of

29

classes is achieved. The full RUSBoost algorithm is explained in the form of pseudo-code in Figure 4.4.

Figure 4.4 RUSBoost [51, 54].

1: Given: Set D of examples (x1, y1),..., (xk, yk) with minority class yr ∈ Y , |Y | = 22: Weak learner,WeakLearn3: Number of iterations, T4: Desired percentage of total instances to be represented by the minority class, M5: Initialize D1 (i) = 1

k for all i.6: for t = 1, ..., T do7: Create a temporary training dataset S′t with distribution D′t using random undersampling8: Call WeakLearn, providing it with examples S′t and their weights D′t9: Get back a hypothesis ht : X × Y → {0, 1}

10: Compute the pseudo-loss (for S and Dt):11: εt =

∑(i,y)yi 6=yDt (i) (1− ht (xi, yi) + ht (xi, y)) .

12: Calculate the weight update parameter:13: αt = εt

1−εt14: Update Dt:15: Dt+1 (i) = Dt (i)α

12 (1+ht(xi,yi)−ht(xi,y:y 6=yi))t .

16: Normalize Dt+1: Let Zt =∑iDt+1 (i)

17: Dt+1 (i) = Dt+1(i)Zt

.

18: Output the final hypothesis:19: H (x) = argmax

y∈Y

(∑Tt=1 ht (x, y) log

(1αt

)).

In this project, the implementation of the RUSBoost used is the framework for Ensemble Learning of

MATLAB R2014a, present in the Statistics and Machine Learning Toolbox.

4.5 Performance Evaluation

When developing an automated classification system, there is the need to assess the performance of

the proposed classifiers. However, different performance metrics yield different meanings and trade-offs,

one classifier can be optimal in one metric and suboptimal in another. The most commonly used metric

is accuracy, however in this project metrics accuracy might be incredibly uninformative due to the fact

that 99% of the dataset can be composed of non-CTCs. More adequate metrics and also widely used

are balanced accuracy, sensitivity and specificity. Nonetheless, these require us to decide in which point

of the Receiver-Operator Characteristic (ROC) we want to position ourselves to consider the classifier

good. Therefore the main performance evaluation metrics used were the ROC curve and the Area Under

the Curve (AUC).

4.5.1 Nested Cross Validation

Nested Cross Validation is not a performance metric, however it is used as a tool to evaluate the

performance of supervised learning algorithms. If the analysis of the performance of an algorithm is

performed on the same dataset that was used to train the classifier it can lead to a optimistically biased

result. To evaluate the generalization ability of a model in an unbiased fashion, one must have a test

or validation set that was never used in the learning part. However, if the number of samples of the

30

input dataset is small, it is not advisable to leave data out from the training set. The Cross Validation

(CV) procedure prevents this problem by randomly partitioning the full dataset into k disjoint sets. One

of the k sets is chosen as the validation set, and the rest k − 1 are used to train the model. The

process is repeated k times until all the folds are used as validation and train sets. Another task one

might have to deal with is to tune a classifier by choosing parameters, such as the adequate number

of neighbors in the k -NN algorithm, the γ of the RBF kernel, the C associated with the soft-margins

of the SVM or the number of weak learners in the Ensemble methods. All these might affect the final

classification performance. Varma and Simon [55] proposed the Nested Cross Validation method, that

not only circumvents the problem the CV does, but also tackles the problem of parameter tunning. The

Nested Cross Validation algorithm in pseudo-code, please refer to Figure 4.5.

Figure 4.5 Nested Cross-Validation [56].

1: Split the set D of K available samples into k disjunct sets Si, i = 1, ..., k of size Kk /* outer cross-

validation */2: for i = 1 to k do3: D := D \Di, K := |S|4: for each parameter set p /*parameter selection*/ do5: Spilt the set D of K available samples into k disjunct sets Dj , j = 1, ..., k of size K

k/* inner

cross-validation*/6: for j = 1 to k do7: Train the classifier on the training set Dt = D \ Dj

8: Compute test error ej on the parameter test set Dj

9: Compute inner CV test error10: Select parameter set p with minimum error11: Train classifier with selected parameter set on Dt = D = D \Di

12: Compute test error on the test set Di

13: Calculate outer CV test error.

4.5.2 Receiver-Operator Characteristic

In order to evaluate the results regarding the performance of the classifiers described above, Re-

ceiver Operator Characteristic graphs will be used. These have been used in signal detection theory

[57], diagnostic systems [58] and in medicine [59].

The area under the curve of a ROC has a baseline rate that is independent of the data, while in some

other metrics it is data dependent [60].

Fawcett and Provost [61] did a thorough study on the use of ROC for evaluation of classifiers.

The true positive rate (TPR) is defined as:

TPR = p (Y | +) ≈ positives correctly classified

total positives(4.21)

And the false positive rate (FPR) as follows:

FPR = p (Y | −) ≈ negatives incorrectly classified

total negatives, (4.22)

31

where + and − are the positive and negative instance classes, respectively.The p (+ | xi) is the

posteriori probability of the instance xi being positive.

A ROC curve plots the TPR on the Y axis and FPR on the X axis, bringing the advantage of presenting

the behaviour of a classifier regardless of class distribution or error cost. In order to choose the best

classifier based on a ROC curve analysis one must maximize (1− FPR) · TPR, which corresponds to

selecting the classifier with the higher area under the curve (AUC). This approach calculates the average

performance of the classifier over the entire performance space [58, 59].

4.6 Conclusion

This chapter covers all the machine learning algorithms used in this thesis for designing an auto-

mated classification system for CTC enumeration: the k -NN along with a bootstrapping technique, the

SVM and its extensions and, finally AdaBoost and RusBoost. As well as the tunning and validation

framework (Nested Cross-Validation) along with the metrics used to assess the performance of the

different supervised learning algorithms studied.

32

Chapter 5

Results

The two previous chapters (Chapters 3 and 4) describe the approaches implemented in order to

construct an automated classification system for CTC detection. In this chapter, the results regarding

these approaches are presented. Section 5.1 describes the dataset used, followed by section 5.2 which

presents a practical discussion of the image processing approaches. Section 5.3 describes implemen-

tation choices. Then section 5.4 gives a insight on the performance of each implemented classifier.

Finally, section 5.5 summarizes the results.

5.1 Dataset - Fluorescence microscopy for blood cells analysis

The fluorescent microscopy images used on the development of this thesis were provided by the

Cancer-ID consortium. This was a multicenter study consisting of 59 patients with Small-Cell Lung

cancer. From each patient three blood samples were retrieved, one before (designated as baseline)

and one after a cycle of chemotherapy and one at the end of chemotherapy. Each blood withdrawal

corresponds to one cartridge and each cartridge corresponds to 175 4-channel TIFF images. The

images were obtained using the CellSearch System and manually classified by expert reviewers. Table

5.1 summarizes the most important demographic and clinical information about the subjects.

Table 5.1: Clinical characteristics of 59 patients with small-cell lung cancer. (ED-extensive diseasestage; LD-limited disease stage) [15].

Characteristic All patients LD EDAge, years (minimum-maximum) 64 (47-84) 67 (47-84) 62 (47-81)Male/Female 35/24 12/9 23/15Stage, n (%) 21 (36) 38 (64)CTCsBaseline, n (median; minimum-maximum) 59 (16; 0-14 040) 21 (6; 0–220) 38 (63; 0–14 040)After one cycle, n (median; minimum-maximum) 37 (0; 0–1681) 18 (0; 0–6) 19 (1; 0–1681)After four cycles, n (median; minimum-maximum) 34 (1; 0–117) 16 (0; 0–3) 18 (1; 0–117)Overall survival days, n (median; minimum-maximum) 59 (280; 5–1424) 21 (356; 9–1424) 38 (213; 5–818)

To the blood samples it was added ferrofluids with EpCAM (epithelial cell adhesion molecule) in order

to select cells of epithelial origin, and they were stained by DAPI-DNA (4’, 2-diamidino-2-phenylindole,

dihydrochoride) for nuclear stain, PE-CK (cytokeratin 8, 18 Phycoerythrin and cytokeratin 19 Phycoery-

33

thrin) and CD45-APC (CD45-allophycocyan) to label leukocytes. The objective of the microscope was

of 10x/0.45NA and it had filters for DAPI, CK, CD45 and FITC, respectively each of the 4 channels of

the 4-page TIFF. The FITC channel was used only for removal of the edge of the cartridge.

Each cartridge was classified by an expert reviewer. Regarding non-CTCs there is no information, in

case of presence of CTCs, it is registered that within that area (a square) there is a CTC.

Along with each cartridge set (175 images), there is an XML file with the position of each CTC

relative to the whole cartridge, which is then transformed in relation to the image that is being analysed.

Additionally, the TIFF header of each imaged has two values correspondent to an offset and a maximum

value related to the condition in which the image was obtained, this information was used for image

normalization (equation 3.1).

Due to limited computation capacities, three random datasets, from all the available, were used for

testing. In total 525 images were processed and 141 634 ROIs were classified, 18 822 of which were

Circulating Tumor Cells.

5.2 Discussion on Image Processing Results

Features are extracted by segmenting image by image, from a full cartridge. An example of a car-

tridge, resulting from the concatenation of 175 images, is presented (left, vertically) in Figure 5.1. In the

same figure you can find one image from that dataset (presented on the right, horizontally), the overlay

of the 3 channels, the DAPI-DNA channel, the CK-PE channel and the CD45-APC channel (designated

as DNA, CK and CD45, respectively).

Figure 5.1: Example of a full cartridge (left, presented vertically) and one image from that dataset (right,horizontally). Overlay corresponds to the 3 channels superimposed. DNA corresponds to DAPI-DNAchannel, CK to CK-PE channel and CD45 to CD45-APC channel.

34

Each image was then normalized. After this step the average background intensities of the images

had a neglectable difference between each other. Then the edges were successfully removed.

Now lets take into consideration Figure 5.2, where an example of a CTC (Figure 5.2(a)) is presented

side by side with a non-CTC example (Figure 5.2(b)), there is no obvious visual difference between a

CTC and a non-CTC in either of the channels. If you observe the full dataset, ROI by ROI, it is hard

to find, by visual inspection, a clear and obvious pattern that allows a non expert reviewer to clearly

distinguish a CTC from a non-CTC.

(a) CTC (b) Non-CTC

Figure 5.2: Example of two cells present. Figure 5.2(a) is an example of a CTC, whereas Figure 5.2(b)is an example of a cell that is non-CTC. The red and green contours represent the contour resulting fromsegmentation.

The segmentation of these two cells (Figure 5.2) was correctly accomplished. However, there are

some situations in which one can not be sure of the quality of the segmentation. If you consider Figure

5.3 it is not possible to know, at least for a non-expert reviewer, if we are dealing with a cluster of two

cells, and, in that case the segmentation performs poorly, or if it just one big cell.

Figure 5.3: Example of a non-CTC. Given the way the manual classification was performed, there is noway of knowing if this is just one cell or a cluster of two cells. The green contour represent the contourresulting from segmentation

Now consider the situation presented in Figure 5.4, we have an example of a situation where the

segmentation algorithm does perform up to the expectation. On the left we have a cell, on the right an

35

element which is not clear, it can be a smudge, debris or an apoptotic cell. Nonetheless the segmentation

algorithm was not able to separate the two objects.

Figure 5.4: Example of a non-CTC, present in figure 5.1. This example highlights two problems, firstthe segmentation not being able to create two distinct areas if they are close. Second, given the waythe expert reviewer did the classification, it is not possible to know if the element on the right (inside thecontour) is just a smudge, a cell or an apoptotic cell.

Overall most of the objects were correctly segmented, however for the problems exemplified above

no solution was implemented.

In Appendix A, it is presented the histogram of distributions of features, by class (CTC and non-CTC)

and type (morphological, intensity and texture features), from the 3 datasets. By inspection, one can

conclude that no single feature clearly distinguishes one class from another.

Only outliers related with abnormal size were removed based on inspection, all ROIs with an area

≤ 9 and ≥ 0.3× 104 pixels were excluded from classification.

5.3 Experimental Design

The goal of this project is to access the viability of an automated classification system for identification

of circulating tumor cells. With this purpose in mind, several classifiers were tested, namely k -NN, k -NN

along with bootstrapping, k -NN using prior probabilities, SVM with a linear kernel, SVM with a RBF

kernel, AdaBoost and RUSBoost. Along with testing several classifiers it was also analysed which set of

features yielded better results. Thus each of the algorithms was tested for the set of features designated

as All, Morphological, User (features related to the ones expert reviewers usually take into account when

performing manual classification), Intensity, Texture, DNA (intensity and texture features of this channel),

CK (intensity and texture features of this channel), CD45 (intensity and texture features of this channel),

please refer to table 5.2 for a more detailed description of each category.

All the algorithms were first tested with one dataset (one cartridge), followed by a second test using

another cartridge and finally the concatenation of three datasets.

The parameters C, both for linear kernel and RBF kernel, γ for RBF kernel, k (number of neighbors)

for k -NN and the number of weak learners used in both of the ensemble methods were estimated using

36

Table 5.2: Set of features of each category used for classification.

All Morphological User Intensity

All extracted features

AreaEccentricityPerimeterPerimeter to Area ratio

AreaEccentricityMax. Intensity DNAMax. Intensity CKStandard Deviation Int. CD45

Mean Intensity DNAMean Intensity CKMean Intensity CD45Max. Intensity DNAMax. Intensity CKMax. Intensity CD45Standard Deviation Int. DNAStandard Deviation Int. CKStandard Deviation Int. CD45Mass DNAMass CKMass CD45

Texture DNA (Intensity+Texture) CK (Intensity+Texture) CD45 (Intensity+Texture)Median of Local Entropy DNAMedian of Local Entropy CKMedian of Local Entropy CD45Median of Local Contrast DNAMedian of Local Contrast CKMedian of Local Contrast CD45Median of Gradient Amplitude DNAMedian of Gradiant Amplitude CKMedian of Gradiant Amplitude CD45HOG DNAHOG CKHOG CD45

Mean Intensity DNAMax. Intensity DNAStandard Deviation Int. DNAMass DNAMedian of Local Entropy DNAMedian of Local Contrast DNAMedian of,Gradient Amplitude DNAHOG DNA

Mean Intensity CKMax. Intensity CKStandard Deviation Int. CKMass,CKMedian of Local Entropy CKMedian of Local Contrast CKMedian of,Gradient Amplitude CKHOG CK

Mean Intensity CD45Max. Intensity CD45Standard Deviation Int. CD45Mass,CD45Median of Local Entropy CD45Median of Local Contrast CD45Median of,Gradient Amplitude CD45HOG CD45

nested cross-validation, using 10-folds in the outer loop and 7 on the inner loop.

Other algorithm specifications are presented below:

• k -NN - k -NN was trained assuming k ∈ {1, 3, 5, 7, 9}.

• k -NN with bootstrapping - k -NN had k = 3 and bootstrapping was performed in such way that

each set had there was the same amount of CTCs and non-CTCs.

• k -NN with Prior Probabilities - k -NN was trained with the following sets of pairs of prior probabili-

ties {(.50, .50) ; (.60, .40) ; (.75, .25) ; (.85, .15) ; (.95, .05) ; (.99, .01) ; (.995, .005) ; (.45, .55) ; (.35, .65) ;

(.30, .70) ; (.10, .90) ; (.01, .99)}.

• SVM Kernel - Both linear and RFB kernels were tested with weights w0 = 1 and w1 = #non−CTCs#CTCs ,

corresponding to the weight of classes non-CTC and CTC respectivelly.The C was assumed to

have the following values {2−16, 2−14, 2−12, 2−10, 2−8, 2−6, 2−4, 2−2, 20, 22, 24} and γ ∈ {2−18, 2−14, 2−10,

2−6, 2−2, 22, 26, 210}.

• Ensemble methods - Both AdaBoost and RUSBoost were tested using using as weak classifier a

decision tree and the number of weak classifiers was {100, 200, 300, 400, 500, 600, 700, 800, 1000}.

• Nested Cross-Validation - The outer fold had 10 folds, and the inner 7.

5.4 Classification Results

In this section, the performance of each implemented classification algorithm will be presented.

Please note that no statistical hypothesis test was used for comparison purposes, the comparison was

based solely on the ROC curves and AUC (Area Under the Curve).

As stated before, first it was all tested with just one dataset, then another one and finally three

datasets concatenated. The results for the first two tests are presented in Appendix B. Results of the

37

three datasets concatenated are presented in this chapter. The results were quite similar, except for

dataset 1 where the best classifier was the SVM with a linear kernel, using all features.

Figure 5.5 presents the ROC curves of the three implementations of k -NN. Overall it is possible to

observe that the classification performed by the k -NN, in either of the situations, is quite weak. All curves

present an almost constant growth meaning that any increase in sensitivity will be accompanied by a

linearly proportional decrease in specificity. Furthermore, the curves are very close to the 45-degree di-

agonal and, as a result, behaves nearly as random classifier. Between the three k -NN implementations,

as expected, the k -NN coupled with the bootstrapping technique (Figure 5.5(a)) performed slightly better

than the other two, due to the fact that it is implemented in such way that tackles the problem of class

imbalance. The best set of features was the Intensity set for this classifier. In all the cases the worst

set of features was the DNA (Intensity+Texture). In the implementations of k -NN with Prior Probabilities

(Figure 5.5(b)) and k -NN (5.5(c)) the best set of features was the CK (Intensity+Texture).

Figure 5.6 displays the results of SVM with Linear Kernel and SVM with RBF Kernel. Both performed

better than the k -NN implementations. The SVM with RBF kernel (Figure 5.6(b)) produced better results

than the one with a Linear kernel (Figure 5.6(a)) and, in both cases, the best set of features was the CK

(Intensity+Texture) and the worst set CD45 (Intensity+Texture).

The ROC curves for the Ensemble methods are depicted in Figure 5.7. Unexpectedly, on average

the AdaBoost performed better than the RUSBoost. The best set for the AdaBoost (Figure 5.7(a)) was

the CK (Intensity+Texture) and the worst was the CD45 (Intensity+Texture). In the case of the RUSBoost

(Figure 5.7(b)) the one that yielded better results was Intensity and the worse the Morphology features

set.

5.5 Summary

Overall (considering all classifiers, the tests done with the two datasets separate and the test done

with three datasets concatenated) the set of features that yielded worst classification results was the

CD45 (Intensity+Texture), which is a bit odd given the fact that this is the exclusion marker, but might

be justified by the lack of quality of these images from this channel. The set of features that generated

better results for the concatenation of the three datasets was the CK (Intensity+Texture), followed by

the Intensity features set. The best classifier was the AdaBoost followed by the SVM with RBF Kernel,

however neither of the results meet the expectations. The results of all implemented classifiers are

summarized in Table 5.3, in the form of AUC.

38

(a) k -NN + Bootstrapping

(b) k -NN with Prior Probabilities

(c) k -NN

Figure 5.5: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping 5.5(a), with Prior Probabilities 5.5(b) and k -NN with the optimal amount of neighbors5.5(c). (All - allfeatures; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 intensitystandard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features of the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass of the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).

39

(a) SVM Linear Kernel

(b) SVM RBF Kernel

Figure 5.6: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear 5.6(a) and Gaussian (RBF) 5.6(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features of the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass of the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).

40

(a) AdaBoost

(b) RUSBoost

Figure 5.7: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost 5.7(a)and RUSBoost 5.7(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features of the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass of the 3channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).

41

Table5.3:

Area

Underthe

Curve

(AUC

)ofeachofthe

algorithms

testedforthe

totalofthe3

datasets.

Area

Under

theC

urveA

llM

orph.U

serInt.

Tex.D

NA

(Int+Tex.)C

K(Int+Tex.)

CD

45(Int+Tex.)

k-N

N+B

ootstrapping.6573

.5947.6837

.6997.6240

.5849.6970

.5898k

-NN

w/P

riorP.

.6498.5786

.6808.6787

.6218.5739

.6944.5795

k-N

NN

eigh..6399

.5935.6566

.6614.6106

.5656.6748

.5768S

VM

Linear.7267

.6379.7313

.7225.6664

.6495.7268

.5767S

VM

RB

F.7246

.6553.7235

.7297.6748

.6553.7305

.6299A

daBoost

.7331.6578

.7327.7331

.6728.6476

.7387.6267

RU

SB

oost.6910

.5697.6904

.6962.6060

.5754.6933

.5838

42

Chapter 6

Conclusions and Future Work

The main goal of this thesis was to study several approaches in order to build an automated classi-

fication system for Circulating Tumor Cells enumeration. Up to date, the interpretation of blood samples

analysed by the CellSearch system, still depends on the expertise of a trained reviewer, and there has

been a growing interest in developing automated systems that enumerate CTCs in a reliable fashion.

This buzzing topic has also been growing due to the increasing amount of biomarkers, detection and

physical isolation systems that are being studying and developed now, in order to perform real-time

biopsies on cancer patients. This work presented a small summary of these systems, nevertheless it

focused on the CellSearch System.

One of the objectives of this project was to study features that could be extracted from each cell and

the impact they had on classification. The features analysed were related to the morphology, intensity

and texture of each ROI. These were then grouped in sets in order to evaluate which were the ones that

could be more informative and produced better classification results. It was concluded that the three

best set of features were the combination of all the extracted features, the set of features extracted from

the CK channel (a combination of intensity and texture features from this channel) and the set of intensity

features. The sets that generated worst results were the set of morphological features, and the texture

and intensity features of both the DNA and the CD45 channels.

The second purpose of the current project was to evaluate and compare several pattern recognition

systems. The large number of non-CTCs when compared with the scarce number of CTC poses as

problem that jeopardizes the classification systems, thus several approaches that tackle class imbalance

were implemented and tested. The three machine learning algorithms that performed better were the

two Support Vector Machines (that deal with class imbalance by associating weights to each of the

classes) and the AdaBoost. The three worst were the three implementations of the k -Nearest Neighbor,

and even in this case the one that performed best was the implementation with bootstrapping.

Overall all the implementations and results under-performed. It is not possible conclude, based on

this thesis results, that it is possible to build an automated system for CTC enumeration of Small-Cell

Lung Cancer.

To boost the results in CTC classification for SCLC several options can be studied and developed:

43

• Development of a more coherent and detailed ground-truth:

– Stricter definition of what should be considered a CTC (different reviewers can assign the

same object as different classes, even the same reviewer in different moments can classify

the same object as a CTC one time and another time as a non-CTC);

– Manual classification of CTCs after image segmentation (currently, in the manual classifica-

tion, it is considered a CTC an area shaped like a rectangle that might contain one or more

ROIs, and not necessarily all CTC);

– Manual classification in more than just CTC, for example also CTC debris and apoptotic CTC.

These two can present a very different morphology and signal intensities compared to a

normal CTC, and, currently, they are classified by an expert reviewer as a CTC;

– Classification of non-CTC: the non-CTC class is everything else in the dataset, which cre-

ates a class (non-CTC) with very vague characteristics, it can be a white blood cell, debris,

apoptotic cells, smudges, etc;

• Development of an automated classification system, that has more classes than just CTC and

non-CTC;

• Using clustering and/or a learning algorithm for outlier removal;

• Studying the use of color histograms and other features;

• Using feature selection algorithms to better assess the informativeness of each feature, such as

correlation and mutual information algorithms;

• Implementation of a noise reduction algorithm (noise analysis of the images studied in this project

is presented in appendix C);

• Improvement of the segmentation algorithm.

From a more general point of view several other improvements can be made:

• Implementation of online-learning algorithms;

• Implementation of semi-supervised solutions;

• Study of biomarkers (dyes) for different types of CTC.

The topic automated classification of Circulating Tumor Cells is still quite recent, has great potential

and huge impact on the study of cancer and there is still a lot of room for development.

44

Bibliography

[1] A. C. Society. Cancer Facts & Figures 2012. Health Policy, 1:1–68, 2012.

[2] American Cancer Society. Global Cancer Facts & Figures 3rd Edition. American Cancer Society,

(800):1–64, 2015.

[3] American Association for Cancer Research. AACR Cancer progress report. pages S1–S100, 2012.

[4] D. E. Bloom, E. Cafiero, E. Jane-Llopis, S. Abrahams-Gessel, L. Reddy Bloom, S. Fathima, A. B.

Feigl, T. Gaziano, A. Hamandi, M. Mowafi, D. O’Farrell, E. Ozaltin, A. Pandya, K. Prettner, L. Rosen-

berg, B. Seligman, A. Z. Stein, C. Weinstein, and J. Weiss. The Global Economic Burden of Non-

communicable Diseases. (September):1–46, 2012.

[5] D. A. Haber and J. Settleman. Cancer: drivers and passengers. Nature, 446(7132):145–146, 2007.

[6] S. de Wit, G. van Dalum, L. W. M. M. Terstappen, and Wit. Detection of Circulating Tumor Cells.

Scientifica, 2014.

[7] D. Wirtz, K. Konstantopoulos, and P. C. Searson. The physics of cancer: the role of physical

interactions and mechanical forces in metastasis. Nature reviews. Cancer, 11:512–522, 2011.

[8] B. Weigelt, J. L. Peterse, and L. J. van’t Veer. Breast cancer metastasis: markers and models.

Nature Reviews Cancer, 5(August):591–602, 2005.

[9] P. J. L. Weigelt, B. and L. J. Van’t Veer. Breast cancer metastasis: markers and models. Nature

reviews cancer, 5(8):591–602, 2005.

[10] R. Weinberg. The biology of cancer. Garland Science, 2013.

[11] C.-M. Svensson, S. Krusekopf, J. Lucke, and M. Thilo Figge. Automated detection of circulating

tumor cells with naive Bayesian classifiers. Cytometry. Part A : the journal of the International

Society for Analytical Cytology, 85(23):501–511, 2014.

[12] S. T. Ligthart, F. a. W. Coumans, G. Attard, A. M. Cassidy, J. S. de Bono, and L. W. M. M. Terstappen.

Unbiased and automated identification of a circulating tumour cell definition that associates with

overall survival. PloS one, 6(11), 2011.

[13] G. R. Simon. Management of Small Cell Lung Cancer. CHEST Journal, 132:324S, 2007.

45

[14] A. Rossi, P. Maione, G. Palazzolo, P. C. Sacco, M. L. Ferrara, M. Falanga, and C. Gridelli. New

targeted therapies and small-cell lung cancer. Clinical lung cancer, 9(5):271–9, 2008.

[15] T. J. N. Hiltermann, M. M. Pore, a. van den Berg, W. Timens, H. M. Boezen, J. J. W. Liesker, J. H.

Schouwink, W. J. a. Wijnands, G. S. M. a. Kerner, F. a. E. Kruyt, H. Tissing, a. G. J. Tibbe, L. W.

M. M. Terstappen, and H. J. M. Groen. Circulating tumor cells in small-cell lung cancer: a predictive

and prognostic factor. Annals of Oncology, 23(June):2937–2942, 2012.

[16] S. T. Ligthart, F. C. Bidard, C. Decraene, T. Bachelot, S. Delaloge, E. Brain, M. Campone, P. Viens,

J. Y. Pierga, and L. W. M. M. Terstappen. Unbiased quantitative assessment of Her-2 expression

of circulating tumor cells in patients with metastatic and non-metastatic breast cancer. Annals of

Oncology, 24:1231–1238, 2013.

[17] T. M. Scholtens, F. Schreuder, S. T. Ligthart, J. F. Swennenhuis, J. Greve, and L. W. M. M. Terstap-

pen. Automated identification of circulating tumor cells by image cytometry. Cytometry. Part A : the

journal of the International Society for Analytical Cytology, 81:138–48, 2012.

[18] M. Alunni-Fabbroni and M. T. Sandri. Circulating tumour cells in clinical practice: Methods of de-

tection and possible characterization. Methods, 50(4):289–297, 2010.

[19] D. R. Shaffer, M. a. Leversha, D. C. Danila, O. Lin, R. Gonzalez-Espinoza, B. Gu, A. Anand,

K. Smith, P. Maslak, G. V. Doyle, L. W. M. M. Terstappen, H. Lilja, G. Heller, M. Fleisher, and

H. I. Scher. Circulating tumor cell analysis in patients with progressive castration-resistant prostate

cancer. Clinical Cancer Research, 13(7):2023–2029, 2007.

[20] External quality assurance of circulating tumor cell enumeration using the CellSearch system: A

feasibility study. Cytometry Part B - Clinical Cytometry, 80 B(June 2010):112–118, 2011.

[21] C. Alix-Panabieres and K. Pantel. Circulating tumor cells: Liquid biopsy of cancer. Clinical Chem-

istry, 59:110–118, 2013.

[22] Z. S. Lalmahomed, J. Kraan, J. W. Gratama, B. Mostert, S. Sleijfer, and C. Verhoef. Circulating

tumor cells and sample size: The more, the better. Journal of Clinical Oncology, 28(17):288–289,

2010.

[23] S. Sleijfer, J. W. Gratama, A. M. Sieuwerts, J. Kraan, J. W. M. Martens, and J. a. Foekens. Circu-

lating tumour cell detection on its way to routine diagnostic implementation? European Journal of

Cancer, 43:2645–2650, 2007.

[24] G. Zack, W. Rogers, and S. Latt. Automatic measurement of sister chromatid exchange frequency.

Journal of Histochemistry & Cytochemistry, 25(7):741–753, 1977.

[25] R. W. S. E. Gonzalez, R.C. Digital Image Processing Using MATLAB, chapter 11. Prentice Hall,

2003.

46

[26] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. volume 1, pages

886–893. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June

2005.

[27] E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination: consistency proper-

ties. Tech. Rep. 4, 1951.

[28] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information

Theory, 13:21–27, 1967.

[29] C. J. . Stone. Consistent Nonparametric Regression. The Annals of Statistics, 5(4):595–620, 1977.

[30] B. Efron. Bootstrap methods: Another look at the Jackknife. The Annals of Statistics, 7:1–26, 1979.

[31] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers.

Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144–

152, 1992.

[32] V. N. Vapnik and A. Lerner. Pattern recognition using Generalized Portrait method. Automation and

Remote Control, 24, 1963.

[33] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995.

[34] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over

data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998.

[35] P. L. Barlett. The sample complexity of pattern classification with neural networks: the size of the

weights is more important than the size of the network. IEEE Transactions on Information Theory,

44(2):273–297, 2006.

[36] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. Advances in Large Margin

Classifiers, pages 349–358, 2000.

[37] P. M. Morgado Marabilha. Automated Diagnosis of Alzheimer’s Disease using PET Images A study

of alternative procedures for feature extraction and selection. Master’s thesis, Instituto Superior

Tecnico, 2012.

[38] V. Vapnik. Statistical Learning Theory, chapter 10.9. Wiley, 1998.

[39] F. R. Osuna, E. and F. Girosi. Support vector machines: Training and applications. AI Memo 1602,

1997.

[40] C.-C. Chang and C.-J. Lin. Libsvm. ACM Transactions on Intelligent Systems and Technology, 2

(3):1–27, 2011.

[41] L. Valiant. A Theory of the Learnable. 27(11):1134–1142, 1984.

47

[42] M. J. Kearns and L. G. Valiant. Learning Boolean formulae or finite automata is as hard as factoring.

Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory,

1988.

[43] M. Kearns and L. Valiant. Cryptographic Limitations on Learning Boolean Formulae and Finite

Automata. J. ACM, 41(1):67–95, 1994.

[44] R. E. Schapire. The Strength of Weak Learnability. 227:197–227, 1990.

[45] Y. Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):

256–285, 1995.

[46] Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an appli-

cation to boosting. Journal of Computer and Systems Sciences, 55(1):119–139, 1997.

[47] R. E. Schapire. A brief introduction to boosting. IJCAI International Joint Conference on Artificial

Intelligence, 2(5):1401–1406, 1999.

[48] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Pro-

ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR), 1:511–518, 2001.

[49] G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for

balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1):20–29, 2004.

[50] C. Drummond, R. C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling

beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11. Citeseer,

2003.

[51] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: A hybrid approach to

alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems

and Humans, 40(1):185–197, 2010.

[52] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-

sampling technique. Journal of artificial intelligence research, pages 321–357, 2002.

[53] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from

imbalanced data. In Proceedings of the 24th international conference on Machine learning, pages

935–942. ACM, 2007.

[54] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: improving classifi-

cation performance when training data is skewed. In Pattern Recognition, 2008. ICPR 2008. 19th

International Conference on, pages 1–4. IEEE, 2008.

[55] S. Varma and R. Simon. Bias in error estimation when using cross-validation for model selection.

BMC bioinformatics, 7(1):91, 2006.

48

[56] C. Petersohn. Temporal Video Segmentation, page 34. Vogt Verlag, 2010.

[57] J. P. Egan. Signal detection theory and {ROC} analysis. 1975.

[58] J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–1293, 1988.

[59] J. R. Beck and E. K. Shultz. The use of relative operating characteristic (roc) curves in test perfor-

mance evaluation. Archives of pathology & laboratory medicine, 110(1):13–20, 1986.

[60] R. Caruana. An Empirical Comparison of Supervised Learning Algorithms. pages 161–168, 2006.

[61] T. Fawcett and P. Foster. Analysis Comparison and Visualization under Imprecise of Classifier

Performance : Class and Cost Distributions. pages 43–48, 1997.

49

Appendix A

Histograms of Feature Distributions

A.1 Morphological Features Histograms

Figure A.1 presents the distributions of morphological features (area, eccentricity, perimeter and

perimeter to area ratio).

(a) Area (b) Eccentricity

(c) Perimeter (d) Perimeter to Area Ratio

Figure A.1: Histogram of distributions of CTC and non-CTC of morphological features.

51

A.2 Intensity Features Histograms

Figure A.2 presents the distributions of intensity features.

(a) DNA Mean Intensity (b) CK Mean Intensity (c) CD45 Mean Intensity

(d) DNA Maximum Intensity (e) CK Maximum Intensity (f) CD45 MAximum Intensity

(g) DNA Standard Deviation of inten-sity signal

(h) CK Standard Deviation of intensitysignal

(i) CD45 Standard Deviation of inten-sity signal

(j) DNA Mass (k) CK Mass (l) CD45 Mass

Figure A.2: Histogram of distributions of CTC and non-CTC of intensity features.

52

A.3 Texture Features Histograms

Figure A.3 presents the distributions of texture features, except HOG Features.

(a) Median of DNA Local Entropy (b) Median of CK Local Entropy (c) Median of CD45 Local Entropy

(d) Median of DNA Local Contrast (e) Median of CK Local Contrast (f) Median of CD45 Local Contrasty

(g) Median of DNA Gradient Amplitude (h) Median of CK Gradient Amplitude (i) Median of CD45 Gradient Amplitude

Figure A.3: Histogram of distributions of CTC and non-CTC of texture features, except HOG features.

53

Appendix B

Classification Results

A total of three datasets were analysed in this thesis. The result of the three of them all together

was presented in section 5.4, however a first test was perform. Two of the datasets were analysed

separately, for sake of simplicity we will designate them as patient A and patient B. Section B.1 presents

the classification results of dataset 1 - patient A, and section B.2 of dataset 2 - patient B.

B.1 Dataset 1 - Patient A

(a) k -NN + Bootstrapping (b) k -NN with Prior Probabilities (c) k -NN with Optimal Number ofNeighbors

Figure B.1: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping B.1(a), with Prior Probabilities B.1(b) and k -NN with the optimal amount of neighborsB.1(c). (All -all features; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 inten-sity standard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features for the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).

Table B.1: Area Under the Curve (AUC) of each of the algorithms tested for patient A.

Area Under the Curve All Morph. User Int. Tex. DNA (Int+Tex.) CK (Int+Tex.) CD45 (Int+Tex.)k -NN+Bootstrapping .6531 .5572 .5854 .6587 .6615 .5562 .5773 .5145k -NN w/ Prior P. .6557 .5919 .6312 .6314 .6629 .5608 .6245 .4972k -NN Neigh. .6397 .5918 .6077 .5961 .6166 .5373 .6476 .5127SVM Linear .8183 .6272 .6387 .7418 .7340 .6349 .7295 .6303SVM RBF .7639 .6501 .5917 .7086 .6945 .6244 .7062 .5517AdaBoost .7364 .5018 .5937 .6524 .6821 .5865 .6933 .6100RUSBoost .6739 .6019 .5663 .6791 .6301 .6628 .6708 .6364

55

(a) SVM Linear (b) SVM RBF

Figure B.2: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear B.2(a) and Gaussian (RBF) B.2(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features for the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass for the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).

(a) AdaBoost (b) RUSBoost

Figure B.3: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost B.3(a)and RUSBoost B.3(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features for the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass for the3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).

56

B.2 Dataset 2 - Patient B

(a) k -NN + Bootstrapping (b) k -NN with Prior Probabilities (c) k -NN with Optimal Number ofNeighbors

Figure B.4: Receiver operator curves for classification of CTC with k -Nearest Neighbor with bootstrap-ping B.4(a), with Prior Probabilities B.4(b) and k -NN with the optimal amount of neighborsB.4(c). (All -all features; User - area, eccentricity, DNA maximum intensity, CK maximum intensity and CD45 inten-sity standard deviation; Morphological - area, eccentricity, perimeter and perimeter to area ratio; Texture- median local contrast, median local entropy and HOG features for the 3 channels; Intensity - mean,maximum and standard deviation of the intensity signal and mass for the 3 channels; DNA, CK, CD45 -texture and intensity features for the correspondent channel).

(a) SVM Linear (b) SVM RBF

Figure B.5: Receiver operator curves for classification of CTC with Support Vector Machines, usingLinear B.5(a) and Gaussian (RBF) B.5(b) Kernels. (All - all features; User - area, eccentricity, DNAmaximum intensity, CK maximum intensity and CD45 intensity standard deviation; Morphological - area,eccentricity, perimeter and perimeter to area ratio; Texture - median local contrast, median local entropyand HOG features for the 3 channels; Intensity - mean, maximum and standard deviation of the intensitysignal and mass for the 3 channels; DNA, CK, CD45 - texture and intensity features for the correspondentchannel).

Table B.2: Area Under the Curve (AUC) of each of the algorithms tested for patient B.

Area Under the Curve All Morph. User Int. Tex. DNA (Int+Tex.) CK (Int+Tex.) CD45 (Int+Tex.)k -NN+Bootstrapping .6547 .5995 .6874 .6852 .6185 .5648 .6842 .5878k -NN w/ Prior P. .6463 .5775 .6768 .6825 .6140 .5506 .6811 .5751k -NN Neigh. .6428 .5658 .6743 .6662 .6163 .5604 .6999 .5767SVM Linear .7267 .6379 .7313 .7225 .6664 .6495 .7268 .5767SVM RBF .7255 .6549 .7239 .7280 .6758 .6557 .7310 .6304AdaBoost .7347 .6579 .7366 .7351 .6758 .6504 .7411 .6247RUSBoost .6910 .5634 .6901 .6951 .6410 .5567 .6845 .5742

57

(a) AdaBoost (b) RUSBoost

Figure B.6: Receiver operator curves for classification of CTC with Ensemble methods, AdaBoost B.6(a)and RUSBoost B.6(b). (All - all features; User - area, eccentricity, DNA maximum intensity, CK maxi-mum intensity and CD45 intensity standard deviation; Morphological - area, eccentricity, perimeter andperimeter to area ratio; Texture - median local contrast, median local entropy and HOG features for the3 channels; Intensity - mean, maximum and standard deviation of the intensity signal and mass for the3 channels; DNA, CK, CD45 - texture and intensity features for the correspondent channel).

58

Appendix C

Noise Analysis

Given the fact that no algorithm was implemented in order to reduce the noise of the images, this

data is presented only in appendix for purpose of future work. The noise appears to be gaussian (Figure

C.1) and the channel DNA (Figure C.1(a)) has mean of 0.0104 and standard deviation of 0.0127, CK

(Figure C.1(b)) mean of 0.1888 and standard deviation 0, 0420 and finally the CD45 channel (Figure

C.1(c)) 0.0585 of mean and standard deviation 0.0153.

(a) Channel DNA Noise (b) Channel CK Noise (c) Channel CD45 Noise

Figure C.1: Distribution of Noise, by channel.

59

Cancer ID - ULisboa · sangu´ıneas, e de elevado interesse. Este trabalho foca-se no estudo de...

Documents

Transcript of Cancer ID - ULisboa · sangu´ıneas, e de elevado interesse. Este trabalho foca-se no estudo de...