Introdução à Machine Learning

28
Aprendizado de M´ aquina e Grandes Conjuntos de Dados Prof. Dr. Thomas de Araujo Buck September 12, 2011 Contents 1 Tipos de algoritmos 2 1.1 Determin´ ısticos .......................................... 2 1.1.1 ´ Arvore de jogos” .................................... 3 1.2 Adaptativos ............................................ 7 1.2.1 Alguns exemplos ..................................... 7 2 A enorme avalanche de dados 9 2.1 Data centers ............................................ 12 2.2 Tratamento dos dados ...................................... 12 3 Aprendizado de M´ aquina 13 3.1 Tarefa t´ ıpica de data mining ................................... 14 3.2 Problemas muito dif´ ıceis para serem programados ....................... 15 3.3 Software that customizes to user ................................ 17 4 Grandes conjuntos de dados 18 4.1 Outros temas correlatos ..................................... 18 4.2 Exemplos ............................................. 18 4.2.1 KDD (com SVM) ..................................... 19 4.2.2 Imagens .......................................... 20 4.2.3 ıdeos ........................................... 20 4.2.4 ´ Area m´ edica ....................................... 21 5 Conclus˜ oes 24 1

description

Esta apresentação em PDF

Transcript of Introdução à Machine Learning

Page 1: Introdução à Machine Learning

Aprendizado de Maquina e Grandes Conjuntos de Dados

Prof. Dr. Thomas de Araujo Buck

September 12, 2011

Contents

1 Tipos de algoritmos 21.1 Determinısticos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 ”Arvore de jogos” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Adaptativos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Alguns exemplos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 A enorme avalanche de dados 92.1 Data centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Tratamento dos dados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Aprendizado de Maquina 133.1 Tarefa tıpica de data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Problemas muito difıceis para serem programados . . . . . . . . . . . . . . . . . . . . . . . 153.3 Software that customizes to user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Grandes conjuntos de dados 184.1 Outros temas correlatos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Exemplos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 KDD (com SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.2 Imagens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.3 Vıdeos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.4 Area medica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Conclusoes 24

1

Page 2: Introdução à Machine Learning

1 Tipos de algoritmos

1. Determinısticos (ou classicos, convencionais)

2. Adaptativos (ou estocasticos, ”avancados”)

1.1 Determinısticos

• Deteccao de colisao

• Fatoracao de numeros primos

• Inversao de matrizes (esparsas)

• Ordenacao (quicksort, mergesort)

• Page Rank

• Um pouco mais avancados

– A*

– ”Arvore de jogos”

2

Page 3: Introdução à Machine Learning

1.1.1 ”Arvore de jogos”

• Jogo da velha

– Qual a quantidade total de possibilidades?

∗ 9× 8× . . .× 1 = 9! = 362.880

• Jogo de damas

3

Page 4: Introdução à Machine Learning

• Xadrez [5, 6]

4

Page 5: Introdução à Machine Learning

• Qual a quantidade total de possibilidades?1

– Se for considerado uma profundidade P, e ramificacao R, a quantidade possıvel de nos N podeser calculado com a formula

N = RP

– O tamanho medio de uma partida de xadrez e de 50 lances, ou seja, 100 jogadas, sendo 50jogadas realizadas pelas pecas brancas e 50 pelas pecas negras.

– Como o fator de ramificacao e em media de 35, pode-se entao estimar a quantidade de nos deuma arvore correspondente a uma partida, como sendo N = 35100 = 2, 55155207 ∗ 10154.

– Caso um computador percorra dois milhoes de posicoes por segundo, seriam necessarios maisde 5, 3 ∗ 10109 anos para esgotar toda a arvore.

• Surge entao a famosa pergunta: o que e um programa ”inteligente” ?

• Quem se lembra da disputa homem (Garry Kasparov) contra maquina (IBM Deep Blue) [7, 8] ?

• Mais uma pergunta: xadrez e, neste sentido, o jogo mais ”difıcil” ja criado pelo homem?

1Resposta obtida na internet.

5

Page 6: Introdução à Machine Learning

• Go [11]

• Ver tambem [9, 10]

• Ha sinais de esperanca [12]

6

Page 7: Introdução à Machine Learning

1.2 Adaptativos

• O que e um programa ”inteligente”?

• E um programa ”que aprende”?

1.2.1 Alguns exemplos

• Reconhecimento de face

• Analise de credito

• Navegacao autonoma

• Diagnostico medico

• Projecao financeira (prognostico)

• Sistemas de recomendacao

• Logıstica

7

Page 8: Introdução à Machine Learning

• Text processing

– Spam

– News

– Plagio

• Aprendizado de maquina

– Supervisionado (aprende com exemplos), que possui 2 fases: treinamento e operacao

∗ NN

∗ Classificacao (Discriminante Linear - DL)

∗ Regressao [66, 67]

– Nao supervisionado (aprende sozinho), que so possui a fase de operacao

∗ Analise de aglomeracao (K-means clustering)

8

Page 9: Introdução à Machine Learning

2 A enorme avalanche de dados

9

Page 10: Introdução à Machine Learning

• Materia da revista The Economist [4]

10

Page 11: Introdução à Machine Learning

11

Page 12: Introdução à Machine Learning

2.1 Data centers

• Google [73]

• Facebook [72]

2.2 Tratamento dos dados

• O que fazer com esses dados? Apenas armazenar? Indexar?

• Ou deve-se extrair informacao util?

12

Page 13: Introdução à Machine Learning

3 Aprendizado de Maquina

• Definicao de Machine Learning (ML): ver [38]

• Outra definicao de ML: ver [39]

• Sobre Support Vector Machines (SVM): ver [38, 51]

– Support vector machines represent a powerful new class of models invented by Vladimir Vapnikin the early 1990s

13

Page 14: Introdução à Machine Learning

• 3 exemplos de aplicacoes de ML [39]

3.1 Tarefa tıpica de data mining

• Analise de risco de credito

14

Page 15: Introdução à Machine Learning

3.2 Problemas muito difıceis para serem programados

• A competicao DARPA Grand Challenge: versao urbana [42, 43, 44, 45]

• A experiencia Google Car [41]

15

Page 16: Introdução à Machine Learning

• Mais alguns detalhes

• Um pequeno problema?

• Outras referencias [52, 54, 55]

16

Page 17: Introdução à Machine Learning

3.3 Software that customizes to user

17

Page 18: Introdução à Machine Learning

4 Grandes conjuntos de dados

• Analise de dados

– Manual

– Automatica

4.1 Outros temas correlatos

• Data mining

– Manual

∗ Visual data mining [63]

– Automatica

4.2 Exemplos

• Analise de risco de credito

• A experiencia IBM Watson [40, 46, 47]

18

Page 19: Introdução à Machine Learning

4.2.1 KDD (com SVM)

• Ver [38]

19

Page 20: Introdução à Machine Learning

4.2.2 Imagens

• Acesso por conteudo [13, 14, 15, 16, 17, 20, 21, 24]

• PhotoLib [19]

• Games with a purpose (GWAP) [18, 26]

• Pixazza → Luminate

• Semantics [22, 23]

• Learning [23, 25]

4.2.3 Vıdeos

• Analise

20

Page 21: Introdução à Machine Learning

4.2.4 Area medica

• Mamografia

• Colonoscopia [30, 31, 35]

– As geracoes dos equipamentos de tomografia computadorizada

– Tipos: convencional e ”virtual” - vantagens e inconvenientes / limitacoes

– Visualizacao simples [29]

21

Page 22: Introdução à Machine Learning

• Display modes for CT colonography [32, 33]

• Computer-Aided Diagnosis (CAD): detecting polyps at CT colonography [34]

22

Page 23: Introdução à Machine Learning

• Quantification of Distention in CT Colonography [36]

• Computerized Detection of Colonic Polyps at CT Colonography [37]

23

Page 24: Introdução à Machine Learning

5 Conclusoes

• Tratamento computacional de grandes quantidades de dados e uma oportunidade, segundo a con-sultoria McKinsey [27, 28]

24

Page 25: Introdução à Machine Learning

References

[1] Richard G. Baraniuk. More is less: Signal processing and the data deluge. Science, 331:717–719,February 2011.

[2] Peter Fox and James Hendler. Changing the equation on scientific data visualization. Science,331:705–708, February 2011.

[3] Trudie Lang. Advancing global health research through digital technology and sharing data. Science,331:714–717, February 2011.

[4] Kenneth Cukier. Data, data everywhere. The Economist, February 2010.

[5] Philip Ross. The meaning of computers and chess. Spectrum, March 2003.

[6] Philip E. Ross. Silicon shows its mettle. Spectrum, 40(3):24–26, March 2003.

[7] Feng-Hsiung Hsu. Behind Deep Blue. Princeton University Press, 2002.

[8] Yasser Seirawan, Herbert A. Simon, and Toshinori Munakata. The implications of kasparov vs. deepblue. Communications of the ACM, 40(8):21–25, August 1997.

[9] John A. Bate. A beginners introduction to go. Technical report, Department of Computer Science,University of Manitoba, 1997.

[10] James Hendler. Computers play chess; humans play go. IEEE Intelligent Systems, 21(4):2–3,July/August 2006.

[11] Feng-Hsiung Hsu. Cracking go. Spectrum, 44(10):44–49, October 2007.

[12] Kirk L. Kroeker. A new benchmark for artificial intelligence. Communications of the ACM, 54(8):13–15, August 2011.

[13] Charles A. Jacobs, Adam Finkelstein, and David H. Salesin. Fast multiresolution image querying.Computer Graphics, pages 277 – 286, August 6 – 11 1995. ACM SIGGRAPH Annual ConferenceSeries.

[14] Everest Mathias and Aura Conci. Comparing the influence of color spaces and metrics in content-based image retrieval. In Luciano da Fontoura Costa and Gilberto Camara, editors, SIBGRAPI 98– International Symposium on Computer Graphics, Image Processing and Vision, pages 371 – 378,Rio de Janeiro – RJ, October 20 – 23 1998. Sociedade Brasileira de Computacao (SBC) – Institutode Matematica Pura e Aplicada (IMPA).

[15] N. Sebe, M. Lew, and D. P. Huijsmans. Towards optimal ranking metrics. In Luciano da Fon-toura Costa and Gilberto Camara, editors, SIBGRAPI 98 – International Symposium on ComputerGraphics, Image Processing and Vision, pages 379–386, Rio de Janeiro – RJ, October 20 – 23 1998.Sociedade Brasileira de Computacao (SBC) – Instituto de Matematica Pura e Aplicada (IMPA).

[16] Henry Lieberman, Elizabeth Rozenweig, and Push Singh. Aria: An agent for annotating and re-trieving images. Computer, 34(7):57–62, July 2001.

[17] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom,Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Queryby image content: the qbic system. Computer, 28(9):23 – 32, September 1995.

[18] Luis von Ahn. Games with a purpose. Computer, 39(6):92–94, June 2006.

[19] Ben Shneiderman, Benjamin B. Bederson, and Steven M. Drucker. Find that photo!: interfacestrategies to annotate, browse, and share. Communications of the ACM, 49(4):69–71, April 2006.

[20] Sixto Ortiz Jr. Searching the visual web. Computer, 40(6):12–14, June 2007.

25

Page 26: Introdução à Machine Learning

[21] Ricardo da Silva Torres and Alexandre Xavier Falcao. Content-based image retrieval: Theory andapplications. Revista de Informatica Teorica e Aplicada (RITA), 13(2):161–185, 2006.

[22] Nuno Vasconcelos. From pixels to semantic spaces: Advances in content-based image retrieval.Computer, 40(7):20–26, July 2007.

[23] Victor Lavrenko, R. Manmatha, and Jiwoon Jeon. A model for learning the semantics of pictures. InSebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural InformationProcessing Systems 16. MIT Press, 2003.

[24] Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain.Content-based image retrieval at the end of the early years. Transactions on Pattern Analysis andMachine Intelligence, 22(12):1349–1380, December 2000.

[25] Tao Wang, Yong Rui, Shi min Hu, and Jia guang Sun. Adaptive tree similarity learning for imageretrieval. Multimedia Systems, 9(2):131–143, August 2003.

[26] Luis von Ahn and Laura Dabbish. Designing games with a purpose. Communications of the ACM,51(8):58–67, August 2008.

[27] McKinsey Global Institute. The challenge – and opportunity – of ’big data’. The McKinsey Quarterly,May 2011.

[28] McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.McKinsey & Company, May 2011.

[29] Suya You, Lichan Hong, Ming Wan, Kittiboon Junyaprasert an Arie Kaufman, Shigeru Muraki,Yong Zhou, Mark Wax, and Zhengrong Liang. Interactive volume rendering for virtual colonoscopy.In Roni Yagel and Hans Hagen, editors, Visualization ’97, pages 433 – 436, Phoenix, AZ, October19 – 24 1997. IEEE Computer Society Press.

[30] Lichan Hong, Shigeru Muraki, Arie Kaufman, Dirk Bartz, and Taosong He. Virtual voyage: Inter-active navigation in the human colon. Computer Graphics, pages 27 – 34, August 3 – 8 1997. ACMSIGGRAPH Annual Conference Series.

[31] David Essex. Did somebody say virtual colonoscopy? Communications of the ACM, 52(4):16–18,April 2009.

[32] Chandu Karadi, Christopher F. Beaulieu, Jr R. Brooke Jeffrey, David S. Paik, and Sandy Napel.Display modes for ct colonography: Part i. synthesis and insertion of polyps into patient ct data.Radiology, 212(1):195–201, July 1999.

[33] Christopher F. Beaulieu, Jr R. Brooke Jeffrey, Chandu Karadi, David S. Paik, and Sandy Napel.Display modes for ct colonography: Part ii. blinded comparison of axial ct and virtual endoscopicand panoramic endoscopic volume-rendered studies. Radiology, 212(1):203–212, July 1999.

[34] Hiroyuki Yoshida, Janne Nappi, Peter MacEneaney, David T. Rubin, and Abraham H. Dachman.Computer-aided diagnosis scheme for detection of polyps at ct colonography. RadioGraphics,22(4):963–979, July–August 2002.

[35] Arie E. Kaufman, Sarang Lakare, Kevin Kreeger, and Ingmar Bitter. Virtual colonoscopy. Commu-nications of the ACM, 48(2):37–41, February 2005.

[36] Peter W. Hung, David S. Paik, Sandy Napel, Judy Yee, R. Brooke Jeffrey Jr, Andreas Steinauer-Gebauer, Juno Min, Ashwin Jathavedam, and Christopher F. Beaulieu. Quantification of distentionin ct colonography: Development and validation of three computer algorithms. Radiology, 222:543–554, February 2002.

[37] Hiroyuki Yoshida, Yoshitaka Masutani, Peter MacEneaney, David T. Rubin, and Abraham H.Dachman. Computerized detection of colonic polyps at ct colonography on the basis of volumetricfeatures: Pilot study. Radiology, 222:327–336, February 2002.

26

Page 27: Introdução à Machine Learning

[38] Lutz H. Hamel. Knowledge Discovery with Support Vector Machines. Wiley-Interscience, 2009.

[39] Tom Mitchell. Machine Learning. McGraw-Hill, 1997.

[40] Kirk L. Kroeker. Weighing watson’s impact. Communications of the ACM, 54(7):13–15, July 2011.

[41] Alex Wright. Automotive autonomy. Communications of the ACM, 54(7):16–18, July 2011.

[42] L. D. Jackel, Douglas Hackett, Eric Krotkov, Michal Perschbacher, James Pippine, and CharlesSullivan. How darpa structures its robotics programs to improve locomotion and navigation. Com-munications of the ACM, 50(11):55–59, November 2007.

[43] Sebastian Thrun. Why we compete in darpa’s urban challenge autonomous robot race. Communi-cations of the ACM, 50(10):29–31, October 2007.

[44] Jean Kumagai. Dusted: No winners in darpa’s $1 million robotic race across the mojave desert.Spectrum, March 2004.

[45] Guna Seetharaman, Arun Lakhotia, and Erik Philip Blasch. Unmanned vehicles come of age: Thedarpa grand challenge. Computer, 39(12):26–29, December 2006.

[46] Stephen Baker. The programmers dilemma: Building a jeopardy! champion. The McKinsey Quar-terly, February 2011.

[47] Greg Lindsay. Changing the game: ”how i beat watson and came out a different player”. TheMcKinsey Quarterly, February 2011.

[48] Gary Anthes. Automated translation of indian languages. Communications of the ACM, 53(1):24–26,January 2010.

[49] Thomas Lengauer, Andre Altman, Alexander Thielen, and Rolf Kaiser. Chasing the aids virus.Communications of the ACM, 53(3):66–74, March 2010.

[50] Joseph MacInnes, Stephanie Santosa, and William Wright. Visual classification: Expert knowledgeguides machine learning. IEEE Computer Graphics and Applications, 30(1):8–14, January / February2010.

[51] Robert P. Schumaker and Hsinchun Chen. A discrete stock price prediction engine based on financialnews. Computer, 43(1):51–56, January 2010.

[52] Leslie Pack Kaelbling. New bar set for intelligent vehicles. Communications of the ACM, 53(4):98,April 2010.

[53] Gregory Goth. Turning data into knowledge. Communications of the ACM, 53(11):13–15, November2010.

[54] Sebastian Thrun. Toward robotic cars. Communications of the ACM, 53(4):99–106, April 2010.

[55] Kathy Kowalenko. Keeping cars from crashing. The Institute, 34(3):5, September 2010.

[56] Ariel Bleicher. Eyes in the sky that see too much. Spectrum, 47(10):16, October 2010.

[57] Yair Weiss and Judea Pearl. Belief propagation. Communications of the ACM, 53(10):94, October2010.

[58] Erik B. Sudderth, Alexander T. Ihler, Michael Isard, William T. Freeman, and Allan S. Willsky.Nonparametric belief propagation. Communications of the ACM, 53(10):95–103, October 2010.

[59] Fernando Pereira. A model sequence memoizer. Communications of the ACM, 54(2):90, February2011.

27

Page 28: Introdução à Machine Learning

[60] Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James, and Yee Whye Teh. The sequencememoizer. Communications of the ACM, 54(2):91–98, February 2011.

[61] Gabor Szabo and Bernardo A. Huberman. Predicting the popularity of online content. Communi-cations of the ACM, 53(8):80–88, August 2010.

[62] J. M. Mendel and K. S. Fu. Adaptive, Learning and Pattern Recognition Systems. Academic Press,1970.

[63] Kwan-Liu Ma. Machine learning to boost the next generation of visualization technology. IEEEComputer Graphics and Applications, 27(5):6–9, September / October 2007.

[64] Fabio A. Gonzalez and Eduardo Romero. Biomedical Image Analysis and Machine Learning Tech-nologies: Applications and Techniques. IGI Global, 2010.

[65] Amos J. Storkey. Machine learning and pattern recognition: Introduction. Technical report, Institutefor Adaptive and Neural Computation, School of Informatics, University of Edinburgh, 2009.

[66] Amos J. Storkey. Machine learning and pattern recognition: Regression and linear parameter models.Technical report, Institute for Adaptive and Neural Computation, School of Informatics, Universityof Edinburgh, 2009.

[67] Leonidas Conceicao Barroso, Magali Maria de Araujo Barroso, Frederico Ferreira Campos Filho,Marcio Luiz Bunte de Carvalho, and Miriam Lourenco Maia. Calculo Numerico (com Aplicacoes).Editora Harbra Ltda., 1987.

[68] Amos J. Storkey. Machine learning and pattern recognition: Preliminaries. Technical report, Insti-tute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, 2009.

[69] Garrett Birkhoff and Saunders MacLane. Algebra Moderna Basica. Guanabara Dois, 1980.

[70] Jieping Ye, Teresa Wu, Jing Li, and Kewei Chen. Machine learning approaches for the neuroimagingstudy of alzheimer’s disease. Computer, 44(4):99–101, April 2011.

[71] Ting Liu, Charles Rosenberg, and Henry A. Rowley. Clustering billions of images with large scalenearest neighbor search. In Eighth IEEE Workshop on Applications of Computer Vision (WACV),2007.

[72] David Schneider. Under the hood at google and facebook. Spectrum, 48(5):54–59, May 2011.

[73] Randy H. Katz. Tech titans building boom. Spectrum, 46(2):36–39; 46–49, February 2009.

28