TDC2016SP - Trilha Data Science
-
Upload
tdc-globalcode -
Category
Education
-
view
136 -
download
3
Transcript of TDC2016SP - Trilha Data Science
![Page 1: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/1.jpg)
Você não precisa estar 100% certo
o tempo todo!
Algoritmos para consultar terabytes de dados
Fabiane Bizinella NardonTailTarget
@fabianenardon
![Page 2: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/2.jpg)
SUAS RESPOSTAS DEVEM ESTAR SEMPRE CERTAS
![Page 3: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/3.jpg)
ÀS VEZES 100% CERTAS
![Page 4: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/4.jpg)
ÀS VEZES 98% CERTAS
![Page 5: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/5.jpg)
2 Bilhões de tuplas / 71min 31 Milhões de tuplas / 3 seg
Fonte: http://highscalability.com/blog/2016/2/25/when-should-approximate-query-processing-be-used.html
![Page 6: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/6.jpg)
QUAL O TAMANHO DA AMOSTRA QUE EU
PRECISO?
![Page 7: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/7.jpg)
DEPENDE DE QUANTO VOCÊ ACEITA ERRAR
![Page 8: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/8.jpg)
MARGEM DE ERRO:INTERVALO NO QUAL
ESPERO ENCONTRAR O DADO QUE QUERO MEDIR.
![Page 9: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/9.jpg)
EXEMPLO:NÚMERO DE MULHERES NA INTERNET
COM 5% DE MARGEM DE ERRO. SE O NÚMERO ESTIMADO É DE 100 MILHÕES, ESPERO QUE O NÚMERO
REAL ESTEJA ENTRE 100 MILHÕES +/- 5%
![Page 10: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/10.jpg)
NÍVEL DE CONFIANÇA:CERTEZA QUE O DADO QUE BUSCAMOS ESTÁ DENTRO
DA MARGEM DE ERRO
![Page 11: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/11.jpg)
EXEMPLO:95% DE FATOR DE CERTEZA SIGNIFICA QUE EM 95% DOS
CASOS, O NÚMERO DE MULHERES ESTARÁ DENTRO
DE 100 MILHÕES +/- 5%
![Page 12: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/12.jpg)
MARGEM DE ERRO
NÍVEL DE CONFIANÇA
TAMANHODA AMOSTRA
![Page 13: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/13.jpg)
MARGEM DE ERRO NÍVEL DE
CONFIANÇA
TAMANHODA AMOSTRA
![Page 14: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/14.jpg)
MARGEM DE ERRO
NÍVEL DE CONFIANÇ
A
TAMANHODA AMOSTRA
![Page 15: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/15.jpg)
n = (z * stddev /mde)2
n = tamanho da amostrastddev = desvio padrão da população
mde = margem de erroz = Z-score
![Page 16: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/16.jpg)
nreal = (n * p) / (n + p – 1)
n = tamanho da amostranreal= tamanho real da amostra
P = tamanho da população
![Page 17: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/17.jpg)
n = (z * stddev /mde)2
Z-Score:
90% = 1.64595% = 1.9699% = 2.576
![Page 18: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/18.jpg)
n = (z * stddev /mde)2
n = (2.576 * 0.5 / 0.01)2
n = 16.590
n = tamanho da amostrastddev = desvio padrão da população
mde = margem de erroz = Z-score
![Page 19: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/19.jpg)
nreal = (n * p) / (n + p – 1)
nreal = (16590 * 100000) / (16590 + 100000 -1)
nreal = 14.229
n = tamanho da amostranreal= tamanho real da amostra
P = tamanho da população
![Page 20: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/20.jpg)
P = 100.000 -> nreal = 14.229
P = 200.000 -> nreal = 15.317 P = 400.000 -> nreal = 15.927
P = 1.000.000 -> nreal = 16.317P = 10.000.000 -> nreal = 16.560
P = 200.000.000 -> nreal = 16.586P = 400.000.000 -> nreal = 16.587
P = 1.000.000.000 -> nreal = 16.587
![Page 21: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/21.jpg)
POPULAÇÃO: 488.333.067
PROPORÇÃO POR GÊNERO:
AMOSTRA: 32.000M: 156.556.529 (32.05%)H: 95.148.646 (19.48%)
AMOSTRA: 16.000M: 158.036.788 (32.36%)H: 96.018.489 (19.66%)
![Page 22: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/22.jpg)
COMO ESCOLHER OS ITENS DA AMOSTRA
![Page 23: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/23.jpg)
SAMPLING WITH REPLACEMENT: ESCOLHE UM ITEM ALEATÓRIO
SUCESSIVAMENTE. MESMO ITEM PODE APARECER MAIS DE UMA
VEZ
![Page 24: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/24.jpg)
SAMPLING WITHOUT REPLACEMENT:
ESCOLHE UM ITEM ALEATÓRIO E GARANTE QUE ELE SÓ APARECE
UMA VEZ NA AMOSTRA.
![Page 25: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/25.jpg)
Random Sampling: Escolhe amostras aleatórias,
todas com o mesmo peso
![Page 26: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/26.jpg)
Stratified Sampling: Divide a população em stratas
e faz uma amostra de cada strata separadamente
![Page 27: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/27.jpg)
AMOSTRA RANDÔMICA
![Page 28: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/28.jpg)
PARA CADA GRUPO AMOSTRAL, GUARDE:
TAMANHO DA AMOSTRATAMANHO REAL
IDENTIFICADOR DO GRUPO AMOSTRAL
![Page 29: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/29.jpg)
EXTRAINDO AMOSTRAS DA BASE DE DADOS
![Page 30: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/30.jpg)
ESTRATÉGIA 1: GRAVE NÚMERO RANDÔMICO
INDEXADO NO BANCO DE DADOS
{"_id": 1,"gender": "f","age": 24,"selector": 0.12412423
}
* MongoDB
![Page 31: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/31.jpg)
int selector = 0;int sampleCount = 0;List<User> sample = new ArrayList<>();double prevSelector = -1;while (sampleCount < SAMPLE_SIZE) { /* Add to sample result of db.Users.find({ selector: { $gt: prevSelector}
}).limit(100);
Set last selector found to prevSelector */
sampleCount += sample.size();}
![Page 32: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/32.jpg)
ESTRATÉGIA 2: USE MECANISMO DE
SAMPLING DO SEU BD
![Page 33: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/33.jpg)
{"aggregations": { "sample": { "sampler": { "shard_size": 16000 }, "aggregations": { "views": { "terms": { "field": "gender“ }, "aggregations": { "unique_users": { "cardinality": { "field": "uid" } } } } } } }} * ElasticSearch
![Page 34: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/34.jpg)
ESTRATÉGIA 3: USANDO SHARDING
![Page 35: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/35.jpg)
SHARDING 1 SHARDING 2 SHARDING 3
ITEMS POR SHARDING = TAMANHO DA AMOSTRA / NÚMERO DE SHARDINGS
RANDOMKEYs RANDOMKEYs RANDOMKEYs
* Redis
![Page 36: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/36.jpg)
PARA CADA GRUPO AMOSTRAL, GUARDE:
TAMANHO DA AMOSTRATAMANHO REAL
IDENTIFICADOR DO GRUPO AMOSTRAL
![Page 37: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/37.jpg)
Ta = Tamanho da AmostraTp = Tamanho da PopulaçãoNa = Número de Itens na AmostraNp = Número de Itens na população
Np = Na * Tp / Ta
Exemplo:Ta = 1000Tp = 100.000Na = 400 Mulheres
Np = 400 * 100.000 / 1000 = 40.000
![Page 38: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/38.jpg)
NÃO SEI O TAMANHO DA POPULAÇÃO.
E AGORA?
![Page 39: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/39.jpg)
RESERVOIR SAMPLING
![Page 40: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/40.jpg)
1 2 3 4 5
A B C D E
F
Random (0..1): 0.7K = Ta / i K = 5 / 6 = 0.83Se K > Random => TROCA!
![Page 41: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/41.jpg)
RESERVOIR SAMPLINGDISTRIBUÍDO
![Page 42: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/42.jpg)
1 2 3 4 5
A B C D E F G H I J K L M N O P Q R S T U V X Y W Z
![Page 43: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/43.jpg)
1 2 3 4 5
A B C D E F G H I J K L M N O P Q R S T U V X Y W Z
![Page 44: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/44.jpg)
1 2 3 4 5
A:0.1 B:0.3 C:0.2 D:0.7 E:0.9 F:0.11 G:0.4 H:0.6 I:0.76
J:0.8 K:0.2 L:0.54 M:0.4 N:0.21 O:0.33 P:0.56 Q:0.32 R:0.23
S:0.21 T:0.32 U:0.22 V:0.7 X:0.12 Y:0.23 W:0.3 Z:0.76
private SortedMap<Double, MyObject> reservoir;
...if (reservoir.size() < SAMPLE_SIZE) { reservoir.put(score, myObject);} else if (score > reservoir.firstKey()) { reservoir.remove(reservoir.firstKey()); reservoir.put(score, myObject);}
![Page 45: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/45.jpg)
O L P I Z
1 2 3 4 5
A:0.1 B:0.3 C:0.2 D:0.7 E:0.9 F:0.11 G:0.4 H:0.6 I:0.76
J:0.8 K:0.2 L:0.54 M:0.4 N:0.21 O:0.33 P:0.56 Q:0.32 R:0.23
S:0.21 T:0.32 U:0.22 V:0.7 X:0.12 Y:0.23 W:0.3 Z:0.76
H:0.6 D:0.7 E:0.9 F:0.11 I:0.76 R:0.23 Q:0.32 O:0.33 L:0.54 P:0.56 S:0.21 U:0.22 Y:0.23 T:0.32 Z:0.76
COMBINER
![Page 46: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/46.jpg)
import org.apache.crunch.lib.Sample;
Sample.reservoirSample(PCollection<T> input, int sampleSize)
Apache Crunch:
![Page 47: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/47.jpg)
CONSULTAS APROXIMADAS
AGREGAÇÕESPROPORÇÕES
![Page 48: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/48.jpg)
COUNT
Np = Na * Tp / Ta
400 * 100.000 / 1000=
40.000
![Page 49: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/49.jpg)
SUM
Sp = Sa * Tp / Ta
600 * 100.000 / 1000=
60.000
![Page 50: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/50.jpg)
AVERAGE
Ap = (Sa * Tp / Ta) / (Na * Tp / Ta)
(600 * 100.000 / 1000) /
(400 * 100.000 / 1000)=
1.5
![Page 51: TDC2016SP - Trilha Data Science](https://reader036.fdocumentos.com/reader036/viewer/2022062412/58cfd4c11a28ab13238b4c27/html5/thumbnails/51.jpg)
Você não precisa estar 100% certo
o tempo todo!
Algoritmos em Java para consultar terabytes de dados
Fabiane Bizinella NardonTailTarget
@fabianenardon