Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística...

14
1 How to cite: 1 EISENLOHR, P.V.  Challenges in Data Analysis: pitfalls and suggestions for a 2 statistical routine in Vegetation Ecology. Brazilian Journal of Botany. DOI: 3 10.1007/s40415-013-0002-9 4 Obs: The final publication is available at link.springer.com. 5 6 Point of View 7 8 CHALLENGES IN DATA ANALYSIS: PITFALLS AND SUGGESTIONS FOR A 9 STATISTICAL ROUTINE IN VEGETATION ECOLOGY 10 11 PEDRO V. EISENLOHR 1  12 13 Running Title: Challenges and pitfalls in data analysis 14 15 16 17 18 19 20 21 1  Universidade Federal de Minas Gerais, Instituto de Ciências Biológicas, Departamento de Botânica. Av. Presidente Antônio Carlos, 6627, Pampulha, Belo Horizonte, MG, 31270-901. [email protected]

Transcript of Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística...

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 1/14

How to cite:1

EISENLOHR, P.V.  Challenges in Data Analysis: pitfalls and suggestions for a2

statistical routine in Vegetation Ecology. Brazilian Journal of Botany. DOI:3

10.1007/s40415-013-0002-94

Obs: “The final publication is available at link.springer.com”.5

6

Point of View7

8

CHALLENGES IN DATA ANALYSIS: PITFALLS AND SUGGESTIONS FOR A9

STATISTICAL ROUTINE IN VEGETATION ECOLOGY10

11

PEDRO V. EISENLOHR 1 12

13

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 2/14

ABSTRACT  –   (Challenges in data analysis: pitfalls and suggestions for a statistical22

routine in Vegetation Ecology).  The step of data analysis in a scientific work is not23

always a friendly universe. Here I provide suggestions and warn of five pitfalls in a24

 proposal of statistical routine focused on selection of predictor variables for multiple25

regression  –  a simple model used to answer questions commonly raised in Vegetation26

Ecology –  and verification of assumptions of this method. I believe that this manuscript27

will clarify important points in the data analysis process and, therefore, contribute to28

making studies in Vegetation Ecology more competitive in the international scientific29

scenario.30

Key words: multiple regression, numerical ecology, variable selection31

32

RESUMO  –   (Desafios em análise de dados: armadilhas e sugestões para uma rotina33

estatística em Ecologia da Vegetação). A etapa de análise de dados em um trabalho34

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 3/14

“ ‘What is the best model to use?’ is the critical question in46

making valid inference from data in the biological sciences”47

(Burnham & Anderson 2002)48

49

Introduction50

Any scientific investigation requires that the data generated during fieldwork or51

in the laboratory be analyzed appropriately. This stage of analysis is often a difficult and52

tricky task. With the advent of modern statistical techniques, especially those related to53

the assumption of spatial independence (Diniz-Filho et al. 2003), analyzing data has54

 become an increasingly stimulating practice and, for some, it can also be harsh.55

I have noticed that the researchers in Vegetation Ecology in general do not feel56

very comfortable in this statistical universe. Then, I believe that we need texts that57

encourage them to perform their numerical analysis properly. Herein, I intend to58

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 4/14

In the academic universe, everything starts with an interesting question guiding71

the research. Suppose a question like this: how could the species distribution or richness72

of a particular group of plants vary between different regions according to climate and73

to topographic and soil variables? It is an important question in order to make decisions74

about the susceptibility of this group of species to environmental variations found in75

different locations. Some remarkable papers addressed this issue (Oliveira-Filho &76

Fontes 2000, Oliveira-Filho et al. 2006, Santos et al. 2012, among others). After77

outlining the sampling and/or collection sites from the literature (see general78

recommendations in Felfili et al. 2011a), it is necessary to obtain and select predictors79

that explain the variation in the species distribution or richness. For this, it is imperative80

to consider variables that are, at least in theory, biologically important for the group of81

species considered. It is also necessary to consider the time and spatial scales that the82

researcher is interested. After all, reasonable behavior and statistics should always go83

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 5/14

algorithm used  and, where there are many sample sites, the intensive computational96

effort, even with efficient processing.97

There are many methods currently available for the selection of variables98

(Burnham & Anderson 2002). Methods such as "Best Subsets", in which all possible99

models are processed and the best is chosen based on objective criteria, have been100

 particularly advocated, because they avoid the bias of the step by step selection, in101

which the order of selection of the variables influences the next variable to be added or102

removed from the model (Whittingham et al. 2006). However, when performing103

automatic selection of a large number of variables, the total time can reach days and still104

run the risk of the program “crashing”.  Another fundamental question arises here: is105

essential to use all these variables, considering the issues of differences in spatial and106

temporal scales?107

Thus, if the number of potentially relevant variables from the biological point of108

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 6/14

of models and one would be able to directly verify the assumptions of regression. If the121

PCA is not sufficient to greatly reduce the dimensionality, it can at least be useful to122

detect and remove clearly collinear variables, which will help to reduce the number of123

variables for the next step. Another interesting and useful procedure would be to use124

 joint plot function of some software, which inserts the most “explanatory” variables (the125

cutoff level is usually defined by the researcher) on the ordination axes. The researcher,126

with this procedure, achieves greater clarity (even graphically) about which variables127

are most correlated to the ordination axes of interest.128

129

Step 2: Obtaining a useful model130

With the reduced dimensionality, we come to what I shall call hereafter the “full131

regression model” (note that I am assuming that only one response variable is used). We132

can then contemplate the selection of the best model. However, three points are133

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 7/14

2012). If the researcher is interested in diversity patterns, species richness can be used145

as the operating response variable, among others.146

For point (ii), one must check whether the full regression model shows147

significant spatial structure, because this can affect the automatic selection (Diniz-Filho148

et al. 2008 and references therein). To this end, a correlogram can be used for the149

residual values (Diniz Filho et al. 2003). In correlograms, significance values ( P ) of the150

spatial structure are generated for different distance classes (Legendre & Fortin 1989).151

Two indicators of spatial structure coefficients can be used: Moran’s  I   and Geary’s c 152

(Legendre & Fortin 1989). In order to decide whether a correlogram is globally153

significant, a corrective procedure should be applied to the values of  P , since various154

tests are being produced for the same data set. The corrective procedure that is155

traditionally used is Bonferroni but, as it is very conservative, some authors suggest156

using the sequential Bonferroni correction, computed for each distance class separately157

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 8/14

of predictor variables can be made by keeping the selected filters fixed, i.e., as variables169

that will necessarily be selected (Diniz-Filho et al. 2008).170

For point (iii), using the lowest value of  AICc  (Corrected Akaike Information171

Criterion) to choose the best model is an interesting criterion, as it combines the172

 principle of parsimony with the descriptive accuracy of the data (Burnham & Anderson173

2002). There are, however, many other methods of automatic selection (for a review,174

see Burnham & Anderson 2002).175

176

Step 3: Verifying the assumptions177

A conventional multiple regression model makes some important assumptions:178

(i) independence of residuals, (ii) normality of residuals, (iii) homogeneity of variances,179

(iv) linearity and (v) absence of collinearity (Quinn & Keough 2002). Moreover, global180

models, among which the conventional multiple regression is included, assume that (vi)181

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 9/14

 pitfall: finding significant spatial structures in the response variable does not necessarily191

imply violation of the assumption of spatial independence. This violation occurs only if192

the spatial structure is significant in the residuals of the model that is the subject of193

investigation (Diniz-Filho et al. 2003, Diniz-Filho & Bini 2005).194

A relatively simple strategy to deal with spatial structure would be to include195

additional spatial variables (filters), as previously emphasized. Either way, it is196

important to prepare a correlogram for the response variable because it allows197

researchers to detect possible gradients, patches and gaps in the distribution of the198

variable over space, which is generally valid for ecological discussions (Legendre &199

Fortin 1989).200

The (ii) normality of the residuals can be tested through the Shapiro-Wilk and201

D'Agostino-Pearson tests, among others. The null hypothesis of this test is that the202

residuals are normally distributed. A graphic analysis of the distribution of residuals in a203

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 10/14

after carrying out these measures, or having to work with non-parametric regressions.212

 Note that we have here another pitfall: I often see researchers simply transforming the213

data and not checking again the assumptions.214

It is also important (v) to check that the final model is free of collinearity. One of215

the most practical ways of doing this is to find the Variance Inflation Factor (VIF).216

Several authors (e.g. Myers 1986, Quinn & Keough 2002) consider that the maximum217

allowed VIF should be 10 (or, alternatively, that the minimum tolerance value, the218

inverse of VIF, should be 0.1). A potential pitfall would be to eliminate all predictor219

variables with VIF above the cutoff. This is usually not necessary. I think this220

inadequate procedure stems from the fact that researchers are often not clear about221

which variables do show collinearity. For this, an examination of a PCA and a222

correlation matrix will certainly help. Generally, the elimination of one or some of these223

variables will make the collinearity nonexistent.224

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 11/14

There are many precautions that must be taken in model selection and237

verification of multiple regression assumptions, and here I only sought to highlight238

some aspects that I consider crucial. However, before starting any routine analysis, we239

must be clear that neither scientific research nor knowledge production can happen240

without an interesting guiding question and an appropriate sampling design and,241

consequently, no analysis is justified even if it is the most powerful and robust one. I242

 believe that the points highlighted in this manuscript will help the readers to have more243

clarity on important points in the data analysis process and, therefore, contribute to244

making studies in Vegetation Ecology more competitive in the international scientific245

scenario.246

247

Acknowledgments248

I thank the two anonymous reviewers for their valuable contributions. I am especially249

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 12/14

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 13/14

Kupfer JA, Farris CA. 2007. Incorporating spatial non-stationarity of regression287

coefficients into predictive vegetation models. Landscape Ecology 22:837-852.288

Legendre P, Fortin M-J. 1989. Spatial pattern and ecological analysis. Vegetatio289

80:107-138.290

Myers RH. 1986. Classical and Modern Regression with Applications. Duxbury Press,291

Boston.292

Oliveira-Filho AT, Fontes MAL. 2000. Patterns of floristic differentiation among293

Atlantic Forests in southeastern Brazil and the influence of climate. Biotropica 32:793-294

810.295

Oliveira-Filho AT, Jarenkow JA, Rodal MJN. 2006. Floristic relationships of seasonally296

dry forests of eastern South America based on tree species distribution patterns. In297

 Neotropical savannas and dry forests: plant diversity, biogeography and conservation298

(RT Pennington, JA Ratter, GP Lewis, eds.). CRC Press, Boca Raton, p.159-192.299

8/12/2019 Eisenlohr 2013 - Desafios Em Análise de Dados Armadilhas e Sugestões Para Uma Rotina Estatística Em Ecologia Da Vegetação

http://slidepdf.com/reader/full/eisenlohr-2013-desafios-em-analise-de-dados-armadilhas-e-sugestoes-para 14/14

14

Table 1. Steps followed in the text, their goals and common pitfalls.

Step Aims Common Pitfalls

1 Reducing dimensionality of predictors. 1.1 Select, as predictors, the PCA axes that synthesize very distinctly

different variables.

2 Obtaining a useful model. 2.1 Select all possible filters in order to eliminate the spatial structure in the

residuals of the model.

3 Verifying the assumptions. 3.1 Consider that the assumption of spatial independence was violated after

finding significant spatial structures only in the response variable.

3.2 Do not check again the assumptions after transforming the data.

3.3 Eliminate all predictor variables with Variance Inflation Factor (VIF)

above the established cut-off.