Model selection for genetic and epidemiological data

Thursday 29th March 2012

International Biometric Society - British and Irish Region (IBS-BIR) Spring meeting and LSHTM Centre for Statistical Methodology Meeting

Title: Model selection for genetic and epidemiological data

Date: 29 March 2012, 1:30PM-5PM

Location: Manson Lecture theatre, LSHTM

Cost and Registration: £20 for International Biometric Society-British and Irish region members, £40 for non members and free for student members (paypal payment available or by cheque on site). Note that it free for students to join the Biometric society and it costs £40 to join the Biometric Society as a full member.


 Document downloads for IBS members.
Join us now.
13:30 - 14:15Stijn Vansteelandt, Ghent University and LSHTM: Challenges for model selection in etiologic studies

Over the past 3 decades, enormous progress has been made in terms of understanding and relaxing the conditions under which causal inferences can be drawn from observational studies. Most available procedures assume that a set of covariates is available, which is sufficient – in some sense – to adjust for confounding of the association between exposure on outcome. The possible high dimensionality of this set makes that some reduction is often necessary in samples of typical size. Interestingly, this important and widespread problem has been largely ignored in the causal inference literature.

In this talk, I will reflect on the challenges for model/covariate selection in etiologic studies. I will argue that routinely applied variable selection procedures – while potentially relevant for the construction of outcome prediction models –  are sub-optimal for selecting covariates in causal analyses, in view of which I will propose a procedure directly targeting the quality of the exposure effect estimator. I will discuss the roles of causal inference procedures based on outcome regression models versus propensity score models. It will be found that certain strategies for inferring causal effects have the desirable features (a) of producing (approximately) valid confidence intervals, even when the covariate-selection process is ignored, and (b) of being robust against certain forms of misspecification of the association of covariates with both exposure and outcome. I will conclude with possible directions for future research in this area.

14:15 - 15:00Christian Robert, Universite Paris Dauphine: ABC model choice and relevant summary statistics

Approximate Bayesian computation (ABC) have become a essential tool for the analysis of complex stochastic models. Having implemented ABC-based model choice in a wide range of phylogenetic models in the DIY-ABC software (Cornuet et al., 2008), we first present theoretical background as to why a generic use of ABC for model choice is ungrounded, since it depends on an unknown amount of information loss induced by the use of a summary statistic (Robert et al., 2011). We then present necessary and sufficient conditions on the summary statistics for ABC based model choice procedure to be consistent, a solution that avoids the call to additional empirical verifications of the performances of the ABC procedure as those available in DIYABC and advocated in Ratman et al. (2011).

Note: these are joint works with J.M. Cornuet, J.M. Marin, N. Pillai and J. Rousseau.


15:00 - 15:30Tea/Coffee break
15:30 - 16:15Doug Speed, University College London: Improved Heritability Estimation using Linear Mixed Models

There is continued discussion regarding the so called "missing heritability" problem. By applying a linear mixed model to whole-genome SNP data, a series of papers headed by Yang et. al. have presented strong evidence that many complex traits are highly polygenic, so that while common variants can explain most of the heritability, each on average has such a small contribution to make their detection by standard size GWAS almost impossible. We have investigated use of the linear mixed model for heritability estimation, finding that it is highly sensitive to the correlations induced by linkage disequilibrium (LD). In particular, it will struggle to pick up the variance explained by rarer variants, even if typed, as their signals will be on average more poorly represented by the SNP array. We have devised a solution to this problem, allowing unbiased estimates of heritability in spite of LD. Using our revised method, we have been able to show that almost all of the heritability for epilepsy can be explained by common SNPs, suggesting that effective prediction models should be possible.

At first glance, the linear mixed model, which assumes a polygenic model where every SNP contributes to the phenotype, would appear completely unsuited to model selection, where we hope to find the relatively small number of SNPs which have the (most) influence on the outcome. But this is not the case, as a subsequent paper involving Yang proceeded to demonstrate by partitioning the genome and assessing the heritability explained by each region. For example, they were able to examine the contributions of individual chromosomes or test whether the causal variants were most likely to be inter or intra genic. We have extended their analysis, showing how localised heritability estimates can be used to guide model selection algorithms. Additionally, we have used the mixed model to develop tests of homogeneity across traits. For example, by demonstrating a concordance between epilepsy and other neurological conditions, this provides justification for combining studies to improve detection power.

Common SNPs explain a large proportion of the heritability for human height; J. Yang, P. Visscher et. al., Nature Genetics 2010

Genome partitioning of genetic variation for complex traits using common SNPs; J. Yang, P. Visscher et. al. Nature Genetics 2011


16:15 - 17:00David Clayton, University of Cambridge: Link functions in multi-locus genetic models

"Complex" diseases are, by definition, influenced by multiple causes, both genetic and environmental and statistical work on the joint action of multiple risk factors has, for more than 40 years, been dominated by the generalized linear model. In genetics, models for dichotomous traits have traditionally been approached via the model of an underlying, normally distributed, liability. This corresponds to the generalized linear model with binomial errors and a probit link function. Elsewhere in epidemiology, however, the logistic regression model, a GLM with logit link function, has been the tool of choice, largely because of its convenient properties in case-control studies.

The choice of link function has usually been dictated by mathematical convenience, but it has some important implications in (a) the choice of association test statistic in the presence of existing strong risk factors, (b) the ability to predict disease from genotype given its heritability, and (c) the definition, and interpretation of epistasis (or epistacy). I will review these issues and propose a new association test.


Download event flyer


Existing members can login below to view all site content. Lost password?


Other visitors might be interested to learn more about the benefits of membership.

Other events

23 May 17Statistical image analysis in medicine and biology
24 Apr 17 - 26 Apr 176th Channel Network Conference
10 Mar 17Exploiting publicly available genetic data
23 Nov 16Design of Experiments in Medicine (and AGM) - Chandler House (UCL)
10 Oct 16Challenges and opportunities of analysing ecological citizen science data

© 2009-2017 Biometric Society, British and Irish Region | Admin | Read our cookie and privacy policy