Tuesday, 26 December 2017

Intuitive explanations of statistical concepts for novices #4

The p-value is widely used but widely misunderstood. I'll demonstrate this in the context of intervention studies. The key question is how confident can we be that an apparently beneficial effect of treatment reflects a change due to the intervention, rather than arising just through the play of chance. The p-value gives one way of deciding that. There are other approaches, including those based on Bayesian statistics, which are preferred by many statisticians. But I will focus here on the traditional null hypothesis significance testing (NHST) approach, which dominates statistical reporting in many areas of science, and which uses p-values.

As illustrated in my previous blogpost, where our measures include random noise, the distorting effects of chance mean that we can never be certain whether or not a particular pattern of data reflects a real difference between groups. However, we can compute the probability that the data came from a sample where there was no effect of intervention.

There are two ways to do this. One way is by simulation. If you repeatedly run the kind of simulation described in my previous blogpost, specifying no mean difference between groups, each time taking a new sample, for each result you can compute a standardized effect size. Cohen's d is the mean difference between groups expressed in standard deviation units, which can be computed by subtracting the group A mean from the group B mean, and dividing by the pooled standard deviation (i.e. the square root of the average of the variances for the two groups). You then see how often the simulated data give an effect size at least as large as the one observed in your experiment.
 Histograms of effecct sizes obtained by repeatedly sampling from population where there is no difference between groups*
Figure 1 shows the distribution of effect sizes for two different studies: the first has 10 participants per group, and the second has 80 per group. For each study, 10,000 simulations were run; on each run, a fresh sample was taken from the population, and the standardized effect size, d, computed for that run. The peak of each distribution is at zero: we expect this, as we are simulating the case of no real difference between groups – the null hypothesis. But note that, though the shape of the distribution is the same for both studies, the scale on the x-axis covers a broader range for the study with 10 per group than the study with 80 per group. This relates to the phenomenon shown in Figure 5 of the previous blogpost, whereby estimates of group means jump around much more when there is a small sample.

The dotted red lines show the cutoff points that identify the top 5%, 1% and 0.1% of the effect sizes. Suppose we ran a study with 10 people and it gave a standardized effect size of 0.3. We can see from the figure that a value in this range is fairly common when there is no real effect: around 25% of the simulations gave an effect size of at least 0.3. However, if our study had 80 people per group, then the simulation tells us this is an improbable result to get if there really is no effect of intervention: only 2.7% of simulations yield an effect size as big as this.

The p-value is the probability of obtaining a result at least as extreme as the one that is observed, if there really is no difference between groups. So for the study with N = 80, p = .027. Conventionally, a level of p < .05 has been regarded as 'statistically significant', but this is entirely arbitrary. There is an inevitable trade-off between false positives (type I errors) and false negatives (type II errors). If it is very important to avoid false positives, and you do not mind sometimes missing a true effect, then a stringent p-value is desirable. If, however, you do not want to miss any finding of potential interest, even if it turns out to be a false positive, then you could adopt a more lenient criterion.

The comparison between the two sample sizes in Figure 1 should make it clear that statistical significance is not the same thing as practical significance. Statistical significance simply tells us how improbable a given result would be if there was no true effect. The larger the sample size, the smaller the effect size that would be detected at a threshold such as p < .05. Small samples are generally a bad thing, because they only allow us to reliably detect very large effects. But very large samples have the opposite problem: they allow us to detect as 'significant' effect that are so small as to be trivial. The key point that the researcher who is conducting an intervention study should start by considering how big an effect would be of practical interest, given the cost of implementing the intervention. For instance, you may decide that staff training and time spent on a vocabulary intervention would only be justified if it boosted children's vocabulary by at least 10 words. If you knew how variable children scores were on the outcome measure, the sample size could then be determined so that the study has a good chance of detecting that effect while minimising false positives. I will say more about how to do that in a future post.

I've demonstrated p-values using simulations in the hope that this will give some insight into how they are derived and what they mean. In practice, we would not normally derive p-values this way, as there are much simpler ways to do this, using statistical formulae. Provided that data are fairly normally distributed, we can use statistical approaches such as ANOVA, t-tests and linear regression to compute probabilities of observed results (see this blogpost). Simulations can, however, be useful in two situations. First, if you don't really understand how a statistic works, you can try running an analysis with simulated data. You can either simulate the null hypothesis by creating data from two groups that do not differ, or you can add a real effect of a given size to one group. Because you know exactly what effect size was used to create the simulated data, you can get a sense of whether particular statistics are sensitive to detect real effects, and how these might vary with sample size.

The second use of simulations is for situations where the assumptions of statistical tests are not met – for instance, if data are not normally distributed, or if you are using a complex design that incorporates multiple interacting variables. If you can simulate a population of data that has the properties of your real data, you can then repeatedly sample from this and compute the probability of obtaining your observed result to get a direct estimate of a p-value, just as was done above.

The key point to grasp about a p-value is that it tells you how likely your observed evidence is, if the null hypothesis is true. The most widely used p-value is .05: if the p-value in your study is less than .05, then the chance of your observed data arising when the intervention had no effect is 1 in 20. You may decide on that basis that it's worth implementing the intervention, or at least investing in the costs of doing further research on it.

The most common mistake is to think that the p-value tells you how likely the null hypothesis is given the evidence. But that is something else. The probability of A (observed data) given B (null hypothesis) not the same as the probability of B (null hypothesis) given A (observed data). As I have argued in another blogpost, the probability that if you are a man you are a criminal is not high, but if you are a criminal, the probability that you are a man is much higher. This may seem fiendishly complicated, but a concrete example can help.

Suppose Bridget Jones has discovered three weight loss pills: if taken for a month, pill A is totally ineffective placebo, pill B leads to a modest weight loss of 2 lbs, and pill C leads to an average loss of 7 lb. We do studies with three groups of 20 people; in each group, half are given A, B or C and the remainder are untreated controls. We discover that after a month, one of the treated groups has an average weight loss of 3 lb, whereas their control group has lost no weight at all. We don't know which pill this group received. If we run a statistical test, we find the p-value is .45. This means we cannot reject the null hypothesis of no effect – which is what we'd expect if this group had been given the placebo pill, A. But the result is also compatible with the participants having received pills B or C. This is demonstrate in Figure 2 which shows the probability density function for each scenario - in effect, the outline of the histogram. The red dotted line corresponds to our obtained result, and it is clear it is highly probable regardless of which pill was used. In short, this result doesn't tell us how likely the null hypothesis is – only that the null hypothesis is compatible with the evidence that we have.
 Probability density function for weight loss pills A, B and C, with red line showing observed result

Many statisticians and researchers have argued we should stop using p-values, or at least adopt more stringent levels of p. My view is that p-values can play a useful role in contexts such as the one I have simulated here, where you want to decide whether an intervention is worth adopting, provided you understand what they tell you. It is crucial to appreciate how dependent a p-value is on sample size, and to recognise that the information it provides is limited to telling you whether an observed difference could just be due to chance. In a later post I'll go on to discuss the most serious negative consequence of misunderstanding of p-values: the generation of false positive findings by the use of p-hacking.

*The R script to generate Figures 1 and 2 can be found here.

Thursday, 21 December 2017

Intuitive explanations of statistical concepts for novices #3

I'll be focusing here on the kinds of stats needed if you conduct an intervention study. Suppose we measured the number of words children could define on a 20-word vocabulary task. Words were selected so that at the start of training, none of the children knew any of them. At the end of 3 months of training, every child in the vocabulary training group (B) knew four words, whereas those in a control group (A) knew three words. If we had 10 children per group, the plot of final scores would look like Figure 1 panel 1.
 Figure 1. Fictional data to demonstrate concept of random error (noise)

In practice, intervention data never look like this. There is always unexplained variation in intervention outcomes, and real results look more like panel 2 or panel 3. That is, in each group, some children learn more than average and some less than average. Such fluctuations can reflect numerous sources of uncontrolled variation: for instance, random error will be large if we use unreliable measures, or there may be individual differences in responsiveness to intervention in the people included in the study, as well as things that can fluctuate from day to day or even moment to moment, such as people's mood, health, tiredness and so on.

The task for the researcher is to detect a signal – the effect of intervention – from noise – the random fluctuations. It is important to carefully select our measures and our participants to minimise noise, but we will never eliminate it entirely.

There are two key concepts behind all the statistics we do: (a) data will contain random noise, and (b) when we do a study we are sampling from a larger population. We can make these ideas more concrete through simulation.

The first step is to generate a large quantity of random numbers. Random numbers can be easily generated using the free software package R: if you have this installed, you can follow this demo by typing in the commands shown in italic at your console. R has a command, rnorm, that generates normally distributed random numbers. For instance:

rnorm(10,0,1)

will generate 10 z-scores, i.e. random numbers with mean of 0 and standard deviation of 1.You get new random numbers each time you submit the command, (unless you explicitly set something known as the random number seed to be the same each time). Now let's use R to generate 100,000 random numbers, and plot the output in a histogram. Figure 2 can be generated with the commands:

myz = rnorm(100000,0,1)
hist(myz)

 Figure 2: Distribution of z-scores simulated with rnorm

This shows that numbers close to zero are most common, and the further we get from zero in either direction, the lower the frequency of the number. The bell-shaped curve is a normal distribution, which we get because we opted to generate random numbers following a normal distribution using rnorm. (Other distributions of random number are also possible; you can see some options here).

So you might be wondering what we do with this list of numbers. Well, we can simulate experimental data based on this population of numbers by specifying two things:
1. The sample size
2. The effect size – i.e., Cohen's d, the mean difference between groups in standard deviation (SD) units.

Suppose we want two groups, A and B, each with a sample size of 10, where group B has scores that are on average 1 SD larger than group A. First we select 20 values at random from myz:

mydata = sample(myz, 20)

Next we create a variable corresponding to group, which is created by just making a variable, mygroup, that combines ten repeats of 'A' with ten repeats of 'B'.

mygroup = c(rep('A', 10), rep('B', 10))

Next we add the effect size, 1, to the last 10 numbers, i.e. those for group B

mydata[11:20] = mydata[11:20] + 1

Now we can plot the individual points clustered by group. First install and activate the beeswarm package to make a nice plot format:

install.packages('beeswarm')
library(beeswarm)

Then you can make the plot with the command:

beeswarm(mydata ~ mygroup)

The resulting plot will look something like one of the graphs in Figure 3. It won't be exactly the same as any of them because your random sample will be different from the ones we have generated. In fact, this is one point of this exercise: to show you how numbers will vary from one occasion to another when you sample from a population.

If you just repeatedly run these lines, you will see how things vary just by chance:

mydata = sample(myz, 20)
mydata[11:20] = mydata[11:20] + 1
beeswarm(mydata ~ mygroup)

 Figure 3: Nine runs of simulated data from 2 groups: A comes from population with mean score of 0 and B from population with mean score of 1
Note how in Figure 3, the difference between groups A and B is far more marked in runs 7 and 9 than in runs 4 and 6, even though each dataset was generated by the same script. This is what is meant by the 'play of chance' affecting experimental data.

Now let's look at Figure 4, which gives output from another nine runs of a simulation. This time, some runs were set so that there was a true effect of intervention (by adding .6 to values for group B) and some were set with no difference between groups. Can you tell which simulations were based on a real effect?

 Figure 4: Some of these runs were generated with effect size of .6, others had no difference between A and B
The answer is that runs 1, 2, 4, 8 and 9 came from runs where there was a real effect of .6 (which, by the standard of most intervention studies is a large effect). You may have identified some of these runs correctly, but you may also to have falsely selected run 3 as showing an effect. This would be a false positive, where we wrongly conclude there is an intervention effect when the apparent superiority of the intervention group is just down to chance. This type of error is known as a type I error. Run 2 looks like a false negative – we are likely to conclude there is no effect of intervention, when in fact there was one. This is a type II error. One way to remember this distinction is that a type I error is when you think you've won (1) but you haven't.

The importance of sample size
Figures 3 and 4 demonstrate that, when inspecting data from intervention trials, you can't just rely on the data in front of your eyes. Sometimes, they will suggest a real effect when the data are really random (type I error) and sometimes they will fail to reveal a difference when the intervention is really effective (type II error). These anomalies arise because data incorporates random noise which can generate spurious effects or mask real effects. This masking is particularly problematic when samples are small.

Figure 5 shows two sets of data: the top panel and the bottom panel were derived by the same simulation, the only difference being the sample size: 10 per group in the top panels, and 80 per group in the bottom panels. In both cases, the simulation specified that group B scores were drawn from a population that had higher scores than group A, with an effect size of 0.6.  The bold line shows the group average. The figure shows that the larger the sample, the closer the results from the sample will agree with the population from which it was drawn.
 Figure 5: Five runs of simulation where true effect size = .6

When samples are small, estimates of the means will jump around much more than when samples are large. Note, in particular, that with the small sample size, on the third run, the mean difference between A and B is overestimated by about 50%, whereas in the fourth run, the mean for B is very close to that for A.

In the population from which these samples are taken the mean difference between A and B is 0.6, but if we just take a sample from this population, by chance we may select atypical cases, and these will have a much larger impact on the observed mean when the sample is small.

In my next post, I will show how we can build on these basic simulations to get an intuitive understanding of p-values.

P.S. Scripts for generating the figures in this post can be found here.

Monday, 27 November 2017

Reproducibility and phonics: necessary but not sufficient

Over a hotel breakfast at an unfeasibly early hour (I'm a clock mutant) I saw two things on Twitter that appeared totally unrelated but which captured my interest for similar reasons.

The two topics were the phonics wars and the reproducibility crisis. For those of you who don't work on children's reading, the idea of phonics wars may seem weid. But sadly, there we have it: those in charge of the education of young minds locked in battle over how to teach children to read. Andrew Old (@oldandrewuk), an exasperated teacher, sounded off this week about 'phonics denialists', who are vehemently opposed to phonics instrution, despite a mountain of evidence indicating this is an important aspect of teaching children to read. He analysed three particular arguments used to defend an anti-phonics stance. I won't summarise the whole piece, as you can read what Andrew says in his blogpost. Rather, I just want to note one of the points that struck a chord with me. It's the argument that 'There's more to phonics than just decoding'. As Andrew points out, those who say this want to imply that those who teach phonics don't want to do anything else.
'In this fantasy, phonics denialists are the only people saving children from 8 hours a day, sat in rows, being drilled in learning letter combinations from a chalkboard while being banned from seeing a book or an illustration.'
This is nonsense: see, for instance, this interview with my colleague Kate Nation, who explains how phonics knowledge is necessary but not sufficient for competent reading.

So what has this got to do with reproducibility in science? Well, another of my favourite colleagues, Dick Passingham, started a little discussion on Twitter - in response to a tweet about a Radiolab piece on replication. Dick is someone I enjoy listening to because he is a fount of intelligence and common sense, but on this occasion, what he said made me a tad irritated:

This has elements of the 'more to phonics than just decoding' style of argument. Of course scientists need to know more than how to make their research reproducible. They need to be able to explore, to develop new theories and to see how to interpret the unexpected. But it really isn't an either/or. Just as phonics is necessary but not sufficient for learning to read, so are reproducible practices necessary but not sufficient for doing good science. Just as phonics denialists depicts phonics advocates as turning children into bored zombies who hate books, those trying to fix reproducibility problems are portrayed as wanting to suppress creative geniuses and turn the process of doing research into a tedious and mechanical exercise. The winds of change that are blowing through psychology won't stop researchers being creative, but they will force them to test their ideas more rigorously before going public.

For those, like Dick, who was trained to do rigorous science from the outset, the focus on reproducibiity may seem like a distraction from the important stuff. But the incentive structure has changed dramatically in recent decades with the rewards favouring the over-hyped sensational result over the careful, thoughful science that he favours. The result is an enormous amount of waste - of resources, of time and careers. So I'm not going to stop 'obsessing about the reproducibility crisis.' As I replied rather sourly to Dick:

Friday, 24 November 2017

Intuitive explanations of statistical concepts for novices #2

In my last post, I gave a brief explainer of what the term 'Analysis of variance' actually means – essentially you are comparing how much variation in a measure is associated with a group effect and how much with within-group variation.

The use of t-tests and ANOVA by psychologists is something of a historical artefact. These methods have been taught to generations of researchers in their basic statistics training, and they do the business for many basic experimental designs. Many statisticians, however, prefer variants of regression analysis. The point of this post is to explain that, if you are just comparing two groups, all three methods – ANOVA, t-test and linear regression – are equivalent. None of this is new but it is often confusing to beginners.

Anyone learning basic statistics probably started out with the t-test. This is a simple way of comparing the means of two groups, and, just like ANOVA, it looks at how big that mean difference is relative to the variation within the groups. You can't conclude anything by knowing that group A has a mean score of 40 and group B has a mean score of 44. You need to know how much overlap there is in the scores of people in the two groups, and that is related to how variable they are. If scores in group A range from to 38 to 42 and those in group B range from 43 to 45 we have a massive difference with no overlap between groups – and we don't really need to do any statistics! But if group A ranges from 20 to 60 and group B ranges from 25 to 65, then a 2-point difference in means is not going to excite us. The t-test gives a statistic that reflects how big the mean difference is relative to the within-group variation.  What many people don't realise is that the t-test is computationally equivalent to the ANOVA. If you square the value of t from a t-test, you get the F-ratio*.

 Figure 1: Simulated data from experiments A, B, and C.  Mean differences for two intervention groups are the same in all three experiments, but within-group variance differs

Now let's look at regression. Consider Figure 1. This is similar to the figure from my last post, showing three experiments with similar mean differences between groups, but very different within-group variance. These could be, for instance, scores out of 80 on a vocabulary test. Regression analysis focuses on the slope of the line between the two means, shown in black, which is referred to as b. If you've learned about regression, you'll probably have been taught about it in the context of two continuous variables, X and Y, where the slope b, tells you how much change there is in Y for every unit change in X. But if we have just two groups, b is equivalent to the difference in means.

So, how can it be that regression is equivalent to ANOVA, if the slopes are the same for A, B and C? The answer is that, just as illustrated above, we can't interpret b unless we know about the variation within each group. Typically, when you run a regression analysis, the output includes a t-value that is derived by dividing b by a measure known as the standard error, which is an index of the variation within groups.

An alternative way to show how it works is to transform data from the three experiments to be on the same scale, in a way that takes into account the within-group variation. We achieve this by transforming the data into z-scores. All three experiments now have the same overall mean (0) and standard deviation (1). Figure 2 shows the transformed data – and you see that after the data have been rescaled in this way, the y-axis now ranges from -3 to +3, and the slope is considerably larger for Experiment C than Experiment A. The slope for z- transformed data is known as beta, or the standardized regression coefficient.

 Figure 2: Same data as from Figure 1, converted to z-scores

The goal of this blogpost is to give an intuitive understanding of the relationship between ANOVA, t-tests and regression, so I am avoiding algebra as far as possible. The key point is when you are comparing two groups, t and F are different ways of representing the ratio between variation between groups and variation within groups, and t can be converted into F by simply squaring the value. You can derive t from linear regression by dividing the b or beta by its standard error - and this is automatically done by most stats programmes. If you are nerdy enough to want to use algebra to transform beta into F, or to see how Figures 1 and 2 were created, see the script Rftest_with_t_and_b.r here.

How do you choose which statistics to do? For a simple two-group comparison it really doesn't matter and you may prefer to use the method that is most likely to be familiar to your readers. The t-test has the advantage of being well-known – and most stats packages also allow you to make an adjustment to the t-value which is useful if the variances in your two groups are different. The main advantage of ANOVA is that it works when you have more than two groups. Regression is even more flexible, and can be extended in numerous ways, which is why it is often preferred.

Further explanations can be found here:

*It might not be exactly the same if your software does an adjustment for unequal variances between groups, but it should be close. It is identical if no correction is done.

Monday, 20 November 2017

Intuitive explanations of statistical concepts for novices #1

Lots of people use Analysis of Variance (Anova) without really understanding how it works, so I thought I'd have a go at explaining the basics in an intuitive fashion.

Consider three experiments, A, B and C, each of which compares the impact of an intervention on an outcome measure. The three experiments each have 20 people in a control group and 20 in an intervention group. Figure 1 shows the individual scores on an outcome measure for the two groups as blobs, and the mean score for each group as a dotted black line.

 Figure 1: Simulated data from 3 intervention studies

In terms of average scores of control and intervention groups, the three groups look very similar, with the intervention group about .4 to .5 points higher than the control group. But we can't interpret this difference without having an idea of how variable scores are in the two groups.

For experiment A, there is considerable variation within each group, that swamps the average difference between the groups. In contrast, for experiment C, the scores within each group are tightly packed. Group B is somewhere in between.

If you enter these data into a one-way Anova, with group as a between-subjects factor, you get out a F-ratio, which can then be evaluated in terms of a p-value which gives the probability of obtaining such an extreme result if there is really no impact of the intervention. As you will see, the F-ratios are very different for A, B, and C, even though the group mean differences are the same. And in terms of the conventional .05 level of significance, the result from experiment A is not significant, experiment C is significant at the .001 level, and experiment B shows a trend (p = .051).

So how is the F-ratio computed? It just involves computing a number that reflects the ratio between the variance of the means of the groups, and the average variance within each group. When we just have two groups, as here, the first value just reflects how far away the two group means are from the overall mean. This is the Between Groups term, which is just the Variance of the two means multiplied by the number in each group (20). That will be similar for A, B and C, because the means for the two groups are similar and the numbers in each group are the same.

But the Within Groups term will differ substantially for A, B, and C, because it is computed as the average variance for the two groups. The F-ratio is obtained by just dividing the between groups term by the within groups term. If the within groups term is big, F is small, and vice versa.

The R script used to generate Figure 1 can be found here: https://github.com/oscci/intervention/blob/master/Rftest.R

PS. 20/11/2017. Thanks to Jan Vanhove for providing code to show means rather than medians in Fig 1.

Friday, 3 November 2017

Prisons, developmental language disorder, and base rates

There's been some interesting discussion on Twitter about the high rate of developmental language disorder (DLD) in the prison population. Some studies give an estimate as high as 50 percent (Anderson et al, 2016), and this has prompted calls for speech-language therapy services to be involved in the working with offenders. Work by Pam Snow and others has documented the difficulties of navigating the justice system if your understanding and ability to express yourself are limited.

This is important work, but I have worried from time to time about the potential for misunderstanding. In particular, if you are a parent of a child with DLD, should you be alarmed at the prospect that your offspring will be incarcerated? So I wanted to give a brief explainer that offers some reassurance.

The simplest way to explain it is to think about gender. I've been delving into the latest national statistics for this post, and found that the UK prison population this year contained 82,314 men, but a mere 4,013 women. That's a staggering difference, but we don't conclude that because most criminals are men, therefore most men are criminals. This is because we have to take into account base rates: the proportion of the general population who are in prison. Another set of government statistics estimates the UK population as around 64.6 million, about half of whom are male, and 81% are adults. So a relatively small proportion of the adult population is in prison, and the numbers of non-criminal men vastly outnumber the number of criminal men.

I did similar sums for DLD, using data from Norbury et al (2016) to estimate a population prevalence of 7% in adult males, and plugging in that relatively high figure of 50% of prisoners with DLD. The figures look like this.

 Numbers (in thousands) assuming 7% prevalence of DLD and 50% DLD in prisoners*
As you can see, according to this scenario, the probability of going to prison is much greater for those with DLD than for those without DLD (2.24% DLD vs 0.17% without DLD), but the absolute probability is still very low – 98% of those with DLD will not be incarcerated.

The so-called base rate fallacy is a common error in logical reasoning. It seems natural to conclude that if A is associated with B, then B must be associated with A. Statistically, that is true, but if A is extremely rare, then the likelihood of B given A can be considerably less than the likelihood of A given B.

So I don't think therefore that we need to seek explanations for the apparent inconsistency that's being flagged up on Twitter between rates of incarceration in studies of those with DLD, vs rates of DLD in those who are incarcerated. It could just be the consequence of the low base rate of incarceration.

References
Anderson et al (2016) Language impairments among youth offenders: A systematic review. Children and Youth Services Review, 65, 195-203.

Norbury, C. F.,  et al. (2016). The impact of nonverbal ability on prevalence and clinical presentation of language disorder: evidence from a population study. Journal of Child Psychology and Psychiatry, 57, 1247-1257.

*An R script for generating this figure can be found here.

Postscript - 4th November 2017
The Twitter discussion has continued and drawn attention to further sources of information on rates of language and related problems in prison populations. Happy to add these here if people can send sources:

Talbot, J. (2008). No One Knows: Report and Final Recommendations. Report by Prison Reform Trust.

House of Commons Justice Committee (2016) The Treatment of Young Adults in the Criminal Justice System.  Report HC 169.

Tuesday, 17 October 2017

Citing the research literature: the distorting lens of memory

 Corticogenesis: younger neurons migrate past older ones using radial glia as a scaffolding. Figure from https://en.wikipedia.org/wiki/Neural_development#/media/File:Corticogenesis_in_a_wild-type_mouse.png
"Billy was a likable twelve-year old boy whose major areas of difficulty were described by his parents as follows: 1) marked difficulty in reading and retaining what he read; 2) some trouble with arithmetic; 3) extreme slowness in completing homework with writing and spelling of poor quality; 4) slowness in learning to tell time (learned only during the past year); 5) lapses of attention with staring into space; 6) "dizzy spells" with "blackouts"; 7) recurring left frontal headaches always centering around and behind the left eye; 8) occasional enuresis until recently; 9) disinterest in work; 10) sudden inappropriate temper outbursts which were often violent; 11) enjoyment of irritating people; and 12) tendency to cry readily." Drake (1968), p . 488

Poor Billy would have been long forgotten, were it not for the fact that he died suddenly shortly after he had undergone extensive assessment for his specific learning difficulties. An autopsy found that death was due to a brain haemorrhage caused by an angioma in the cerebellum, but the neuropathologist also remarked on some unusual features elsewhere in his brain:

"In the cerebral hemispheres, anomalies were noted in the convolutional pattern of the parietal lobe bilaterally. The cortical pattern was disrupted by penetrating deep gyri that appeared disconnected. Related areas of the corpus callosum appeared thin (Figure 2). Microscopic examination revealed the cause of the hemorrrage to be a cerebellar angioma of the type known as capillary telangiectases (Figure 3). The cerebral cortex was more massive than normal, the lamination tended to be columnar, the nerve cells were spindle-shaped, and there were numerous ectopic neurons in the white matter that were not collected into distinct heterotopias (Figure 4)." p. 496*

I had tracked down this article in the course of writing a paper with colleagues on the neuronal migration account of dyslexia – a topic I have blogged about previously  The 'ectopic neurons' referred to by Drake are essentially misplaced neurons that,  because of disruptions of very early development, have failed to migrate to their usual location in the brain.

I realised that my hazy memory of this paper was quite different from the reality: I had thought the location of the ectopic neurons was consistent with those reported in later post mortem studies by Galaburda and colleagues. In fact, Drake says nothing about their location, other than that it is in white matter – which contrasts with the later reports.

This made me curious to see how this work had been reported by others. This was not a comprehensive exercise: I did this by identifying from Web of Science all papers that cited Drake's article, and then checking what they said about the results if  I could locate an online version of the article easily. Here's what I found:

Out of a total of 45 papers, 18 were excluded: they were behind a paywall or not readily traceable online, or (1 case) did not mention neuroanatomical findings A further 10 papers included the Drake study in a bunch of references referring to neuroanatomical abnormalities in dyslexia, without singling out any specific results. Thus they were not inaccurate, but just vague.

The remaining 17 could be divided up as follows:

Seven papers gave a broadly accurate account of the neuroanatomical findings. The most detailed accurate account was by Galaburda et al (1985) who noted:

"Drake published neuropathological findings in a well-documented case of developmental dyslexia. He described a thinned corpus callosum particularly involving the parietal connections, abnormal cortical folding in the parietal regions, and, on microscopical examination, excessive numbers of neurons in the subcortical white matter. The illustrations provided did not show the parietal lobe, and the portion of the corpus callosum that could be seen appeared normal. No mention was made as to whether the anomalies were asymmetrically distributed."p. 227.

Four (three of them from the same research group) cited Drake as though there were two patients, rather than one, and focussed only on the the corpus callosum, without mentioning ectopias.

Six gave an inaccurate account of the findings. The commonest error was to be specific about the location of the ectopias, which (as is clear from the Galaburda quote above), was not apparent in the text or figures of the original paper. Five of these articles located the ectopias in the left parietal lobe, one more generally in the parietal lobe, and one in the cerebellum (where the patient's stroke had been).

So, if we discount those available articles that just gave a rather general reference to Drake's study, over half of the remainder got some information wrong – and the bias was in the direction of making this early study consistent with later research.

The paper is hard to get hold of**, and when you do track it down, it is rather long-winded. It is largely concerned with the psychological evaluation of the patient, including aspects, such as Oedipal conflicts, that seem fanciful to modern eyes, and the organisation of material is not easy to follow. Perhaps it is not so surprising that people make errors when reporting the findings. But if nothing else, this exercise reminded me of the need to check sources when you cite them. It is all too easy to think you know what is in a paper – or to rely on someone else's summary. In fact, these days I am often dismayed to discover I have a false memory of what is in my own old papers, let alone those by other people. But once in the literature, errors can propagate, and we need to be vigilant to prevent a gradual process of distortion over time. It is all too easy to hurriedly read a secondary source or an abstract: we (and I include myself here) need to slow down.

References
Drake, W. E. (1968). Clinical and pathological findings in a child with a developmental learning disability Journal of Learning Disabilities, 1(9), 486-502.
Galaburda, A. M., Sherman, G. F., Rosen, G. D., Aboitiz, F., & Geschwind, N. (1985). Developmental dyslexia: four consecutive cases with cortical anomalies. Annals of Neurology, 18, 222-233.

* I assume the figures are copyrighted so am not reproducing them here

Sunday, 1 October 2017

Pre-registration or replication: the need for new standards in neurogenetic studies

This morning I did a very mean thing. I saw an author announce to the world on Twitter that they had just published this paper, and I tweeted a critical comment. This does not make me happy, as I know just how proud and pleased one feels when a research project at last makes it into print, and to immediately pounce on it seems unkind. Furthermore, the flaws in the paper are not all that unusual: they characterise a large swathe of literature. And the amount of work that has gone into the paper is clearly humongous, with detailed analysis of white matter structural integrity that probably represents many months of effort. But that, in a sense, is the problem. We just keep on and on doing marvellously complex neuroimaging in contexts where the published studies are likely to contain unreliable results.

Why am I so sure that this is unreliable? Well, yesterday saw the publication of a review that I had led on, which was highly relevant to the topic of the paper – genetic variants affecting brain and behaviour. In our review we closely scrutinised 30 papers on this topic that had been published in top neuroscience journals. The field of genetics was badly burnt a couple of decades ago when it was discovered that study after study reported results that failed to replicate. These days, it's not possible to publish a genetic association in a genetics journal unless you show that the finding holds up in a replication sample. However, neuroscience hasn't caught up and seems largely unaware of why this is a problem.

The focus of this latest paper was on a genetic variant known as the COMT Val158Met SNP. People can have one of three versions of this genotype: Val/Val, Val/Met and Met/Met, but it's not uncommon for researchers to just distinguish people with Val/Val from Met carriers (Val/Met and Met/Met). This COMT polymorphism is one of the most-studied genetic variants in relation to human cognition, with claims of associations with all kinds of things: intelligence, memory, executive functions, emotion, response to anti-depressants, to name just a few. Few of these, however, have replicated, and there is reason to be dubious about the robustness of findings (Barnett, Scoriels & Munafo, 2008)

In this latest COMT paper – and many, many other papers in neurogenetics – the sample size is simply inadequate.  There were 19 participants (12 males and 7 females) with the COMT Val/Val version of the variant, compared with 63 (27 males and 36 females) who had either Met/Met or Val/Met genotype. The authors reported that significant effects of genotype on corpus callosum structure were found in males only. As we noted in our review, effects of common genetic variants are typically very small. In this context, an effect size (standardized difference between means of two genotypes, Cohen's d) of .2 would be really large. Yet this study has power of .08 to detect such an effect in males – that is if there really is a difference of -0.2 SDs between the two genotypes, and you repeatedly ran studies with this sample size, then you'd fail to see the effect in 92% of studies. To look at it another way, the true effect size would need to be enormous (around 1 SD difference between groups) to have an 80% chance of being detectable, given the sample size.

When confronted with this kind of argument, people often say that maybe there really are big effect sizes. After all, the researchers were measuring characteristics of the brain, which are nearer to the gene than the behavioural measures that are often used. Unfortunately, there is another much more likely explanation for the result, which is that it is a false positive arising from a flexible analytic pipeline.

The problem is that both neuroscience and genetics are a natural environment for analytic flexibility. Put the two together, and you need to be very very careful to control for spurious false positive results. In the papers we evaluated for our review, there were numerous sources of flexibility: often researchers adopted multiple comparisons corrections for some of these, but typically not for all. In the COMT/callosum paper, the authors addressed the multiple comparisons issue using permutation testing. However, one cannot tell from a published paper how many subgroupings/genetic variants/phenotypes/analysis pathways etc were tried but not reported. If, as in mainstream genetics, the authors had included a direct replication of this result, that would be far more convincing. Perhaps the best way for the field to proceed would be by adopting pre-registration as standard. Pre-registration means you commit yourself to a specific hypothesis and analytic plan in advance; hypotheses can then be meaningfully tested using standard statistical methods. If you don’t pre-register and there are many potential ways of looking at the data, it is very easy to fool yourself into finding something that looks 'significant'.

I am sufficiently confident that this finding will not replicate that I hereby undertake to award a prize of £1000 to anyone who does a publicly preregistered replication of the El-Hage et al paper and reproduces their finding of a statistically significant male-specific effect of COMT Val158Met polymorphism on the same aspects of corpus callosum structure.

I emphasise that, though the new COMT/callosum paper is the impetus for this blogpost, I do not intend this as a specific criticism of the authors of that paper. The research approach they adopted is pretty much standard in the field, and the literature is full of small studies that aren't pre-registered and don't include a replication sample. I don't think most researchers are being deliberately misleading, but I do think we need a change of practices if we are to amass a research literature that can be built upon. Either pre-registration or replication should be conditions of publication.

PS. 3rd October 2017
An anonymous commentator (below) drew my attention to a highly relevant preprint in Bioarxiv by Jahanshad and colleagues from the ENIGMA-DTI consortium, entitled 'Do Candidate Genes Affect the Brain's White Matter Microstructure? Large-Scale Evaluation of 6,165 Diffusion MRI Scans'. They included COMT as one of the candidate genes, although they did not look at gender-specific effects. The Abstract makes for sobering reading: 'Regardless of the approach, the previously reported candidate SNPs did not show significant associations with white matter microstructure in this largest genetic study of DTI to date; the negative findings are likely not due to insufficient power.'

In addition, Kevin Mitchell (@WiringTheBrain) on Twitter alerted me to a blogpost from 2015 in which he made very similar points about neuroimaging biomarkers. Let's hope that funders and mainstream journals start to get the message.

Sunday, 10 September 2017

Bishopblog catalogue (updated 10 Sept 2017)

 Source: http://www.weblogcartoons.com/2008/11/23/ideas/

Those of you who follow this blog may have noticed a lack of thematic coherence. I write about whatever is exercising my mind at the time, which can range from technical aspects of statistics to the design of bathroom taps. I decided it might be helpful to introduce a bit of order into this chaotic melange, so here is a catalogue of posts by topic.

Language impairment, dyslexia and related disorders
The common childhood disorders that have been left out in the cold (1 Dec 2010) What's in a name? (18 Dec 2010) Neuroprognosis in dyslexia (22 Dec 2010) Where commercial and clinical interests collide: Auditory processing disorder (6 Mar 2011) Auditory processing disorder (30 Mar 2011) Special educational needs: will they be met by the Green paper proposals? (9 Apr 2011) Is poor parenting really to blame for children's school problems? (3 Jun 2011) Early intervention: what's not to like? (1 Sep 2011) Lies, damned lies and spin (15 Oct 2011) A message to the world (31 Oct 2011) Vitamins, genes and language (13 Nov 2011) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012) Phonics screening: sense and sensibility (3 Apr 2012) What Chomsky doesn't get about child language (3 Sept 2012) Data from the phonics screen (1 Oct 2012) Auditory processing disorder: schisms and skirmishes (27 Oct 2012) High-impact journals (Action video games and dyslexia: critique) (10 Mar 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) Raising awareness of language learning impairments (26 Sep 2013) Good and bad news on the phonics screen (5 Oct 2013) What is educational neuroscience? (25 Jan 2014) Parent talk and child language (17 Feb 2014) My thoughts on the dyslexia debate (20 Mar 2014) Labels for unexplained language difficulties in children (23 Aug 2014) International reading comparisons: Is England really do so poorly? (14 Sep 2014) Our early assessments of schoolchildren are misleading and damaging (4 May 2015) Opportunity cost: a new red flag for evaluating interventions (30 Aug 2015) The STEP Physical Literacy programme: have we been here before? (2 Jul 2017)

Autism
Autism diagnosis in cultural context (16 May 2011) Are our ‘gold standard’ autism diagnostic instruments fit for purpose? (30 May 2011) How common is autism? (7 Jun 2011) Autism and hypersystematising parents (21 Jun 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011) Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) The ‘autism epidemic’ and diagnostic substitution (4 Jun 2012) How wishful thinking is damaging Peta's cause (9 June 2014)

Developmental disorders/paediatrics
The hidden cost of neglected tropical diseases (25 Nov 2010) The National Children's Study: a view from across the pond (25 Jun 2011) The kids are all right in daycare (14 Sep 2011) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Changing the landscape of psychiatric research (11 May 2014)

Genetics
Where does the myth of a gene for things like intelligence come from? (9 Sep 2010) Genes for optimism, dyslexia and obesity and other mythical beasts (10 Sep 2010) The X and Y of sex differences (11 May 2011) Review of How Genes Influence Behaviour (5 Jun 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Genes, brains and lateralisation (22 Dec 2012) Genetic variation and neuroimaging (11 Jan 2013) Have we become slower and dumber? (15 May 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) Incomprehensibility of much neurogenetics research ( 1 Oct 2016) A common misunderstanding of natural selection (8 Jan 2017) Sample selection in genetic studies: impact of restricted range (23 Apr 2017)

Neuroscience
Neuroprognosis in dyslexia (22 Dec 2010) Brain scans show that… (11 Jun 2011)  Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Neuronal migration in language learning impairments (2 May 2012) Sharing of MRI datasets (6 May 2012) Genetic variation and neuroimaging (1 Jan 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) What is educational neuroscience? ( 25 Jan 2014) Changing the landscape of psychiatric research (11 May 2014) Incomprehensibility of much neurogenetics research ( 1 Oct 2016)

Reproducibility
Accentuate the negative (26 Oct 2011) Novelty, interest and replicability (19 Jan 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013) Who's afraid of open data? (15 Nov 2015) Blogging as post-publication peer review (21 Mar 2013) Research fraud: More scrutiny by administrators is not the answer (17 Jun 2013) Pressures against cumulative research (9 Jan 2014) Why does so much research go unpublished? (12 Jan 2014) Replication and reputation: Whose career matters? (29 Aug 2014) Open code: note just data and publications (6 Dec 2015) Why researchers need to understand poker ( 26 Jan 2016) Reproducibility crisis in psychology ( 5 Mar 2016) Further benefit of registered reports ( 22 Mar 2016) Would paying by results improve reproducibility? ( 7 May 2016) Serendipitous findings in psychology ( 29 May 2016) Thoughts on the Statcheck project ( 3 Sep 2016) When is a replication not a replication? (16 Dec 2016) Reproducible practices are the future for early career researchers (1 May 2017) Which neuroimaging measures are useful for individual differences research? (28 May 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017)

Statistics
Book review: biography of Richard Doll (5 Jun 2010) Book review: the Invisible Gorilla (30 Jun 2010) The difference between p < .05 and a screening test (23 Jul 2010) Three ways to improve cognitive test scores without intervention (14 Aug 2010) A short nerdy post about the use of percentiles (13 Apr 2011) The joys of inventing data (5 Oct 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Causal models of developmental disorders: the perils of correlational data (24 Jun 2012) Data from the phonics screen (1 Oct 2012)Moderate drinking in pregnancy: toxic or benign? (1 Nov 2012) Flaky chocolate and the New England Journal of Medicine (13 Nov 2012) Interpreting unexpected significant results (7 June 2013) Data analysis: Ten tips I wish I'd known earlier (18 Apr 2014) Data sharing: exciting but scary (26 May 2014) Percentages, quasi-statistics and bad arguments (21 July 2014) Why I still use Excel ( 1 Sep 2016) Sample selection in genetic studies: impact of restricted range (23 Apr 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017)

Journalism/science communication
Orwellian prize for scientific misrepresentation (1 Jun 2010) Journalists and the 'scientific breakthrough' (13 Jun 2010) Science journal editors: a taxonomy (28 Sep 2010) Orwellian prize for journalistic misrepresentation: an update (29 Jan 2011) Academic publishing: why isn't psychology like physics? (26 Feb 2011) Scientific communication: the Comment option (25 May 2011)  Publishers, psychological tests and greed (30 Dec 2011) Time for academics to withdraw free labour (7 Jan 2012) 2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012) Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Communicating science in the age of the internet (13 Jul 2012) How to bury your academic writing (26 Aug 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013)  A short rant about numbered journal references (5 Apr 2013) Schizophrenia and child abuse in the media (26 May 2013) Why we need pre-registration (6 Jul 2013) On the need for responsible reporting of research (10 Oct 2013) A New Year's letter to academic publishers (4 Jan 2014) Journals without editors: What is going on? (1 Feb 2015) Editors behaving badly? (24 Feb 2015) Will Elsevier say sorry? (21 Mar 2015) How long does a scientific paper need to be? (20 Apr 2015) Will traditional science journals disappear? (17 May 2015) My collapse of confidence in Frontiers journals (7 Jun 2015) Publishing replication failures (11 Jul 2015) Psychology research: hopeless case or pioneering field? (28 Aug 2015) Desperate marketing from J. Neuroscience ( 18 Feb 2016) Editorial integrity: publishers on the front line ( 11 Jun 2016) When scientific communication is a one-way street (13 Dec 2016) Breaking the ice with buxom grapefruits: Pratiques de publication and predatory publishing (25 Jul 2017)

Social Media
A gentle introduction to Twitter for the apprehensive academic (14 Jun 2011) Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011) Will I still be tweeting in 2013? (2 Jan 2012) Blogging in the service of science (10 Mar 2012) Blogging as post-publication peer review (21 Mar 2013) The impact of blogging on reputation ( 27 Dec 2013) WeSpeechies: A meeting point on Twitter (12 Apr 2014) Email overload ( 12 Apr 2016)