What is complete pooling, no-pooling, and partial pooling, and how to use data.table for no-pooling models (fit the same model, but separately to each group)
Get source code for this RMarkdown script here.
Donate and become a patron: If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.
data.table
package to repeatedly fit a model to separate groups in a datasetBefore the invention of multi-level models (aka linear mixed effects models), how did people analyze nested or repeated-measures data? They aggregated the data, used repeated-measures ANOVA, and did other things.
One interesting approach is to repeatedly fit the same model to each group separately, and this approach seems to be first mentioned in this article (Lorch and Myers 1990). If you have 10 data points per subject, you have nested data (data points nested within subjects), and you could fit 10 linear regressions, one for each subject. This approach is still quite useful nowadays (I use it regularly in my research), and I’ll demonstrate how to do it easily with data.table
and my summaryh()
function from my hausekeep
package.
# load packages
library(hausekeep); library(data.table)
Convert the built-in dataset ChickWeight
to data.table
class so we can leverage its extensive functionalities.
dt1 <- data.table(ChickWeight) # convert builtin dataset to data.table class
dt1 # print dataset
weight Time Chick Diet
1: 42 0 1 1
2: 51 2 1 1
3: 59 4 1 1
4: 64 6 1 1
5: 76 8 1 1
---
574: 175 14 50 4
575: 205 16 50 4
576: 234 18 50 4
577: 264 20 50 4
578: 264 21 50 4
summary(dt1) # check dataset
weight Time Chick Diet
Min. : 35.0 Min. : 0.00 13 : 12 1:220
1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
Median :103.0 Median :10.00 20 : 12 3:120
Mean :121.8 Mean :10.72 10 : 12 4:118
3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
Max. :373.0 Max. :21.00 19 : 12
(Other):506
dt1[, Chick := as.numeric(as.character(Chick))] # recode factor as numeric
Let’s look at just the first 10 chicks (and to illustrate data.table
syntax; see my data.table tutorials). We then fit a linear regression weight ~ Time
to the data from these 10 chicks, ignoring that the data came from 10 different chicks (i.e., completing pooling). That is, we pretend all the data came from one chick.
# filter rows where Chick is <= 10 and fit linear regression model
dt1[Chick <= 10, ] # syntax works only with data.table
weight Time Chick Diet
1: 42 0 1 1
2: 51 2 1 1
3: 59 4 1 1
4: 64 6 1 1
5: 76 8 1 1
---
115: 96 14 10 1
116: 101 16 10 1
117: 112 18 10 1
118: 120 20 10 1
119: 124 21 10 1
model <- lm(weight ~ Time, data = dt1[Chick <= 10])
summary(model)
Call:
lm(formula = weight ~ Time, data = dt1[Chick <= 10])
Residuals:
Min 1Q Median 3Q Max
-85.907 -11.399 -1.485 11.291 121.093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.2094 5.2189 5.788 6.07e-08 ***
Time 7.3189 0.4091 17.892 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.08 on 117 degrees of freedom
Multiple R-squared: 0.7323, Adjusted R-squared: 0.73
F-statistic: 320.1 on 1 and 117 DF, p-value: < 2.2e-16
If you use summaryh()
from my hausekeep package, you can show formatted output that can be copied-pasted into your APA manuscript.
summaryh(model)
term results
1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47
2: Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86
Or fit the same model within data.table
dt1[Chick <= 10, summaryh(lm(weight ~ Time))] # results are same as above
term results
1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47
2: Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86
The more appropriate way to fit this model is to account for the nested structure is to use linear mixed models (lme4 package). This approach is ‘partial pooling’, because it assumes hierarchy in the data structure.
library(lme4); library(lmerTest)
lme1 <- lmer(weight ~ Time + (1 | Chick), data = dt1[Chick %in% 1:10])
summaryh(lme1)
term results
1: (Intercept) b = 30.33, SE = 7.39, t(15) = 4.10, p = .001, r = 0.73
2: Time b = 7.30, SE = 0.32, t(108) = 22.74, p < .001, r = 0.91
summary(lme1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: weight ~ Time + (1 | Chick)
Data: dt1[Chick %in% 1:10]
REML criterion at convergence: 1104.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.5405 -0.5187 -0.0682 0.6075 3.6354
Random effects:
Groups Name Variance Std.Dev.
Chick (Intercept) 378.8 19.46
Residual 556.4 23.59
Number of obs: 119, groups: Chick, 10
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 30.329 7.392 14.747 4.103 0.000972 ***
Time 7.300 0.321 108.052 22.741 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
Time -0.470
What if you want to fit a separate model for each chick/group, so you know how time was associated with weight for each chick? You could write a for loop to loop through each chick, but you can do it with data.table
syntax really easily.
Now we can extend the syntax above with by = Chick
to fit the model to each chick! Just one line of code is needed.
# fit linear regression model to first 10 Chicks
dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time)), by = Chick]
Chick term results
1: 1 (Intercept) b = 24.47, SE = 6.73, t(10) = 3.64, p = .005, r = 0.75
2: 1 Time b = 7.99, SE = 0.52, t(10) = 15.25, p < .001, r = 0.98
3: 2 (Intercept) b = 24.72, SE = 4.93, t(10) = 5.01, p < .001, r = 0.85
4: 2 Time b = 8.72, SE = 0.38, t(10) = 22.72, p < .001, r = 0.99
5: 3 (Intercept) b = 23.18, SE = 5.08, t(10) = 4.56, p = .001, r = 0.82
6: 3 Time b = 8.49, SE = 0.40, t(10) = 21.45, p < .001, r = 0.99
7: 4 (Intercept) b = 32.87, SE = 4.01, t(10) = 8.21, p < .001, r = 0.93
8: 4 Time b = 6.09, SE = 0.31, t(10) = 19.53, p < .001, r = 0.99
9: 5 (Intercept) b = 16.90, SE = 7.56, t(10) = 2.24, p = .049, r = 0.58
10: 5 Time b = 10.06, SE = 0.59, t(10) = 17.10, p < .001, r = 0.98
11: 6 (Intercept) b = 44.12, SE = 7.35, t(10) = 6.00, p < .001, r = 0.88
12: 6 Time b = 6.38, SE = 0.57, t(10) = 11.15, p < .001, r = 0.96
13: 7 (Intercept) b = 5.84, SE = 11.42, t(10) = 0.51, p = .620, r = 0.16
14: 7 Time b = 13.21, SE = 0.89, t(10) = 14.85, p < .001, r = 0.98
15: 8 (Intercept) b = 43.73, SE = 3.70, t(9) = 11.82, p < .001, r = 0.97
16: 8 Time b = 4.83, SE = 0.31, t(9) = 15.44, p < .001, r = 0.98
17: 9 (Intercept) b = 52.09, SE = 4.79, t(10) = 10.88, p < .001, r = 0.96
18: 9 Time b = 2.66, SE = 0.37, t(10) = 7.15, p < .001, r = 0.91
19: 10 (Intercept) b = 38.70, SE = 1.13, t(10) = 34.28, p < .001, r = 1.00
20: 10 Time b = 4.07, SE = 0.09, t(10) = 46.29, p < .001, r = 1.00
Return specific results (e.g., t, p, SE, beta) in separate columns
dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time), showTable = T)$results2, by = Chick]
Chick term estimate std.error df statistic p.value es.r es.d es.r.squared es.adj.r.squared
1: 1 (Intercept) 24.465 6.728 10 3.636 0.005 0.755 2.300 0.959 0.955
2: 1 Time 7.988 0.524 10 15.255 0.000 0.979 9.648 0.959 0.955
3: 2 (Intercept) 24.725 4.931 10 5.014 0.001 0.846 3.171 0.981 0.979
4: 2 Time 8.720 0.384 10 22.721 0.000 0.990 14.370 0.981 0.979
5: 3 (Intercept) 23.180 5.083 10 4.560 0.001 0.822 2.884 0.979 0.977
6: 3 Time 8.487 0.396 10 21.455 0.000 0.989 13.569 0.979 0.977
7: 4 (Intercept) 32.866 4.005 10 8.206 0.000 0.933 5.190 0.974 0.972
8: 4 Time 6.089 0.312 10 19.532 0.000 0.987 12.353 0.974 0.972
9: 5 (Intercept) 16.896 7.556 10 2.236 0.049 0.577 1.414 0.967 0.964
10: 5 Time 10.055 0.588 10 17.098 0.000 0.983 10.814 0.967 0.964
11: 6 (Intercept) 44.123 7.351 10 6.002 0.000 0.885 3.796 0.926 0.918
12: 6 Time 6.378 0.572 10 11.147 0.000 0.962 7.050 0.926 0.918
13: 7 (Intercept) 5.843 11.422 10 0.512 0.620 0.160 0.324 0.957 0.952
14: 7 Time 13.205 0.889 10 14.855 0.000 0.978 9.395 0.957 0.952
15: 8 (Intercept) 43.727 3.698 9 11.824 0.000 0.969 7.882 0.964 0.960
16: 8 Time 4.827 0.313 9 15.444 0.000 0.982 10.296 0.964 0.960
17: 9 (Intercept) 52.094 4.786 10 10.885 0.000 0.960 6.884 0.836 0.820
18: 9 Time 2.663 0.372 10 7.150 0.000 0.915 4.522 0.836 0.820
19: 10 (Intercept) 38.695 1.129 10 34.285 0.000 0.996 21.684 0.995 0.995
20: 10 Time 4.066 0.088 10 46.289 0.000 0.998 29.276 0.995 0.995
Why won’t summary()
work?
dt1[Chick <= 10, summary(lm(weight ~ Time))] # summary() doesn't return a table
dt1[Chick <= 10, summary(lm(weight ~ Time)), by = Chick] # doesn't work!
You can also use the lmList()
function from the lme4
package to fit this ‘no-pooling’ model, but you don’t get p-values, effect sizes, and other useful information!
lmList(weight ~ Time | Chick, data = dt1[Chick <= 10])
Call: lmList(formula = weight ~ Time | Chick, data = dt1[Chick <= 10])
Coefficients:
(Intercept) Time
1 24.465436 7.987899
2 24.724853 8.719861
3 23.179549 8.487370
4 32.865678 6.088640
5 16.895628 10.055362
6 44.123431 6.378006
7 5.842535 13.205264
8 43.727273 4.827273
9 52.094086 2.663137
10 38.695054 4.066102
Degrees of freedom: 119 total; 99 residual
Residual standard error: 11.44853
Support my work and become a patron here!
Lorch, Robert F., and Jerome L. Myers. 1990. “Regression Analyses of Repeated Measures Data in Cognitive Research.” Journal of Experimental Psychology: Learning, Memory, and Cognition 16 (1): 149–57. https://www.ncbi.nlm.nih.gov/pubmed/2136750.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hauselin/rtutorialsite, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lin (2019, Feb. 18). Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code). Retrieved from https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/
BibTeX citation
@misc{lin2019complete, author = {Lin, Hause}, title = {Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)}, url = {https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/}, year = {2019} }