What is complete pooling, no-pooling, and partial pooling, and how to use data.table for no-pooling models (fit the same model, but separately to each group)
Get source code for this RMarkdown script here.
Donate and become a patron: If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.
data.table
package to repeatedly fit a model to separate groups in a datasetBefore the invention of multi-level models (aka linear mixed effects models), how did people analyze nested or repeated-measures data? They aggregated the data, used repeated-measures ANOVA, and did other things.
One interesting approach is to repeatedly fit the same model to each group separately, and this approach seems to be first mentioned in this article
R.F. Lorch, J.L. Myers.
Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol 16(1), pp. 149–157. 1990. data.table
and my summaryh()
function from my hausekeep
package.
# load packages
library(hausekeep); library(data.table)
Convert the built-in dataset ChickWeight
to data.table
class so we can leverage its extensive functionalities.
dt1 <- data.table(ChickWeight) # convert builtin dataset to data.table class
dt1 # print dataset
weight Time Chick Diet 1: 42 0 1 1 2: 51 2 1 1 3: 59 4 1 1 4: 64 6 1 1 5: 76 8 1 1 --- 574: 175 14 50 4 575: 205 16 50 4 576: 234 18 50 4 577: 264 20 50 4 578: 264 21 50 4
summary(dt1) # check dataset
weight Time Chick Diet Min. : 35.0 Min. : 0.00 13 : 12 1:220 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120 Median :103.0 Median :10.00 20 : 12 3:120 Mean :121.8 Mean :10.72 10 : 12 4:118 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12 Max. :373.0 Max. :21.00 19 : 12 (Other):506
dt1[, Chick := as.numeric(as.character(Chick))] # recode factor as numeric
Let’s look at just the first 10 chicks (and to illustrate data.table
syntax; see my data.table tutorials). We then fit a linear regression weight ~ Time
to the data from these 10 chicks, ignoring that the data came from 10 different chicks (i.e., completing pooling). That is, we pretend all the data came from one chick.
# filter rows where Chick is <= 10 and fit linear regression model
dt1[Chick <= 10, ] # syntax works only with data.table
weight Time Chick Diet 1: 42 0 1 1 2: 51 2 1 1 3: 59 4 1 1 4: 64 6 1 1 5: 76 8 1 1 --- 115: 96 14 10 1 116: 101 16 10 1 117: 112 18 10 1 118: 120 20 10 1 119: 124 21 10 1
model <- lm(weight ~ Time, data = dt1[Chick <= 10])
summary(model)
Call: lm(formula = weight ~ Time, data = dt1[Chick <= 10]) Residuals: Min 1Q Median 3Q Max -85.907 -11.399 -1.485 11.291 121.093 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.2094 5.2189 5.788 6.07e-08 *** Time 7.3189 0.4091 17.892 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.08 on 117 degrees of freedom Multiple R-squared: 0.7323, Adjusted R-squared: 0.73 F-statistic: 320.1 on 1 and 117 DF, p-value: < 2.2e-16
If you use summaryh()
from my hausekeep package, you can show formatted output that can be copied-pasted into your APA manuscript.
summaryh(model)
term results 1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47 2: Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86
Or fit the same model within data.table
dt1[Chick <= 10, summaryh(lm(weight ~ Time))] # results are same as above
term results 1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47 2: Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86
The more appropriate way to fit this model is to account for the nested structure is to use linear mixed models (lme4 package). This approach is ‘partial pooling’, because it assumes hierarchy in the data structure.
library(lme4); library(lmerTest)
lme1 <- lmer(weight ~ Time + (1 | Chick), data = dt1[Chick %in% 1:10])
summaryh(lme1)
term results 1: (Intercept) b = 30.33, SE = 7.39, t(15) = 4.10, p = .001, r = 0.73 2: Time b = 7.30, SE = 0.32, t(108) = 22.74, p < .001, r = 0.91
summary(lme1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest'] Formula: weight ~ Time + (1 | Chick) Data: dt1[Chick %in% 1:10] REML criterion at convergence: 1104.9 Scaled residuals: Min 1Q Median 3Q Max -2.5405 -0.5187 -0.0682 0.6075 3.6354 Random effects: Groups Name Variance Std.Dev. Chick (Intercept) 378.8 19.46 Residual 556.4 23.59 Number of obs: 119, groups: Chick, 10 Fixed effects: Estimate Std. Error df t value Pr(>|t|) (Intercept) 30.329 7.392 14.747 4.103 0.000972 *** Time 7.300 0.321 108.052 22.741 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Correlation of Fixed Effects: (Intr) Time -0.470
What if you want to fit a separate model for each chick/group, so you know how time was associated with weight for each chick? You could write a for loop to loop through each chick, but you can do it with data.table
syntax really easily.
Now we can extend the syntax above with by = Chick
to fit the model to each chick! Just one line of code is needed.
# fit linear regression model to first 10 Chicks
dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time)), by = Chick]
Chick term results 1: 1 (Intercept) b = 24.47, SE = 6.73, t(10) = 3.64, p = .005, r = 0.75 2: 1 Time b = 7.99, SE = 0.52, t(10) = 15.25, p < .001, r = 0.98 3: 2 (Intercept) b = 24.72, SE = 4.93, t(10) = 5.01, p < .001, r = 0.85 4: 2 Time b = 8.72, SE = 0.38, t(10) = 22.72, p < .001, r = 0.99 5: 3 (Intercept) b = 23.18, SE = 5.08, t(10) = 4.56, p = .001, r = 0.82 6: 3 Time b = 8.49, SE = 0.40, t(10) = 21.45, p < .001, r = 0.99 7: 4 (Intercept) b = 32.87, SE = 4.01, t(10) = 8.21, p < .001, r = 0.93 8: 4 Time b = 6.09, SE = 0.31, t(10) = 19.53, p < .001, r = 0.99 9: 5 (Intercept) b = 16.90, SE = 7.56, t(10) = 2.24, p = .049, r = 0.58 10: 5 Time b = 10.06, SE = 0.59, t(10) = 17.10, p < .001, r = 0.98 11: 6 (Intercept) b = 44.12, SE = 7.35, t(10) = 6.00, p < .001, r = 0.88 12: 6 Time b = 6.38, SE = 0.57, t(10) = 11.15, p < .001, r = 0.96 13: 7 (Intercept) b = 5.84, SE = 11.42, t(10) = 0.51, p = .620, r = 0.16 14: 7 Time b = 13.21, SE = 0.89, t(10) = 14.85, p < .001, r = 0.98 15: 8 (Intercept) b = 43.73, SE = 3.70, t(9) = 11.82, p < .001, r = 0.97 16: 8 Time b = 4.83, SE = 0.31, t(9) = 15.44, p < .001, r = 0.98 17: 9 (Intercept) b = 52.09, SE = 4.79, t(10) = 10.88, p < .001, r = 0.96 18: 9 Time b = 2.66, SE = 0.37, t(10) = 7.15, p < .001, r = 0.91 19: 10 (Intercept) b = 38.70, SE = 1.13, t(10) = 34.28, p < .001, r = 1.00 20: 10 Time b = 4.07, SE = 0.09, t(10) = 46.29, p < .001, r = 1.00
Return specific results (e.g., t, p, SE, beta) in separate columns
dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time), showTable = T)$results2, by = Chick]
Chick term estimate std.error df statistic p.value es.r es.d es.r.squared es.adj.r.squared 1: 1 (Intercept) 24.465 6.728 10 3.636 0.005 0.755 2.300 0.959 0.955 2: 1 Time 7.988 0.524 10 15.255 0.000 0.979 9.648 0.959 0.955 3: 2 (Intercept) 24.725 4.931 10 5.014 0.001 0.846 3.171 0.981 0.979 4: 2 Time 8.720 0.384 10 22.721 0.000 0.990 14.370 0.981 0.979 5: 3 (Intercept) 23.180 5.083 10 4.560 0.001 0.822 2.884 0.979 0.977 6: 3 Time 8.487 0.396 10 21.455 0.000 0.989 13.569 0.979 0.977 7: 4 (Intercept) 32.866 4.005 10 8.206 0.000 0.933 5.190 0.974 0.972 8: 4 Time 6.089 0.312 10 19.532 0.000 0.987 12.353 0.974 0.972 9: 5 (Intercept) 16.896 7.556 10 2.236 0.049 0.577 1.414 0.967 0.964 10: 5 Time 10.055 0.588 10 17.098 0.000 0.983 10.814 0.967 0.964 11: 6 (Intercept) 44.123 7.351 10 6.002 0.000 0.885 3.796 0.926 0.918 12: 6 Time 6.378 0.572 10 11.147 0.000 0.962 7.050 0.926 0.918 13: 7 (Intercept) 5.843 11.422 10 0.512 0.620 0.160 0.324 0.957 0.952 14: 7 Time 13.205 0.889 10 14.855 0.000 0.978 9.395 0.957 0.952 15: 8 (Intercept) 43.727 3.698 9 11.824 0.000 0.969 7.882 0.964 0.960 16: 8 Time 4.827 0.313 9 15.444 0.000 0.982 10.296 0.964 0.960 17: 9 (Intercept) 52.094 4.786 10 10.885 0.000 0.960 6.884 0.836 0.820 18: 9 Time 2.663 0.372 10 7.150 0.000 0.915 4.522 0.836 0.820 19: 10 (Intercept) 38.695 1.129 10 34.285 0.000 0.996 21.684 0.995 0.995 20: 10 Time 4.066 0.088 10 46.289 0.000 0.998 29.276 0.995 0.995
Why won’t summary()
work?
dt1[Chick <= 10, summary(lm(weight ~ Time))] # summary() doesn't return a table
dt1[Chick <= 10, summary(lm(weight ~ Time)), by = Chick] # doesn't work!
You can also use the lmList()
function from the lme4
package to fit this ‘no-pooling’ model, but you don’t get p-values, effect sizes, and other useful information!
lmList(weight ~ Time | Chick, data = dt1[Chick <= 10])
Call: lmList(formula = weight ~ Time | Chick, data = dt1[Chick <= 10]) Coefficients: (Intercept) Time 1 24.465436 7.987899 2 24.724853 8.719861 3 23.179549 8.487370 4 32.865678 6.088640 5 16.895628 10.055362 6 44.123431 6.378006 7 5.842535 13.205264 8 43.727273 4.827273 9 52.094086 2.663137 10 38.695054 4.066102 Degrees of freedom: 119 total; 99 residual Residual standard error: 11.44853
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hauselin/rtutorialsite, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lin (2019, Feb. 18). Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code). Retrieved from https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/
BibTeX citation
@misc{lin2019complete, author = {Lin, Hause}, title = {Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)}, url = {https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/}, year = {2019} }