Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)

Hause Lin

Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)

What is complete pooling, no-pooling, and partial pooling, and how to use data.table for no-pooling models (fit the same model, but separately to each group)

Author

Affiliation

Hause Lin

Published

Feb. 18, 2019

Citation

Lin, 2019

Get source code for this RMarkdown script here.

Consider being a patron and supporting my work?

Donate and become a patron: If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.

How to use `data.table` package to repeatedly fit a model to separate groups in a dataset

Before the invention of multi-level models (aka linear mixed effects models), how did people analyze nested or repeated-measures data? They aggregated the data, used repeated-measures ANOVA, and did other things.

One interesting approach is to repeatedly fit the same model to each group separately, and this approach seems to be first mentioned in this article

Regression analyses of repeated measures data in cognitive research. [link]
R.F. Lorch, J.L. Myers.
Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol 16(1), pp. 149–157. 1990.

[1]

. If you have 10 data points per subject, you have nested data (data points nested within subjects), and you could fit 10 linear regressions, one for each subject. This approach is still quite useful nowadays (I use it regularly in my research), and I’ll demonstrate how to do it easily with data.table and my summaryh() function from my hausekeep package.

# load packages
library(hausekeep); library(data.table)

Convert the built-in dataset ChickWeight to data.table class so we can leverage its extensive functionalities.

dt1 <- data.table(ChickWeight) # convert builtin dataset to data.table class
dt1 # print dataset

     weight Time Chick Diet
  1:     42    0     1    1
  2:     51    2     1    1
  3:     59    4     1    1
  4:     64    6     1    1
  5:     76    8     1    1
 ---                       
574:    175   14    50    4
575:    205   16    50    4
576:    234   18    50    4
577:    264   20    50    4
578:    264   21    50    4

Dataset contains weights of 50 chicks that were fed one of four diets. Their weights were tracked over time. Type ?ChickWeight in the console for more info about this dataset.

summary(dt1) # check dataset

     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506

dt1[, Chick := as.numeric(as.character(Chick))] # recode factor as numeric

Completing pooling approach

Let’s look at just the first 10 chicks (and to illustrate data.table syntax; see my data.table tutorials). We then fit a linear regression weight ~ Time to the data from these 10 chicks, ignoring that the data came from 10 different chicks (i.e., completing pooling). That is, we pretend all the data came from one chick.

# filter rows where Chick is <= 10 and fit linear regression model
dt1[Chick <= 10, ] # syntax works only with data.table

     weight Time Chick Diet
  1:     42    0     1    1
  2:     51    2     1    1
  3:     59    4     1    1
  4:     64    6     1    1
  5:     76    8     1    1
 ---                       
115:     96   14    10    1
116:    101   16    10    1
117:    112   18    10    1
118:    120   20    10    1
119:    124   21    10    1

model <- lm(weight ~ Time, data = dt1[Chick <= 10])
summary(model)

Call:
lm(formula = weight ~ Time, data = dt1[Chick <= 10])

Residuals:
    Min      1Q  Median      3Q     Max 
-85.907 -11.399  -1.485  11.291 121.093 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.2094     5.2189   5.788 6.07e-08 ***
Time          7.3189     0.4091  17.892  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.08 on 117 degrees of freedom
Multiple R-squared:  0.7323,    Adjusted R-squared:   0.73 
F-statistic: 320.1 on 1 and 117 DF,  p-value: < 2.2e-16

If you use summaryh() from my hausekeep package, you can show formatted output that can be copied-pasted into your APA manuscript.

summaryh(model)

          term                                                 results
1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47
2:        Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86

Or fit the same model within data.table

dt1[Chick <= 10, summaryh(lm(weight ~ Time))] # results are same as above

          term                                                 results
1: (Intercept) b = 30.21, SE = 5.22, t(117) = 5.79, p < .001, r = 0.47
2:        Time b = 7.32, SE = 0.41, t(117) = 17.89, p < .001, r = 0.86

summaryh() returns results in a data.table.

Partial pooling approach

The more appropriate way to fit this model is to account for the nested structure is to use linear mixed models (lme4 package). This approach is ‘partial pooling’, because it assumes hierarchy in the data structure.

library(lme4); library(lmerTest)
lme1 <- lmer(weight ~ Time + (1 | Chick), data = dt1[Chick %in% 1:10])
summaryh(lme1)

          term                                                 results
1: (Intercept)  b = 30.33, SE = 7.39, t(15) = 4.10, p = .001, r = 0.73
2:        Time b = 7.30, SE = 0.32, t(108) = 22.74, p < .001, r = 0.91

summary(lme1)

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: weight ~ Time + (1 | Chick)
   Data: dt1[Chick %in% 1:10]

REML criterion at convergence: 1104.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5405 -0.5187 -0.0682  0.6075  3.6354 

Random effects:
 Groups   Name        Variance Std.Dev.
 Chick    (Intercept) 378.8    19.46   
 Residual             556.4    23.59   
Number of obs: 119, groups:  Chick, 10

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)   30.329      7.392  14.747   4.103 0.000972 ***
Time           7.300      0.321 108.052  22.741  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
Time -0.470

No pooling approach

What if you want to fit a separate model for each chick/group, so you know how time was associated with weight for each chick? You could write a for loop to loop through each chick, but you can do it with data.table syntax really easily.

Now we can extend the syntax above with by = Chick to fit the model to each chick! Just one line of code is needed.

# fit linear regression model to first 10 Chicks
dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time)), by = Chick]

    Chick        term                                                 results
 1:     1 (Intercept)  b = 24.47, SE = 6.73, t(10) = 3.64, p = .005, r = 0.75
 2:     1        Time  b = 7.99, SE = 0.52, t(10) = 15.25, p < .001, r = 0.98
 3:     2 (Intercept)  b = 24.72, SE = 4.93, t(10) = 5.01, p < .001, r = 0.85
 4:     2        Time  b = 8.72, SE = 0.38, t(10) = 22.72, p < .001, r = 0.99
 5:     3 (Intercept)  b = 23.18, SE = 5.08, t(10) = 4.56, p = .001, r = 0.82
 6:     3        Time  b = 8.49, SE = 0.40, t(10) = 21.45, p < .001, r = 0.99
 7:     4 (Intercept)  b = 32.87, SE = 4.01, t(10) = 8.21, p < .001, r = 0.93
 8:     4        Time  b = 6.09, SE = 0.31, t(10) = 19.53, p < .001, r = 0.99
 9:     5 (Intercept)  b = 16.90, SE = 7.56, t(10) = 2.24, p = .049, r = 0.58
10:     5        Time b = 10.06, SE = 0.59, t(10) = 17.10, p < .001, r = 0.98
11:     6 (Intercept)  b = 44.12, SE = 7.35, t(10) = 6.00, p < .001, r = 0.88
12:     6        Time  b = 6.38, SE = 0.57, t(10) = 11.15, p < .001, r = 0.96
13:     7 (Intercept)  b = 5.84, SE = 11.42, t(10) = 0.51, p = .620, r = 0.16
14:     7        Time b = 13.21, SE = 0.89, t(10) = 14.85, p < .001, r = 0.98
15:     8 (Intercept)  b = 43.73, SE = 3.70, t(9) = 11.82, p < .001, r = 0.97
16:     8        Time   b = 4.83, SE = 0.31, t(9) = 15.44, p < .001, r = 0.98
17:     9 (Intercept) b = 52.09, SE = 4.79, t(10) = 10.88, p < .001, r = 0.96
18:     9        Time   b = 2.66, SE = 0.37, t(10) = 7.15, p < .001, r = 0.91
19:    10 (Intercept) b = 38.70, SE = 1.13, t(10) = 34.28, p < .001, r = 1.00
20:    10        Time  b = 4.07, SE = 0.09, t(10) = 46.29, p < .001, r = 1.00

When calling functions inside a data.table with the by argument, your functions have to return vectors, lists, or dataframes. Otherwise, you’ll get error messages. Run the code below to figure out why.

Return specific results (e.g., t, p, SE, beta) in separate columns

dt1[Chick %in% 1:10, summaryh(lm(weight ~ Time), showTable = T)$results2, by = Chick]

    Chick        term estimate std.error df statistic p.value  es.r   es.d es.r.squared es.adj.r.squared
 1:     1 (Intercept)   24.465     6.728 10     3.636   0.005 0.755  2.300        0.959            0.955
 2:     1        Time    7.988     0.524 10    15.255   0.000 0.979  9.648        0.959            0.955
 3:     2 (Intercept)   24.725     4.931 10     5.014   0.001 0.846  3.171        0.981            0.979
 4:     2        Time    8.720     0.384 10    22.721   0.000 0.990 14.370        0.981            0.979
 5:     3 (Intercept)   23.180     5.083 10     4.560   0.001 0.822  2.884        0.979            0.977
 6:     3        Time    8.487     0.396 10    21.455   0.000 0.989 13.569        0.979            0.977
 7:     4 (Intercept)   32.866     4.005 10     8.206   0.000 0.933  5.190        0.974            0.972
 8:     4        Time    6.089     0.312 10    19.532   0.000 0.987 12.353        0.974            0.972
 9:     5 (Intercept)   16.896     7.556 10     2.236   0.049 0.577  1.414        0.967            0.964
10:     5        Time   10.055     0.588 10    17.098   0.000 0.983 10.814        0.967            0.964
11:     6 (Intercept)   44.123     7.351 10     6.002   0.000 0.885  3.796        0.926            0.918
12:     6        Time    6.378     0.572 10    11.147   0.000 0.962  7.050        0.926            0.918
13:     7 (Intercept)    5.843    11.422 10     0.512   0.620 0.160  0.324        0.957            0.952
14:     7        Time   13.205     0.889 10    14.855   0.000 0.978  9.395        0.957            0.952
15:     8 (Intercept)   43.727     3.698  9    11.824   0.000 0.969  7.882        0.964            0.960
16:     8        Time    4.827     0.313  9    15.444   0.000 0.982 10.296        0.964            0.960
17:     9 (Intercept)   52.094     4.786 10    10.885   0.000 0.960  6.884        0.836            0.820
18:     9        Time    2.663     0.372 10     7.150   0.000 0.915  4.522        0.836            0.820
19:    10 (Intercept)   38.695     1.129 10    34.285   0.000 0.996 21.684        0.995            0.995
20:    10        Time    4.066     0.088 10    46.289   0.000 0.998 29.276        0.995            0.995

This line of code might look complicated, but it’s actually quite easy if you understand how the summaryh() function works. See here for documentation.

Why won’t summary() work?

dt1[Chick <= 10, summary(lm(weight ~ Time))] # summary() doesn't return a table
dt1[Chick <= 10, summary(lm(weight ~ Time)), by = Chick] # doesn't work!

You can also use the lmList() function from the lme4 package to fit this ‘no-pooling’ model, but you don’t get p-values, effect sizes, and other useful information!

lmList(weight ~ Time | Chick, data = dt1[Chick <= 10])

Call: lmList(formula = weight ~ Time | Chick, data = dt1[Chick <= 10]) 
Coefficients:
   (Intercept)      Time
1    24.465436  7.987899
2    24.724853  8.719861
3    23.179549  8.487370
4    32.865678  6.088640
5    16.895628 10.055362
6    44.123431  6.378006
7     5.842535 13.205264
8    43.727273  4.827273
9    52.094086  2.663137
10   38.695054  4.066102

Degrees of freedom: 119 total; 99 residual
Residual standard error: 11.44853

Support my work

Support my work and become a patron here!

0 Comments Share:

References

Regression analyses of repeated measures data in cognitive research. [link]
Lorch, R.F. and Myers, J.L., 1990. Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol 16(1), pp. 149–157.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hauselin/rtutorialsite, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lin (2019, Feb. 18). Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code). Retrieved from https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/

BibTeX citation

@misc{lin2019complete,
  author = {Lin, Hause},
  title = {Data science: Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)},
  url = {https://hausetutorials.netlify.com/posts/2019-02-18-fit-models-to-every-group/},
  year = {2019}
}

[Lorch] Regression analyses of repeated measures data in cognitive research. [link]
Lorch, R.F. and Myers, J.L., 1990. Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol 16(1), pp. 149–157.

Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)

Author

Affiliation

Published

Citation

Table of Contents

Consider being a patron and supporting my work?

How to use `data.table` package to repeatedly fit a model to separate groups in a dataset

Completing pooling approach

Partial pooling approach

No pooling approach

Support my work

Footnotes

References

Corrections

Reuse

Citation

Complete vs. partial vs. no-pooling: Fit the same model to different groups (just one simple line of code)

Author

Affiliation

Published

Citation

Table of Contents

Consider being a patron and supporting my work?

How to use data.table package to repeatedly fit a model to separate groups in a dataset

Completing pooling approach

Partial pooling approach

No pooling approach

Support my work

Footnotes

References

Corrections

Reuse

Citation

How to use `data.table` package to repeatedly fit a model to separate groups in a dataset