How do I do stratified sampling on group-separated datasets in Python? Do packages for this exist?

by Peter   Last Updated September 11, 2019 17:19 PM - source

Say I have the following data:

    Group_ID | Column_1 | Column_2 | Column_3 ...
==========================================
A        | 1        | 2        | 33
A        | 2        | 2        | 3765
A        | 3        | 6        | 3436
A        | 4        | 8        | 32
B        | 5        | 9        | 33
B        | 3        | 34       | 385
B        | 7        | 25       | 3
B        | 3        | 1        | 38
C        | 6        | 2        | 3
C        | 8        | 2        | 4
D        | 7        | 1        | 5
D        | 6        | 9        | 11

I want to:

  • First identify train-test splits that keep groups (Group_ID) separate between splits. I.e. no group can be in both train and test splits.
  • Out of all possible splits that have been identified, get splits which have the most similar distributions of Column_1, Column_2, Column_3 etc. across train and test splits.

In short, is there any way that I can split my data so that groups are separated, but that the other features are similar across the split?

Ideally, I would like to do this with a package in Python or the like, if it exists.



Related Questions



Random seed splitting impacting model performance

Updated April 18, 2018 12:19 PM


Identifying parts of a web page

Updated September 19, 2017 03:19 AM

Too many missing values

Updated December 30, 2017 15:19 PM