Loss function for comparing high-dimensional joint distributions

by Max Ghenis   Last Updated November 09, 2018 01:19 AM

I'm synthesizing data trained from a source dataset, and am looking for a loss function to compare different data synthesis methods*. I have some ideas below, but each has drawbacks and none is very elegant. Is there an established loss function to compare high-dimensional joint distributions?

Here are my ideas, but all look just at one variable at a time without considering the joint nature explicitly, so would have to be evaluated across strata.

  • MSE: Just compares means, without considering distributions.
  • Kolmogorov-Smirnov D statistic: e.g. an average of each summed. Doesn't consider the full distribution.
  • Deviations from quantiles: e.g. for some set of equally-spaced quantiles. Captures more of the distribution.

Another idea could be something like cosine distance, matching each synthetic record with its nearest real record.

* The loss function could be zero when passed the real data, so I'm separately checking to ensure that no synthetic record exactly matches a real one.

Related Questions