I'm synthesizing data trained from a source dataset, and am looking for a loss function to compare different data synthesis methods*. I have some ideas below, but each has drawbacks and none is very elegant. Is there an established loss function to compare high-dimensional joint distributions?
Here are my ideas, but all look just at one variable at a time without considering the joint nature explicitly, so would have to be evaluated across strata.
Another idea could be something like cosine distance, matching each synthetic record with its nearest real record.
* The loss function could be zero when passed the real data, so I'm separately checking to ensure that no synthetic record exactly matches a real one.