How to convert discrete data into a continuous curve

by Tim Hargreaves   Last Updated January 12, 2018 20:19 PM

I am interested in drawing graphs to show the distribution of letters throughout a word. This would be in the form of a graph with the x-axis being a continuous scale from 0 to 1 and the y-axis being frequencies. For example I would expect the letter 'Q' to have a distribution that is higher for small x and decreases as x increases since words are more likely to begin with 'Q' than end with it.

I have a dataset of 1 million words that I've mined from famous novels in R and I have a list for each letter of every position it appeared in as a proportion of the words it was in e.g. 3/8 if it was the third letter of an eight letter word.

I am unsure of how to convert this data into a smooth curve showing their distributions. I have ideas that are accurate (just counting the number of occurrences of each value) but don't look smooth (since they'll be a massive jump at x = 0.5 and due to the influence of common 2-letter words). I also have ideas that are smooth (placing down normal distribution curves at each point and then summing) but don't feel accurate or valid in any sense. What method would be the happy medium of this?

Answers 1

It's not clear why you want to describe this totally discrete data with a continuous distribution. I think you'd be better off by looking at letter frequency at each position, where each position is the letter index (e.g. 3), and not the proportional location of the letter (e.g. 3/8). That proportional index has problems like you mentioned, with 2-letter words only having values at 0.5 and 1, 4-letter words only having values at 0.25, 0.5, 0.75, and 1, and so on. On top of that, values at 0.5 are the beginning of the word for 2-letter words, but halfway through the word for 10-letter words. You can see that 0.5 doesn't as have much meaning as the positional index if you're concerned about word beginning or endings.

Instead, look at all words that have at least two letters and calculate frequency at position 2. Then look at all words that have at least 3 letters, and calculate frequency at position 3. Now each positional distribution sums to 1, and accurately takes into account only words that are actually that long. By taking proportions, you get rid of those big bumps that represent certain-length words.

Nuclear Wang
Nuclear Wang
January 12, 2018 19:36 PM

Related Questions

log transformation of NON-continuous variable

Updated August 18, 2017 19:19 PM

Recursive relation of moments of distributions

Updated May 18, 2017 14:19 PM