Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Given a word list of N words formed from a language of M characters, where each

ID: 657246 • Letter: G

Question

Given a word list of N words formed from a language of M characters, where each word is composed of n?1 not necessarily distinct characters, how can I find the best set of k<M characters to learn, where "best" is defined as the set of k with which the most complete words can be spelled. Ie, if I know these k characters I can spell Nk words. What is the maximum Nk for every k and which characters should I choose?

Checking a given set of k words is equivalent to a Scrabble problem, but searching the space becomes very hard very fast. This problem is of limited interest in English where M=26 but is more important in Chinese or Japanese where M?103.

I thought of considering the bipartite graph of characters and the words that they make but I'm not sure what the best search strategy is. I'm somewhat pessimistic that this problem is strictly solvable and therefore I am willing to try heuristic or stochastic methods.

Explanation / Answer

If you take each word as a set of characters, you're looking for sets of size k that contain the most words as subsets. For a given word of j unique characters out of an alphabet of size N, you have (N-j) choose (k-j) possible supersets, ie all the sets of size k that have the original j characters plus any (k-j) others.

Some time ago I worked on an algorithm for a similar problem that might be modified for this case. I'll try to sketch how it would work in this case.

Briefly, the solution is to walk all the "words" through their candidate supersets (a factorial number) in a fixed order, identify any runs of identical values that indicate a number of words that fit in the same superset, and to save the run that's longest.

This is essentially an n-way merge, looking for runs of identical values on the output. I did this with a heap of iterators where I would pull the smallest value, count it, increment it (find the next superset in its range of possible supersets), and re-insert it into the heap. This process provably visits every possible superset and counts the words that it covers.

What makes this possibly tractable is a pruning rule that skips candidates with sub-maximal runs. What I realized while coding the above is that if my only interest is runs of length L or greater, then if I look L elements ahead I can see if the current run stops before L elements, and furthermore (with some careful reasoning) eliminate all intervening values from consideration and skip to whatever value is L elements ahead.

The problem I was solving originally was association-rule mining, which walks each set through its subsets rather than supersets, and this method worked well in that case.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote