Trie Tree and finding n words with highest number of occurrences, pointers

So I have this piece of code and I have to make it as it’s written in topic – find n words with highest number of occurrences, I’ve general idea of that and written it in void setorder function but I have some troubles with pointers.

Anyways, the idea is that I have two arrays, one for words and second with integers, so index [i] from words char array would have x occurrences, where x is stored in integer array with index [i].

My problem is, I don’t know how to crawl down the tree so I don’t mess anything up as I do it recursively, I just don’t seem to get the pointers to the structures, I don’t know what should I exactly put in this function arguments 🙁 Please help

edit: I forgot to add the code, heh https://gist.github.com/anonymous/a20e459e2214629d3839

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ARRAY_SIZE(a) sizeof(a)/sizeof(a[0])

// Alphabet size (# of symbols)
#define ALPHABET_SIZE (26)
// Converts key current character into index
// use only 'a' through 'z' and lower case
#define CHAR_TO_INDEX(C) ((int)C - (int)'A')


char words[]={0};
int counters[1000]={0};

// trie node
typedef struct trie_node trie_node_t;
struct trie_node
{
    int counter;
    char letter;
    trie_node_t *children[ALPHABET_SIZE];
};

// trie ADT
typedef struct trie trie_t;
struct trie
{
    trie_node_t *root;
    int count;
};

// Returns new trie node (initialized to NULLs)
trie_node_t *getNode(void)
{
    trie_node_t *pNode = NULL;

    pNode = (trie_node_t *)malloc(sizeof(trie_node_t));

    if( pNode )
    {
        int i;

        pNode->counter = 0;

        for(i = 0; i < ALPHABET_SIZE; i++)
        {
            pNode->children[i] = NULL;
        }
    }

    return pNode;
}

// Initializes trie (root is dummy node)
void initialize(trie_t *pTrie)
{
    pTrie->root = getNode();
    pTrie->count = 0;
}

int wordnr=0;

void setorder(trie_node_t *pTrie, char word[])
{
    int length = strlen(word);
    char tempword[1000] = {0};


    for(int i=0; i<length; i++)
    {
        tempword[i]=word[i];
    }

    trie_node_t *pCrawl;
    pCrawl = pTrie->root;


    for(int i=0 ; i<26; i++ )
    {
        tempword[length]=pCrawl->letter;
        tempword[length+1]='';

        if(pCrawl->counter!=0)
        {
            for(int i=0; i<=length; i++)
                words[wordnr+i]=tempword[i];

            counters[wordnr]=pCrawl->counter;
        }

        else
        {
            pCrawl = pCrawl->children[i];
            setorder(pCrawl, tempword);
        }
    }
}

// If not present, inserts key into trie
// If the key is prefix of trie node, just marks leaf node
void insert(trie_t *pTrie, char key[])
{
    int level;
    int length = strlen(key);
    int index;
    trie_node_t *pCrawl;


    pTrie->count++;
    pCrawl = pTrie->root;

    for( level = 0; level < length; level++ )
    {
        index = CHAR_TO_INDEX(key[level]);
        if( !pCrawl->children[index] )
        {
            pCrawl->children[index] = getNode();
        }
        pCrawl->letter=key[level];
        pCrawl = pCrawl->children[index];
    }

    if(pCrawl->counter!=0)
        pCrawl->counter++;
    else
        pCrawl->counter=1;

    printf("counter slow 3= %dn", pCrawl->counter);
}

// Returns non zero, if key presents in trie
int search(trie_t *pTrie, char key[])
{
    int level;
    int length = strlen(key);
    int index;
    trie_node_t *pCrawl;

    pCrawl = pTrie->root;

    for( level = 0; level < length; level++ )
    {
        index = CHAR_TO_INDEX(key[level]);

        if( !pCrawl->children[index] )
        {
            return 0;
        }

        pCrawl = pCrawl->children[index];
    }

    return (0 != pCrawl && pCrawl->counter);
}

// Driver
int main()
{
    // Input keys (use only 'a' through 'z' and lower case)
    char keys[][15] = {"THE", "THE", "BYE", "A", "THERE", "ANSWER", "ANSWER", "BBUWNTSMFK", "THE", "THEIR", "ANSWER", "THE", "LOL", "OMG", "WTF"};
    trie_t trie;

    char output[][20] = {"Not present in trie", "Present in trie"};

    initialize(&trie);

    // Construct trie
    for(int i = 0; i < ARRAY_SIZE(keys); i++)
    {
        insert(&trie, keys[i]);
    }

    char word[1000] = {0};
    setorder(&trie, word);

    // Search for different keys
    printf("%s --- %sn", "THE", output[search(&trie, "THE")] );
    printf("%s --- %sn", "THESE", output[search(&trie, "THESE")] );
    printf("%s --- %sn", "THEIR", output[search(&trie, "THEIR")] );
    printf("%s --- %sn", "THAW", output[search(&trie, "THAW")] );

    return 0;
}

Answer

I think you have some conceptual errors in your code. The basic trie insertion works, but counting the occurrences is wrong.

First your code as presented here does not compile, because you have mixed up trie_t and trie_node_t in setorder. Use trie_twhen you want to refer to the whole structure and tire_node_twhen you refer to a node in that structure and its children (and grandchildren, etc.) nodes. When you crawl the trie, you follow trie_node_t nodes, starting with the root of your trie_t.

Then, when you insert a node, you don’t need to distinguish bewteen the zero and non-zero case when assigning the counter, just

pCrawl->counter++;

will do. (If you think about it, you should see that incrementing zero yields one, so you’re just catching a alleged special case that doesn’t need to be treated specially. Your code is not wrong, just needlessly complicated.)

Okay, on to walking the trie and reporting the occurrences of the words. I’m not quite sure what’s going on in setorder, but again, it looks to complicated. Get rid of the auxiliary array counters. I think the bookkeeping of that array makes your program segfault. You can also get rid of the letter field of trie_node_t. All the information you need is in the trie nodes themselves and their connections. Especially, the letter is given by the path on which you reached this node from its parent. children[0] means it’s A, children[23] means it’s X.

Tries are a species of trees, and trees are best crawled down with recursion. So let’s write a recursive variant of setorder:

#define INDEX_TO_CHAR(IX) ('A' + IX)

void setorder_rec(trie_node_t *pCrawl, char *str, int n)
{
    if (pCrawl == NULL) return;

    if (pCrawl->counter) {
        printf("%.*s: %dn", n, str, pCrawl->counter);
    }

    for (int i = 0; i < ALPHABET_SIZE; i++) {
        str[n] = INDEX_TO_CHAR(i);
        setorder_rec(pCrawl->children[i], str, n + 1);
    }
}

And a frontend function that will define a temporary array and start the recursion at the root node:

void setorder(trie_t *pTrie)
{
    char tempword[40] = {0};

    setorder_rec(pTrie->root, tempword, 0);
}

The recursive function walks down the nodes, adds the appropriate letters to the string and prints the count when a word is found. The current level of the trie equals the current string length and both equal the current recursion depth, i.e. how many times the function has called itself.

The difference between inserting and searching is that here, we haven’t got a single, linear path through the trie given by a word, i.e. sequence of letters, but we have to walk down all existing children nodes.

In a non-recursive function we would have to walk the first path, retrack to all branching points, walk down a bit more, and so on. It would be very tedious to control all that crawling down and back. In the recursive function, crawling down corresponds to a call to setorder_rec and crawling back up corresponds to a return from setorder_rec.

Note that I have created a second macro, the inverse of CHAR_TO_INDEX. When searching the trie, we need to know the index from a char. When walking all branches unconditonally, we need to be able to reconstruct the chars of a word from the children indices.

If you don’t want to just print the counts but want to create an array, you can use a global index as in your example. But you’ll have to store the words themselves alongside the counts, too, maybe in a second array, maybe in a special struct.

One last point: Your program isn’t guarded against calculating wrong indices. If you add a word that isn’t all upper-case letters line CAN'T or LOS ANGELES you will get an access violation.