Author Topic: String Puzzle (Read 7218 times)

Sidoh · « **on:** December 12, 2007, 12:56:47 pm »

Given a string, detect the number of occurrences of two substrings: "fo" and "oo" in O(n) time that works without traversing the string more than once. For example, given the string

"foobarfoo"

The answer would be 4, since there are two occurrences of "fo" and two occurrences of "oo".

After you develop the algorithm, program it in your favorite language and post it!

I don't think this is a super difficult problem, so I won't give any more hints.

iago · « **Reply #1 on:** December 12, 2007, 03:04:38 pm »

Am I crazy, or is there no trick to this problem? It seems to me that it can be done with two linear searches.

Code: [Select]

int fo = 0;
int oo = 0;
for(i = 0; i < strlen(str) - 1; i++)
  if(str[i] == 'f' && str[i+1] == 'o')
   fo++;

for(i = 0; i < strlen(str) - 1; i++)
  if(str[i] == 'o' && str[i+1] == 'o')
   oo++;

As far as I can tell, that runs in O(n). Each of the loops runs 2n times, for a total of 4n. You drop the constant, and end up with O(n).

Am I missing something?

Sidoh · « **Reply #2 on:** December 12, 2007, 03:18:07 pm »

You're absolutely right. That's an O(n) solution. That's exactly the way the problem was phrased when it was given to me, but I think that the original problem had another condition: you can only pass through the loop once.

iago · « **Reply #3 on:** December 12, 2007, 04:26:18 pm »

That's still easy:

Code: [Select]

int fo = 0;
int oo = 0;
for(i = 0; i < strlen(str) - 1; i++)
{
  if(str[i] == 'f' && str[i+1] == 'o')
   fo++;
  if(str[i] == 'o' && str[i+1] == 'o')
   oo++;
}

Perhaps they meant doing it in "n" commands, or something? *shrug*

Sidoh · « **Reply #4 on:** December 12, 2007, 04:51:08 pm »

Haha, yeah. They actually meant something completely different. They wanted you to find all repeated pairs of the strings in O(n). I have a solution for that too, but your solution is fine, assuming you incorporate out-of-bounds checks, which I'm sure you thought of.

My solution for this problem was to use a state graph. That's the first thing I thought of (within seconds) and it was fairly trivial, so I didn't bother to think past that. They're both O(n), though. You're right; that problem had no trick to it.

Here's the real problem (sort of):

Find all pairs in the string that are repeated and output the number for each pair in O(n) time.

What I'm thinking is you're supposed to use extra space. Use a hash table or something similar, I guess. I have a solution with a graph, but I'm still not sure if it's O(n). I'll have to think about it more.

Blaze · « **Reply #5 on:** December 12, 2007, 09:17:48 pm »

Well, I'll give this a shot in PHP.

<?PHP
  $thatstring = "foobarfoo";  #The string we're reading
  $counter[];                 #The counter which keeps track of how many times a pattern has happened.  :)

  for ($i = 0; $i < strlen($thatstring) - 1;$i++)
  {
    $counter[substr($thatstring, $i,2)]++;
  }

  print_r($counter);

?>

Ouputs:

Quote

[fo] => 2
[oo] => 2
[ob] => 1
[ba] => 1
[ar] => 1
[rf] => 1

Woo?

Sidoh · « **Reply #6 on:** December 12, 2007, 09:35:57 pm »

Yeah, that's effectively the solution using a hash table (but avoiding all of the messiness that's involved with hash tables). There's some shifty stuff that goes on with arrays in PHP. The way they are constructed in the engine depends on a bunch of stuff, including the size of the array, what you're puting into it and what you're doing with it after it's constructed. When you use an array in that fashion, it's acting like a dictionary/hash table.

To quote the PHP manual:

Quote

An array in PHP is actually an ordered map. A map is a type that maps values to keys. This type is optimized in several ways, so you can use it as a real array, or a list (vector), hashtable (which is an implementation of a map), dictionary, collection, stack, queue and probably more. Because you can have another PHP array as a value, you can also quite easily simulate trees.

Part of the requirement of the original problem (this is a Microsoft interview question, apparently. tests if you understand space/time trade-offs, maybe?) is that you write it in C (tests if you understand data structures, I guess?).

What would you use as the hash function? Since it's a pretty simple value (the pair), you can pretty trivially concoct a perfect hash function (1:1).

The solution using the graph is also pretty interesting. I'm not completely sure that you can guarantee that it's O(n) without use of O(R^2) space, though (where R is the range of characters). Just sort of thinking out loud here -- I suppose you have to do that with the hash table too, though. Depending on how the hash table is implemented (in addition to [tex]\alpha[/tex], its load factor = the number of elements / the capacity), you could contrive a string that induces a bunch of collisions and make lookups in the hash table O(n), which would make the entire thing O(n^2).

Incidentally, one of my research advisor's past topics is a the DAWG (Directed Acyclic Word Graph). It's a really interesting data structure that allows you to do efficient searches for substrings within strings and is intended to be used in things like the human genome project (where you're searching through gigabytes of data for a string that's a tiny tiny fraction of that). If you have a string of length n and a substring of length m, it takes O(mn) time to build the DAWG, but then it requires O(m) time to search for any substring of that length. The solution I thought of for this problem involving a graph is loosely related to this data structure, but it's definitely not a DAWG.

iago · « **Reply #7 on:** December 12, 2007, 10:00:06 pm »

Quote from: Sidoh on December 12, 2007, 09:35:57 pm

this is a Microsoft interview question, apparently.

That explains why the originally specification was so lame.

Sidoh · « **Reply #8 on:** December 12, 2007, 10:10:03 pm »

Quote from: iago on December 12, 2007, 10:00:06 pm

Quote from: Sidoh on December 12, 2007, 09:35:57 pm
this is a Microsoft interview question, apparently.
That explains why the originally specification was so lame.

Man, I knew you'd say something! I said the same thing, though. hehe.

It was posted on some board that's designed for this sort of thing (technical interview questions in computer scienceish subjects). The problem was worded strangely. Sorry about all of the confusion. Here's the original problem:

Quote

I was asked that you have a string say, "foobarfoo", in which "fo" and "oo" are repeated twice. You have to find all such repeated pairs in O(n) time. The string can only have alphanumeric elements in it and code it in C.

iago · « **Reply #9 on:** December 12, 2007, 10:42:28 pm »

Quote from: Sidoh on December 12, 2007, 10:10:03 pm

Quote from: iago on December 12, 2007, 10:00:06 pm
Quote from: Sidoh on December 12, 2007, 09:35:57 pm
this is a Microsoft interview question, apparently.
That explains why the originally specification was so lame.

Man, I knew you'd say something! I said the same thing, though. hehe.

It was posted on some board that's designed for this sort of thing (technical interview questions in computer scienceish subjects). The problem was worded strangely. Sorry about all of the confusion. Here's the original problem:

Quote
I was asked that you have a string say, "foobarfoo", in which "fo" and "oo" are repeated twice. You have to find all such repeated pairs in O(n) time. The string can only have alphanumeric elements in it and code it in C.

Come on, of course I said something! Too easy!

But yeah, a sorta-hashtable could be used in C. I'm not positive if it would be O(n) still, because of scanning the hashtable, but I'd do something like this (note: untested, but seems like it ought to work):

Code: [Select]

int matches[256 * 256];
int i;

for(i = 0; i < strlen(str) - 1; i++)
 matches[(str[i] << 8) | str[i + 1]]++;

for(i = 0; i < sizeof(matches); i++) /* Sorta ruins O(n), but necessary? */
 if(matches[i] == 2)
  printf("Found a pair: %c%c\n", matches[i] & 0x0FF, matches[i] >> 8);

<edit> works, basically. Here is the full tested C code:

Code: [Select]

#include <stdio.h>
#define MAX (256 * 256)
int main()
{
    int matches[MAX];
    int i;
    char *str = "foobarfoo";

    memset(matches, 0, MAX * sizeof(int));

    for(i = 0; i < strlen(str) - 1; i++)
        matches[(str[i] << 8) | str[i + 1]]++;

    for(i = 0; i < MAX; i++) /* Sorta ruins O(n), but necessary? */
        if(matches[i] == 2)
            printf("Found a pair: %c%c\n", i >> 8, i & 0x0FF);

    return 0;
}

And the output:

ron@slayer:~$ ./a.out
Found a pair: fo
Found a pair: oo

Incidentally, this method sorts them, too.

Sidoh · « **Reply #10 on:** December 12, 2007, 11:02:17 pm »

Nice! I think your initial size is a bit big, though. You only have 36^2 or 62^2 possible pairs, depending on case-sensitivity.

So your hash function is the first character shifted by eight bits to the left + the second character? Yeah, that seems like it'd work just fine. The traversal of the hash table is really O(1) since there's nothing in the program that will change its running time. I don't think it's necessary, however. Just traverse the string again, look up each pair you encounter, output the pair and its count if it's > 1 and set the value = 0.

iago · « **Reply #11 on:** December 12, 2007, 11:07:55 pm »

Quote from: Sidoh on December 12, 2007, 11:02:17 pm

Nice! I think your initial size is a bit big, though. You only have 36^2 or 62^2 possible pairs, depending on case-sensitivity.

So your hash function is the first character shifted by eight bits to the left + the second character? Yeah, that seems like it'd work just fine. The traversal of the hash table is really O(1) since there's nothing in the program that will change its running time. I don't think it's necessary, however. Just traverse the string again, look up each pair you encounter, output the pair and its count if it's > 1 and set the value = 0.

Yeah, my program can technically do any binary data, so the buffer is much larger than necessary (65535 entries).

You're right that I could just process the string a second time, I hadn't even thought of that! Also, if I'm ok with finding any number or repetitions >1 (not just pairs), I can build it into the loop itself, too:

Code: [Select]

#include <stdio.h>
#define MAX (256 * 256)
int main()
{
    int matches[MAX];
    int i;
    int temp;
    char *str = "foobarfoojkilajkfi23jkljlkwef89ujklsafdj23 ssdafjklasdfjklsdfkjlsadf s kjsdf23jklsdflweu";

    memset(matches, 0, MAX * sizeof(int));

    for(i = 0; i < strlen(str) - 1; i++)
    {
        temp = (str[i] << 8) | str[i + 1];
        matches[temp]++;
        if(matches[temp] == 2)
            printf("Found a pair: '%c%c'\n", str[i], str[i + 1]);
    }

    return 0;
}

And output:

ron@slayer:~$ ./a.out
Found a pair: 'fo'
Found a pair: 'oo'
Found a pair: 'jk'
Found a pair: 'kl'
Found a pair: '23'
Found a pair: 'af'
Found a pair: 'la'
Found a pair: 'sd'
Found a pair: 'fj'
Found a pair: 'ls'
Found a pair: 'df'
Found a pair: 'jl'
Found a pair: 'sa'
Found a pair: ' s'
Found a pair: 'kj'
Found a pair: '3j'
Found a pair: 'we

Sidoh · « **Reply #12 on:** December 12, 2007, 11:14:48 pm »

Awesome! I'm going to whip up something in C. I'll post my solution when I'm done. I'd really like to give the graph solution a go, but I think it'd be sort of nasty in C.

Oh, forgot to mention: it'll handle any binary data that can be represented in <= 8 bits in all cases, right? I guess is a safe assumption if you're using UTF-8, lol (which gcc does?). After that, you start to have collisions.

iago · « **Reply #13 on:** December 12, 2007, 11:50:37 pm »

Well, mine won't really work on wide characters. To hell with those Asians, though!

In any case, binary data tends to be represented in bytes anyways, so it'll work as long as you're looking for pairs of single-byte sequences.

Sidoh · « **Reply #14 on:** December 13, 2007, 12:00:23 am »

Quote from: iago on December 12, 2007, 11:50:37 pm

Well, mine won't really work on wide characters. To hell with those Asians, though!

In any case, binary data tends to be represented in bytes anyways, so it'll work as long as you're looking for pairs of single-byte sequences.

Hahaha.

Yep, that's exactly what I meant.

Here's my solution:

Code: [Select]

chris@shadow:~/src/c/pairs$ cat pairs.c
#include <stdio.h>

#define HASH_MAX 1296

int hash_digit(char d);

int main()
{
        int pairs[HASH_MAX];
        int i;
        memset(pairs, 0, HASH_MAX * sizeof(int));

        char * string;
        printf("Enter a string : ");
        scanf("%s", string);

        for(i = 0; i < strlen(string) - 1; i++)
        {
                int hash = ((36)*hash_digit(string[i]) + hash_digit(string[i + 1]));
                pairs[hash]++;
        }

        for(i = 0; i < strlen(string) - 1; i++)
        {
                int hash = ((36)*hash_digit(string[i]) + hash_digit(string[i + 1]));

                if(pairs[hash] > 1)
                {
                        printf("Found pair: %c%c, occurred %d times.\n", string[i], string[i+1], pairs[hash]);
                        pairs[hash] = 0;
                }
        }
}

int hash_digit(char d)
{
        if(d >= 0x61 && d <= 0x7A)
        {
                return (d - 0x60);
        }
        else if(d >= 0x30 && d <= 0x39)
        {
                return (d - 0x29);
        }
}

Output:

Code: [Select]

chris@shadow:~/src/c/pairs$ ./pairs 
Enter a string : aloaiwleyrkasdjfhaksjcvbwieurgfasiudfyaisudcbaksdjbkjahsvdwieuyriasudyaksdjbflkasjdfbqjwhkeriuyasdfjkahsgdfkjhagsdrfwe
Found pair: ai, occurred 2 times.
Found pair: yr, occurred 2 times.
Found pair: ka, occurred 3 times.
Found pair: as, occurred 5 times.
Found pair: sd, occurred 5 times.
Found pair: dj, occurred 3 times.
Found pair: ha, occurred 2 times.
Found pair: ak, occurred 3 times.
Found pair: ks, occurred 3 times.
Found pair: sj, occurred 2 times.
Found pair: wi, occurred 2 times.
Found pair: ie, occurred 2 times.
Found pair: eu, occurred 2 times.
Found pair: iu, occurred 2 times.
Found pair: ud, occurred 3 times.
Found pair: df, occurred 4 times.
Found pair: ya, occurred 3 times.
Found pair: su, occurred 2 times.
Found pair: jb, occurred 2 times.
Found pair: kj, occurred 2 times.
Found pair: ah, occurred 2 times.
Found pair: hs, occurred 2 times.
Found pair: uy, occurred 2 times.
Found pair: ri, occurred 2 times.

Sidoh · « **Reply #15 on:** December 13, 2007, 03:40:23 am »

I guess it's also worth noting that this could be done just as easily with a two dimensional array, haha.

You could cut down on the space usage a bunch if you used buckets for the nodes in the hash table, but it would increase the running time a bit (it'd still be O(n), though).

iago · « **Reply #16 on:** December 13, 2007, 09:32:32 am »

Personally, I like the short, simple solution. Mine works perfectly in the least amount of code!

Sidoh · « **Reply #17 on:** December 13, 2007, 12:14:47 pm »

Quote from: iago on December 13, 2007, 09:32:32 am

Personally, I like the short, simple solution. Mine works perfectly in the least amount of code!

I really did think about using a hash code similar to yours, but I decided against it since the problem specification very clearly said the input was alphanumeric. I wanted to minimize the amount of space used. I think some other solutions I saw (after doing mine, of course) were using buckets in their hash tables and they had size ~100. That's definitely not very safe, though.

By the way, you're supposed to output the number of times the pairs occur.

iago · « **Reply #18 on:** December 13, 2007, 12:29:14 pm »

Quote from: Sidoh on December 13, 2007, 12:14:47 pm

By the way, you're supposed to output the number of times the pairs occur.

I disagree based on the problem's wording:

Quote

I was asked that you have a string say, "foobarfoo", in which "fo" and "oo" are repeated twice. You have to find all such repeated pairs in O(n) time. The string can only have alphanumeric elements in it and code it in C.

It says to find all cases where they're repeated twice, then says "all such repeated pairs" -- I took that to mean that it wants cases where the pair of letters is repeated twice. It says nothing about outputting the number of times, but I took the "all such pairs" to mean that it only wants cases where there are two.

Sidoh · « **Reply #19 on:** December 19, 2007, 08:08:13 pm »

This was posted on another forum where the premise is interview questions. I think I remember the poster clarifying that this was one of the requirements of the solution. It doesn't matter, though.

Here's a paper on DAWGs. It's pretty interesting.

http://www.primes.colostate.edu/Workshops/Networks/McConnell.pdf

News:

Author Topic: String Puzzle (Read 7218 times)