Tag Archive for 'code'

Knuth-Morris-Pratt String Search Algorithm

The KMP algorithm searches for the occurrence of a word W in a sentence S while employing the observation that when a mismatch occurs W contains sufficient information to determine the start position of the next search in S.

Say W = abcd and S = abcbabcd. Now we compare W[0] with S[0], W[1] with S[1], W[2] with S[2] and they all match. But W[3] does not match S[3]. The brute force algorithm would have started the next search by comparing W[0] with S[1]. But a simple observation tells us that it would be fruitless as there is no W[0] till S[3]. We would have done better by starting the next search from S[4]. And this is exactly what the KMP tries to achieve.

KMP requires a pre-processing of the word W. We create a Partial Match Table by iterating over all the letters of W in such a way that in case of a mismatch we can quickly figure out the next start position by looking at the table. The wiki article on KMP describes with an example how to create the table:

KMP – Partial Match Table

The partial match table can be created in O(n) time using O(n) space, where n is the length of the word W. Once we have the table we can iterate over each element of S consulting the table every time a mismatch occurs. You may want to read up the algorithm with examples from

KMP – Algorithm With Examples – Wikipedia

The search happens in O(k) where k is the length of the sentence S. Thus the algorithm runs in O(n + k).

Below is my java implementation of the same:

/**
* Perform the Knuth-Morris-Pratt string search. The algorithm first pre-processes the word to create a Partial Match
* Table. The table is creatd in O(n) where n is the length of the word. After the pre-processing, the actual search can
* be performed in O(k) where k is the size of the sentence.
*/
public class KMPStringSearch {
/**
* Searches for all occurances of the word in the sentence. Runs in O(n+k) where n is the word length and k is the
* sentence length.
*
* @param word The word that is being searched
* @param sentence The collections of word over which the search happens.
* @return The list of starting indices of the matched word in the sentence. Empty list in case of no match.
*/
public List<Integer> searchString(final String word, final String sentence) {
final List<Integer> matchedIndices = new ArrayList<>();

final int sentenceLength = sentence.length();
final int wordLength = word.length();
int beginMatch = 0; // the starting position in sentence from which the match started
int idxWord = 0; // the index of the character of the word that is being compared to a character in string
final List<Integer> partialTable = createPartialMatchTable(word);
while (beginMatch + idxWord < sentenceLength)
if (word.charAt(idxWord) == sentence.charAt(beginMatch + idxWord)) {
// the characters have matched
if (idxWord == wordLength - 1) {
// the word is complete. we have a match.
matchedIndices.add(beginMatch);
// restart the search
beginMatch = beginMatch + idxWord - partialTable.get(idxWord);
if (partialTable.get(idxWord) > -1) idxWord = partialTable.get(idxWord);
else idxWord = 0;
} else idxWord++;
} else {
// mismatch. restart the search.
beginMatch = beginMatch + idxWord - partialTable.get(idxWord);
if (partialTable.get(idxWord) > -1) idxWord = partialTable.get(idxWord);
else idxWord = 0;
}

return Collections.unmodifiableList(matchedIndices);
}

/**
* Creates the Partial Match Table for the word. Runs in O(n) where n is the length of the word.
*
* @param word The word whose Partial Match Table is required.
* @return The table as a list of integers.
*/
public List<Integer> createPartialMatchTable(final String word) {
if (StringUtils.isBlank(word)) return Collections.EMPTY_LIST;

final int length = word.length();
final List<Integer> partialTable = new ArrayList<>(length + 1);
partialTable.add(-1);
partialTable.add(0);

final char firstChar = word.charAt(0);
for (int idx = 1; idx < word.length(); idx++) {
final int prevVal = partialTable.get(idx);
if (prevVal == 0) {
if (word.charAt(idx) == firstChar) partialTable.add(1);
else partialTable.add(0);
} else if (word.charAt(idx) == word.charAt(prevVal)) partialTable.add(prevVal + 1);
else partialTable.add(0);
}

return Collections.unmodifiableList(partialTable);
}
}

Popularity: 3% [?]

WTF Code: Hardcode URL When Cleaning

At Amazon it’s a Service Oriented Architecture that everyone follows. We use the PAAPI (Product Advertising API) to create links to products in various marketplaces. Both PAAPI and us use the tag parameter to identify user who sent the request. So we always clean the urls returned by PAAPI to strip off tag. 

I looked around the code and found that we have a method which takes in a list of parameters to be removed and cleanses the url. Good. I had to integrate www.javari.jp into our widget server. When testing I found that all the urls were defaulting to www.amazon.co.jp instead of javari! Now this was baffling because clearly the url cleaner did not such thing as change the domain. I went deeper into the code and what I found surprised me!

For some reason, unknown to anyone in the team, someone had decided to comment out the piece of code calling that url cleaner method and instead chose to write his own cleaner. This is what he did:

As I understand from the above code, all that he wanted to do was strip out the get parameters. But that can very easily be done by using the URL class of java. Given a valid url, URL has all the logic to parse it and spit out various segments. My modified code was:

Simple and easy as it can get!

Popularity: 1% [?]

Hive: User Defined Functions

Writing user defined functions in perl for hive are so easy! Say you have a table in hive that has tab separated columns. In your script you will expect to receive the same tab separated columns from STDIN, each line separated with \n. You can do whatever processing you want to. Then just print back everything to STDOUT in the same format – columns separated by tab and lines by \n.

For example, consider you have a table with tab separated columns full_name, address, email_id. You want to write a function to parse the full_name and extract out the first and the last name. Your script will be:

And you can call this script from hive like this:

In fact, you could form the below query to select all the people whose last name is ‘Singh’.

Popularity: 2% [?]

Recursively Remove .svn Folders

I was moving my svn projects to git repository. I needed to get rid of all the .svn folders recursively. Following is the script that I used:

Popularity: 1% [?]

Project Euler : Frequency Analysis For Message Decryption

Problem 59 of Project Euler brought back memories of Sherlock Holme’s Dancing Men. Holmes had to decrypt a series of messages – the characters of which were stick men figures in varying dancing positions. He computed the frequencies of each dancing figure. Then since ‘e’ is the most used letter in English, he replaced the most frequent dancing figure with ‘e’.

I did the same thing with a difference. The given text in the problem did not have any spaces. This meant that even the spaces had been encrypted. And the most frequent number occurred almost twice as much as the next one. So I replaced that with space. That’s it. The problem was solved.

The password used to encrypt the text is – ‘god’ and the following code decrypts the text and finds the answer in the count variable.


    private List<Distribution<Integer>> parseInputFile(String fileName) throws FileNotFoundException {
        List<Distribution<Integer>> frequencies = Lists.newArrayList(new Distribution<Integer>(), new Distribution<Integer>(), new Distribution<Integer>());

        Scanner scanner = new Scanner(new File(fileName)).useDelimiter(",");
        int count = 0;
        for(int i = 0;  scanner.hasNext(); i++) {
            int num = scanner.nextInt();
            frequencies.get(i % 3).add(num);

            if(i % 3 ==0) num ^= 103;
            else if(i%3 == 1) num ^= 111;
            else if(i%3 == 2) num ^= 100;
            count += num;
            System.out.printf("%c", num);
        }

        System.out.println("Sum = " + count);

        return frequencies;
    }

Popularity: 5% [?]

Project Euler : n Digit Integers That Are Also nth Power

Problem number 63 of Project Euler gives the expression a = b^c and imposes the condition that the number of digits in a is equal to c. An example would be 16807 = 7^5. ab and c are all positive integers. The problem asks one to find the number of such possibilities.

Solving this problem made me happy. So, I’ll go into the details.

Given the problem, the first thing that comes to mind is the special relationship between the log of a number and the number of digits. The hint is to use logarithms to simplify the problem.

a = b^c

\Rightarrow \log(a) = c \log(b)

But \log(a) = c - f, where f \in (0, 1]

\Rightarrow c - f = c \log(b)

\Rightarrow c \log(\frac{10}{b}) = f

\Rightarrow 0 < \log(\frac{10}{b}) \leq \frac{1}{c}, dividing by c \in Z^+

That’s it! Our problem has been solved. I hope you have already noted that b \in [1, 9]. This means that we can precompute the values of \log(\frac{10}{b}) and store the results in an array. Then for all c \geq 1, just count how many of the precomputed values are smaller than or equal to \frac{1}{c}. The iteration completes when the smallest precomputed value is greater than \frac{1}{c}.

Below is the piece of code I wrote in Java to solve the task.

    public int getCount() {
        m_precomputedValues = new double[9];
        for(int i = 1; i < 10; i++) {
            m_precomputedValues[i-1] = Math.log10(10.0 / i);
        }

        int count = 0;

        int numberOfDigits = 1;
        double val = 1.0 / numberOfDigits;
        while (val > m_precomputedValues[8]) {
            for(int idx = 8; idx >= 0 &amp;amp;&amp;amp; m_precomputedValues[idx] <= val; idx--) count++;
            val = 1.0 / ++numberOfDigits;
        }

        return count;
    }

Popularity: 4% [?]

Java ArrayList Is Actually Just An Array !

I was reading The Art of Computer Programming Vol 1, the chapter about arrays and list. Knuth describes the differences and mentions how it’s very easy to add a new element to a list in a constant time if we have the node where the insertion needs to happen. Same for deletion. That is when I realized I did not know how the ArrayList implementation handles it. I did a code walk through of that class – something Prasun had advised me long ago but I always ignored.

I always knew that ArrayList used array for implementation, which is how it can guarantee the O(1) element access time. But I did not know how the other methods worked. For instance, one need not provide any initial size of the ArrayList. How does Java handle it? I remember Prasun telling me that the initial size in such cases is 10. But then what happens in the case of an overflow? What is the new space that gets assigned to the list?

But I was more interested in the add(index, element) and the remove(index) methods. And they are not constant time! They are O(n) operations. Say we want to remove(6). Java actually copies over all the elements from 7..n to indexes 6..n-1 and then sets the nth element as null. This is so not what we learn as deletion in list where it was just a matter of changing a couple of pointers (the next and prev ones).

Popularity: 2% [?]

Project Euler : Count Of Lychrel Numbers Below 10000

Problem 55 of Project Euler explains what Lychrel numbers are and asks one to find the number of those below 10000. The only sticking point in the whole procedure is that after a few iterations the number become so large that BigInteger needs to be used. But otherwise, just follow the wording of the statement and the problem will be solved.

Here’s the outermost loop which does the counting:

    private static final int ARRAY_SIZE = 1000000;
    private static final BigInteger BIG_ARRAY_SIZE = BigInteger.valueOf(ARRAY_SIZE);

    public static int countOfLychrelNumbers(int upperLimit) {
        int count = 0;

        State[] lychrelNumbers = ArrayUtils.createArray(ARRAY_SIZE, State.NOT_VERIFIED, new State[0]);
        for (int i = 0; i <= upperLimit; i++) {
            if (lychrelNumbers[i] == State.NON_LYCHREL) continue;
            if (lychrelNumbers[i] == State.LYCHREL || testForLychrelProperty(lychrelNumbers, i)) {
                count++;
            }
        }

        return count;
    }

    private enum State {
        LYCHREL, NON_LYCHREL, NOT_VERIFIED
    }

It’s in the testForLychrelProperty method that all the caching logic comes in. It makes sense to cache the results. However, I have imposed a cache limit as well. Here’s the messier code with caching logic:

    private static boolean testForLychrelProperty(State[] lychrelNumbers, int number) {
        int iterCount = 1;
        State state = State.NOT_VERIFIED;
        BigInteger num = BigInteger.valueOf(number);
        List<BigInteger> found = new ArrayList<BigInteger>();
        found.add(num);
        while (iterCount < 50) {
            BigInteger reverse = NumberUtil.reverse(num);
            BigInteger sum = reverse.add(num);
            if (!num.remainder(BigInteger.TEN).equals(BigInteger.ZERO)) found.add(reverse);
            state = sum.compareTo(BIG_ARRAY_SIZE) < 0 ? lychrelNumbers[sum.intValue()] : State.NOT_VERIFIED;
            if (state == State.LYCHREL || state == State.NON_LYCHREL) {
                break;
            }
            if (PalindromeUtil.checkIfPalindrome(sum.toString())) {
                state = State.NON_LYCHREL;
                break;
            }
            num = sum;
            found.add(sum);
            iterCount++;
        }

        if (state == State.NOT_VERIFIED) state = State.LYCHREL;

        setState(lychrelNumbers, found, state);
        return state == State.LYCHREL;
    }

The setState method simply updates the cache with the latest result. Here is it’s implementation:

    private static void setState(State[] lychrelNumbers, List<BigInteger> numbers, State state) {
        for (BigInteger number : numbers) {
            if (number.compareTo(BIG_ARRAY_SIZE) > 0) return;
            lychrelNumbers[number.intValue()] = state;
        }
    }

All in all, the code runs in some 180ms on my laptop.

Popularity: 6% [?]

Project Euler : Largest Sum Of Digits In Exponentiation

The problem 56 of project euler requires one to sum the digits of the exponentiation a^b, where 1<= a,b <= 100, and find the largest sum.

Well, let’s just say I did it by proper brute force. Here’s my java code:

    public static int largestSumOfDigitsInExponentiationNew(int baseMax, int exponentMax) {
        int largest = 0;

        for (int i = 1; i <= baseMax; i++) {
            BigInteger num = BigInteger.valueOf(i);
            BigInteger prev = num;
            largest = Problem_56.checkLargestSum(prev, largest);
            for (int j = 2; j <= exponentMax; j++) {
                prev = prev.multiply(num);
                largest = Problem_56.checkLargestSum(prev, largest);
            }
        }

        return largest;
    }

    private static int checkLargestSum(BigInteger number, int curentLargest) {
        int sum = NumberUtil.getSumOfDigits(number);
        return curentLargest < sum ? sum : curentLargest;
    }

Popularity: 5% [?]

Project Euler : Count Integral Shortest Paths In All Cuboid Smaller Than Certain Dimension

Problem 86 of Project Euler is the next problem I took up. Given a cuboid with integral dimensions, the shortest distance for an ant to crawl from one corner to diagonally opposite corner need not be an integral number. The problem statement required one to consider all the possible cuboid up to a maximum size of M x M x M and report the number of cuboid that had integral value for the shortest path.

The question has a huge hint hidden within its wording. It goes on to provide two examples – for M = 99 the count is 1975 and for M  = 100 the count is 2060! Solve the problem incrementally. This problem can be disintegrated into 3 smaller problems.

Firstly, we need to know the calculation of shortest path in a cuboid as an ant would crawl. To do this, explode the 3D figure into a 2D figure keeping track of the corners. It’s a pretty simple matter to calculate the distance now. There’s one thing to be remembered though – the explosion can be done in 3 different ways. So we have potentially 3 candidates for the shortest path.

Given the dimension as a X b X c, the squared distance is one of (a+b)^{2} + c^{2}(b+c)^{2} + a^{2} or (c+a)^{2} + b^{2}. We shall see later that in our case we do not need to perform 3 calculations as the choice becomes quite obvious.

Next  we tackle the problem of induction. Say we know that the count for the problem size  M x M x M is some p. How do we calculate the value for the size (M+1) x (M+1) x (M+1)? If you were to imagine two cuboid of dimensions M and (M+1) and enumerate all the cuboid not covered by the one of dimension M, you would find them to be of the form:

i X j X (M+1)

where i = 1 to M + 1

and j = i to M + 1

There lies our solution! We setup a nested for loop and calculate the shortest distances for each of these extra cuboid. It should now be pretty obvious to notice that for the computation of shortest distance we only need to consider the case (i+j)^{2} + (M+1)^{2}. Thus we can go from M to M+1.

I hope you have noticed some redundancy in the above loop setup. Every time we calculate (i+j). Consider the two cases:

i = 1, j = 3

i = 2, j = 2

In both these cases the value of (i+j) is the same. We are unnecessarily computing values already computed. Being a little cleverer we can improve the current O(n^{2}) to O(n).

Instead of using a nested loop, we use a single for-loop from the range of 2 to 2*(M+1) as those are the range for the sum of i and j. But this introduces another problem. Say for a particular sum we get that the shortest path is an integer. We need to know the number of ways in which the sum can be decomposed into i and j. I setup a single line function to calculate this value and I was done!

The code ran in 258ms on my Lenovo T410 laptop. Below is my function to move from M to M+1. To solve the Project Euler problem, I just needed to call this function in a loop, always updating the memoized value.

 

    /**
     * Count the number of integral shortest paths in all cuboids upto a size of size X size X size
     *
     * @param size the maximum dimensions of the cuboid
     * @param memoizedValue the value for a cuboid with max dimension size - 1
     * @return the value for a cuboid of max dimension size X size X size
     */
    public int countIntegralShortDistances(int size, int memoizedValue) {
        double sizeSquared = Math.pow(size, 2);
        for (int i = 2; i <= 2 * size; i++)
            if (NumberUtil.isInteger(Math.sqrt(Math.pow(i, 2) + sizeSquared))) memoizedValue += NumberUtil
                    .countSummations(i, size);
        return memoizedValue;
    }

Popularity: 12% [?]