Shivji Jha: LCS and LIS .....

LONGEST COMMON SUBSEQUENCE

LCS stands for Longest common subsequence . It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics. If you are studying LCS for the first time , I would recommend you to read it from CLRS- Introduction to Algorithms . Charles Liesersen (one of the authors of the book) himself explains it here . I must confess he explains it pretty well in this video . You may also visit the corresponding wiki article Its a pretty long article but they have mentioned everything very neatly .

LCS Problem Statement: Given two sequences, find the length of longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are subsequences of “abcdefg”. So a string of length n has 2^n different possible subsequences.

Examples:
LCS for input Sequences “ABCDGH” and “AEDFHR” is “ADH” of length 3.
LCS for input Sequences “AGGTAB” and “GXTXAYB” is “GTAB” of length 4.

Still not clear with the problem formulation / algo ?? look at the animations here.

Here is a code of mine that finds the LCS of two strings given as input .

Recursive code:

[sourcecode language="cpp"]

int max(int a,int b)
{
if(a<b)return b;
return a;
}
int recursive_lcs_length(char * x, char * y)
{
if (*x == '\0' || *y == '\0') return 0;
else if (*x == *y) return 1 + recursive_lcs_length(x+1, y+1);
else return max(recursive_lcs_length(x+1,y), recursive_lcs_length(x,y+1));
}
[/sourcecode]

Recursive code takes O(2^n) time in the worst case- the two strings have no matching characters, so the last line is always executed. Its very inefficient. Lets now see the DP code.

Dynamic Programming Code (Bottom-up):

[sourcecode language="cpp"]
#include <stdio.h>
#include <string.h>
#define MAX 100
char X[MAX],Y[MAX];
int i,j,m,n,c[MAX][MAX],b[MAX][MAX];

int LCSlength()

{
m=strlen(X);
n=strlen(Y);
for (i=1;i<=m;i++) c[i][0]=0;
for (j=0;j<=n;j++) c[0][j]=0;
for (i=1;i<=m;i++)
for (j=1;j<=n;j++) {
if (X[i-1]==Y[j-1]) {
c[i][j]=c[i-1][j-1]+1;
b[i][j]=1;
}
else if (c[i-1][j]>=c[i][j-1]) {
c[i][j]=c[i-1][j];
b[i][j]=2;
}
else {
c[i][j]=c[i][j-1];
b[i][j]=3;
}
}
return c[m][n];
}

void printLCS(int i,int j)

{
if (i==0 || j==0) return;
if (b[i][j]==1) {
printLCS(i-1,j-1);
printf("%c",X[i-1]);

}
else if (b[i][j]==2)
printLCS(i-1,j);
else
printLCS(i,j-1);
}

int main()

{
while (1) {
gets(X);
if (feof(stdin)) break;
gets(Y);
printf("LCS length -> %d\n",LCSlength());
printLCS(m,n);
printf("\n");
}
}
[/sourcecode]

The DP code above moves bottom-up. It solves all the subproblems. Sometimes it may not be required to solve all the subproblems but only a few of them. This is what the memoization techenique does. Look at the code using memoization below:

Memoization Technique (Top-down):

[sourcecode language="cpp"]
int max(int a,int b)
{
if(a<b)return b;
return a;
}

int lcs_memo(char* x, int n, char* y,int m)
{
int memo[n+1][m+1];
int result=0;
for(int i=0;i<=n;i++)
for(int j=0;j<=m;j++)
memo[i][j]=-1;

/*base cases*/
if(m==0)return 0;
if(n==0)return 0;

/*see if already calculated*/
if(memo[n][m]!=-1) return memo[n][m];

/*if not calculated, calculate now*/
if(x[n]==y[m])
result=1+lcs_memo(x,n-1,y,m-1);
else
result=max(lcs_memo(x,n-1,y,m),lcs_memo(x,n,y,m-1));

/*save result*/
memo[n][m]=result;

return result;
}

[/sourcecode]

The LCS code above (both DP and memoization techniques) takes O(n-square) time and O(n-square) space . It is possible to write a little more efficient code for LCS . The time complexity is the same (or probably a little higher in terms of bigger constants) but the space complexity can be reduced to be linear . Check the code here . Note that the LCS is not unique. If you want to find all the LCSs see the pseudocode at wiki. Lastly, lcs can also be found using the edit distance algo by making the cost of replacement greater than that of an insertion plus a deletion.

LONGEST INCREASING SUBSEQUENCE

LIS, I would say , is a cousin of LCS It stands for Longest increasing subsequence. The problem statement is as follows:

Given a sequence $a_1, a_2, \ldots, a_n$ , find the largest subset such that for every i < j, a_i < a_j.

This wiki article explains the problem statement pretty nicely. An interesting way of finding the LIS is to use the LCS algorithm.

Make a sorted copy of the given sequence (say A), denoted as B. This would take O(nlog(n)) time.

Use LCS on A and B which takes O(n²) time.

Alternatively you may solve LIS without using the LCS algo , The straight-forward DP approach is explained nicely by the algorithmist . The LIS problem comes in two different flavours :

Find the length of LIS ,

print the LIS .

If you need only the length its the easy part . The second part is a little trickier . Also ,the LIS implementation can be done in O(n-square) or 0(n log n ) time .Here is an O(n log n) code to find the length of LIS.

[sourcecode language="cpp"]
#include<iostream>
#include<set>
#include<vector>
using namespace std;

int LIS(vector<int> A)
{
int N = A.size(),i;
set<int> s;
set<int>::iterator k;
for (i=0;i<N;i++)
{
if (s.insert(A[i]).second)
{
k = s.find(A[i]);
k++;
if (k!=s.end())
s.erase(k);
}
}
return s.size();
}
[/sourcecode]

The above code runs in O(n log k) time where k is the length of the LIS found . To get the LIS itself , we need to maintain a previous array which stores the index of the previous element in the LIS. Here’s a C++ implementation. It returns the LIS as an array.

[sourcecode language="cpp"]
#include<iostream>
#include<map>
#include<vector>
using namespace std;

typedef pair < int , int > PI;

vector<int> LIS(vector<int> A)
{
int N = A.size(),i,j=-1,t;
vector<int> pre(N,-1),res;
map<int,int> m;
map<int,int>::iterator k,l;
for (i=0;i<N;i++)
{
if (m.insert(PI(A[i],i)).second)
{
k = m.find(A[i]);
l = k;
k++;
if (l==m.begin())
pre[i]=-1;
else
{
l--;
pre[i]=l->second;
}
if (k!=m.end())
m.erase(k);
}
}
k=m.end();
k--;
j = k->second;
while (j!=-1)
{
res.push_back(A[j]);
j = pre[j];
}
reverse (res.begin(),res.end());
return res;
}
[/sourcecode]

If you are not very familiar with STL (in C++) , you may go for the O(n-square) solution . Here is a C code I generally use .

[sourcecode language="cpp"]
#include<stdio.h>
#include<string.h>
void lis(int str[],int n)
{
int len[n],prev[n],i,j,max=-1,k,b[n],a=0;
for(i=0;i<n;i++){
len[i]=1;
prev[i]=-1;
}
for(i=1;i<n;i++)
for(j=0;j<i;j++){
if(str[i]>str[j]&& len[i]<(len[j]+1)){
len[i]=len[j]+1;
prev[i]=j;

}
}
for(i=0;i<n;i++)
if(len[i]>max)
{ max=len[i]; k=i;}
b[a]=str[k];
for(i=prev[k];i>=0;i=prev[k])
{b[++a]=str[i];k=i;}
for(i=max-1;i>=0;i--)
printf("%d ",b[i]);
}
main()
{
int str[]={5,9,4};
lis(str,sizeof(str)/4);
return 0;
}

[/sourcecode]

Play with my codes if you dont understand it . Try different inputs . If you still dont get it , feel free to drop a comment .

References:

I came to know of the the LIS codes (O(n log n) versions) presented above from here .

Art of programming (viz - no. 2 here).

geeksforgeeks.org article for LCS

http://www.ics.uci.edu/~eppstein/161/960229.html

12 comments:

Neha BahugunaFebruary 8, 2012 at 4:19 AM
Hey Mr.Cool Congrats for ur first blog:)
AnonymousFebruary 10, 2012 at 1:01 AM
Nice blog! :D
ShivFebruary 10, 2012 at 1:09 AM
oohhh !!! thanks for the compliment :) and a special one for "Mr. COOL" !!!
ShivFebruary 10, 2012 at 1:10 AM
thanks ...
Aayush kumarFebruary 10, 2012 at 9:29 AM
nice blog..
Post another articles for nice DP problems with good explaination
ShivFebruary 10, 2012 at 9:36 AM
Sure , I will try doing that..thanks for the suggestion :)
Edit distance – implementation « Shivji Kumar JhaMarch 5, 2012 at 1:21 PM
[...] LCS - make the cost of replacement greater than that of an insertion plus a deletion. [...]
Rahul ChoudharyMay 6, 2013 at 3:44 AM
i think the dp code will take more than n*n running time... n*n time it ll take to build the table itself... please clarify..
Rahul ChoudharyMay 6, 2013 at 3:48 AM
sorry i mean it ll take more than m*n running time..
ShivMay 6, 2013 at 11:11 PM
Hi Rahul,

You are right right, building the LCS table will take O(m*n) time. And printLCS a linear time, in worst case O(m+n), so effectively it becomes O(m*n)+O(m+n). The second factor there can be left out. And that is why we say it takes O(m*n). That is how asymptotic analysis works. Does that make it clear?
Rahul ChoudharyOctober 13, 2013 at 10:11 AM
yeah.. actually i was reading asymptotic analysis at that time... so, i mistook your asymptotic tight bound (order notation) for big oh (O) notation.. i understood your code.. only later i realised order notation is generally used for efficiency of algorithms...:) i was hoping you wont read my comment and i would delete it before you read it..:)
Anyways, very good explanation bhaiya.. Dynamic programming was the first topic i read in design of algorithms(apart from recursive and divide and conquer that I am generally comfortable with)... And your blog was a great motivation.. :) Keep Writing...
ShivOctober 13, 2013 at 12:04 PM
Good to see the interest, keep it going. And remember DP is tough. So keep attacking it to hack the tricks !

Shivji Jha

Pages

Tuesday, February 7, 2012

LCS and LIS .....

12 comments: