Fourier text search

=algorithm =explanation =machine learning



Fourier transforms convert between multiplication and convolution. This means they can be used for string search.



Suppose you have 2 DNA sequences, and you want to find the best match between them. One way to do this is as follows:


1) Create a set S1 of 4 arrays for the first sequence. The first array S1[0] is 1 where the sequence is A, and otherwise 0. The other 3 arrays are the same but for T, C, G.


2) Reverse the second sequence, then create a set S2 of 4 arrays for that as in (1). Pad the arrays with 0 (or perhaps 1/4) to the same length as the first set of arrays.


3) Apply a FFT to each of those 8 arrays in S1 and S2: S1[i] = FFT(S1[i])


4) Multiply the 2 sets of arrays: Products[i][j] = S1[i][j] * S2[i][j]


4) Sum the resulting 4 arrays: Sum[j] = Products[0][j] + Products[1][j] + Products[2][j] + Products[3][j]


5) Apply an inverse FFT to Sum: Sum = iFFT(Sum)


6) Find the maximum value of Sum. The position where it's found corresponds to the best match offset for the 2 DNA sequences, and the value corresponds to the quality of that match.



Is this approach actually useful? The complexity is O(n log n) where n is the longer sequence's length, while naive search for imperfect matches is O(length1 * length2). Text search usually involves finding a perfect match for a short string in a very large amount of text, in which case Fourier transforms are not a good approach. Also, a larger number of symbols obviously requires more arrays. That's why I used the example of DNA sequence alignment.


However, symbol counts can be reduced by breaking symbols into orthogonal numerical components - for example, by converting words to vectors, which is now a commonly used technique called "word embedding". One advantage of using Fourier transforms here is that a Gaussian blur can be easily applied to searches on word embedding sequences.


back to index