Manipulating DNA/RNA sequences is a very basic and fundamental operation in Molecular Biology. Writing the reverse-complement of a DNA sequence is very easy, but is also a error-prone operation if performed manually. Sequence manipulation tools are available online and free-of-charge (my favourite is the Sequence Manipulation Suite 2, available at the following URL: http://www.bioinformatics.org/sms2/).
However, we don’t need to use an online tool for getting reverse/complement sequences. A couple of lines of R code will largely suffice. Here’s an example of how we can do this.
Reverse sequence
The reverse of a sequence (as well of any other character String), is just the same string read from the end to the beginning. We can get the reverse of any string by reading it one char at time and right-to-left and then collapse the resulting character vector in order to get a String
myDNA <- "AAAATTTCCG" revDNA <- sapply(1:nchar(myDNA), (function(i){ pos <- nchar(myDNA) - i + 1 substr(myDNA, pos, pos) })) revDNA <- paste(revDNA, collapse = "") revDNA # GCCTTTAAAA
Complementary sequence
In order to get a complementary sequence, we need to convert each letter (corresponding to a DNA base) to the one we expect to find on the complementary DNA strand. Briefly, the conversion rule is: A matches T; C matches G. Undefined bases are labelled as N. The complementary base of N is N. We can store the conversion rules in a vector that will work as a dictionary.
baseMatch <- c("A", "T", "C", "G", "N") names(baseMatch) <- c("T", "A", "G", "C", "N") #complement of a T is: baseMatch["T"] # returns an A
We can read each char in the DNA string, return the complementary base letter using this dictionary and then collapse the result as we did before.
myDNA <- "AAAATTTCCG" complDNA <- paste(sapply(1:nchar(myDNA), (function(i){ myBase <- substr(myDNA, i,i) baseMatch[myBase] })), collapse = "") complDNA #TTTTAAAGGC
That’s it. We can write a function that includes this code and eventually implement some quality checks:
- convert U (Uracil) to T and match it with an A
- throw an error if non-DNA characters are found in the string;
- and so on…
The result may be found on GitHub at the following URL: