Genomic Sequencing

January 19, 2024

When genes are sequenced from some biological sample via laboratory equipment, the initially-post-processed output is files with sequences of base pairs. A base pair is one of the chemicals adenine (A), thymine (T), guanine (G), or cytosine (C), and they are the basic building blocks of DNA—sequences of them represent genes. The base pair sequences found in the sample by the laboratory equipment are from an unknown position in the organism; the sample is from the overall genome. To figure out where in the genome these sequences line up to, and thus what genes they are, a search of the organism's genome is done against the sequences in the files generated from the laboratory equipment. This is a computationally intensive task as it is, in basic form, taking potentially thousands of strings of text like AGTCACTGAGT and matching it against the text of a genome (which is also a long sequence of base pairs, ex. 3.3 billion base pairs long in the human genome's case). This is further complicated by heuristics and advanced searching techniques that have to be used to find genes optimally and efficiently. Genomic sequencing is frequently an embarrassingly parallel workload as it requires many gene samples to have the same analysis done over them, for the most part independent of any other sample.