ksw2 aligner¶
Documentation¶
-
namespace
hugenomic
¶ Copyright (c) 2021, Huxelerate S.r.l. All rights reserved.
-
namespace
gpu
¶ -
class
ksw2
¶ - #include <ksw2.hpp>
Accelerated GPU implementation of a sequence aligner based on the ksw2_extd method from: https://github.com/lh3/ksw2
The aligner computes the optimal scores and the CIGAR for global and extension alignment of two sequences using dual gap affine cost function and a custom scoring matrix. The aligner also supports banded alignment and the z-drop heuristics: https://doi.org/10.1093/bioinformatics/bty191
Public Types
-
typedef uint32_t
cigar_op
¶ The data type used for storing a CIGAR operation within a CIGAR (e.g. 10M, 2I, 4D). A CIGAR operation encodes both the operation type and its length.
- Note
We use the same encoding of the ksw2 library.
Public Functions
-
ksw2
(uint32_t gpu_id, size_t gpu_memory)¶ Instantiate a ksw2 object to align sequences on a given GPU.
- Parameters
gpu_id
: The identifier of the target GPU to usegpu_memory
: The amount of GPU memory in bytes to use, if 0 all the free GPU memory will be used
- Exceptions
std::invalid_argument
: If an invalid GPU identifier is specified or if the specified GPU memory is not enough or exceeds the free GPU memory
-
~ksw2
()¶ Release all resources and the allocated GPU memory.
-
std::vector<response>
align_batch
(const std::vector<std::pair<std::vector<uint8_t>, std::vector<uint8_t>>> &pairs_of_sequences, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)¶ Performs a sequence alignment of one or more pairs of query and target sequences and returns the alignment scores and the CIGAR for each pair of sequences.
The function computes at once the score of 4 different types of alignments. These alignments are: global alignment, global query alignment, global target alignment, extension alignment. For each alignment type, each pair of query and target sequences are aligned from their beginning up to an end position (query, target) that depends on the alignment type as follows:
global alignment (query_length - 1, target_length - 1)
global query alignment (query_length - 1,
FREE
)global target alignment (
FREE
, target_length - 1)extension alignment (
FREE
,FREE
)
Where
FREE
means that the algorithm selects the position that maximizes the alignment score.The function returns only the CIGAR of one alignment among: extension, global and global query alignment. To specify which CIGAR to return the user can use the extension_only_cigar and the end_bonus parameters. The following rules determine which CIGAR is returned:
global alignment CIGAR
when extension_only_cigar = false
global query alignment CIGAR
when extension_only_cigar = true AND global_query_max_score + end_bonus > extension_max_score
extension alignment CIGAR
when extension_only_cigar = true AND global_query_max_score + end_bonus <= extension_max_score
However, if the alignment is z-dropped, the extension alignment CIGAR is returned by default.
The cost function for gaps is computed as the minimum of two affine gap functions using the following formula, where l is the length of the gap:
min{start_gap_cost + l * extend_gap_cost, start_gap_cost_2 + l * extend_gap_cost_2}
The cost function associated to a match/mismatch is provided by a scoring matrix. The scoring matrix is an M*M matrix encoded as a linear array that determines the score to assign when comparing an element q of the query to an element t of the target sequence. In details, the score is assigned as:
score = scoring_matrix[t * M + q]
The value M must be in the range [2, MAX_SCORING_MATRIX_SIZE].
Notice that the values stored in the query and target sequences must be between 0 and M-1 otherwise an std::invalid_argument will be thrown.
PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.
- Return
Vector of ksw2::response containing the alignment result for all pairs of sequences
- Parameters
pairs_of_sequences
: The pairs of query and target sequences to align where the first element of each pair is a query sequence and the second element of each pair is a target sequencestart_gap_cost
: The cost for starting a gapextend_gap_cost
: The cost for extending a gapstart_gap_cost_2
: The cost for starting a gap (second gap function)extend_gap_cost_2
: The cost for extending a gap (second gap function)scoring_matrix
: Scoring matrix of size M*M encoded as a linear vectorbandwidth
: The width of the band for banded alignment, use a negative number for full alignmentzdrop
: The z value for the z-drop heuristic, use a negative number to avoid z-dropend_bonus
: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is trueextension_only_cigar
: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.
- Exceptions
std::invalid_argument
: In case of invalid sequences or scoring matrix
-
std::vector<response>
align_batch_indirect_storage
(const std::vector<const uint8_t*> &query_pointers, const std::vector<uint32_t> &query_sizes, const std::vector<const uint8_t*> &target_pointers, const std::vector<uint32_t> &target_sizes, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)¶ This method provides the same functionality of ksw2::align_batch but supports a more flexible data storage for the input sequences.
Each sequence is provided as input using a memory pointer to the sequence and its size.
For details on the functionality of this method refer to ksw2::align_batch.
- Return
Vector of ksw2::response containing the alignment result for all pairs of sequences
- Parameters
query_pointers
: Vector of memory pointers to each query sequencequery_sizes
: Vector of query sizestarget_pointers
: Vector of memory pointers to each target sequencetarget_sizes
: Vector of target sizesstart_gap_cost
: The cost for starting a gapextend_gap_cost
: The cost for extending a gapstart_gap_cost_2
: The cost for starting a gap (second gap function)extend_gap_cost_2
: The cost for extending a gap (second gap function)scoring_matrix
: Scoring matrix of size M*M encoded as a linear vectorbandwidth
: The width of the band for banded alignment, use a negative number for full alignmentzdrop
: The z value for the z-drop heuristic, use a negative number to avoid z-dropend_bonus
: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is trueextension_only_cigar
: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.
- Exceptions
std::invalid_argument
: In case of invalid sequences, invalid scoring matrix or mismatching sizes for query and target vectors
Public Static Functions
-
static inline char
get_cigar_op_type
(cigar_op op)¶ Retrieves the operation type of a CIGAR operation.
- Return
The operation type, can be one of:
’M’: match/mismatch
’I’: insertion
’D’: deletion
’U’: undefined (only if the cigar operation is incorrect)
- Parameters
op
: The cigar operation
Public Static Attributes
-
static const uint64_t
MAX_SCORING_MATRIX_SIZE
= 64¶ The maximum supported size for the custom scoring matrix
-
static const int32_t
MAX_NOT_FOUND
= -0x40000000¶ A special value assigned to a max score returned by the aligner when the max score is not found (such as when the alignment is z-dropped)
Private Members
-
std::experimental::propagate_const<std::unique_ptr<impl>>
pimpl
¶
-
struct
response
¶ - #include <ksw2.hpp>
The result of an alignment on two pair of sequences
Public Members
-
int32_t
extension_max_score
¶ Max score of extension alignment:
query from 0 to [X]
target from 0 to [Y]
can be equal to MAX_NOT_FOUND if not found
-
int32_t
extension_query_end_position
¶ Query end position ([X]) for extension alignment
-
int32_t
extension_target_end_position
¶ Target end position ([Y]) for extension alignment
-
int32_t
global_query_max_score
¶ Max score of global query alignment:
query from 0 to query_len - 1
target from 0 to [Y]
can be equal to MAX_NOT_FOUND if not found
-
int32_t
global_query_target_end_position
¶ Target end position ([Y]) for global query alignment
-
int32_t
global_target_max_score
¶ Max score of global target alignment:
query from 0 to [X]
target from 0 to target_len - 1
can be equal to MAX_NOT_FOUND if not found
-
int32_t
global_target_query_end_position
¶ Query end position ([X]) for global target alignment
-
int32_t
global_max_score
¶ Max score of global alignment:
query from 0 to query_len - 1
target from 0 to target_len - 1
can be equal to MAX_NOT_FOUND if not found
-
std::vector<cigar_op>
cigar
¶ The CIGAR of one of the following alignments:
global alignment: if extension_only_cigar=false
global query alignment: if extension_only_cigar=true and global_query_max_score + end_bonus > extension_max_score
extension alignment: if extension_only_cigar=true and global_query_max_score + end_bonus <= extension_max_score
- Note
If the alignment has been z-dropped, the CIGAR of extension alignment is returned by default
-
bool
global_query_cigar
¶ Whether the CIGAR for global query alignment is returned
-
bool
zdropped
¶ Whether the alignment has been z-dropped
-
int32_t
-
typedef uint32_t
-
class
-
namespace
Usage examples¶
examples/ksw2/1_cigar_operations_counting¶
/**
* Copyright (c) 2021, Huxelerate S.r.l.
* All rights reserved.
*
* Sample code for computing the average number of matches/mismatches, insertions and deletions
* for global alignment between random DNA reads.
*/
#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "ksw2.hpp" // HUGenomic library
using namespace hugenomic::gpu;
// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 200000;
const size_t QUERY_SEQUENCES_LEN = 256;
const size_t TARGET_SEQUENCES_LEN = 256;
// scoring system used by the sequence alignment algorithm
const int START_GAP_COST = 4;
const int EXTEND_GAP_COST = 2;
const int START_GAP_COST_2 = 13;
const int EXTEND_GAP_COST_2 = 1;
// in this example we use 5x5 scoring matrix, where we assume the following
// encoding for positions (rows/columns) within the matrix:
// 0 -> A, 1 -> C, 2 -> G, 3 -> T, 4 -> *
std::vector<int8_t> SCORING_MATRIX = {
3, -4, -4, -4, 0,
-4, 3, -4, -4, 0,
-4, -4, 3, -4, 0,
-4, -4, -4, 3, 0,
0, 0, 0, 0, 0
};
const int ZDROP = -1; // we do not use the z-drop heuristics
const int BANDWIDTH = -1; // we do full alignment (not banded)
const int END_BONUS = 0;
const int EXTENSION_ONLY_CIGAR = false;
// utility function to generate a random dna sequence with a given length
std::vector<uint8_t> random_dna_sequence(size_t length);
// utility function to retrieve the current time in microseconds
unsigned long get_time();
int main()
{
// initialize the ksw2 aligner
const int gpu_id = 0;
ksw2 aligner(gpu_id, 0);
// generate a random set of pairs of DNA reads
std::cout << "Generating " << NUM_PAIRS
<< " random pairs of DNA reads, each pair with read lengths ("
<< QUERY_SEQUENCES_LEN << ", " << TARGET_SEQUENCES_LEN << ")"
<< std::endl;
srand(0);
std::vector< std::pair< std::vector<uint8_t>, std::vector<uint8_t> > > sequences_pairs;
for(unsigned int s = 0; s < NUM_PAIRS; s++) {
std::vector<uint8_t> query_sequence = random_dna_sequence(QUERY_SEQUENCES_LEN);
std::vector<uint8_t> target_sequence = random_dna_sequence(TARGET_SEQUENCES_LEN);
sequences_pairs.push_back({query_sequence, target_sequence});
}
std::cout << "Generation complete!" << std::endl << std::endl;
// run the alignment on each pair of random reads
std::cout << "Running alignment..." << std::endl;
int total_match_mismatch = 0;
int total_deletion = 0;
int total_insertion = 0;
unsigned long start_compute_time = get_time();
std::vector<ksw2::response> alignment_results = aligner.align_batch(sequences_pairs,
START_GAP_COST, EXTEND_GAP_COST, START_GAP_COST_2, EXTEND_GAP_COST_2,
SCORING_MATRIX, BANDWIDTH, ZDROP, END_BONUS, EXTENSION_ONLY_CIGAR);
unsigned long end_compute_time = get_time();
for(const auto &alignment: alignment_results) {
for(ksw2::cigar_op cigar_op: alignment.cigar) {
char cigar_op_type = ksw2::get_cigar_op_type(cigar_op);
uint32_t cigar_op_length = ksw2::get_cigar_op_length(cigar_op);
if(cigar_op_type == 'M')
total_match_mismatch += cigar_op_length;
else if(cigar_op_type == 'I')
total_insertion += cigar_op_length;
else if(cigar_op_type == 'D')
total_deletion += cigar_op_length;
}
}
// print the average number of matches/mismatches, insertions and deletions
std::cout << std::fixed << std::setprecision(2);
std::cout << "Alignment complete!" << std::endl << std::endl;
std::cout << "Average number of matches/mismatches = "
<< total_match_mismatch / (double) NUM_PAIRS << std::endl << std::endl;
std::cout << "Average number of insertions = "
<< total_insertion / (double) NUM_PAIRS << std::endl << std::endl;
std::cout << "Average number of deletions = "
<< total_deletion / (double) NUM_PAIRS << std::endl << std::endl;
// measure execution time and performance
double time_ms = (end_compute_time - start_compute_time) / 1000.0;
size_t cell_updates = QUERY_SEQUENCES_LEN * TARGET_SEQUENCES_LEN * NUM_PAIRS;
std::cout << "Alignment time: " << time_ms << " ms" << std::endl
<< "GCUPS (Giga Cell Update Per Second) = "
<< (cell_updates / time_ms / 1000000.0) << std::endl;
return 0;
}
std::vector<uint8_t> random_dna_sequence(size_t length)
{
std::vector<uint8_t> sequence;
for(size_t i = 0; i < length; i++) {
sequence.push_back( rand() % 5 );
}
return sequence;
}
unsigned long get_time()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}