Heidelberg University

Small Scale project, 2020 Hamming distance

Scenario

Researcher

Andreas Ott, Principal Investigator and Privatdozent, Institute for Mathematics

Initial problem

Bottleneck in an analysis pipeline of SARS-CoV-2 genome data
Calculation of hamming distances from a large set of gene expressions
This pre-processing step takes many hours to complete even using many cores
Prevented them from analysing larger datasets

Institute for Mathematics

Outcome

What we did

Replace Python implementation with a highly optimized, vectorized and parallelized C++ version
Package as a python library hammingdist to fit into the existing analysis pipeline
Add vectorized SSE2/AVX/AVX512 implementations with automatic runtime dispatch based on detected CPU
Provide pre-compiled binary wheels on PyPI for convenient installation
Set up Continuous Integration and Continuous Delivery to provide automated testing and deployment of the code

Results

Code now runs in minutes on a single core with original dataset
Code now scales using HPC resources to process hundreds of thousands of gene expressions
Publication: arXiv:2207.03394 [math.AT] MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations
Publication: arXiv:2106.07292 [q-bio.PE] Topological data analysis identifies emerging adaptive mutations in SARS-CoV-2
Pipeline: CoVtRec Topological Surveillance of Recurrent Mutations in SARS-CoV-2

Deutsch

Contact