Small Scale project, 2020Hamming distance



Andreas Ott, Principal Investigator and Privatdozent, Institute for Mathematics

Initial problem

  • Bottleneck in an analysis pipeline of SARS-CoV-2 genome data
  • Calculation of hamming distances from a large set of gene expressions
  • This pre-processing step takes many hours to complete even using many cores
  • Prevented them from analysing larger datasets


What we did

  • Replace Python implementation with a highly optimized, vectorized and parallelized C++ version
  • Package as a python library hammingdist to fit into the existing analysis pipeline
  • Add vectorized SSE2/AVX/AVX512 implementations with automatic runtime dispatch based on detected CPU
  • Provide pre-compiled binary wheels on PyPI for convenient installation
  • Set up Continuous Integration and Continuous Delivery to provide automated testing and deployment of the code


  • Code now runs in minutes on a single core with original dataset
  • Code now scales using HPC resources to process hundreds of thousands of gene expressions
  • Publication: arXiv:2207.03394 [math.AT] MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations
  • Publication: arXiv:2106.07292 [q-bio.PE] Topological data analysis identifies emerging adaptive mutations in SARS-CoV-2
  • Pipeline: CoVtRec Topological Surveillance of Recurrent Mutations in SARS-CoV-2
  • Show full screen mode

    Plot of performance improvement