The wavefront alignment (WFA) algorithm is an exact gap-affine
algorithm that takes advantage of homologous regions between the
sequences to accelerate the alignment process. Unlike traditional
dynamic programming algorithms that run in quadratic time, the WFA runs
in time O(ns+s^2), proportional to the sequence length n and the
alignment score s, using O(s^2) memory (or O(s) using the
ultralow/BiWFA mode). Moreover, the WFA algorithm exhibits simple
computational patterns that the modern compilers can automatically
vectorize for different architectures without adapting the code.
FASDA aims to provide a fast and simple differential analysis tool
that just works and does not require any knowledge beyond basic Unix
command-line skills. The code is written entirely in C to maximize
efficiency and portability, and to provide a simple command-line user
interface.
MEGAHIT is a single node assembler for large and complex metagenomics
NGS reads, such as soil. It makes use of succinct de Bruijn graph
(SdBG) to achieve low memory assembly. MEGAHIT can optionally utilize
a CUDA-enabled GPU to accelerate its SdBG contstruction.
This is an attempt to perform a simple "libraryfication" of the GFF/GTF
parsing code that is used in GFFRead codebase. There are not many
(any?) relatively lightweight GTF/GFF parsers exposing a C++ interface,
and the goal of this library is to provide this functionality without
the necessity of drawing in a heavy-weight dependency like SeqAn. Note:
This library draws directly from the code in GFFRead and GCLib, and
exists primarily to remove functionality (and hence code) that is
unnecessary for our downstream purposes. In the future, it may be
appropriate to just replace this library wholesale with GCLib.
MoChA is a bcftools plugin released under the MIT license for mosaic
chromosomal alteration detection and analysis from DNA microarray or
whole genome sequence data. It can be used both with Illumina and
Affymetrix data. It can also be used for detection of germline copy
number variants. Data can be prepared in usable file formats using the
gtc2vcf plugin.
deepTools contains useful modules to process the mapped reads data for
multiple quality checks, creating normalized coverage files in standard
bedGraph and bigWig file formats, that allow comparison between
different files (for example, treatment and control). Finally, using
such normalized and standardized files, deepTools can create many
publication-ready visualizations to identify enrichments and for
functional annotations of the genome.
py-bigwig is a python extension, written in C, for quick access to
bigBed files and access to and creation of bigWig files. This extension
uses libBigWig for local and remote file access.
py2bit is a python extension, written in C, for quick access to 2bit
files for randomly accessible, packed nucleotide sequences. The
extension uses lib2bit for file access.
Utilities for working on SAM/BAM files from The Center for Statistical
Genetics at the University of Michigan School of Public Health. It
includes numerous functions such as splitting, merging, trimming reads,
filtering, validation, diff, etc.
pywgsim is a modified version of the wgsim short read simulator. The
code for wgsim has been modified to allow visualizing the simulated
mutations as a GFF file.
Biolibc-tools is a collection of simple fast, memory-efficient,
programs for processing biological data. These are simple programs
built on biolibc that are not complex enough to warrant a separate
project.
BFC is a standalone high-performance tool for correcting sequencing
errors from Illumina sequencing data. It is specifically designed for
high-coverage whole-genome human data, though also performs well for
small genomes.
FLASH (Fast Length Adjustment of SHort reads) is a very fast and
accurate software tool to merge paired-end reads from next-generation
sequencing experiments. FLASH is designed to merge pairs of reads when
the original DNA fragments are shorter than twice the length of reads.
The resulting longer reads can significantly improve genome assemblies.
They can also improve transcriptome assembly when FLASH is used to
merge RNA-seq data.
The ont_fast5_api is a simple interface to HDF5 files of the Oxford
Nanopore .fast5 file format. It provides:
o Implementation of the fast5 file schema using h5py library
o Methods to interact with and reflect the fast5 file schema
o Tools to convert between multi_read and single_read formats
o Tools to compress/decompress raw data in files
ErmineJ performs analyses of gene sets in high-throughput genomics data
such as gene expression profiling studies. A typical goal is to
determine whether particular biological pathways are "doing something
interesting" in an experiment that generates long lists of candidates.
The software is designed to be used by biologists with little or no
informatics background (but if you do, you might be interested in the
CLI or the R support).
MMseqs2 (Many-against-Many sequence searching) is a software suite to search
and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source
GPL-licensed software implemented in C++ for FreeBSD, Linux, MacOS, and (via
via cygwin) Windows. The software is designed to run on multiple cores and
servers and exhibits very good scalability. MMseqs2 can run 10000 times
faster than BLAST. At 100 times its speed it achieves almost the same
sensitivity. It can perform profile searches with the same sensitivity as
PSI-BLAST at over 400 times its speed.