Commit graph

660 commits

Author SHA1 Message Date
nia
169fb9f0f3 biology: Remove SHA1 hashes for distfiles 2021-10-07 13:19:36 +00:00
adam
801b54580f py-pydicom: updated to 2.2.2
Version 2.2.0

Changes
-------
* Data elements with a VR of **AT** must now be set with values
  acceptable to :func:`~pydicom.tag.Tag`, and are always stored as a
  :class:`~pydicom.tag.BaseTag`.  Previously, any Python type could be
  set.
* :meth:`BaseTag.__eq__()<pydicom.tag.BaseTag.__eq__>` returns ``False`` rather
  than raising an exception when the operand cannot be converted to
  :class:`~pydicom.tag.BaseTag` (:pr:`1327`)
* :meth:`DA.__str__()<pydicom.valuerep.DA.__str__>`,
  :meth:`DT.__str__()<pydicom.valuerep.DT.__str__>` and
  :meth:`TM.__str__()<pydicom.valuerep.TM.__str__>` return valid DICOM
  strings instead of the formatted date and time representations
  (:issue:`1262`)
* If comparing :class:`~pydicom.dataset.FileDataset` instances, the file
  metadata is now ignored. This makes it possible to compare a
  :class:`~pydicom.dataset.FileDataset` object with a
  :class:`~pydicom.dataset.Dataset` object.
* :func:`~pydicom.pixel_data_handlers.rle_handler.rle_encode_frame` is
  deprecated and will be removed in v3.0, use
  :meth:`~pydicom.dataset.Dataset.compress` or
  :attr:`~pydicom.encoders.RLELosslessEncoder` instead.
* :func:`~pydicom.filereader.read_file` is deprecated and will be removed in
  v3.0, use :func:`~pydicom.filereader.dcmread` instead.
* :func:`~pydicom.filewriter.write_file` is deprecated and will be removed in
  v3.0, use :func:`~pydicom.filewriter.dcmwrite` instead.
* Data dictionaries updated to version 2021b of the DICOM Standard
* :class:`~pydicom.dataset.Dataset` no longer inherits from :class:`dict`

Enhancements
------------
* Added a command-line interface for pydicom.  Current subcommands are:

    * ``show``: display all or part of a DICOM file
    * ``codify`` to produce Python code for writing files or sequence items
      from scratch.

  Please see the :ref:`cli_guide` for examples and details
  of all the options for each command.
* A field containing an invalid number of bytes will result in a warning
  instead of an exception when
  :attr:`~pydicom.config.convert_wrong_length_to_UN` is set to ``True``.
* Private tags known via the private dictionary will now get the configured
  VR if read from a dataset instead of **UN** (:issue:`1051`).
* While reading explicit VR, a switch to implicit VR will be silently attempted
  if the VR bytes are not valid VR characters, and config option
  :attr:`~pydicom.config.assume_implicit_vr_switch` is ``True`` (default)
* New functionality to help with correct formatting of decimal strings (**DS**)

    * Added :func:`~pydicom.valuerep.is_valid_ds` to check whether a string is
      valid as a DICOM decimal string and
      :func:`~pydicom.valuerep.format_number_as_ds` to format a given ``float``
      or ``Decimal`` as a DS while retaining the highest possible level of
      precision
    * If :attr:`~pydicom.config.enforce_valid_values` is set to ``True``, all
      **DS** objects created will be checked for the validity of their string
      representations.
    * Added optional ``auto_format`` parameter to the init methods of
      :class:`~pydicom.valuerep.DSfloat` and
      :class:`~pydicom.valuerep.DSdecimal` and the :func:`~pydicom.valuerep.DS`
      factory function to allow explicitly requesting automatic formatting of
      the string representations of these objects when they are constructed.
* Added methods to construct :class:`~pydicom.valuerep.PersonName` objects
  from individual components of names (``family_name``, ``given_name``, etc.).
  See :meth:`~pydicom.valuerep.PersonName.from_named_components` and
  :meth:`~pydicom.valuerep.PersonName.from_named_components_veterinary`.
* Added support for downloading the large test files with the `requests
  <https://docs.python-requests.org/en/master/>`_ package in addition to
  :mod:`urllib.request` (:pr:`1340`)
* Ensured :func:`~pydicom.pixel_data_handlers.util.convert_color_space` uses
  32-bit floats for calculation, added `per_frame` flag to allow frame-by-frame
  processing and improved the speed by ~20-60% (:issue:`1348`)
* Optimisations for RLE encoding using *pydicom* (~40% faster).
* Added support for faster decoding (~4-5x) and encoding (~20x) of *RLE Lossless*
  *Pixel Data* via the `pylibjpeg-rle
  <https://github.com/pydicom/pylibjpeg-rle>`_ plugin (:pr:`1361`, :pr:`1372`).
* Added :func:`Dataset.compress()<pydicom.dataset.Dataset.compress>` function for
  compressing uncompressed pixel data using a given encoding format as specified
  by a UID. Only *RLE Lossless* is currently supported (:pr:`1372`)
* Added :mod:`~pydicom.encoders` module and the following encoders:

  * :attr:`~pydicom.encoders.RLELosslessEncoder` with 'pydicom', 'pylibjpeg'
    and 'gdcm' plugins
* Added `read` parameter to :func:`~pydicom.data.get_testdata_file`
  to allow reading and returning the corresponding dataset (:pr:`1372`)
* Handle decoded RLE segments with padding (:issue:`1438`)
* Add option to JSON functions to suppress exception and continue (:pr:`1332`)
* Allow searching :class:`~pydicom.fileset.FileSet` s for a list of elements (:pr:`1428`)
* Added hash function to SR :class:`~pydicom.sr.Code` (:pr:`1434`)


Fixes
-----
* Fixed pickling a :class:`~pydicom.dataset.Dataset` instance with sequences
  after the sequence had been read (:issue:`1278`)
* Fixed JSON export of numeric values
* Fixed handling of sequences of unknown length that switch to implicit
  encoding, and sequences with VR **UN** (:issue:`1312`)
* Do not load external data sources until needed - fixes problems with
  standard workflow if `setuptools` are not installed (:issue:`1341`)
* Fixed empty **PN** elements read from file being :class:`str` rather than
  :class:`~pydicom.valuerep.PersonName` (:issue:`1338`)
* Fixed handling of JPEG (10918-1) images compressed using RGB colourspace
  rather than YBR with the Pillow pixel data handler (:pr:`878`)
* Allow to deepcopy a `~pydicom.dataset.FileDataset` object (:issue:`1147`)
* Fixed elements with a VR of **OL**, **OD** and **OV** not being set correctly
  when an encoded backslash was part of the element value (:issue:`1412`)
* Fixed expansion of linear segments with floating point steps in
  segmented LUTs (:issue:`1415`)
* Fixed handling of code extensions with person name component delimiter
  (:pr:`1449`)
* Fixed bug decoding RBG jpg with APP14 marker due to change in Pillow (:pr:`1444`)
* Fixed decoding for `FloatPixelData` and `DoubleFloatPixelData` via
  `pydicom.pixel_data_handlers.numpy_handler` (:issue:`1457`)
2021-10-04 08:54:01 +00:00
adam
5e7c36d9d2 revbump for boost-libs 2021-09-29 19:00:02 +00:00
bacon
39bd4376cd biology/biolibc: Update to 0.2.0.11
Regenerate man pages with improved auto-c2man
Improved formatting and added missing return value sections
2021-09-18 00:42:39 +00:00
bacon
40b825e7c1 biology/peak-classifier: Update to 0.1.1.21
Fix regression: Replace BL_BED_SET_STRAND() macro with
bl_bed_set_strand(), which performs sanity checks
2021-09-03 01:57:56 +00:00
bacon
dacaebd841 biology/biolibc: Update to 0.2.0.1
Fix regression: Replace BL_BED_SET_STRAND() macro with
bl_bed_set_strand(), which performs sanity checks
2021-09-03 01:53:52 +00:00
adam
d40206d1db py-pydicom: PLIST fix 2021-09-01 18:18:19 +00:00
bacon
a924de7072 biology/Makefile: Add biolibc-tools 2021-08-31 15:56:42 +00:00
bacon
03efa277f5 biology/biolibc-tools: import biolibc-tools-0.1.0.36
Biolibc-tools is a collection of simple, fast, and memory-efficient
programs for processing biological data.  These programs built on
biolibc are not complex enough to warrant separate projects.
2021-08-31 15:55:16 +00:00
adam
02648e6327 py-pydicom: add ALTERNATIVES 2021-08-29 13:00:03 +00:00
adam
41be103944 py-pydicom: updated to 2.2.1
Version 2.2.0

Changes

Data elements with a VR of AT must now be set with values acceptable to Tag(), and are always stored as a BaseTag. Previously, any Python type could be set.
BaseTag.__eq__() returns False rather than raising an exception when the operand cannot be converted to BaseTag
DA.__str__(), DT.__str__() and TM.__str__() return valid DICOM strings instead of the formatted date and time representations
If comparing FileDataset instances, the file metadata is now ignored. This makes it possible to compare a FileDataset object with a Dataset object.
rle_encode_frame() is deprecated and will be removed in v3.0, use compress() or RLELosslessEncoder instead.
read_file() is deprecated and will be removed in v3.0, use dcmread() instead.
write_file() is deprecated and will be removed in v3.0, use dcmwrite() instead.
Data dictionaries updated to version 2021b of the DICOM Standard
Dataset no longer inherits from dict

Enhancements

Added a command-line interface for pydicom. Current subcommands are:
show: display all or part of a DICOM file
codify to produce Python code for writing files or sequence items from scratch.
Please see the Command-line Interface Guide for examples and details of all the options for each command.
A field containing an invalid number of bytes will result in a warning instead of an exception when convert_wrong_length_to_UN is set to True.
Private tags known via the private dictionary will now get the configured VR if read from a dataset instead of UN
While reading explicit VR, a switch to implicit VR will be silently attempted if the VR bytes are not valid VR characters, and config option assume_implicit_vr_switch is True (default)
New functionality to help with correct formatting of decimal strings (DS)
Added is_valid_ds() to check whether a string is valid as a DICOM decimal string and format_number_as_ds() to format a given float or Decimal as a DS while retaining the highest possible level of precision
If enforce_valid_values is set to True, all DS objects created will be checked for the validity of their string representations.
Added optional auto_format parameter to the init methods of DSfloat and DSdecimal and the DS() factory function to allow explicitly requesting automatic formatting of the string representations of these objects when they are constructed.
Added methods to construct PersonName objects from individual components of names (family_name, given_name, etc.). See from_named_components() and from_named_components_veterinary().
Added support for downloading the large test files with the requests package in addition to urllib.request
Ensured convert_color_space() uses 32-bit floats for calculation, added per_frame flag to allow frame-by-frame processing and improved the speed by ~20-60%
Optimisations for RLE encoding using pydicom (~40% faster).
Added support for faster decoding (~4-5x) and encoding (~20x) of RLE Lossless Pixel Data via the pylibjpeg-rle plugin
Added Dataset.compress() function for compressing uncompressed pixel data using a given encoding format as specified by a UID. Only RLE Lossless is currently supported
Added encoders module and the following encoders:
RLELosslessEncoder with ‘pydicom’, ‘pylibjpeg’ and ‘gdcm’ plugins
Added read parameter to get_testdata_file() to allow reading and returning the corresponding dataset
Handle decoded RLE segments with padding
Add option to JSON functions to suppress exception and continue
Allow searching FileSet s for a list of elements
Added hash function to SR Code

Fixes

Fixed pickling a Dataset instance with sequences after the sequence had been read
Fixed JSON export of numeric values
Fixed handling of sequences of unknown length that switch to implicit encoding, and sequences with VR UN
Do not load external data sources until needed - fixes problems with standard workflow if setuptools are not installed
Fixed empty PN elements read from file being str rather than PersonName
Fixed handling of JPEG (10918-1) images compressed using RGB colourspace rather than YBR with the Pillow pixel data handler
Allow to deepcopy a ~pydicom.dataset.FileDataset object
Fixed elements with a VR of OL, OD and OV not being set correctly when an encoded backslash was part of the element value
Fixed expansion of linear segments with floating point steps in segmented LUTs
Fixed handling of code extensions with person name component delimiter
Fixed bug decoding RBG jpg with APP14 marker due to change in Pillow
Fixed decoding for FloatPixelData and DoubleFloatPixelData via pydicom.pixel_data_handlers.numpy_handler


Version 2.1.1

Fixes

Remove py.typed
Fix ImportError with Python 3.6.0
Fix converting Sequences with Bulk Data when loading from JSON


Version 2.1.0

Changelog

Dropped support for Python 3.5 (only Python 3.6+ supported)

Enhancements

Large testing data is no longer distributed within the pydicom package with the aim to reduce the package download size. These test files will download on-the-fly whenever either the tests are run, or should the file(s) be requested via the data manager functions. For example:
To download all files and get their paths on disk you can run pydicom.data.get_testdata_files().
To download an individual file and get its path on disk you can use pydicom.data.get_testdata_file(), e.g. for RG1_UNCI.dcm use pydicom.data.get_testdata_file("RG1_UNCI.dcm")
Added a new pixel data handler based on pylibjpeg which supports all (non-retired) JPEG transfer syntaxes
Added apply_rescale() alias
Added apply_voi() and apply_windowing()
Added prefer_lut keyword parameter to apply_voi_lut() and handle empty VOI LUT module elements
Added ability to register external data sources for use with the functions in pydicom.data
__contains__, __next__ and __iter__ implementations added to PersonName
Added convenience constants for the MPEG transfer syntaxes to pydicom.uid
Added support for decoding Waveform Data:
Added pydicom.waveforms module and generate_multiplex() and multiplex_array() functions.
Added Dataset.waveform_array() which returns an ndarray for the multiplex group at index within a Waveform Sequence element.
When JPEG 2000 image data is unsigned and the Pixel Representation is 1 the image data is converted to signed
Added keyword property for the new UID keywords in version 2020d of the DICOM Standard
Added testing of the variable names used when setting Dataset attributes and INVALID_KEYWORD_BEHAVIOR config option to allow customizing the behavior when a camel case variable name is used that isn’t a known element keyword
Added INVALID_KEY_BEHAVIOR config option to allow customizing the behavior when an invalid key is used with the Dataset in operator
Implemented full support (loading, accessing, modifying, writing) of DICOM File-sets and their DICOMDIR files via the FileSet class
Added AllTransferSyntaxes
Added option to turn on pydicom future breaking behavior to allow user code to check itself against the next major version release. Set environment variable “PYDICOM_FUTURE” to “True” or call future_behavior()
Added another signature to the bulk_data_uri_handler in from_json to allow for the communication of not just the URI but also the tag and VR to the handler. Previous handlers will work as expected, new signature handlers will get the additional information.
pack_bits() can now be used with 2D or 3D input arrays and will pad the packed data to even length by default.
Elements with the IS VR accept float strings that are convertible to integers without loss, e.g. “1.0”
Added encapsulate_extended() function for use when an Extended Offset Table is required

Changes

Reading and adding unknown non-private tags now does not raise an exception per default, only when enforce_valid_values is set
Data dictionaries updated to version 2020d of the DICOM Standard
Updated a handful of the SOP Class variable names in _storage_sopclass_uids to use the new UID keywords. Variables with Multiframe in them become MultiFrame, those with and in them become And, and DICOSQuadrupoleResonanceQRStorage becomes DICOSQuadrupoleResonanceStorage.
The following UID constants are deprecated and will be removed in v2.2:
JPEGBaseline: use JPEGBaseline8Bit
JPEGExtended: use JPEGExtended12Bit
JPEGLossless: use JPEGLosslessSV1
JPEGLSLossy: use JPEGLSNearLossless
JPEG2000MultiComponentLossless: use JPEG2000MCLossless
JPEG2000MultiComponent: use JPEG2000MC
In v3.0 the value for JPEGLossless will change from 1.2.840.10008.1.2.4.70 to 1.2.840.10008.1.2.4.57 to match its UID keyword
The following lists of UIDs are deprecated and will be removed in v2.2:
JPEGLossyCompressedPixelTransferSyntaxes: use JPEGTransferSyntaxes
JPEGLSSupportedCompressedPixelTransferSyntaxes: use JPEGLSTransferSyntaxes
JPEG2000CompressedPixelTransferSyntaxes: use JPEG2000TransferSyntaxes
RLECompressedLosslessSyntaxes: use RLETransferSyntaxes
UncompressedPixelTransferSyntaxes: use UncompressedTransferSyntaxes
PILSupportedCompressedPixelTransferSyntaxes
DicomDir and the dicomdir module are deprecated and will be removed in v3.0. Use FileSet instead
pydicom.overlay_data_handlers is deprecated, use pydicom.overlays instead
Removed transfer syntax limitations when converting overlays to an ndarray
The overlay_data_handlers config option is deprecated, the default handler will always be used.

Fixes

Dataset.copy() now works as expected
Optimistically parse undefined length non-SQ data as if it’s encapsulated pixel data to avoid erroring out on embedded sequence delimiter
Fixed get_testdata_file() and get_testdata_files() raising an exception if no network connection is available
Fixed GDCM < v2.8.8 not returning the pixel array for datasets not read from a file-like
Raise TypeError if dcmread() or dcmwrite() is called with wrong argument
Gracefully handle empty Specific Character Set
Fixed empty ambiguous VR elements raising an exception
Allow apply_voi_lut() to apply VOI lookup to an input float array
Fixed Dataset.setdefault() not adding working correctly when the default value is None and not adding private elements when enforce_valid_values is True

Version 2.0.0

Changelog

Dropped support for Python 2 (only Python 3.5+ supported)
Changes to Dataset.file_meta
file_meta now shown by default in dataset str or repr output; pydicom.config.show_file_meta can be set False to restore previous behavior
new FileMetaDataset class that accepts only group 2 data elements
Deprecation warning given unless Dataset.file_meta set with a FileMetaDataset object (in pydicom 3, it will be required)
Old PersonName class removed; PersonName3 renamed to PersonName. Classes PersonNameUnicode and PersonName3 are aliased to PersonName but are deprecated and will be removed in version 2.1
dataelem.isMultiValue (previously deprecated) has been removed. Use dataelem.DataElement.VM instead.

Enhancements

Allow PathLike objects for filename argument in dcmread, dcmwrite and Dataset.save_as
Deflate post-file meta information data when writing a dataset with the Deflated Explicit VR Little Endian transfer syntax UID
Added config.replace_un_with_known_vr to be able to switch off automatic VR conversion for known tags with VR “UN”
Added config.use_DS_numpy and config.use_IS_numpy to have multi-valued data elements with VR of DS or IS return a numpy array

Fixes

Fixed reading of datasets with an empty Specific Character Set tag
Fixed failure to parse dataset with an empty LUT Descriptor or Red/Green/Blue Palette Color LUT Descriptor element.
Made Dataset.save_as a wrapper for dcmwrite
Removed 1.2.840.10008.1.2.4.70 - JPEG Lossless (Process 14, SV1) from the Pillow pixel data handler as Pillow doesn’t support JPEG Lossless.
Fixed error when writing elements with a VR of OF
Fixed improper conversion when reading elements with a VR of OF
Fixed apply_voi_lut() and apply_modality_lut() not handling (0028,3006) LUT Data with a VR of OW
Fixed access to private creator tag in raw datasets
Fixed description of newly added known private tag
Fixed update of private blocks after deleting private creator
Fixed bug in updating pydicom.config.use_DS_Decimal flag in DS_decimal()
2021-08-29 12:59:39 +00:00
bacon
d5c97c175e biology/peak-classifier: Update to 0.1.1.20
Updates for libxtend and biolibc API changes
2021-08-28 18:40:29 +00:00
bacon
6b12dfc571 biology/vcf2hap: Update to 0.1.3.12
Updates for libxtend and bioloibc API changes
2021-08-28 18:40:06 +00:00
bacon
289a9063c4 biology/vcf-split: Update to 0.1.2.14
Updates for libxtend and biolibc API changes
2021-08-28 18:39:37 +00:00
bacon
52a18bf75d biology/ad2vcf: Update to 0.1.3.31
Updates for libxtend and biolibc API changes
Clean up and minor bug fixes
2021-08-28 18:39:10 +00:00
bacon
0b7391d132 biology/biolibc: Update to 0.2.0
Major API overhaul
New classes for FASTA and FASTQ
Generate accessor and mutator functions for all classes
Generate man pages for all functions and macros
Export delimiter-separated-value class to libxtend
2021-08-28 18:34:37 +00:00
nia
57b1e1ac6e py-numpy: "Python version >= 3.7 required." 2021-06-29 08:41:59 +00:00
nia
55394cf036 Revbump for MySQL default change 2021-06-23 20:33:06 +00:00
bacon
45fac8d332 biology/Makefile: Add peak-classifier 2021-06-15 13:55:35 +00:00
bacon
c5b9a46b0d biology/peak-classifier: import peak-classifier-0.1.1
Classify ChIP/ATAC-Seq peaks based on features provided in a GFF
Peaks are provided in a BED file sorted by chromosome and position. The GFF
must be sorted by chromosome and position, with gene-level features separated
by ### tags and each gene organized into subfeatures such as transcripts and
exons.  This is the default for common data sources.
2021-06-15 13:54:14 +00:00
bacon
e51df20828 biology/biolibc: Update to 0.1.3.2
Add LDFLAGS to allow RELRO
2021-06-15 13:47:46 +00:00
bacon
ed9ec6519f biology/vcf-split: Update to 0.1.2
Updates for new biolibc API

Upstream change log: https://github.com/auerlab/vcf-split/releases
2021-06-11 17:22:40 +00:00
bacon
9011fdfe82 biology/vcf2hap: Update to 0.1.3
Updates for new biolibc API

Upstream change log: https://github.com/auerlab/vcf2hap/releases
2021-06-11 17:09:40 +00:00
bacon
b4e8c5eeea biology/ad2vcf: Update to 0.1.3
Updates for new biolibc API

Upstream change log: https://github.com/auerlab/ad2vcf/releases
2021-06-11 17:06:54 +00:00
bacon
9749c03a24 biology/biolibc: Update to 0.1.3
Import sam_buff_t class and VCF functions from ad2vcf
Add BED and GFF support
Isolate headers under include/biolibc
Numerous small enhancements and fixes

Upstream change log: https://github.com/auerlab/biolibc/releases
2021-06-11 17:04:55 +00:00
bacon
a4f77ad14b biology/ncbi-blast+: Update to 2.11.0
Release notes: https://www.ncbi.nlm.nih.gov/books/NBK131777/
2021-06-11 13:47:39 +00:00
wiz
9f50982921 *: recursive PKGREVISION bump for sneaky gsl shared library version number change 2021-06-01 09:12:22 +00:00
brook
c2288fd3db biology/minimap2: install minimap2 program instead of python binding
The distfile for minimap2 includes two different components: (i) the
minimap2 sequence mapping program itself, and (ii) a python binding
generally referred to as mappy.  The initial version of this package
included only the python binding.  However, it is more appropriate
that the minimap2 package should contain the program of the same name,
and a new package be created with the name mappy for the python
binding.  Splitting these into two packages makes sense, because this
allows users to install the minimap2 package without python
dependencies.
2021-05-29 17:35:18 +00:00
brook
e0507013e4 biology/filter-fastq: add filter-fastq version 0.0.0.20210527 2021-05-27 17:13:15 +00:00
brook
3a68b9598a biology/filter-fastq: add filter-fastq version 0.0.0.20210527
Filter reads from a FASTQ file using a list of identifiers.

Each entry in the input FASTQ file (or files) is checked against all
entries in the identifier list. Matches are included by default, or
excluded if the --invert flag is supplied. Paired-end files are kept
consistent (in order).

This is almost certainly not the most efficient way to implement this
filtering procedure. I tested a few different strategies and this one
seemed the fastest. Current timing with 16 processes is about 10
minutes per 1M paired reads with gzip'd input and output, depending on
the length of the identifier list to filter by.

usage: filter_fastq.py [-h] [-i INPUT] [-1 READ1] [-2 READ2] [-p NUM_THREADS]
                       [-o OUTPUT] [-f FILTER_FILE] [-v] [--gzip]
2021-05-27 17:11:42 +00:00
brook
f6d3a93579 Added biology/beagle version 5.2 2021-05-26 19:14:24 +00:00
brook
5320059e48 biology/beagle: added beagle 5.2
Introduction

Beagle is a software package for phasing genotypes and for imputing
ungenotyped markers. Beagle version 5.2 provides significantly faster
genotype phasing than version 5.1

Citation

If you use Beagle in a published analysis, please report the program
version and cite the appropriate article.

The Beagle 5.2 genotype imputation method is described in:

  B L Browning, Y Zhou, and S R Browning (2018). A one-penny imputed
  genome from next generation reference panels. Am J Hum Genet
  103(3):338-348. doi:10.1016/j.ajhg.2018.07.015

The most recent reference for Beagle's phasing method is:

  S R Browning and B L Browning (2007) Rapid and accurate haplotype
  phasing and missing data inference for whole genome association
  studies by use of localized haplotype clustering. Am J Hum Genet
  81:1084-1097. doi:10.1086/521987

This reference will be updated when the Beagle version 5 phasing
method is published.
2021-05-26 19:13:39 +00:00
brook
934eb80113 Added biology/racon 1.4.3 2021-05-26 18:54:29 +00:00
brook
3dfd1c7a4f biology/racon: add racon 1.4.3
## Description

Racon is intended as a standalone consensus module to correct raw
contigs generated by rapid assembly methods which do not include a
consensus step. The goal of Racon is to generate genomic consensus
which is of similar or better quality compared to the output generated
by assembly methods which employ both error correction and consensus
steps, while providing a speedup of several times compared to those
methods. It supports data produced by both Pacific Biosciences and
Oxford Nanopore Technologies.

Racon can be used as a polishing tool after the assembly with **either
Illumina data or data produced by third generation of
sequencing**. The type of data inputed is automatically detected.

Racon takes as input only three files: contigs in FASTA/FASTQ format,
reads in FASTA/FASTQ format and overlaps/alignments between the reads
and the contigs in MHAP/PAF/SAM format. Output is a set of polished
contigs in FASTA format printed to stdout. All input files **can be
compressed with gzip** (which will have impact on parsing time).

Racon can also be used as a read error-correction tool. In this
scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps
between reads **including dual overlaps**.

A **wrapper script** is also available to enable easier usage to the
end-user for large datasets. It has the same interface as racon but
adds two additional features from the outside. Sequences can be
**subsampled** to decrease the total execution time (accuracy might be
lower) while target sequences can be **split** into smaller chunks and
run sequentially to decrease memory consumption. Both features can be
run at the same time as well.
2021-05-26 18:53:39 +00:00
brook
7251d531ad Add biology/minimap2 2.18 2021-05-26 18:51:07 +00:00
brook
82215c7813 biology/minimap2: add minimap 2.18
## Users' Guide

Minimap2 is a versatile sequence alignment program that aligns DNA or
mRNA sequences against a large reference database. Typical use cases
include: (1) mapping PacBio or Oxford Nanopore genomic reads to the
human genome; (2) finding overlaps between long reads with error rate
up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore
cDNA or Direct RNA reads against a reference genome; (4) aligning
Illumina single- or paired-end reads; (5) assembly-to-assembly
alignment; (6) full-genome alignment between two closely related
species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than
mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and
GMAP. It is more accurate on simulated long reads and produces
biologically meaningful alignment ready for downstream analyses. For
>100bp Illumina short reads, minimap2 is three times as fast as
BWA-MEM and Bowtie2, and as accurate on simulated data.  Detailed
evaluations are available from the minimap2 paper or the preprint.

Release 2.18-r1015 (9 April 2021)
---------------------------------

This release fixes multiple rare bugs in minimap2 and adds additional
functionality to paftools.js.

Changes to minimap2:

 * Bugfix: a rare segfault caused by an off-by-one error (#489)

 * Bugfix: minimap2 segfaulted due to an uninitilized variable (#622 and #625).

 * Bugfix: minimap2 parsed spaces as field separators in BED (#721). This led
   to issues when the BED name column contains spaces.

 * Bugfix: minimap2 `--split-prefix` did not work with long reference names
   (#394).

 * Bugfix: option `--junc-bonus` didn't work (#513)

 * Bugfix: minimap2 didn't return 1 on I/O errors (#532)

 * Bugfix: the `de:f` tag (sequence divergence) could be negative if there were
   ambiguous bases

 * Bugfix: fixed two undefined behaviors caused by calling memcpy() on
   zero-length blocks (#443)

 * Bugfix: there were duplicated SAM @SQ lines if option `--split-prefix` is in
   use (#400 and #527)

 * Bugfix: option -K had to be smaller than 2 billion (#491). This was caused
   by a 32-bit integer overflow.

 * Improvement: optionally compile against SIMDe (#597). Minimap2 should work
   with IBM POWER CPUs, though this has not been tested. To compile with SIMDe,
   please use `make -f Makefile.simde`.

 * Improvement: more informative error message for I/O errors (#454) and for
   FASTQ parsing errors (#510)

 * Improvement: abort given malformatted RG line (#541)

 * Improvement: better formula to estimate the `dv:f` tag (approximate sequence
   divergence). See DOI:10.1101/2021.01.15.426881.

 * New feature: added the `--mask-len` option to fine control the removal of
   redundant hits (#659). The default behavior is unchanged.

Changes to mappy:

 * Bugfix: mappy caused segmentation fault if the reference index is not
   present (#413).

 * Bugfix: fixed a memory leak via 238b6bb3

 * Change: always require Cython to compile the mappy module (#723). Older
   mappy packages at PyPI bundled the C source code generated by Cython such
   that end users did not need to install Cython to compile mappy. However, as
   Python 3.9 is breaking backward compatibility, older mappy does not work
   with Python 3.9 anymore. We have to add this Cython dependency as a
   workaround.

Changes to paftools.js:

 * Bugfix: the "part10-" line from asmgene was wrong (#581)

 * Improvement: compatibility with GTF files from GenBank (#422)

 * New feature: asmgene also checks missing multi-copy genes

 * New feature: added the misjoin command to evaluate large-scale misjoins and
   megabase-long inversions.

Although given the many bug fixes and minor improvements, the core algorithm
stays the same. This version of minimap2 produces nearly identical alignments
to v2.17 except very rare corner cases.

Now unimap is recommended over minimap2 for aligning long contigs against a
reference genome. It often takes less wall-clock time and is much more
sensitive to long insertions and deletions.

(2.18: 9 April 2021, r1015)
2021-05-26 18:49:20 +00:00
brook
53256fe620 Add biology/miniasm 0.3. 2021-05-26 18:46:41 +00:00
brook
3e5a5a2e30 biology/miniasm: add miniasm 0.3
Miniasm is a very fast OLC-based *de novo* assembler for noisy long
reads. It takes all-vs-all read self-mappings (typically by minimap)
as input and outputs an assembly graph in the GFA format. Different
from mainstream assemblers, miniasm does not have a consensus step. It
simply concatenates pieces of read sequences to generate the final
unitig sequences. Thus the per-base error rate is similar to the raw
input reads.

So far miniasm is in early development stage. It has only been tested
on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data
sets. Including the mapping step, it takes about 3 minutes to assemble
a bacterial genome. Under the default setting, miniasm assembles 9 out
of 12 PacBio datasets and 3 out of 4 ONT datasets into a single
contig. The 12 PacBio data sets are [PacBio E.  coli
sample][PB-151103], [ERS473430][ERS473430], [ERS544009][ERS544009],
[ERS554120][ERS554120], [ERS605484][ERS605484],
[ERS617393][ERS617393], [ERS646601][ERS646601],
[ERS659581][ERS659581], [ERS670327][ERS670327],
[ERS685285][ERS685285], [ERS743109][ERS743109] and a deprecated PacBio
E.  coli data set. ONT data are acquired from the Loman Lab.

For a *C. elegans* PacBio data set (only 40X are used, not the whole
dataset), miniasm finishes the assembly, including reads overlapping,
in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50
is 1.94Mb. In comparison, the HGAP3 produces a 104Mb assembly with N50
1.61Mb. This dotter plot gives a global view of the miniasm assembly
(on the X axis) and the HGAP3 assembly (on Y). They are broadly
comparable. Of course, the HGAP3 consensus sequences are much more
accurate. In addition, on the whole data set (assembled in ~30 min),
the miniasm N50 is reduced to 1.79Mb. Miniasm still needs
improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it
is possible to generate long contigs from raw PacBio or ONT reads
without error correction. It also shows that minimap can be used as a
read overlapper, even though it is probably not as sensitive as the
more sophisticated overlapers such as MHAP and DALIGNER.  Coupled with
long-read error correctors and consensus tools, miniasm may also be
useful to produce high-quality assemblies.

## Algorithm Overview

1. Crude read selection. For each read, find the longest contiguous region
   covered by three good mappings. Get an approximate estimate of read
   coverage.

2. Fine read selection. Use the coverage information to find the good regions
   again but with more stringent thresholds. Discard contained reads.

3. Generate a string graph. Prune tips, drop weak overlaps and
   collapse short bubbles. These procedures are similar to those
   implemented in short-read assemblers.

4. Merge unambiguous overlaps to produce unitig sequences.

## Limitations

1. Consensus base quality is similar to input reads (may be fixed with a
   consensus tool).

2. Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing
   needed).

3. Prone to collapse repeats or segmental duplications longer than input reads
   (hard to fix without error correction).
2021-05-26 18:44:44 +00:00
wiz
6eae1297d5 *: recursive bump for perl 5.34 2021-05-24 19:49:01 +00:00
nia
fa308d6128 py-dnaio: unbreak pkgsrc tree. revert removal of PYTHON_VERSIONS_INCOMPATIBLE. 2021-05-22 23:24:17 +00:00
adam
201d4f2c77 py-dnaio: updated to 0.5.1
v0.5.1
Add py.typed and distribute .pyi files
2021-05-21 11:38:13 +00:00
mrg
0a843265c7 various fixes for arm64 big endian support.
most of these simply extend matching from "aarch64" to "aarch64eb"
in various forms of code.  most remaining uses in pkgsrc of
"MACHINE_ARCH == aarch64" are because of missing aarch64eb support,
such as most of the binary-bootstrap requiring languages like rust,
go, and java.

no pkg-bump because this shouldn't change packages on systems that
could already build all of these.
2021-04-25 07:51:24 +00:00
nia
dfbfcffef5 py-dnaio: mark incompatible with python 2 2021-04-22 08:38:59 +00:00
nia
8ec3dd6866 py-cutadpt: add missing build dependency 2021-04-22 08:36:59 +00:00
adam
da0a125726 revbump for boost-libs 2021-04-21 13:24:06 +00:00
adam
9d0e79c401 revbump for textproc/icu 2021-04-21 11:40:12 +00:00
wiz
6eaa8d1255 *: remove dead download locations 2021-04-21 09:12:23 +00:00
wiz
c23272545f *: remove dead download location 2021-04-21 09:11:13 +00:00
pin
ebdbdf11f7 biology/molsketch: update to 0.7.2
-This is just a small release to fix some issues with the (possibly) renamed
*.so/*.dll files after removing Qt5 support. In case you were using Molsketch
prior to version 0.7.1, it will ask you to update the corresponding settings at
start up.
For Windows users, there will be an online installer, as in version 0.7.1, but
this will now reside in a separate folder and not be updated as frequently as
Molsketch itself. Updates will instead be made available in the online
repository at github from which the installer will fetch them. Just start the
installer and select the update option
2021-04-04 19:10:20 +00:00
adam
97dc3c02d6 py-cutadapt: updated to 3.4
v3.4 (2021-03-30)
-----------------
* :issue:`481`: An experimental single-file Windows executable of Cutadapt
  is `available for download on the GitHub "releases"
  page <https://github.com/marcelm/cutadapt/releases>`_.
* :issue:`517`: Report correct sequence in info file if read was reverse complemented
* :issue:`517`: Added a column to the info file that shows whether the read was
  reverse-complemented (if ``--revcomp`` was used)
* :issue:`320`: Fix (again) "Too many open files" when demultiplexing
2021-03-31 09:23:56 +00:00