bamnostic package

bamnostic is an OS-agnostic, Pure Python BAM file parser.

The purpose of bamnostic is to open up BAM query capabilities to all current OS environments. Special care was taken to allow bamnostic to run in all versions of Python 2.7 onward. Furthermore, bamnostic does not require any packages outside of the standard library. Therefore, bamnostic will run on all stable versions of PyPy.

Note

SAM and CRAM support is not yet implemented

The three main classes of bamnostic are:

  1. bamnostic.AlignmentFile: the BAM file handler
  2. bamnostic.AlignedSegment: an aligned read object interface
  3. bamnostic.bai.Bai: if the BAM file has an associated index file (preferred), this is the file handler for it.

Note

Within the scope of personal research, reading BAM files is the only fully supported IO. The skeleton for writing BAM files is present, just not connected.

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

Core Module (bamnostic.core)

class bamnostic.core.Cigar(op_code, n_op, op_id, op_name)

Bases: tuple

namedtuple for handling CIGAR data

Parameters:
  • op_code (int) – CIGAR operation index
  • n_op (int) – number of operations for a given op_code
  • op_id (str) – the string representation of the CIGAR representation (‘MIDNSHP=XB’)
  • op_name (str) – Longer string name for operation
n_op

Alias for field number 1

op_code

Alias for field number 0

op_id

Alias for field number 2

op_name

Alias for field number 3

bamnostic.core.offset_qual(qual_string)[source]

Offsets the ASCII-encoded quality string to represent PHRED score.

Every base that is in the alignment is assigned a Phred score. A Phred score (\(Q\)) is defined as \(Q=-10\log_{10}P\), where \(P\) is the base-calling error probability. Phred quality scores tend range from 10 to 60. These qualities are then offset by 33 and ASCII-encoded for readability and storage.

Parameters:qual_string (str or bytes) – Phred quality scores without offset
Returns:ASCII-encoded Phred scores offest by adding 33 to base score.
Return type:(str)

Examples

>>> qual_score = ''
>>> ''.join(offset_qual(qual_score))
'<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;'
class bamnostic.core.AlignmentFile(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False, compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Bases: bamnostic.bam.BamReader, bamnostic.bam.BamWriter

Wrapper to allow drop in replacement for BAM functionality in a pysam-like API.

Parameters:
  • filepath_or_object (str | file) – the path or file object of the BAM file
  • mode (str) – Mode for reading. BAM files are binary by nature (default: ‘rb’).
  • max_cache (int) – number of desired LRU cache size, preferably a multiple of 2 (default: 128).
  • index_filename (str) – path to index file (BAI) if it is named differently than the BAM file (default: None).
  • filename (str | file) – synonym for filepath_or_object
  • check_header (bool) – Obsolete method maintained for backwards compatibility (default: False)
  • check_sq (bool) – Inspect BAM file for @SQ entries within the header
  • reference_filename (str) – Not implemented. Maintained for backwards compatibility
  • filepath_index (str) – synonym for index_filename
  • require_index (bool) – require the presence of an index file or raise (default: False)
  • duplicate_filehandle (bool) – Not implemented. Raises warning if True.
  • ignore_truncation (bool) – Whether or not to allow trucated file processing (default: False).
class bamnostic.core.AlignedSegment(_io)[source]

Bases: object

Main class for handling reads within the BAM

query_alignment_end

One past the final index position of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

query_alignment_start

The starting index of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

query_name

Alias for the read name

query_alignment_sequence

The sequence of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

query_alignment_length

The length of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

reference_start

The starting index of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

reference_end

One past the final index position of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

reference_length

The length of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

query_sequence

The sequence of the read

query_ here means the query as seen in the BAM file, and therefore includes clipping

query_length

The length of sequence of the read

query_ here means the query as seen in the BAM file, and therefore includes clipping

next_reference_id

Return the reference id of mate read

next_reference_name

Return the reference name of mate read

next_reference_start

Return the reference name of mate read

is_duplicate

If the read is set as PCR or optical duplicate according to read flag

is_paired

If the read is set as paired according to read flag

is_proper_pair

If each segment properly aligned according to the aligner

is_qcfail

If the read is set as QC fail according to read flag

is_read1

If the read is set as the first read of mate pair according to read flag (implies the read is mated)

is_read2

If the read is set as the second read of mate pair according to read flag (implies the read is mated)

is_reverse

If the read is set as reversed according to read flag

is_secondary

If the read is set as secondary alignment according to read flag

is_supplementary

If the read is set as a supplementary alignment according to read flag

is_unmapped

If the read is set as unmapped according to read flag

mate_is_reverse

Check the read flag to see if the mat is reverse strand (implies the read is mated)

mate_is_unmapped

Check the read flag to see if mate is unmapped (implies the read is mated)

mapping_quality

The mapping quality (MAPQ) of the read

get_tag(tag, with_value_type=False)[source]

Gets the value associated with a given tag key.

Parameters:
  • tag (str) – the tag of interest
  • with_value_type (bool) – return what kind of value the tag
Returns:

the value associated with a given tag or the value and type of value (as seen in BAM format)

get_tags(with_value_type=False)[source]

Returns all the tags for a given read

Parameters:with_value_type (bool) – return the tag value type (as defined by BAM format)
Returns:list of tag tuples (with or without tag value type)
Return type:f_tags(list)
get_cigar_stats()[source]

Gets the counts of each CIGAR operation in the read and number of nucleotides related to those given operations.

Returns:list of CIGAR operation counts nt_counts (list): list of nucleotide counts for each operation
Return type:op_blocks (list)
ref_gen()[source]

Recreates the reference sequence associated with the given segment.

Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome.

Returns:generated reference sequence
Return type:(str)
Raises:KeyError – if read does not contain MD tag
to_bam()[source]

Writes the alignment record to a BAM file

Parameters:bam_file (string or bamnostic.bam.BamWriter) – BAM file path or open bam file in a write mode
bamnostic.core.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.core.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Bgzip BAM Format (BGZF) file handlers (bamnostic.bgzf)

bamnostic.bgzf.unpack_gzip_header()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_subfields()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_gzip_integrity()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_bgzf_metaheader()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.get_block(handle, offset=0)[source]

Pulls out entire GZIP block

Used primarily for copying the header block of a BAM file. However, it can be used to copy any BGZF block within a BAM file that starts at the given offset.

Note

Does not progress file cursor position.

Parameters:
  • handle (file) – open BAM file
  • offset (int) – offset of BGZF block (default: 0)
Returns:

Complete BGZF block

Raises:

ValueError – if the BGZF block header is malformed

Example

>>> bam_header = get_block(bamnostic.example_bam)
>>> try:
...     bam_header.startswith(b'\x1f\x8b\x08\x04')
... except SyntaxError:
...     bam_header.startswith('\x1f\x8b\x08\x04')
True
class bamnostic.bgzf.BgzfReader(filepath_or_object, mode='rb', max_cache=None, filename=None, duplicate_filehandle=None, ignore_truncation=False)[source]

Bases: object

The BAM reader. Heavily modified from Peter Cock’s BgzfReader.

tell()[source]

Return a 64-bit unsigned BGZF virtual offset.

seek(virtual_offset)[source]

Seek to a 64-bit unsigned BGZF virtual offset.

A virtual offset is a composite number made up of the compressed offset (coffset) position of the start position of the BGZF block that the position originates within, and the uncompressed offset (uoffset) within the deflated BGZF block where the position starts. The virtual offset is defined as

virtual_offset = coffset << 16 | uoffset

Parameters:

virtual_offset (int) – 64-bit unsigned composite byte offset

Returns:

an echo of the new position

Return type:

virtual_offset (int)

Raises:
  • ValueError – if within block offset is more than block size
  • AssertionError – if the start position is not the block start position

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.seek(10)
10
>>> bam.seek(bamnostic.utils.make_virtual_offset(0, 42))
Traceback (most recent call last):
    ...
ValueError: Within offset 42 but block size only 38
read(size=-1)[source]

Read method for the BGZF module.

Parameters:

size (int) – the number of bytes to read from file. Advances the cursor.

Returns:

byte string of length size

Return type:

data (bytes)

Raises:
  • NotImplementedError – if the user tries to read the whole file
  • AssertionError – if read does not return any data
readline()[source]

Read a single line for the BGZF file.

Binary operations do not support readline(). Code is commented out for posterity sake

close()[source]

Close BGZF file.

seekable()[source]

Return True indicating the BGZF supports random access.

Note

Modified from original Bio.BgzfReader: checks to see if BAM file has associated index file (BAI)

isatty()[source]

Return True if connected to a TTY device.

fileno()[source]

Return integer file descriptor.

class bamnostic.bgzf.BgzfWriter(filepath_or_object, mode='wb', compresslevel=6)[source]

Bases: object

Define a BGZFWriter object.

write(data)[source]

Write method for the class.

flush()[source]

Flush data explicitly.

close()[source]

Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.

samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.

tell()[source]

Return a BGZF 64-bit virtual offset.

seekable()[source]

Return True indicating the BGZF supports random access.

isatty()[source]

Return True if connected to a TTY device.

fileno()[source]

Return integer file descriptor.

bamnostic.bgzf.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Binary Alignment Map (BAM) file handlers (bamnostic.bam)

class bamnostic.bam.BAMheader(_io)[source]

Bases: object

Parse and store the BAM file header

The BAM header is the plain text and byte-encoded metadata of a given BAM file. Information stored in the header are the number, length, and name of the reference sequences that reads were aligned to; version of software used; read group identifiers; etc. The BAM format also stipulates that the first block of any BAM file should be reserved just for the BAM header block.

_io

opened BAM file object

Type:file
SAMheader

parsed dictionary of the SAM header

Type:dict
n_refs

number of references

Type:int
refs

reference names and lengths listed in the BAM header

Type:dict
to_header()[source]

Allows the user to directly copy the header of another BAM file

Returns:packed byte code of entire header BGZF block
Return type:(bytesarray)
class bamnostic.bam.BamReader(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False)[source]

Bases: bamnostic.bgzf.BgzfReader

The BAM reader. Heavily modified from Peter Cock’s BgzfReader.

header

representation of header data (if present)

lengths

lengths of references listed in header

Type:list of int
nocoordinate

number of reads that have no coordinates

Type:int
nreferences

number of references in header

Type:int
ref2tid

refernce names and refID dictionary

Type:dict of str, int
references

names of references listed in header

Type:list of str
text

SAM header (if present)

Type:str
unmapped

number of unmapped reads

Type:int

Note

This implementation is likely to change. While the API was meant to mirror pysam, it makes sense to include the pysam-like API in an extension that will wrap the core reader. This would be a major refactor, and therefore will not happen any time soon (30 May 2018).

check_index(index_filename=None, req_idx=False)[source]

Checks to make sure index file is available. If not, it disables random access.

Parameters:
  • index_filename (str) – path to index file (BAI) if it does not fit naming convention (default: None).
  • req_idx (bool) – Raise error if index file is not present (default: False).
Returns:

True if index is present, else False

Return type:

(bool)

Raises:

IOError – If the index file is closed or index could not be opened

Warns:

UserWarning – If index could not be loaded. Random access is disabled.

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam)
>>> bam.check_index(bamnostic.example_bam + '.bai')
True
>>> bam.check_index('not_a_file.bai')
False
nocoordinate

Get the number of reads without coordiantes according to the statistics recorded in the index.

Returns:sum of reads without coordinates
Return type:(int)
mapped

Get the number of mapped reads according to the statistics recorded in the index.

Returns:sum of mapped reads
Return type:(int)
unmapped

Get the number of unmapped reads without coordiantes and without coordinate.

Returns:sum of unmapped reads and reads without coordinates
Return type:(int)
references

Get all references used in alignment.

Returns:reference names
Return type:(tuple of str)
nreferences

Get the number of references listed in alignment

Returns:count of references
Return type:(int)
lengths

Get all reference lengths used in alignment.

Returns:reference lengths
Return type:(tuple of int)
has_index()[source]

Checks if file has index and it is open

Returns:True if present and opened, else False
Return type:bool
fetch(contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)[source]

Creates a generator that returns all reads within the given region. (inclusive, exclusive)

Parameters:
  • contig (str) – name of reference/contig
  • start (int) – start position of region of interest (0-based)
  • stop (int) – stop position of region of interest (0-based)
  • region (str) – SAM region formatted string. Accepts tab-delimited values as well
  • tid (int) – the refID or target id of a reference/contig
  • until_eof (bool) – iterate until end of file
  • mutiple_iterators (bool) – allow multiple iterators over region. Not Implemented. Notice: each iterator will open up a new view into the BAM file, so overhead will apply.
  • reference (str) – synonym for contig
  • end (str) – synonym for stop
Yields:

reads over the region of interest if any

Raises:
  • ValueError – if the genomic coordinates are out of range or invalid
  • KeyError – Reference is not found in header

Note

SAM region formatted strings take on the following form: ‘chr1:100000-200000’

Usage:
AlignmentFile.fetch(contig='chr1', start=1, stop= 1000)
AlignmentFile.fetch('chr1', 1, 1000)
AlignmentFile.fetch('chr1:1-1000')
AlignmentFile.fetch('chr1', 1)
AlignmentFile.fetch('chr1')

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> next(bam.fetch('chr1', 1, 100)) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
EAS56_57:6:190:289:82 ... MF:C:192
>>> next(bam.fetch('chr10', 1, 10))
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
>>> next(bam.fetch('chr1', 1700, 1701))
Traceback (most recent call last):
    ...
ValueError: Genomic region out of bounds.
>>> next(bam.fetch('chr1', 100, 10))
Traceback (most recent call last):
    ...
AssertionError: Malformed region: start should be <= stop, you entered 100, 10
count(contig=None, start=None, stop=None, region=None, until_eof=False, tid=None, read_callback='nofilter', reference=None, end=None)[source]

Count the number of reads in the given region

Note: this counts the number of reads that overlap the given region.

Can potentially make use of a filter for the reads (or custom function that returns True or False for each read).

Parameters:
  • contig (str) – the reference name (Default: None)
  • reference (str) – synonym for contig (Default: None)
  • start (int) – 0-based inclusive start position (Default: None)
  • stop (int) – 0-based exclusive start position (Default: None)
  • end (int) – Synonym for stop (Default: None)
  • region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
  • until_eof (bool) – count number of reads from start to end of file Note, this can potentially be an expensive operation. (Default: False)
  • read_callback (str|function) –

    select (or create) a filter of which reads to count. Built-in filters:

    • all: skips reads that contain the following flags:
      • 0x4 (4): read unmapped
      • 0x100 (256): not primary alignment
      • 0x200 (512): QC Fail
      • 0x400 (1024): PCR or optical duplicate
    • nofilter: uses all reads (Default)
    • The user can also supply a custom function that
      returns boolean objects for each read
Returns:

count of reads in the given region that meet parameters

Return type:

(int)

Raises:
  • ValueError – if genomic coordinates are out of range or invalid or random access is disabled
  • RuntimeError – if read_callback is not properly set
  • KeyError – Reference is not found in header
  • AssertionError – if genomic region is malformed

Notes

SAM region formatted strings take on the following form: ‘chr1:100000-200000’

Usage:
AlignmentFile.count(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count(‘chr1’, 1, 1000) AlignmentFile.count(‘chr1:1-1000’) AlignmentFile.count(‘chr1’, 1) AlignmentFile.count(‘chr1’)

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.count('chr1', 1, 100)
2
>>> bam.count('chr1', 1, 100, read_callback='all')
1
>>> bam.count('chr10', 1, 10)
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
>>> bam.count('chr1', 1700, 1701)
Traceback (most recent call last):
    ...
ValueError: Genomic region out of bounds.
>>> bam.count('chr1', 100, 10)
Traceback (most recent call last):
    ...
AssertionError: Malformed region: start should be <= stop, you entered 100, 10
count_coverage(contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None, base_quality_threshold=0)[source]

Counts the coverage of each base supported by a read the given interval.

Given an interval (inclusive, exclusive), this method pulls each read that overlaps with the region. To ensure that the read truly overlaps with the region, the CIGAR string is required. These reads can further be filtered out by their flags, MAPQ qualities, or custom filtering function. Using the CIGAR string, the aligned portion of the read is traversed and the presence of each nucleotide base is tallied into respective arrays. Additionally, the user can choose to filter the counted bases based on its base quality score that is stored in the quality string.

Parameters:
  • contig (str) – the reference name (Default: None)
  • reference (str) – synonym for contig (Default: None)
  • start (int) – 0-based inclusive start position (Default: None)
  • stop (int) – 0-based exclusive start position (Default: None)
  • end (int) – Synonym for stop (Default: None)
  • region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
  • quality_threshold (int) – MAPQ quality threshold (Default: 15)
  • base_quality_threshold (int) – base quality score threshold (Default: 0)
  • read_callback (str|function) –

    select (or create) a filter of which reads to count. Built-in filters:

    • all: skips reads that contain the following flags:
      • 0x4 (4): read unmapped
      • 0x100 (256): not primary alignment
      • 0x200 (512): QC Fail
      • 0x400 (1024): PCR or optical duplicate
    • nofilter: uses all reads (Default)
    • The user can also supply a custom function that returns boolean objects for each read
Returns:

Four arrays in the order of A, C, G, T

Return type:

(array.array)

Raises:
  • ValueError – if genomic coordinates are out of range or invalid, random access is disabled, or nucleotide base is unrecognized
  • RuntimeError – if read_callback is not properly set
  • KeyError – Reference is not found in header
  • AssertionError – if genomic region is malformed

Notes

SAM region formatted strings take on the following form: ‘chr1:100-200’

Usage:
AlignmentFile.count_coverage(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count_coverage(‘chr1’, 1, 1000) AlignmentFile.count_coverage(‘chr1:1-1000’) AlignmentFile.count_coverage(‘chr1’, 1) AlignmentFile.count_coverage(‘chr1’)

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> for arr in bam.count_coverage('chr1', 100, 150): # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
...     print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist()))))
array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0])
array('L', [0, 0, 0, 0, ..., 15, 0, 14, 0, 0])
array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 14])
array('L', [0, 0, 0, 0, ..., 0, 15, 0, 14, 0])
>>> for arr in bam.count_coverage('chr1', 100, 150, quality_threshold=20, base_quality_threshold=25): # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
...     print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist()))))
array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0])
array('L', [0, 0, 0, 0, ..., 13, 0, 14, 0, 0])
array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 13])
array('L', [0, 0, 0, 0, ..., 0, 14, 0, 13, 0])
get_index_stats()[source]

Inspects the index file (BAI) for alignment statistics.

Every BAM index file contains metrics regarding the alignment process for the given BAM file. The stored data are the number of mapped and unmapped reads for a given reference. Unmapped reads are paired end reads where only one part is mapped. Additionally, index files also contain the number of unplaced unmapped reads. This is stored within the nocoordinate instance attribute (if present).

Returns:list of tuples for each reference in the order seen in the header. Each tuple contains the number of mapped reads, unmapped reads, and the sum of both.
Return type:idx_stats (list of tuple)
Raises:AssertionError – if the index file is not available

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_index_stats()
[(1446, 18, 1464), (1789, 17, 1806)]
>>> bam_no_bai = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb', index_filename='not_a_file.bai')
>>> bam_no_bai.get_index_stats()
Traceback (most recent call last):
    ...
AssertionError: No index available
is_valid_tid(tid)[source]

Return True if TID/RefID is valid.

Returns:True if TID/refID is valid, else False

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.is_valid_tid(0)
True
>>> bam.is_valid_tid(10) # because there are only 2 in this file
False
get_reference_name(tid)[source]

Convert TID/refID to reference name.

The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.

Parameters:tid (int) – TID/refID of desired reference/contig
Returns:String representation of chromosome if valid, else None
Raises:KeyError – if TID/refID is not valid

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_reference_name(0) # doctest: +ALLOW_UNICODE
'chr1'
>>> bam.get_reference_name(10)
Traceback (most recent call last):
    ...
KeyError: '10 is not a valid TID/refID for this file.'
get_tid(reference)[source]

Convert reference/contig name to refID/TID.

The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.

Parameters:reference (str) – reference/contig name
Returns:the TID/refID of desired reference/contig
Return type:(int)
Raises:KeyError – if reference name not found file header

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_tid('chr1')
0
>>> bam.get_tid('chr10')
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
mate(AlignedSegment)[source]

Gets the mate to a given AlignedSegment.

Note

Slow, when compared to the C-API. Not meant for high-throughput analysis.

Does not advance current iterator position.

Parameters:AlignedSegment (bamnostic.AlignedSegment) – a bamnostic AlignedSegment read with a mate
Returns:if read has a valid mate, else None
Return type:(bamnostic.AlignedSegment)
Raises:ValueError – if AlignedSegment is unpaired
head(n=5, multiple_iterators=False)[source]

List out the first n reads of the file.

This method is primarily used when doing an initial exploration of the data. Whether or not multiple_iterators is used, cursor position within the file will not change.

Note

Using multiple_interators opens a new file object of the same file currently in use and, thus, impacts the memory footprint of your analysis.

Parameters:
  • n (int) – number of aligned reads to print (default: 5)
  • mutliple_iterators (bool) – Whether to use current file object or create a new one (default: False).
Returns:

list of n reads from the front of the BAM file

Return type:

head_reads (list of AlignedSegment)

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.head(n=5, multiple_iterators = False)[0] # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
EAS56_57:6:190:289:82       ...     MF:C:192
>>> bam.head(n = 5)[1] # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
EAS56_57:6:190:289:82       ...     UQ:C:0
next()[source]

Return the next line.

seekable()[source]

Return True indicating the BGZF supports random access.

Note

Modified from original Bio.BgzfReader: checks to see if BAM file has associated index file (BAI)

class bamnostic.bam.BamWriter(filepath_or_object, mode='xb', compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Bases: bamnostic.bgzf.BgzfWriter

write_header(copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Modified code from Peter Cock’s [BamWriter](https://github.com/peterjc/biopython/blob/SamBam2015/Bio/Sequencing/SamBam/__init__.py#L1714)

Writes the header into a BAM file

Parameters:
  • copy_header (filepath_or_object) – copy the header from another BAM file
  • header (bytes) – SAM-style header
  • reference_names (list of str) – list of all reference names used in file
  • reference_lengths (list of int) – list of all reference lengths used in file
write(data)[source]

Write method for the class.

bamnostic.bam.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bam.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

BAM Index (BAI) file handler (bamnostic.bai)

BAI file parser

Copyright © 2018, Marcus D. Sherman

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

The Binary Alignment Map (BAM) format (defined at https://samtools.github.io/hts-specs/SAMv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a BAM index (BAI) file is created. Generally speaking, a BAI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the BAI is produced, random access to the BAM is available.

This script is for parsing the binary encoded BAI file, inherently making it human readable. Furthermore, it allows subsections of the BAI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic

bamnostic.bai.unpack_chunk()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_intervals()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_bid_nchunk()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_unmapped()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

class bamnostic.bai.RefIdx(start_offset, end_offset, n_bins)

Bases: tuple

end_offset

Alias for field number 1

n_bins

Alias for field number 2

start_offset

Alias for field number 0

bamnostic.bai.reg2bin(rbeg, rend)[source]

Finds the largest superset bin of region. Numeric values taken from hts-specs

Parameters:
  • rbeg (int) – inclusive beginning position of region
  • rend (int) – exclusive end position of region
Returns:

distinct bin ID for largest superset bin of region

Return type:

(int)

bamnostic.bai.reg2bins(rbeg, rend)[source]

Generates bin ids which overlap the specified region.

Parameters:
  • rbeg (int) – inclusive beginning position of region
  • rend (int) – exclusive end position of region
Yields:

(int) – bin IDs for overlapping bins of region

Raises:

AssertionError (Exception) – if the range is malformed or invalid

class bamnostic.bai.Bai(filename)[source]

Bases: object

This class defines the bam index file object and its interface.

The purpose of this class is the binary parsing of the bam index file (BAI) associated with a given bam file. When queried, the Bai object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.

Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)

_io

opened BAI file object

Type:fileObject
_LINEAR_INDEX_WINDOW

constant of the linear interval window size

Type:int
_UNMAP_BIN

constant for bin ID of unmapped read stats

Type:int
magic

first 4 bytes of file. Must be equal to b’BAI’

Type:bytes
n_refs

number of references in BAI

Type:int
unmapped

dictionary of the unmapped read stats by each reference

Type:dict
current_ref

dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.

Type:None|dict
ref_indices

dictionary of reference ids and their start/stop offsets within the BAI file

Type:dict
n_no_coord

if present in BAI, is the number of reads that have no coordinates

Type:None|int
_last_pos

used for indexing, the byte position of the file head.

Type:int
get_chunks(n_chunks)[source]

Simple generator for unpacking chunk data

Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.

Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.

Parameters:n_chunks (int) – number of chunks to be unpacked from stream
Returns:
a list of Chunk objects with the attributes of
chunks[i] are .voffset_beg and voffset_end
Return type:chunks (list)
get_ints(n_int)[source]

Unpacks n_int number of interval virtual offsets from stream

A linear interval is defined as a 16384 bp window along a given reference. The value stored in the BAI is the virtual offset of the first read within that given interval. This virtual offset is the byte offset (coffset) of the start of the BGZF block that contains the beginning of the read and the byte offset (uoffset) within the uncompressed data of the residing BGZF block to that first read.

Note: a caveat to using linear interval with long reads: A long read can span multiple linear intervals. As such, the current encoding could potentially shift the expected region of interest to the left more than expected.

Parameters:n_int (int) – number of intervals to unpack
Returns:list of virtual offsets for n_int number of linear intervals
Return type:intervals (list)
get_bins(n_bins, ref_id=None, idx=False)[source]

Simple function that iteratively unpacks the data of a number (n_bin) of bins.

As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in the BAI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.

As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True

Parameters:n_int (int) – number of bins to be unpacked from stream
Returns:
None if just indexing the index file or a dictionary
of bin_id: chunks pairs
Return type:bins (None | dict)
Raises:AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
get_ref[source]

Interatively unpacks all the bins, linear intervals, and chunks for a given reference

A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.

This function also serves to “index” the BAI file such that, if it is invoked by setting ids=True, will do a single pass through the BAI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing the BAI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.

Parameters:
  • ref_id (None|int) – used for random access or indexing the BAI
  • idx (bool) – Flag for setting whether or not to run an index of the BAI
Returns:

namedtuple containing the byte offsets of the reference start, stop, and number of bins or Ref: namedtuple containing a dictionary of bins and list of linear intervals

Return type:

RefIdx

Raises:

AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.

query(ref_id, start, stop=-1)[source]

Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start

Parameters:
  • ref (int) – which reference/chromosome TID
  • start (int) – left most bp position of region (zero-based)
  • stop (int) – right most bp position of region (zero-based)
Returns:

the voffset_beg of the first chunk given the chunk’s voffset_end

is greater than the voffset of the linear index that overlaps the region of interest’s start offset

Return type:

(int)

seek(offset=None, whence=0)[source]

Simple seek function for binary files

Parameters:
  • offset (None|int) – byte offset from whence to move the file head to.
  • whence (int) – 0 := from start of file, 1:= from current position, 2:= from end of file
Returns:

new byte position of file head

Return type:

(int)

Raise:
ValueError (Exception): if the offset is not an integer or is not provided
read(size=-1)[source]

Simple read function for binary files

Parameters:size (int) – number of bytes to read in (default: -1 –whole file)
Returns:the number of bytes read from file
Return type:(bytes)
tell()[source]

Simple tell function for reporting byte position of file head

Returns:byte position of file head
Return type:(int)
bamnostic.bai.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Coordinate-Sorted Index (CSI) file handler (bamnostic.csi)

CSI file parser

Copyright © 2018, Marcus D. Sherman

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

The Binary Alignment Map (CSI) format (defined at http://samtools.github.io/hts-specs/CSIv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a CSI index file can be created. Generally speaking, a CSI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the CSI is produced, random access to the BAM is available.

This script is for parsing the binary encoded CSI file, inherently making it human readable. Furthermore, it allows subsections of the CSI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic

bamnostic.csi.unpack_chunk()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_bid_loffset_nchunk()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_unmapped()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.reg2bin_csi(rbeg, rend, min_shift=14, depth=5)[source]

Finds the largest superset bin of region. Numeric values taken from hts-specs

Parameters:
  • rbeg (int) – inclusive beginning position of region
  • rend (int) – exclusive end position of region
Returns:

distinct bin ID for largest superset bin of region

Return type:

(int)

bamnostic.csi.reg2bins_csi(rbeg, rend, min_shift=14, depth=5)[source]

Generates bin ids which overlap the specified region.

Parameters:
  • rbeg (int) – inclusive beginning position of region
  • rend (int) – exclusive end position of region
Yields:

(int) – bin IDs for overlapping bins of region

Raises:

AssertionError (Exception) – if the range is malformed or invalid

class bamnostic.csi.Csi(filename)[source]

Bases: bamnostic.bai.Bai

This class defines the bam CSI index file object and its interface.

The purpose of this class is the binary parsing of the bam index file (CSI) associated with a given bam file. When queried, the Csi object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.

Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)

_io

opened CSI file object

Type:fileObject
_UNMAP_BIN

constant for bin ID of unmapped read stats

Type:int
magic

first 4 bytes of file. Must be equal to b’CSI’

Type:bytes
n_refs

number of references in CSI

Type:int
unmapped

dictionary of the unmapped read stats by each reference

Type:dict
current_ref

dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.

Type:None|dict
ref_indices

dictionary of reference ids and their start/stop offsets within theCSI file

Type:dict
n_no_coord

if present inCSI, is the number of reads that have no coordinates

Type:None|int
_last_pos

used for indexing, the byte position of the file head.

Type:int
get_chunks(n_chunks)[source]

Simple generator for unpacking chunk data

Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.

Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.

Parameters:n_chunks (int) – number of chunks to be unpacked from stream
Returns:
a list of Chunk objects with the attributes of
chunks[i] are .voffset_beg and voffset_end
Return type:chunks (list)
get_bins(n_bins, ref_id=None, idx=False)[source]

Simple function that iteratively unpacks the data of a number (n_bin) of bins.

As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in theCSI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.

As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True

Parameters:n_int (int) – number of bins to be unpacked from stream
Returns:
None if just indexing the index file or a dictionary
of bin_id: chunks pairs
Return type:bins (None | dict)
Raises:AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
get_ref[source]

Iteratively unpacks all the bins, linear intervals, and chunks for a given reference

A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.

This function also serves to “index” theCSI file such that, if it is invoked by setting ids=True, will do a single pass through theCSI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing theCSI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.

Parameters:
  • ref_id (None|int) – used for random access or indexing theCSI
  • idx (bool) – Flag for setting whether or not to run an index of theCSI
Returns:

namedtuple containing the byte offsets of the reference start, stop, and number of bins o Ref: namedtuple containing a dictionary of bins and list of linear intervals

Return type:

RefIdx

Raises:

AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.

query(ref_id, start, stop=-1)[source]

Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start

Parameters:
  • ref (int) – which reference/chromosome TID
  • start (int) – left most bp position of region (zero-based)
  • stop (int) – right most bp position of region (zero-based)
Returns:

the voffset_beg of the first chunk given the chunk’s voffset_end

is greater than the voffset of the linear index that overlaps the region of interest’s start offset

Return type:

(int)

bamnostic.csi.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

BAMnostic Utilities (bamnostic.utils)

bamnostic.utils.format_warnings(message, category, filename, lineno, file=None, line=None)[source]

Sets STDOUT warnings

Parameters:
  • message – the unformatted warning message being reported
  • category (str) – the level of warning (handled by warnings module)
  • filename (str) – filename for logging purposes (defaults to STDOUT)
  • lineno (int) – where the error occurred.
Returns:

formatted warning string

bamnostic.utils.unpack_int32()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.utils.unpack_int32L()

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

class bamnostic.utils.Roi(contig=None, start=None, stop=None, tid=None)[source]

Bases: object

Small __slots__ class for region of interest parsing

bamnostic.utils.flag_decode(flag_code)[source]

Simple read alignment flag decoder

Every read within a BAM file ought to have an associated flag code. Theses flags are used for read filtering and QC. The flags are described below. Additionally, they can be found here

Any given read’s flag is determined by the or (|) operand of all appropriate bit flags.

Parameters:flag_code (int) – either a standalone integer/bit flag or the read object itself
Returns:list of flag and flag description tuples.
Return type:(list of tuple)
Raises:ValueError – if provided flag is not a valid entry

Example

If a flag is 516 it is comprised of flag 4 and flag 512

>>> flag_decode(516)
[(4, 'read unmapped'), (512, 'QC fail')]
Int Bit Description
1 0x1 Template having multiple segments in sequencing
2 0x2 Each segment properly aligned according to the aligner
4 0x4 Segment unmapped
8 0x8 Next segment in the template unmapped
16 0x10 SEQ being reverse complemented
32 0x20 SEQ of the next segment in the template being reverse complemented
64 0x40 The first segment in the template
128 0x80 The last segment in the template
256 0x100 Secondary alignment
512 0x200 Not passing filters, such as platform/vendor quality controls
1024 0x400 PCR or optical duplicate
2048 0x800 Supplementary alignment
bamnostic.utils.yes_no(message)[source]

Simple prompt parser

bamnostic.utils.parse_region(contig=None, start=None, stop=None, region=None, tid=None, reference=None, end=None, until_eof=False)[source]

Parses region information from all user set parameters.

The main goal of this function is to handle the many different ways a user can put in genomic region data. One way is through keyword arguments. This is the most straight forward. However, due to Pysam’s API, there are multiple synonyms for reference/contig or stop/end. Additionally, if the user knows the refID/TID of the reference they are interested in, they can input that way as well.

The second form a submission can take is through positional arguments. Just like keyword, but ordered such that it makes up a genomic region of interest.

Note

Positional arguments make utilizing the tid parameter difficult since it is the 5th argument of the function signature.

The third form a submission can take is through using a SAM-formatted string. An example SAM-formatted string looks like ‘chr1:10-100’. As many users also copy & paste directly from tab-delimited files, such as BED files, a SAM-formatted string can take the form of ‘chr1\t10\t100’ where ‘\t’ indicates a tab space.

Lastly, the until_eof switch allows users to take all items from their desired start position (be it the whole reference or a specific spot on the reference). Setting this to True (default: False) will pull all reads to the end of the reference or file, whichever is first.

Parameters:
  • contig (str) – name of reference/contig
  • start (int) – start position of region of interest (0-based)
  • stop (int) – stop position of region of interest (0-based)
  • region (str) – SAM region formatted string. Accepts tab-delimited values as well
  • tid (int) – the refID or target id of a reference/contig
  • until_eof (bool) – iterate until end of file (default: False)
Returns:

region of interest formatted as an object with named attributes.

Return type:

query (bamnostic.utils.Roi)

Raises:

ValueError – if two synonym keywords are set, but contradict each other

Examples

# Keyword-based
>>> parse_region(contig = 'chr1', start = 10, stop = 100)
Roi(contig: chr1, start: 10, stop: 100)

# Using `tid` instead of `contig`
>>> parse_region(tid = 0, start = 10, stop = 100)
Roi(tid: 0, start: 10, stop: 100)

# Positional arguments
>>> parse_region('chr1', 10, 100)
Roi(contig: chr1, start: 10, stop: 100)

# SAM-formatted string (keyword)
>>> parse_region(region = 'chr1:10-100')
Roi(contig: chr1, start: 10, stop: 100)

# SAM-formatted string (positional)
>>> parse_region('chr1:10-100')
Roi(contig: chr1, start: 10, stop: 100)

# Tab-delimited region string
>>> parse_region('chr1\t10\t100')
Roi(contig: chr1, start: 10, stop: 100)

# Contradictory synonyms
>>> parse_region(reference='chr1', contig='chr10', start=10, stop = 100)
Traceback (most recent call last):
...
ValueError: either contig or reference must be set, not both
bamnostic.utils.unpack(fmt, _io, is_array=False)[source]

Utility function for unpacking binary data from file object or byte stream.

The only difference between this method and struct.unpack is that unpack dynamically determines the size needed to read in based on the format string. Additionally, it can process a file object or byte stream and implement a read or slice (respectively). Mainly, this is a quality of life function.

Parameters:
  • fmt (str) – the string format of the binary data to be unpacked
  • _io – built-in binary format reader (default: io.BufferedRandom)
Returns:

unpacked contents from _io based on fmt string

bamnostic.utils.make_virtual_offset(block_start_offset, within_block_offset)[source]

Compute a BGZF virtual offset from block start and within block offsets.

The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms:

block_start_offset << 16 | within_block_offset

Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer).

>>> make_virtual_offset(0, 0)
0
>>> make_virtual_offset(0, 1)
1
>>> make_virtual_offset(0, 2**16 - 1)
65535
>>> make_virtual_offset(0, 2**16)
Traceback (most recent call last):
...
ValueError: Require 0 <= within_block_offset < 2**16, got 65536
bamnostic.utils.split_virtual_offset(virtual_offset)[source]

Divides a 64-bit BGZF virtual offset into block start & within block offsets.

>>> (100000, 0) == split_virtual_offset(6553600000)
True
>>> (100000, 10) == split_virtual_offset(6553600010)
True
class bamnostic.utils.LruDict(*args, **kwargs)[source]

Bases: collections.OrderedDict

Simple least recently used (LRU) based dictionary that caches a given number of items.

get(key)[source]

Basic getter that renews LRU status upon inspection

Parameters:key (str) – immutable dictionary key
update(others)[source]

Same as a regular dict.update, however, since pypy’s dict.update doesn’t go through dict.__setitem__, this is used to ensure it does

Parameters:others (iterable) – a dictionary or iterable containing key/value pairs
cull()[source]

Main utility function for pruning the LruDict

If the length of the LruDict is more than max_cache, it removes the LRU item

bamnostic.utils.parse_cigar(cigar_str)[source]

Parses a CIGAR string and turns it into a list of tuples

Parameters:cigar_str (str) – the CIGAR string as shown in SAM entry
Returns:list of tuples of CIGAR operations (by id) and number of operations
Return type:cigar_array (list)
Raises:ValueError – if CIGAR operation is invalid

Examples

>>> parse_cigar('3M1I3M1D5M') # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
[(('BAM_CMATCH', 0), 3), ..., (('BAM_CMATCH', 0), 5)]
bamnostic.utils.check_cigar_arg(cigar)[source]

Checks to make sure CIGAR arugment is valid.

Parameters:argument (str or list) – CIGAR string (pre-formatted or raw)
Returns:CIGAR re-formatted as a list
Return type:(list)
Raises:ValueError – if CIGAR is not a string or pre-formatted list
bamnostic.utils.cigar_changes(seq, cigar)[source]
Recreates the reference sequence to the extent that the CIGAR string can
represent.
Parameters:
  • seq (str) – aligned segment sequence
  • cigar (list) – list of tuples of cigar operations (by id) and number of operations
Returns:

a version of the aligned segment’s reference sequence given the changes reflected in the cigar string

Return type:

cigar_formatted_ref (str)

Raises:

ValueError – if CIGAR operation is invalid

Examples

>>> cigar_changes('ACTAGAATGGCT', '3M1I3M1D5M')
'ACTGAATGGCT'
>>> cigar_changes('ACTAGAATGGCT', '3V1I3M1D5M')
Traceback (most recent call last):
    ...
ValueError: Invalid CIGAR operation (V).
bamnostic.utils.md_changes(seq, md_tag)[source]

Recreates the reference sequence of a given alignment to the extent that the MD tag can represent.

Note

Used in conjunction with cigar_changes to recreate the complete reference sequence

Parameters:
  • seq (str) – aligned segment sequence
  • md_tag (str) – MD tag for associated sequence
Returns:

a version of the aligned segment’s reference sequence given the changes reflected in the MD tag

Return type:

ref_seq (str)

Raises:

ValueError – if MD tag is None

Example

>>> md_changes('CTTATATTGGCCTT', '3C4AT4')
'CTTCTATTATCCTT'
bamnostic.utils.ref_gen(seq, cigar_string, md_tag)[source]

Recreates the reference sequence associated with the given segment.

Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome. Example reads, MD tags, and CIGAR strings taken from David Tang’s Blog.

Returns:generated reference sequence
Return type:(str)
Raises:KeyError – if read does not contain MD tag

Examples

# Only mismatches >>> seq = ‘CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG’ >>> cigar = ‘36M’ >>> md = ‘1A0C0C0C1T0C0T27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG’

# Insertions and mismatches >>> seq = ‘GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT’ >>> cigar = ‘6M1I29M’ >>> md = ‘0C1C0C1C0T0C27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT’

# Deletion and mismatches >>> seq = ‘AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC’ >>> cigar = ‘9M9D27M’ >>> md = ‘2G0A5^ATGATGTCA27’ >>> ref_gen(seq, cigar, md) ‘AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC’

# Insertion, deletion, and mismatches >>> seq = ‘AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA’ >>> cigar = ‘2M1I7M6D26M’ >>> md = ‘3C3T1^GCTCAG26’ >>> ref_gen(seq, cigar, md) ‘AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA’

bamnostic.utils.cigar_alignment(seq=None, cigar=None, start_pos=0, qualities=None, base_qual_thresh=0, query=False)[source]

Use the CIGAR to filter out all unaligned data bases

Any clipping results in the removal of those bases. If an insertion is seen in the CIGAR, those bases are removed from the sequence. If a deletion is seen in the CIGAR, those bases are padded with a period (‘.’) symbol.

Parameters:
  • seq (str) – string sequence of the aligned segment.
  • cigar (str) – the cigar string or cigartuple of the aligned segment.
  • start_pos (int) – the first aligned position of the read
  • qualities (array.array) – base quality array from read
Yields:

(tuple of str and int) – nucleotide base and index position of that base relative to reference

Example

>>> seq = 'AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA'
>>> cigar = '2M1I7M6D26M'
>>> start_position = 95
>>> c_a = cigar_alignment(seq, cigar, start_position)
>>> next(c_a)
('A', 95)