bamnostic package¶
bamnostic is an OS-agnostic, Pure Python BAM file parser.
The purpose of bamnostic is to open up BAM query capabilities to all current OS environments. Special care was taken to allow bamnostic to run in all versions of Python 2.7 onward. Furthermore, bamnostic does not require any packages outside of the standard library. Therefore, bamnostic will run on all stable versions of PyPy.
Note
SAM and CRAM support is not yet implemented
The three main classes of bamnostic are:
bamnostic.AlignmentFile
: the BAM file handlerbamnostic.AlignedSegment
: an aligned read object interfacebamnostic.bai.Bai
: if the BAM file has an associated index file (preferred), this is the file handler for it.
Note
Within the scope of personal research, reading BAM files is the only fully supported IO. The skeleton for writing BAM files is present, just not connected.
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
Core Module (bamnostic.core
)¶
-
class
bamnostic.core.
Cigar
(op_code, n_op, op_id, op_name)¶ Bases:
tuple
namedtuple
for handling CIGAR dataParameters: - op_code (int) – CIGAR operation index
- n_op (int) – number of operations for a given op_code
- op_id (str) – the string representation of the CIGAR representation (‘MIDNSHP=XB’)
- op_name (str) – Longer string name for operation
-
n_op
¶ Alias for field number 1
-
op_code
¶ Alias for field number 0
-
op_id
¶ Alias for field number 2
-
op_name
¶ Alias for field number 3
-
bamnostic.core.
offset_qual
(qual_string)[source]¶ Offsets the ASCII-encoded quality string to represent PHRED score.
Every base that is in the alignment is assigned a Phred score. A Phred score (\(Q\)) is defined as \(Q=-10\log_{10}P\), where \(P\) is the base-calling error probability. Phred quality scores tend range from 10 to 60. These qualities are then offset by 33 and ASCII-encoded for readability and storage.
Parameters: qual_string ( str
orbytes
) – Phred quality scores without offsetReturns: ASCII-encoded Phred scores offest by adding 33 to base score. Return type: (str) Examples
>>> qual_score = '' >>> ''.join(offset_qual(qual_score)) '<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;'
-
class
bamnostic.core.
AlignmentFile
(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False, compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶ Bases:
bamnostic.bam.BamReader
,bamnostic.bam.BamWriter
Wrapper to allow drop in replacement for BAM functionality in a
pysam
-like API.Parameters: - filepath_or_object (str |
file
) – the path or file object of the BAM file - mode (str) – Mode for reading. BAM files are binary by nature (default: ‘rb’).
- max_cache (int) – number of desired LRU cache size, preferably a multiple of 2 (default: 128).
- index_filename (str) – path to index file (BAI) if it is named differently than the BAM file (default: None).
- filename (str |
file
) – synonym for filepath_or_object - check_header (bool) – Obsolete method maintained for backwards compatibility (default: False)
- check_sq (bool) – Inspect BAM file for @SQ entries within the header
- reference_filename (str) – Not implemented. Maintained for backwards compatibility
- filepath_index (str) – synonym for index_filename
- require_index (bool) – require the presence of an index file or raise (default: False)
- duplicate_filehandle (bool) – Not implemented. Raises warning if True.
- ignore_truncation (bool) – Whether or not to allow trucated file processing (default: False).
- filepath_or_object (str |
-
class
bamnostic.core.
AlignedSegment
(_io)[source]¶ Bases:
object
Main class for handling reads within the BAM
-
query_alignment_end
¶ One past the final index position of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
-
query_alignment_start
¶ The starting index of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
-
query_name
¶ Alias for the read name
-
query_alignment_sequence
¶ The sequence of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
-
query_alignment_length
¶ The length of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
-
reference_start
¶ The starting index of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
-
reference_end
¶ One past the final index position of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
-
reference_length
¶ The length of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
-
query_sequence
¶ The sequence of the read
query_ here means the query as seen in the BAM file, and therefore includes clipping
-
query_length
¶ The length of sequence of the read
query_ here means the query as seen in the BAM file, and therefore includes clipping
-
next_reference_id
¶ Return the reference id of mate read
-
next_reference_name
¶ Return the reference name of mate read
-
next_reference_start
¶ Return the reference name of mate read
-
is_duplicate
¶ If the read is set as PCR or optical duplicate according to read flag
-
is_paired
¶ If the read is set as paired according to read flag
-
is_proper_pair
¶ If each segment properly aligned according to the aligner
-
is_qcfail
¶ If the read is set as QC fail according to read flag
-
is_read1
¶ If the read is set as the first read of mate pair according to read flag (implies the read is mated)
-
is_read2
¶ If the read is set as the second read of mate pair according to read flag (implies the read is mated)
-
is_reverse
¶ If the read is set as reversed according to read flag
-
is_secondary
¶ If the read is set as secondary alignment according to read flag
-
is_supplementary
¶ If the read is set as a supplementary alignment according to read flag
-
is_unmapped
¶ If the read is set as unmapped according to read flag
-
mate_is_reverse
¶ Check the read flag to see if the mat is reverse strand (implies the read is mated)
-
mate_is_unmapped
¶ Check the read flag to see if mate is unmapped (implies the read is mated)
-
mapping_quality
¶ The mapping quality (MAPQ) of the read
-
get_tag
(tag, with_value_type=False)[source]¶ Gets the value associated with a given tag key.
Parameters: - tag (str) – the tag of interest
- with_value_type (bool) – return what kind of value the tag
Returns: the value associated with a given tag or the value and type of value (as seen in BAM format)
Returns all the tags for a given read
Parameters: with_value_type (bool) – return the tag value type (as defined by BAM format) Returns: list of tag tuples (with or without tag value type) Return type: f_tags( list
)
-
get_cigar_stats
()[source]¶ Gets the counts of each CIGAR operation in the read and number of nucleotides related to those given operations.
Returns: list of CIGAR operation counts nt_counts ( list
): list of nucleotide counts for each operationReturn type: op_blocks ( list
)
-
ref_gen
()[source]¶ Recreates the reference sequence associated with the given segment.
Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome.
Returns: generated reference sequence Return type: (str) Raises: KeyError
– if read does not contain MD tag
-
to_bam
()[source]¶ Writes the alignment record to a BAM file
Parameters: bam_file (string or bamnostic.bam.BamWriter
) – BAM file path or open bam file in a write mode
-
-
bamnostic.core.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.core.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Bgzip BAM Format (BGZF) file handlers (bamnostic.bgzf
)¶
-
bamnostic.bgzf.
unpack_gzip_header
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bgzf.
unpack_subfields
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bgzf.
unpack_gzip_integrity
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bgzf.
unpack_bgzf_metaheader
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bgzf.
get_block
(handle, offset=0)[source]¶ Pulls out entire GZIP block
Used primarily for copying the header block of a BAM file. However, it can be used to copy any BGZF block within a BAM file that starts at the given offset.
Note
Does not progress file cursor position.
Parameters: - handle (
file
) – open BAM file - offset (int) – offset of BGZF block (default: 0)
Returns: Complete BGZF block
Raises: ValueError
– if the BGZF block header is malformedExample
>>> bam_header = get_block(bamnostic.example_bam) >>> try: ... bam_header.startswith(b'\x1f\x8b\x08\x04') ... except SyntaxError: ... bam_header.startswith('\x1f\x8b\x08\x04') True
- handle (
-
class
bamnostic.bgzf.
BgzfReader
(filepath_or_object, mode='rb', max_cache=None, filename=None, duplicate_filehandle=None, ignore_truncation=False)[source]¶ Bases:
object
The BAM reader. Heavily modified from Peter Cock’s BgzfReader.
-
seek
(virtual_offset)[source]¶ Seek to a 64-bit unsigned BGZF virtual offset.
A virtual offset is a composite number made up of the compressed offset (coffset) position of the start position of the BGZF block that the position originates within, and the uncompressed offset (uoffset) within the deflated BGZF block where the position starts. The virtual offset is defined as
virtual_offset = coffset << 16 | uoffset
Parameters: virtual_offset (int) – 64-bit unsigned composite byte offset
Returns: an echo of the new position
Return type: virtual_offset (int)
Raises: ValueError
– if within block offset is more than block sizeAssertionError
– if the start position is not the block start position
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.seek(10) 10
>>> bam.seek(bamnostic.utils.make_virtual_offset(0, 42)) Traceback (most recent call last): ... ValueError: Within offset 42 but block size only 38
-
read
(size=-1)[source]¶ Read method for the BGZF module.
Parameters: size (int) – the number of bytes to read from file. Advances the cursor.
Returns: byte string of length size
Return type: data (
bytes
)Raises: NotImplementedError
– if the user tries to read the whole fileAssertionError
– if read does not return any data
-
readline
()[source]¶ Read a single line for the BGZF file.
Binary operations do not support readline(). Code is commented out for posterity sake
-
-
class
bamnostic.bgzf.
BgzfWriter
(filepath_or_object, mode='wb', compresslevel=6)[source]¶ Bases:
object
Define a BGZFWriter object.
-
close
()[source]¶ Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.
samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.
-
-
bamnostic.bgzf.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bgzf.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Binary Alignment Map (BAM) file handlers (bamnostic.bam
)¶
-
class
bamnostic.bam.
BAMheader
(_io)[source]¶ Bases:
object
Parse and store the BAM file header
The BAM header is the plain text and byte-encoded metadata of a given BAM file. Information stored in the header are the number, length, and name of the reference sequences that reads were aligned to; version of software used; read group identifiers; etc. The BAM format also stipulates that the first block of any BAM file should be reserved just for the BAM header block.
-
_io
¶ opened BAM file object
Type: file
-
SAMheader
¶ parsed dictionary of the SAM header
Type: dict
-
n_refs
¶ number of references
Type: int
-
refs
¶ reference names and lengths listed in the BAM header
Type: dict
-
-
class
bamnostic.bam.
BamReader
(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False)[source]¶ Bases:
bamnostic.bgzf.BgzfReader
The BAM reader. Heavily modified from Peter Cock’s BgzfReader.
-
header
¶ representation of header data (if present)
-
lengths
¶ lengths of references listed in header
Type: list
ofint
-
nocoordinate
¶ number of reads that have no coordinates
Type: int
-
nreferences
¶ number of references in header
Type: int
-
ref2tid
¶ refernce names and refID dictionary
Type: dict
ofstr
,int
-
references
¶ names of references listed in header
Type: list
ofstr
-
text
¶ SAM header (if present)
Type: str
-
unmapped
¶ number of unmapped reads
Type: int
Note
This implementation is likely to change. While the API was meant to mirror pysam, it makes sense to include the pysam-like API in an extension that will wrap the core reader. This would be a major refactor, and therefore will not happen any time soon (30 May 2018).
-
check_index
(index_filename=None, req_idx=False)[source]¶ Checks to make sure index file is available. If not, it disables random access.
Parameters: - index_filename (str) – path to index file (BAI) if it does not fit naming convention (default: None).
- req_idx (bool) – Raise error if index file is not present (default: False).
Returns: True if index is present, else False
Return type: (bool)
Raises: IOError
– If the index file is closed or index could not be openedWarns: UserWarning – If index could not be loaded. Random access is disabled.
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam) >>> bam.check_index(bamnostic.example_bam + '.bai') True
>>> bam.check_index('not_a_file.bai') False
-
nocoordinate
Get the number of reads without coordiantes according to the statistics recorded in the index.
Returns: sum of reads without coordinates Return type: (int)
-
mapped
¶ Get the number of mapped reads according to the statistics recorded in the index.
Returns: sum of mapped reads Return type: (int)
-
unmapped
Get the number of unmapped reads without coordiantes and without coordinate.
Returns: sum of unmapped reads and reads without coordinates Return type: (int)
-
references
Get all references used in alignment.
Returns: reference names Return type: ( tuple
ofstr
)
-
nreferences
Get the number of references listed in alignment
Returns: count of references Return type: (int)
-
lengths
Get all reference lengths used in alignment.
Returns: reference lengths Return type: ( tuple
ofint
)
-
has_index
()[source]¶ Checks if file has index and it is open
Returns: True if present and opened, else False Return type: bool
-
fetch
(contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)[source]¶ Creates a generator that returns all reads within the given region. (inclusive, exclusive)
Parameters: - contig (str) – name of reference/contig
- start (int) – start position of region of interest (0-based)
- stop (int) – stop position of region of interest (0-based)
- region (str) – SAM region formatted string. Accepts tab-delimited values as well
- tid (int) – the refID or target id of a reference/contig
- until_eof (bool) – iterate until end of file
- mutiple_iterators (bool) – allow multiple iterators over region. Not Implemented. Notice: each iterator will open up a new view into the BAM file, so overhead will apply.
- reference (str) – synonym for contig
- end (str) – synonym for stop
Yields: reads over the region of interest if any
Raises: ValueError
– if the genomic coordinates are out of range or invalidKeyError
– Reference is not found in header
Note
SAM region formatted strings take on the following form: ‘chr1:100000-200000’
- Usage:
AlignmentFile.fetch(contig='chr1', start=1, stop= 1000) AlignmentFile.fetch('chr1', 1, 1000) AlignmentFile.fetch('chr1:1-1000') AlignmentFile.fetch('chr1', 1) AlignmentFile.fetch('chr1')
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> next(bam.fetch('chr1', 1, 100)) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE EAS56_57:6:190:289:82 ... MF:C:192
>>> next(bam.fetch('chr10', 1, 10)) Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
>>> next(bam.fetch('chr1', 1700, 1701)) Traceback (most recent call last): ... ValueError: Genomic region out of bounds.
>>> next(bam.fetch('chr1', 100, 10)) Traceback (most recent call last): ... AssertionError: Malformed region: start should be <= stop, you entered 100, 10
-
count
(contig=None, start=None, stop=None, region=None, until_eof=False, tid=None, read_callback='nofilter', reference=None, end=None)[source]¶ Count the number of reads in the given region
Note: this counts the number of reads that overlap the given region.
Can potentially make use of a filter for the reads (or custom function that returns True or False for each read).
Parameters: - contig (str) – the reference name (Default: None)
- reference (str) – synonym for contig (Default: None)
- start (int) – 0-based inclusive start position (Default: None)
- stop (int) – 0-based exclusive start position (Default: None)
- end (int) – Synonym for stop (Default: None)
- region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
- until_eof (bool) – count number of reads from start to end of file Note, this can potentially be an expensive operation. (Default: False)
- read_callback (str|function) –
select (or create) a filter of which reads to count. Built-in filters:
- all: skips reads that contain the following flags:
- 0x4 (4): read unmapped
- 0x100 (256): not primary alignment
- 0x200 (512): QC Fail
- 0x400 (1024): PCR or optical duplicate
- nofilter: uses all reads (Default)
- The user can also supply a custom function that
- returns boolean objects for each read
- all: skips reads that contain the following flags:
Returns: count of reads in the given region that meet parameters
Return type: (int)
Raises: ValueError
– if genomic coordinates are out of range or invalid or random access is disabledRuntimeError
– if read_callback is not properly setKeyError
– Reference is not found in headerAssertionError
– if genomic region is malformed
Notes
SAM region formatted strings take on the following form: ‘chr1:100000-200000’
- Usage:
- AlignmentFile.count(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count(‘chr1’, 1, 1000) AlignmentFile.count(‘chr1:1-1000’) AlignmentFile.count(‘chr1’, 1) AlignmentFile.count(‘chr1’)
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.count('chr1', 1, 100) 2
>>> bam.count('chr1', 1, 100, read_callback='all') 1
>>> bam.count('chr10', 1, 10) Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
>>> bam.count('chr1', 1700, 1701) Traceback (most recent call last): ... ValueError: Genomic region out of bounds.
>>> bam.count('chr1', 100, 10) Traceback (most recent call last): ... AssertionError: Malformed region: start should be <= stop, you entered 100, 10
-
count_coverage
(contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None, base_quality_threshold=0)[source]¶ Counts the coverage of each base supported by a read the given interval.
Given an interval (inclusive, exclusive), this method pulls each read that overlaps with the region. To ensure that the read truly overlaps with the region, the CIGAR string is required. These reads can further be filtered out by their flags, MAPQ qualities, or custom filtering function. Using the CIGAR string, the aligned portion of the read is traversed and the presence of each nucleotide base is tallied into respective arrays. Additionally, the user can choose to filter the counted bases based on its base quality score that is stored in the quality string.
Parameters: - contig (str) – the reference name (Default: None)
- reference (str) – synonym for contig (Default: None)
- start (int) – 0-based inclusive start position (Default: None)
- stop (int) – 0-based exclusive start position (Default: None)
- end (int) – Synonym for stop (Default: None)
- region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
- quality_threshold (int) – MAPQ quality threshold (Default: 15)
- base_quality_threshold (int) – base quality score threshold (Default: 0)
- read_callback (str|function) –
select (or create) a filter of which reads to count. Built-in filters:
- all: skips reads that contain the following flags:
- 0x4 (4): read unmapped
- 0x100 (256): not primary alignment
- 0x200 (512): QC Fail
- 0x400 (1024): PCR or optical duplicate
- nofilter: uses all reads (Default)
- The user can also supply a custom function that returns boolean objects for each read
Returns: Four arrays in the order of A, C, G, T
Return type: (
array.array
)Raises: ValueError
– if genomic coordinates are out of range or invalid, random access is disabled, or nucleotide base is unrecognizedRuntimeError
– if read_callback is not properly setKeyError
– Reference is not found in headerAssertionError
– if genomic region is malformed
Notes
SAM region formatted strings take on the following form: ‘chr1:100-200’
- Usage:
- AlignmentFile.count_coverage(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count_coverage(‘chr1’, 1, 1000) AlignmentFile.count_coverage(‘chr1:1-1000’) AlignmentFile.count_coverage(‘chr1’, 1) AlignmentFile.count_coverage(‘chr1’)
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> for arr in bam.count_coverage('chr1', 100, 150): # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE ... print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist())))) array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0]) array('L', [0, 0, 0, 0, ..., 15, 0, 14, 0, 0]) array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 14]) array('L', [0, 0, 0, 0, ..., 0, 15, 0, 14, 0])
>>> for arr in bam.count_coverage('chr1', 100, 150, quality_threshold=20, base_quality_threshold=25): # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE ... print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist())))) array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0]) array('L', [0, 0, 0, 0, ..., 13, 0, 14, 0, 0]) array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 13]) array('L', [0, 0, 0, 0, ..., 0, 14, 0, 13, 0])
-
get_index_stats
()[source]¶ Inspects the index file (BAI) for alignment statistics.
Every BAM index file contains metrics regarding the alignment process for the given BAM file. The stored data are the number of mapped and unmapped reads for a given reference. Unmapped reads are paired end reads where only one part is mapped. Additionally, index files also contain the number of unplaced unmapped reads. This is stored within the nocoordinate instance attribute (if present).
Returns: list of tuples for each reference in the order seen in the header. Each tuple contains the number of mapped reads, unmapped reads, and the sum of both. Return type: idx_stats ( list
oftuple
)Raises: AssertionError
– if the index file is not availableExample
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_index_stats() [(1446, 18, 1464), (1789, 17, 1806)]
>>> bam_no_bai = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb', index_filename='not_a_file.bai') >>> bam_no_bai.get_index_stats() Traceback (most recent call last): ... AssertionError: No index available
-
is_valid_tid
(tid)[source]¶ Return True if TID/RefID is valid.
Returns: True if TID/refID is valid, else False Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.is_valid_tid(0) True
>>> bam.is_valid_tid(10) # because there are only 2 in this file False
-
get_reference_name
(tid)[source]¶ Convert TID/refID to reference name.
The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.
Parameters: tid (int) – TID/refID of desired reference/contig Returns: String representation of chromosome if valid, else None Raises: KeyError
– if TID/refID is not validExamples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_reference_name(0) # doctest: +ALLOW_UNICODE 'chr1'
>>> bam.get_reference_name(10) Traceback (most recent call last): ... KeyError: '10 is not a valid TID/refID for this file.'
-
get_tid
(reference)[source]¶ Convert reference/contig name to refID/TID.
The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.
Parameters: reference (str) – reference/contig name Returns: the TID/refID of desired reference/contig Return type: (int) Raises: KeyError
– if reference name not found file headerExample
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_tid('chr1') 0
>>> bam.get_tid('chr10') Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
-
mate
(AlignedSegment)[source]¶ Gets the mate to a given AlignedSegment.
Note
Slow, when compared to the C-API. Not meant for high-throughput analysis.
Does not advance current iterator position.
Parameters: AlignedSegment ( bamnostic.AlignedSegment
) – a bamnostic AlignedSegment read with a mateReturns: if read has a valid mate, else None Return type: ( bamnostic.AlignedSegment
)Raises: ValueError
– if AlignedSegment is unpaired
-
head
(n=5, multiple_iterators=False)[source]¶ List out the first n reads of the file.
This method is primarily used when doing an initial exploration of the data. Whether or not multiple_iterators is used, cursor position within the file will not change.
Note
Using multiple_interators opens a new file object of the same file currently in use and, thus, impacts the memory footprint of your analysis.
Parameters: - n (int) – number of aligned reads to print (default: 5)
- mutliple_iterators (bool) – Whether to use current file object or create a new one (default: False).
Returns: list of n reads from the front of the BAM file
Return type: head_reads (
list
ofAlignedSegment
)Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.head(n=5, multiple_iterators = False)[0] # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE EAS56_57:6:190:289:82 ... MF:C:192
>>> bam.head(n = 5)[1] # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE EAS56_57:6:190:289:82 ... UQ:C:0
-
-
class
bamnostic.bam.
BamWriter
(filepath_or_object, mode='xb', compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶ Bases:
bamnostic.bgzf.BgzfWriter
-
write_header
(copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶ Modified code from Peter Cock’s [BamWriter](https://github.com/peterjc/biopython/blob/SamBam2015/Bio/Sequencing/SamBam/__init__.py#L1714)
Writes the header into a BAM file
Parameters: - copy_header (filepath_or_object) – copy the header from another BAM file
- header (bytes) – SAM-style header
- reference_names (
list
ofstr
) – list of all reference names used in file - reference_lengths (
list
ofint
) – list of all reference lengths used in file
-
-
bamnostic.bam.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bam.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
BAM Index (BAI) file handler (bamnostic.bai
)¶
BAI file parser
Copyright © 2018, Marcus D. Sherman
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
The Binary Alignment Map (BAM) format (defined at https://samtools.github.io/hts-specs/SAMv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a BAM index (BAI) file is created. Generally speaking, a BAI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the BAI is produced, random access to the BAM is available.
This script is for parsing the binary encoded BAI file, inherently making it human readable. Furthermore, it allows subsections of the BAI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic
-
bamnostic.bai.
unpack_chunk
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bai.
unpack_intervals
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bai.
unpack_bid_nchunk
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bai.
unpack_unmapped
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
class
bamnostic.bai.
RefIdx
(start_offset, end_offset, n_bins)¶ Bases:
tuple
-
end_offset
¶ Alias for field number 1
-
n_bins
¶ Alias for field number 2
-
start_offset
¶ Alias for field number 0
-
-
bamnostic.bai.
reg2bin
(rbeg, rend)[source]¶ Finds the largest superset bin of region. Numeric values taken from hts-specs
Parameters: - rbeg (int) – inclusive beginning position of region
- rend (int) – exclusive end position of region
Returns: distinct bin ID for largest superset bin of region
Return type: (int)
-
bamnostic.bai.
reg2bins
(rbeg, rend)[source]¶ Generates bin ids which overlap the specified region.
Parameters: - rbeg (int) – inclusive beginning position of region
- rend (int) – exclusive end position of region
Yields: (int) – bin IDs for overlapping bins of region
Raises: AssertionError (Exception) – if the range is malformed or invalid
-
class
bamnostic.bai.
Bai
(filename)[source]¶ Bases:
object
This class defines the bam index file object and its interface.
The purpose of this class is the binary parsing of the bam index file (BAI) associated with a given bam file. When queried, the Bai object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.
Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)
-
_io
¶ opened BAI file object
Type: fileObject
-
_LINEAR_INDEX_WINDOW
¶ constant of the linear interval window size
Type: int
-
_UNMAP_BIN
¶ constant for bin ID of unmapped read stats
Type: int
-
magic
¶ first 4 bytes of file. Must be equal to b’BAI’
Type: bytes
-
n_refs
¶ number of references in BAI
Type: int
-
unmapped
¶ dictionary of the unmapped read stats by each reference
Type: dict
-
current_ref
¶ dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.
Type: None|dict
-
ref_indices
¶ dictionary of reference ids and their start/stop offsets within the BAI file
Type: dict
-
n_no_coord
¶ if present in BAI, is the number of reads that have no coordinates
Type: None|int
-
_last_pos
¶ used for indexing, the byte position of the file head.
Type: int
-
get_chunks
(n_chunks)[source]¶ Simple generator for unpacking chunk data
Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.
Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.
Parameters: n_chunks (int) – number of chunks to be unpacked from stream Returns: - a list of Chunk objects with the attributes of
- chunks[i] are .voffset_beg and voffset_end
Return type: chunks (list)
-
get_ints
(n_int)[source]¶ Unpacks n_int number of interval virtual offsets from stream
A linear interval is defined as a 16384 bp window along a given reference. The value stored in the BAI is the virtual offset of the first read within that given interval. This virtual offset is the byte offset (coffset) of the start of the BGZF block that contains the beginning of the read and the byte offset (uoffset) within the uncompressed data of the residing BGZF block to that first read.
Note: a caveat to using linear interval with long reads: A long read can span multiple linear intervals. As such, the current encoding could potentially shift the expected region of interest to the left more than expected.
Parameters: n_int (int) – number of intervals to unpack Returns: list of virtual offsets for n_int number of linear intervals Return type: intervals (list)
-
get_bins
(n_bins, ref_id=None, idx=False)[source]¶ Simple function that iteratively unpacks the data of a number (n_bin) of bins.
As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in the BAI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.
As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True
Parameters: n_int (int) – number of bins to be unpacked from stream Returns: - None if just indexing the index file or a dictionary
- of bin_id: chunks pairs
Return type: bins (None | dict) Raises: AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
-
get_ref
[source]¶ Interatively unpacks all the bins, linear intervals, and chunks for a given reference
A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.
This function also serves to “index” the BAI file such that, if it is invoked by setting ids=True, will do a single pass through the BAI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing the BAI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.
Parameters: - ref_id (None|int) – used for random access or indexing the BAI
- idx (bool) – Flag for setting whether or not to run an index of the BAI
Returns: namedtuple containing the byte offsets of the reference start, stop, and number of bins or Ref: namedtuple containing a dictionary of bins and list of linear intervals
Return type: Raises: AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.
-
query
(ref_id, start, stop=-1)[source]¶ Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start
Parameters: - ref (int) – which reference/chromosome TID
- start (int) – left most bp position of region (zero-based)
- stop (int) – right most bp position of region (zero-based)
Returns: - the voffset_beg of the first chunk given the chunk’s voffset_end
is greater than the voffset of the linear index that overlaps the region of interest’s start offset
Return type: (int)
-
seek
(offset=None, whence=0)[source]¶ Simple seek function for binary files
Parameters: - offset (None|int) – byte offset from whence to move the file head to.
- whence (int) – 0 := from start of file, 1:= from current position, 2:= from end of file
Returns: new byte position of file head
Return type: (int)
- Raise:
- ValueError (Exception): if the offset is not an integer or is not provided
-
-
bamnostic.bai.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.bai.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Coordinate-Sorted Index (CSI) file handler (bamnostic.csi
)¶
CSI file parser
Copyright © 2018, Marcus D. Sherman
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
The Binary Alignment Map (CSI) format (defined at http://samtools.github.io/hts-specs/CSIv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a CSI index file can be created. Generally speaking, a CSI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the CSI is produced, random access to the BAM is available.
This script is for parsing the binary encoded CSI file, inherently making it human readable. Furthermore, it allows subsections of the CSI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic
-
bamnostic.csi.
unpack_chunk
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.csi.
unpack_bid_loffset_nchunk
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.csi.
unpack_unmapped
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.csi.
reg2bin_csi
(rbeg, rend, min_shift=14, depth=5)[source]¶ Finds the largest superset bin of region. Numeric values taken from hts-specs
Parameters: - rbeg (int) – inclusive beginning position of region
- rend (int) – exclusive end position of region
Returns: distinct bin ID for largest superset bin of region
Return type: (int)
-
bamnostic.csi.
reg2bins_csi
(rbeg, rend, min_shift=14, depth=5)[source]¶ Generates bin ids which overlap the specified region.
Parameters: - rbeg (int) – inclusive beginning position of region
- rend (int) – exclusive end position of region
Yields: (int) – bin IDs for overlapping bins of region
Raises: AssertionError (Exception) – if the range is malformed or invalid
-
class
bamnostic.csi.
Csi
(filename)[source]¶ Bases:
bamnostic.bai.Bai
This class defines the bam CSI index file object and its interface.
The purpose of this class is the binary parsing of the bam index file (CSI) associated with a given bam file. When queried, the Csi object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.
Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)
-
_io
¶ opened CSI file object
Type: fileObject
-
_UNMAP_BIN
¶ constant for bin ID of unmapped read stats
Type: int
-
magic
¶ first 4 bytes of file. Must be equal to b’CSI’
Type: bytes
-
n_refs
¶ number of references in CSI
Type: int
-
unmapped
¶ dictionary of the unmapped read stats by each reference
Type: dict
-
current_ref
¶ dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.
Type: None|dict
-
ref_indices
¶ dictionary of reference ids and their start/stop offsets within theCSI file
Type: dict
-
n_no_coord
¶ if present inCSI, is the number of reads that have no coordinates
Type: None|int
-
_last_pos
¶ used for indexing, the byte position of the file head.
Type: int
-
get_chunks
(n_chunks)[source]¶ Simple generator for unpacking chunk data
Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.
Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.
Parameters: n_chunks (int) – number of chunks to be unpacked from stream Returns: - a list of Chunk objects with the attributes of
- chunks[i] are .voffset_beg and voffset_end
Return type: chunks (list)
-
get_bins
(n_bins, ref_id=None, idx=False)[source]¶ Simple function that iteratively unpacks the data of a number (n_bin) of bins.
As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in theCSI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.
As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True
Parameters: n_int (int) – number of bins to be unpacked from stream Returns: - None if just indexing the index file or a dictionary
- of bin_id: chunks pairs
Return type: bins (None | dict) Raises: AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
-
get_ref
[source]¶ Iteratively unpacks all the bins, linear intervals, and chunks for a given reference
A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.
This function also serves to “index” theCSI file such that, if it is invoked by setting ids=True, will do a single pass through theCSI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing theCSI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.
Parameters: - ref_id (None|int) – used for random access or indexing theCSI
- idx (bool) – Flag for setting whether or not to run an index of theCSI
Returns: namedtuple containing the byte offsets of the reference start, stop, and number of bins o Ref: namedtuple containing a dictionary of bins and list of linear intervals
Return type: Raises: AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.
-
query
(ref_id, start, stop=-1)[source]¶ Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start
Parameters: - ref (int) – which reference/chromosome TID
- start (int) – left most bp position of region (zero-based)
- stop (int) – right most bp position of region (zero-based)
Returns: - the voffset_beg of the first chunk given the chunk’s voffset_end
is greater than the voffset of the linear index that overlaps the region of interest’s start offset
Return type: (int)
-
-
bamnostic.csi.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.csi.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
BAMnostic Utilities (bamnostic.utils
)¶
-
bamnostic.utils.
format_warnings
(message, category, filename, lineno, file=None, line=None)[source]¶ Sets STDOUT warnings
Parameters: - message – the unformatted warning message being reported
- category (str) – the level of warning (handled by warnings module)
- filename (str) – filename for logging purposes (defaults to STDOUT)
- lineno (int) – where the error occurred.
Returns: formatted warning string
-
bamnostic.utils.
unpack_int32
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
bamnostic.utils.
unpack_int32L
()¶ Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
-
class
bamnostic.utils.
Roi
(contig=None, start=None, stop=None, tid=None)[source]¶ Bases:
object
Small __slots__ class for region of interest parsing
-
bamnostic.utils.
flag_decode
(flag_code)[source]¶ Simple read alignment flag decoder
Every read within a BAM file ought to have an associated flag code. Theses flags are used for read filtering and QC. The flags are described below. Additionally, they can be found here
Any given read’s flag is determined by the or (|) operand of all appropriate bit flags.
Parameters: flag_code (int) – either a standalone integer/bit flag or the read object itself Returns: list of flag and flag description tuples. Return type: ( list
oftuple
)Raises: ValueError
– if provided flag is not a valid entryExample
If a flag is 516 it is comprised of flag 4 and flag 512
>>> flag_decode(516) [(4, 'read unmapped'), (512, 'QC fail')]
Int Bit Description 1 0x1 Template having multiple segments in sequencing 2 0x2 Each segment properly aligned according to the aligner 4 0x4 Segment unmapped 8 0x8 Next segment in the template unmapped 16 0x10 SEQ being reverse complemented 32 0x20 SEQ of the next segment in the template being reverse complemented 64 0x40 The first segment in the template 128 0x80 The last segment in the template 256 0x100 Secondary alignment 512 0x200 Not passing filters, such as platform/vendor quality controls 1024 0x400 PCR or optical duplicate 2048 0x800 Supplementary alignment
-
bamnostic.utils.
parse_region
(contig=None, start=None, stop=None, region=None, tid=None, reference=None, end=None, until_eof=False)[source]¶ Parses region information from all user set parameters.
The main goal of this function is to handle the many different ways a user can put in genomic region data. One way is through keyword arguments. This is the most straight forward. However, due to Pysam’s API, there are multiple synonyms for reference/contig or stop/end. Additionally, if the user knows the refID/TID of the reference they are interested in, they can input that way as well.
The second form a submission can take is through positional arguments. Just like keyword, but ordered such that it makes up a genomic region of interest.
Note
Positional arguments make utilizing the tid parameter difficult since it is the 5th argument of the function signature.
The third form a submission can take is through using a SAM-formatted string. An example SAM-formatted string looks like ‘chr1:10-100’. As many users also copy & paste directly from tab-delimited files, such as BED files, a SAM-formatted string can take the form of ‘chr1\t10\t100’ where ‘\t’ indicates a tab space.
Lastly, the until_eof switch allows users to take all items from their desired start position (be it the whole reference or a specific spot on the reference). Setting this to True (default: False) will pull all reads to the end of the reference or file, whichever is first.
Parameters: - contig (str) – name of reference/contig
- start (int) – start position of region of interest (0-based)
- stop (int) – stop position of region of interest (0-based)
- region (str) – SAM region formatted string. Accepts tab-delimited values as well
- tid (int) – the refID or target id of a reference/contig
- until_eof (bool) – iterate until end of file (default: False)
Returns: region of interest formatted as an object with named attributes.
Return type: query (
bamnostic.utils.Roi
)Raises: ValueError
– if two synonym keywords are set, but contradict each otherExamples
# Keyword-based >>> parse_region(contig = 'chr1', start = 10, stop = 100) Roi(contig: chr1, start: 10, stop: 100) # Using `tid` instead of `contig` >>> parse_region(tid = 0, start = 10, stop = 100) Roi(tid: 0, start: 10, stop: 100) # Positional arguments >>> parse_region('chr1', 10, 100) Roi(contig: chr1, start: 10, stop: 100) # SAM-formatted string (keyword) >>> parse_region(region = 'chr1:10-100') Roi(contig: chr1, start: 10, stop: 100) # SAM-formatted string (positional) >>> parse_region('chr1:10-100') Roi(contig: chr1, start: 10, stop: 100) # Tab-delimited region string >>> parse_region('chr1\t10\t100') Roi(contig: chr1, start: 10, stop: 100) # Contradictory synonyms >>> parse_region(reference='chr1', contig='chr10', start=10, stop = 100) Traceback (most recent call last): ... ValueError: either contig or reference must be set, not both
-
bamnostic.utils.
unpack
(fmt, _io, is_array=False)[source]¶ Utility function for unpacking binary data from file object or byte stream.
The only difference between this method and struct.unpack is that unpack dynamically determines the size needed to read in based on the format string. Additionally, it can process a file object or byte stream and implement a read or slice (respectively). Mainly, this is a quality of life function.
Parameters: - fmt (str) – the string format of the binary data to be unpacked
- _io – built-in binary format reader (default: io.BufferedRandom)
Returns: unpacked contents from _io based on fmt string
-
bamnostic.utils.
make_virtual_offset
(block_start_offset, within_block_offset)[source]¶ Compute a BGZF virtual offset from block start and within block offsets.
The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms:
block_start_offset << 16 | within_block_offset
Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer).
>>> make_virtual_offset(0, 0) 0
>>> make_virtual_offset(0, 1) 1
>>> make_virtual_offset(0, 2**16 - 1) 65535
>>> make_virtual_offset(0, 2**16) Traceback (most recent call last): ... ValueError: Require 0 <= within_block_offset < 2**16, got 65536
-
bamnostic.utils.
split_virtual_offset
(virtual_offset)[source]¶ Divides a 64-bit BGZF virtual offset into block start & within block offsets.
>>> (100000, 0) == split_virtual_offset(6553600000) True
>>> (100000, 10) == split_virtual_offset(6553600010) True
-
class
bamnostic.utils.
LruDict
(*args, **kwargs)[source]¶ Bases:
collections.OrderedDict
Simple least recently used (LRU) based dictionary that caches a given number of items.
-
get
(key)[source]¶ Basic getter that renews LRU status upon inspection
Parameters: key (str) – immutable dictionary key
-
-
bamnostic.utils.
parse_cigar
(cigar_str)[source]¶ Parses a CIGAR string and turns it into a list of tuples
Parameters: cigar_str (str) – the CIGAR string as shown in SAM entry Returns: list of tuples of CIGAR operations (by id) and number of operations Return type: cigar_array (list) Raises: ValueError
– if CIGAR operation is invalidExamples
>>> parse_cigar('3M1I3M1D5M') # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE [(('BAM_CMATCH', 0), 3), ..., (('BAM_CMATCH', 0), 5)]
-
bamnostic.utils.
check_cigar_arg
(cigar)[source]¶ Checks to make sure CIGAR arugment is valid.
Parameters: argument (str or list
) – CIGAR string (pre-formatted or raw)Returns: CIGAR re-formatted as a list Return type: ( list
)Raises: ValueError
– if CIGAR is not a string or pre-formatted list
-
bamnostic.utils.
cigar_changes
(seq, cigar)[source]¶ - Recreates the reference sequence to the extent that the CIGAR string can
- represent.
Parameters: - seq (str) – aligned segment sequence
- cigar (list) – list of tuples of cigar operations (by id) and number of operations
Returns: a version of the aligned segment’s reference sequence given the changes reflected in the cigar string
Return type: cigar_formatted_ref (str)
Raises: ValueError
– if CIGAR operation is invalidExamples
>>> cigar_changes('ACTAGAATGGCT', '3M1I3M1D5M') 'ACTGAATGGCT'
>>> cigar_changes('ACTAGAATGGCT', '3V1I3M1D5M') Traceback (most recent call last): ... ValueError: Invalid CIGAR operation (V).
-
bamnostic.utils.
md_changes
(seq, md_tag)[source]¶ Recreates the reference sequence of a given alignment to the extent that the MD tag can represent.
Note
Used in conjunction with cigar_changes to recreate the complete reference sequence
Parameters: - seq (str) – aligned segment sequence
- md_tag (str) – MD tag for associated sequence
Returns: a version of the aligned segment’s reference sequence given the changes reflected in the MD tag
Return type: ref_seq (str)
Raises: ValueError
– if MD tag is NoneExample
>>> md_changes('CTTATATTGGCCTT', '3C4AT4') 'CTTCTATTATCCTT'
-
bamnostic.utils.
ref_gen
(seq, cigar_string, md_tag)[source]¶ Recreates the reference sequence associated with the given segment.
Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome. Example reads, MD tags, and CIGAR strings taken from David Tang’s Blog.
Returns: generated reference sequence Return type: (str) Raises: KeyError
– if read does not contain MD tagExamples
# Only mismatches >>> seq = ‘CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG’ >>> cigar = ‘36M’ >>> md = ‘1A0C0C0C1T0C0T27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG’
# Insertions and mismatches >>> seq = ‘GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT’ >>> cigar = ‘6M1I29M’ >>> md = ‘0C1C0C1C0T0C27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT’
# Deletion and mismatches >>> seq = ‘AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC’ >>> cigar = ‘9M9D27M’ >>> md = ‘2G0A5^ATGATGTCA27’ >>> ref_gen(seq, cigar, md) ‘AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC’
# Insertion, deletion, and mismatches >>> seq = ‘AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA’ >>> cigar = ‘2M1I7M6D26M’ >>> md = ‘3C3T1^GCTCAG26’ >>> ref_gen(seq, cigar, md) ‘AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA’
-
bamnostic.utils.
cigar_alignment
(seq=None, cigar=None, start_pos=0, qualities=None, base_qual_thresh=0, query=False)[source]¶ Use the CIGAR to filter out all unaligned data bases
Any clipping results in the removal of those bases. If an insertion is seen in the CIGAR, those bases are removed from the sequence. If a deletion is seen in the CIGAR, those bases are padded with a period (‘.’) symbol.
Parameters: - seq (str) – string sequence of the aligned segment.
- cigar (str) – the cigar string or cigartuple of the aligned segment.
- start_pos (int) – the first aligned position of the read
- qualities (
array.array
) – base quality array from read
Yields: (
tuple
ofstr
andint
) – nucleotide base and index position of that base relative to referenceExample
>>> seq = 'AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA' >>> cigar = '2M1I7M6D26M' >>> start_position = 95 >>> c_a = cigar_alignment(seq, cigar, start_position) >>> next(c_a) ('A', 95)