pbsamtools¶

pbsamtools is a commandline tool that generates SAM or BAM files from a cmp.h5 file and a list of bas.h5 files. The SAM files that are generated are of the PacBio SAM specification (see below) which extends the SAM v1.4-r962 format specification (samtools).

PacBio SAM specification¶

SAM is a TAB-delimited text format used for storing information associated with the alignment of a query sequence to a reference sequence along with information related to the technology used to generate the query sequence. The contents of PacBio SAM files are directly mapped to cmp.h5 data fields and their original sources. Those mappings are defined here.

Nomenclature¶

“/DataGroup/DataSet”: Path to a cmp.h5 dataset
DataField: Data field present in the cmp.h5 including Data Groups and DataSet attribute fields as well as AlnInfo/Index column names
@XX: Header record types are preceded with ‘@’ and followed by the 2 letter type (XX)

Header section¶

Type	Tag	Regular Expression	Description	Comment
HD	VN	`1.2.0\|1.2.0.SF\|1.2.0.PB`	`cmp.h5` version	Required; root group attribute “Version” from `cmp.h5`
SQ	SN	`[^ \\t\\n\\r@=]+`	Unique reference sequence name	Required; from “/RefInfo/FullName”
	LN	`[0-9]+`	Reference sequence length	Required; from “/RefInfo/Length”
	M5	`[a-fA-F\\d]{32}`	MD5 checksum of reference sequence in uppercase with gaps and spaces removed	Required; from “/RefInfo/MD5”
RG	ID	`[a-fA-F\\d]{10}`	Unique read group ID (generated from RG:PU)	Required if @RG is present; 10 first characters of the hex digest of the RG:PU MD5 hash. The RG optional field per alignment record will also be included
	PU	`m[0-9]+_[0-9]+_[a-zA-Z0` `-9]+_c[0-9]+_s[0-9]_p[0` `-9]`	Unique platform unit as represented by the Movie Name	Optional if @RG is present; from “/MovieInfo/Name” in `cmp.h5`. The Movie Name stores date/time, instrument name, cell ID and pane ID
	SM	`c[0-9]+`	Unique cell ID (extracted from RG:PU	Optional if @RG is present; from “/MovieInfo/Name” in `cmp.h5` after extracting the 4th field using ‘_’ as the delimiter
	PL	`PacBio_[A-Z]+`	Platform/technology used to generate the original `cmp.h5` file	Optional if @RG is present; sequencing platform name (PacBio RS)
	DT	`[0-9\\-]+T[:0-9]+`	ISO8601 date/time (extracted from RG:PU	Optional if @RG is present; from “/MovieInfo/Name” in `cmp.h5` after extracting the first 2 fields using ‘_’ as the delimiter. First fiels represents the date and the second the time
PG	ID	`[a-zA-Z_\\.0-9]+`	Software that generated the `SAM` file	Required if @PG is present;
	VN	`[a-zA-Z_\\.0-9]+`	Version of software that generated the `SAM` file	Optional if @PG is present;
	CL	`[a-zA-Z_\\.0-9]+`	Commandline used to generate the `SAM` file	Optional if @PG is present; includes all commandline tool arguments and absolute file paths
CO		`[a-zA-Z:0-9]+`	One-line text of comments	Optional;

The following example is of a PacBio SAM header generated from a cmp.h5 file containing 2 reference sequences:

@ HD        VN:1.3
@SQ SN:lambda_NEB3011       LN:48502        M5:a1319ff90e994c8190a4fe6569d0822a
@RG ID:0f1b0b57bca8fd9143a2b48b3c6592ab     SM:c100072840635400000315004305181152   PU:m110208_151047_richard_c100072840635400000315004305181152_s2_p0      PL:PacBio_RS    DT:2011-02-08T15:10:47
@RG ID:fbaf81b5d7a6e69f897e793d4076c225     SM:c100072840635400000315004305181152   PU:m110208_151047_richard_c100072840635400000315004305181152_s1_p0      PL:PacBio_RS    DT:2011-02-08T15:10:47
@PG ID:SAMIO.py     VN:1.2.0.PB     CL:TEST
@CO READS:181843

Notes:

Ordering of @SQ tags: Determined by the order of appearance of reference headers in the Reference Repository associated with the SAF HDF5 or provided as an argument by the user to SAMIO.py

Alignment section¶

Data fields for each PacBio SAM alignment are populated using data fields from cmp.h5 “AlnIndex” rows and their respective “AlnArray” mappings. In the presence of a soft of hard clipped CIGAR field, clippings are determined from entries in the “/PulseData/BaseCalls/Basecall” dataset of the HDF Pulse File and quality values from the “/PulseData/BaseCalls/QualityValue” dataset. Field descriptions are updated from the current SAM specification only if they are associated with information that is unique to PacBio.

Field	Regular Expression	Description	Comment
QNAME	`[!-?A-~]{1,255}`	AlignmentID defined as: Movie Name/Hole Number or RG:ID.Hole Number	2 QNAME formats are supported. First is generated by concatenating Movie Name from “/MovieInfo/Name” with “HoleNumber” for the Hole Number, using “/” as the delimiter. The second is generated by concatenating the RG:ID with HoleNumber using a ‘.’ as the delimiter
FLAG	`[0,2p(16)-1]` where `2p(16)` is 2 to the the power of 16	Bitwise FLAG	The following bits are supported: 1,2,8,10,20,40 and 80
RNAME	`\[!-()+-<>-~][!-~]`	Reference sequence name	Using the “RefGroupID” data field from the “AlnIndex” to index “RefInfo/FullName”
POS	`[0,2p(29)-1]` where `2p(29)` is 2 to the the power of 29	1-based leftmost mapping position of the fragment/subread to the reference	“tStart”+1 from “AlnIndex”
MAPQ	`[0,2p(8)-1]` where `2p(8)` is 2 to the the power of 8	Mapping quality (phred-scaled posterior probability that the mapping position of this subread is incorrect)	Derived from “MapQV” from “AlnIndex” and set to 255 if “MapQV” is not calculated
CIGAR	`([0-9]+[MIDSH])+\|\\*`	Extended CIGAR string	In the absence of clipping solely derived from “AlignmentArray” given offsets. In the presence of clipping, derived from “/PulseData/BaseCalls/Basecall” extracted from `bas.h5` files
RNEXT	`*\|=`	Next fragment/subread reference sequence name	Equal to the RNAME of the next fragment/subread as ordered in the template/read sequence (ascending ordering of fragments by “rStart” from “AlnIndex”). For templates/reads containing a single fragment/subread set to ‘*’ else set to “=”
PNEXT	`[0,2p(29)-1]` where `2p(29)` is 2 to the the power of 29	1-based leftmost mapping position of the next fragment/subread to the reference	Set to POS of the current `SAM` record if it represents the last fragment/subread of a template/read or to POS of the next fragment/subread sorted in asecding ordering of fragments by “rStart” from “AlnIndex”
TLEN	`0`	Observed template length	Undefined
SEQ	`[acgtnACGTN.=]+\|\\*`	Fragment/subread sequence	In the absence f soft clipping, derived from the “AlignmentArray” with gaps (‘-‘) removed. In the presence of soft clipping derived from “/PulseData/BaseCalls/Basecall” extracted from `bas.h5` files
SEQ	`[!-~]+`	Query quality; ASCII-33 gives the Phred base quality	In the absence f soft clipping, derived from the “QualityValue” and thus representing “InsertionQV”. In the presence of soft clipping derived from “/PulseData/BaseCalls/QualityValue” extracted from `bas.h5` files

CIGAR strings currently support the following 3 formats:

No clipping: Only the fragment/subread alignment is taken into consideration for generating the CIGAR sequence
Hard clipping: The entire template/read is taken into consideration for determining the values of the H CIGAR operators. For each subread alignment, all bases part of the entire read that fall outside the alignment are hard clipped.
Soft clipping: As with Hard clipping, the entire read is taken into consideration for determining the values of the H and S CIGAR operators. In this case H operators are used for clipping non-High Quality (HQ) region portions of the read and S operators are determined according to the following rule: SAM record sequence START = First HQ region base for a given read (first subread) or last reported base of previous SAM record in ascending FI order (not-first subread), SAM record sequence END = Last aligned subread base (not-last subread) or last HQ region base for given read (last subread)

Optional Fields¶

RG (TYPE:Z): Matches the @RG:ID tag and represents the unique read group ID
AS (TYPE:i): Alignment score equal to : (Number of Matches * 5 ) + (Number of Mismatches * -6) + (Number of Insertions * -5) + (Number of Deletions * -5)
NM (TYPE:i): Number of nucleotide differences from the reference sequence. Equal to the sum of the number Mismatches, Insertions and Deletions
TC (TYPE:i): Number of fragments/subreads in the template/read. This is Optional and only present in SAM records with CIGAR strings supporting clipping. Set to 1 for templates/reads containing a single fragment/subread alignment.
FI (TYPE:i): 1-based index of the fragment/subread relative to all other fragments/subreads in the template/read. Optional and only present in SAM records with CIGAR strings supporting clipping. 1-based indexing is used with ascending ordering of fragments by “rStart” from “AlnIndex”
XA (TYPE:Z): Comma-separated 1-based positions of the leftmost and rightmost adapter bases (‘_’ delimited) relative to all template/read bases. Optional and only present in SAM records with CIGAR strings supporting soft clipping. For N adapters associated with a SAM record a string of the following format is generated: adapter1Start_adapter1End,adapter2Start_adapter2End,adapterNStart_adapterNEnd
XS (TYPE:i): 1-based position of the leftmost base of the fragment/subread relative to all template/read bases. Equal to “rStart” from “AlnIndex”.
XE (TYPE:i): 1-based position of the rightmost base of the fragment/subread relative to all template/read bases. Equal to “rEnd” from “AlnIndex”.
XT (TYPE:i): Read type. 3 read types are supported: 1=RS (continuous long reads), 2=RCCS (Reference-based Circular Consensus Sequencing (CCS)), 3=CCS (de novo CCS)
XC (TYPE:i): CCS coverage. Optional and only present if the XT tag is set to 2. Represents the number of fragments/subreads used to generate the given SAM record SEQ.

A single SAM record is displayed below where the CIGAR sequence supports hard clipping:

m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15  35      lambda_NEB3011  11044   255     2161H34M1I15M1I3M2I3M1I13M1D5M1I3M1I8M1I15M1I12M1I4M1I3M1D3M1I11M1I1M1I2M3I7M1I44M2I30M538H     =       11050   0       GGTCCTGTCCAGAGCGCGGCAGGCGGCAGGGCTGAACGTTTAACCAGACCAAGCGGGAGGTCACTCAGCGCACGGTTACAGGCCGGGGGTAAAGCGGTGAGGCTCAGAATTGCGTCCATCAAGCCAAGATGTTGGCGCGTTCCTCCTCTGGGCTGCATCCCGGCGTGGAGGTGGAAAAGGTCGCTGAAGCCTTCGGGAAGCTGAACCCACAGACCCGACGTCGGGGCTGACGGCGA    !,)'0+'+()1/,,+**%.-*$"-&--.()!/*,(///)(**/,)/#,*0!*//(+--/-)0.-))+---+--!&(*.%,,,).%&!/-*,#+%-,,)//$-'%1-1*/**.)-)'0++1&,(*!%0.-*1(!--,+,.!$#(+",)/(!!!,).+!$$!1*,.-'!+*,*(+-*!+,/.*//0)1+--%.)'-$--*-01*1)$%'/,+/--%,1!,*/'/')/+.(.!/++.-+    AS:i:974        FI:i:8  NM:i:26 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab   TC:i:9  XE:i:2397       XS:i:2162       XT:i:1  MD:Z:15C0G51^T50^G11T27C58

A single SAM record is displayed below where the CIGAR sequence supports softs clipping:

m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15  83      lambda_NEB3011  11043   255     2661H9M1I16M1I5M1I6M3I12M1I9M2I10M2I7M1I4M1I3M3I5M1I2M2I8M1I39M1I11M1I5M1I2M1I2M2I3M1I6M2I23M58S        =       11047   0       TGGTCCTGTTCCAGAGCCGGGCAGGCCGGCAGCGGCTGAGCACGTTTAACCCCAGACAGCGAGTTCCACTCAGCGCCAACTGGTTTAAGGGCGGTTCCGGTAAAGCGCGGTGAGGGCTCAGATTGCGTCCATCAGCCAGAGTGTGGCGCGTTTCTTCCTCTGCATCCCGGCGTCGGGAGGTGTGTGACAGGGGGTCGCTGAAGCCTTCGGGAAGCTTGACCCACAGGACCCCGAGCGGTCGGGGGCTGAGGGCGAGATGGCTCATTCTCTCTCT      %0#,-0(.-$./)0(.+//)$'"/*.",(%'/!.$.!/$!!!-*,(+&)&/&()**++/,,+!0!+.(!//)+.'-!((00/,-+!-(-,#/0,!!!!/+--(),!!,/!,-*-&!,.-*/)-*,/,+-(+-.+/&.((++-)".+.'.!!),+.!/.%!/./*'''!/$/.-!1,#,/!!-/-!0,)***!!!++,/-.*#/.*-*+1,$+'00,!-+*/('++-)*+,)!/,!-%$,!1,"!$'-.%!!!-/,!(10+,.0-!*-,-*,///      AS:i:795        FI:i:1  NM:i:34 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab   TC:i:9  XE:i:274        XS:i:59 XT:i:1  MD:Z:45A0G1C32G80A24

Dependencies¶

pbsamtools depends on PacBio’s pbcore library as well as the numpy and h5py python libraries. Additionally, for generating BAM files as well as populating the SAM MD optional field, an installation of samtools is required of version equal to or greater than 0.16.

Installation¶

To install pbsamtools, run the following command from the pbsamtools root directory:

python setup.py install

Usage¶

pbsamtools depends on 2 input files:

cmp.h5: file containing all alignment information
input.fofn: text file containing paths to all bas.h5 used for generating alignments separated by new lines.

Additionally, one can supply the path to the reference repository used for alignment (via --refrepos) but this is only optional. In the absence of this commandline argument the reference repository information as stored within the cmp.h5 file will be used. Finally, --readtype supports 3 types of reads used to generate alignments:

continuous long reads (--readtype=RS)
reference-based circularconsensus sequencing reads (--readtype=RCCS)
de novo circular consensus sequencing reads (--readtype=CCS)

usage: pbsamtools.py [-h] [-i] [-d] [-v] [--outfile OUTFILE]
                     [--refrepos REFREPOS] [--plsFOFN PLSFOFN]
                     [--tmpDir TMPDIR] [--readtype READTYPE]
                     [--clipping CLIPPING] [--bam] [--md]
                     input.bas.h5

Tool for converting cmp.h5 files to SAM/BAM files.
Notes: For all command-line arguments, default values are listed in [].

positional arguments:
  input.cmp.h5         input filename

optional arguments:
  -h, --help           show this help message and exit
  -i, --info           turn on progress monitoring to stdout [False]
  -d, --debug          turn on progress monitoring to stdout and keep temp files [False]
  -v, --version        show program's version number and exit
  --outfile OUTFILE    Name of output SAM file [output.sam]
  --refrepos REFREPOS  Path to Reference Sequence Repository (optional)
  --plsFOFN PLSFOFN    Path to input.fofn containing list of bas.h5 (required for Hard and Soft clipping)
  --tmpDir TMPDIR      temporary directory to use [./]
  --readtype READTYPE  Read type: RS, RCCS or dCCS [RS]
  --clipping CLIPPING  Clipping: Hard, Soft or None [None]

samtools arguments:
  --bam                Generate BAM file
  --md                 Populate MD field in generated SAM file

Examples¶

Generate a soft clipped SAM file (myfile.sam) for a given cmp.h5 file containing continuous long reads:

pbsamtools --info --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Soft aligned_reads.cmp.h5

Generate a hard clipped BAM file (myfile.bam) with a populated MD optional field for a given cmp.h5 file containing continuous long reads:

pbsamtools --info --bam -- md --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Hard aligned_reads.cmp.h5

Known Issues¶

None

pbsamtools¶

PacBio SAM specification¶

Nomenclature¶

Header section¶

Alignment section¶

Optional Fields¶

Dependencies¶

Installation¶

Usage¶

Examples¶

Known Issues¶

Indices and tables¶

Table Of Contents

Previous topic

This Page

Navigation

pbsamtools¶

PacBio SAM specification¶

Nomenclature¶

Header section¶

Alignment section¶

Optional Fields¶

Dependencies¶

Installation¶

Usage¶

Examples¶

Known Issues¶

Indices and tables¶

Table Of Contents

Previous topic

This Page

Quick search

Navigation