pbsamtools

pbsamtools is a commandline tool that generates SAM or BAM files from a cmp.h5 file and a list of bas.h5 files. The SAM files that are generated are of the PacBio SAM specification (see below) which extends the SAM v1.4-r962 format specification (samtools).

PacBio SAM specification

SAM is a TAB-delimited text format used for storing information associated with the alignment of a query sequence to a reference sequence along with information related to the technology used to generate the query sequence. The contents of PacBio SAM files are directly mapped to cmp.h5 data fields and their original sources. Those mappings are defined here.

Nomenclature

  • “/DataGroup/DataSet”: Path to a cmp.h5 dataset
  • DataField: Data field present in the cmp.h5 including Data Groups and DataSet attribute fields as well as AlnInfo/Index column names
  • @XX: Header record types are preceded with ‘@’ and followed by the 2 letter type (XX)

Header section

Type Tag Regular Expression Description Comment
HD VN 1.2.0|1.2.0.SF|1.2.0.PB cmp.h5 version Required; root group attribute “Version” from cmp.h5
SQ SN [^ \\t\\n\\r@=]+ Unique reference sequence name Required; from “/RefInfo/FullName”
  LN [0-9]+ Reference sequence length Required; from “/RefInfo/Length”
  M5 [a-fA-F\\d]{32} MD5 checksum of reference sequence in uppercase with gaps and spaces removed Required; from “/RefInfo/MD5”
RG ID [a-fA-F\\d]{10} Unique read group ID (generated from RG:PU) Required if @RG is present; 10 first characters of the hex digest of the RG:PU MD5 hash. The RG optional field per alignment record will also be included
  PU m[0-9]+_[0-9]+_[a-zA-Z0 -9]+_c[0-9]+_s[0-9]_p[0 -9] Unique platform unit as represented by the Movie Name Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5. The Movie Name stores date/time, instrument name, cell ID and pane ID
  SM c[0-9]+ Unique cell ID (extracted from RG:PU Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5 after extracting the 4th field using ‘_’ as the delimiter
  PL PacBio_[A-Z]+ Platform/technology used to generate the original cmp.h5 file Optional if @RG is present; sequencing platform name (PacBio RS)
  DT [0-9\\-]+T[:0-9]+ ISO8601 date/time (extracted from RG:PU Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5 after extracting the first 2 fields using ‘_’ as the delimiter. First fiels represents the date and the second the time
PG ID [a-zA-Z_\\.0-9]+ Software that generated the SAM file Required if @PG is present;
  VN [a-zA-Z_\\.0-9]+ Version of software that generated the SAM file Optional if @PG is present;
  CL [a-zA-Z_\\.0-9]+ Commandline used to generate the SAM file Optional if @PG is present; includes all commandline tool arguments and absolute file paths
CO   [a-zA-Z:0-9]+ One-line text of comments Optional;

The following example is of a PacBio SAM header generated from a cmp.h5 file containing 2 reference sequences:

@ HD        VN:1.3
@SQ SN:lambda_NEB3011       LN:48502        M5:a1319ff90e994c8190a4fe6569d0822a
@RG ID:0f1b0b57bca8fd9143a2b48b3c6592ab     SM:c100072840635400000315004305181152   PU:m110208_151047_richard_c100072840635400000315004305181152_s2_p0      PL:PacBio_RS    DT:2011-02-08T15:10:47
@RG ID:fbaf81b5d7a6e69f897e793d4076c225     SM:c100072840635400000315004305181152   PU:m110208_151047_richard_c100072840635400000315004305181152_s1_p0      PL:PacBio_RS    DT:2011-02-08T15:10:47
@PG ID:SAMIO.py     VN:1.2.0.PB     CL:TEST
@CO READS:181843

Notes:

  1. Ordering of @SQ tags: Determined by the order of appearance of reference headers in the Reference Repository associated with the SAF HDF5 or provided as an argument by the user to SAMIO.py

Alignment section

Data fields for each PacBio SAM alignment are populated using data fields from cmp.h5 “AlnIndex” rows and their respective “AlnArray” mappings. In the presence of a soft of hard clipped CIGAR field, clippings are determined from entries in the “/PulseData/BaseCalls/Basecall” dataset of the HDF Pulse File and quality values from the “/PulseData/BaseCalls/QualityValue” dataset. Field descriptions are updated from the current SAM specification only if they are associated with information that is unique to PacBio.

Field Regular Expression Description Comment
QNAME [!-?A-~]{1,255} AlignmentID defined as: Movie Name/Hole Number or RG:ID.Hole Number 2 QNAME formats are supported. First is generated by concatenating Movie Name from “/MovieInfo/Name” with “HoleNumber” for the Hole Number, using “/” as the delimiter. The second is generated by concatenating the RG:ID with HoleNumber using a ‘.’ as the delimiter
FLAG [0,2p(16)-1] where 2p(16) is 2 to the the power of 16 Bitwise FLAG The following bits are supported: 1,2,8,10,20,40 and 80
RNAME \*[!-()+-<>-~][!-~]* Reference sequence name Using the “RefGroupID” data field from the “AlnIndex” to index “RefInfo/FullName”
POS [0,2p(29)-1] where 2p(29) is 2 to the the power of 29 1-based leftmost mapping position of the fragment/subread to the reference “tStart”+1 from “AlnIndex”
MAPQ [0,2p(8)-1] where 2p(8) is 2 to the the power of 8 Mapping quality (phred-scaled posterior probability that the mapping position of this subread is incorrect) Derived from “MapQV” from “AlnIndex” and set to 255 if “MapQV” is not calculated
CIGAR ([0-9]+[MIDSH])+|\\* Extended CIGAR string In the absence of clipping solely derived from “AlignmentArray” given offsets. In the presence of clipping, derived from “/PulseData/BaseCalls/Basecall” extracted from bas.h5 files
RNEXT *|= Next fragment/subread reference sequence name Equal to the RNAME of the next fragment/subread as ordered in the template/read sequence (ascending ordering of fragments by “rStart” from “AlnIndex”). For templates/reads containing a single fragment/subread set to ‘*’ else set to “=”
PNEXT [0,2p(29)-1] where 2p(29) is 2 to the the power of 29 1-based leftmost mapping position of the next fragment/subread to the reference Set to POS of the current SAM record if it represents the last fragment/subread of a template/read or to POS of the next fragment/subread sorted in asecding ordering of fragments by “rStart” from “AlnIndex”
TLEN 0 Observed template length Undefined
SEQ [acgtnACGTN.=]+|\\* Fragment/subread sequence In the absence f soft clipping, derived from the “AlignmentArray” with gaps (‘-‘) removed. In the presence of soft clipping derived from “/PulseData/BaseCalls/Basecall” extracted from bas.h5 files
SEQ [!-~]+ Query quality; ASCII-33 gives the Phred base quality In the absence f soft clipping, derived from the “QualityValue” and thus representing “InsertionQV”. In the presence of soft clipping derived from “/PulseData/BaseCalls/QualityValue” extracted from bas.h5 files

CIGAR strings currently support the following 3 formats:

  1. No clipping: Only the fragment/subread alignment is taken into consideration for generating the CIGAR sequence
  2. Hard clipping: The entire template/read is taken into consideration for determining the values of the H CIGAR operators. For each subread alignment, all bases part of the entire read that fall outside the alignment are hard clipped.
  3. Soft clipping: As with Hard clipping, the entire read is taken into consideration for determining the values of the H and S CIGAR operators. In this case H operators are used for clipping non-High Quality (HQ) region portions of the read and S operators are determined according to the following rule: SAM record sequence START = First HQ region base for a given read (first subread) or last reported base of previous SAM record in ascending FI order (not-first subread), SAM record sequence END = Last aligned subread base (not-last subread) or last HQ region base for given read (last subread)

Optional Fields

  1. RG (TYPE:Z): Matches the @RG:ID tag and represents the unique read group ID
  2. AS (TYPE:i): Alignment score equal to : (Number of Matches * 5 ) + (Number of Mismatches * -6) + (Number of Insertions * -5) + (Number of Deletions * -5)
  3. NM (TYPE:i): Number of nucleotide differences from the reference sequence. Equal to the sum of the number Mismatches, Insertions and Deletions
  4. TC (TYPE:i): Number of fragments/subreads in the template/read. This is Optional and only present in SAM records with CIGAR strings supporting clipping. Set to 1 for templates/reads containing a single fragment/subread alignment.
  5. FI (TYPE:i): 1-based index of the fragment/subread relative to all other fragments/subreads in the template/read. Optional and only present in SAM records with CIGAR strings supporting clipping. 1-based indexing is used with ascending ordering of fragments by “rStart” from “AlnIndex”
  6. XA (TYPE:Z): Comma-separated 1-based positions of the leftmost and rightmost adapter bases (‘_’ delimited) relative to all template/read bases. Optional and only present in SAM records with CIGAR strings supporting soft clipping. For N adapters associated with a SAM record a string of the following format is generated: adapter1Start_adapter1End,adapter2Start_adapter2End,adapterNStart_adapterNEnd
  7. XS (TYPE:i): 1-based position of the leftmost base of the fragment/subread relative to all template/read bases. Equal to “rStart” from “AlnIndex”.
  8. XE (TYPE:i): 1-based position of the rightmost base of the fragment/subread relative to all template/read bases. Equal to “rEnd” from “AlnIndex”.
  9. XT (TYPE:i): Read type. 3 read types are supported: 1=RS (continuous long reads), 2=RCCS (Reference-based Circular Consensus Sequencing (CCS)), 3=CCS (de novo CCS)
  10. XC (TYPE:i): CCS coverage. Optional and only present if the XT tag is set to 2. Represents the number of fragments/subreads used to generate the given SAM record SEQ.

A single SAM record is displayed below where the CIGAR sequence supports hard clipping:

m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15  35      lambda_NEB3011  11044   255     2161H34M1I15M1I3M2I3M1I13M1D5M1I3M1I8M1I15M1I12M1I4M1I3M1D3M1I11M1I1M1I2M3I7M1I44M2I30M538H     =       11050   0       GGTCCTGTCCAGAGCGCGGCAGGCGGCAGGGCTGAACGTTTAACCAGACCAAGCGGGAGGTCACTCAGCGCACGGTTACAGGCCGGGGGTAAAGCGGTGAGGCTCAGAATTGCGTCCATCAAGCCAAGATGTTGGCGCGTTCCTCCTCTGGGCTGCATCCCGGCGTGGAGGTGGAAAAGGTCGCTGAAGCCTTCGGGAAGCTGAACCCACAGACCCGACGTCGGGGCTGACGGCGA    !,)'0+'+()1/,,+**%.-*$"-&--.()!/*,(///)(**/,)/#,*0!*//(+--/-)0.-))+---+--!&(*.%,,,).%&!/-*,#+%-,,)//$-'%1-1*/**.)-)'0++1&,(*!%0.-*1(!--,+,.!$#(+",)/(!!!,).+!$$!1*,.-'!+*,*(+-*!+,/.*//0)1+--%.)'-$--*-01*1)$%'/,+/--%,1!,*/'/')/+.(.!/++.-+    AS:i:974        FI:i:8  NM:i:26 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab   TC:i:9  XE:i:2397       XS:i:2162       XT:i:1  MD:Z:15C0G51^T50^G11T27C58

A single SAM record is displayed below where the CIGAR sequence supports softs clipping:

m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15  83      lambda_NEB3011  11043   255     2661H9M1I16M1I5M1I6M3I12M1I9M2I10M2I7M1I4M1I3M3I5M1I2M2I8M1I39M1I11M1I5M1I2M1I2M2I3M1I6M2I23M58S        =       11047   0       TGGTCCTGTTCCAGAGCCGGGCAGGCCGGCAGCGGCTGAGCACGTTTAACCCCAGACAGCGAGTTCCACTCAGCGCCAACTGGTTTAAGGGCGGTTCCGGTAAAGCGCGGTGAGGGCTCAGATTGCGTCCATCAGCCAGAGTGTGGCGCGTTTCTTCCTCTGCATCCCGGCGTCGGGAGGTGTGTGACAGGGGGTCGCTGAAGCCTTCGGGAAGCTTGACCCACAGGACCCCGAGCGGTCGGGGGCTGAGGGCGAGATGGCTCATTCTCTCTCT      %0#,-0(.-$./)0(.+//)$'"/*.",(%'/!.$.!/$!!!-*,(+&)&/&()**++/,,+!0!+.(!//)+.'-!((00/,-+!-(-,#/0,!!!!/+--(),!!,/!,-*-&!,.-*/)-*,/,+-(+-.+/&.((++-)".+.'.!!),+.!/.%!/./*'''!/$/.-!1,#,/!!-/-!0,)***!!!++,/-.*#/.*-*+1,$+'00,!-+*/('++-)*+,)!/,!-%$,!1,"!$'-.%!!!-/,!(10+,.0-!*-,-*,///      AS:i:795        FI:i:1  NM:i:34 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab   TC:i:9  XE:i:274        XS:i:59 XT:i:1  MD:Z:45A0G1C32G80A24

Dependencies

pbsamtools depends on PacBio’s pbcore library as well as the numpy and h5py python libraries. Additionally, for generating BAM files as well as populating the SAM MD optional field, an installation of samtools is required of version equal to or greater than 0.16.

Installation

To install pbsamtools, run the following command from the pbsamtools root directory:

python setup.py install

Usage

pbsamtools depends on 2 input files:

  • cmp.h5: file containing all alignment information
  • input.fofn: text file containing paths to all bas.h5 used for generating alignments separated by new lines.

Additionally, one can supply the path to the reference repository used for alignment (via --refrepos) but this is only optional. In the absence of this commandline argument the reference repository information as stored within the cmp.h5 file will be used. Finally, --readtype supports 3 types of reads used to generate alignments:

  • continuous long reads (--readtype=RS)
  • reference-based circularconsensus sequencing reads (--readtype=RCCS)
  • de novo circular consensus sequencing reads (--readtype=CCS)
usage: pbsamtools.py [-h] [-i] [-d] [-v] [--outfile OUTFILE]
                     [--refrepos REFREPOS] [--plsFOFN PLSFOFN]
                     [--tmpDir TMPDIR] [--readtype READTYPE]
                     [--clipping CLIPPING] [--bam] [--md]
                     input.bas.h5

Tool for converting cmp.h5 files to SAM/BAM files.
Notes: For all command-line arguments, default values are listed in [].

positional arguments:
  input.cmp.h5         input filename

optional arguments:
  -h, --help           show this help message and exit
  -i, --info           turn on progress monitoring to stdout [False]
  -d, --debug          turn on progress monitoring to stdout and keep temp files [False]
  -v, --version        show program's version number and exit
  --outfile OUTFILE    Name of output SAM file [output.sam]
  --refrepos REFREPOS  Path to Reference Sequence Repository (optional)
  --plsFOFN PLSFOFN    Path to input.fofn containing list of bas.h5 (required for Hard and Soft clipping)
  --tmpDir TMPDIR      temporary directory to use [./]
  --readtype READTYPE  Read type: RS, RCCS or dCCS [RS]
  --clipping CLIPPING  Clipping: Hard, Soft or None [None]

samtools arguments:
  --bam                Generate BAM file
  --md                 Populate MD field in generated SAM file

Examples

Generate a soft clipped SAM file (myfile.sam) for a given cmp.h5 file containing continuous long reads:

pbsamtools --info --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Soft aligned_reads.cmp.h5

Generate a hard clipped BAM file (myfile.bam) with a populated MD optional field for a given cmp.h5 file containing continuous long reads:

pbsamtools --info --bam -- md --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Hard aligned_reads.cmp.h5

Known Issues

  • None

Indices and tables