pbsamtools is a commandline tool that generates SAM or BAM files from a cmp.h5 file and a list of bas.h5 files. The SAM files that are generated are of the PacBio SAM specification (see below) which extends the SAM v1.4-r962 format specification (samtools).
SAM is a TAB-delimited text format used for storing information associated with the alignment of a query sequence to a reference sequence along with information related to the technology used to generate the query sequence. The contents of PacBio SAM files are directly mapped to cmp.h5 data fields and their original sources. Those mappings are defined here.
Type | Tag | Regular Expression | Description | Comment |
---|---|---|---|---|
HD | VN | 1.2.0|1.2.0.SF|1.2.0.PB | cmp.h5 version | Required; root group attribute “Version” from cmp.h5 |
SQ | SN | [^ \\t\\n\\r@=]+ | Unique reference sequence name | Required; from “/RefInfo/FullName” |
LN | [0-9]+ | Reference sequence length | Required; from “/RefInfo/Length” | |
M5 | [a-fA-F\\d]{32} | MD5 checksum of reference sequence in uppercase with gaps and spaces removed | Required; from “/RefInfo/MD5” | |
RG | ID | [a-fA-F\\d]{10} | Unique read group ID (generated from RG:PU) | Required if @RG is present; 10 first characters of the hex digest of the RG:PU MD5 hash. The RG optional field per alignment record will also be included |
PU | m[0-9]+_[0-9]+_[a-zA-Z0 -9]+_c[0-9]+_s[0-9]_p[0 -9] | Unique platform unit as represented by the Movie Name | Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5. The Movie Name stores date/time, instrument name, cell ID and pane ID | |
SM | c[0-9]+ | Unique cell ID (extracted from RG:PU | Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5 after extracting the 4th field using ‘_’ as the delimiter | |
PL | PacBio_[A-Z]+ | Platform/technology used to generate the original cmp.h5 file | Optional if @RG is present; sequencing platform name (PacBio RS) | |
DT | [0-9\\-]+T[:0-9]+ | ISO8601 date/time (extracted from RG:PU | Optional if @RG is present; from “/MovieInfo/Name” in cmp.h5 after extracting the first 2 fields using ‘_’ as the delimiter. First fiels represents the date and the second the time | |
PG | ID | [a-zA-Z_\\.0-9]+ | Software that generated the SAM file | Required if @PG is present; |
VN | [a-zA-Z_\\.0-9]+ | Version of software that generated the SAM file | Optional if @PG is present; | |
CL | [a-zA-Z_\\.0-9]+ | Commandline used to generate the SAM file | Optional if @PG is present; includes all commandline tool arguments and absolute file paths | |
CO | [a-zA-Z:0-9]+ | One-line text of comments | Optional; |
The following example is of a PacBio SAM header generated from a cmp.h5 file containing 2 reference sequences:
@ HD VN:1.3
@SQ SN:lambda_NEB3011 LN:48502 M5:a1319ff90e994c8190a4fe6569d0822a
@RG ID:0f1b0b57bca8fd9143a2b48b3c6592ab SM:c100072840635400000315004305181152 PU:m110208_151047_richard_c100072840635400000315004305181152_s2_p0 PL:PacBio_RS DT:2011-02-08T15:10:47
@RG ID:fbaf81b5d7a6e69f897e793d4076c225 SM:c100072840635400000315004305181152 PU:m110208_151047_richard_c100072840635400000315004305181152_s1_p0 PL:PacBio_RS DT:2011-02-08T15:10:47
@PG ID:SAMIO.py VN:1.2.0.PB CL:TEST
@CO READS:181843
Notes:
Data fields for each PacBio SAM alignment are populated using data fields from cmp.h5 “AlnIndex” rows and their respective “AlnArray” mappings. In the presence of a soft of hard clipped CIGAR field, clippings are determined from entries in the “/PulseData/BaseCalls/Basecall” dataset of the HDF Pulse File and quality values from the “/PulseData/BaseCalls/QualityValue” dataset. Field descriptions are updated from the current SAM specification only if they are associated with information that is unique to PacBio.
Field | Regular Expression | Description | Comment |
---|---|---|---|
QNAME | [!-?A-~]{1,255} | AlignmentID defined as: Movie Name/Hole Number or RG:ID.Hole Number | 2 QNAME formats are supported. First is generated by concatenating Movie Name from “/MovieInfo/Name” with “HoleNumber” for the Hole Number, using “/” as the delimiter. The second is generated by concatenating the RG:ID with HoleNumber using a ‘.’ as the delimiter |
FLAG | [0,2p(16)-1] where 2p(16) is 2 to the the power of 16 | Bitwise FLAG | The following bits are supported: 1,2,8,10,20,40 and 80 |
RNAME | \*[!-()+-<>-~][!-~]* | Reference sequence name | Using the “RefGroupID” data field from the “AlnIndex” to index “RefInfo/FullName” |
POS | [0,2p(29)-1] where 2p(29) is 2 to the the power of 29 | 1-based leftmost mapping position of the fragment/subread to the reference | “tStart”+1 from “AlnIndex” |
MAPQ | [0,2p(8)-1] where 2p(8) is 2 to the the power of 8 | Mapping quality (phred-scaled posterior probability that the mapping position of this subread is incorrect) | Derived from “MapQV” from “AlnIndex” and set to 255 if “MapQV” is not calculated |
CIGAR | ([0-9]+[MIDSH])+|\\* | Extended CIGAR string | In the absence of clipping solely derived from “AlignmentArray” given offsets. In the presence of clipping, derived from “/PulseData/BaseCalls/Basecall” extracted from bas.h5 files |
RNEXT | *|= | Next fragment/subread reference sequence name | Equal to the RNAME of the next fragment/subread as ordered in the template/read sequence (ascending ordering of fragments by “rStart” from “AlnIndex”). For templates/reads containing a single fragment/subread set to ‘*’ else set to “=” |
PNEXT | [0,2p(29)-1] where 2p(29) is 2 to the the power of 29 | 1-based leftmost mapping position of the next fragment/subread to the reference | Set to POS of the current SAM record if it represents the last fragment/subread of a template/read or to POS of the next fragment/subread sorted in asecding ordering of fragments by “rStart” from “AlnIndex” |
TLEN | 0 | Observed template length | Undefined |
SEQ | [acgtnACGTN.=]+|\\* | Fragment/subread sequence | In the absence f soft clipping, derived from the “AlignmentArray” with gaps (‘-‘) removed. In the presence of soft clipping derived from “/PulseData/BaseCalls/Basecall” extracted from bas.h5 files |
SEQ | [!-~]+ | Query quality; ASCII-33 gives the Phred base quality | In the absence f soft clipping, derived from the “QualityValue” and thus representing “InsertionQV”. In the presence of soft clipping derived from “/PulseData/BaseCalls/QualityValue” extracted from bas.h5 files |
CIGAR strings currently support the following 3 formats:
A single SAM record is displayed below where the CIGAR sequence supports hard clipping:
m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15 35 lambda_NEB3011 11044 255 2161H34M1I15M1I3M2I3M1I13M1D5M1I3M1I8M1I15M1I12M1I4M1I3M1D3M1I11M1I1M1I2M3I7M1I44M2I30M538H = 11050 0 GGTCCTGTCCAGAGCGCGGCAGGCGGCAGGGCTGAACGTTTAACCAGACCAAGCGGGAGGTCACTCAGCGCACGGTTACAGGCCGGGGGTAAAGCGGTGAGGCTCAGAATTGCGTCCATCAAGCCAAGATGTTGGCGCGTTCCTCCTCTGGGCTGCATCCCGGCGTGGAGGTGGAAAAGGTCGCTGAAGCCTTCGGGAAGCTGAACCCACAGACCCGACGTCGGGGCTGACGGCGA !,)'0+'+()1/,,+**%.-*$"-&--.()!/*,(///)(**/,)/#,*0!*//(+--/-)0.-))+---+--!&(*.%,,,).%&!/-*,#+%-,,)//$-'%1-1*/**.)-)'0++1&,(*!%0.-*1(!--,+,.!$#(+",)/(!!!,).+!$$!1*,.-'!+*,*(+-*!+,/.*//0)1+--%.)'-$--*-01*1)$%'/,+/--%,1!,*/'/')/+.(.!/++.-+ AS:i:974 FI:i:8 NM:i:26 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab TC:i:9 XE:i:2397 XS:i:2162 XT:i:1 MD:Z:15C0G51^T50^G11T27C58
A single SAM record is displayed below where the CIGAR sequence supports softs clipping:
m110208_151047_richard_c100072840635400000315004305181152_s2_p0/15 83 lambda_NEB3011 11043 255 2661H9M1I16M1I5M1I6M3I12M1I9M2I10M2I7M1I4M1I3M3I5M1I2M2I8M1I39M1I11M1I5M1I2M1I2M2I3M1I6M2I23M58S = 11047 0 TGGTCCTGTTCCAGAGCCGGGCAGGCCGGCAGCGGCTGAGCACGTTTAACCCCAGACAGCGAGTTCCACTCAGCGCCAACTGGTTTAAGGGCGGTTCCGGTAAAGCGCGGTGAGGGCTCAGATTGCGTCCATCAGCCAGAGTGTGGCGCGTTTCTTCCTCTGCATCCCGGCGTCGGGAGGTGTGTGACAGGGGGTCGCTGAAGCCTTCGGGAAGCTTGACCCACAGGACCCCGAGCGGTCGGGGGCTGAGGGCGAGATGGCTCATTCTCTCTCT %0#,-0(.-$./)0(.+//)$'"/*.",(%'/!.$.!/$!!!-*,(+&)&/&()**++/,,+!0!+.(!//)+.'-!((00/,-+!-(-,#/0,!!!!/+--(),!!,/!,-*-&!,.-*/)-*,/,+-(+-.+/&.((++-)".+.'.!!),+.!/.%!/./*'''!/$/.-!1,#,/!!-/-!0,)***!!!++,/-.*#/.*-*+1,$+'00,!-+*/('++-)*+,)!/,!-%$,!1,"!$'-.%!!!-/,!(10+,.0-!*-,-*,/// AS:i:795 FI:i:1 NM:i:34 RG:Z:0f1b0b57bca8fd9143a2b48b3c6592ab TC:i:9 XE:i:274 XS:i:59 XT:i:1 MD:Z:45A0G1C32G80A24
pbsamtools depends on PacBio’s pbcore library as well as the numpy and h5py python libraries. Additionally, for generating BAM files as well as populating the SAM MD optional field, an installation of samtools is required of version equal to or greater than 0.16.
To install pbsamtools, run the following command from the pbsamtools root directory:
python setup.py install
pbsamtools depends on 2 input files:
Additionally, one can supply the path to the reference repository used for alignment (via --refrepos) but this is only optional. In the absence of this commandline argument the reference repository information as stored within the cmp.h5 file will be used. Finally, --readtype supports 3 types of reads used to generate alignments:
usage: pbsamtools.py [-h] [-i] [-d] [-v] [--outfile OUTFILE]
[--refrepos REFREPOS] [--plsFOFN PLSFOFN]
[--tmpDir TMPDIR] [--readtype READTYPE]
[--clipping CLIPPING] [--bam] [--md]
input.bas.h5
Tool for converting cmp.h5 files to SAM/BAM files.
Notes: For all command-line arguments, default values are listed in [].
positional arguments:
input.cmp.h5 input filename
optional arguments:
-h, --help show this help message and exit
-i, --info turn on progress monitoring to stdout [False]
-d, --debug turn on progress monitoring to stdout and keep temp files [False]
-v, --version show program's version number and exit
--outfile OUTFILE Name of output SAM file [output.sam]
--refrepos REFREPOS Path to Reference Sequence Repository (optional)
--plsFOFN PLSFOFN Path to input.fofn containing list of bas.h5 (required for Hard and Soft clipping)
--tmpDir TMPDIR temporary directory to use [./]
--readtype READTYPE Read type: RS, RCCS or dCCS [RS]
--clipping CLIPPING Clipping: Hard, Soft or None [None]
samtools arguments:
--bam Generate BAM file
--md Populate MD field in generated SAM file
Generate a soft clipped SAM file (myfile.sam) for a given cmp.h5 file containing continuous long reads:
pbsamtools --info --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Soft aligned_reads.cmp.h5
Generate a hard clipped BAM file (myfile.bam) with a populated MD optional field for a given cmp.h5 file containing continuous long reads:
pbsamtools --info --bam -- md --outfile=myfile.sam --refrepos=/path/to/my/refrepos --plsFOFN=/path/to/input.fofn --readtype=RS --clipping=Hard aligned_reads.cmp.h5