1
2
3
4
5
6
7
8
9 """Bio.SeqIO support for the FASTQ and QUAL file formats.
10
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
12 shown below.
13
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
18
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a non-negative quality value to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
22
23 Pe = 1.0, Q = 0
24 Pe = 0.1, Q = 10
25 Pe = 0.01, Q = 20
26 ...
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
29
30 In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
31 In the QUAL format these quality values are held as space separated text in
32 a FASTA like file format. In the FASTQ format, each quality values is encoded
33 with a single ASCI character using chr(Q+33), meaning zero maps to the
34 character "!" and for example 80 maps to "q". For the Sanger FASTQ standard
35 the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
36 quality are then stored in pairs in a FASTA like format.
37
38 Unfortunately there is no official document describing the FASTQ file format,
39 and worse, several related but different variants exist. Reasonable
40 documentation exists at: http://maq.sourceforge.net/fastq.shtml
41
42 The good news is that Roche 454 sequencers can output files in the QUAL format,
43 and sensibly they use PHREP style scores like Sanger. Converting a pair of
44 FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
45 files from a Roche 454 SFF binary file, use the Roche off instrument command
46 line tool "sffinfo" with the -q or -qual argument. You can extract a matching
47 FASTA file using the -s or -seq argument instead.
48
49 The bad news is that Solexa/Illumina did things differently - they have their
50 own scoring system AND their own incompatible versions of the FASTQ format.
51 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
52 be negative. PHRED scores and Solexa scores are NOT interchangeable (but a
53 reasonable mapping can be achieved between them, and they are approximately
54 equal for high quality reads).
55
56 Confusingly early Solexa pipelines produced a FASTQ like file but using their
57 own score mapping and an ASCII offset of 64. To make things worse, for the
58 Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
59 FASTQ file format, this time using PHRED scores (which is more consistent) but
60 with an ASCII offset of 64.
61
62 i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
63 file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
64
65 You are expected to use this module via the Bio.SeqIO functions, with the
66 following format names:
67
68 - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
69 - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
70 offset of 33 (e.g. from the NCBI Short Read Archive). These can hold PHRED
71 scores from 0 to 93.
72 - "fastq-sanger" is an alias for "fastq".
73 - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
74 files, using Solexa scores with an ASCII offset 64. These can hold Solexa
75 scores from -5 to 62.
76 - "fastq-illumina" means new Illumina 1.3+ style FASTQ files, using PHRED
77 scores but with an ASCII offset 64, allowing PHRED scores from 0 to 62.
78
79 We could potentially add support for "qual-solexa" meaning QUAL files which
80 contain Solexa scores, but thus far there isn't any reason to use such files.
81
82 For example, consider the following short FASTQ file::
83
84 @EAS54_6_R1_2_1_413_324
85 CCCTTCTTGTCTTCAGCGTTTCTCC
86 +
87 ;;3;;;;;;;;;;;;7;;;;;;;88
88 @EAS54_6_R1_2_1_540_792
89 TTGGCAGGCCAAGGCCGATGGATCA
90 +
91 ;;;;;;;;;;;7;;;;;-;;;3;83
92 @EAS54_6_R1_2_1_443_348
93 GTTGCTTCTGGCGTGGGTGGGGGGG
94 +
95 ;;;;;;;;;;;9;7;;.7;393333
96
97 This contains three reads of length 25. From the read length these were
98 probably originally from an early Solexa/Illumina sequencer but this file
99 follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
100 offet of 33). This means we can parse this file using Bio.SeqIO using
101 "fastq" as the format name:
102
103 >>> from Bio import SeqIO
104 >>> for record in SeqIO.parse(open("Quality/example.fastq"), "fastq"):
105 ... print record.id, record.seq
106 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
107 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
108 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
109
110 The qualities are held as a list of integers in each record's annotation:
111
112 >>> print record
113 ID: EAS54_6_R1_2_1_443_348
114 Name: EAS54_6_R1_2_1_443_348
115 Description: EAS54_6_R1_2_1_443_348
116 Number of features: 0
117 Per letter annotation for: phred_quality
118 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
119 >>> print record.letter_annotations["phred_quality"]
120 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
121
122 You can use the SeqRecord format method to show this in the QUAL format:
123
124 >>> print record.format("qual")
125 >EAS54_6_R1_2_1_443_348
126 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
127 24 18 18 18 18
128 <BLANKLINE>
129
130 Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
131
132 >>> print record.format("fastq")
133 @EAS54_6_R1_2_1_443_348
134 GTTGCTTCTGGCGTGGGTGGGGGGG
135 +
136 ;;;;;;;;;;;9;7;;.7;393333
137 <BLANKLINE>
138
139 Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
140 of 64):
141
142 >>> print record.format("fastq-illumina")
143 @EAS54_6_R1_2_1_443_348
144 GTTGCTTCTGGCGTGGGTGGGGGGG
145 +
146 ZZZZZZZZZZZXZVZZMVZRXRRRR
147 <BLANKLINE>
148
149 You can also get Biopython to convert the scores and show a Solexa style
150 FASTQ file:
151
152 >>> print record.format("fastq-solexa")
153 @EAS54_6_R1_2_1_443_348
154 GTTGCTTCTGGCGTGGGTGGGGGGG
155 +
156 ZZZZZZZZZZZXZVZZMVZRXRRRR
157 <BLANKLINE>
158
159 Notice that this is actually the same output as above using "fastq-illumina"
160 as the format! The reason for this is all these scores are high enough that
161 the PHRED and Solexa scores are almost equal. The differences become apparent
162 for poor quality reads. See the functions solexa_quality_from_phred and
163 phred_quality_from_solexa for more details.
164
165 If you wanted to trim your sequences (perhaps to remove low quality regions,
166 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
167
168 >>> sub_rec = record[5:15]
169 >>> print sub_rec
170 ID: EAS54_6_R1_2_1_443_348
171 Name: EAS54_6_R1_2_1_443_348
172 Description: EAS54_6_R1_2_1_443_348
173 Number of features: 0
174 Per letter annotation for: phred_quality
175 Seq('TTCTGGCGTG', SingleLetterAlphabet())
176 >>> print sub_rec.letter_annotations["phred_quality"]
177 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
178 >>> print sub_rec.format("fastq")
179 @EAS54_6_R1_2_1_443_348
180 TTCTGGCGTG
181 +
182 ;;;;;;9;7;
183 <BLANKLINE>
184
185 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
186
187 >>> from Bio import SeqIO
188 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
189 >>> out_handle = open("Quality/temp.qual", "w")
190 >>> SeqIO.write(record_iterator, out_handle, "qual")
191 3
192 >>> out_handle.close()
193
194 You can of course read in a QUAL file, such as the one we just created:
195
196 >>> from Bio import SeqIO
197 >>> for record in SeqIO.parse(open("Quality/temp.qual"), "qual"):
198 ... print record.id, record.seq
199 EAS54_6_R1_2_1_413_324 ?????????????????????????
200 EAS54_6_R1_2_1_540_792 ?????????????????????????
201 EAS54_6_R1_2_1_443_348 ?????????????????????????
202
203 Notice that QUAL files don't have a proper sequence present! But the quality
204 information is there:
205
206 >>> print record
207 ID: EAS54_6_R1_2_1_443_348
208 Name: EAS54_6_R1_2_1_443_348
209 Description: EAS54_6_R1_2_1_443_348
210 Number of features: 0
211 Per letter annotation for: phred_quality
212 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
213 >>> print record.letter_annotations["phred_quality"]
214 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
215
216 Just to keep things tidy, if you are following this example yourself, you can
217 delete this temporary file now:
218
219 >>> import os
220 >>> os.remove("Quality/temp.qual")
221
222 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
223 files. Because the Bio.SeqIO system is designed for reading single files, you
224 would have to read the two in separately and then combine the data. However,
225 since this is such a common thing to want to do, there is a helper iterator
226 defined in this module that does this for you - PairedFastaQualIterator.
227
228 Alternatively, if you have enough RAM to hold all the records in memory at once,
229 then a simple dictionary approach would work:
230
231 >>> from Bio import SeqIO
232 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
233 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"):
234 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
235
236 You can then access any record by its key, and get both the sequence and the
237 quality scores.
238
239 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
240 @EAS54_6_R1_2_1_540_792
241 TTGGCAGGCCAAGGCCGATGGATCA
242 +
243 ;;;;;;;;;;;7;;;;;-;;;3;83
244 <BLANKLINE>
245
246 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
247 using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
248 "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
249 for the more recent variant), as this cannot be detected reliably
250 automatically.
251
252 To illustrate this problem, let's consider an artifical example:
253
254 >>> from Bio.Seq import Seq
255 >>> from Bio.Alphabet import generic_dna
256 >>> from Bio.SeqRecord import SeqRecord
257 >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
258 ... description="Made up!")
259 >>> print test.format("fasta")
260 >Test Made up!
261 NACGTACGTA
262 <BLANKLINE>
263 >>> print test.format("fastq")
264 Traceback (most recent call last):
265 ...
266 ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
267
268 We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
269 or FASTQ format we need to provide some quality scores. These are held as a
270 list of integers (one for each base) in the letter_annotations dictionary:
271
272 >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40]
273 >>> print test.format("qual")
274 >Test Made up!
275 0 1 2 3 4 5 10 20 30 40
276 <BLANKLINE>
277 >>> print test.format("fastq")
278 @Test Made up!
279 NACGTACGTA
280 +
281 !"#$%&+5?I
282 <BLANKLINE>
283
284 We can check this FASTQ encoding - the first PHRED quality was zero, and this
285 mapped to a exclamation mark, while the final score was 40 and this mapped to
286 the letter "I":
287
288 >>> ord('!') - 33
289 0
290 >>> ord('I') - 33
291 40
292 >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
293 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
294
295 Similarly, we could produce an Illumina 1.3+ style FASTQ file using PHRED
296 scores with an offset of 64:
297
298 >>> print test.format("fastq-illumina")
299 @Test Made up!
300 NACGTACGTA
301 +
302 @ABCDEJT^h
303 <BLANKLINE>
304
305 And we can check this too - the first PHRED score was zero, and this mapped to
306 "@", while the final score was 40 and this mapped to "h":
307
308 >>> ord("@") - 64
309 0
310 >>> ord("h") - 64
311 40
312 >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
313 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
314
315 Notice how different the standard Sanger FASTQ and the Illumina 1.3+ style
316 FASTQ files look for the same data! Then we have the older Solexa/Illumina
317 format to consider which encodes Solexa scores instead of PHRED scores.
318
319 First let's see what Biopython says if we convert the PHRED scores into Solexa
320 scores (rounding to one decimal place):
321
322 >>> for q in [0,1,2,3,4,5,10,20,30,40]:
323 ... print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))
324 PHRED 0 maps to Solexa -5.0
325 PHRED 1 maps to Solexa -5.0
326 PHRED 2 maps to Solexa -2.3
327 PHRED 3 maps to Solexa -0.0
328 PHRED 4 maps to Solexa 1.8
329 PHRED 5 maps to Solexa 3.3
330 PHRED 10 maps to Solexa 9.5
331 PHRED 20 maps to Solexa 20.0
332 PHRED 30 maps to Solexa 30.0
333 PHRED 40 maps to Solexa 40.0
334
335 Now here is the record using the old Solexa style FASTQ file:
336
337 >>> print test.format("fastq-solexa")
338 @Test Made up!
339 NACGTACGTA
340 +
341 ;;>@BCJT^h
342 <BLANKLINE>
343
344 Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
345
346 >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
347 [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
348
349 This explains why the last few letters of this FASTQ output matched that using
350 the Illumina 1.3+ format - high quality PHRED scores and Solexa scores are
351 approximately equal.
352
353 """
354 __docformat__ = "epytext en"
355
356
357
358 from Bio.Alphabet import single_letter_alphabet
359 from Bio.Seq import Seq, UnknownSeq
360 from Bio.SeqRecord import SeqRecord
361 from Interfaces import SequentialSequenceWriter
362 from math import log
363 import warnings
364
365
366 SANGER_SCORE_OFFSET = 33
367 SOLEXA_SCORE_OFFSET = 64
368
370 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
371
372 PHRED and Solexa quality scores are both log transformations of a
373 probality of error (high score = low probability of error). This function
374 takes a PHRED score, transforms it back to a probability of error, and
375 then re-expresses it as a Solexa score. This assumes the error estimates
376 are equivalent.
377
378 How does this work exactly? Well the PHRED quality is minus ten times the
379 base ten logarithm of the probability of error::
380
381 phred_quality = -10*log(error,10)
382
383 Therefore, turning this round::
384
385 error = 10 ** (- phred_quality / 10)
386
387 Now, Solexa qualities use a different log transformation::
388
389 solexa_quality = -10*log(error/(1-error),10)
390
391 After substitution and a little manipulation we get::
392
393 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
394
395 However, real Solexa files use a minimum quality of -5. This does have a
396 good reason - a random a random base call would be correct 25% of the time,
397 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
398 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
399 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
400
401 Taken literally, this logarithic formula would map a PHRED quality of zero
402 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
403 score of zero means a probability of error of one (i.e. the base call is
404 definitely wrong), which is worse than random! In practice, a PHRED quality
405 of zero usually means a default value, or perhaps random - and therefore
406 mapping it to the minimum Solexa score of -5 is reasonable.
407
408 In conclusion, we follow EMBOSS, and take this logarithmic formula but also
409 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
410 quality of zero to -5.0 as well.
411
412 Note this function will return a floating point number, it is up to you to
413 round this to the nearest integer if appropriate. e.g.
414
415 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
416 80.00
417 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
418 50.00
419 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
420 19.96
421 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
422 9.54
423 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
424 3.35
425 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
426 1.80
427 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
428 -0.02
429 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
430 -2.33
431 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
432 -5.00
433 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
434 -5.00
435
436 Notice that for high quality reads PHRED and Solexa scores are numerically
437 equal. The differences are important for poor quality reads, where PHRED
438 has a minimum of zero but Solexa scores can be negative.
439
440 Finally, as a special case where None is used for a "missing value", None
441 is returned:
442
443 >>> print solexa_quality_from_phred(None)
444 None
445 """
446 if phred_quality > 0:
447
448
449 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10))
450 elif phred_quality == 0:
451
452 return -5.0
453 elif phred_quality is None:
454
455
456 return None
457 else:
458 raise ValueError("PHRED qualities must be positive (or zero), not %s" \
459 % repr(phred_quality))
460
462 """Convert a Solexa quality (which can be negative) to a PHRED quality.
463
464 PHRED and Solexa quality scores are both log transformations of a
465 probality of error (high score = low probability of error). This function
466 takes a Solexa score, transforms it back to a probability of error, and
467 then re-expresses it as a PHRED score. This assumes the error estimates
468 are equivalent.
469
470 The underlying formulas are given in the documentation for the sister
471 function solexa_quality_from_phred, in this case the operation is::
472
473 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
474
475 This will return a floating point number, it is up to you to round this to
476 the nearest integer if appropriate. e.g.
477
478 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
479 80.00
480 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
481 20.04
482 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
483 10.41
484 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
485 3.01
486 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
487 1.19
488
489 Note that a solexa_quality less then -5 is not expected, will trigger a
490 warning, but will still be converted as per the logarithmic mapping
491 (giving a number between 0 and 1.19 back).
492
493 As a special case where None is used for a "missing value", None is
494 returned:
495
496 >>> print phred_quality_from_solexa(None)
497 None
498 """
499 if solexa_quality is None:
500
501 return None
502 if solexa_quality < -5:
503 import warnings
504 warnings.warn("Solexa quality less than -5 passed, %s" \
505 % repr(solexa_quality))
506 return 10*log(10**(solexa_quality/10.0) + 1, 10)
507
509 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
510
511 If there are no PHRED qualities, but there are Solexa qualities, those are
512 used instead after conversion.
513 """
514 try:
515 return record.letter_annotations["phred_quality"]
516 except KeyError:
517 pass
518 try:
519 return [phred_quality_from_solexa(q) for \
520 q in record.letter_annotations["solexa_quality"]]
521 except KeyError:
522 raise ValueError("No suitable quality scores found in "
523 "letter_annotations of SeqRecord (id=%s)." \
524 % record.id)
525
526
527 _phred_to_sanger_quality_str = dict((qp, chr(min(126,qp+SANGER_SCORE_OFFSET))) \
528 for qp in range(0, 93+1))
529
530 _solexa_to_sanger_quality_str = dict( \
531 (qs, chr(min(126,int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \
532 for qs in range(-5, 93+1))
534 """Returns a Sanger FASTQ encoded quality string (PRIVATE).
535
536 >>> from Bio.Seq import Seq
537 >>> from Bio.SeqRecord import SeqRecord
538 >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
539 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
540 >>> _get_sanger_quality_str(r)
541 'SI?5+!'
542
543 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
544 the PHRED qualities are integers, this function is able to use a very fast
545 pre-cached mapping. However, if they are floats which differ slightly, then
546 it has to do the appropriate rounding - which is slower:
547
548 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
549 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
550 >>> _get_sanger_quality_str(r2)
551 'SI?5+!'
552
553 If your scores include a None value, this raises an exception:
554
555 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
556 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
557 >>> _get_sanger_quality_str(r3)
558 Traceback (most recent call last):
559 ...
560 TypeError: A quality value of None was found
561
562 If (strangely) your record has both PHRED and Solexa scores, then the PHRED
563 scores are used in preference:
564
565 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
566 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0],
567 ... "solexa_quality":[-5,-4,0,None,0,40]})
568 >>> _get_sanger_quality_str(r4)
569 'SI?5+!'
570
571 If there are no PHRED scores, but there are Solexa scores, these are used
572 instead (after the approriate conversion):
573
574 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
575 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
576 >>> _get_sanger_quality_str(r5)
577 'I?5+$"'
578
579 Again, integer Solexa scores can be looked up in a pre-cached mapping making
580 this very fast. You can still use approximate floating point scores:
581
582 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
583 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
584 >>> _get_sanger_quality_str(r6)
585 'I?5+$"'
586
587 Notice that due to the limited range of printable ASCII characters, a
588 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
589 file (using ASCII 126, the tilde). This function will issue a warning
590 in this situation.
591 """
592
593
594
595 try:
596
597 qualities = record.letter_annotations["phred_quality"]
598 except KeyError:
599
600 pass
601 else:
602
603 try:
604 return "".join([_phred_to_sanger_quality_str[qp] \
605 for qp in qualities])
606 except KeyError:
607
608 pass
609 if None in qualities:
610 raise TypeError("A quality value of None was found")
611 if max(qualities) >= 93.5:
612 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ")
613
614 return "".join([chr(min(126,int(round(qp))+SANGER_SCORE_OFFSET)) \
615 for qp in qualities])
616
617 try:
618 qualities = record.letter_annotations["solexa_quality"]
619 except KeyError:
620 raise ValueError("No suitable quality scores found in "
621 "letter_annotations of SeqRecord (id=%s)." \
622 % record.id)
623
624 try:
625 return "".join([_solexa_to_sanger_quality_str[qs] \
626 for qs in qualities])
627 except KeyError:
628
629 pass
630 if None in qualities:
631 raise TypeError("A quality value of None was found")
632
633
634 if max(qualities) >= 93.5:
635 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ")
636
637 return "".join([chr(min(126,int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \
638 for qs in qualities])
639
640
641 assert 62+SOLEXA_SCORE_OFFSET==126
642 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \
643 for qp in range(0, 62+1))
644
645 _solexa_to_illumina_quality_str = dict( \
646 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
647 for qs in range(-5, 62+1))
649 """Returns an Illumina 1.3+ FASTQ encoded quality string (PRIVATE).
650
651 Notice that due to the limited range of printable ASCII characters, a
652 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
653 file (using ASCII 126, the tilde). This function will issue a warning
654 in this situation.
655 """
656
657
658
659 try:
660
661 qualities = record.letter_annotations["phred_quality"]
662 except KeyError:
663
664 pass
665 else:
666
667 try:
668 return "".join([_phred_to_illumina_quality_str[qp] \
669 for qp in qualities])
670 except KeyError:
671
672 pass
673 if None in qualities:
674 raise TypeError("A quality value of None was found")
675 if max(qualities) >= 62.5:
676 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
677
678 return "".join([chr(min(126,int(round(qp))+SOLEXA_SCORE_OFFSET)) \
679 for qp in qualities])
680
681 try:
682 qualities = record.letter_annotations["solexa_quality"]
683 except KeyError:
684 raise ValueError("No suitable quality scores found in "
685 "letter_annotations of SeqRecord (id=%s)." \
686 % record.id)
687
688 try:
689 return "".join([_solexa_to_illumina_quality_str[qs] \
690 for qs in qualities])
691 except KeyError:
692
693 pass
694 if None in qualities:
695 raise TypeError("A quality value of None was found")
696
697
698 if max(qualities) >= 62.5:
699 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
700
701 return "".join([chr(min(126,int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
702 for qs in qualities])
703
704
705 assert 62+SOLEXA_SCORE_OFFSET==126
706 _solexa_to_solexa_quality_str = dict((qs, chr(min(126,qs+SOLEXA_SCORE_OFFSET))) \
707 for qs in range(-5, 62+1))
708
709 _phred_to_solexa_quality_str = dict(\
710 (qp, chr(min(126,int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \
711 for qp in range(0, 62+1))
713 """Returns a Solexa FASTQ encoded quality string (PRIVATE).
714
715 Notice that due to the limited range of printable ASCII characters, a
716 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
717 file (using ASCII 126, the tilde). This function will issue a warning
718 in this situation.
719 """
720
721
722
723 try:
724
725 qualities = record.letter_annotations["solexa_quality"]
726 except KeyError:
727
728 pass
729 else:
730
731 try:
732 return "".join([_solexa_to_solexa_quality_str[qs] \
733 for qs in qualities])
734 except KeyError:
735
736 pass
737 if None in qualities:
738 raise TypeError("A quality value of None was found")
739 if max(qualities) >= 62.5:
740 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
741
742 return "".join([chr(min(126,int(round(qs))+SOLEXA_SCORE_OFFSET)) \
743 for qs in qualities])
744
745 try:
746 qualities = record.letter_annotations["phred_quality"]
747 except KeyError:
748 raise ValueError("No suitable quality scores found in "
749 "letter_annotations of SeqRecord (id=%s)." \
750 % record.id)
751
752 try:
753 return "".join([_phred_to_solexa_quality_str[qp] \
754 for qp in qualities])
755 except KeyError:
756
757
758 pass
759 if None in qualities:
760 raise TypeError("A quality value of None was found")
761
762
763 if max(qualities) >= 62.5:
764 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
765 return "".join([chr(min(126,
766 int(round(solexa_quality_from_phred(qp))) + \
767 SOLEXA_SCORE_OFFSET)) \
768 for qp in qualities])
769
770
772 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
773
774 This code does not try to interpret the quality string numerically. It
775 just returns tuples of the title, sequence and quality as strings. For
776 the sequence and quality, any whitespace (such as new lines) is removed.
777
778 Our SeqRecord based FASTQ iterators call this function internally, and then
779 turn the strings into a SeqRecord objects, mapping the quality string into
780 a list of numerical scores. If you want to do a custom quality mapping,
781 then you might consider calling this function directly.
782
783 For parsing FASTQ files, the title string from the "@" line at the start
784 of each record can optionally be omitted on the "+" lines. If it is
785 repeated, it must be identical.
786
787 The sequence string and the quality string can optionally be split over
788 multiple lines, although several sources discourage this. In comparison,
789 for the FASTA file format line breaks between 60 and 80 characters are
790 the norm.
791
792 WARNING - Because the "@" character can appear in the quality string,
793 this can cause problems as this is also the marker for the start of
794 a new sequence. In fact, the "+" sign can also appear as well. Some
795 sources recommended having no line breaks in the quality to avoid this,
796 but even that is not enough, consider this example::
797
798 @071113_EAS56_0053:1:1:998:236
799 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
800 +071113_EAS56_0053:1:1:998:236
801 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
802 @071113_EAS56_0053:1:1:182:712
803 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
804 +
805 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
806 @071113_EAS56_0053:1:1:153:10
807 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
808 +
809 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
810 @071113_EAS56_0053:1:3:990:501
811 TGGGAGGTTTTATGTGGA
812 AAGCAGCAATGTACAAGA
813 +
814 IIIIIII.IIIIII1@44
815 @-7.%<&+/$/%4(++(%
816
817 This is four PHRED encoded FASTQ entries originally from an NCBI source
818 (given the read length of 36, these are probably Solexa Illumna reads where
819 the quality has been mapped onto the PHRED values).
820
821 This example has been edited to illustrate some of the nasty things allowed
822 in the FASTQ format. Firstly, on the "+" lines most but not all of the
823 (redundant) identifiers are ommited. In real files it is likely that all or
824 none of these extra identifiers will be present.
825
826 Secondly, while the first three sequences have been shown without line
827 breaks, the last has been split over multiple lines. In real files any line
828 breaks are likely to be consistent.
829
830 Thirdly, some of the quality string lines start with an "@" character. For
831 the second record this is unavoidable. However for the fourth sequence this
832 only happens because its quality string is split over two lines. A naive
833 parser could wrongly treat any line starting with an "@" as the beginning of
834 a new sequence! This code copes with this possible ambiguity by keeping
835 track of the length of the sequence which gives the expected length of the
836 quality string.
837
838 Using this tricky example file as input, this short bit of code demonstrates
839 what this parsing function would return:
840
841 >>> handle = open("Quality/tricky.fastq", "rU")
842 >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
843 ... print title
844 ... print sequence, quality
845 071113_EAS56_0053:1:1:998:236
846 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
847 071113_EAS56_0053:1:1:182:712
848 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
849 071113_EAS56_0053:1:1:153:10
850 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
851 071113_EAS56_0053:1:3:990:501
852 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
853 >>> handle.close()
854
855 Finally we note that some sources state that the quality string should
856 start with "!" (which using the PHRED mapping means the first letter always
857 has a quality score of zero). This rather restrictive rule is not widely
858 observed, so is therefore ignored here. One plus point about this "!" rule
859 is that (provided there are no line breaks in the quality sequence) it
860 would prevent the above problem with the "@" character.
861 """
862
863
864 handle_readline = handle.readline
865
866
867 while True:
868 line = handle_readline()
869 if line == "" : return
870 if line[0] == "@":
871 break
872
873 while True:
874 if line[0]!="@":
875 raise ValueError("Records in Fastq files should start with '@' character")
876 title_line = line[1:].rstrip()
877
878
879
880 seq_string = handle_readline().rstrip()
881
882 while True:
883 line = handle_readline()
884 if not line:
885 raise ValueError("End of file without quality information.")
886 if line[0] == "+":
887
888 second_title = line[1:].rstrip()
889 if second_title and second_title != title_line:
890 raise ValueError("Sequence and quality captions differ.")
891 break
892 seq_string += line.rstrip()
893
894
895 if " " in seq_string or "\t" in seq_string:
896 raise ValueError("Whitespace is not allowed in the sequence.")
897 seq_len = len(seq_string)
898
899
900 quality_string = handle_readline().rstrip()
901
902 while True:
903 line = handle_readline()
904 if not line : break
905 if line[0] == "@":
906
907
908
909
910 if len(quality_string) >= seq_len:
911
912
913 break
914
915 quality_string += line.rstrip()
916
917 if seq_len != len(quality_string):
918 raise ValueError("Lengths of sequence and quality values differs "
919 " for %s (%i and %i)." \
920 % (title_line, seq_len, len(quality_string)))
921
922
923 yield (title_line, seq_string, quality_string)
924 if not line : return
925 assert False, "Should not reach this line"
926
927
929 """Generator function to iterate over FASTQ records (as SeqRecord objects).
930
931 - handle - input file
932 - alphabet - optional alphabet
933 - title2ids - A function that, when given the title line from the FASTQ
934 file (without the beginning >), will return the id, name and
935 description (in that order) for the record as a tuple of
936 strings. If this is not given, then the entire title line
937 will be used as the description, and the first word as the
938 id and name.
939
940 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
941
942 For each sequence in a (Sanger style) FASTQ file there is a matching string
943 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
944 values with an offset of 33.
945
946 For example, consider a file containing three short reads::
947
948 @EAS54_6_R1_2_1_413_324
949 CCCTTCTTGTCTTCAGCGTTTCTCC
950 +
951 ;;3;;;;;;;;;;;;7;;;;;;;88
952 @EAS54_6_R1_2_1_540_792
953 TTGGCAGGCCAAGGCCGATGGATCA
954 +
955 ;;;;;;;;;;;7;;;;;-;;;3;83
956 @EAS54_6_R1_2_1_443_348
957 GTTGCTTCTGGCGTGGGTGGGGGGG
958 +
959 ;;;;;;;;;;;9;7;;.7;393333
960
961 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
962 string encoding the PHRED qualities using a ASCI values with an offset of
963 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
964
965 Using this module directly you might run:
966
967 >>> handle = open("Quality/example.fastq", "rU")
968 >>> for record in FastqPhredIterator(handle):
969 ... print record.id, record.seq
970 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
971 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
972 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
973 >>> handle.close()
974
975 Typically however, you would call this via Bio.SeqIO instead with "fastq"
976 (or "fastq-sanger") as the format:
977
978 >>> from Bio import SeqIO
979 >>> handle = open("Quality/example.fastq", "rU")
980 >>> for record in SeqIO.parse(handle, "fastq"):
981 ... print record.id, record.seq
982 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
983 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
984 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
985 >>> handle.close()
986
987 If you want to look at the qualities, they are record in each record's
988 per-letter-annotation dictionary as a simple list of integers:
989
990 >>> print record.letter_annotations["phred_quality"]
991 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
992 """
993 assert SANGER_SCORE_OFFSET == ord("!")
994
995
996
997
998
999 q_mapping = dict()
1000 for letter in range(0,255):
1001 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET
1002 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1003 if title2ids:
1004 id, name, descr = title2ids(title_line)
1005 else:
1006 descr = title_line
1007 id = descr.split()[0]
1008 name = id
1009 record = SeqRecord(Seq(seq_string, alphabet),
1010 id=id, name=name, description=descr)
1011 qualities = [q_mapping[letter] for letter in quality_string]
1012 if qualities and (min(qualities) < 0 or max(qualities) > 93):
1013 raise ValueError("Invalid character in quality string")
1014
1015
1016
1017
1018
1019 dict.__setitem__(record._per_letter_annotations,
1020 "phred_quality", qualities)
1021 yield record
1022
1023
1025 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
1026
1027 The optional arguments are the same as those for the FastqPhredIterator.
1028
1029 For each sequence in Solexa/Illumina FASTQ files there is a matching string
1030 encoding the Solexa integer qualities using ASCII values with an offset
1031 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
1032 will NOT perform any automatic conversion when loading.
1033
1034 NOTE - This file format is used by the OLD versions of the Solexa/Illumina
1035 pipeline. See also the FastqIlluminaIterator function for the NEW version.
1036
1037 For example, consider a file containing these five records::
1038
1039 @SLXA-B3_649_FC8437_R1_1_1_610_79
1040 GATGTGCAATACCTTTGTAGAGGAA
1041 +SLXA-B3_649_FC8437_R1_1_1_610_79
1042 YYYYYYYYYYYYYYYYYYWYWYYSU
1043 @SLXA-B3_649_FC8437_R1_1_1_397_389
1044 GGTTTGAGAAAGAGAAATGAGATAA
1045 +SLXA-B3_649_FC8437_R1_1_1_397_389
1046 YYYYYYYYYWYYYYWWYYYWYWYWW
1047 @SLXA-B3_649_FC8437_R1_1_1_850_123
1048 GAGGGTGTTGATCATGATGATGGCG
1049 +SLXA-B3_649_FC8437_R1_1_1_850_123
1050 YYYYYYYYYYYYYWYYWYYSYYYSY
1051 @SLXA-B3_649_FC8437_R1_1_1_362_549
1052 GGAAACAAAGTTTTTCTCAACATAG
1053 +SLXA-B3_649_FC8437_R1_1_1_362_549
1054 YYYYYYYYYYYYYYYYYYWWWWYWY
1055 @SLXA-B3_649_FC8437_R1_1_1_183_714
1056 GTATTATTTAATGGCATACACTCAA
1057 +SLXA-B3_649_FC8437_R1_1_1_183_714
1058 YYYYYYYYYYWYYYYWYWWUWWWQQ
1059
1060 Using this module directly you might run:
1061
1062 >>> handle = open("Quality/solexa_example.fastq", "rU")
1063 >>> for record in FastqSolexaIterator(handle):
1064 ... print record.id, record.seq
1065 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1066 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1067 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1068 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1069 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1070 >>> handle.close()
1071
1072 Typically however, you would call this via Bio.SeqIO instead with
1073 "fastq-solexa" as the format:
1074
1075 >>> from Bio import SeqIO
1076 >>> handle = open("Quality/solexa_example.fastq", "rU")
1077 >>> for record in SeqIO.parse(handle, "fastq-solexa"):
1078 ... print record.id, record.seq
1079 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1080 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1081 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1082 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1083 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1084 >>> handle.close()
1085
1086 If you want to look at the qualities, they are recorded in each record's
1087 per-letter-annotation dictionary as a simple list of integers:
1088
1089 >>> print record.letter_annotations["solexa_quality"]
1090 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
1091
1092 These scores aren't very good, but they are high enough that they map
1093 almost exactly onto PHRED scores:
1094
1095 >>> print "%0.2f" % phred_quality_from_solexa(25)
1096 25.01
1097
1098 Let's look at faked example read which is even worse, where there are
1099 more noticeable differences between the Solexa and PHRED scores::
1100
1101 @slxa_0001_1_0001_01
1102 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1103 +slxa_0001_1_0001_01
1104 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1105
1106 Again, you would typically use Bio.SeqIO to read this file in (rather than
1107 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
1108 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
1109 as shown above. This example has only as one entry, so instead we can
1110 use the Bio.SeqIO.read() function:
1111
1112 >>> from Bio import SeqIO
1113 >>> handle = open("Quality/solexa_faked.fastq", "rU")
1114 >>> record = SeqIO.read(handle, "fastq-solexa")
1115 >>> handle.close()
1116 >>> print record.id, record.seq
1117 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1118 >>> print record.letter_annotations["solexa_quality"]
1119 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1120
1121 These quality scores are so low that when converted from the Solexa scheme
1122 into PHRED scores they look quite different:
1123
1124 >>> print "%0.2f" % phred_quality_from_solexa(-1)
1125 2.54
1126 >>> print "%0.2f" % phred_quality_from_solexa(-5)
1127 1.19
1128
1129 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
1130 method to output the record(s):
1131
1132 >>> print record.format("fastq-solexa")
1133 @slxa_0001_1_0001_01
1134 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1135 +
1136 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1137 <BLANKLINE>
1138
1139 Note this output is slightly different from the input file as Biopython
1140 has left out the optional repetition of the sequence identifier on the "+"
1141 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
1142 output format instead, and Biopython will do the conversion for you:
1143
1144 >>> print record.format("fastq")
1145 @slxa_0001_1_0001_01
1146 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1147 +
1148 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
1149 <BLANKLINE>
1150
1151 >>> print record.format("qual")
1152 >slxa_0001_1_0001_01
1153 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
1154 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1155 1 1
1156 <BLANKLINE>
1157
1158 As shown above, the poor quality Solexa reads have been mapped to the
1159 equivalent PHRED score (e.g. -5 to 1 as shown earlier).
1160 """
1161 q_mapping = dict()
1162 for letter in range(0,255):
1163 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1164 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1165 if title2ids:
1166 id, name, descr = title_line
1167 else:
1168 descr = title_line
1169 id = descr.split()[0]
1170 name = id
1171 record = SeqRecord(Seq(seq_string, alphabet),
1172 id=id, name=name, description=descr)
1173 qualities = [q_mapping[letter] for letter in quality_string]
1174
1175 if qualities and (min(qualities) < -5 or max(qualities)>62):
1176 raise ValueError("Invalid character in quality string")
1177
1178
1179 dict.__setitem__(record._per_letter_annotations,
1180 "solexa_quality", qualities)
1181 yield record
1182
1183
1185 """Parse new Illumina 1.3+ FASTQ like files (which differ in the quality mapping).
1186
1187 The optional arguments are the same as those for the FastqPhredIterator.
1188
1189 For each sequence in Illumina 1.3+ FASTQ files there is a matching string
1190 encoding PHRED integer qualities using ASCII values with an offset of 64.
1191
1192 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
1193 with an ASCII offset of 64. They are approximately equal but only for high
1194 qaulity reads. If you have an old Solexa/Illumina file with negative
1195 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
1196
1197 >>> from Bio import SeqIO
1198 >>> record = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-solexa")
1199 >>> print record.id, record.seq
1200 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1201 >>> print record.letter_annotations["solexa_quality"]
1202 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1203 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
1204 Traceback (most recent call last):
1205 ...
1206 ValueError: Invalid character in quality string
1207
1208 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
1209 """
1210 q_mapping = dict()
1211 for letter in range(0,255):
1212 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1213 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1214 if title2ids:
1215 id, name, descr = title2ids(title_line)
1216 else:
1217 descr = title_line
1218 id = descr.split()[0]
1219 name = id
1220 record = SeqRecord(Seq(seq_string, alphabet),
1221 id=id, name=name, description=descr)
1222 qualities = [q_mapping[letter] for letter in quality_string]
1223 if qualities and (min(qualities) < 0 or max(qualities) > 62):
1224 raise ValueError("Invalid character in quality string")
1225
1226
1227 dict.__setitem__(record._per_letter_annotations,
1228 "phred_quality", qualities)
1229 yield record
1230
1232 """For QUAL files which include PHRED quality scores, but no sequence.
1233
1234 For example, consider this short QUAL file::
1235
1236 >EAS54_6_R1_2_1_413_324
1237 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1238 26 26 26 23 23
1239 >EAS54_6_R1_2_1_540_792
1240 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1241 26 18 26 23 18
1242 >EAS54_6_R1_2_1_443_348
1243 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1244 24 18 18 18 18
1245
1246 Using this module directly you might run:
1247
1248 >>> handle = open("Quality/example.qual", "rU")
1249 >>> for record in QualPhredIterator(handle):
1250 ... print record.id, record.seq
1251 EAS54_6_R1_2_1_413_324 ?????????????????????????
1252 EAS54_6_R1_2_1_540_792 ?????????????????????????
1253 EAS54_6_R1_2_1_443_348 ?????????????????????????
1254 >>> handle.close()
1255
1256 Typically however, you would call this via Bio.SeqIO instead with "qual"
1257 as the format:
1258
1259 >>> from Bio import SeqIO
1260 >>> handle = open("Quality/example.qual", "rU")
1261 >>> for record in SeqIO.parse(handle, "qual"):
1262 ... print record.id, record.seq
1263 EAS54_6_R1_2_1_413_324 ?????????????????????????
1264 EAS54_6_R1_2_1_540_792 ?????????????????????????
1265 EAS54_6_R1_2_1_443_348 ?????????????????????????
1266 >>> handle.close()
1267
1268 Becase QUAL files don't contain the sequence string itself, the seq
1269 property is set to an UnknownSeq object. As no alphabet was given, this
1270 has defaulted to a generic single letter alphabet and the character "?"
1271 used.
1272
1273 By specifying a nucleotide alphabet, "N" is used instead:
1274
1275 >>> from Bio import SeqIO
1276 >>> from Bio.Alphabet import generic_dna
1277 >>> handle = open("Quality/example.qual", "rU")
1278 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
1279 ... print record.id, record.seq
1280 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
1281 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
1282 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
1283 >>> handle.close()
1284
1285 However, the quality scores themselves are available as a list of integers
1286 in each record's per-letter-annotation:
1287
1288 >>> print record.letter_annotations["phred_quality"]
1289 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1290
1291 You can still slice one of these SeqRecord objects with an UnknownSeq:
1292
1293 >>> sub_record = record[5:10]
1294 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
1295 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
1296 """
1297
1298 while True:
1299 line = handle.readline()
1300 if line == "" : return
1301 if line[0] == ">":
1302 break
1303
1304 while True:
1305 if line[0]!=">":
1306 raise ValueError("Records in Fasta files should start with '>' character")
1307 if title2ids:
1308 id, name, descr = title2ids(line[1:].rstrip())
1309 else:
1310 descr = line[1:].rstrip()
1311 id = descr.split()[0]
1312 name = id
1313
1314 qualities = []
1315 line = handle.readline()
1316 while True:
1317 if not line : break
1318 if line[0] == ">": break
1319 qualities.extend([int(word) for word in line.split()])
1320 line = handle.readline()
1321
1322 if qualities and min(qualities) < 0:
1323 raise ValueError(("Negative quality score %i found in %s. " + \
1324 "Are these Solexa scores, not PHRED scores?") \
1325 % (min(qualities), id))
1326
1327
1328 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
1329 id = id, name = name, description = descr)
1330
1331
1332 dict.__setitem__(record._per_letter_annotations,
1333 "phred_quality", qualities)
1334 yield record
1335
1336 if not line : return
1337 assert False, "Should not reach this line"
1338
1340 """Class to write standard FASTQ format files (using PHRED quality scores).
1341
1342 Although you can use this class directly, you are strongly encouraged
1343 to use the Bio.SeqIO.write() function instead via the format name "fastq"
1344 or the alias "fastq-sanger". For example, this code reads in a standard
1345 Sanger style FASTQ file (using PHRED scores) and re-saves it as another
1346 Sanger style FASTQ file:
1347
1348 >>> from Bio import SeqIO
1349 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1350 >>> out_handle = open("Quality/temp.fastq", "w")
1351 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1352 3
1353 >>> out_handle.close()
1354
1355 You might want to do this if the original file included extra line breaks,
1356 which while valid may not be supported by all tools. The output file from
1357 Biopython will have each sequence on a single line, and each quality
1358 string on a single line (which is considered desirable for maximum
1359 compatibility).
1360
1361 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
1362 quality scores) is converted into a standard Sanger style FASTQ file using
1363 PHRED qualities:
1364
1365 >>> from Bio import SeqIO
1366 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1367 >>> out_handle = open("Quality/temp.fastq", "w")
1368 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1369 5
1370 >>> out_handle.close()
1371
1372 This code is also called if you use the .format("fastq") method of a
1373 SeqRecord, or .format("fastq-sanger") if you prefer that alias.
1374
1375 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
1376 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1377 warning is issued.
1378
1379 P.S. To avoid cluttering up your working directory, you can delete this
1380 temporary file now:
1381
1382 >>> import os
1383 >>> os.remove("Quality/temp.fastq")
1384 """
1385 assert SANGER_SCORE_OFFSET == ord("!")
1386
1414
1416 """Class to write QUAL format files (using PHRED quality scores).
1417
1418 Although you can use this class directly, you are strongly encouraged
1419 to use the Bio.SeqIO.write() function instead. For example, this code
1420 reads in a FASTQ file and saves the quality scores into a QUAL file:
1421
1422 >>> from Bio import SeqIO
1423 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1424 >>> out_handle = open("Quality/temp.qual", "w")
1425 >>> SeqIO.write(record_iterator, out_handle, "qual")
1426 3
1427 >>> out_handle.close()
1428
1429 This code is also called if you use the .format("qual") method of a
1430 SeqRecord.
1431
1432 P.S. Don't forget to clean up the temp file if you don't need it anymore:
1433
1434 >>> import os
1435 >>> os.remove("Quality/temp.qual")
1436 """
1437 - def __init__(self, handle, wrap=60, record2title=None):
1438 """Create a QUAL writer.
1439
1440 Arguments:
1441 - handle - Handle to an output file, e.g. as returned
1442 by open(filename, "w")
1443 - wrap - Optional line length used to wrap sequence lines.
1444 Defaults to wrapping the sequence at 60 characters
1445 Use zero (or None) for no wrapping, giving a single
1446 long line for the sequence.
1447 - record2title - Optional function to return the text to be
1448 used for the title line of each record. By default
1449 a combination of the record.id and record.description
1450 is used. If the record.description starts with the
1451 record.id, then just the record.description is used.
1452
1453 The record2title argument is present for consistency with the
1454 Bio.SeqIO.FastaIO writer class.
1455 """
1456 SequentialSequenceWriter.__init__(self, handle)
1457
1458 self.wrap = None
1459 if wrap:
1460 if wrap < 1:
1461 raise ValueError
1462 self.wrap = wrap
1463 self.record2title = record2title
1464
1466 """Write a single QUAL record to the file."""
1467 assert self._header_written
1468 assert not self._footer_written
1469 self._record_written = True
1470
1471 if self.record2title:
1472 title=self.clean(self.record2title(record))
1473 else:
1474 id = self.clean(record.id)
1475 description = self.clean(record.description)
1476 if description and description.split(None,1)[0]==id:
1477
1478 title = description
1479 elif description:
1480 title = "%s %s" % (id, description)
1481 else:
1482 title = id
1483 self.handle.write(">%s\n" % title)
1484
1485 qualities = _get_phred_quality(record)
1486 try:
1487
1488
1489 qualities_strs = [("%i" % round(q,0)) for q in qualities]
1490 except TypeError, e:
1491 if None in qualities:
1492 raise TypeError("A quality value of None was found")
1493 else:
1494 raise e
1495
1496 if self.wrap:
1497 while qualities_strs:
1498 line=qualities_strs.pop(0)
1499 while qualities_strs \
1500 and len(line) + 1 + len(qualities_strs[0]) < self.wrap:
1501 line += " " + qualities_strs.pop(0)
1502 self.handle.write(line + "\n")
1503 else:
1504 data = " ".join(qualities_strs)
1505 self.handle.write(data + "\n")
1506
1508 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
1509
1510 This outputs FASTQ files like those from the early Solexa/Illumina
1511 pipeline, using Solexa scores and an ASCII offset of 64. These are
1512 NOT compatible with the standard Sanger style PHRED FASTQ files.
1513
1514 If your records contain a "solexa_quality" entry under letter_annotations,
1515 this is used, otherwise any "phred_quality" entry will be used after
1516 conversion using the solexa_quality_from_phred function. If neither style
1517 of quality scores are present, an exception is raised.
1518
1519 Although you can use this class directly, you are strongly encouraged
1520 to use the Bio.SeqIO.write() function instead. For example, this code
1521 reads in a FASTQ file and re-saves it as another FASTQ file:
1522
1523 >>> from Bio import SeqIO
1524 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1525 >>> out_handle = open("Quality/temp.fastq", "w")
1526 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
1527 5
1528 >>> out_handle.close()
1529
1530 You might want to do this if the original file included extra line breaks,
1531 which (while valid) may not be supported by all tools. The output file
1532 from Biopython will have each sequence on a single line, and each quality
1533 string on a single line (which is considered desirable for maximum
1534 compatibility).
1535
1536 This code is also called if you use the .format("fastq-solexa") method of
1537 a SeqRecord. For example,
1538
1539 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1540 >>> print record.format("fastq-solexa")
1541 @Test PHRED qualities from 40 to 0 inclusive
1542 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1543 +
1544 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
1545 <BLANKLINE>
1546
1547 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
1548 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,
1549 a warning is issued.
1550
1551 P.S. Don't forget to delete the temp file if you don't need it anymore:
1552
1553 >>> import os
1554 >>> os.remove("Quality/temp.fastq")
1555 """
1584
1586 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
1587
1588 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
1589 using PHRED scores and an ASCII offset of 64. Note these files are NOT
1590 compatible with the standard Sanger style PHRED FASTQ files which use an
1591 ASCII offset of 32.
1592
1593 Although you can use this class directly, you are strongly encouraged to
1594 use the Bio.SeqIO.write() function with format name "fastq-illumina"
1595 instead. This code is also called if you use the .format("fastq-illumina")
1596 method of a SeqRecord. For example,
1597
1598 >>> from Bio import SeqIO
1599 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1600 >>> print record.format("fastq-illumina")
1601 @Test PHRED qualities from 40 to 0 inclusive
1602 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1603 +
1604 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
1605 <BLANKLINE>
1606
1607 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
1608 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1609 warning is issued.
1610 """
1639
1641 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
1642
1643 For example, consider this short QUAL file with PHRED quality scores::
1644
1645 >EAS54_6_R1_2_1_413_324
1646 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1647 26 26 26 23 23
1648 >EAS54_6_R1_2_1_540_792
1649 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1650 26 18 26 23 18
1651 >EAS54_6_R1_2_1_443_348
1652 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1653 24 18 18 18 18
1654
1655 And a matching FASTA file::
1656
1657 >EAS54_6_R1_2_1_413_324
1658 CCCTTCTTGTCTTCAGCGTTTCTCC
1659 >EAS54_6_R1_2_1_540_792
1660 TTGGCAGGCCAAGGCCGATGGATCA
1661 >EAS54_6_R1_2_1_443_348
1662 GTTGCTTCTGGCGTGGGTGGGGGGG
1663
1664 You can parse these separately using Bio.SeqIO with the "qual" and
1665 "fasta" formats, but then you'll get a group of SeqRecord objects with
1666 no sequence, and a matching group with the sequence but not the
1667 qualities. Because it only deals with one input file handle, Bio.SeqIO
1668 can't be used to read the two files together - but this function can!
1669 For example,
1670
1671 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1672 ... open("Quality/example.qual", "rU"))
1673 >>> for record in rec_iter:
1674 ... print record.id, record.seq
1675 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1676 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1677 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1678
1679 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1680 they are in each record's per-letter-annotation dictionary as a simple
1681 list of integers:
1682
1683 >>> print record.letter_annotations["phred_quality"]
1684 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1685
1686 If you have access to data as a FASTQ format file, using that directly
1687 would be simpler and more straight forward. Note that you can easily use
1688 this function to convert paired FASTA and QUAL files into FASTQ files:
1689
1690 >>> from Bio import SeqIO
1691 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1692 ... open("Quality/example.qual", "rU"))
1693 >>> out_handle = open("Quality/temp.fastq", "w")
1694 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1695 3
1696 >>> out_handle.close()
1697
1698 And don't forget to clean up the temp file if you don't need it anymore:
1699
1700 >>> import os
1701 >>> os.remove("Quality/temp.fastq")
1702 """
1703 from Bio.SeqIO.FastaIO import FastaIterator
1704 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
1705 title2ids=title2ids)
1706 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
1707 title2ids=title2ids)
1708
1709
1710
1711 while True:
1712 try:
1713 f_rec = fasta_iter.next()
1714 except StopIteration:
1715 f_rec = None
1716 try:
1717 q_rec = qual_iter.next()
1718 except StopIteration:
1719 q_rec = None
1720 if f_rec is None and q_rec is None:
1721
1722 break
1723 if f_rec is None:
1724 raise ValueError("FASTA file has more entries than the QUAL file.")
1725 if q_rec is None:
1726 raise ValueError("QUAL file has more entries than the FASTA file.")
1727 if f_rec.id != q_rec.id:
1728 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
1729 % (f_rec.id, q_rec.id))
1730 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
1731 raise ValueError("Sequence length and number of quality scores disagree for %s" \
1732 % f_rec.id)
1733
1734 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
1735 yield f_rec
1736
1737
1738
1740 """Run the Bio.SeqIO module's doctests.
1741
1742 This will try and locate the unit tests directory, and run the doctests
1743 from there in order that the relative paths used in the examples work.
1744 """
1745 import doctest
1746 import os
1747 if os.path.isdir(os.path.join("..","..","Tests")):
1748 print "Runing doctests..."
1749 cur_dir = os.path.abspath(os.curdir)
1750 os.chdir(os.path.join("..","..","Tests"))
1751 assert os.path.isfile("Quality/example.fastq")
1752 assert os.path.isfile("Quality/example.fasta")
1753 assert os.path.isfile("Quality/example.qual")
1754 assert os.path.isfile("Quality/tricky.fastq")
1755 assert os.path.isfile("Quality/solexa_faked.fastq")
1756 doctest.testmod(verbose=0)
1757 os.chdir(cur_dir)
1758 del cur_dir
1759 print "Done"
1760
1761 if __name__ == "__main__":
1762 _test()
1763