1
2
3
4
5
6
7
8
9 """Sequence input/output as SeqRecord objects.
10
11 Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by
12 a whole chapter in our tutorial:
13 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
14 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
15
16 Input
17 =====
18 The main function is Bio.SeqIO.parse(...) which takes an input file handle,
19 and format string. This returns an iterator giving SeqRecord objects:
20
21 >>> from Bio import SeqIO
22 >>> handle = open("Fasta/f002", "rU")
23 >>> for record in SeqIO.parse(handle, "fasta"):
24 ... print record.id, len(record)
25 gi|1348912|gb|G26680|G26680 633
26 gi|1348917|gb|G26685|G26685 413
27 gi|1592936|gb|G29385|G29385 471
28 >>> handle.close()
29
30 Note that the parse() function will all invoke the relevant parser for the
31 format with its default settings. You may want more control, in which case
32 you need to create a format specific sequence iterator directly.
33
34 Input - Single Records
35 ======================
36 If you expect your file to contain one-and-only-one record, then we provide
37 the following 'helper' function which will return a single SeqRecord, or
38 raise an exception if there are no records or more than one record:
39
40 >>> from Bio import SeqIO
41 >>> handle = open("Fasta/f001", "rU")
42 >>> record = SeqIO.read(handle, "fasta")
43 >>> handle.close()
44 >>> print record.id, len(record)
45 gi|3318709|pdb|1A91| 79
46
47 This style is useful when you expect a single record only (and would
48 consider multiple records an error). For example, when dealing with GenBank
49 files for bacterial genomes or chromosomes, there is normally only a single
50 record. Alternatively, use this with a handle when download a single record
51 from the internet.
52
53 However, if you just want the first record from a file containing multiple
54 record, use the iterator's next() method:
55
56 >>> from Bio import SeqIO
57 >>> handle = open("Fasta/f002", "rU")
58 >>> record = SeqIO.parse(handle, "fasta").next()
59 >>> handle.close()
60 >>> print record.id, len(record)
61 gi|1348912|gb|G26680|G26680 633
62
63 The above code will work as long as the file contains at least one record.
64 Note that if there is more than one record, the remaining records will be
65 silently ignored.
66
67
68 Input - Multiple Records
69 ========================
70 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
71 using a sequence iterator can save you a lot of memory (RAM). There is
72 less benefit for interlaced file formats (e.g. most multiple alignment file
73 formats). However, an iterator only lets you access the records one by one.
74
75 If you want random access to the records by number, turn this into a list:
76
77 >>> from Bio import SeqIO
78 >>> handle = open("Fasta/f002", "rU")
79 >>> records = list(SeqIO.parse(handle, "fasta"))
80 >>> handle.close()
81 >>> print records[1].id
82 gi|1348917|gb|G26685|G26685
83
84 If you want random access to the records by a key such as the record id,
85 turn the iterator into a dictionary:
86
87 >>> from Bio import SeqIO
88 >>> handle = open("Fasta/f002", "rU")
89 >>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
90 >>> handle.close()
91 >>> len(record_dict)
92 3
93 >>> print len(record_dict["gi|1348917|gb|G26685|G26685"])
94 413
95
96 However, using list() or the to_dict() function will load all the records
97 into memory at once, and therefore is not possible on very large files.
98 Instead, for *some* file formats Bio.SeqIO provides an indexing approach
99 providing dictionary like access to any record. For example,
100
101 >>> from Bio import SeqIO
102 >>> record_dict = SeqIO.index("Fasta/f002", "fasta")
103 >>> len(record_dict)
104 3
105 >>> print len(record_dict["gi|1348917|gb|G26685|G26685"])
106 413
107
108 Many but not all of the supported input file formats can be indexed like
109 this. For example "ace", "embl", "fasta", "fastq", "genbank", "ig", "phd",
110 "pir", "tab" and "qual" work, but alignment formats like "phylip", "clustalw"
111 and "nexus" will not.
112
113 Input - Alignments
114 ==================
115 You can read in alignment files as Alignment objects using Bio.AlignIO.
116 Alternatively, reading in an alignment file format via Bio.SeqIO will give
117 you a SeqRecord for each row of each alignment:
118
119 >>> from Bio import SeqIO
120 >>> handle = open("Clustalw/hedgehog.aln", "rU")
121 >>> for record in SeqIO.parse(handle, "clustal"):
122 ... print record.id, len(record)
123 gi|167877390|gb|EDS40773.1| 447
124 gi|167234445|ref|NP_001107837. 447
125 gi|74100009|gb|AAZ99217.1| 447
126 gi|13990994|dbj|BAA33523.2| 447
127 gi|56122354|gb|AAV74328.1| 447
128 >>> handle.close()
129
130 Output
131 ======
132 Use the function Bio.SeqIO.write(...), which takes a complete set of
133 SeqRecord objects (either as a list, or an iterator), an output file handle
134 and of course the file format::
135
136 from Bio import SeqIO
137 records = ...
138 handle = open("example.faa", "w")
139 SeqIO.write(records, handle, "fasta")
140 handle.close()
141
142 In general, you are expected to call this function once (with all your
143 records) and then close the file handle.
144
145 Output - Advanced
146 =================
147 The effect of calling write() multiple times on a single file will vary
148 depending on the file format, and is best avoided unless you have a strong
149 reason to do so.
150
151 Trying this for certain alignment formats (e.g. phylip, clustal, stockholm)
152 would have the effect of concatenating several multiple sequence alignments
153 together. Such files are created by the PHYLIP suite of programs for
154 bootstrap analysis.
155
156 For sequential files formats (e.g. fasta, genbank) each "record block" holds
157 a single sequence. For these files it would probably be safe to call
158 write() multiple times.
159
160 Conversion
161 ==========
162 The Bio.SeqIO.convert(...) function allows an easy interface for simple
163 file format conversions. Additionally, it may use file format specific
164 optimisations so this should be the fastest way too.
165
166 In general however, you can combine the Bio.SeqIO.parse(...) function with the
167 Bio.SeqIO.write(...) function for sequence file conversion. Using generator
168 expressions provides a memory efficient way to perform filtering or other
169 extra operations as part of the process.
170
171 File Formats
172 ============
173 When specifying the file format, use lowercase strings. The same format
174 names are also used in Bio.AlignIO and include the following:
175
176 - ace - Reads the contig sequences from an ACE assembly file.
177 - embl - The EMBL flat file format. Uses Bio.GenBank internally.
178 - fasta - The generic sequence file format where each record starts with
179 an identifer line starting with a ">" character, followed by
180 lines of sequence.
181 - fastq - A "FASTA like" format used by Sanger which also stores PHRED
182 sequence quality values (with an ASCII offset of 33).
183 - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS
184 - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which
185 encodes Solexa quality scores (not PHRED quality scores) with an
186 ASCII offset of 64.
187 - fastq-illumina - Solexa/Illumnia 1.3+ variant of the FASTQ format which
188 encodes PHRED quality scores with an ASCII offset of 64 (not 33).
189 - genbank - The GenBank or GenPept flat file format.
190 - gb - An alias for "genbank", for consistency with NCBI Entrez Utilities
191 - ig - The IntelliGenetics file format, apparently the same as the
192 MASE alignment format.
193 - phd - Output from PHRED, used by PHRAP and CONSED for input.
194 - pir - A "FASTA like" format introduced by the National Biomedical
195 Research Foundation (NBRF) for the Protein Information Resource
196 (PIR) database, now part of UniProt.
197 - swiss - Plain text Swiss-Prot aka UniProt format.
198 - tab - Simple two column tab separated sequence files, where each
199 line holds a record's identifier and sequence. For example,
200 this is used as by Aligent's eArray software when saving
201 microarray probes in a minimal tab delimited text file.
202 - qual - A "FASTA like" format holding PHRED quality values from
203 sequencing DNA, but no actual sequences (usually provided
204 in separate FASTA files).
205
206 Note that while Bio.SeqIO can read all the above file formats, it cannot
207 write to all of them.
208
209 You can also use any file format supported by Bio.AlignIO, such as "nexus",
210 "phlip" and "stockholm", which gives you access to the individual sequences
211 making up each alignment as SeqRecords.
212 """
213 __docformat__ = "epytext en"
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229 """
230 FAO BioPython Developers
231 ========================
232 The way I envision this SeqIO system working as that for any sequence file
233 format we have an iterator that returns SeqRecord objects.
234
235 This also applies to interlaced fileformats (like clustal - although that
236 is now handled via Bio.AlignIO instead) where the file cannot be read record
237 by record. You should still return an iterator, even if the implementation
238 could just as easily return a list.
239
240 These file format specific sequence iterators may be implemented as:
241 * Classes which take a handle for __init__ and provide the __iter__ method
242 * Functions that take a handle, and return an iterator object
243 * Generator functions that take a handle, and yield SeqRecord objects
244
245 It is then trivial to turn this iterator into a list of SeqRecord objects,
246 an in memory dictionary, or a multiple sequence alignment object.
247
248 For building the dictionary by default the id propery of each SeqRecord is
249 used as the key. You should always populate the id property, and it should
250 be unique in most cases. For some file formats the accession number is a good
251 choice. If the file itself contains ambiguous identifiers, don't try and
252 dis-ambiguate them - return them as is.
253
254 When adding a new file format, please use the same lower case format name
255 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
256
257 See also http://biopython.org/wiki/SeqIO_dev
258
259 --Peter
260 """
261
262 import os
263 from Bio.Seq import Seq
264 from Bio.SeqRecord import SeqRecord
265 from Bio.Align.Generic import Alignment
266 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
267
268 import AceIO
269 import FastaIO
270 import IgIO
271 import InsdcIO
272 import PhdIO
273 import PirIO
274 import SwissIO
275 import TabIO
276 import QualityIO
277
278
279
280
281
282
283
284
285
286
287
288 _FormatToIterator ={"fasta" : FastaIO.FastaIterator,
289 "gb" : InsdcIO.GenBankIterator,
290 "genbank" : InsdcIO.GenBankIterator,
291 "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator,
292 "embl" : InsdcIO.EmblIterator,
293 "embl-cds" : InsdcIO.EmblCdsFeatureIterator,
294 "ig" : IgIO.IgIterator,
295 "swiss" : SwissIO.SwissIterator,
296 "phd" : PhdIO.PhdIterator,
297 "ace" : AceIO.AceIterator,
298 "tab" : TabIO.TabIterator,
299 "pir" : PirIO.PirIterator,
300 "fastq" : QualityIO.FastqPhredIterator,
301 "fastq-sanger" : QualityIO.FastqPhredIterator,
302 "fastq-solexa" : QualityIO.FastqSolexaIterator,
303 "fastq-illumina" : QualityIO.FastqIlluminaIterator,
304 "qual" : QualityIO.QualPhredIterator,
305 }
306
307 _FormatToWriter ={"fasta" : FastaIO.FastaWriter,
308 "gb" : InsdcIO.GenBankWriter,
309 "genbank" : InsdcIO.GenBankWriter,
310 "tab" : TabIO.TabWriter,
311 "fastq" : QualityIO.FastqPhredWriter,
312 "fastq-sanger" : QualityIO.FastqPhredWriter,
313 "fastq-solexa" : QualityIO.FastqSolexaWriter,
314 "fastq-illumina" : QualityIO.FastqIlluminaWriter,
315 "phd" : PhdIO.PhdWriter,
316 "qual" : QualityIO.QualPhredWriter,
317 }
318
319 -def write(sequences, handle, format):
320 """Write complete set of sequences to a file.
321
322 - sequences - A list (or iterator) of SeqRecord objects.
323 - handle - File handle object to write to.
324 - format - lower case string describing the file format to write.
325
326 You should close the handle after calling this function.
327
328 Returns the number of records written (as an integer).
329 """
330 from Bio import AlignIO
331
332
333 if isinstance(handle, basestring):
334 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
335 if not isinstance(format, basestring):
336 raise TypeError("Need a string for the file format (lower case)")
337 if not format:
338 raise ValueError("Format required (lower case string)")
339 if format != format.lower():
340 raise ValueError("Format string '%s' should be lower case" % format)
341 if isinstance(sequences,SeqRecord):
342 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord")
343
344
345 if format in _FormatToWriter:
346 writer_class = _FormatToWriter[format]
347 count = writer_class(handle).write_file(sequences)
348 elif format in AlignIO._FormatToWriter:
349
350
351 alignment = to_alignment(sequences)
352 alignment_count = AlignIO.write([alignment], handle, format)
353 assert alignment_count == 1, "Internal error - the underlying writer " \
354 + " should have returned 1, not %s" % repr(alignment_count)
355 count = len(alignment.get_all_seqs())
356 del alignment_count, alignment
357 elif format in _FormatToIterator or format in AlignIO._FormatToIterator:
358 raise ValueError("Reading format '%s' is supported, but not writing" \
359 % format)
360 else:
361 raise ValueError("Unknown format '%s'" % format)
362
363 assert isinstance(count, int), "Internal error - the underlying %s " \
364 "writer should have returned the record count, not %s" \
365 % (format, repr(count))
366 return count
367
368 -def parse(handle, format, alphabet=None):
369 r"""Turns a sequence file into an iterator returning SeqRecords.
370
371 - handle - handle to the file.
372 - format - lower case string describing the file format.
373 - alphabet - optional Alphabet object, useful when the sequence type
374 cannot be automatically inferred from the file itself
375 (e.g. format="fasta" or "tab")
376
377 Typical usage, opening a file to read in, and looping over the record(s):
378
379 >>> from Bio import SeqIO
380 >>> filename = "Fasta/sweetpea.nu"
381 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta"):
382 ... print "ID", record.id
383 ... print "Sequence length", len(record)
384 ... print "Sequence alphabet", record.seq.alphabet
385 ID gi|3176602|gb|U78617.1|LOU78617
386 Sequence length 309
387 Sequence alphabet SingleLetterAlphabet()
388
389 For file formats like FASTA where the alphabet cannot be determined, it
390 may be useful to specify the alphabet explicitly:
391
392 >>> from Bio import SeqIO
393 >>> from Bio.Alphabet import generic_dna
394 >>> filename = "Fasta/sweetpea.nu"
395 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta", generic_dna):
396 ... print "ID", record.id
397 ... print "Sequence length", len(record)
398 ... print "Sequence alphabet", record.seq.alphabet
399 ID gi|3176602|gb|U78617.1|LOU78617
400 Sequence length 309
401 Sequence alphabet DNAAlphabet()
402
403 If you have a string 'data' containing the file contents, you must
404 first turn this into a handle in order to parse it:
405
406 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
407 >>> from Bio import SeqIO
408 >>> from StringIO import StringIO
409 >>> for record in SeqIO.parse(StringIO(data), "fasta"):
410 ... print record.id, record.seq
411 Alpha ACCGGATGTA
412 Beta AGGCTCGGTTA
413
414 Use the Bio.SeqIO.read(handle, format) function when you expect a single
415 record only.
416 """
417
418
419
420 from Bio import AlignIO
421
422
423 if isinstance(handle, basestring):
424 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
425 if not isinstance(format, basestring):
426 raise TypeError("Need a string for the file format (lower case)")
427 if not format:
428 raise ValueError("Format required (lower case string)")
429 if format != format.lower():
430 raise ValueError("Format string '%s' should be lower case" % format)
431 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
432 isinstance(alphabet, AlphabetEncoder)):
433 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
434
435
436 if format in _FormatToIterator:
437 iterator_generator = _FormatToIterator[format]
438 if alphabet is None:
439 return iterator_generator(handle)
440 try:
441 return iterator_generator(handle, alphabet=alphabet)
442 except:
443 return _force_alphabet(iterator_generator(handle), alphabet)
444 elif format in AlignIO._FormatToIterator:
445
446
447
448 return _iterate_via_AlignIO(handle, format, alphabet)
449 else:
450 raise ValueError("Unknown format '%s'" % format)
451
452
459
473
474 -def read(handle, format, alphabet=None):
475 """Turns a sequence file into a single SeqRecord.
476
477 - handle - handle to the file.
478 - format - string describing the file format.
479 - alphabet - optional Alphabet object, useful when the sequence type
480 cannot be automatically inferred from the file itself
481 (e.g. format="fasta" or "tab")
482
483 This function is for use parsing sequence files containing
484 exactly one record. For example, reading a GenBank file:
485
486 >>> from Bio import SeqIO
487 >>> record = SeqIO.read(open("GenBank/arab1.gb", "rU"), "genbank")
488 >>> print "ID", record.id
489 ID AC007323.5
490 >>> print "Sequence length", len(record)
491 Sequence length 86436
492 >>> print "Sequence alphabet", record.seq.alphabet
493 Sequence alphabet IUPACAmbiguousDNA()
494
495 If the handle contains no records, or more than one record,
496 an exception is raised. For example:
497
498 >>> from Bio import SeqIO
499 >>> record = SeqIO.read(open("GenBank/cor6_6.gb", "rU"), "genbank")
500 Traceback (most recent call last):
501 ...
502 ValueError: More than one record found in handle
503
504 If however you want the first record from a file containing
505 multiple records this function would raise an exception (as
506 shown in the example above). Instead use:
507
508 >>> from Bio import SeqIO
509 >>> record = SeqIO.parse(open("GenBank/cor6_6.gb", "rU"), "genbank").next()
510 >>> print "First record's ID", record.id
511 First record's ID X55053.1
512
513 Use the Bio.SeqIO.parse(handle, format) function if you want
514 to read multiple records from the handle.
515 """
516 iterator = parse(handle, format, alphabet)
517 try:
518 first = iterator.next()
519 except StopIteration:
520 first = None
521 if first is None:
522 raise ValueError("No records found in handle")
523 try:
524 second = iterator.next()
525 except StopIteration:
526 second = None
527 if second is not None:
528 raise ValueError("More than one record found in handle")
529 return first
530
531 -def to_dict(sequences, key_function=None):
532 """Turns a sequence iterator or list into a dictionary.
533
534 - sequences - An iterator that returns SeqRecord objects,
535 or simply a list of SeqRecord objects.
536 - key_function - Optional callback function which when given a
537 SeqRecord should return a unique key for the dictionary.
538
539 e.g. key_function = lambda rec : rec.name
540 or, key_function = lambda rec : rec.description.split()[0]
541
542 If key_function is ommitted then record.id is used, on the assumption
543 that the records objects returned are SeqRecords with a unique id.
544
545 If there are duplicate keys, an error is raised.
546
547 Example usage, defaulting to using the record.id as key:
548
549 >>> from Bio import SeqIO
550 >>> handle = open("GenBank/cor6_6.gb", "rU")
551 >>> format = "genbank"
552 >>> id_dict = SeqIO.to_dict(SeqIO.parse(handle, format))
553 >>> print sorted(id_dict.keys())
554 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1']
555 >>> print id_dict["L31939.1"].description
556 Brassica rapa (clone bif72) kin mRNA, complete cds.
557
558 A more complex example, using the key_function argument in order to
559 use a sequence checksum as the dictionary key:
560
561 >>> from Bio import SeqIO
562 >>> from Bio.SeqUtils.CheckSum import seguid
563 >>> handle = open("GenBank/cor6_6.gb", "rU")
564 >>> format = "genbank"
565 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(handle, format),
566 ... key_function = lambda rec : seguid(rec.seq))
567 >>> for key, record in sorted(seguid_dict.iteritems()):
568 ... print key, record.id
569 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1
570 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1
571 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1
572 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1
573 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1
574 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1
575
576 This approach is not suitable for very large sets of sequences, as all
577 the SeqRecord objects are held in memory. Instead, consider using the
578 Bio.SeqIO.index() function (if it supports your particular file format).
579 """
580 if key_function is None:
581 key_function = lambda rec : rec.id
582
583 d = dict()
584 for record in sequences:
585 key = key_function(record)
586 if key in d:
587 raise ValueError("Duplicate key '%s'" % key)
588 d[key] = record
589 return d
590
591 -def index(filename, format, alphabet=None, key_function=None):
592 """Indexes a sequence file and returns a dictionary like object.
593
594 - filename - string giving name of file to be indexed
595 - format - lower case string describing the file format
596 - alphabet - optional Alphabet object, useful when the sequence type
597 cannot be automatically inferred from the file itself
598 (e.g. format="fasta" or "tab")
599 - key_function - Optional callback function which when given a
600 SeqRecord identifier string should return a unique
601 key for the dictionary.
602
603 This indexing function will return a dictionary like object, giving the
604 SeqRecord objects as values:
605
606 >>> from Bio import SeqIO
607 >>> records = SeqIO.index("Quality/example.fastq", "fastq")
608 >>> len(records)
609 3
610 >>> sorted(records.keys())
611 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
612 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta")
613 >EAS54_6_R1_2_1_540_792
614 TTGGCAGGCCAAGGCCGATGGATCA
615 <BLANKLINE>
616 >>> "EAS54_6_R1_2_1_540_792" in records
617 True
618 >>> print records.get("Missing", None)
619 None
620
621 Note that this psuedo dictionary will not support all the methods of a
622 true Python dictionary, for example values() is not defined since this
623 would require loading all of the records into memory at once.
624
625 When you call the index function, it will scan through the file, noting
626 the location of each record. When you access a particular record via the
627 dictionary methods, the code will jump to the appropriate part of the
628 file and then parse that section into a SeqRecord.
629
630 Note that not all the input formats supported by Bio.SeqIO can be used
631 with this index function. It is designed to work only with sequential
632 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any
633 interlaced file format (e.g. alignment formats such as "clustal").
634
635 For small files, it may be more efficient to use an in memory Python
636 dictionary, e.g.
637
638 >>> from Bio import SeqIO
639 >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq"))
640 >>> len(records)
641 3
642 >>> sorted(records.keys())
643 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
644 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta")
645 >EAS54_6_R1_2_1_540_792
646 TTGGCAGGCCAAGGCCGATGGATCA
647 <BLANKLINE>
648
649 As with the to_dict() function, by default the id string of each record
650 is used as the key. You can specify a callback function to transform
651 this (the record identifier string) into your prefered key. For example:
652
653 >>> from Bio import SeqIO
654 >>> def make_tuple(identifier):
655 ... parts = identifier.split("_")
656 ... return int(parts[-2]), int(parts[-1])
657 >>> records = SeqIO.index("Quality/example.fastq", "fastq",
658 ... key_function=make_tuple)
659 >>> len(records)
660 3
661 >>> sorted(records.keys())
662 [(413, 324), (443, 348), (540, 792)]
663 >>> print records[(540, 792)].format("fasta")
664 >EAS54_6_R1_2_1_540_792
665 TTGGCAGGCCAAGGCCGATGGATCA
666 <BLANKLINE>
667 >>> (540, 792) in records
668 True
669 >>> "EAS54_6_R1_2_1_540_792" in records
670 False
671 >>> print records.get("Missing", None)
672 None
673
674 Another common use case would be indexing an NCBI style FASTA file,
675 where you might want to extract the GI number from the FASTA identifer
676 to use as the dictionary key.
677
678 Notice that unlike the to_dict() function, here the key_function does
679 not get given the full SeqRecord to use to generate the key. Doing so
680 would impose a severe performance penalty as it would require the file
681 to be completely parsed while building the index. Right now this is
682 usually avoided.
683 """
684
685 if not isinstance(filename, basestring):
686 raise TypeError("Need a filename (not a handle)")
687 if not isinstance(format, basestring):
688 raise TypeError("Need a string for the file format (lower case)")
689 if not format:
690 raise ValueError("Format required (lower case string)")
691 if format != format.lower():
692 raise ValueError("Format string '%s' should be lower case" % format)
693 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
694 isinstance(alphabet, AlphabetEncoder)):
695 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
696
697
698 import _index
699 try:
700 indexer = _index._FormatToIndexedDict[format]
701 except KeyError:
702 raise ValueError("Unsupported format '%s'" % format)
703 return indexer(filename, alphabet, key_function)
704
706 """Returns a multiple sequence alignment (OBSOLETE).
707
708 - sequences -An iterator that returns SeqRecord objects,
709 or simply a list of SeqRecord objects. All
710 the record sequences must be the same length.
711 - alphabet - Optional alphabet. Stongly recommended.
712 - strict - Optional, defaults to True. Should error checking
713 be done?
714
715 Using this function is now discouraged. Rather doing this:
716
717 >>> from Bio import SeqIO
718 >>> handle = open("Clustalw/protein.aln")
719 >>> alignment = SeqIO.to_alignment(SeqIO.parse(handle, "clustal"))
720 >>> handle.close()
721
722 You are now encouraged to use Bio.AlignIO instead, e.g.
723
724 >>> from Bio import AlignIO
725 >>> handle = open("Clustalw/protein.aln")
726 >>> alignment = AlignIO.read(handle, "clustal")
727 >>> handle.close()
728 """
729
730 from Bio.Alphabet import generic_alphabet
731 from Bio.Alphabet import _consensus_alphabet
732 if alphabet is None:
733 sequences = list(sequences)
734 alphabet = _consensus_alphabet([rec.seq.alphabet for rec in sequences \
735 if rec.seq is not None])
736
737 if not (isinstance(alphabet, Alphabet) or isinstance(alphabet, AlphabetEncoder)):
738 raise ValueError("Invalid alphabet")
739
740 alignment_length = None
741 alignment = Alignment(alphabet)
742 for record in sequences:
743 if strict:
744 if alignment_length is None:
745 alignment_length = len(record.seq)
746 elif alignment_length != len(record.seq):
747 raise ValueError("Sequences must all be the same length")
748
749 assert isinstance(record.seq.alphabet, Alphabet) \
750 or isinstance(record.seq.alphabet, AlphabetEncoder), \
751 "Sequence does not have a valid alphabet"
752
753
754
755
756 if isinstance(record.seq.alphabet, Alphabet) \
757 and isinstance(alphabet, Alphabet):
758
759 if not isinstance(record.seq.alphabet, alphabet.__class__):
760 raise ValueError("Incompatible sequence alphabet " \
761 + "%s for %s alignment" \
762 % (record.seq.alphabet, alphabet))
763 elif isinstance(record.seq.alphabet, AlphabetEncoder) \
764 and isinstance(alphabet, Alphabet):
765 raise ValueError("Sequence has a gapped alphabet, alignment does not")
766 elif isinstance(record.seq.alphabet, Alphabet) \
767 and isinstance(alphabet, Gapped):
768
769 if not isinstance(record.seq.alphabet, alphabet.alphabet.__class__):
770 raise ValueError("Incompatible sequence alphabet " \
771 + "%s for %s alignment" \
772 % (record.seq.alphabet, alphabet))
773 else:
774
775 if not isinstance(record.seq.alphabet, alphabet.__class__):
776 raise ValueError("Incompatible sequence alphabet " \
777 + "%s for %s alignment" \
778 % (record.seq.alphabet, alphabet))
779 if record.seq.alphabet.gap_char != alphabet.gap_char:
780 raise ValueError("Sequence gap characters != alignment gap char")
781
782
783 if record.seq is None:
784 raise TypeError("SeqRecord (id=%s) has None for its sequence." % record.id)
785
786
787
788
789 alignment._records.append(record)
790 return alignment
791
792 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
793 """Convert between two sequence file formats, return number of records.
794
795 - in_file - an input handle or filename
796 - in_format - input file format, lower case string
797 - out_file - an output handle or filename
798 - out_format - output file format, lower case string
799 - alphabet - optional alphabet to assume
800
801 NOTE - If you provide an output filename, it will be opened which will
802 overwrite any existing file without warning. This may happen if even
803 the conversion is aborted (e.g. an invalid out_format name is given).
804
805 For example, going from a filename to a handle:
806
807 >>> from Bio import SeqIO
808 >>> from StringIO import StringIO
809 >>> handle = StringIO("")
810 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta")
811 3
812 >>> print handle.getvalue()
813 >EAS54_6_R1_2_1_413_324
814 CCCTTCTTGTCTTCAGCGTTTCTCC
815 >EAS54_6_R1_2_1_540_792
816 TTGGCAGGCCAAGGCCGATGGATCA
817 >EAS54_6_R1_2_1_443_348
818 GTTGCTTCTGGCGTGGGTGGGGGGG
819 <BLANKLINE>
820 """
821
822
823 if isinstance(in_file, basestring):
824 in_handle = open(in_file, "rU")
825 in_close = True
826 else:
827 in_handle = in_file
828 in_close = False
829
830 if isinstance(out_file, basestring):
831 out_handle = open(out_file, "w")
832 out_close = True
833 else:
834 out_handle = out_file
835 out_close = False
836
837
838 from _convert import _handle_convert
839 count = _handle_convert(in_handle, in_format,
840 out_handle, out_format,
841 alphabet)
842
843 if in_close : in_handle.close()
844 if out_close : out_handle.close()
845 return count
846
848 """Run the Bio.SeqIO module's doctests.
849
850 This will try and locate the unit tests directory, and run the doctests
851 from there in order that the relative paths used in the examples work.
852 """
853 import doctest
854 import os
855 if os.path.isdir(os.path.join("..","..","Tests")):
856 print "Runing doctests..."
857 cur_dir = os.path.abspath(os.curdir)
858 os.chdir(os.path.join("..","..","Tests"))
859 doctest.testmod()
860 os.chdir(cur_dir)
861 del cur_dir
862 print "Done"
863 elif os.path.isdir(os.path.join("Tests", "Fasta")):
864 print "Runing doctests..."
865 cur_dir = os.path.abspath(os.curdir)
866 os.chdir(os.path.join("Tests"))
867 doctest.testmod()
868 os.chdir(cur_dir)
869 del cur_dir
870 print "Done"
871
872 if __name__ == "__main__":
873
874 _test()
875