1
2
3
4
5
6 """Multiple sequence alignment input/output as Alignment objects.
7
8 The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in
9 fact the two are connected internally. Both modules use the same set of file
10 format names (lower case strings). From the user's perspective, you can read
11 in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you
12 can read in the sequences within these alignmenta using Bio.SeqIO.
13
14 Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by
15 a whole chapter in our tutorial:
16 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
17 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
18
19 Input
20 =====
21 For the typical special case when your file or handle contains one and only
22 one alignment, use the function Bio.AlignIO.read(). This takes an input file
23 handle, format string and optional number of sequences per alignment. It will
24 return a single Alignment object (or raise an exception if there isn't just
25 one alignment):
26
27 >>> from Bio import AlignIO
28 >>> handle = open("Phylip/interlaced.phy", "rU")
29 >>> align = AlignIO.read(handle, "phylip")
30 >>> handle.close()
31 >>> print align
32 SingleLetterAlphabet() alignment with 3 rows and 384 columns
33 -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI
34 MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU
35 ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN
36
37 For the general case, when the handle could contain any number of alignments,
38 use the function Bio.AlignIO.parse(...) which takes the same arguments, but
39 returns an iterator giving Alignment objects (typically used in a for loop).
40 If you want random access to the alignments by number, turn this into a list:
41
42 >>> from Bio import AlignIO
43 >>> handle = open("Emboss/needle.txt", "rU")
44 >>> alignments = list(AlignIO.parse(handle, "emboss"))
45 >>> print alignments[2]
46 SingleLetterAlphabet() alignment with 2 rows and 120 columns
47 -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec
48 LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver
49
50 Most alignment file formats can be concatenated so as to hold as many
51 different multiple sequence alignments as possible. One common example
52 is the output of the tool seqboot in the PHLYIP suite. Sometimes there
53 can be a file header and footer, as seen in the EMBOSS alignment output.
54
55 Output
56 ======
57 Use the function Bio.AlignIO.write(...), which takes a complete set of
58 Alignment objects (either as a list, or an iterator), an output file handle
59 and of course the file format::
60
61 from Bio import AlignIO
62 alignments = ...
63 handle = open("example.faa", "w")
64 alignment = SeqIO.write(alignments, handle, "fasta")
65 handle.close()
66
67 In general, you are expected to call this function once (with all your
68 alignments) and then close the file handle. However, for file formats
69 like PHYLIP where multiple alignments are stored sequentially (with no file
70 header and footer), then multiple calls to the write function should work as
71 expected.
72
73 Conversion
74 ==========
75 The Bio.AlignIO.convert(...) function allows an easy interface for simple
76 alignnment file format conversions. Additionally, it may use file format
77 specific optimisations so this should be the fastest way too.
78
79 In general however, you can combine the Bio.AlignIO.parse(...) function with
80 the Bio.AlignIO.write(...) function for sequence file conversion. Using
81 generator expressions provides a memory efficient way to perform filtering or
82 other extra operations as part of the process.
83
84 File Formats
85 ============
86 When specifying the file format, use lowercase strings. The same format
87 names are also used in Bio.SeqIO and include the following:
88
89 - clustal - Ouput from Clustal W or X, see also the module Bio.Clustalw
90 which can be used to run the command line tool from Biopython.
91 - emboss - EMBOSS tools' "pairs" and "simple" alignment formats.
92 - fasta - The generic sequence file format where each record starts with
93 an identifer line starting with a ">" character, followed by
94 lines of sequence.
95 - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA
96 tools when used with the -m 10 command line option for machine
97 readable output.
98 - ig - The IntelliGenetics file format, apparently the same as the
99 MASE alignment format.
100 - nexus - Output from NEXUS, see also the module Bio.Nexus which can also
101 read any phylogenetic trees in these files.
102 - phylip - Used by the PHLIP tools.
103 - stockholm - A richly annotated alignment file format used by PFAM.
104
105 Note that while Bio.AlignIO can read all the above file formats, it cannot
106 write to all of them.
107
108 You can also use any file format supported by Bio.SeqIO, such as "fasta" or
109 "ig" (which are listed above), PROVIDED the sequences in your file are all the
110 same length.
111 """
112 __docformat__ = "epytext en"
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128 import os
129
130 from StringIO import StringIO
131 from Bio.Seq import Seq
132 from Bio.SeqRecord import SeqRecord
133 from Bio.Align.Generic import Alignment
134 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
135
136 import StockholmIO
137 import ClustalIO
138 import NexusIO
139 import PhylipIO
140 import EmbossIO
141 import FastaIO
142
143
144
145
146 _FormatToIterator ={
147 "clustal" : ClustalIO.ClustalIterator,
148 "emboss" : EmbossIO.EmbossIterator,
149 "fasta-m10" : FastaIO.FastaM10Iterator,
150 "nexus" : NexusIO.NexusIterator,
151 "phylip" : PhylipIO.PhylipIterator,
152 "stockholm" : StockholmIO.StockholmIterator,
153 }
154
155 _FormatToWriter ={
156
157 "nexus" : NexusIO.NexusWriter,
158 "phylip" : PhylipIO.PhylipWriter,
159 "stockholm" : StockholmIO.StockholmWriter,
160 "clustal" : ClustalIO.ClustalWriter,
161 }
162
163 -def write(alignments, handle, format):
164 """Write complete set of alignments to a file.
165
166 Arguments:
167 - sequences - A list (or iterator) of Alignment objects
168 - handle - File handle object to write to
169 - format - lower case string describing the file format to write.
170
171 You should close the handle after calling this function.
172
173 Returns the number of alignments written (as an integer).
174 """
175 from Bio import SeqIO
176
177
178 if isinstance(handle, basestring):
179 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
180 if not isinstance(format, basestring):
181 raise TypeError("Need a string for the file format (lower case)")
182 if not format:
183 raise ValueError("Format required (lower case string)")
184 if format != format.lower():
185 raise ValueError("Format string '%s' should be lower case" % format)
186 if isinstance(alignments, Alignment):
187 raise TypeError("Need an Alignment list/iterator, not just a single Alignment")
188
189
190 if format in _FormatToIterator:
191 writer_class = _FormatToWriter[format]
192 count = writer_class(handle).write_file(alignments)
193 elif format in SeqIO._FormatToWriter:
194
195
196 count = 0
197 for alignment in alignments:
198 if not isinstance(alignment, Alignment):
199 raise TypeError("Expect a list or iterator of Alignment objects.")
200 SeqIO.write(alignment, handle, format)
201 count += 1
202 elif format in _FormatToIterator or format in SeqIO._FormatToIterator:
203 raise ValueError("Reading format '%s' is supported, but not writing" \
204 % format)
205 else:
206 raise ValueError("Unknown format '%s'" % format)
207
208 assert isinstance(count, int), "Internal error - the underlying %s " \
209 "writer should have returned the alignment count, not %s" \
210 % (format, repr(count))
211 return count
212
213
215 """Uses Bio.SeqIO to create an Alignment iterator (PRIVATE).
216
217 Arguments:
218 - handle - handle to the file.
219 - format - string describing the file format.
220 - alphabet - optional Alphabet object, useful when the sequence type
221 cannot be automatically inferred from the file itself
222 (e.g. fasta, phylip, clustal)
223 - seq_count - Optional integer, number of sequences expected in each
224 alignment. Recommended for fasta format files.
225
226 If count is omitted (default) then all the sequences in
227 the file are combined into a single Alignment.
228 """
229 from Bio import SeqIO
230 assert format in SeqIO._FormatToIterator
231
232 if seq_count:
233
234 seq_record_iterator = SeqIO.parse(handle, format, alphabet)
235
236 records = []
237 for record in seq_record_iterator:
238 records.append(record)
239 if len(records) == seq_count:
240 yield SeqIO.to_alignment(records)
241 records = []
242 if len(records) > 0:
243 raise ValueError("Check seq_count argument, not enough sequences?")
244 else:
245
246
247 records = list(SeqIO.parse(handle, format, alphabet))
248 if records:
249 yield SeqIO.to_alignment(records)
250 else:
251
252 pass
253
273
274 -def parse(handle, format, seq_count=None, alphabet=None):
275 """Turns a sequence file into an iterator returning Alignment objects.
276
277 Arguments:
278 - handle - handle to the file.
279 - format - string describing the file format.
280 - alphabet - optional Alphabet object, useful when the sequence type
281 cannot be automatically inferred from the file itself
282 (e.g. fasta, phylip, clustal)
283 - seq_count - Optional integer, number of sequences expected in each
284 alignment. Recommended for fasta format files.
285
286 If you have the file name in a string 'filename', use:
287
288 >>> from Bio import AlignIO
289 >>> filename = "Emboss/needle.txt"
290 >>> format = "emboss"
291 >>> for alignment in AlignIO.parse(open(filename,"rU"), format):
292 ... print "Alignment of length", alignment.get_alignment_length()
293 Alignment of length 124
294 Alignment of length 119
295 Alignment of length 120
296 Alignment of length 118
297 Alignment of length 125
298
299 If you have a string 'data' containing the file contents, use:
300
301 from Bio import AlignIO
302 from StringIO import StringIO
303 my_iterator = AlignIO.parse(StringIO(data), format)
304
305 Use the Bio.AlignIO.read() function when you expect a single record only.
306 """
307 from Bio import SeqIO
308
309
310 if isinstance(handle, basestring):
311 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
312 if not isinstance(format, basestring):
313 raise TypeError("Need a string for the file format (lower case)")
314 if not format:
315 raise ValueError("Format required (lower case string)")
316 if format != format.lower():
317 raise ValueError("Format string '%s' should be lower case" % format)
318 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
319 isinstance(alphabet, AlphabetEncoder)):
320 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
321 if seq_count is not None and not isinstance(seq_count, int):
322 raise TypeError("Need integer for seq_count (sequences per alignment)")
323
324
325 if format in _FormatToIterator:
326 iterator_generator = _FormatToIterator[format]
327 if alphabet is None :
328 return iterator_generator(handle, seq_count)
329 try:
330
331 return iterator_generator(handle, seq_count, alphabet=alphabet)
332 except TypeError:
333
334 return _force_alphabet(iterator_generator(handle, seq_count), alphabet)
335
336 elif format in SeqIO._FormatToIterator:
337
338 return _SeqIO_to_alignment_iterator(handle, format,
339 alphabet=alphabet,
340 seq_count=seq_count)
341 else:
342 raise ValueError("Unknown format '%s'" % format)
343
344 -def read(handle, format, seq_count=None, alphabet=None):
345 """Turns an alignment file into a single Alignment object.
346
347 Arguments:
348 - handle - handle to the file.
349 - format - string describing the file format.
350 - alphabet - optional Alphabet object, useful when the sequence type
351 cannot be automatically inferred from the file itself
352 (e.g. fasta, phylip, clustal)
353 - seq_count - Optional integer, number of sequences expected in each
354 alignment. Recommended for fasta format files.
355
356 If the handle contains no alignments, or more than one alignment,
357 an exception is raised. For example, using a PFAM/Stockholm file
358 containing one alignment:
359
360 >>> from Bio import AlignIO
361 >>> filename = "Clustalw/protein.aln"
362 >>> format = "clustal"
363 >>> alignment = AlignIO.read(open(filename, "rU"), format)
364 >>> print "Alignment of length", alignment.get_alignment_length()
365 Alignment of length 411
366
367 If however you want the first alignment from a file containing
368 multiple alignments this function would raise an exception.
369
370 >>> from Bio import AlignIO
371 >>> filename = "Emboss/needle.txt"
372 >>> format = "emboss"
373 >>> alignment = AlignIO.read(open(filename, "rU"), format)
374 Traceback (most recent call last):
375 ...
376 ValueError: More than one record found in handle
377
378 Instead use:
379
380 >>> from Bio import AlignIO
381 >>> filename = "Emboss/needle.txt"
382 >>> format = "emboss"
383 >>> alignment = AlignIO.parse(open(filename, "rU"), format).next()
384 >>> print "First alignment has length", alignment.get_alignment_length()
385 First alignment has length 124
386
387 You must use the Bio.AlignIO.parse() function if you want to read multiple
388 records from the handle.
389 """
390 iterator = parse(handle, format, seq_count, alphabet)
391 try:
392 first = iterator.next()
393 except StopIteration:
394 first = None
395 if first is None:
396 raise ValueError("No records found in handle")
397 try:
398 second = iterator.next()
399 except StopIteration:
400 second = None
401 if second is not None:
402 raise ValueError("More than one record found in handle")
403 if seq_count:
404 assert len(first.get_all_seqs())==seq_count
405 return first
406
407 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
408 """Convert between two alignment files, returns number of alignments.
409
410 - in_file - an input handle or filename
411 - in_format - input file format, lower case string
412 - output - an output handle or filename
413 - out_file - output file format, lower case string
414 - alphabet - optional alphabet to assume
415
416 NOTE - If you provide an output filename, it will be opened which will
417 overwrite any existing file without warning. This may happen if even the
418 conversion is aborted (e.g. an invalid out_format name is given).
419 """
420
421
422 if isinstance(in_file, basestring):
423 in_handle = open(in_file, "rU")
424 in_close = True
425 else:
426 in_handle = in_file
427 in_close = False
428
429 alignments = parse(in_handle, in_format, None, alphabet)
430
431 if isinstance(out_file, basestring):
432 out_handle = open(out_file, "w")
433 out_close = True
434 else:
435 out_handle = out_file
436 out_close = False
437
438
439 count = write(alignments, out_handle, out_format)
440
441 if in_close : in_handle.close()
442 if out_close : out_handle.close()
443 return count
444
446 """Run the Bio.AlignIO module's doctests.
447
448 This will try and locate the unit tests directory, and run the doctests
449 from there in order that the relative paths used in the examples work.
450 """
451 import doctest
452 import os
453 if os.path.isdir(os.path.join("..","..","Tests")):
454 print "Runing doctests..."
455 cur_dir = os.path.abspath(os.curdir)
456 os.chdir(os.path.join("..","..","Tests"))
457 doctest.testmod()
458 os.chdir(cur_dir)
459 del cur_dir
460 print "Done"
461
462 if __name__ == "__main__":
463 _test()
464