Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008-2009 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Multiple sequence alignment input/output as Alignment objects. 
  7   
  8  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
  9  fact the two are connected internally.  Both modules use the same set of file 
 10  format names (lower case strings).  From the user's perspective, you can read 
 11  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 12  can read in the sequences within these alignmenta using Bio.SeqIO. 
 13   
 14  Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by 
 15  a whole chapter in our tutorial: 
 16   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 17   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}   
 18   
 19  Input 
 20  ===== 
 21  For the typical special case when your file or handle contains one and only 
 22  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 23  handle, format string and optional number of sequences per alignment.  It will 
 24  return a single Alignment object (or raise an exception if there isn't just 
 25  one alignment): 
 26   
 27      >>> from Bio import AlignIO 
 28      >>> handle = open("Phylip/interlaced.phy", "rU") 
 29      >>> align = AlignIO.read(handle, "phylip") 
 30      >>> handle.close() 
 31      >>> print align 
 32      SingleLetterAlphabet() alignment with 3 rows and 384 columns 
 33      -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI 
 34      MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU 
 35      ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN 
 36   
 37  For the general case, when the handle could contain any number of alignments, 
 38  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 39  returns an iterator giving Alignment objects (typically used in a for loop). 
 40  If you want random access to the alignments by number, turn this into a list: 
 41   
 42      >>> from Bio import AlignIO 
 43      >>> handle = open("Emboss/needle.txt", "rU") 
 44      >>> alignments = list(AlignIO.parse(handle, "emboss")) 
 45      >>> print alignments[2] 
 46      SingleLetterAlphabet() alignment with 2 rows and 120 columns 
 47      -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec 
 48      LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver 
 49   
 50  Most alignment file formats can be concatenated so as to hold as many 
 51  different multiple sequence alignments as possible.  One common example 
 52  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 53  can be a file header and footer, as seen in the EMBOSS alignment output. 
 54   
 55  Output 
 56  ====== 
 57  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 58  Alignment objects (either as a list, or an iterator), an output file handle 
 59  and of course the file format:: 
 60   
 61      from Bio import AlignIO 
 62      alignments = ... 
 63      handle = open("example.faa", "w") 
 64      alignment = SeqIO.write(alignments, handle, "fasta") 
 65      handle.close() 
 66   
 67  In general, you are expected to call this function once (with all your 
 68  alignments) and then close the file handle.  However, for file formats 
 69  like PHYLIP where multiple alignments are stored sequentially (with no file 
 70  header and footer), then multiple calls to the write function should work as 
 71  expected. 
 72   
 73  Conversion 
 74  ========== 
 75  The Bio.AlignIO.convert(...) function allows an easy interface for simple 
 76  alignnment file format conversions. Additionally, it may use file format 
 77  specific optimisations so this should be the fastest way too. 
 78   
 79  In general however, you can combine the Bio.AlignIO.parse(...) function with 
 80  the Bio.AlignIO.write(...) function for sequence file conversion. Using 
 81  generator expressions provides a memory efficient way to perform filtering or 
 82  other extra operations as part of the process. 
 83   
 84  File Formats 
 85  ============ 
 86  When specifying the file format, use lowercase strings.  The same format 
 87  names are also used in Bio.SeqIO and include the following: 
 88   
 89   - clustal   - Ouput from Clustal W or X, see also the module Bio.Clustalw 
 90                 which can be used to run the command line tool from Biopython. 
 91   - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats. 
 92   - fasta     - The generic sequence file format where each record starts with 
 93                 an identifer line starting with a ">" character, followed by 
 94                 lines of sequence. 
 95   - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
 96                 tools when used with the -m 10 command line option for machine 
 97                 readable output. 
 98   - ig        - The IntelliGenetics file format, apparently the same as the 
 99                 MASE alignment format. 
100   - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
101                 read any phylogenetic trees in these files. 
102   - phylip    - Used by the PHLIP tools. 
103   - stockholm - A richly annotated alignment file format used by PFAM. 
104   
105  Note that while Bio.AlignIO can read all the above file formats, it cannot 
106  write to all of them. 
107   
108  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
109  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
110  same length. 
111  """ 
112  __docformat__ = "epytext en" #not just plaintext 
113   
114  #TODO 
115  # - define policy on reading aligned sequences with gaps in 
116  #   (e.g. - and . characters) including how the alphabet interacts 
117  # 
118  # - Can we build the to_alignment(...) functionality 
119  #   into the generic Alignment class instead? 
120  # 
121  # - How best to handle unique/non unique record.id when writing. 
122  #   For most file formats reading such files is fine; The stockholm 
123  #   parser would fail. 
124  # 
125  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
126  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format  
127   
128  import os 
129  #from cStringIO import StringIO 
130  from StringIO import StringIO 
131  from Bio.Seq import Seq 
132  from Bio.SeqRecord import SeqRecord 
133  from Bio.Align.Generic import Alignment 
134  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
135   
136  import StockholmIO 
137  import ClustalIO 
138  import NexusIO 
139  import PhylipIO 
140  import EmbossIO 
141  import FastaIO 
142   
143  #Convention for format names is "mainname-subtype" in lower case. 
144  #Please use the same names as BioPerl and EMBOSS where possible. 
145   
146  _FormatToIterator ={#"fasta" is done via Bio.SeqIO 
147                      "clustal" : ClustalIO.ClustalIterator, 
148                      "emboss" : EmbossIO.EmbossIterator, 
149                      "fasta-m10" : FastaIO.FastaM10Iterator, 
150                      "nexus" : NexusIO.NexusIterator, 
151                      "phylip" : PhylipIO.PhylipIterator, 
152                      "stockholm" : StockholmIO.StockholmIterator, 
153                      } 
154   
155  _FormatToWriter ={#"fasta" is done via Bio.SeqIO 
156                    #"emboss" : EmbossIO.EmbossWriter, (unfinished) 
157                    "nexus" : NexusIO.NexusWriter, 
158                    "phylip" : PhylipIO.PhylipWriter, 
159                    "stockholm" : StockholmIO.StockholmWriter, 
160                    "clustal" : ClustalIO.ClustalWriter, 
161                    } 
162   
163 -def write(alignments, handle, format):
164 """Write complete set of alignments to a file. 165 166 Arguments: 167 - sequences - A list (or iterator) of Alignment objects 168 - handle - File handle object to write to 169 - format - lower case string describing the file format to write. 170 171 You should close the handle after calling this function. 172 173 Returns the number of alignments written (as an integer). 174 """ 175 from Bio import SeqIO 176 177 #Try and give helpful error messages: 178 if isinstance(handle, basestring): 179 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 180 if not isinstance(format, basestring): 181 raise TypeError("Need a string for the file format (lower case)") 182 if not format: 183 raise ValueError("Format required (lower case string)") 184 if format != format.lower(): 185 raise ValueError("Format string '%s' should be lower case" % format) 186 if isinstance(alignments, Alignment): 187 raise TypeError("Need an Alignment list/iterator, not just a single Alignment") 188 189 #Map the file format to a writer class 190 if format in _FormatToIterator: 191 writer_class = _FormatToWriter[format] 192 count = writer_class(handle).write_file(alignments) 193 elif format in SeqIO._FormatToWriter: 194 #Exploit the existing SeqIO parser to the dirty work! 195 #TODO - Can we make one call to SeqIO.write() and count the alignments? 196 count = 0 197 for alignment in alignments: 198 if not isinstance(alignment, Alignment): 199 raise TypeError("Expect a list or iterator of Alignment objects.") 200 SeqIO.write(alignment, handle, format) 201 count += 1 202 elif format in _FormatToIterator or format in SeqIO._FormatToIterator: 203 raise ValueError("Reading format '%s' is supported, but not writing" \ 204 % format) 205 else: 206 raise ValueError("Unknown format '%s'" % format) 207 208 assert isinstance(count, int), "Internal error - the underlying %s " \ 209 "writer should have returned the alignment count, not %s" \ 210 % (format, repr(count)) 211 return count
212 213 #This is a generator function!
214 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
215 """Uses Bio.SeqIO to create an Alignment iterator (PRIVATE). 216 217 Arguments: 218 - handle - handle to the file. 219 - format - string describing the file format. 220 - alphabet - optional Alphabet object, useful when the sequence type 221 cannot be automatically inferred from the file itself 222 (e.g. fasta, phylip, clustal) 223 - seq_count - Optional integer, number of sequences expected in each 224 alignment. Recommended for fasta format files. 225 226 If count is omitted (default) then all the sequences in 227 the file are combined into a single Alignment. 228 """ 229 from Bio import SeqIO 230 assert format in SeqIO._FormatToIterator 231 232 if seq_count: 233 #Use the count to split the records into batches. 234 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 235 236 records = [] 237 for record in seq_record_iterator: 238 records.append(record) 239 if len(records) == seq_count: 240 yield SeqIO.to_alignment(records) 241 records = [] 242 if len(records) > 0: 243 raise ValueError("Check seq_count argument, not enough sequences?") 244 else: 245 #Must assume that there is a single alignment using all 246 #the SeqRecord objects: 247 records = list(SeqIO.parse(handle, format, alphabet)) 248 if records: 249 yield SeqIO.to_alignment(records) 250 else: 251 #No alignment found! 252 pass
253
254 -def _force_alphabet(alignment_iterator, alphabet):
255 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 256 #Assume the alphabet argument has been pre-validated 257 given_base_class = _get_base_alphabet(alphabet).__class__ 258 for align in alignment_iterator: 259 if not isinstance(_get_base_alphabet(align._alphabet), 260 given_base_class): 261 raise ValueError("Specified alphabet %s clashes with "\ 262 "that determined from the file, %s" \ 263 % (repr(alphabet), repr(align._alphabet))) 264 for record in align: 265 if not isinstance(_get_base_alphabet(record.seq.alphabet), 266 given_base_class): 267 raise ValueError("Specified alphabet %s clashes with "\ 268 "that determined from the file, %s" \ 269 % (repr(alphabet), repr(record.seq.alphabet))) 270 record.seq.alphabet = alphabet 271 align._alphabet = alphabet 272 yield align
273
274 -def parse(handle, format, seq_count=None, alphabet=None):
275 """Turns a sequence file into an iterator returning Alignment objects. 276 277 Arguments: 278 - handle - handle to the file. 279 - format - string describing the file format. 280 - alphabet - optional Alphabet object, useful when the sequence type 281 cannot be automatically inferred from the file itself 282 (e.g. fasta, phylip, clustal) 283 - seq_count - Optional integer, number of sequences expected in each 284 alignment. Recommended for fasta format files. 285 286 If you have the file name in a string 'filename', use: 287 288 >>> from Bio import AlignIO 289 >>> filename = "Emboss/needle.txt" 290 >>> format = "emboss" 291 >>> for alignment in AlignIO.parse(open(filename,"rU"), format): 292 ... print "Alignment of length", alignment.get_alignment_length() 293 Alignment of length 124 294 Alignment of length 119 295 Alignment of length 120 296 Alignment of length 118 297 Alignment of length 125 298 299 If you have a string 'data' containing the file contents, use: 300 301 from Bio import AlignIO 302 from StringIO import StringIO 303 my_iterator = AlignIO.parse(StringIO(data), format) 304 305 Use the Bio.AlignIO.read() function when you expect a single record only. 306 """ 307 from Bio import SeqIO 308 309 #Try and give helpful error messages: 310 if isinstance(handle, basestring): 311 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 312 if not isinstance(format, basestring): 313 raise TypeError("Need a string for the file format (lower case)") 314 if not format: 315 raise ValueError("Format required (lower case string)") 316 if format != format.lower(): 317 raise ValueError("Format string '%s' should be lower case" % format) 318 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ 319 isinstance(alphabet, AlphabetEncoder)): 320 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 321 if seq_count is not None and not isinstance(seq_count, int): 322 raise TypeError("Need integer for seq_count (sequences per alignment)") 323 324 #Map the file format to a sequence iterator: 325 if format in _FormatToIterator: 326 iterator_generator = _FormatToIterator[format] 327 if alphabet is None : 328 return iterator_generator(handle, seq_count) 329 try: 330 #Initially assume the optional alphabet argument is supported 331 return iterator_generator(handle, seq_count, alphabet=alphabet) 332 except TypeError: 333 #It isn't supported. 334 return _force_alphabet(iterator_generator(handle, seq_count), alphabet) 335 336 elif format in SeqIO._FormatToIterator: 337 #Exploit the existing SeqIO parser to the dirty work! 338 return _SeqIO_to_alignment_iterator(handle, format, 339 alphabet=alphabet, 340 seq_count=seq_count) 341 else: 342 raise ValueError("Unknown format '%s'" % format)
343
344 -def read(handle, format, seq_count=None, alphabet=None):
345 """Turns an alignment file into a single Alignment object. 346 347 Arguments: 348 - handle - handle to the file. 349 - format - string describing the file format. 350 - alphabet - optional Alphabet object, useful when the sequence type 351 cannot be automatically inferred from the file itself 352 (e.g. fasta, phylip, clustal) 353 - seq_count - Optional integer, number of sequences expected in each 354 alignment. Recommended for fasta format files. 355 356 If the handle contains no alignments, or more than one alignment, 357 an exception is raised. For example, using a PFAM/Stockholm file 358 containing one alignment: 359 360 >>> from Bio import AlignIO 361 >>> filename = "Clustalw/protein.aln" 362 >>> format = "clustal" 363 >>> alignment = AlignIO.read(open(filename, "rU"), format) 364 >>> print "Alignment of length", alignment.get_alignment_length() 365 Alignment of length 411 366 367 If however you want the first alignment from a file containing 368 multiple alignments this function would raise an exception. 369 370 >>> from Bio import AlignIO 371 >>> filename = "Emboss/needle.txt" 372 >>> format = "emboss" 373 >>> alignment = AlignIO.read(open(filename, "rU"), format) 374 Traceback (most recent call last): 375 ... 376 ValueError: More than one record found in handle 377 378 Instead use: 379 380 >>> from Bio import AlignIO 381 >>> filename = "Emboss/needle.txt" 382 >>> format = "emboss" 383 >>> alignment = AlignIO.parse(open(filename, "rU"), format).next() 384 >>> print "First alignment has length", alignment.get_alignment_length() 385 First alignment has length 124 386 387 You must use the Bio.AlignIO.parse() function if you want to read multiple 388 records from the handle. 389 """ 390 iterator = parse(handle, format, seq_count, alphabet) 391 try: 392 first = iterator.next() 393 except StopIteration: 394 first = None 395 if first is None: 396 raise ValueError("No records found in handle") 397 try: 398 second = iterator.next() 399 except StopIteration: 400 second = None 401 if second is not None: 402 raise ValueError("More than one record found in handle") 403 if seq_count: 404 assert len(first.get_all_seqs())==seq_count 405 return first
406
407 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
408 """Convert between two alignment files, returns number of alignments. 409 410 - in_file - an input handle or filename 411 - in_format - input file format, lower case string 412 - output - an output handle or filename 413 - out_file - output file format, lower case string 414 - alphabet - optional alphabet to assume 415 416 NOTE - If you provide an output filename, it will be opened which will 417 overwrite any existing file without warning. This may happen if even the 418 conversion is aborted (e.g. an invalid out_format name is given). 419 """ 420 #TODO - Add optimised versions of important conversions 421 #For now just off load the work to SeqIO parse/write 422 if isinstance(in_file, basestring): 423 in_handle = open(in_file, "rU") 424 in_close = True 425 else: 426 in_handle = in_file 427 in_close = False 428 #This will check the arguments and issue error messages, 429 alignments = parse(in_handle, in_format, None, alphabet) 430 #Don't open the output file until we've checked the input is OK: 431 if isinstance(out_file, basestring): 432 out_handle = open(out_file, "w") 433 out_close = True 434 else: 435 out_handle = out_file 436 out_close = False 437 #This will check the arguments and issue error messages, 438 #after we have opened the file which is a shame. 439 count = write(alignments, out_handle, out_format) 440 #Must now close any handles we opened 441 if in_close : in_handle.close() 442 if out_close : out_handle.close() 443 return count
444
445 -def _test():
446 """Run the Bio.AlignIO module's doctests. 447 448 This will try and locate the unit tests directory, and run the doctests 449 from there in order that the relative paths used in the examples work. 450 """ 451 import doctest 452 import os 453 if os.path.isdir(os.path.join("..","..","Tests")): 454 print "Runing doctests..." 455 cur_dir = os.path.abspath(os.curdir) 456 os.chdir(os.path.join("..","..","Tests")) 457 doctest.testmod() 458 os.chdir(cur_dir) 459 del cur_dir 460 print "Done"
461 462 if __name__ == "__main__": 463 _test() 464