Package Bio :: Package Prosite
[hide private]
[frames] | no frames]

Source Code for Package Bio.Prosite

  1  # Copyright 1999 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2000 by Jeffrey Chang.  All rights reserved. 
  3  # Revisions Copyright 2007 by Peter Cock.  All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7  """Module for working with Prosite files from ExPASy (DEPRECATED). 
  8   
  9  Most of the functionality in this module has moved to Bio.ExPASy.Prosite; 
 10  please see 
 11   
 12  Bio.ExPASy.Prosite.read          To read a Prosite file containing one entry. 
 13  Bio.ExPASy.Prosite.parse         Iterates over entries in a Prosite file. 
 14  Bio.ExPASy.Prosite.Record        Holds Prosite data. 
 15   
 16  For 
 17  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 18  _extract_pattern_hits Extract Prosite patterns from a web page. 
 19  PatternHit            Holds data from a hit against a Prosite pattern. 
 20  please see the new module Bio.ExPASy.ScanProsite. 
 21   
 22  The other functions and classes in Bio.Prosite (including 
 23  Bio.Prosite.index_file and Bio.Prosite.Dictionary) are considered deprecated, 
 24  and were not moved to Bio.ExPASy.Prosite. If you use this functionality, 
 25  please contact the Biopython developers at biopython-dev@biopython.org to 
 26  avoid permanent removal of this module from Biopython. 
 27   
 28   
 29  This module provides code to work with the prosite dat file from 
 30  Prosite. 
 31  http://www.expasy.ch/prosite/ 
 32   
 33  Tested with: 
 34  Release 15.0, July 1998 
 35  Release 16.0, July 1999 
 36  Release 17.0, Dec 2001 
 37  Release 19.0, Mar 2006 
 38   
 39   
 40  Functions: 
 41  parse                 Iterates over entries in a Prosite file. 
 42  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 43  index_file            Index a Prosite file for a Dictionary. 
 44  _extract_record       Extract Prosite data from a web page. 
 45  _extract_pattern_hits Extract Prosite patterns from a web page. 
 46   
 47   
 48  Classes: 
 49  Record                Holds Prosite data. 
 50  PatternHit            Holds data from a hit against a Prosite pattern. 
 51  Dictionary            Accesses a Prosite file using a dictionary interface. 
 52  RecordParser          Parses a Prosite record into a Record object. 
 53   
 54  _Scanner              Scans Prosite-formatted data. 
 55  _RecordConsumer       Consumes Prosite data to a Record object. 
 56   
 57  """ 
 58   
 59  import warnings 
 60  warnings.warn("Bio.Prosite is deprecated, and will be removed in a"\ 
 61                " future release of Biopython. Most of the functionality " 
 62                " is now provided by Bio.ExPASy.Prosite.  If you want to " 
 63                " continue to use Bio.Prosite, please get in contact " 
 64                " via the mailing lists to avoid its permanent removal from"\ 
 65                " Biopython.", DeprecationWarning) 
 66   
 67  from types import * 
 68  import re 
 69  import sgmllib 
 70  from Bio import File 
 71  from Bio import Index 
 72  from Bio.ParserSupport import * 
 73   
 74  # There is probably a cleaner way to write the read/parse functions 
 75  # if we don't use the "parser = RecordParser(); parser.parse(handle)" 
 76  # approach. Leaving that for the next revision of Bio.Prosite. 
77 -def parse(handle):
78 import cStringIO 79 parser = RecordParser() 80 text = "" 81 for line in handle: 82 text += line 83 if line[:2]=='//': 84 handle = cStringIO.StringIO(text) 85 record = parser.parse(handle) 86 text = "" 87 if not record: # Then this was the copyright notice 88 continue 89 yield record
90
91 -def read(handle):
92 parser = RecordParser() 93 try: 94 record = parser.parse(handle) 95 except ValueError, error: 96 if error.message=="There doesn't appear to be a record": 97 raise ValueError("No Prosite record found") 98 else: 99 raise error 100 # We should have reached the end of the record by now 101 remainder = handle.read() 102 if remainder: 103 raise ValueError("More than one Prosite record found") 104 return record
105
106 -class Record:
107 """Holds information from a Prosite record. 108 109 Members: 110 name ID of the record. e.g. ADH_ZINC 111 type Type of entry. e.g. PATTERN, MATRIX, or RULE 112 accession e.g. PS00387 113 created Date the entry was created. (MMM-YYYY) 114 data_update Date the 'primary' data was last updated. 115 info_update Date data other than 'primary' data was last updated. 116 pdoc ID of the PROSITE DOCumentation. 117 118 description Free-format description. 119 pattern The PROSITE pattern. See docs. 120 matrix List of strings that describes a matrix entry. 121 rules List of rule definitions (from RU lines). (strings) 122 prorules List of prorules (from PR lines). (strings) 123 124 NUMERICAL RESULTS 125 nr_sp_release SwissProt release. 126 nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) 127 nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) 128 nr_positive True positives. tuple of (hits, seqs) 129 nr_unknown Could be positives. tuple of (hits, seqs) 130 nr_false_pos False positives. tuple of (hits, seqs) 131 nr_false_neg False negatives. (int) 132 nr_partial False negatives, because they are fragments. (int) 133 134 COMMENTS 135 cc_taxo_range Taxonomic range. See docs for format 136 cc_max_repeat Maximum number of repetitions in a protein 137 cc_site Interesting site. list of tuples (pattern pos, desc.) 138 cc_skip_flag Can this entry be ignored? 139 cc_matrix_type 140 cc_scaling_db 141 cc_author 142 cc_ft_key 143 cc_ft_desc 144 cc_version version number (introduced in release 19.0) 145 146 DATA BANK REFERENCES - The following are all 147 lists of tuples (swiss-prot accession, 148 swiss-prot name) 149 dr_positive 150 dr_false_neg 151 dr_false_pos 152 dr_potential Potential hits, but fingerprint region not yet available. 153 dr_unknown Could possibly belong 154 155 pdb_structs List of PDB entries. 156 157 """
158 - def __init__(self):
159 self.name = '' 160 self.type = '' 161 self.accession = '' 162 self.created = '' 163 self.data_update = '' 164 self.info_update = '' 165 self.pdoc = '' 166 167 self.description = '' 168 self.pattern = '' 169 self.matrix = [] 170 self.rules = [] 171 self.prorules = [] 172 self.postprocessing = [] 173 174 self.nr_sp_release = '' 175 self.nr_sp_seqs = '' 176 self.nr_total = (None, None) 177 self.nr_positive = (None, None) 178 self.nr_unknown = (None, None) 179 self.nr_false_pos = (None, None) 180 self.nr_false_neg = None 181 self.nr_partial = None 182 183 self.cc_taxo_range = '' 184 self.cc_max_repeat = '' 185 self.cc_site = [] 186 self.cc_skip_flag = '' 187 188 self.dr_positive = [] 189 self.dr_false_neg = [] 190 self.dr_false_pos = [] 191 self.dr_potential = [] 192 self.dr_unknown = [] 193 194 self.pdb_structs = []
195
196 -class PatternHit:
197 """Holds information from a hit against a Prosite pattern. 198 199 Members: 200 name ID of the record. e.g. ADH_ZINC 201 accession e.g. PS00387 202 pdoc ID of the PROSITE DOCumentation. 203 description Free-format description. 204 matches List of tuples (start, end, sequence) where 205 start and end are indexes of the match, and sequence is 206 the sequence matched. 207 208 """
209 - def __init__(self):
210 self.name = None 211 self.accession = None 212 self.pdoc = None 213 self.description = None 214 self.matches = []
215 - def __str__(self):
216 lines = [] 217 lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) 218 lines.append(self.description) 219 lines.append('') 220 if len(self.matches) > 1: 221 lines.append("Number of matches: %s" % len(self.matches)) 222 for i in range(len(self.matches)): 223 start, end, seq = self.matches[i] 224 range_str = "%d-%d" % (start, end) 225 if len(self.matches) > 1: 226 lines.append("%7d %10s %s" % (i+1, range_str, seq)) 227 else: 228 lines.append("%7s %10s %s" % (' ', range_str, seq)) 229 return "\n".join(lines)
230 231
232 -class Dictionary:
233 """Accesses a Prosite file using a dictionary interface. 234 235 """ 236 __filename_key = '__filename' 237
238 - def __init__(self, indexname, parser=None):
239 """__init__(self, indexname, parser=None) 240 241 Open a Prosite Dictionary. indexname is the name of the 242 index for the dictionary. The index should have been created 243 using the index_file function. parser is an optional Parser 244 object to change the results into another form. If set to None, 245 then the raw contents of the file will be returned. 246 247 """ 248 self._index = Index.Index(indexname) 249 self._handle = open(self._index[Dictionary.__filename_key]) 250 self._parser = parser
251
252 - def __len__(self):
253 return len(self._index)
254
255 - def __getitem__(self, key):
256 start, len = self._index[key] 257 self._handle.seek(start) 258 data = self._handle.read(len) 259 if self._parser is not None: 260 return self._parser.parse(File.StringHandle(data)) 261 return data
262
263 - def __getattr__(self, name):
264 return getattr(self._index, name)
265
266 -class RecordParser(AbstractParser):
267 """Parses Prosite data into a Record object. 268 269 """
270 - def __init__(self):
271 self._scanner = _Scanner() 272 self._consumer = _RecordConsumer()
273
274 - def parse(self, handle):
275 self._scanner.feed(handle, self._consumer) 276 return self._consumer.data
277
278 -class _Scanner:
279 """Scans Prosite-formatted data. 280 281 Tested with: 282 Release 15.0, July 1998 283 284 """
285 - def feed(self, handle, consumer):
286 """feed(self, handle, consumer) 287 288 Feed in Prosite data for scanning. handle is a file-like 289 object that contains prosite data. consumer is a 290 Consumer object that will receive events as the report is scanned. 291 292 """ 293 if isinstance(handle, File.UndoHandle): 294 uhandle = handle 295 else: 296 uhandle = File.UndoHandle(handle) 297 298 consumer.finished = False 299 while not consumer.finished: 300 line = uhandle.peekline() 301 if not line: 302 break 303 elif is_blank_line(line): 304 # Skip blank lines between records 305 uhandle.readline() 306 continue 307 elif line[:2] == 'ID': 308 self._scan_record(uhandle, consumer) 309 elif line[:2] == 'CC': 310 self._scan_copyrights(uhandle, consumer) 311 else: 312 raise ValueError("There doesn't appear to be a record")
313
314 - def _scan_copyrights(self, uhandle, consumer):
315 consumer.start_copyrights() 316 self._scan_line('CC', uhandle, consumer.copyright, any_number=1) 317 self._scan_terminator(uhandle, consumer) 318 consumer.end_copyrights()
319
320 - def _scan_record(self, uhandle, consumer):
321 consumer.start_record() 322 for fn in self._scan_fns: 323 fn(self, uhandle, consumer) 324 325 # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before 326 # the 3D lines, instead of the other way around. 327 # Thus, I'll give the 3D lines another chance after the DO lines 328 # are finished. 329 if fn is self._scan_do.im_func: 330 self._scan_3d(uhandle, consumer) 331 consumer.end_record()
332
333 - def _scan_line(self, line_type, uhandle, event_fn, 334 exactly_one=None, one_or_more=None, any_number=None, 335 up_to_one=None):
336 # Callers must set exactly one of exactly_one, one_or_more, or 337 # any_number to a true value. I do not explicitly check to 338 # make sure this function is called correctly. 339 340 # This does not guarantee any parameter safety, but I 341 # like the readability. The other strategy I tried was have 342 # parameters min_lines, max_lines. 343 344 if exactly_one or one_or_more: 345 read_and_call(uhandle, event_fn, start=line_type) 346 if one_or_more or any_number: 347 while 1: 348 if not attempt_read_and_call(uhandle, event_fn, 349 start=line_type): 350 break 351 if up_to_one: 352 attempt_read_and_call(uhandle, event_fn, start=line_type)
353
354 - def _scan_id(self, uhandle, consumer):
355 self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
356
357 - def _scan_ac(self, uhandle, consumer):
358 self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
359
360 - def _scan_dt(self, uhandle, consumer):
361 self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
362
363 - def _scan_de(self, uhandle, consumer):
364 self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
365
366 - def _scan_pa(self, uhandle, consumer):
367 self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
368
369 - def _scan_ma(self, uhandle, consumer):
370 self._scan_line('MA', uhandle, consumer.matrix, any_number=1)
371 ## # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 372 ## # contain a CC line buried within an 'MA' line. Need to check 373 ## # for that. 374 ## while 1: 375 ## if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): 376 ## line1 = uhandle.readline() 377 ## line2 = uhandle.readline() 378 ## uhandle.saveline(line2) 379 ## uhandle.saveline(line1) 380 ## if line1[:2] == 'CC' and line2[:2] == 'MA': 381 ## read_and_call(uhandle, consumer.comment, start='CC') 382 ## else: 383 ## break 384
385 - def _scan_pp(self, uhandle, consumer):
386 #New PP line, PostProcessing, just after the MA line 387 self._scan_line('PP', uhandle, consumer.postprocessing, any_number=1)
388
389 - def _scan_ru(self, uhandle, consumer):
390 self._scan_line('RU', uhandle, consumer.rule, any_number=1)
391
392 - def _scan_nr(self, uhandle, consumer):
393 self._scan_line('NR', uhandle, consumer.numerical_results, 394 any_number=1)
395
396 - def _scan_cc(self, uhandle, consumer):
397 self._scan_line('CC', uhandle, consumer.comment, any_number=1)
398
399 - def _scan_dr(self, uhandle, consumer):
400 self._scan_line('DR', uhandle, consumer.database_reference, 401 any_number=1)
402
403 - def _scan_3d(self, uhandle, consumer):
404 self._scan_line('3D', uhandle, consumer.pdb_reference, 405 any_number=1)
406
407 - def _scan_pr(self, uhandle, consumer):
408 #New PR line, ProRule, between 3D and DO lines 409 self._scan_line('PR', uhandle, consumer.prorule, any_number=1)
410
411 - def _scan_do(self, uhandle, consumer):
412 self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
413
414 - def _scan_terminator(self, uhandle, consumer):
415 self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
416 417 #This is a list of scan functions in the order expected in the file file. 418 #The function definitions define how many times each line type is exected 419 #(or if optional): 420 _scan_fns = [ 421 _scan_id, 422 _scan_ac, 423 _scan_dt, 424 _scan_de, 425 _scan_pa, 426 _scan_ma, 427 _scan_pp, 428 _scan_ru, 429 _scan_nr, 430 _scan_cc, 431 432 # This is a really dirty hack, and should be fixed properly at 433 # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 434 # in Rel 17 have lines out of order. Thus, I have to rescan 435 # these, which decreases performance. 436 _scan_ma, 437 _scan_nr, 438 _scan_cc, 439 440 _scan_dr, 441 _scan_3d, 442 _scan_pr, 443 _scan_do, 444 _scan_terminator 445 ]
446
447 -class _RecordConsumer(AbstractConsumer):
448 """Consumer that converts a Prosite record to a Record object. 449 450 Members: 451 data Record with Prosite data. 452 453 """
454 - def __init__(self):
455 self.data = None
456
457 - def start_record(self):
458 self.data = Record()
459
460 - def end_record(self):
461 self._clean_record(self.data)
462
463 - def identification(self, line):
464 cols = line.split() 465 if len(cols) != 3: 466 raise ValueError("I don't understand identification line\n%s" \ 467 % line) 468 self.data.name = self._chomp(cols[1]) # don't want ';' 469 self.data.type = self._chomp(cols[2]) # don't want '.'
470
471 - def accession(self, line):
472 cols = line.split() 473 if len(cols) != 2: 474 raise ValueError("I don't understand accession line\n%s" % line) 475 self.data.accession = self._chomp(cols[1])
476
477 - def date(self, line):
478 uprline = line.upper() 479 cols = uprline.split() 480 481 # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' 482 if cols[2] != '(CREATED);' or \ 483 cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ 484 cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': 485 raise ValueError("I don't understand date line\n%s" % line) 486 487 self.data.created = cols[1] 488 self.data.data_update = cols[3] 489 self.data.info_update = cols[6]
490
491 - def description(self, line):
492 self.data.description = self._clean(line)
493
494 - def pattern(self, line):
495 self.data.pattern = self.data.pattern + self._clean(line)
496
497 - def matrix(self, line):
498 self.data.matrix.append(self._clean(line))
499
500 - def postprocessing(self, line):
503
504 - def rule(self, line):
505 self.data.rules.append(self._clean(line))
506
507 - def numerical_results(self, line):
508 cols = self._clean(line).split(";") 509 for col in cols: 510 if not col: 511 continue 512 qual, data = [word.lstrip() for word in col.split("=")] 513 if qual == '/RELEASE': 514 release, seqs = data.split(",") 515 self.data.nr_sp_release = release 516 self.data.nr_sp_seqs = int(seqs) 517 elif qual == '/FALSE_NEG': 518 self.data.nr_false_neg = int(data) 519 elif qual == '/PARTIAL': 520 self.data.nr_partial = int(data) 521 elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: 522 m = re.match(r'(\d+)\((\d+)\)', data) 523 if not m: 524 raise Exception("Broken data %s in comment line\n%s" \ 525 % (repr(data), line)) 526 hits = tuple(map(int, m.groups())) 527 if(qual == "/TOTAL"): 528 self.data.nr_total = hits 529 elif(qual == "/POSITIVE"): 530 self.data.nr_positive = hits 531 elif(qual == "/UNKNOWN"): 532 self.data.nr_unknown = hits 533 elif(qual == "/FALSE_POS"): 534 self.data.nr_false_pos = hits 535 else: 536 raise ValueError("Unknown qual %s in comment line\n%s" \ 537 % (repr(qual), line))
538
539 - def comment(self, line):
540 #Expect CC lines like this: 541 #CC /TAXO-RANGE=??EPV; /MAX-REPEAT=2; 542 #Can (normally) split on ";" and then on "=" 543 cols = self._clean(line).split(";") 544 for col in cols: 545 if not col or col[:17] == 'Automatic scaling': 546 # DNAJ_2 in Release 15 has a non-standard comment line: 547 # CC Automatic scaling using reversed database 548 # Throw it away. (Should I keep it?) 549 continue 550 if col.count("=") == 0: 551 #Missing qualifier! Can we recover gracefully? 552 #For example, from Bug 2403, in PS50293 have: 553 #CC /AUTHOR=K_Hofmann; N_Hulo 554 continue 555 qual, data = [word.lstrip() for word in col.split("=")] 556 if qual == '/TAXO-RANGE': 557 self.data.cc_taxo_range = data 558 elif qual == '/MAX-REPEAT': 559 self.data.cc_max_repeat = data 560 elif qual == '/SITE': 561 pos, desc = data.split(",") 562 self.data.cc_site.append((int(pos), desc)) 563 elif qual == '/SKIP-FLAG': 564 self.data.cc_skip_flag = data 565 elif qual == '/MATRIX_TYPE': 566 self.data.cc_matrix_type = data 567 elif qual == '/SCALING_DB': 568 self.data.cc_scaling_db = data 569 elif qual == '/AUTHOR': 570 self.data.cc_author = data 571 elif qual == '/FT_KEY': 572 self.data.cc_ft_key = data 573 elif qual == '/FT_DESC': 574 self.data.cc_ft_desc = data 575 elif qual == '/VERSION': 576 self.data.cc_version = data 577 else: 578 raise ValueError("Unknown qual %s in comment line\n%s" \ 579 % (repr(qual), line))
580
581 - def database_reference(self, line):
582 refs = self._clean(line).split(";") 583 for ref in refs: 584 if not ref: 585 continue 586 acc, name, type = [word.strip() for word in ref.split(",")] 587 if type == 'T': 588 self.data.dr_positive.append((acc, name)) 589 elif type == 'F': 590 self.data.dr_false_pos.append((acc, name)) 591 elif type == 'N': 592 self.data.dr_false_neg.append((acc, name)) 593 elif type == 'P': 594 self.data.dr_potential.append((acc, name)) 595 elif type == '?': 596 self.data.dr_unknown.append((acc, name)) 597 else: 598 raise ValueError("I don't understand type flag %s" % type)
599
600 - def pdb_reference(self, line):
601 cols = line.split() 602 for id in cols[1:]: # get all but the '3D' col 603 self.data.pdb_structs.append(self._chomp(id))
604
605 - def prorule(self, line):
606 #Assume that each PR line can contain multiple ";" separated rules 607 rules = self._clean(line).split(";") 608 self.data.prorules.extend(rules)
609
610 - def documentation(self, line):
611 self.data.pdoc = self._chomp(self._clean(line))
612
613 - def terminator(self, line):
614 self.finished = True
615
616 - def _chomp(self, word, to_chomp='.,;'):
617 # Remove the punctuation at the end of a word. 618 if word[-1] in to_chomp: 619 return word[:-1] 620 return word
621
622 - def _clean(self, line, rstrip=1):
623 # Clean up a line. 624 if rstrip: 625 return line[5:].rstrip() 626 return line[5:]
627
628 -def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
629 """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> 630 list of PatternHit's 631 632 Search a sequence for occurrences of Prosite patterns. You can 633 specify either a sequence in seq or a SwissProt/trEMBL ID or accession 634 in id. Only one of those should be given. If exclude_frequent 635 is true, then the patterns with the high probability of occurring 636 will be excluded. 637 638 """ 639 from Bio import ExPASy 640 if (seq and id) or not (seq or id): 641 raise ValueError("Please specify either a sequence or an id") 642 handle = ExPASy.scanprosite1(seq, id, exclude_frequent) 643 return _extract_pattern_hits(handle)
644
645 -def _extract_pattern_hits(handle):
646 """_extract_pattern_hits(handle) -> list of PatternHit's 647 648 Extract hits from a web page. Raises a ValueError if there 649 was an error in the query. 650 651 """ 652 class parser(sgmllib.SGMLParser): 653 def __init__(self): 654 sgmllib.SGMLParser.__init__(self) 655 self.hits = [] 656 self.broken_message = 'Some error occurred' 657 self._in_pre = 0 658 self._current_hit = None 659 self._last_found = None # Save state of parsing
660 def handle_data(self, data): 661 if data.find('try again') >= 0: 662 self.broken_message = data 663 return 664 elif data == 'illegal': 665 self.broken_message = 'Sequence contains illegal characters' 666 return 667 if not self._in_pre: 668 return 669 elif not data.strip(): 670 return 671 if self._last_found is None and data[:4] == 'PDOC': 672 self._current_hit.pdoc = data 673 self._last_found = 'pdoc' 674 elif self._last_found == 'pdoc': 675 if data[:2] != 'PS': 676 raise ValueError("Expected accession but got:\n%s" % data) 677 self._current_hit.accession = data 678 self._last_found = 'accession' 679 elif self._last_found == 'accession': 680 self._current_hit.name = data 681 self._last_found = 'name' 682 elif self._last_found == 'name': 683 self._current_hit.description = data 684 self._last_found = 'description' 685 elif self._last_found == 'description': 686 m = re.findall(r'(\d+)-(\d+) (\w+)', data) 687 for start, end, seq in m: 688 self._current_hit.matches.append( 689 (int(start), int(end), seq)) 690 691 def do_hr(self, attrs): 692 # <HR> inside a <PRE> section means a new hit. 693 if self._in_pre: 694 self._current_hit = PatternHit() 695 self.hits.append(self._current_hit) 696 self._last_found = None 697 def start_pre(self, attrs): 698 self._in_pre = 1 699 self.broken_message = None # Probably not broken 700 def end_pre(self): 701 self._in_pre = 0 702 p = parser() 703 p.feed(handle.read()) 704 if p.broken_message: 705 raise ValueError(p.broken_message) 706 return p.hits 707 708 709 710
711 -def index_file(filename, indexname, rec2key=None):
712 """index_file(filename, indexname, rec2key=None) 713 714 Index a Prosite file. filename is the name of the file. 715 indexname is the name of the dictionary. rec2key is an 716 optional callback that takes a Record and generates a unique key 717 (e.g. the accession number) for the record. If not specified, 718 the id name will be used. 719 720 """ 721 import os 722 if not os.path.exists(filename): 723 raise ValueError("%s does not exist" % filename) 724 725 index = Index.Index(indexname, truncate=1) 726 index[Dictionary._Dictionary__filename_key] = filename 727 728 handle = open(filename) 729 records = parse(handle) 730 end = 0L 731 for record in records: 732 start = end 733 end = long(handle.tell()) 734 length = end - start 735 736 if rec2key is not None: 737 key = rec2key(record) 738 else: 739 key = record.name 740 741 if not key: 742 raise KeyError("empty key was produced") 743 elif key in index: 744 raise KeyError("duplicate key %s found" % key) 745 746 index[key] = start, length
747