Parsing BookMooch’s Asins.xml with a SAX parser
Posted by Jeff May 01, 2007
I’ve been playing with BookMooch’s API recently. They have data files for:
- Inventory : how many copies of each book are moochable
- Wishlists: How many people want each book.
- ASIN’s : full details for each book.
My initial goal is the availability (Inventory-Wishlist) of each book. Examples:
| Availability | Title |
| -210 | Omnivore’s Dilemma |
| -172 | The God Delusion |
| 115 | The Da Vinci Code |
| 122 | Jurassic Park |
Omnivore’s Dilemma is in heavy demand; Jurassic Park is a stale meme. The value of a book decreases over time; it makes sense to trade in current books while you can.
ASINS.xml is 983MB; a DOM parser requires far too much memory. A SAX parser is required to handle a file this size.
My requirements are to produce a CSV file mapping ISBN to Title. A pickled version of a python map would also be useful.
ASIN Detail
See example here.
SAX ContentHandler
A ContentHandler is supplied as a callback. The parser calls startElement, characters,and stopElement and as it walks the XML input stream. Since the id element repeats, examining the tag name is insufficient to know the location in the tree. Instead, a list of containing elements makes sense:
def __init__(self):
self.curElements=[] # Will have the path to the current location.
def startElement(self, name, attrs):
self.curElements.append(name)
def endElement(self, name):
self.curElements.pop()
Capturing ID, Title
My only interest is in the title and ID of each book.
def characters(self,ch):
if len(self.curElements) ==3:
if self.curElements == ‘id’:
self.isbn = self.isbn + ch
elif self.curElements[2] == ‘Title’:
self.title = self.title + ch
def startElement(self, name, attrs):
if name==’asin’:
self.isbn = ‘’
self.title = ‘’
def endElement(self, name):
if name==’asin’:
self.br.record ( self.isbn.strip(), self.title.strip() )
Full Program
# IN: asins_fixed.xml
# OUT: isbns.txt, isbns.pickle
# http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/
import codecs,pickle
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
BOOKNUM = 0
class bookRecorder:
def __init__(self):
# *Very Important* to open in the right encoding!
self.f = codecs.open (’isbns.txt’,'w’, ‘iso-8859-1′)
self.dict = {}
def record(self,isbn,title):
self.f.write(unicode (isbn+’,’ + title +’\r\n’))
self.dict [isbn]=title
def close(self):
self.f.close()
p=open(’isbns.pickle’,'w’,200000)
pickle.dump (self.dict,p)
p.close()
class asinHandler(ContentHandler):
def __init__(self):
self.br=bookRecorder()
self.curElements=[] # Will have the path to the current location.
def characters(self,ch):
if len(self.curElements) ==3:
if self.curElements == ‘id’:
self.isbn = self.isbn + ch
elif self.curElements[2] == ‘Title’:
self.title = self.title + ch
def close(self):
self.br.close()
def startElement(self, name, attrs):
self.curElements.append(name)
if name==’asin’:
self.isbn = ‘’
self.title = ‘’
global BOOKNUM
BOOKNUM = BOOKNUM + 1
if BOOKNUM % 5000 == 0:
print BOOKNUM
def endElement(self, name):
if name==’asin’:
self.br.record ( self.isbn.strip(), self.title.strip() )
self.curElements.pop()
parser = make_parser()
curHandler = asinHandler()
parser.setContentHandler(curHandler)
parser.parse(open(’asins_fixed.xml’))
curHandler.close()