Jeff’s Brain Dump

Sometimes the first duty of intelligent men is the restatement of the obvious.

Parsing BookMooch’s Asins.xml with a SAX parser

Posted by Jeff May 01, 2007

I’ve been playing with BookMooch’s API recently. They have data files for:

  • Inventory : how many copies of each book are moochable
  • Wishlists: How many people want each book.
  • ASIN’s : full details for each book.

My initial goal is the availability (Inventory-Wishlist) of each book. Examples:

Availability Title
-210 Omnivore’s Dilemma
-172 The God Delusion
115 The Da Vinci Code
122 Jurassic Park


Omnivore’s Dilemma is in heavy demand; Jurassic Park is a stale meme. The value of a book decreases over time; it makes sense to trade in current books while you can.

ASINS.xml is 983MB; a DOM parser requires far too much memory. A SAX parser is required to handle a file this size.

My requirements are to produce a CSV file mapping ISBN to Title. A pickled version of a python map would also be useful.

ASIN Detail

See example here.

SAX ContentHandler

A ContentHandler is supplied as a callback. The parser calls startElement, characters,and stopElement and as it walks the XML input stream.  Since the id element repeats, examining the tag name is insufficient to know the location in the tree. Instead, a list of containing elements makes sense:

class asinHandler(ContentHandler):
def __init__(self):
self.curElements=[] # Will have the path to the current location.
def startElement(self, name, attrs):
self.curElements.append(name)
def endElement(self, name):
self.curElements.pop()

Capturing ID, Title

My only interest is in the title and ID of each book.
    def characters(self,ch):
        if len(self.curElements) ==3:
            if self.curElements == ‘id’:
                self.isbn = self.isbn + ch
            elif self.curElements[2] == ‘Title’:
                self.title = self.title + ch
    def startElement(self, name, attrs):
        if name==’asin’:
            self.isbn = ‘’
            self.title = ‘’       
    def endElement(self, name):
        if name==’asin’:
                self.br.record ( self.isbn.strip(), self.title.strip() )

Full Program

# IN: asins_fixed.xml
# OUT: isbns.txt, isbns.pickle

# http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/
import codecs,pickle
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
BOOKNUM = 0

class bookRecorder:
    def __init__(self):
        # *Very Important* to open in the right encoding!
        self.f = codecs.open (’isbns.txt’,'w’, ‘iso-8859-1′)
        self.dict = {}
    def record(self,isbn,title):
        self.f.write(unicode (isbn+’,’ +  title +’\r\n’))
        self.dict [isbn]=title
    def close(self):
        self.f.close()
        p=open(’isbns.pickle’,'w’,200000)
        pickle.dump (self.dict,p)
        p.close()

class asinHandler(ContentHandler):
    def __init__(self):
        self.br=bookRecorder()
        self.curElements=[] # Will have the path to the current location.       
    def characters(self,ch):
        if len(self.curElements) ==3:
            if self.curElements == ‘id’:
                self.isbn = self.isbn + ch
            elif self.curElements[2] == ‘Title’:
                self.title = self.title + ch
    def close(self):
        self.br.close()  
    def startElement(self, name, attrs):
        self.curElements.append(name)
        if name==’asin’:
            self.isbn = ‘’
            self.title = ‘’
            global BOOKNUM
            BOOKNUM = BOOKNUM + 1
            if BOOKNUM % 5000 == 0:
                 print BOOKNUM       
    def endElement(self, name):
        if name==’asin’:
                self.br.record ( self.isbn.strip(), self.title.strip() )
        self.curElements.pop()

parser = make_parser()  
curHandler = asinHandler()
parser.setContentHandler(curHandler)
parser.parse(open(’asins_fixed.xml’))
curHandler.close()

, ,