[LWN Logo]
[LWN.net]
From:	 Evelyn Mitchell 
To:	 frypthoneers@lists.tummy.com
Subject: FRPythoneers Meeting Notes 2001-04-16
Date:	 Tue, 17 Apr 2001 10:10:27 -0600
Cc:	 claird@neosoft.com, lwn@lwn.net, webmaster@python.org

Front Range Pythoneers April Meeting 

10 in attendance

Mike Olson of Fourthought (www.fourthought.com) presented on 
Processing XML in Python. 

I'd heard a bit about XML before this presentation, but had never
programmed an XML application in any language. Mike covered the
available tools in Python for working with XML including
parsers, DOM (Document Object Model) tools, XSLT (Style sheet transformation) 
tools, XPath (a query language) tools, and RDF (Resource Description Format, 
the semantic web) tools.

Fourthought has been developing XML tools for several years, and is
active in core Python development of XML tools through the XML-SIG.
All of the Fourthought tools described are available for download
from their website. They've developed a rather nice server (4Suite
Server) which stores XML and RDF parsed documents, allowing for
quick deployment of XML-based applications.

Most of the people in the audience had done some XML development,
so the questions were on target.  Mike was very well prepared,
with deep knowledge of both XML and Python.

While I was taking notes during the presentation, I was reminded
once again while I like Python. I was transcribing the code samples
on his slides, and I realized that in the few moments I had before
the discussion moved to the next topic, I was able to visually 
proof the code for correctness. I don't guarantee that any of the
code samples in my notes will actually run, but they might. 

Thanks Mike, for a great presentation! 

And, thanks to Ken for volunteering to help out with our (non-existant)
website! We'll be putting up a Zope site on www.fr.pythoneers.com
within the next few days.

Our next meeting will be at Abacus Direct in Interlocken in Broomfield
CO on May 21 at 6:30 (presentation starts at 7:15, arrive early as the 
doors are locked after 7). 

Mailing list information and archives are at 

  http://lists.tummy.com/mailman/listinfo/frpythoneers

Evelyn Mitchell
efm@tummy.com

XML 
  eXtensible Markup Language
  tag based
    everything needs a closing tag

SGML
  spawned both HTML and XML
  SGML is too difficult to parse
  XML is a more restricted format, easier to use

Included in Python 2.0
xmllib
  simple python parser

pyexpat
  wrapper around expat parser

minidom
  dom implementation

Other:
xmlProc
  validating parser
4DOM
  DOM
4XSLT
  XSLT processor
Redland and 4RDF
  RDF implementation
4XLink
  XLink implementation

Python has strong XML support in the language, and from third parties

pyxml.sourceforge.net

XML Parsers
 xmllib
   100% pure python
   uses regular expressions to parse XML
   user defines call-backs on a per element basis
   namespace support
   not validating

(Validating: well formed [has closing tags], valid when compared against
  the DTD for that document, the correct character set [expand entities])

xmllib-ex.py -----------------------------------------------------------
import xmllib
class Parser(xmllib.XMLParser):
  def __init__(self):
    xmllib.XMLParser.__init__(self)
    self.elements = {'ENTRY':(self.start_entry, None),}

  def start_entry(self,attrs):
    print "Entry with ID=%s" % attrs['ID]

p = Parser ()
p.feed(XML_SRC)
p.close()

-----------------------------------------------------------
Pyexpat
  James Clark's expat parser
  very fast
  handlers added as any python callable object

pyexpat-ex.py -----------------------------------------------------------
import string
from xml.parsers import expat

def start_elements(name, attrs):
  print 'Start element:', name, attrs
def end_element(name):
  print 'End element:', name
def char_data(data):
  data = string.strip(data)
  if data: print 'Character data:', string.strip(data)

parser = expat.ParserCreate()
parser.StartElementHandler = start_element
parser.EndElementHandler = end_element 
parser.CharacterDataHandler = char_data
parser.Parse(XML_SRC)
-----------------------------------------------------------

xmlProc
  only fully validating python parser
  Lars Marius Garshol

import string
from xml.parsers.xmlproc import xmlproc

class Application(xmlproc.Application):
  def handle_start_tag(self, name, attrs):
    print 'Start element:', name, attrs
  def handle_data(self, data, start, end):
    data = string.strip(data[start:end])
    if data: print 'Character data:', string.strip(data)
  def handle_end_tag(self.name):
    print 'End element:', name

builder = Application()
parser=xmlproc.XMLProcessor()
parser.set_application(builder)
parser.parse_resource('addr_book1.xml')
-----------------------------------------------------------

Simple API for XML 2 (SAX2)
written in python 
builds on top of pyexpat and xmlproc to provide SAX and SAX2 interfaces

DOM Document Object Model
miniDOM, included in Python 2.0
4DOM, included in XML-SIG
pDomlette/cDomlette, included in 4Suite
  a DOM is like a UML class diagram for the document

miniDom
  Paul Prescod
  lightweight, 'pythonic' DOM
  DOM XML and Core recommendations only
  children of the document are a python list

minidom.parse(stream, parser=None)
minidom.parseString(string, parser=None)
  returns an instance of the document interface
Node.toxml()
Node.toprettyxml(indent='\t', newln='\n')

import sys
from xml.dom import minidom, Node

dom = minidom.parseString(XML_SRC)
for c in dom.documentElement.childNotes:
  if c.nodeType == Node.ELEMENT_NODE {
    print "ID = %s" % c.getAttribute('ID')
if sys.hexversion < 0x2000000 {
  dom.unlink()
-----------------------------------------------------------
4DOM
written by Fourthought
Fully compliant DOM level II
  some 'pythonic' interfaces
DOM Core, XML, HTML, Partial Events, Traversal, and Partial Ranges

xml.dom.ext.reader.Sax
xml.dom.ext.reader.Sax2
xml.dom.ext.reader.PyExpat
xml.dom.ext.reader.HtmlLib

xml.dom.ext.Print(root, stream=sys.stdout, encoding='UTF=8')
xml.dom.ext.PrettyPrint(root, stream=sys.stdout, encoding='UTF=8', indent='
')
xml.dom.ext.XHtmlPrettyPrint(root, stream=sys.stdout, encoding='UTF=8',
indent=....)

import sys
from xml.dom.ext.reader import PyExpat
from xml.dom import Node

reader = PyExpat.Reader()
dom = reader.fromString(XML_SRC)
for c in dom.documentElement.childNodes:
  if c.nodeType == Node.ELEMENT_NODE {
    #PyExpat Reader is NS aware!
    print "ID = %s" % c.getAtttributesNS('','ID')
if sys.hexversion < 0x2000000 {
  reader.releaseNode(dom)

-----------------------------------------------------------

XPath
primarily used to select a set of nodes from a single document
a language which is used to select from a given context
build complex searches in steps
powerful and extensible filtering mechanism
one liners (no comments)

From Root:
  descendant::ENTRY
    --> All ENTRY elements in the docuemnt
From ADDRBOOK:
  child::ENTRY/ADDRESS[text() = '42 Spam Blvd']
    --> Returns ADDRESS of second ENTRY
From anywhere:
  //ENTRY[PHONENUM/@DESC = "Work"]
    --> Returns all ENTRY elements with a PHONENUM
       child that has a DESC attribute equal to "Work"

4XPath included in 4Suite
PyPath, included in XML-SIG (based of 4XPath)

4XPath
 written by Fourthought
 fully XPath 1.0 compliant

import sys
from xml.dom.ext.reader import PyExpat
from xml import xpath

reader = PyExpat.Reader()
dom = reader.fromString(XML_SRC)

#XPath expressions can be precompiled
#expr = xpath.Compile("/ADDRBOOK/ENTRY/@ID")

ids =xpath.Evaluate("/ADDRBOOK/ENTRY/@ID", contentNode=dom)
for id in ids:
  print "ID = %s" % id.value

if sys.hexversion < 0x2000000 {
  reader.releaseNode(dom)

-----------------------------------------------------------

Extensible Style sheet Language Transformation (XSLT)

Used to transform XML into different format (HTML, XML, PDF, RTF, etc)
Matches nodes from a source document
Builds from XPath language
Extremely extensible 
a real programming language

4XSLT included in 4Suite
Sablotron, written in C++ with Python wrapper
Xalan, python wrapper

4XSLT
written by Fourthought
Fully XSLT 1.0 compliant

Stream, Uri, String, Dom
  stylesheets and documents

import sys
from xml.xslt import Processor

processor = Processor.Processor()
processor.appendStylesheetUri('addr_book1.xsl')
res = processor.runUri('addr_book1.xml')
print res

-----------------------------------------------------------
Resource Description Framework (RDF)
A framework for making metadata statements about resources
A statement is a simple assertion of subject, predicate, object
RDF schemas allow for validation of statements

Redland RDF, C based by Dave Beckett, wrapped for Python
4RDF for Fourthought

4RDF
Model manipulation and basic searching
Serialization and Deserialize of standard RDF
Schema support
Optional persistence through Postgres, Oracle and Shelve
RDF Inference Language

import sys
from xml.om.ext.rader import PyExpat
from Ft.Rdf.Serializers.Dom import Serializer
from Ft.Rdf.Drivers import Memory
form xml.dom.ext import PrettyPrint
from Ft.Rdf import Model

mem = Memory.DbAdaptor('')
mem.begin()


4Suite Server
XML Server
Optimized XML processing for XSLT, XLink, Xpath and XPointer
Automatic RDF Meta-data Management
Access to XML data through CORBA, raw HTTP, Soap, webDAV, GUI, and 
  command line
User and group based security
Distributed Transaction system
Event generation and trigger system