At Yahoo!, I spend a fair amount of time on the Web. Not just wasting time on My Yahoo! and Digg, but trying to make sense of the manifold ways that sites expose structured (and not-so-structured) data. There isn't enough agreement about how a web page should describe, say, sports scores or movie showtimes or
WebPath is a little project of mine that provides a way to query the web of structured data hidden in messy markup, from the most simple query (like "does this page use microformats?") up to complex, web-spanning requests (like "can I get to a movie review within two clicks from here?").
The ideas behind WebPath were published at the XML 2007 conference, where lots of great conversations started. Now, more discussions are happening, since the full source code behind the project has been released under a BSD-style license. So, what makes WebPath worth looking at?
Regardless of what type of markup a site uses under the hood, my team -- the Structured Web Group -- looks for ways to pull that structured information into an index where it can be sliced and diced in ways that help searchers get instant gratification via immediate answers to their queries. To accomplish this, the team needed to get a feel for which approaches work best with particular sites. It turns out that as these requirements surfaced, a local Hack Day was coming up, providing the opportunity we needed to try some experimental approaches. In little more than 24 hours, the project progressed from scratches in a notebook to a working implementation.
WebPath at this point is a technical endeavor, suitable for those who enjoy command-line programming. It's written in Python. (It pretty much had to be, given the timeline of its development!) You could consider WebPath a kind of XPath engine, to use the W3C jargon, but its main focus is on the "Web" part of its name. Webpath includes access to a module that accepts messy and invalid pages--just like the ones all over the Web--and "tidies" them up into something easier to work with and analyze.
What's next for WebPath as an open source project? Already developers are getting involved, using WebPath in their own projects, pitching in to offer new features, and, of course, helping fix bugs! I maintain a plain-text todo
list of areas for immediate attention, but in the bigger picture I'd like to wire this up to Hadoop as a front end for truly web-scale queries. You too can help shape the future of this project. If you've taken a look at WebPath, I'd love to hear from you.
-m
Golden Web photo by Cyron.
Popularity: 30% [?]
There's a lot of value to learning new programming languages even if you can't use them in your day job. It helps you think about the problems you face differently. See a new perspective on solving problems.
The languages which are interesting to me are:
And smalltalk for knowing what the grandmaster poobahs did back in the day.
To keep your chops up, try learning a new language every year.
www.pragmaticprogrammer.com/loty/
(http://www.pragprog.com/articles/designing-learning )
Learning something new doesn't mean you're rejecting what you already know and do...
Popularity: 7% [?]
When / Where
About Next*