next*next*

a yahoo! thinga yahoo! thing

WebPath Goes Open Source

Web Central

At Yahoo!, I spend a fair amount of time on the Web. Not just wasting time on My Yahoo! and Digg, but trying to make sense of the manifold ways that sites expose structured (and not-so-structured) data. There isn't enough agreement about how a web page should describe, say, sports scores or movie showtimes or homebrew mead recipes. Different groups are working on approaches to this problem, including two technologies that get me fired up: microformats, RDFa, and now WebPath.

WebPath is a little project of mine that provides a way to query the web of structured data hidden in messy markup, from the most simple query (like "does this page use microformats?") up to complex, web-spanning requests (like "can I get to a movie review within two clicks from here?").

The ideas behind WebPath were published at the XML 2007 conference, where lots of great conversations started. Now, more discussions are happening, since the full source code behind the project has been released under a BSD-style license. So, what makes WebPath worth looking at?

Regardless of what type of markup a site uses under the hood, my team -- the Structured Web Group -- looks for ways to pull that structured information into an index where it can be sliced and diced in ways that help searchers get instant gratification via immediate answers to their queries. To accomplish this, the team needed to get a feel for which approaches work best with particular sites. It turns out that as these requirements surfaced, a local Hack Day was coming up, providing the opportunity we needed to try some experimental approaches. In little more than 24 hours, the project progressed from scratches in a notebook to a working implementation.

WebPath at this point is a technical endeavor, suitable for those who enjoy command-line programming. It's written in Python. (It pretty much had to be, given the timeline of its development!) You could consider WebPath a kind of XPath engine, to use the W3C jargon, but its main focus is on the "Web" part of its name. Webpath includes access to a module that accepts messy and invalid pages--just like the ones all over the Web--and "tidies" them up into something easier to work with and analyze.

What's next for WebPath as an open source project? Already developers are getting involved, using WebPath in their own projects, pitching in to offer new features, and, of course, helping fix bugs! I maintain a plain-text todo
list of areas for immediate attention, but in the bigger picture I'd like to wire this up to Hadoop as a front end for truly web-scale queries. You too can help shape the future of this project. If you've taken a look at WebPath, I'd love to hear from you.

-m

Golden Web photo by Cyron.

Popularity: 40% [?]

mdubinko, February 13th, 2008 on 10:00 pm

2 Responses »

    Comment by Miguel Fonseca — February 16th, 2008 at 11:41 am


    I’ve only read the article and didn’t look at the program itself, but for you description I don’t see much of a diference between webpath and webharvest ( http://web-harvest.sourceforge.net/ )…

    Comment by ciprian — February 16th, 2008 at 1:00 pm


    interesting, is this some sort of WebQL application that you can use to query a web-sites html output to get any possible structured data, like the results from a airline ticket search on any ticketing website??

Leave a Comment

 (required)

 (will not be published)  (required)

About Next*

  • * Tasty bits of hacker goodness
  • * A steady stream of small delights
  • * Ideas, experiments and the people behind them

  • Brought to you by: the folks at Yahoo! Brickhouse
  • Site design: Matt Fukuda
  • Backend heroics: JR Conlin, Kevin Railsback

Next*... where the wildcards are.
Copyright © 2007 Yahoo! All Rights reserved. Privacy Policy - Terms of Service - Copyright/IP Policy