next*next*

a yahoo! thinga yahoo! thing

WebPath Goes Open Source

Web Central

At Yahoo!, I spend a fair amount of time on the Web. Not just wasting time on My Yahoo! and Digg, but trying to make sense of the manifold ways that sites expose structured (and not-so-structured) data. There isn't enough agreement about how a web page should describe, say, sports scores or movie showtimes or homebrew mead recipes. Different groups are working on approaches to this problem, including two technologies that get me fired up: microformats, RDFa, and now WebPath.

WebPath is a little project of mine that provides a way to query the web of structured data hidden in messy markup, from the most simple query (like "does this page use microformats?") up to complex, web-spanning requests (like "can I get to a movie review within two clicks from here?").

The ideas behind WebPath were published at the XML 2007 conference, where lots of great conversations started. Now, more discussions are happening, since the full source code behind the project has been released under a BSD-style license. So, what makes WebPath worth looking at?

Regardless of what type of markup a site uses under the hood, my team -- the Structured Web Group -- looks for ways to pull that structured information into an index where it can be sliced and diced in ways that help searchers get instant gratification via immediate answers to their queries. To accomplish this, the team needed to get a feel for which approaches work best with particular sites. It turns out that as these requirements surfaced, a local Hack Day was coming up, providing the opportunity we needed to try some experimental approaches. In little more than 24 hours, the project progressed from scratches in a notebook to a working implementation.

WebPath at this point is a technical endeavor, suitable for those who enjoy command-line programming. It's written in Python. (It pretty much had to be, given the timeline of its development!) You could consider WebPath a kind of XPath engine, to use the W3C jargon, but its main focus is on the "Web" part of its name. Webpath includes access to a module that accepts messy and invalid pages--just like the ones all over the Web--and "tidies" them up into something easier to work with and analyze.

What's next for WebPath as an open source project? Already developers are getting involved, using WebPath in their own projects, pitching in to offer new features, and, of course, helping fix bugs! I maintain a plain-text todo
list of areas for immediate attention, but in the bigger picture I'd like to wire this up to Hadoop as a front end for truly web-scale queries. You too can help shape the future of this project. If you've taken a look at WebPath, I'd love to hear from you.

-m

Golden Web photo by Cyron.

Popularity: 30% [?]

mdubinko, February 13th, 2008 on 10:00 pm

2 comments

On learning new programming languages

There's a lot of value to learning new programming languages even if you can't use them in your day job. It helps you think about the problems you face differently. See a new perspective on solving problems.

The languages which are interesting to me are:

  • D for a c++ like language
  • lua for embedded oo scripting
  • ruby for web development and automation
  • erlang for concurrent and distributed programming
  • ocaml for data processing

And smalltalk for knowing what the grandmaster poobahs did back in the day.

To keep your chops up, try learning a new language every year.

www.pragmaticprogrammer.com/loty/

(http://www.pragprog.com/articles/designing-learning )

Learning something new doesn't mean you're rejecting what you already know and do...

Popularity: 7% [?]

About Next*

  • * Tasty bits of hacker goodness
  • * A steady stream of small delights
  • * Ideas, experiments and the people behind them

  • Brought to you by the folks at Yahoo! Brickhouse

  • Editor-at-small: Cynthia Johanson
  • Site design: Matt Fukuda
  • Backend heroics: Kevin Railsback

Next*... where the wildcards are.
Copyright © 2008 Yahoo! All Rights reserved. Privacy Policy - Terms of Service - Copyright/IP Policy