Text Ain’t Metadata

text Ain't Metadata

The headline text in that screen grab from the Seattle Times is trying really hard to be metadata and connect these two items based on shared topic coverage. Turns out, though, that the character string ‘3D’ can carry different meanings in different contexts, so matching items based on common text isn’t the slam dunk certainty that the UI folks from the Times would like it to be. If we want reliable indicators of relationships between items, we’re going to have to do some information modeling work.

textAintMetadata

Headlines from Seattle Times,
Nation & World
, Feb. 20, 2013
Posted in Organization

Kickapoo Ed

Initial excitement over the Knowledge Graph seems to have calmed down somewhat. That’s probably not surprising, as the reality of the first pass on such things does tend to fall short of the promises made at announcement time. Turns out it doesn’t really harm anything to drop a dollop of Wikipedia content in the right column, although I’m not sure just what it adds.

The graph doesn’t seem to find much to link out to in the general run of subject queries. Honda? Nope. Solar battery charger? Nope, the right column is still showing products for sale on that one. Digital rights management? Nope, more ads. People and place names seem to trigger the graph UI more than anything else, which makes sense in terms of unambiguous identification and lack of saleable products that can make a $tronger claim to the screen real estate. But hey, each of those queries returns something in Wikipedia, so why couldn’t the graph find something to link to somewhere on the Google results page?

Kickapoo Ed SummersOK, let’s toss it a softball and search for a person. Hey, who’s Kickapoo Ed? The relevance ranker picked the right Ed, which is to say the one I was thinking about when I entered the query, based on three of the top four results. That second one is a curiosity, though. I had never heard of the pitcher for the early 20th-century Tigers, and given his World Series record as reported in Wikipedia, maybe that’s not a big surprise, either. Then again, I’m happy enough to have learned that tidbit of history. But why would the relevance ranker guess along with me, while the knowledge graph shoots off on a tangent of its own?

Oh wait. Maybe it’s because Inkdroid Ed doesn’t have a Wikipedia page of his own (not yet, anyway). As much wrenching as he’s done on those pages, it seems he ought to be acknowledged in that venue. But nope again. Wikipedia itself goes straight to Kickapoo Ed without even a disambiguation alert. It seems the function of the knowledge graph is powerfully emphasizing breadth of horizontal connections over vertical depth digging further on a particular concept. And who doesn’t love a little serendipitous discovery?

Posted in Noise, Organization, Signal, Speculation, Weight of information

Ascendance of the factoid

filter bubbles!
Filter bubbles?

Google has more to say about the Knowledge Graph and how it will finally fulfill the Star Trek promise of computers that will intently listen, perfectly understand, and obediently retrieve exactly what we want. Hell, the graph may even read our minds and give us what we want before we want it.

I don’t see much in that announcement, or in the happy video, about the problem of the computer telling us what we are allowed to want. Well, it tells us what we can have, and that sets boundaries on what we can decide to want, just because it’s all we know about. I wonder if that bouncy graphic at the top of the Google Blog is actually intended to represent filter bubbles, or is that just a happy coincidence?

I don’t mean to make fun (much) of this rather startling and inspiring achievement. It seems built around old-school blunt-force computing power, as I have mentioned, but it looks like a fine application with focused processing of absolutely massive data in service to users’ preferences, as best they can be determined in bulk and with minimal context. The graph crunches text strings in the context of user link behavior to discover entities and the relationships between them that users might want to understand. It assembles factoids to build Tinker Toy representations of users’ most likely mental structures in the moment. More than that, it audaciously tries to model the entire mess of knowledge in a structured way and make it all visible from a single point in the graph. Actually, more even than that, it tries for visibility from a single point into all the possible graphs that might intersect there.

That sort of process will have to make some presumptions about user intentions. The new mini-summaries of likely user topics seem intended to pop the filter bubbles that result and enable nonlinear exploration — maybe they’ll help. These amount to small Wikipedia pages, assembled on the fly, presenting bits of fact in useful context that can lead people along the paths that earlier explorers have blazed through the same parts of the forest. I still see the probability of disintermediation of sources I called out in the March post, but I’m thinking now that it’s likely to affect other summarizers like Wikipedia more than solid, original sources, and I feel better about that outcome. If Google captures bigger chunks of the web’s ad revenue by facilitating access, more power to ‘em. Are they smart enough to keep linking out to those sources and feeding traffic to the geese laying their golden eggs? I sure hope so.

One risk, though, is the loss of complexity. As the graph summarizes the summaries so strongly favored by Google’s relevance ranks, it’s omitting detail from sources that already had omitted a lot of detail. Now maybe users will find their way to detail and complex interpretation via those exploration paths that Google is so earnestly trying to build. To the extent that the paths lead to true source materials, that seems likely; to the extent that they lead only to slightly less dumbed-down summaries, the risk is rising that users’ finite determination and attention span will be exhausted before they get someplace useful.

The knowledge graph is a beautiful vision of facts assembled in user-specific context. It still needs to look toward assembling the structures users ought to want, or would want if they suspected, in addition to those they’ve built with their click trails. Some queries really do need to be informed by topical expertise so people can move beyond the crowd-sourced zeitgeist, which can’t do much more than refresh its own image, as presented by the carnival mirror that technology provides.

Posted in Context, Design, UX, Noise, Signal

It ain’t the data that was unstructured.

Stone Tech
Photo from wallyg on Flickr under Creative Commons license

So we’re told in a report on e-commerce search engines from, incredibly, May 2012. On the surface, the reporter got it wrong, as they’re wont to do when reporting outside their areas of expertise. The eBay data clearly was structured in a database rather than distributed across random, idiosyncratically formatted documents, which is what unstructured means in this context. The lack of structure was in eBay’s use of its data. Can a company that size and THAT dependent on search really have been limping along with an algorithm that “takes a query and matches it faithfully against the title of items”? Effing eBay was content with SQL string matching in a single effing field??? Apparently technology isn’t a key determinant of success if a company can post solid financial results over a long time even with the limitations of systems as dusty as that.

The search engine project takes time because eBay’s online marketplace has so much variable information from millions of listings that are described differently by each seller – something known as unstructured data in the tech world.

Threat to Google?

The article asks, but it also answers: No, probably not, given the headstart Google has on these e-commerce pretenders. And the reasons aren’t just technical, or even mostly so. The use case for eBay’s or WalMart’s search starts with someone happy to get results from within those silos, so they’ve already missed the key context for people’s product searches: Buyers don’t want a product specifically from one catalog or another, they just want the damn product. Google gets that. (In some ways it has invented that, or at least nurtured it, because its pioneering reach across silos has opened gaping holes in brand loyalty just by making the options so easy to find.) If the search people at one or another e-commerce outfit really believe people will start with their own proprietary single-box interfaces, they’re making the terrible mistake of believing their own marketing messaging.

But the more interesting stuff in the article is in the discussion of that “red dress” search on the Goog. They’re the ones dealing with the unstructured data scattered across the DIVs and TDs of the Web, but they’ve obviously managed to read the semantic cues well enough to suss out some underlying structure. For all the complaining we hear about the indiscriminate, blunt-force tactics of single-box search, Google obviously has managed to infer categories and act on them to structure a display relevant to users. And as reported elsewhere they’re building out that structure in a rather disciplined and targeted way, in part through their integration of Freebase. They’ve done and are doing more with the truly unstructured data of the web than are competitors with long histories of carefully structuring their information, in part because those competitors haven’t bothered to build anything user-facing on top of those carefully constructed foundations.

So when people complain about searchers satisficing within the unstructured mess of Google results — and I’m looking at YOU, librarians — maybe they should take a peak under the hood themselves, possibly learning a lesson or two in the process about how to define a use case and what to do about it after you’re finished. Library systems, AND vendors, are a lot closer to the stone-age technology that eBay is only now replacing than they are to anything current.

Posted in Context, Organization

Brought to you by your friendly librarians

Oxford University, Vatican libraries to digitize works

Illuminating

The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (BAV) said on Thursday they intended to digitize 1.5 million pages of ancient texts and make them freely available online.

OK, now THAT is cool enough for a Thursday!

Photo from Walters Art Museum Illuminated Manuscripts on Flickr under Creative Commons license.

Posted in Background

The Morselization of Information


Wavii delivers morsels of news, five words at a time

Five years ago, Twitter’s critics dismissed the idea that news could be transmitted in 140 characters. Now, Adrian Aoun thinks it can be done in just five or six words.

one-third of a birdSo how much signal do we need to distinguish it from noise? Is five or six words, maybe 45 or 50 characters at most, enough to see traces of meaning? A good headline writer can manage that task in eight to ten words. (Reuters took ten above.) Is half a headline enough?

Twitter and SMS catch all kinds of flak for forcing users to distill a thought to 140 characters, but Aoun & company are set to do it in one-third of that or less. They seem set to perform the same function as Twitter — linking interesting static content — but they deny their audience the luxury of annotation to suggest context. The service performs a useful function only if the user can determine — accurately — whether the linked resource will be worth the attention span based only on the mini-headline. Who ever thought Twitter would start to feel like a communication indulgence?

I’m sure the NLP processing is cool as hell — it will have to be to do anything at all for its users. But is attention span so scarce that a useful resource must be dumbed down to five words to make it consumable? At some point this quest for precision becomes pathological. The pinpoint indicator loses context, so it indicates nothing, so only the great wash of recall is left splashing randomly around.

Photo from John ‘K’ on Flickr under Creative Commons license.

Posted in Noise, Organization, Signal

Who will guard the guards?


Photo by rooReynolds

News: Web privacy rules turn poachers into gamekeepers
Views: The EU is demanding that users opt in to tracking, while the U.S. Congress seems to want people to have a relatively standard way to opt out. Either way, it’s the companies that rely on tracking who will develop the technology, and therefore understand how to work around it so they can continue providing their services. If they can’t, and if people generally block tracking, the revenue from selling their eyeballs will diminish, and so will the services they want — or they’ll have to start paying directly. Lots of issues at play between now & June, it seems.

However it plays out, a larger pattern seems to be the ascendance of regulatory roles by private organizations. The EU bureaucrats are issuing the mandate, and presumably they’d also evaluate any technical solution proposed by industry. The consortium still takes on a larger role, and the only one with the hands on the actual switches. Its members also are in the position of brokering government mandates from different jurisdictions, and doubt is not high about whose interests will most powerfully drive their actions. They continue to frame their role as protecting their customers’ access to free services, but that tissue of rhetoric is too thin and too far stretched to provide any coverage at all for the primary financial motives in play.

But maybe a prior question is whether internetters themselves want to opt out. The bureaucrats might be embarrassed if they go to all the trouble of forcing the technical changes, and the reconfiguration of the industry driven by them, only to have users greet the news with a yawn and ignore the whole kerfluffle while they continue to Facebook along, oblivious to the man behind the curtain.

Posted in Noise, Privacy

The Semantics of “Semantic”

News: Google gives search a refresh, the Wall Street Journal Reports.
Views: Lots of these changes sound great to me, as best I can tell from a WSJ report. I’m a fan of improving relevance and I like the idea of “examining a Web page and identifying information about specific entities referenced on it, rather than only look for keywords” depending on how it works. The language evokes processes behind linked data, especially the idea of unambiguous identifiers for entities, allowing aggregation of relevant data from all kinds of sources.

But how? The web hasn’t (yet) gotten around to tagging entities with stable URIs, and Google’s expertise is in crunching text. I suspect the answer involves lots of processing power grinding down text to its component parts and matching up bits with some level of reliability, as opposed to the (supposed) certainty of identification via stable URI. On one side, that seems a better strategy than sitting around waiting for certainty to descend from the heavens and resolve all data everywhere into one giant, tidy triple store. On the other side, that’s still the old method — “look for keywords” — just the really KEY keywords. In other keywords, between the lines of the WSJ story, it’s looking like a turbocharged version of the current technology.

If the change involves a difference in kind, it’s the effort to amass that giant and growing entity database, with pretty likely results spit back at users keystroke by keystroke. Considering how the entities are identified, the signal still is certain to contain enough noise to paint a fuzzy picture, leaving satisficing users to make do with almost-relevant information. Good enough for most, maybe, and better than the current flood of recall. But it doesn’t live up to the promises of precision inherent in the “semantic” language.

And the provenance of the information in that entity database raises a potentially larger problem. In the extreme case where Google mashes up the entire web into its own proprietary database, it disintermediates all the sources on which it relies for the value of that database. To the extent that it starves the geese laying those golden eggs, it reduces its own ability to attract the eyeballs it wants to sell to advertisers. By linking to the providers, it automatically shares that wealth and keeps them going, while capturing a share for itself. If it were to become a single-source category killer, it would compromise its own supply of information while fobbing off users with dumbed down overviews that only kinda sorta meet their needs. That looks from here like a reduction in value on all sides rather than the value-add from a well-integrated general search tool.

Would providers hate that outcome enough to start setting their robots.txt files to restrict or even block Google? They probably can’t expect a lot better deal from its competitors, if it were a viable option at all. Realistically, that extreme case isn’t an especially likely outcome, but it certainly raises (or reemphasizes) questions about the point at which optimizing for one provider begins to draw down value for the system as a whole.

The nod to Facebook and other competitors is expected in a WSJ piece, but not much is made of social search as an alternative. G+ hasn’t gained near the traction it would need to be a solid basis for judgments of relevance, despite Google’s attempts to make it so. FB’s prospects for a better outcome seem no better, given the mundane patter on which it would be drawing to power suggestions in search results. True semantic web technologies seem the far better path to truly relevant search results, maybe via the relatively simple shortcut of microdata. That set of options has the additional happy benefit of preserving the diversity and diffusion needed for a healthy ecology of information providers.

Posted in Context, Noise, Signal

How to Pop Those Pesky Filter Bubbles

Eli Pariser's TED talkEli Pariser gave a TED talk in March 2011 urging web users to beware online filter bubbles. It’s not a new set of concepts — personalization is rampant as an ostensibly well-intentioned strategy to help us find the useful bits within the web’s stream of semi-consciousness. The talk doesn’t address the potentially nasty influence of selling our eyeballs to advertisers, as so many such critiques do. Rather it makes some clear statements about how our information diet gets reduced to junk food, and it issues a call to mix in some “information vegetables” with the automatically filtered stream of preprocessed infonuggets.

One thing I think it could do better is to emphasize the need for interface design to show users how the algorithmic filters influence what they see and provide controls to adjust those effects. That’s mentioned, but it’s core to the topic and needs clearer explication, imo. Also, I can easily see someone respond to the criticism by saying, “Look, if you want to be someone who has watched Rashomon, then skip past Ace Ventura and just watch Rashomon already.” It’s asking for better-behaved net nannies rather than their elimination. Given the scale of the traffic at this point, maybe that’s the best that can be hoped for.

Posted in Censorship, Context, Design, UX, Privacy

Literate distractability?

next click
Photo from MiikaS on Flickr under Creative Commons BY-SA license

Reuters tells us about a Pew survey of “technology insiders, critics and students” asking what skills young people will need in 2020. The key ones are cooperative work that makes problem-solving a public, crowd-sourced activity, online information search, and evaluating information quality. As best I can tell, the third is a more specific restatement of the second, but whatever.

What interests me about the key-skill list is what’s omitted: thinking things through. The survey respondents apparently worry that the term learning more and more describes process of flitting among virtual experiences like a stone skipping across a pond and never pausing to take a look below and consider what may be holding up that surface. The article quotes Jonathan Grudin: “[T]he ability to read one thing and think hard about it for hours will not be of no consequence, but it will be of far less consequence for most people.” I wonder what he’d say about how the noise facilitated by recall-driven information technology tends to encourage that process of surfacization in its base list of possibly relevant search trails. Users see a potential experience in each item from the sort of laundry list of linked factoids that current interfaces provide. They get no help, or damn little, deciding which might represent a good path toward better understanding.

Barry Chudakov, has something to say about that: “Is this my intention, or is the tool inciting me to feel and think this way?” That’s a good question. What’s the answer?

Posted in Design, UX, Noise