January 10, 2012 //
highlighting the exposure of data on the internet—and providing added impetus for organizations to expose data in a way that can efficiently be found and accessed.
I’ve been working on trying to solve a simple problem for the past six years; Google and other search engines don’t provide enough information about the content or its suitability, before visiting each site. To this end, I’ve been helping to create a new technology and methodology that exposes additional information that consumers will find useful. Couple this with MetaCert’s partnership with ICM Registry for the provision of labeling every .XXX domain and my early involvement in the W3C Semantic Web Education and Outreach Programme and you get an interesting mix of feelings and emotions.
I replied to Lisa’s email, but I was so compelled by the subject that I thought I’d write a post about it. This post doesn’t express the opinion of Lisa or Common Crawl in any way.
When reading Stephen’s ’s post, I couldn’t help but feel he’s trying to reinvent a more complicated concept than the Semantic Web and then making it even more complicated by adding in a new gTLD to the mix. If the Semantic Web hasn’t seen mass adoption with the backing of the W3C over the past 15+ years, what hope does anyone have in creating a new complicated standard for trying to achieve the same thing with a new TLD. If he was talking about a slick user interface for “consumers” to access such information more easily, without knowing/caring for the jargon, he’d be onto something. In fact, he’d do exactly what MetaCert is aiming to achieve. When I was Chair of the British Interactive Media Association for three years I rarely met a designer or developer who understood the purpose of the Semantic Web - let alone what RDF is, or any of that stuff about metadata. All they care about is making sure current/existing search engines expose their customers’ websites.
Why create a new gTLD for the data that lies beneath websites? What would the domain look like? data.data, awesomecontent.data OR, is the idea to sell a .data domain for each site that wants to expose the metadata that lies beneath - I certainly hope not. In the end it would cost about $1m for the new gTLD application. $180k is the base application fee but as any TLD expert will tell you, that’s likely to jump to a million bucks easily with all the legal fees etc. Stephen is much better investing that money in MetaCert - which has a six year lead, its method of labeling content is a W3C Full Recommendation, some partnerships that will help enable adoption, but more importantly, focus on what consumers need and sometimes want. I say sometimes want because they don’t always know what they need or want - that’s why we need to innovate.
MetaCert’s mission is to become the IMDB.com for the Web - providing consumers with more information about the content and its suitability before visiting a website. The first step was to help instigate the creation of a new standard for labeling content - this doesn’t guarantee adoption, but it certainly helps - and that piece of the puzzle alone took four and a half years.
The team is now building data sets, but only so tools can be built to make use of them - data alone is worthless unless it can be interpreted and easily consumed. The first data set we created is for the benefit of parents - they can now better protect their families from sexually explicit content. Adults who wish to find that type of content, but avoid it at work also benefit.
We have the largest data set of sexually explicit content worldwide, with an index of over half a billion webpages. The data by itself is worthless. Does a mother/father know what a data set is? No. Most don’t know what a browser is, let alone a browser extension or a plug-in (they are different). So we are building family safety tools that are easy to use. We are also in the process of encouraging mainstream players to update their existing family safety controls with our data set - as browser extensions etc. aren’t scalable across the entire web. We will build data sets for other useful purposes as soon as we have provided a whole product for family safety.
Going back to Stephen’s post, the aspiration/goal is admirable and similar to mine - but they require very different implementations.
A little insight to what most consumers would like to know about websites before visiting them.
- Is it child safe / safe to open at work or in front of my kids?
- Is it secure?
- Does it respect my personal information?
- Does this site really belong to this company?
- Is everything on this site free or do I need to pay for stuff?
- What’s the track record of this company (a little more tricky)
- Is this website accessible to me (can I increase the size of text for example)
I’m providing examples because it’s important to always keep the consumer in mind. How do they benefit? What difference will it make to their life? Protecting children online, making our parents feel more safe and secure, helping people with disabilities find accessible content, are all benefits.