mainentityofpage

Schema.org defines markup to allow search engines and the like to process web sits with some intelligence. You add extra information in mark up on your web site’s page, then internet bots read this data, and use it process your site more effectively. This increases the accuracy of search engines’ understanding of your site, so results of apt queries are more likely to suggest your website.

Unfortunately, a number of their mark up definitions and descriptions are pretty awful. I’ve already written something on isFamilyFriendly, which smells like something imposed by an external sponsor.

This time, I’ve found a mess which can only be the fault of schema.org itself. Their properties mainEntity and mainEntityOfPage are appallingly defined. The problem is that Google requires them to be used in blog post mark up.

According to schema.org:

mainEntityOfPage (of type Url) “indicates a page (or other CreativeWork) for which this thing is the main entity being described.”
mainEntity (of type Thing) “indicates the primary entity described in some page or other CreativeWork. Inverse property: mainEntityOfPage.”

So, to 'clarify': mainEntityOfPage, which is called a Page, but is a Url not a Page which is a Thing, indicates a page that is not a Page, and must be a thing that is not a Thing, and there is another main entity that is not a mainEntity, and one of the … er … wotsits … in that mess is being described.

Google have a somewhat clearer definition, “The canonical URL of the article page. Specify mainEntityOfPage when the article is the primary topic of the article page.”

Unfortunately, the Google definition also has a self–destruct feature. It uses the word article. These two properties are for marking up HTML pages, and HTML defines article. By implication, the Google definition states that mainEntityOfPage applies to the contents of the HTML article tag. But this markup is a URL. But it is on the article page, which is the page containing the article to which it is being applied. So the only possible value of the url is its own page, e.g., in unix speak, “.”. What’s the point of that? Anything reading the page already knows its Url, because it needed it to find the page to read.

Perhaps I should look more closely at the Google definition. It uses the word canonical. It says the tag is used to point to the canonical URL of the article page. Now, I’m presuming canonical is the computing use, which means the preferred or unique version. Since I do not knowingly copy other people’s work, then mine the only version I know about, thus that definition also shows the url should reference itself. So every definition of that tag that I can work out defines it to be a Url that points to itself. That’s pointless. Yet Google insist it is used. Why?!

I have to admit that, if this is correct, I will cheat by using the photographers’ solution to overdefinition, which is blur, and blur the reference to be to the site containing the page, not the page itself.

As an aside, there’s a problem with that word canonical, which is why I dislike it. The computing use indicates the unique definition, which by implication must be the best definition. But the normal English definition of canonical refers to canon law, which is church law. Law is always being revised, and thus cannot be perfect (if it were perfect, it would not need revising). It may (sometimes) be good, but it is not perfect. The problem is illustrated in the cliché “the best is the enemy of the good”, which reveals the computing jargonisation of canonical is the enemy of standard usage. They are opposites. So, canonical means the opposite of itself, and thus is a classical example of a word to be avoided when clear and precise definition is essential—such as in computing. HTML, including web page markup, is computing. It’s clear anything wanting precision which uses the word canonical cannot get it. But Google insist on precision. Bah!

In these circumstances, I’ll fall back on a third, informal definition of the word, one from the military, which means a reference to cannons. Cannons are devices that blow things up with a bang. When they go wrong, they blow themselves up with a bang, along with anyone around them. It seems, underneath it all, that this is the appropriate meaning of the word canonical in the Google definition of mainEntityOfPage.

The other schema.org tag mentioned above is mainEntity, defined to be the inverse of mainEntityOfPage. But mainEntityOfPage is already established to refer to itself, so mainEntity must logically refer to something that is not itself. But its type is a Thing, which normally contains additional information, and is normally defined where it is written, as per the examples on the schema.org page. So mainEntity, from the examples, contains information, and that information is part of the page. So mainEntityOfPage refers to part of itself, whereas its inverse, mainEntity, refers to part of itself. Thus, in the schema.org world, a page is the opposite of itself.

This is a thorough mess. I wonder if it is a April fool that got out of control? Or did someone, somewhere, who is scared of Artificial Intelligence, take 1960s Hollywood nasties about blinkenlichten blowing themselves up too seriously, and create a definition bomb in the schema because they hope real computers can’t handle self–references? (clue: Hollywood was, as usual, talking bollox.)

Clearly and obviously, a thing that’s at fault is my understanding of the concepts. It would be nice if Schema.org replacing their current explanations of the terms with something that explains them. I’d like to know what they are supposed to do.

The practical problem for me is that Google requires blog pages to be decorated with mainEntityOfPage, and I have a blog I want to mark up, so I have to use these badly explained things. I’m going to fall back on my silly get–out–of–jail–free article observation above, I’ll blur the precision, and have mainEntityOfPage refer to this site’s home page, just to shut the pointless error messages up.

16 4 10

UPDATE: I received an email from Bart Turczynski of zety.com, who prompted me to reexamine my understanding of canonical in this context. From that, I found RFC 6596, which should clear the matter up for me.

Amusingly, I may have already implemented a solution! IndieWeb requires microformats’ h-entry on blog entries. That includes u-uid, which I now think corresponds to mainEntityOfPage. If that’s correct—I still need to carefully read the RFC—then I can generalise that solution, although it’s going to take some work to apply it across the whole site.

The schema.org definition remains confusing; I wish they referenced the RFC, or at least the wikipedia description of the problem and its solution. Right now, their description smells very much like an introduction to One Song To The Tune Of Another.

18 8 14