I spotted an interesting thread at the Google News Help forum where one site was complaining their articles weren't being included in Google News and Google replied the reason was because some of the formatting tags weren't recognized.
What is interesting is that the specific tags called out by Google as the issue were standard paragraph break tags.
Harvey P. from Google said:
In reviewing your site, I found a couple of things that may be preventing our crawler from indexing your articles. In the HTML code of article pages, you use many formatting tags such as <p> and <br> that may cause problems for our crawler. Removing frequent use of these tags may help our system better identify and index your articles.
I looked at the site in question and picked a random article and it didn't seem out of the ordinary. The code, including the <p> and <br> used throughout the body content, didn't seem atypical.
So I am not sure if there was a specific article that had too much HTML formatting in it?
We do get errors on some of our articles, specifically the daily recap posts. Specifically, the error we get is Article fragmented which means:
The article body that we extracted from the HTML page appears to consist of isolated sentences not grouped together into paragraphs. We generated this error to avoid including what might be an incorrect piece of text.Recommendations
* Try formatting your articles into text paragraphs of a few sentences each.
* Make sure your sentences are well punctuated.
* Make sure you don't use frequent <p> and <br> tags within your paragraphs, and try to avoid breaking up the article body in general.
* Consider removing some of the non-article text from the article page.
So I suspect there is a specific form of articles that are not properly structured in which Harvey is responding to.
Forum discussion at Google News Help.