How does a search engine work?
In order for a document, usually a website, to appear in search results, every search engine must first find the document and then understand the content. The first step is therefore to get an overview of as many documents as possible. A crawler, like the Google Bot, basically does nothing but follow all the URLs to constantly discover new pages.
If the crawler lands on a URL, the HTML document is initially downloaded and the source code is searched for basic information on the one hand and links to other URLs on the other hand. Meta-information such as the Robots meta tag or the canonical tag tells you how to process the page. The crawler can follow the found links further.
For this the crawler sends the HTML file and all resources to the indexer, in Google’s case this is the caffeine. The indexer renders the document and can then index the contents of the webpage. The algorithm then makes sure that the most relevant documents can be found for appropriate search queries.
<! doctype html > < html > < head >< /head > < body > < app-root >< /app-root > < script src="runtime.js" >< /script > < script src="polyfills.js" >< /script > < script src="main.js" ></script > < /body > < /html >
While retrieving a server-side rendered document structure and content are already included in the HTML source code, comes at React, Angular and Co., the pre-DOM HTML quite empty. Now, if the entire content of a page loaded in this way, a search engine crawler, which reads only the HTML code of a page, gets virtually no information. The robot can not find any links to follow, nor any basic information about the site content, nor any meta tags.
In fact, we know that Google uses a Web Rendering Service (WRS) based on headless Chrome in Caffeine. Unfortunately, we also know that this is currently still on the state of Chrome version 41, so behaves like a three-year-old browser. Fortunately, the responsible team around John Mueller and Martin Splitt has already stressed that they are working hard to come up with a newer version as soon as possible and want to keep up with the Chrome updates in the future.
But as long as that is not the case, it can be read on www.caniuse.com or the Chrome Platform Status, which features are supported by just that Chrome 41 and which are not.
Two other tools to test Google rendering include the mobile-friendly test and the rich-results test . Here you can find the rendered HTML, as well as the Mobile-Friendly-Test a screenshot of the rendered page.
URLs are the API for crawlers
It is important to avoid pushSta e-bugs for internal links, so that the server-side supported URL is really called. Otherwise, it may happen that content is suddenly available on multiple URLs, becoming duplicate content.
You have to know that Google Bot always works Stateless. Cookies, Local Storage, Session Storage, IndexedDB, ServiceWorkers, etc. are not supported. The crawler visits each URL as a completely new user. Therefore, it is important to ensure that all routes, so all URLs, are always directly accessible.
Server Status Codes
If content is no longer available, a 404 status should also be returned correctly. Search engines then remove these URLs from the search results. This also includes avoiding soft-404 errors: a 404 page is called a 404 page because it outputs such a status, and not just an “ups! Sorry, this page does not exist “is displayed while the server is playing a 200 (OK) code.
< a onclick = "location.href (, https://www.example.com/downground '); " > My bottom < /a > < span onclick = "goTo (, / underside '); " > My bottom < /span > < a href="/downground" > My bottom < /a >
< meta name="robots" content="noindex" > < script > var robots = document.querySelector ('meta[name="robots"] '); document.head.removeChild (robots); < /script >
But in the first step – we remember – the Google bot comes to the site and downloads the unfinished HTML to find the noindex there. Thus, the page is not passed on to Caffeine, where the page would be rendered, and Google does not see that in the finished DOM then just noindex would be more in the head. As a result, the page is not indexed at all.
Rendering web pages is extremely resource intensive – even for industry giants like Google. Therefore, as mentioned above, the rendering process does not happen immediately after the crawler discovers a URL, but only when appropriate resources are free. It can take up to a week to render a page. This makes the quite simple process of crawling and indexing extremely complicated and inefficient.
This requires a middleware that differentiates whether the access comes from a normal browser or a bot. Here, the user agent is simply read out and, if necessary, the IP address verified by the respective bot is usually accessed. John Muller, Senior Webmaster Trends Analyst at Google, mentioned in Google I / O 18 Talk that this variation is not considered a cloaking. That the pre-rendered and the client-side version should not differ in content should be clear.
Hybrid “Isomorphic” rendering
Test if found
Of course you can always download the version 41 of the Chrome browser and see how the page is rendered. The console of the local DevTools provides information about which features this old version does not yet support.
To test whether the contents of your own page have been indexed correctly, also offers a simple Google search. With the search operator site: example.com and a text excerpt from the page to be checked can be quickly determined whether Google finds the content.