Hi, i want to ask about xml

Joined
Oct 29, 2020
Messages
13
Likes
4
Degree
0
Hello, i am new member here, hope fully i can learn more from this forum :happy:

So, right now i have a problem for my site, which is i have a root domain use laravel with all the fucntionality such as /portofolio, / member/, /users. etc. But i have a subfolder which is for my blog and it used wordpress.

the root domain it self already ranking for several, the main keyword, but i want to generate sitemap for it. So, what do you think ?

My question is, what should i do in term of generate sitemap for the root domain? i just start crawling the website using xml-sitemaps.com, but the sitemap that this site generated also include /blog , while the /blog already has xml with yoast. is safe from duplicate? any suggestion?


Thank you, and sorry for my bad English.

Regards,

Regi Ausched
 
Hi, @If x=y, good question, and welcome to BuSo.

What I recommend is that you take a look at Laravel and decide which sub-folders even deserve to be indexed, let alone in your sitemap. I'd probably noindex anything related to /members/ and /users/, etc. If it's a user profile, especially user-generated, then I wouldn't put that in my index or sitemap.

Ultimately you should be to trim this down to static portions of the Laravel site. Now you can find some free software like Xenu Link Sleuth, Screaming Frog, Integrity Pro, and crawl the site. But what you can do is exclude folders like /member/, /users/, and most importantly /blog/.

You'll be left with the URLs for the static portion of the Laravel site, which you can easily generate an XML sitemap for. Google something like "Generate XML sitemap from list of URLs" and there's several options you can use.

Then you can slap that in the root of your public_html or wherever and upload it to Search Console, drop a reference to it in your robots.txt, etc.

But no, you don't want duplicate URLs in multiple sitemaps.
 
@Ryuzaki Thank you for the response.

So, i can generate xml sitemap for main site (laravel), and exclude evertyhing, include /blog? and by the way, if /blog exclude, isn't blog also gonna deindex?

anyway, what do you mean exlude? is it not listed the url? or listed the url in sitemap and deindex it with robots.txt by disallow google bot to crawl it?

thank you ryuzaki.
 
@If x=y, there is a difference between what is in your sitemap and what Google indexes. They can index things you don't put in your sitemap, and they can not index things you do put in your sitemap. A sitemap only helps them discover your content and tell when the timestamps have been updated so they can revisit and index the changes.

In this case, "exclude" means to tell the spidering software that you're using to crawl your own site to ignore a URL if it exists in a specific sub-folder of the URL hierarchy. It saves you time on crawling and keeps them out of the list of URLs you get from the spider, which you'd later use to generate your sitemap.

Generally, you don't want to use the robots.txt file to control crawling. Because if Google crawls a page you set to "noindex", they'll never see that "noindex" flag and will still index the page, even though they can't access the content. Pages with links deserve to be indexed, right? That's the entire philosophy of the Google algorithm. So if you're linking to pages you don't want indexed but also don't let Google crawl them, they won't know not to index them, and will create an empty page in the index for that page. That's very bad news for your website.

If a set of pages are accessible on the front end of your website, meaning real users and googlebot can access them through links, and these are pages you don't want indexed, then you must allow crawling of them (or you risk them blank pages being indexed). If you don't want them indexed, they need to include the <meta name="robots" content="noindex" /> directive in the header.

To repeat, they MUST be accessible by googlebot to be "noindexed", so do not under any circumstances include them in the robots.txt file. If you want to suggest to Google not to waste yours or their own time & resources crawling those sections of your site, then you should link to them using the <a rel="nofollow"> tag.

Nofollow is for controlling crawling on the front end. Robots.txt is to stop accidental crawling into the back end. Noindex is to control indexing.
 
Hi @Ryuzaki interesting you post that today. I was reading this https://developers.google.com/search/docs/advanced/crawling/large-site-managing-crawl-budget and noticed the following:

Block crawling of URLs that shouldn't be indexed. Some pages might be important to users, but shouldn't appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can't consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt or the URL Parameters tool (for duplicate content reached by URL parameters). Don't use noindex, as Google will still request, but then drop the page when it sees the noindex tag, wasting crawling time. Don't use robots.txt to temporarily free up crawl budget for other pages; use robots.txt to block pages or resources that you think that we shouldn't crawl at all. Google won't shift this freed-up crawl budget to other pages unless Google is already hitting your site's serving limit.

Which goes against what you are saying?
 
Which goes against what you are saying?
They give examples of what they're talking about. This is for giant sites like eCommerce sites with faceted navigation, for example. There's better ways to manage crawling on those, such as blocking out certain URL parameters with the URL parameter tool.

So like .com/tshirts/?color=white?size=large?order=newest and .com/tshirts/?color=white?size=large?order=oldest and tons of other variations. They all return the same results in different orders. You can safely direct Google not to crawl those without risk of them being indexed because of how the URL Parameter tool works.

They go on to further specify and deal with robots.txt by mentioning "pages we shouldn't crawl at all" which would be stuff like /wp-admin/ which shouldn't even be discoverable but maybe you aren't blocking directory listings and then someone who isn't you links to the backend of your site. That could set Google loose in the backend and they find and try to crawl everything frequently, even though it's password protected upon access.

Same with infinite scrolling. You can update the URL in the browser's URL bar visually, but that's not the actual URL. This is where using the rel="canonical" comes into play.

There's various methods that all either point Google to surface the correct URL, keep it visible so they see the noindex tag, or suggest they not crawl it like with nofollow.

It's all about what can be crawled but you don't desire it and what shouldn't be crawled (mainly frontend vs. backend). Robots.txt is fine for backend stuff. Robots.txt is how you want to restrict crawling on the backend. It should never be crawled and you can serve HTTP header "noindex" directives in case they do get through. The URL Parameters tool is for blocking dynamically generated filtered content that's all duplicate.

For the front end, you never want to block crawling, because you'll jack up your indexing. You want to use nofollow tags, you want to use the URL parameter tool, and you want to use canonicals. Those all offer suggestions to Google not to crawl, but still allows them to see the "noindex" meta tag.

The logic pattern should flow something like this:
  • Block on the back end (robots.txt + HTTP header noindex)
  • Block dynamically loaded & auto generated URLs (URL Parameter tool)
  • Block duplicate URLs (robots.txt + canonical)
  • Suggest on the front end (nofollow + meta noindex)
The key thing to understand in that block of text from the Google page you quoted is the first sentence which acts as a bullet point header on the page: "Block crawling of URLs that shouldn't be indexed." It doesn't say "block crawling of URLs you don't want indexed" but specifically says "shouldn't". That's a big difference when it comes to managing indexation.

That quote is also concerned with crawl budgets, not indexing. If you're maxing out your crawl budget (tens of millions of pages) then you need to be thinking about what kind of crap shouldn't be live on the site anyways, like member profiles, pages generated for every image, paginated comments, archives, let alone tons of duplicate variations from faceted navigation, etc. Crawling and indexing are different beasts with different solutions.
 
@Ryuzaki thank you for explanation.
i got your point.
let's say i have a website with 2.1K index. and the main keyword around 100 pages laravel, and 20 of em already rank without sitemap, which is there's "my life" come, enough to live in SEA, at least for my self :smile: , and blog post 2000 which is already have sitemap with yoast.

real case scenario :

1. the question about exclude and noindex, the noindex things i am already know how to implement this, let's said i want to noindex (remove if already index) the pages of /users, /signup, /newsletter, etc.

for example, if i want to noindex the /users, i am gonna use meta tag noindex in html head in that category and use robots.txt for disallow domain.com/*/users (is this right slug for the entire /users?), is it right? and also for the exclude mean not put the url in sitemap?

2. the update sitemap
how to auto update nor dynamically update the sitemap if something changes or something?

3. for sitemap /blog
i just submit the submit alongside sitemap that i want to create for laravel, and that's ok if i have two sitemap ?

i just worry that after submit the sitemap for laravel, what already ranking nor have a problem such as duplicate or something mess happen on search result after i submit the sitemap, what do you think?

thank you ryu.
 
You'll be left with the URLs for the static portion of the Laravel site, which you can easily generate an XML sitemap for. Google something like "Generate XML sitemap from list of URLs" and there's several options you can use.
hello ryu,

is the sitemap dynamically change and update if the content updated nor created ?

if not, what should we do in order todo that?

thank ryu. :smile:
 
for example, if i want to noindex the /users, i am gonna use meta tag noindex in html head in that category and use robots.txt for disallow domain.com/*/users (is this right slug for the entire /users?), is it right? and also for the exclude mean not put the url in sitemap?
Yes on meta tag noindex on /users/. No on blocking it in robots.txt. You want it crawled or Google won't see the noindex tag. Yes on excluding it from the sitemap. You don't need Google recrawling that stuff if it's not going to be indexed. Use nofollow links any time you link to that section of the site.

how to auto update nor dynamically update the sitemap if something changes or something?
For Laravel you'll need to find a solution that can automatically handle the sitemap for you. You'll want one that you can exclude sections of the site from the sitemap. You'll want to look into solutions like this, as an example.
i just submit the submit alongside sitemap that i want to create for laravel, and that's ok if i have two sitemap ?
Yes, you can have your Laravel sitemap and your Wordpress sitemap. I run sites that are only Wordpress that themselves have as many as 5 sitemaps. Yoast, as you mentioned, is probably creating separate sitemaps under one single "sitemap index". It's fine.
 
Ok, i am understand what should i do based on your explanation @Ryuzaki , thanks.

But, what if i also want to no index the string, such as domain.com/users?short , like that ?
 
But, what if i also want to no index the string, such as domain.com/users?short , like that ?

You can show the meta noindex tag on those or send the noindex HTTP header by using Regex in the .htaccess file.

Alternatively, you can simply set canonical tags on parameterized URLs to point back to the non-parameter versions of the page.

Perhaps doing both is the best idea.
 
is that best practice? cause i saw a lot of them blocking from robots
 
Back