There are numerous motives you would possibly want to uncover each of the URLs on a website, but your precise objective will decide That which you’re hunting for. For instance, you may want to:
Identify each individual indexed URL to analyze difficulties like cannibalization or index bloat
Collect existing and historic URLs Google has viewed, specifically for web page migrations
Find all 404 URLs to Get better from post-migration errors
In Every scenario, one Device received’t Provide you all the things you require. Regrettably, Google Research Console isn’t exhaustive, and a “internet site:instance.com” lookup is restricted and tough to extract facts from.
In this post, I’ll wander you through some tools to build your URL listing and ahead of deduplicating the data utilizing a spreadsheet or Jupyter Notebook, based upon your web site’s size.
Previous sitemaps and crawl exports
If you’re in search of URLs that disappeared from the Reside internet site a short while ago, there’s an opportunity anyone on your own staff might have saved a sitemap file or simply a crawl export prior to the alterations were built. In case you haven’t currently, check for these documents; they are able to generally provide what you will need. But, when you’re reading this, you almost certainly didn't get so lucky.
Archive.org
Archive.org
Archive.org is a useful Device for SEO responsibilities, funded by donations. In case you hunt for a site and select the “URLs” selection, you'll be able to entry as much as 10,000 shown URLs.
However, There are several constraints:
URL Restrict: You can only retrieve as much as web designer kuala lumpur ten,000 URLs, which happens to be insufficient for greater websites.
Quality: Lots of URLs could possibly be malformed or reference useful resource information (e.g., visuals or scripts).
No export alternative: There isn’t a designed-in method to export the list.
To bypass the lack of the export button, make use of a browser scraping plugin like Dataminer.io. On the other hand, these constraints signify Archive.org might not offer an entire Resolution for more substantial websites. Also, Archive.org doesn’t suggest whether Google indexed a URL—but when Archive.org uncovered it, there’s an excellent prospect Google did, as well.
Moz Pro
Whilst you would possibly commonly make use of a backlink index to uncover external websites linking to you personally, these instruments also learn URLs on your site in the method.
How you can use it:
Export your inbound one-way links in Moz Pro to obtain a swift and straightforward listing of concentrate on URLs from a website. In the event you’re handling a huge Internet site, consider using the Moz API to export information beyond what’s workable in Excel or Google Sheets.
It’s essential to Be aware that Moz Professional doesn’t confirm if URLs are indexed or found out by Google. Even so, given that most web pages apply precisely the same robots.txt procedures to Moz’s bots since they do to Google’s, this technique frequently works well being a proxy for Googlebot’s discoverability.
Google Lookup Console
Google Search Console features various worthwhile resources for constructing your listing of URLs.
Backlinks experiences:
Much like Moz Pro, the Inbound links portion offers exportable lists of goal URLs. Sad to say, these exports are capped at 1,000 URLs Just about every. You can utilize filters for precise webpages, but considering that filters don’t apply into the export, you may perhaps should count on browser scraping applications—limited to 500 filtered URLs at a time. Not best.
Overall performance → Search Results:
This export provides you with a listing of webpages getting search impressions. Whilst the export is limited, You should use Google Search Console API for much larger datasets. You will also find no cost Google Sheets plugins that simplify pulling far more in depth knowledge.
Indexing → Internet pages report:
This section gives exports filtered by challenge type, while they are also restricted in scope.
Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a wonderful supply for collecting URLs, which has a generous limit of one hundred,000 URLs.
Even better, you could apply filters to build distinct URL lists, correctly surpassing the 100k Restrict. As an example, if you would like export only weblog URLs, stick to these measures:
Phase 1: Increase a section for the report
Step two: Click “Develop a new segment.”
Action three: Outline the phase using a narrower URL sample, for instance URLs containing /site/
Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.
Server log documents
Server or CDN log documents are Most likely the last word Device at your disposal. These logs capture an exhaustive record of every URL route queried by buyers, Googlebot, or other bots in the recorded period of time.
Factors:
Data dimensions: Log files is often huge, a great number of websites only retain the last two months of knowledge.
Complexity: Examining log data files can be difficult, but different equipment can be found to simplify the process.
Incorporate, and fantastic luck
When you finally’ve gathered URLs from all these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for bigger datasets, instruments like Google Sheets or Jupyter Notebook. Make sure all URLs are regularly formatted, then deduplicate the listing.
And voilà—you now have an extensive list of recent, outdated, and archived URLs. Fantastic luck!