How to Check for Duplicate Content
How to find duplicate content
Duplicate material on a website should be avoided since it makes it difficult for search engines to decide which version to rank for a query.
While there is no such thing as a “duplicate content penalty” in SEO, very identical content might cause crawling inefficiencies, dilute PageRank, and be a sign of content that should be consolidated, eliminated, or improved.
It’s important to note that duplicate and related material is a natural feature of the web, which isn’t always a problem for search engines, which will canonicalize URLs and filter them when appropriate by design. However, on a larger scale, it can be more difficult.
By avoiding duplicate material, you get control over what is indexed and ranked, rather than relying on search engines. To aid in ranking, you can reduce crawl budget waste and combine indexing and link signals.
In this blog, we will show you how to utilise the Screaming Frog SEO Spider to discover exact duplicate material as well as near-identical content, which occurs when some text matches between pages on a website.
NB: The first two steps are only available with a licence.
1. Set up to find near duplicate content
By default, the SEO Spider will automatically identify exact duplicate pages. However, to identify ‘Near Duplicates’ the configuration must be enabled, which allows it to store the content of each page.
Select on the top menu – Configuration > Content > Duplicates and click on ‘Enable Near Duplicates’
This enables the Minhash algorithm to help identify near duplicates.
The first option ‘Only Check Indexable Pages for Duplicates’ does what you expect and if you have two identical URLs and one is canonicalised to the other and is therefore non-indexable then it will not be reported unless it is disabled.
The near duplicate similarity threshold is set to 90% as default but you can change this if you so wish. It can also be set post-crawl in order to save you recrawling a large site.
If you are interested in finding crawl budget issues, then untick the ‘Only Check Indexable Pages For Duplicates’ option, as this can help find areas of potential crawl waste.
2. Configure Content Area for analysis
You’re able to configure the content used for near-duplicate analysis. For a new crawl, we recommend using the default set-up and refining it later when the content used in the analysis can be seen, and considered.
The SEO Spider will automatically exclude both the nav and footer elements to focus on main body content. However, not every website is built using these HTML5 elements, so you’re able to refine the content area used for the analysis if required. You can choose to ‘include’ or ‘exclude’ HTML tags, classes and IDs in the analysis.
Click on the top menu – Configuration > Content > Area to select which parts of the website you wish to exclude. In this case, we may wish to exclude the mobile menu outside of the header and footer elements.
While this isn’t much of an issue, in this case, to help focus on the main body text of the page its class name ‘mobile-menu__dropdown’ can be input into the ‘Exclude Classes’ box.
This will exclude the menu from being included in the duplicate content analysis algorithm.
Now we are ready to commence a crawl.
3. Crawl the website
Open up the SEO Spider, type or copy in the website you wish to crawl in the ‘Enter URL to spider’ box and hit ‘Start’.
4. View duplicate content
Selecting the ‘Content’ tab has two filters related to duplicate content under the filters sub-menu, ‘exact duplicates’ and ‘near duplicates’.
Only ‘exact duplicates’ is available to view in real-time during a crawl. ‘Near Duplicates’ require calculation at the end of the crawl via post ‘Crawl Analysis‘ for it to be populated with data.
The right-hand ‘overview’ pane displays the number of exact and near duplicate URLs that require your attention.
In the main window pane, the duplicate URLs will be displayed along with their respective Hash reference which has to be exactly the same in order for the match to qualify.
5. Crawl Analysis Required
To run crawl analysis for near duplicate content you need to select ‘Crawl Analysis’ on the top menu and click ‘Configure’ to ensure that the content tick box is selected.
Now click Crawl Analysis > Start and the near duplicates will be populated if it displays the ‘Crawl Analysis Required’ comment. This will rank the near duplicates in chronological order of those with the highest match to the lowest.
The guide above should help illustrate the simple steps required to get started with Screaming frog SEO Spider.
For more information check out the videos on Youtube.
Likewise, if you have any further questions, then please get in touch via our contact page.