Duplicate content is content that appears on the website or web page in more than one URL same title. When create one more web page with same title under same content type in famous CMS like Drupal, Joomla and WordPress etc then created pages with same title but pages url are different. For example i create three basic page with title 'About QA' in drupal CMS and created three urls are 'http://localhost/duplicatexontent/about-qa', 'http://localhost/duplicatexontent/about-qa-0' and http://localhost/duplicatexontent/about-qa-1. Sometime migrate one website data like as pages, orders, product to other website by script and create duplicate content. Now I show how to find duplicate content in this tutorial using Selenium Webdriver.
Demo Duplicate Content HTML Page Code
<!DOCTYPE html>
<html>
<head>
<title>duplicte</title>
</head>
<body>
<div align="center">
<table border="1">
<thead>
<tr>
<th>Title</th>
<th>Type</th>
<th>Author</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://localhost/duplicatexontent/about-qa">About QA</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/about-qa-1">About QA</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/code-runner">Code Runner</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td> <a href="http://localhost/duplicatexontent/access-denied">ACCESS DENIED</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/circulation">Circulation</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/digital-advertising">Digital Advertising</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/summary-body">Summary of Body</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td> <a href="http://localhost/duplicatexontent/webinars">Webinars</a></td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/videos">Videos</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td> <a href="http://localhost/duplicatexontent/resources">Resources</a></td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td> <a href="http://localhost/duplicatexontent/news">News</a></td>
<td>Basic page</td>
<td>hiro</td>
</tr>
<tr>
<td><a href="http://localhost/duplicatexontent/about-qa-10">About QA</a> </td>
<td>Basic page</td>
<td>hiro</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
HTML page Output
Demo Selenium Webdriver Code for Upper Duplicate Contents Html page
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.NoSuchElementException;
public class Duplicatecontentshandler {
public static void main(String[] args) throws InterruptedException {
List<String> freshcontents, duplicatecontents;
freshcontents = new ArrayList();
duplicatecontents = new ArrayList();
List<WebElement> urllist;
try {
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("file:///C:/Users/Hiro%20Mia/Desktop/Blog%20content/duplicate%20contents.html");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
urllist = driver.findElements(By.tagName("a"));
for (WebElement url : urllist) {
// check duplicate content
if (url.getAttribute("href").trim().matches("([^\\s]+(\\-[0-9])$)")) {
//store duplicate contents into duplicatecontents List variable
duplicatecontents.add(url.getText().trim() + " " + url.getAttribute("href").trim());
} else {
//store Fresh content into freshcontents List variable
freshcontents.add(url.getText().trim() + " " + url.getAttribute("href").trim());
}
}
driver.quit();
} catch (NoSuchElementException e) {
e.printStackTrace();
}
System.out.println("===== Duplicate contents =======");
System.out.println("Number of duplicate contents =: " + duplicatecontents.size());
for (String duplicate : duplicatecontents) {
System.out.println(duplicate);
}
System.out.println("\n===== Fresh contents =======");
System.out.println("Number of Fresh contents =: " + freshcontents.size());
for (String fresh : freshcontents) {
System.out.println(fresh);
}
}
}
Output
===== Duplicate contents =======
Number of duplicate contents =: 2
About QA http://localhost/duplicatexontent/about-qa-1
About QA http://localhost/duplicatexontent/about-qa-10
===== Fresh contents =======
Number of Fresh contents =: 10
About QA http://localhost/duplicatexontent/about-qa
Code Runner http://localhost/duplicatexontent/code-runner
ACCESS DENIED http://localhost/duplicatexontent/access-denied
Circulation http://localhost/duplicatexontent/circulation
Digital Advertising http://localhost/duplicatexontent/digital-advertising
Summary of Body http://localhost/duplicatexontent/summary-body
Webinars http://localhost/duplicatexontent/webinars
Videos http://localhost/duplicatexontent/videos
Resources http://localhost/duplicatexontent/resources
News http://localhost/duplicatexontent/news