How to find duplicate contents on website or web page in Selenium Webdriver using java

Duplicate content is content that appears on the website or web page in more than one URL same title. When create one more web page with same title under same content type in famous CMS like Drupal, Joomla and WordPress etc then created pages with same title but pages url are different. For example i create three basic page with title 'About QA' in drupal CMS and created three urls are 'http://localhost/duplicatexontent/about-qa', 'http://localhost/duplicatexontent/about-qa-0' and http://localhost/duplicatexontent/about-qa-1. Sometime migrate one website data like as pages, orders, product to other website by script and create duplicate content. Now I show how to find duplicate content in this tutorial using Selenium Webdriver.


Demo Duplicate Content HTML Page Code
 <!DOCTYPE html>  
 <html>  
 <head>  
 <title>duplicte</title>  
 </head>  
 <body>  
 <div align="center">  
 <table border="1">  
  <thead>   
   <tr>   
       <th>Title</th>   
       <th>Type</th>  
        <th>Author</th>  
      </tr>  
  </thead>  
  <tbody>   
   <tr>  
       <td><a href="http://localhost/duplicatexontent/about-qa">About QA</a> </td>  
       <td>Basic page</td>  
        <td>hiro</td>  
      </tr>   
       <tr>  
        <td><a href="http://localhost/duplicatexontent/about-qa-1">About QA</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td><a href="http://localhost/duplicatexontent/code-runner">Code Runner</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td> <a href="http://localhost/duplicatexontent/access-denied">ACCESS DENIED</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td><a href="http://localhost/duplicatexontent/circulation">Circulation</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>   
       <tr>  
        <td><a href="http://localhost/duplicatexontent/digital-advertising">Digital Advertising</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td><a href="http://localhost/duplicatexontent/summary-body">Summary of Body</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td> <a href="http://localhost/duplicatexontent/webinars">Webinars</a></td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td><a href="http://localhost/duplicatexontent/videos">Videos</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td> <a href="http://localhost/duplicatexontent/resources">Resources</a></td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>        
       <tr>  
        <td> <a href="http://localhost/duplicatexontent/news">News</a></td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>  
   <tr>  
        <td><a href="http://localhost/duplicatexontent/about-qa-10">About QA</a> </td>  
        <td>Basic page</td>  
        <td>hiro</td>  
       </tr>         
  </tbody>  
 </table>  
 </div>  
 </body>  
 </html>  

HTML page Output


Demo Selenium Webdriver Code for Upper Duplicate Contents Html page
 import java.util.ArrayList;  
 import java.util.List;  
 import java.util.concurrent.TimeUnit;  
 import org.openqa.selenium.By;  
 import org.openqa.selenium.WebDriver;  
 import org.openqa.selenium.WebElement;  
 import org.openqa.selenium.firefox.FirefoxDriver;  
 import org.openqa.selenium.NoSuchElementException;

  
 public class Duplicatecontentshandler {  
   public static void main(String[] args) throws InterruptedException {  
     List<String> freshcontents, duplicatecontents;  
     freshcontents = new ArrayList();  
     duplicatecontents = new ArrayList();  
     List<WebElement> urllist;  
     try {  
       WebDriver driver = new FirefoxDriver();  
       driver.manage().window().maximize();  
       driver.get("file:///C:/Users/Hiro%20Mia/Desktop/Blog%20content/duplicate%20contents.html");  
       driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);  
       urllist = driver.findElements(By.tagName("a"));  
       for (WebElement url : urllist) {  
         // check duplicate content  
         if (url.getAttribute("href").trim().matches("([^\\s]+(\\-[0-9])$)")) {  
           //store duplicate contents into duplicatecontents List variable  
           duplicatecontents.add(url.getText().trim() + "  " + url.getAttribute("href").trim());  
         } else {  
           //store Fresh content into freshcontents List variable  
           freshcontents.add(url.getText().trim() + "  " + url.getAttribute("href").trim());  
         }  
       }  
       driver.quit();  
     } catch (NoSuchElementException e) {  
       e.printStackTrace();  
     }  
     System.out.println("===== Duplicate contents =======");  
     System.out.println("Number of duplicate contents =: " + duplicatecontents.size());  
     for (String duplicate : duplicatecontents) {  
       System.out.println(duplicate);  
     }  
     System.out.println("\n===== Fresh contents =======");  
     System.out.println("Number of Fresh contents =: " + freshcontents.size());  
     for (String fresh : freshcontents) {  
       System.out.println(fresh);  
     }  
   }  
 }  

Output
 ===== Duplicate contents =======  
 Number of duplicate contents =: 2  
 About QA  http://localhost/duplicatexontent/about-qa-1  
 About QA  http://localhost/duplicatexontent/about-qa-10  
 ===== Fresh contents =======  
 Number of Fresh contents =: 10  
 About QA  http://localhost/duplicatexontent/about-qa  
 Code Runner  http://localhost/duplicatexontent/code-runner  
 ACCESS DENIED  http://localhost/duplicatexontent/access-denied  
 Circulation  http://localhost/duplicatexontent/circulation  
 Digital Advertising  http://localhost/duplicatexontent/digital-advertising  
 Summary of Body  http://localhost/duplicatexontent/summary-body  
 Webinars  http://localhost/duplicatexontent/webinars  
 Videos  http://localhost/duplicatexontent/videos  
 Resources  http://localhost/duplicatexontent/resources  
 News  http://localhost/duplicatexontent/news  

1 comment: