Finding Duplicate Documents in MongoDB

Posted By : Mohd Adnan | 30-Nov-2018

Web Apps

MongoDB Aggregation

Finding duplicate values in your database can be difficult, especially if you have millions of documents in the collection.
MongoDB's aggregation features make it achievable by allowing customization and provide flexibility as to how documents are grouped together and filtered.

Finding Duplicates with Aggregations

You can find duplicate values within your MongoDB database using the aggregate method along with the $group and $match aggregation pipeline operators.

Let's see how to use the aggregate method in MongoDB.

db.list.aggregate([ ... ]);

The aggregate function takes the parameter as an array of aggregation operators.
The $group operator is the first one we want to use, and we want to set it up so it groups our documents by the REF number.

db.list.aggregate([ {$group: { _id: {REF: "$REF"} } } ]);

We'll need an _id field first, which is required because it indicates what we are grouping by the unique Id. In this case, we are grouping together the REF numbers so $REF is used to point to the REF field. The key REF within _id can have any name.

Running the aggregation using only the $group pipeline operator, we'll get something like the following below from the MongoDB shell. However, the $group pipeline operator doesn't order documents so you may get varying results:

{ "_id" : { "REF" : 5821 } }
{ "_id" : { "REF" : 35142 } }
{ "_id" : { "REF" : 32744 } }
{ "_id" : { "REF" : 22013 } }
{ "_id" : { "REF" : 26120 } }

In order to view documents that have the same REF number, we'll need to collect them together and show their IDs. For that, we can create another field to preserve the IDs as we move through the pipeline. What we'll do is associate a REF with all of the MongoDB documents that have the same REF number in an array. The array will be created with the $addToSet operator and it will include all the ObjectIds of the documents with the same REF number using each document's _id field represented by $_id. If there is more than one ObjectId associated with a REF number in the uniqueIds array, then we've found a duplicate.

db.list.aggregate([ {$group: { _id: {REF: "$REF"}, uniqueIds: {$addToSet: "$_id"} } } ]);

At this breakpoint, we could possibly stop and check to see which groupings have more than one unique ObjectId and decide which document we would like to keep. However, if we had hundreds or thousands of documents. It would become very time consuming and to track for all of the groupings with more than one unique ObjectId. So, let's make it a little easier by including another field in our $group pipeline operator that will count the documents. The count field uses the $sum operator which adds the expression 1to the total for this group for each document in the group. When used in the $groupstage, $sum returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key.

This aggregation will look like the following:

db.list.aggregate([ {$group: { _id: {REF: "$REF"}, uniqueIds: {$addToSet: "$_id"}, count: {$sum: 1} } } ]);

Running this aggregation, we'll get the number of documents that have the same REFin the count field.

{ "_id" : { "REF" : 34985 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6c5") ], "count" : 1 }
{ "_id" : { "REF" : 34172 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6c6") ], "count" : 1 }
{ "_id" : { "REF" : 21900 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6cd") ], "count" : 1 }
{ "_id" : { "REF" : 29444 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d0") ], "count" : 1 }

To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match pipeline operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1. This looks like the following:

db.list.aggregate([
    {$group: {
        _id: {REF: "$REF"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {"$gt": 1}
        }
    }
]);

The result of this query will give us only the grouped documents that have more than one document with the same value.

{ "_id" : { "REF" : 31111 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6e3"), ObjectId("5a9535b6a44447fca16fa6e4") ], "count" : 2 }
{ "_id" : { "REF" : 3204 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d4"), ObjectId("5a9535b6a44447fca16fa6dc") ], "count" : 2 }
{ "_id" : { "REF" : 19487 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d6"), ObjectId("5a9535b6a44447fca16fa6dd") ], "count" : 2 }

The filtering of the results to include REF groupings with more than one document allows for a lot of the noise to be removed. At times, you may have more groupings with a lot more duplicates. Hence, you may like to order them so that the groupings with the most duplicates appear at the top of your results, or in descending order. MongoDB provides us with the $sort pipeline operator to select the fields that we want to sort our groupings by. To sort by the 'count' field in descending order, we can use the expression -1, and now our aggregation will look like:

db.list.aggregate([
    {$group: {
        _id: {REF: "$REF"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {"$gt": 1}
        }
    },
    {$sort: {
        count: -1
        }
    }
]);

Now, the results will be in descending order on the basis of count field.

{ "_id" : { "REF" : 58915 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6df"), ObjectId("5a9535b6a44447fca16fa6e0"), ObjectId("5a9535b6a44447fca16fa6e1") ], "count" : 3 }
{ "_id" : { "REF" : 32984 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6e3"), ObjectId("5a9535b6a44447fca16fa6e4") ], "count" : 2 }

MongoDB's aggregate method allows you to find duplicates in just a few lines of code.

Related Tags

About Author

Mohd Adnan

Adnan, an experienced Backend Developer, boasts a robust expertise spanning multiple technologies, prominently Java. He possesses an extensive grasp of cutting-edge technologies and boasts hands-on proficiency in Core Java, Spring Boot, Hibernate, Apache Kafka messaging queue, Redis, as well as relational databases such as MySQL and PostgreSQL. Adnan consistently delivers invaluable contributions to a variety of client projects, including Vision360 (UK) - Konfer, Bitsclan, Yogamu, Bill Barry DevOps support, enhedu.com, Noorisys, One Infinity- DevOps Setup, and more. He exhibits exceptional analytical skills alongside a creative mindset. Moreover, he possesses a fervent passion for reading books and exploring novel technologies and innovations.

Ready to innovate? Let's get in touch

Attach files

Recaptcha is required.

Backend

Full Stack

Frontend

Blockchain

Mobile

Video Streaming

E-commerce

ERP

CMS

Devops

AR/VR

Software Development Services

Metaverse Innovation & Consulting

Digital Experience

Digital Trivergence

Data Services

Scaffold

Company

Finding Duplicate Documents in MongoDB

Posted By : Mohd Adnan | 30-Nov-2018

Related Tags

About Author

Mohd Adnan

More From Oodles

Ready to Expand? Discover PWA Tactics That Drive Growth

In this article, we will explore in-depth how progressive enhancement can elevate your digital expansion approach, its principles, and how you can implement it to improve your accessibility.

Arpita Pal | 22-Jan-2025

Essential Security Practices for Securing Your Business’s PWA

In this article, we will explore the potential threats to your PWA and impactful strategies to safeguard it against these risks.

Arpita Pal | 30-Dec-2024

Elevate Your Digital Presence with Our PWA Optimization Solutions

In this article, we have provided a detailed discussion of the most effective strategies to elevate its user experience, performance and functionality.

Arpita Pal | 29-Nov-2024

Ready to innovate? Let's get in touch

Valued Services

Resources

Expertise

Connect with us

© Copyright 2025 Oodles Technologies Pvt Ltd. All rights reserved.