Finding Duplicate Documents in MongoDB
Posted By : Mohd Adnan | 30-Nov-2018
Finding duplicate values in your database can be difficult, especially if you have millions of documents in the collection.
MongoDB's aggregation features make it achievable by allowing customization and provide flexibility as to how documents are grouped together and filtered.
Finding Duplicates with Aggregations
You can find duplicate values within your MongoDB database using the aggregate method along with the $group and $match aggregation pipeline operators.
Let's see how to use the aggregate method in MongoDB.
db.list.aggregate([ ... ]);
The aggregate function takes the parameter as an array of aggregation operators.
The $group operator is the first one we want to use, and we want to set it up so it groups our documents by the REF number.
db.list.aggregate([ {$group: { _id: {REF: "$REF"} } } ]);
We'll need an _id field first, which is required because it indicates what we are grouping by the unique Id. In this case, we are grouping together the REF numbers so $REF is used to point to the REF field. The key REF within _id can have any name.
Running the aggregation using only the $group pipeline operator, we'll get something like the following below from the MongoDB shell. However, the $group pipeline operator doesn't order documents so you may get varying results:
{ "_id" : { "REF" : 5821 } }
{ "_id" : { "REF" : 35142 } }
{ "_id" : { "REF" : 32744 } }
{ "_id" : { "REF" : 22013 } }
{ "_id" : { "REF" : 26120 } }
In order to view documents that have the same REF number, we'll need to collect them together and show their IDs. For that, we can create another field to preserve the IDs as we move through the pipeline. What we'll do is associate a REF with all of the MongoDB documents that have the same REF number in an array. The array will be created with the $addToSet operator and it will include all the ObjectIds of the documents with the same REF number using each document's _id field represented by $_id. If there is more than one ObjectId associated with a REF number in the uniqueIds array, then we've found a duplicate.
db.list.aggregate([ {$group: { _id: {REF: "$REF"}, uniqueIds: {$addToSet: "$_id"} } } ]);
At this breakpoint, we could possibly stop and check to see which groupings have more than one unique ObjectId and decide which document we would like to keep. However, if we had hundreds or thousands of documents. It would become very time consuming and to track for all of the groupings with more than one unique ObjectId. So, let's make it a little easier by including another field in our $group pipeline operator that will count the documents. The count field uses the $sum operator which adds the expression 1to the total for this group for each document in the group. When used in the $groupstage, $sum returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key.
This aggregation will look like the following:
db.list.aggregate([ {$group: { _id: {REF: "$REF"}, uniqueIds: {$addToSet: "$_id"}, count: {$sum: 1} } } ]);
Running this aggregation, we'll get the number of documents that have the same REFin the count field.
{ "_id" : { "REF" : 34985 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6c5") ], "count" : 1 }
{ "_id" : { "REF" : 34172 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6c6") ], "count" : 1 }
{ "_id" : { "REF" : 21900 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6cd") ], "count" : 1 }
{ "_id" : { "REF" : 29444 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d0") ], "count" : 1 }
To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match pipeline operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1. This looks like the following:
db.list.aggregate([
{$group: {
_id: {REF: "$REF"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
]);
The result of this query will give us only the grouped documents that have more than one document with the same value.
{ "_id" : { "REF" : 31111 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6e3"), ObjectId("5a9535b6a44447fca16fa6e4") ], "count" : 2 }
{ "_id" : { "REF" : 3204 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d4"), ObjectId("5a9535b6a44447fca16fa6dc") ], "count" : 2 }
{ "_id" : { "REF" : 19487 }, "uniqueIds" : [ ObjectId("5a9535b5a44447fca16fa6d6"), ObjectId("5a9535b6a44447fca16fa6dd") ], "count" : 2 }
The filtering of the results to include REF groupings with more than one document allows for a lot of the noise to be removed. At times, you may have more groupings with a lot more duplicates. Hence, you may like to order them so that the groupings with the most duplicates appear at the top of your results, or in descending order. MongoDB provides us with the $sort pipeline operator to select the fields that we want to sort our groupings by. To sort by the 'count' field in descending order, we can use the expression -1, and now our aggregation will look like:
db.list.aggregate([
{$group: {
_id: {REF: "$REF"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
Now, the results will be in descending order on the basis of count field.
{ "_id" : { "REF" : 58915 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6df"), ObjectId("5a9535b6a44447fca16fa6e0"), ObjectId("5a9535b6a44447fca16fa6e1") ], "count" : 3 }
{ "_id" : { "REF" : 32984 }, "uniqueIds" : [ ObjectId("5a9535b6a44447fca16fa6e3"), ObjectId("5a9535b6a44447fca16fa6e4") ], "count" : 2 }
MongoDB's aggregate method allows you to find duplicates in just a few lines of code.
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Mohd Adnan
Adnan, an experienced Backend Developer, boasts a robust expertise spanning multiple technologies, prominently Java. He possesses an extensive grasp of cutting-edge technologies and boasts hands-on proficiency in Core Java, Spring Boot, Hibernate, Apache Kafka messaging queue, Redis, as well as relational databases such as MySQL and PostgreSQL. Adnan consistently delivers invaluable contributions to a variety of client projects, including Vision360 (UK) - Konfer, Bitsclan, Yogamu, Bill Barry DevOps support, enhedu.com, Noorisys, One Infinity- DevOps Setup, and more. He exhibits exceptional analytical skills alongside a creative mindset. Moreover, he possesses a fervent passion for reading books and exploring novel technologies and innovations.