Metadata

Metadata, courtesy of http://www.flickr.com/photos/sjcockell/6126442977

Metadata is data that describes and provides information about other data. A classic example is the information about a digital photo, such as the ISO settings, resolution, pixel depth, exposure, camera settings, camera type and so on.

Let’s look at some sample metadata for an image.

Field Value
File name img_1771.jpg
File size 32764 Bytes
MIME type image/jpeg
Image size 480 x 360
Camera make Canon
Camera model Canon PowerShot S40
Image timestamp 2003-12-14 12-01-44
Image number 117-1771
Exposure time 1500 s
Aperture F4.9
Exposure bias 0 EV
Flash No, auto

Each image might contain a different mix of fields depending on what features the camera supports. A cell phone might add GPS coordinates as metadata, while another might add more detailed flash information.

Obviously, it’s unfeasible to create an index for each possible metadata field. The schema we present in this chapter is optimized to allow for quick and efficient retrieval of documents by metadata fields.

Schema Observations

  • Efficiently query in diverse sets of metadata tags, using a single index.
  • All metadata keys are represented in the index, meaning it can get fairly large.
  • Optimized for a read heavy workload.

Schema

In our schema we are going to leverage the fact that one can index arrays of objects easily in MongoDB. By using this ability, we can create a metadata field in a document, that can be efficiently queried.

{
  "metadata": [
    {"key": "File Name", "value": "img_1771.jpg"},
    {"key": "File size", "value": 32764},
    {"key": "MIME type", "value": "image/jpeg"},
    {"key": "Image size", "value": {"width": 480, "height": 360}},
    {"key": "Camera make", "value": "Canon"},
    {"key": "Camera model", "value": "Canon PowerShot S40"},
    {"key": "Image timestamp", "value": ISODate("2014-01-01T10:01:00Z")},
    {"key": "Image number", "value": "117-1771"},
    {"key": "Exposure time", "value": "1/500 s"},
    {"key": "Aperture", "value": "F4.9"},
    {"key": "Exposure bias", "value": "0 EV"},
    {"key": "Flash", "value": "No, auto"}
  ]
}

Example 1: A document containing the metadata schema

Operations

To correctly query this schema, we need to learn about two query selection operators that allow us to match embedded documents in an array.

$all

The $all query operator is defined as selecting all the documents where the value of a field is an array that contains all the specified elements.

var col = db.getSisterDB("supershot").images;
col.findOne({tags: {$all: [ "appliance", "school", "book" ]}});

Example 2: A simple $all query example

$elemMatch

The $elemMatch query operator matches more than one component within an array element.

var col = db.getSisterDB("supershot").images;
col.findOne({metadata: {$elemMatch: {key: "File Name", value: "img_1771.jpg"}}});

Example 3: A simple $elemMatch query example

Let’s see how we can use the two operators to locate documents in our metadata schema.

The $elemMatch operator looks like the obvious first choice. However, the problem is that our metadata array is defined as objects that all have key and value fields. If you attempt to enter multiple matches using key and value in the $elemMatch only the last pair will be used.

If we wish to locate a photo that has MIME type equal to image/jpeg, and also Flash equal to No, auto we need to combine the two query operators $all and $elemMatch.

Let’s take a look at how to query our desired documents.

Querying Documents by Multiple metadata fields

var col = db.getSisterDB("supershot").images;
col.find({ metadata: { $all: [
            { "$elemMatch" : { key : "MIME type", value: "image/jpeg" } },
            { "$elemMatch" : { key: "Flash", value: "No, auto" } }
          ]}
       }).toArray();

Example 4: Locate images using multiple metadata fields

The first $elemMatch operator will locate all the documents with the MIME type equal to image/jpeg and then filter on the Flash key.

Indexes

To provide efficient retrieval of documents using the metadata schema, we need to create an index that allows us to optimize the query using the $elemMatch and $all operators.

For this we will need a compound index on the key and value fields inside the metadata array.

var col = db.getSisterDB("supershot").images;
db.images.ensureIndex({"metadata.key": 1, "metadata.value": 1});

Example 5: Create index for metadata field

Scaling

Secondary Reads

If the site is read heavy (say a photo album site), it might make sense to offload reads to secondary servers allowing the read load to be spread out.

Using secondary reads comes down to the level of acceptable latency for your application is willing to live with as there might be some delay between a write happening on a primary and the document being visible on the secondary.

Sharding

There is no efficient way to shard documents by a generic metadata key as the general nature of the schema precludes the usage of any of the fields in the metadata structure as a shard key.

The metadata would typically be part of the narrowing of a query and the shard key would be related to the purpose of the document containing the metadata structure.

Let’s take the image example and amend it with a user Id. We’ll would use the userId field as the shard key, route the query to a single shard and then narrow the query using the metadata fields.

{
  , "userId": "olepeteraaa1"
  , "metadata": [
    {"key": "File Name", "value": "img_1771.jpg"},
    {"key": "File size", "value": 32764},
    {"key": "MIME type", "value": "image/jpeg"},
    {"key": "Image size", "value": {"width": 480, "height": 360}},
    {"key": "Camera make", "value": "Canon"},
    {"key": "Camera model", "value": "Canon PowerShot S40"},
    {"key": "Image timestamp", "value": ISODate("2014-01-01T10:01:00Z")},
    {"key": "Image number", "value": "117-1771"},
    {"key": "Exposure time", "value": "1/500 s"},
    {"key": "Aperture", "value": "F4.9"},
    {"key": "Exposure bias", "value": "0 EV"},
    {"key": "Flash", "value": "No, auto"}
  ]
}
important

Using the userId as the shard key will ensure user specific images will be located on the same shard.

Performance

A simple exploration of the performance on a single machine with MongoDb 3.0 shows the difference between MMAP and WiredTiger for a narrow simulation using the schema simulation framework mongodb-schema-simulator.

Scenario

https://github.com/christkv/mongodb-schema-simulator/blob/master/examples/scripts/single_or_replset/metadata_access_scenario.js

MongoDb runs on a MacBook Pro Retina 2013 with 512 GB ssd and 16 gb ram. The simulation runs with the following parameters against a single mongodb instance under osx 10.10 Yosemite.

Parameters Value
processes 4
poolSize per process 50
type linear
Resolution in milliseconds 1000
Iterations run 25
Number of users querying using metadata per iteration 1000
Execution strategy slicetime

MMAP

The MMAP engine is run using the default settings on MongoDB 3.0.1.

Metadata Read Simulation

metadata scenario results

Statistics Value
Runtime 40.874 seconds
Mean 0.954 milliseconds
Standard Deviation 0.291 milliseconds
75 percentile 1.187 milliseconds
95 percentile 1.367 milliseconds
99 percentile 1.619 milliseconds
Minimum 0.441 milliseconds
Maximum 4.885 milliseconds

As we can see the 1000 users a second impacts the minimum and maximum as well as the average query time a fair bit.

WiredTiger

The WiredTiger engine is run using the default settings on MongoDB 3.0.1.

Metadata Read Simulation

metadata scenario results

Statistics Value
Runtime 40.01 seconds
Mean 0.992 milliseconds
Standard Deviation 0.318 milliseconds
75 percentile 1.232 milliseconds
95 percentile 1.431 milliseconds
99 percentile 1.706 milliseconds
Minimum 0.484 milliseconds
Maximum 9.827 milliseconds

As expected there is not much difference between the MMAP and WiredTiger storage engines since this is mainly a read only workload.

Notes

important

Runtime versus Iterations

The number of iterations here is 25 meaning that every 1000 milliseconds we start up another 1000 users attempting to read metadata. We can see that the total runtime is around ~40 seconds instead of 25 seconds, which means we have surpased the maximum continous read capacity of the MongoDB instance on the used hardware. If we adjusted down the simulation load until the Runtime of the scenario was around ~25 seconds, we could estimate the maxmimum load our topology can handle.