Create my own "shard" / "fragment" key

I’m having trouble researching this one for ideas, just the right terms are eluding me. Say I have a hexadecimal string that is always unique (http://php.net/manual/en/class.mongoid.php). I need to find a way to “shard” or split these keys across N groups. Doesnt really matter how many groups, preferably 16 or 32 groups I suppose. But I need to ensure that they get evenly split, and none are overlooked.

So the term is called a “shard key” within MongoDB, but I need to recreate my own within my own programming language (php or python). Any ideas or terms that I can use to get researching / testing?

If it helps at all my end goal is to be able to multi thread a task on data sets.

Why not just use mongoDB or use a message queue – that is really what you are after.

PHP generally won’t help as it isn’t internally capable of multi-threading, sharding anything typically requires that. Python should do but why do that when you could just submit messages to RabbitMQ or something?

I’ll most likely be using python, though I can very easily create artificial multi-threading through php (may do this as Im very suprised to find that early benchmarks are showing php loops to be neraly twice as fast). Keep in mind I’m not actually looking to “shard” the data other than to split data into groups for operations. If I were dealing with a straight number, I might just grab the last digit and split it 10 ways.

Gotcha. This really is still a job for message queuing – which could likely be easily tuned to group things based on some parameter. The other big question is “does this calculation have cross cutting concerns here?” IE, massively parallelizing a running total that needs all other things to be calculated won’t actually help.

The basic operation will be to summarize an array across a time period for each given “ID”


{
ID: "2v23kksdf", //my fake id, like it?
day: "1/2/2013"
log: [
{action: "foo", duration: 3},
{action: "foo", duration: 2},
{action: "bar", duration: 5}
]
}
{
ID: "2v23kksdf", //my fake id, like it?
day: "1/3/2013"
log: [
{action: "foo", duration: 2},
{action: "foo", duration: 7},
{action: "bar", duration: 1}
]
}

It will wind up being a month’s worth that I will be summarizing, per ID, summarize each action in log into a new collection.

Gotcha.

This is a classic map-reduce scenario – why not just that feature directly in mongodb? It takes this down to perhaps a 15 minute coding exercise.

Other options are CouchDb, Hadoop or RavenDb if you want to go .NET.

PS: in case you are wondering what map reduce is, this is the best explanation I can think of: http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Well, while this scenario I provided would be best suited for map reduce, I DO have other operations I’ll be doing that wouldn’t be defined as this. (map reduce is on my list of things to learn, quickly) I’ve learned that mongoDb’s map reduce is horrible, and their aggregation framework is just as bad. I’ll soon be getting an instance of Hadoop running along with mongo in order to handle some of my larger operations.

I appreciate the pointers, many times members do need that push toward what they were “really” looking for. But for this excercise, I really do need a way to convert a string into something that I can easily group, if youi have any ideas :slight_smile:

I was looking at possibly some sort of base 10 conversion for a bit, but I would like to keep the operation on the SQL side (yes this operation might be done on SQLSERVER, just depends)

Gotcha.

Simple approach you could write into a SQL stored proc – take a count, alphabeltize DISTINCT IDs, chop list based on sections, update each item with a batch number, group by batch, etc. Once you get that far sql should eat it right up.