hadoop - MongoDB into AWS Redshift -
hadoop - MongoDB into AWS Redshift -
we've got pretty big mongodb instance sharded collections. it's reached point it's becoming expensive rely on mongodb query capabilities (including aggregation framework) insight data.
i've looked around options create info available , easier consume, , have settled on 2 promising options:
aws redshift hadoop + hivewe want able utilize sql syntax analyze our data, , want close real time access info (a few minutes latency fine, don't want wait whole mongodb sync overnight).
as far can gather, alternative 2, 1 can utilize https://github.com/mongodb/mongo-hadoop move info on mongodb hadoop cluster.
i've looked high , low, i'm struggling find similar solution getting mongodb aws redshift. looking @ amazon articles, seems right way go utilize aws kinesis info redshift. said, can't find illustration of did similar, , can't find libraries or connectors move info mongodb kinesis stream. @ to the lowest degree nil looks promising.
has done this?
i ended coding our own migrator using nodejs. got bit irritated answers explaining redshift , mongodb is, decided i'll take time share had in end.
timestamped data
basically ensure our mongodb collections want migrated tables in redshift timestamped, , indexed according timestamp.
plugins returning cursors
we code plugin each migration want mongo collection redshift table. each plugin returns cursor, takes lastly migrated date business relationship (passed migrator engine), , returns info has changed since lastly successful migration plugin.
how cursors used
the migrator engine uses cursor, , loops through each record. calls plugin each record, transform document array, migrator uses create delimited line streams file on disk. utilize tabs delimit file, our info contained lot of commas , pipes.
delimited exports s3 table on redshift
the migrator uploads delimited file onto s3, , runs redshift re-create command load file s3 temp table, using plugin configuration name , convention denote temporary table.
so example, if had plugin configured table name of employees
, create temp table name of temp_employees
.
now we've got info in temp table. , records in temp table ids originating mongodb collection. allows run delete against target table, in our example, employees table, id nowadays in temp table. if of tables don't exist, gets created on fly, based on schema provided plugin. , insert records temp table target table. caters both new records , updated records. soft deletes on our data, it'll updated is_deleted
flag in redshift.
once whole process done, migrator engine stores timestamp plugin in redshift table, in order maintain track of when migration lastly run it. value passed plugin next time engine decides should migrate data, allowing plugin utilize timestamp in cursor needs provide engine.
so in summary, each plugin/migration provides next engine:
a cursor, optionally uses lastly migrated date passed engine, in order ensure deltas moved across. a transform function, engine uses turn each document in cursor delimited string, gets appended export file a schema file, sql file containing schema table @ redshift mongodb hadoop amazon-web-services
Comments
Post a Comment