After asking the following question on stackoverflow.com it was suggested I could get better answers here. So here we go. Any information is appreciated.

After asking the following question on stackoverflow.com it was suggested I could get better answers here. So here we go. Any information is appreciated.

In my Delphi service application I need to store a huge number of documents, mostly images, along with meta data about them, such as file names/sizes, as well as some data about them, such as OCR results, in some type of database. Single documents might get up to around 300 MB, while total database size might get into the terabytes range. Storage and retrieval performance is important.

Up to now I was handling that with mORMot's TSynBigTable (along with their REST http WebSockets server over IInterface), but that storage solution doesn't seem very stable. It also is a single file, so has a hard file size limit coming from file system. Somehow it does get corrupted rather easily too (some problem with its memory cache/disk flush logic I think).

I believe Lucene would be perfect for my need, but that has no Delphi port as far as I know. Another alternative I can think of is MongoDB, but I have no idea how stable it will be handling sizes mentioned above.

Ideally, I should be able to embed it into my service application.

I could use some advice of what to use. Any ideas ? Any experience with those data sizes with MongoDB in Delphi, or something similar ?

(I hope I explained specific enough so its not too general a question)

Edit: I forgot to mention I need to be able to search that meta data. For instance I need to index ocr text and search it.

Edit2: I think its important to mention that I do NOT suggest storing files in an SQL database such as MySQL or MSSQL etc.
http://stackoverflow.com/questions/30751809/lucene-or-similar-document-store-for-delphi

Comments

  1. As far as I know MongoDB is very stable and fast. Looks it could solve your task. DBA in my company recommends it very much.

    ReplyDelete
  2. On the systems I design and implement I store the files in the filesystem and the files' metadata in the database. For the images I also make thumbnails and store them together with the images so when I show a list view of the files I show the thumbnails for speed and when the user wants to open the file it loads the original image. I believe you can have two services, one for sending the database (metadata) data and one for sending the actual files with mORMot without many changes to your system.

    By the way you can get "Instant MongoDB" e-book today for free at:
    https://www.packtpub.com/packt/offers/free-learning?utm_source=Sentori&utm_campaign=Free+Learning+30th+April+15

    ReplyDelete
  3. There is also a very stable and actively deveoloped mongo-delphi-driver on GitHub

    ReplyDelete
  4. cheers for the information. I'm currently checking out MongoDB. I think I might go the synopse mormot way tho for client. Looks fast and is a nice ORM way.

    ReplyDelete
  5. mORMot supports MongoDB as backend for its ORM.
    You could also access directly MongoDB via its SynMongoDB unit.
    But IMHO GridFS, nor MongoDB is not very good for serving files, since they will divide it in chunks before storing them.
    A NAS may be a good alternative, in your particular case.
    See http://stackoverflow.com/a/30758546/458259

    ReplyDelete
  6. I wrote a COM wrapper (for my use case) in Visual Studio around Lucene.net (the C# version of Lucene).  It obviously requires the COM object to be installed when you install your software on other PCs, but it is amazingly fast.  I do limit the amount of returns to 200 items from Lucene (too many returns means user has to refine their search), I return the relationship ID stored in the lucene "document" (among other things, like lat and lng to plot the document location on google maps), then build a query to retrieve database specific items my app uses (written in Delphi XE7).  I have currently indexed over 12000 documents (not gigantic, but serves my needs), and a user entered query, from the UI, across the internet (n-tier app), and back with results is usually around 300ms.  I'm VERY happy with the results.   If you know C# and need to learn Lucene, go to manning.com.  They have a lucene (java) book you can pick up and learn how everything works.  Hope that helps.

    ReplyDelete
  7. Simon Stuart  If you don't mind answering, from a 10,000 foot view, how are you integrating ElasticSearch with your Delphi app?  Using a java bridge, or something different?

    ReplyDelete
  8. Simon Stuart  Cool!  I have Mannings "Elastic Search in Action", but haven't had a chance to read it yet.  I didn't know they had a RESTful client interface to it.  I was thinking it was a JAR/class file you pulled into your java project.

    ReplyDelete
  9. I do not advise using MongoDB. I read a very well articulated article on why it is essentially useless for most use cases.

    I can find it agan if anybody's interested.

    A

    ReplyDelete
  10. Consider testing SQLite3 FTS4, even with no stored text, just indexes. It works very well. See http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_8

    ReplyDelete
  11. Andrea Raimondi​ Please stop trolling. MongoDB is a great nosql database, especially with its latest WiredTiger engine. But it has its purpose, of course.

    ReplyDelete
  12. This is the original article: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

    It is a bit old, but I feel it's very well argomented ;)

    I am no expert in MongoDB but this article highlights quite clearly the doubts I have with all NoSQL
    storage. If they are now including some sort of SQL that is good.

    A

    ReplyDelete
  13. Andrea Raimondi This guy did it the wrong way, confusing everything and also confusing you.
    He tried to use MongoDB to store graph data - which is a wrong idea: you need a graph NoSQL engine!
    Then he fallback on a flat scheme on MySQL, then compare this with MongoDB... If you expect NoSQL to be a RDBMS on steroid, you would be disappointed.
    Take a look at what a real expert write: http://martinfowler.com/articles/nosqlKeyPoints.html
    NoSQL won't replace RDBMS, but it is another way of storing data. In fact, there are several kind of NoSQL database. Please check also my blog article at http://blog.synopse.info/post/2014/02/28/Are-NoSQL-databases-ACID and the notion of "Aggregate", which comes from DDD.
    IMHO you could use MongoDB, but with a front-end like our little mORMot, which allows you to switch from a RDBMS to MongoDB on need - even translating the SQL queries on the fly - and still be able to store complex nested information - http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_83
    Perhaps PostgreSQL, and its JSONB column type, is a good mix of NoSQL and RDBMS. Or even using SQLite3 + a TDocVariant field for storing metadata in our ORM - see http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITLE_73

    ReplyDelete
  14. Hi!

    From your link:
    - To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency.
    - The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.

    I have a simple question: show me a use case where not having consistency is acceptable.

    I am really curious because I can't think of one.

    A

    ReplyDelete
  15. Andrea Raimondi In the real world, you do NOT have automatic consistency. For instance, if you get married, all your relatives (and institutions) should be notified that your marital state changed. Please read about data modelization, and state, e.g. in the DDD context. http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_91
    Even in RDBMS, your data is sometimes not consistent, on purpose. For instance, when you do accounting, you never replace previous data: if you need to "delete" an event, you "add" a new event with the reverse money value, to bring back the balance to its previous state. In short: consistency depends on the bounded context on which you have currently focusing! The latest state of one account is to be accurate, but you won't forget the previous states.
    BigData does not care much about consistency. For instance, if you are tracking web site clicks, missing several clicks is not an issue, when you have billions. When you are doing statistics, you may not need accurate data, just representative data.
    Last but not least, NoSQL could be ACID - see my previous link. Even over a network, MongoDB has modes to ensure that the data modification is propagated to all slaves before acknowledging it.

    ReplyDelete
  16. A. Bouchez on the other hand, in accounting you need consistence for all intermediate states, otherwise that's a recipe for losing money - or laundering money, depending on your point of view :)

    Typically if you have a movement between two accounts, it will be (at least) two accounting entries, and these have to be consistent. The current balance can be an aggregate or computed state, but any time it is displayed, the current balance is at best a "snapshot" of the actual current balance anyway.

    A few crypto-currency exchanges using NoSQL DBs learned the troubles of consistenyc and atomicity that at their own expense (cf. FlexCoin & PoloniEx)

    And even when tracking billions of clicks, you cannot always afford to miss some, for instance when those clicks are tied to revenue sharing, one out of tens of thousandths of clicks can be the one that matters (the one that made a sale). The consistency can be "eventual" (delayed), but it's still required at some point.

    ReplyDelete
  17. Sorry for the delayed response, but I was really busy.

    It seems to me that A. Bouchez  is confusing consistency with accuracy.
    Consistency, in my world-view, means that - at any point in time - whatever node you look at, you have the same data.
    Accuracy, on the other end, means - always in my world view - that at any point in time the data reflects the situation.

    So, it is true that - in some cases - even RDBMSes can be both inconsistent and not accurate. This is particularly true in complex business applications which have convoluted replication processes. There is, however, a difference: in the RDBMS world, you can always go to the source and fix the replicated data, because you have one endpoint from which you replicate and one on which you replicate. This isn't necessarily true with NoSQL data where you have no way to say which node must be the trusted source. This is why there are issues if you have to change the underlying DB to either another NoSQL or to an RDBMS.

    Am I missing something here?

    A

    ReplyDelete
  18. IMHO both are linked. If your DB is not consistent, it won't be accurate.
    And my answer wanted to point out that ACID is not always mandatory.
    A single master / several slaves replication scheme do exist in both RDBMS and NoSQL: as I stated, MongoDB has a dedicated acknoledged mode for that.
    No need to introduce NoSQL: you may have issues even if you not change the configuration or setup of your very same RDBMS system. Just change the storage engine under MySQL, and you will find out.
    In all cases, if you change one RDBMS to another, you may find oddities.
    And if you expect NoSQL to be a replacement of a RDBMS, you would fail for sure.
    One of the rules I know is the
    https://en.wikipedia.org/wiki/CAP_theorem  : you would be able to pickup only two of the three - Consistency, Availability, Partition tolerance... This affects all kind of databases.
    Writing "I do not advise MongoDB", with no argument but a deprecated and weak blog article, does not sound right to me. In short: when to advise MongoDB? or NoSQL? or MSSQL? or MySQL? or PostgreSQL? or  Oracle? It depends... and sadly most of the time the decision is not taken from the technological point of view, but from high management or business affinity/habits...

    ReplyDelete

Post a Comment