We are working on a fulfillment system for an online book store. Everyday we rec
ID: 651964 • Letter: W
Question
We are working on a fulfillment system for an online book store. Everyday we receive tens of CSV files (about 90 files) with data about purchased orders.
Example:
OrderId | BookTitle | Quantity | Price | BuyerName | ...
1 |The hunger games | 1 | 10 | John Doe | ...
1 | The Exorcist | 2 | 25 | John Doe | ...
2 | Superman | 1 | 5.5 | Jane Nose | ...
What we currently do:
The data need to be translated to our domain to Order entity, OrderItem entity,...etc. We load the file into dummy tables then use very complex queries to populate the Orders table and OrderItems table.
Problem:
All the business logic is in the database (in the queries) so unit testing them is a nightmare. In addition we were requested to change the database we are using (not exactly, but something terrifyingly similar)
What we tried:
We are moving to a DDD approach where the domain model is the core of the application that holds all the logic and validation.
We tried to parse the file line by line and create the respective entities and validate them accordingly then save them to the database.
This resulted in thousands of calls to the database and sometimes the server runs out of memory coz of the thousands of records (around 90 thousand record/file) in a single file.
I know this domain is dead old and someone must have found a better way to handle such task. How do you process such huge files in your domain without suffering from performance?
UPDATE
We are using a MySql database with PHP and Apache.
The CSV file is loaded into a dummy table using something similar to this:
load data local infile 'orders.csv' into table dummyOrders fields terminated by ','
enclosed by '"'
lines terminated by ' '
(orderId, bookkTitle,Quantity, Price, BuyerName,...)
Explanation / Answer
You were doing it right - load the file into a dummy table (preferably on a staging DB), then manipulate the data with stored procedures (which are easy to test - they're little more than a single function, and you can put test data in the tables and run your sprocs using that data in a transaction that can be rolled back so they can be very isolatable regardless of what data exists in there).
Once you have your data in a good state, you migrate it to the production DB. That one will not need to have all the loading SQL in it, so can be simpler to manage.
So basically, split your operations into 2, staging and production. I've used this approach with millions of records and very good performance (though with SqlServer, not MySQL).
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.