Skip to content

DB Migrations #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

DB Migrations #221

wants to merge 4 commits into from

Conversation

dpordomingo
Copy link
Contributor

@dpordomingo dpordomingo commented Mar 20, 2018

depends on #206 and #232

required by #197 and #233

This PR defines the migration from our current Staging DB, with no uast cols, to a new internal DB having uast nullable cols in the file_pairs table as required by #206
This PR also removes the diff column from the internal DB as required by #232
This PR also removes the features table from the internal DB as required by #233

How to use it:

  1. use an old internal database (internal.db) → can be downloaded from the export tool
  2. use a source DB containing uasts (source.db) → from ML issue, stored in gdrive
  3. ensure that source DB has the UASTs to import inside a files table
  4. run the migrations
# prepares the current "internal.db". Adds "uast_a" and "uast_b" nulable cols
go run cli/migrations/*.go uast-add-cols internal.db
# import UASTs into "internal.db" reading from "source.db"
go run cli/migrations/*.go uast-import internal.db source.db
# prepares the current "internal.db". Remove "diff" col
go run cli/migrations/*.go diff-rm-col internal.db
# drops the 'features' table from the current "internal.db"
go run cli/migrations/*.go features-drop-table internal.db

How to validate the result

with this command/query you can fetch those FilePairs whose UASTs could not be imported

sqlite3 internal.db 'SELECT blob_id_a, uast_a, blob_id_b, uast_b
 FROM file_pairs
 WHERE uast_a IS NULL or uast_b IS NULL;'

if you made a backup of the file_pairs table, you can ensure that primary keys were kept running:

sqlite3 internal.db 'SELECT count(*)
  FROM file_pairs_backup fb, file_pairs f
  WHERE
    fb.id=f.id and
    fb.blob_id_a=f.blob_id_a and
    fb.blob_id_b=f.blob_id_b and
    fb.uast_a=f.uast_a and
    fb.uast_b=f.uast_b;'

The returned value will be the amount of PK that were messed (it should be 0)

Disclaimer

These are not real migrations with up/down feature

You can read more details about these scripts in the migrations/README.md

@dpordomingo dpordomingo added the feature-request New feature or request form the users label Mar 20, 2018
@dpordomingo dpordomingo self-assigned this Mar 20, 2018
@dpordomingo dpordomingo requested review from smacker, carlosms and bzz March 20, 2018 18:22
@dpordomingo dpordomingo mentioned this pull request Mar 20, 2018
12 tasks
@bzz
Copy link
Contributor

bzz commented Mar 21, 2018

Seems like only commits after 6e6169f need to be reviewed in here.

@@ -0,0 +1,58 @@
package main
Copy link
Contributor

@bzz bzz Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, +325 LOC for not a real migration to add 2 columns to 1 table? That seems a LOT.... sorry if I have misunderstood the purpose of this, please, do not hesitate to clarify that here.

Do you think there might be a way to avoid having this complexity, but rather create a separate issue for applying a real migration tools instead?

As discussed IRL (I'm sorry if that was not articulated well enough) I think it's nice to have a DML SQL command in a small .sql file, to add a new columns. From my side, having a custom mechanism implemented to handle this task is 🙅‍♂️, as at least for now it seems like a lot of overhead (that needs to be maintained) to have that custom code that needs to be go run, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two scripts.

First one adds the new columns. It's 58 cols because of having a command tu run the query. Most of the lines are command documentation.

Second one is 176 lines to assign uast to existent FilePairs. It is needed because I couldn't guarantee that provided table with UAST is the same that current one so it could be replaced; even more: I could guarantee that it is different, with different Id's and less cols. Transforming the provided one into the current one may require more work, and it is would not be reusable.
Because of that the command just look for Files without UAST, and assign them looking for them in the source table.

About having separated commands instead of a bunch of queies: it seems easier to test and reproduce the process making shorter the time needed for the migration itself.
The commands itself can be reused for creating more commands and being consistent between them. I copied the idea from Borges that provided a clan way to pack it's commands.

About the maintenance overhead, I can not see how since it won't be needed (same way that for other cli scripts in CAT, that are not supported nor prod ready)
If you really concern about having unmaintained code in the repo we can delete all these scripts tomorrow or as soon as we finish this task. Anyway I think it's better to validate that the modifications to be done are what we expect.

About creating an issue to add a migration tool, I agree on that, but as we agreed, it is something that should be maybe done in the future.

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 For really nice PR description, with detailed test plan, etc.

If I did not misunderstood the purpose of this in #221 (review) - for this change I would strongly prefer having a single DML statement in single .sql file instead of a custom code.

@smacker
Copy link
Contributor

smacker commented Mar 21, 2018

Sorry I don't get why do we need all this code for 1-time task and especially why do we need it in master.

PS. It looks like it should be possible to achieve just by this (pseudo code):

$ sqlite source.db
> SELECT 'UPDATE file_pairs SET uast_a = ' + uast_a + ', uast_b = ' + uast_b + ' WHERE blob_id_a = ' + blob_id_a +  ' AND blob_id_b = ' + blob_id_b ';' FROM files INTO FILE 'data_migration.sql';
$ sqlite internal.db
> ALTER TABLE file_pairs ADD COLUMN uast_a BLOB;
> ALTER TABLE file_pairs ADD COLUMN uast_b BLOB;
$ sqlite internal.db < data_migration.sql

Copy link
Contributor

@carlosms carlosms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see any obvious problems in the code.

Taking into account this is a one-off internal script that won't be supported, I won't comment on this being or not overkill.

@dpordomingo
Copy link
Contributor Author

Thanks @smacker for your suggestion
The problem I found is that the UAST contains binary data that should be properly packed to be used in the scripts. Fix and test it would consume more or less the same time than the required to do it programmatically.

On the other hand, if we agree that non production code should not be in master I can agree on that, but we should maybe also create a new issue to drop other non production code that is already at master.

To be consistent, we could also schedule that task to be done separatelly.

@dpordomingo dpordomingo force-pushed the migrations branch 2 times, most recently from dcd750d to 25d847e Compare March 22, 2018 17:21
Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this approach is stretching the limits a bit and politely disagree with argument that having multiple commands is not bringing maintenance overhead, but I commit to the solution @dpordomingo, as an author of this PR and a maintainer of this project suggests.

LGTM

@dpordomingo
Copy link
Contributor Author

The main goal of this PR was to ensure that there was no error in the migration process.
I can understand from the conversations that all of you concerns about adding more and more non-production or non-reusable code. Since I also share with you this concern, I'm ok with applying the migrations locally, and avoid this merge till further agreement or decition.

@dpordomingo dpordomingo force-pushed the migrations branch 2 times, most recently from 06bc285 to fd6524b Compare March 23, 2018 16:11
@smacker
Copy link
Contributor

smacker commented Mar 26, 2018

@dpordomingo applying migration locally without merging to master would be great! Thanks!

@dpordomingo dpordomingo force-pushed the migrations branch 2 times, most recently from b487c67 to 9d59aff Compare March 27, 2018 17:02
@dpordomingo dpordomingo mentioned this pull request Mar 27, 2018
@dpordomingo dpordomingo force-pushed the migrations branch 2 times, most recently from 1913173 to f113671 Compare April 4, 2018 18:39
@dpordomingo dpordomingo changed the title UAST Migrations DB Migrations Apr 4, 2018
@bzz
Copy link
Contributor

bzz commented Apr 5, 2018

@dpordomingo is there anything left before merging this?

@dpordomingo
Copy link
Contributor Author

I understood #221 (comment) that we won't merge it due to your concerns explained in the discussion above.
I will close this PR once the migrations has been applied unless you prefer to do in a different way.

@dpordomingo
Copy link
Contributor Author

Migrations were successfully applied and the new DB was pushed into Prod and Staging.
The dpordomingo:migrations branch will be kept open so further migrations could be stored there.

The last migration will be applied once the ML Feature Extraction API has been integrated into CAT:

@dpordomingo dpordomingo closed this Apr 9, 2018
@dpordomingo dpordomingo removed the request for review from smacker April 9, 2018 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request form the users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants