#TIL Sqoop Codegen

published on Saturday July 13, 2019 00:07:00

I haven't been blogging for a long time and decided to write one today. As a data engineer at Agoda, one of my day to day tasks includes supporting users from various teams.

To give you some background, our team has ETL tools (import/export data from/to Hadoop), data discovery web application, and customized version of HUE with some extra features.

The problem

One of my colleagues from BI team has some issue with our tool last Thursday. She has an ETL job (essentially a Sqoop job) to import data from Postgres to Hive table. She noticed that her target table has some column renamed. The column name from the source was native, but the column name in target table in Hive was _native. She also has another ETL job that select data from the target table created by the previous ETL job which failed because column names are different.

Investigating the problem

At first glance, I think Sqoop or something must have renamed native column to _native because it sounds like a special keyword or something. I Googled a list of reserved words in Postgres, Hive, and Sqoop but couldn't find native as a reserved word.

Since I have cloned Sqoop source code in my local machine, I can just search some code with

$ ag "\bnative\b"

and found 8 matches from the repo. Sweet! 😊

Search result looks something like this:

CHANGELOG.txt
24:    * [SQOOP-1429] - Fix native characters usage for SqlServer object names

...

src/java/org/apache/sqoop/orm/ClassWriter.java
96:    JAVA_RESERVED_WORDS.add("native");

src/test/com/cloudera/sqoop/orm/TestClassWriter.java
344:    assertEquals("_native", ClassWriter.toJavaIdentifier("native"));

I can clearly see that, as expected, Sqoop did rename the column native to _native. But why?

After reading these files from Sqoop source code:

src/java/com/cloudera/sqoop/orm/ClassWriter.java
src/java/org/apache/sqoop/orm/ClassWriter.java
src/test/com/cloudera/sqoop/orm/TestClassWriter.java

as well as this blog post, I can now understand why it does that. Roughly speaking, when you use Sqoop import, it will generate ORM classes based on your data model from the source and use these classes in MapReduce job.

Some experimentation

1. Set up test data on Postgres

test_data

2. Run Sqoop codegen command

$ sqoop codegen \
  --driver org.postgresql.Driver \
  --connect jdbc:postgresql://localhost/postgres \
  --username postgres \
  --table test_table

Sqoop import do this step internally, but I run only codegen command because I just want to see the generated class.

3. Generated class looks like this:

generated_class

As you can see, Sqoop put _ in front of native since it's one of the Java keywords.

Conclusion

Understanding internal of a system is crucial and always a satisfying experience.
Sqoop has a step to generate Java ORM classes before it do the actual import in MapReduce.
Warning log that says "this column has been renamed because ..." could be helpful.