Friday, July 21, 2017

Storing the database password securely in SQOOP



As of SQOOP 1.4.6 onward it is introduced one option "--password-alias" which will take password alias instead of actual password on command line. The argument value for this option is the alias on the storage associated with the actual password.

To use the above option the underlying hadoop version should provides an API to separate password storage from applications. As of hadoop 2.6.0 hadoop command credential will provide this facility. The passwords are stored with their aliases in a keystore that is password protected. 

Ex: The below are the version tested this functionality.
Hadoop 2.6.0-cdh5.7.0 
Sqoop 1.4.6-cdh5.7.0

Step 1). Run the below command to see the usage of credential API.
$hadoop credential --help

create: to create a new password alias
delete: to delete already created alias
list: to list the available password alias

Step 2): Create a password alias for your database password as below.
Note: For this example using mysql db with username: root and password:cloudera

$ hadoop credential create mydb.password.alias -provider jceks://hdfs/user/cloudera/mydb.password.jceks

Enter password: 
Enter password again: 
mydb.password.alias has been successfully created.

org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated.

In the above command mydb.password.alias is an alias for our db password(cloudera in this example), and jceks://hdfs/user/cloudera/mydb.password.jceks is the HDFS path(/user/cloudera) where we want to create .jceks file which will be hidden and not a text file so its is secure.

Step 3): To list the above created alias use the below command.

$ hadoop credential list -provider jceks://hdfs/user/cloudera/mydb.password.jceks

Listing aliases for CredentialProvider: jceks://hdfs/user/cloudera/mydb.password.jceks


mydb.password.alias

Step 4): Sqoop command which will use above created alias in place of password in the import command.

$sqoop import \
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/cloudera/mydb.password.jceks \
--connect jdbc:mysql://localhost/testDb \
--table student \
--username root \
--password-alias mydb.password.alias \
--delete-target-dir \
--target-dir /user/cloudera/student \

-m 1

Note: --password-alias is the option to specify the alias value we have created for our database. 






Friday, July 14, 2017

hadoop interview questions

1. What is super class for exceptions in java?
Throwable

2. What is the difference between put and copyFromLocal command?
both are same except
copyFromLocal is restricted to copy from local while put can take file from any (other hdfs/local filesystem/..).

3. How to change reducer output part file name to user specified names?
Below is the code to rename our reducer output part file name from "part-*" to "customName-*".
In Driver Code
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); – for avoiding the creation of empty default partfiles
MultipleOutputs.addNamedOutput(job, “text”, TextOutputFormat.class,Text.class, IntWritable.class); – for adding new name to the part
file

In reducer code:
Instead of context.write, use multipleOutputs.write(key, totalWordCount, generateFileName(key,totalWordCount ));
generateFileName method is used to generate desirable output filenames

4. What is speculative execution?
When the tasks running slow Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is
running slower than expected and launches another, equivalent, task as a backup. This is termed speculative execution of tasks.

5. Default no of mappers in sqoop job, what happen if there is no primary key in table?
Default no of map tasks are 4, If your table has no primary key defined then you have to give -m 1 option for importing the data.

6. Difference between ORDER BY and SORT BY in hive?
In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single reducer.
SORT BY sorts the data per reducer.It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.

7. What is Secondary Sort in hadoop?
With mapreduce framework, the keys are sorted but the values associated with each key 
are not.  In order for the values to be sorted, we need to write code to perform what is 
referred to a secondary sort.

a) Make the key a composite of the natural key and the natural value  
b) The sort comparator should order by the composite key, that is, the natural key and natural 
value.
c) The partitioner and grouping comparator for the composite key should consider only the natural
key for partitioning and grouping.

8). Difference between String, StringBuffer and StringBuilder?
String is immutable  ( once created can not be changed )object. The object created as a String is stored in the Constant String Pool.
StringBuffer is mutable means one can change the value of the object . The object created through StringBuffer is stored in the heap .
StringBuffer  has the same methods as the StringBuilder , but each method in StringBuffer is synchronized that is StringBuffer is
thread safe . StringBuilder is not thread safe and also introduced in java 1.5.

9). What happens if we call run() methos directly instead of calling start() method?
The start method makes sure the code runs in a new thread context. If we call run directly, then it would be like an ordinary method
call and it would run in the context of the current thread instead of the new one. The start method contains the special code to
trigger the new thread; run obviously doesn't have that ability because we didn't include it when you wrote the run method.

10). What is the difference between ArrayList and Vector?
Vector is synchronized while ArrayList is not . Vector is slow while ArrayList is fast . Every time when needed, Vector increases the
capacity twice of its initial size while ArrayList increases its ArraySize by 50%. ArrayList intial size is 10.
Vector - 10, Arraylist - 10, Hashtable - 11, Hashmap - 16, Hashset - 16

11). What is access modifiers and access specifiers in java?
There are two types of modifiers in java: access modifiers and non-access modifiers.
The access modifiers in java specifies accessibility (scope) of a data member, method, constructor or class.
There are 4 types of java access modifiers:
private--class level
default--package level
protected--package+inherited classes
public--global access
There are many non-access modifiers such as static, abstract, synchronized, native, volatile, transient etc. 
There are no access specifiers in java.

12). Why String is Immutable in Java ?
String pool (String intern pool) is a special storage area in Java heap. When a string is created and if the string already exists in
the pool, the reference of the existing string will be returned, instead of creating a new object and returning its reference.If
string is not immutable, changing the string with one reference will lead to the wrong value for the other references.
The hashcode of string is frequently used in Java. For example, in a HashMap. Being immutable guarantees that hashcode will always the
same, so that it can be cashed without worrying the changes.That means, there is no need to calculate hashcode every time it is used.
This is more efficient.

13). What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. 
Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.
Graph of transformations to produce one RDD is called as Lineage Graph.In case of we lose some partition of RDD , 
we can replay the transformation on that partition  in lineage to achieve the same computation, rather than doing data replication
across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and
replication and thus achieves faster computations.

14). What is CAP therom in NoSql databases?
All the combinations available are:

CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that
the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the
partition is resolved).
CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a
node goes down.
AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you
aren't guaranteed that all nodes will have the same data (either during or after the partition)

15). What is the difference between count(*) and count(1) in sql?

16). How to check the location of already created hive external table?
describe formatted <table name>;

17). Explain MapReduce flow in different phases?
MapReduce has four phases:
map,
combine,
shuttle and sort,
reduce.
Map phase is done by mappers. Mappers run on unsorted input key/values pairs. The same physical nodes that keeps input data run also
mappers. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pairs are
called intermediate key/value pairs. Their type is usually different from input key/value pair type. Mapper must be supplied by
programmers.
Combine phase is done by combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero,
once or multiple times. Framework decides whether and how many times to run the combiner, programmer has no control over it. Combiners
output key/value pair type must be the same as its input key/value pair types.
Shuttle and sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key.
Each reducer obtains all values associated with the same key. 
Programmer may supply custom compare function for sorting and partitioner for data split. All key/value pairs going to the same
reducer are sorted by the key, but there is no global sorting.
Reducer obtains sorted key/[values list] pairs sorted by the key. Values list contains all values with the same key produced by
mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pair type is
usually different from input key/value pair type. Reducer must be supplied by programmers.

18). What is the clause used for bucketing in hive?
We use CLUSTERED BY clause for bucketing in hive.

19). What is the difference between GROUP and COGROUP?
COGROUP on one table is same as GROUP.We can use COGROUP when we need to group two tables by a column and then join on the grouped column.

20). How to use incremental lastmodified mode in sqoop?
After importing the data if any changes done to the source we can use lastmodifiedmode to import the updates from source database to
hadoop. For this mode modified date is mandatory to identify the columns which are modified after import.
Ex: sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P   --check-column mDate --incremental lastmodified --last-value 2014-01-25

21). What is HDFS Federations?
The NameNode has references of all files and block in filesystem in memory, which means a large cluster with many files,
memory becoming limit.HDFS Federation is available in 2.x series,in which wen can adding namenodes,each namenode manages a portion of
the filesytem namespace example: /user —for namenode1,/tmp—namenode2.Two namenodes are independent to each other.if ,namenode1 is
failure which not effected to another namenode2.

22). What is data-integrity in hdfs?
HDFS tranparently checksums all data written to it and by default verifies checksums when reading data.A seperate checksums created
for every bytes of data(default is 512 bytes ,because CRC-32 checksums is 4 bytes).Datanodes are responsible for verfing the data they
recevie before storing the data and checksums.
It is possible to disable checksums by passing false to setverifychecksum() method on filesystem before using open() method to read
file.

23). How to set different replication factor for a specific file?
hadoop fs –setrep –w 3 /my/file

24).  What is the Use of SSH in Hadoop ?
We should use SSH in Hadoop because SSH is a built-in username and password schema that can be used for secure access to a remote host; it is a more secure alternative to rlogin and telnet .

25). What is Hive Metastore ?
Hive Meta store is a database that stores metadata of your hive tables like table name,column name,data types,table location,number of
buckets in the table etc.

26). What is fail-fast approach in HashMap?
Fail-fast means when you try to modify the content when you are iterating thru it, it will fail and throw ConcurrentModificationException.

27). What is spilling in hadoop ?
Map output is first written to buffer and buffer size is decided by io.sort.mb property .By default, it will be 100 MB.
When buffer reaches certain threshold ,It will start spilling   buffer data to disk. This threshold is decided by 
io.sort.spill.percent. Before writing data onto Hard disk ,data is divided into partitions with respect to reducers.
On each Partition ,in-memory sort will be performed by key.
once per every three spills combiner will be run on sorted data if combiner function is specified.
These number of spills is decided by min.num.spills.for.combine.after combiner function is performed, data is written to hard disk.

28). Difference between Runnable and Callable in Java?
Both Runnable and Callable represent task which is intended to be executed in separate thread.
Runnable is there from JDK 1.0, while Callable was added on JDK 1.5.
Main difference between these two is that Callable’s call() method can return value and throw Exception,
which was not possible with Runnable’s run() method.Callable return Future object, which can hold result of computation.

29). Difference between notify and notifyAll in Java?
notify() method doesn’t provide any way to choose a particular thread,that’s why its only useful
when you know that there is only one thread is waiting.On the other hand, notifyAll() sends notification 
to all threads and allows them to compete for locks, which ensures that at-least one thread will proceed further. 

30). Why wait, notify and notifyAll are not inside thread class? 
One reason which is obvious is that Java provides lock at object level not at thread level.
Every object has lock, which is acquired by thread.Now if thread needs to wait for certain lock
it make sense to call wait() on that object rather than on that thread.
Had wait() method declared on Thread class, it was not clear that for which lock thread was waiting.
In short, since wait, notify and notifyAll operate at object level,
it make sense to defined it on object class because lock belongs to object.

31). How to find nth highest salary in hive?
ex: 4th highest salaray query, assume table has 
select * from (
  select salary, ROW_NUMBER() over (ORDER BY salary desc) as row_no from emp group by salary) res 
where res.row_no = 4;

problem with row_number() is if duplicate salaries are there then same salary may have two different ranks.But because of group by we will get correct result.

select * from (
  select salary, rank() over (ORDER BY salary desc) as row_no from emp group by salary) res 
where res.row_no = 4;

problem with rank function is it will skip some ranks if duplicate salaries are found.

select * from (
  select salary, dense_rank() over (ORDER BY salary desc) as row_no from emp ) res 
where res.row_no = 4;

dense_rank() is the perfect for finding nth highest salary.

32). What is the difference between GROUP and COGROUP in PIG?
The GROUP and COGROUP operators are identical. Both operators work with one or more relations. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. You can COGROUP up to but no more than 127 relations at a time.

33).what is hive Metastore ?
The Metastore is the component that stores the system catalog which contains metadata about tables, columns, and partitions.
Usually, metadata is stored in a traditional RDBMS format. However, by default, Apache Hive uses the Derby database. Any Java Database Connectivity or JDBC compliant database like MySQL can be used for Metastore.

34). What is “SerDe” in “Hive”?
The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write table rows.

35). Explain “WAL” and “Hlog” in “HBase”?
“WAL” (Write Ahead Log) is similar to the “MySQL BIN” log; it records all the changes that occur in the data. It is a standard sequence file by Hadoop and stores “HLogkeys”.  These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in case of server failure, the “WAL” works as the lifeline and retrieves the lost data.

36). How to change the FileFormat of hive table after it created?
ALTER TABLE [DB.]<TableName> SET FILEFORMAT <FileFormat Name>;

37). Difference between map and flapmap functions in spark?
map and flatMap are similar, they take a line from the input RDD and apply a function on it. The way they differ is that the map function returns only one element, while flatMap function can return a list of elements (0 or more).

38). How to change one row into multiple row while doing operation on multiple columns using hive?
Input: 
ID     NAME     CLASS     Start Date      End Date
1      ABC      X         20171106        20171107 

select id,newname name,class,CONCAT_WS('-',substr(startdate,1,4),substr(startdate,5,2),substr(startdate,7,2)) as StartDate ,
CONCAT_WS(substr(enddate,1,4),substr(enddate,5,2),substr(enddate,7,2)) as EndDate from hive_problem1 
LATERAL VIEW EXPLODE(split(CONCAT_WS('-',substr(name,1,1),substr(name,2,1),substr(name,3,1)),'-')) dummytable as newname;

39). What is the difference between ORC and Parquet formats?

Major difference is ORC is from Hortonworks(HDP), and Parquet is from Cloudera(CDH). It means if we use a CDH distro, better using Parquet since CDH will have up to date bug fixes for Hive related to Parquet, and similarly better using ORC on a HDP distribution.

40). Difference between group by vs reduce by in spark?
In group by key operation all data is sent from map task to reduce task where as in reduce by key operation combiner is run on map task and reduce task.