Log Structured Merge Trees


LSM trees are designed to achieve higher throughput and are used as the storage engine of various DB such as HBase, Cassandra, LevelDB, SQLite. As the name suggests, writes are made to log files in append-only mode. These logs files are merged and compacted in the background and indexed for efficient search.

To discuss how LSTM work’s let’s understand few basics:

  1. Log storage
  2. Segment
  3. Segment Merge
  4. Memtable
  5. SSTable
  6. WAL

Log storage – As the name suggests it is an append-only file where any new data is appended to the file. The file contains key-value pairs. Where key is the record key and value is the data. Since the file is append only, the log file can contain multiple records for the same key as an update to the existing the key.

Segment – Continuous appending to the log file, can make file size big and eventually running out of disk space. One good solution is to break log into segments of a certain size.

Segment Merge –  Since the files size are small these files can be merged at a regular time for compaction. The merging and compaction can be done as background thread while regular operations are performed. After the merging process is completed switch can be done to read from the new segments, replacing the old segments(which can be deleted). During the merge process multiple records with the same key are converged into a single record.

crunchbase

Source : Wikipedia

SSTable – Sorted String Table is a concept borrowed from Google BigTable which stores a set of immutable row fragments in sorted order based on row keys. It is a log file but with keys sorted.

Write Ahead Log – Write ahead log is a technique where the request received for write is first written into a separate file residing on the disk. The helps from crash recovery where the data from the file can be replayed and inserted.

Write Amplification – Log structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect of one write to the database resulting in multiple writes to the disk over course of the database’s lifetime – is known as write amplification.

How does read/write works for LGSTM?

  • When a write request comes, it is added to an in-memory balanced tree data structure (eg red-black tree). This in memory tree, is sometimes referred as memtable.
  • Along with request added to in-memory balanced tree the request is also logged in a WAL (on disk). The log helps to recover in case system or the process crashes.
  • When the memtable gets bigger than some threshold it is flushed to disk as a SSTable file. This is done efficiently because tree already maintains a sorted key-value pair. The becomes the most recent segment of the database. During the operation of flushing to the disk, write operation continues as normal. Even with compaction reads will still need to visit many files. Most implementations void this through the use of a Bloom filter. Bloom filters are a memory efficient way of working out whether a file contains a key.
  • To server the read request, first key is searched in memtable then the most recent on-disk segment and so on in the next older segment.
  • The merging and compaction of the segment happens in the background.

Advantages of  LSM

  • LSM trees are typically able to sustain higher write throughput than B-trees, partly because it has lower write amplification and partly because they sequentially write in the form of compact SST tables files rather than having to overwrite several pages in the tree.
  • LSM can be compressed better and thus often produce smaller files on disk than B-trees.
  • B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page. Since LSM are not page oriented and writes are sequential the storage overhead is low.

Disadvantage of LSM

  • The compaction process of LSM can sometime interfere with the performance of ongoing reads and writes. When writing to an empty database, the full disk bandwidth can be used for initial write but bigger the database gets, the more disk bandwidth is required for compaction.
  • An advantage of B-trees is that each key exists in exactly one place in the index whereas a log-structured storage engine may have multiple copies of the same key in different segments.

How does Https work?


We know that sending a data over the internet is like sending a box via courier with no digital lock and can be opened by anyone. (One can argue about hacking the lock but that’s not the problem we are trying to solve).

To send the data securely, https was invented which is nothing but

HTTP +SSL (Secure Socket Layer)

 

SSL uses cryptography to encrypt the text.

Let discuss the basics of cryptography.

1) Encryption – The process of converting a text to a random string is called Encryption. To convert the plain text to the random text, a key is used.
2) Key – The key is used to encrypt and decrypt the data.
3) Decryption – The process of converting a random string to text is called Decryption. To convert the random text to the plain text a key is used.

To encrypt the text A and decrypt back to text A the key should be common. The key is called symmetric key.

Symmetric Key An encryption system in which the sender and receiver of a message share a single, common key that is used to encrypt and decrypt the message. Symmetric-key systems are simpler and faster, but their main drawback is that the two parties must somehow exchange the key in a secure way.

Asymmetric key The keys are simply large numbers that have been paired together but are not identical (asymmetric). One key in the pair can be shared with everyone; it is called the public key. The other key in the pair is kept secret; it is called the private key. Either of the keys can be used to encrypt a message; the opposite key from the one used to encrypt the message is used for decryption.

Now we are aware of the basics of cryptography, lets look into how HTTPS works

  • When a user clicks on an https link browser makes a TCP connection on https port 443 with the server.
  • After a connection is successful SSL handshake starts between browser and server.

The series of exchange between server and client can be categorized into 3 groups.

  • Hello: Client sends a hello message which contains details such as Highest SSL version, Ciphers algorithm it supports, Compression algorithm, Random key – this is later used to generate a symmetric key. Server responds with a hello message containing SSL version, Cipher to be used, sessionid, random data – this data will later be used in generation of key.
  • Certificate Exchange: After server hello message, the server sends a digital certificate. The certificate contains the public key assigned for the browser. The certificate also helps to set the identity of the browser with the server. The digital signature on the certificate is someone vouching for the fact that a particular public key belongs to a particular individual or organization.In order to be trusted by the average web browser, certificates have to be signed by a trusted Certificate Authority (CA). CAs are companies that perform manual inspection and review, to make sure that the applying entity is both:
    1. a real person or business that exists in the public record
    2. in control of the domain, they’re applying for a signed certificate for

    Once the CA verifies that the applicant is real and really owns the domain, the CA will “sign” the site’s certificate, essentially putting their stamp of approval on the fact that this site’s public key really belongs to them and should be trusted. The browser comes preloaded with a list of trusted CAs.

  • Key Exchange: After receiving the digital certificate, the browser generates a symmetric key. It sends this key by encrypting it with the server public key. Since this message is encrypted using server public key, it can only be decrypted by its private key which only resides on the server.

Once the symmetric key is exchanged the browser can start interacting with the server by sending encrypted messages securely.

Power of grep: search set of strings from an input file


grep is a common command which is used for search patterns in a file. I use it commonly for searching access log files, debug log files, looking for a pattern in user behavior.

If a user wants to search for a word or phrase in a file following is the command

grep "hello world" input.txt

this will list the lines which have the “hello world” in it.

grep works with regex expressions and support wild characters too.

grep "hello*" input.txt

will list all the lines having hello* pattern.

Now suppose there is a file with the list of user ids and the task is to search all the users who have visited the website from debug.log.

grep  -F -f userids.txt debug.log
The above command tells grep to look for strings as patterns from the input file

-F, –fixed-strings

  Interpret pattern as a set of fixed strings (i.e. force grep to behave as fgrep).

-f file, –file=file

 Read one or more newline separated patterns from a file.  Empty pattern lines match every input line.  Newlines are not considered part of a pattern.  If a file is empty, nothing is matched.

Java Interview Questions


Some of the most common interview questions I generally ask Java programmers. For more comprehensive list, check Java Interview Questions

Q) What is polymorphism?

Ans) The ability to define a function in multiple forms is called Polymorphism. In java,c++ there are two types of polymorphism: compile time polymorphism (overloading) and runtime polymorphism (overriding). Method overriding Overriding occurs when a child class implements the method with the same signature as a method in a parent class. When you override methods, JVM determines the proper methods to call at the program’s run time, not at the compile time.

Overloading occurs when several methods have same names but different number or type of parameters.

  • Overloading is determined at the compile time.
  • Different method signature and different number or type of parameters.
  • Same method signature but the different number of parameters.
  • Same method signature and same number of parameters but of different type

Example of Overloading

int add(int a,int b)
   float add(float a,int b)
   float add(int a ,float b)
   void add(float a)
   int add(int a)
   void add(int a) //error conflict with the  method int add(int a)
class BookDetails{
  String title;
  setBook(String title){}
}
class ScienceBook extends BookDetails{
  setBook(String title){} //overriding
  setBook(String title, String publisher,float price){} //overloading

Q) What is the difference between final, finally and finalize() in Java?

Ans) final – A final variable acts as a constant, a final class is immutable and a final method cannot be overridden while doing inheritance.

finally – handles exception. The finally block is optional and provides a mechanism to clean up regardless of what happens within the try block (except System.exit(0) call). Use the finally block to close files or to release other system resources like database connections, statements etc.

finalize() – method belongs to Object class. The method that is invoked while doing the garbage collection of the object. It could be used for allowing it to clean up its state. Good use cases will be to free connection pools, deallocate resources etc.

Q)What is the difference between HashMap and HashTable?

Ans) Both collections implement Map. Both collections store value as key-value pairs. The key differences between the two are

  1. Hashmap is not synchronized in nature but hashtable is.
  2. Another difference is that iterator in the HashMap is fail-safe while the enumerator for the Hashtable isn’t.
    Fail-safe -if the Hashtable is structurally modified at any time after the iterator is created, in any way except through the iterator’s own remove method, the iterator will throw a ConcurrentModificationException?
  3. HashMap permits null values and only one null key, while Hashtable doesn’t allow key or value as null.

Q) What is the difference between abstract class and interface?

Ans)

  • A class is called abstract when it contains at least one abstract method. It can also contain n numbers of concrete method. An interface can contain only abstract( non implemented) methods.
  • The abstract class can have public, private, protect or default variables and also constants. In interface, the variable is by default public final. In nutshell, the interface doesn’t have any variables it only has constants.
  • A class can extend only one abstract class but a class can implement multiple interfaces.
  • If an interface is implemented its compulsory to implement all of its methods but if an abstract class is extended it’s not compulsory to implement all methods.
  • The issue with an interface is, if you want to add a new feature (method) in its contract, then you MUST implement the new method in all of the classes which implement that interface. However, in the case of an abstract class, the method can be simply implemented in the abstract class and the same can be called by its subclass.

Q) What is the difference between equals() and == ?

Ans) == operator is used to compare the references of the objects.
public boolean equals(Object o) is the method provided by the Object class. The default implementation uses == operator to compare two objects. But since the method can be overridden like for String class. equals() method can be used to compare the values of two objects.

String str1 = "MyName"; 
String str2 = "MyName";
String str3 = new String(str2);

if (str1 == str2) {
  System.out.println("Objects are equal")
}else{
  System.out.println("Objects are not equal")
}
if(str1.equals(str2)) {
  System.out.println("Objects are equal")
} else {
  System.out.println("Objects are not equal")
}

Output:
Objects are not equal
Objects are equal
String str2 = "MyName";
String str3 = str2;
if (str2 == str3) {
System.out.println("Objects are equal")
}else{
System.out.println("Objects are not equal")
}
if (str3.equals(str2)) {
  System.out.println("Objects are equal")
} else {
  System.out.println("Objects are not equal")
}

Output:
Objects are equal
Objects are equal

Q) What is the difference between an ArrayList and a Vector?

Ans)

  • Synchronization – ArrayList is not thread-safe whereas Vector is thread-safe. In Vector class each method like add(), get(int i) is surrounded with a synchronized block, thus making Vector class thread-safe.
  • Data growth – Internally, both the ArrayList and Vector hold onto their contents using an Array. When an element is inserted into an ArrayList or a Vector, the object will need to expand its internal array if it runs out of capacity. A Vector defaults to doubling the size of its array, while the ArrayList increases its array size by 50 percent.
  • Performance the Since vector is thread-safe, the performance is slower than ArrayList.

Q) Which all classes implement Set interface ?

Ans) A Set is a collection that contains no duplicate elements. More formally, a set contains no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. HashSet, SortedSet and TreeSet are the commonly used class which implements Set interface.

  • SortedSet – It is an interface which extends Set. A the name suggest, the interface allows the data to be iterated in the ascending order or sorted on the basis of Comparator or Comparable interface. All elements inserted into the interface must implement Comparable or Comparator interface.
  • TreeSet – It is the implementation of SortedSet interface. This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains). The class is not synchronized. The class uses Red-Black tree data structure.
  • HashSet: This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element. This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets

Q) Describe the exception hierarchy in Java?

Ans) The hierarchy is as follows:

java exception hierarchy

Throwable is a parent class of all Exception classes. There are two types of Exceptions: Checked exceptions and UncheckedExceptions or RunTimeExceptions. Both type of exceptions extends Exception class.

Difference between final, finally and finalize in Java ?


final – final keyword can be used with a class, variable or a method.

  • A variable declared as final acts as constant, which means one a variable is declared and assigned , the value cannot be changed. An object can also be final, which means that the once the object is created it cannot be assigned a different object, although the properties or fields of the object can be changed.
  • A final class is immutable, which means that no other class can extend from it. E.g String, Integer.
  • A final method in a class cannot be overridden in the child class.

The underlying behavior of using final keyword is to act as constant.

public class Test {
    private static final String PREFIX = "test." 
    private final MyClass obj = new Myclass();

    publc Test() {
      obj = new MyClass() ;// throws error 
    }
  }
  public class Test {
    private static final String PREFIX = "test." 
    private final MyClass obj;

    publc Test() {
      obj = new MyClass() ;// this works
    }
  }

finally – finally keyword is used with try-catch block for handling exception. The finally block is optional in try-catch block. The finally code block is always executed after try or catch block is completed. The general use case for finally block is close the resources used in try block. For e.g. Closing a FileStream, I/O stream objects, Database connections, HTTP connections are generally closed in a finally block.

public class Test {
  public static void main(String[] args) {
    BufferedReader br = null;
    try {
      String sCurrentLine = "";
      br = new BufferedReader(new FileReader("C:\\testing.txt"));
      while ((sCurrentLine = br.readLine()) != null) {
        System.out.println(sCurrentLine);
      }
    } catch (IOException e) {
      e.printStackTrace();
    } finally { // close the resource. 
      try {
        if (br != null)br.close();
      } catch (IOException ex) {
        ex.printStackTrace();
      }
    }
  }
}

finalize() – This is the method of Object class.It is invoked before an object is discarded by the garbage collector, allowing it to clean up its state. Should not be used to release non-memory resources like file handles, sockets, database connections etc because Java has only a finite number of these resources and you do not know when the garbage collection is going to kick in to release these non-memory resources through the finalize() method.