Sunday, October 14, 2012

SQL Server Index internals (SQL Server)


What Is an Index?
One of the best ways to reduce logical reads and I/ O reads is Index. An Index can be used to search the data in table without scanning the entire table. An Index in database is analogous to Index of book. Say, for example if you want to look up a phrase in book, you can directly go to that page using Index of Book.

You can use two different approaches to create Index on the Table.

Like a dictionary:
A Dictionary stores data in alphabetical order. An Index on table can maintain data in similar fashion. The data is ordered, although it will be duplicates. Data can be directly accessed once it is sorted. This type of index is clustered Index.

Like an Index:
Just like Index of book, Index on table can contain Index values and pointer to actual data. This type of index is non-clustered Index.

Index Design Recommendations:

1. Examine the WHERE Clause and Join Criteria Columns:
a.       Optimizer identifies list of columns used in Joins and Where Clause.
b.       Optimizer then examines indexes on those columns and chooses the best clause to improve the performance
c.        Finally, Optimized estimate least costly method of retrieving the rows


Download Adventure Works database from codeplex and restore it as:


CREATE DATABASE MyAdventureWorks
    ON (FILENAME = 'D:\Programming\AdventureWorks2008_Data.mdf'),
    (FILENAME = 'D:\Programming\AdventureWorks2008_Log.ldf')
    FOR ATTACH;

When WHERE Clause is not included:
SELECT * FROM Person.Person

(19972 row(s) affected)
Table 'Person'. Scan count 1, logical reads 3816, physical reads 0

Once WHERE Clasue is included in query:

SELECT * FROM Person.Person A
WHERE A.BusinessEntityID = 2
(1 row(s) affected)
Table 'Person'. Scan count 0, logical reads 3, physical reads 0



2. Use Narrow Index:
If more will be size of data page, more will be time taken to search the data (A narrow index can accommodate more rows in a 8KB data page). So, minimize the use of wide data types in an index.
Following example shows, how Index is represented in B Tree:


SELECT
      A.name,
      A.type_desc,
      B.page_count,
      B.record_count,
      B.index_level
FROM sys.indexes A
      INNER JOIN sys.dm_db_index_physical_stats(DB_ID(N'MyAdventureWorks'), OBJECT_ID(N'Person.Person'), NULL, NULL, 'DETAILED') AS B
            ON A.index_id = B.index_id
WHERE A.object_id = object_id (N'Person.Person')

If more wider will be index, more Pages/ levels will be required to represent B Tree and more time Index it will take to search data.

3. Examine column Uniqueness:

Query optimizer will not able to narrow down the result if column values are not unique.
For example: Creating Index on Gender column will not help narrow down the result for Optimizer

The column with highest number of unique values can be best candidate for Indexing when referred to WHERE 
or Join clause.

Note: It is highly recommended that you create Index on column having high selectivity.

                  Selectivity = Total Number of rows in table / Total Unique values of column 


3. Consider column ordering:

An Index key is sorted on the column of Index. If Index is created on more than one column (Composite Index), Index key will be sorted on first column then on rest of the column.



CREATE NONCLUSTERED INDEX [IX_Person_LastName_FirstName_MiddleName] ON [Person].[Person]
(
      [LastName] ASC,
      [FirstName] ASC,
      [MiddleName] ASC
)


SELECT *
FROM Person.Person
WHERE FirstName = 'Isabella' 



SELECT *
FROM Person.Person
WHERE LastName = 'Miller'          

Look up is performed to get other column since in this case only 3 columns are leaf node. A clustered Index is already created on table hence Leaf node will point to Clustered Index and uses Clustered Key Look up.


SELECT *
FROM Person.Person
WHERE LastName = 'Miller'          
      AND MiddleName = 'A'


Clustered Index: Table rows are sorted on clustered Index column in data Pages, and since there can be one order table data, a table can have only Clustered Index

Heap Table: A table with no Clustred index in Heap Table

Relationship with Non Clustered Index:
1.       If table contains no Clustered Index, the leaf node of nonClustered Index will contain the Index
column and Pointer to heap table containg the actual data rows
2.       If table contains Clustered Index, the leaf node of nonClustered Index will contain the Index column and Pointer to Clustered Index column

Col1
Col2
Col3
1
A1
B1
2
A2
B2

If NonClustered Index is created on Col2

Case 1: Non Clustered Index with no Clustered Index
Col2
Pointer
A1
Pointer to Col1 = 1
A2
Pointer to Col1 = 2

Case 2: Non Clustered Index with Clustered Index
Col2
Pointer
A1
1
A2
2


Clustered Index Recommendations:
1.       Create Clustered Index first then create Non Clustered Index. If we create Non Clustered Index first then Row Locator will point to heap but not Clustered Index
2.       Keep Index Narrow
3.       Rebuild the Index in single step using DROP_EXISTING clause

When Not to use a Clustered Index:
1.       Do not create Clustered Index on most updated column since it will change the row locator of NonClustered Index, increasing the cost of queries.
2.       Wide Keys

In both above mentioned cases, NonClustered Index will be recommended.



Advance Indexing Techniques
1.      Covering Index: A Covering Index is created on all the columns required to satisfy the a SQL query without going to a Base Table

SELECT PostalCode
FROM Person.Address
      WHERE StateProvinceID = 42


Lookup is required to get PostalCode  column of table hence we can create NonClustered Index on StateProvinceID and PostalCode. We can use INCLUDE operator to create Covering Index.

CREATE NONCLUSTERED INDEX [IX_Address_StateProvinceID]
 ON [Person].[Address]
(
            [StateProvinceID] ASC
)
INCLUDE (PostalCode)
WITH (DROP_EXISTING = ON)

If we rerun the query, then Index Seek is performed without any Lookup.

      
INCLUDE IN INDEX:
If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause.
The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree. This makes the index smaller because it's not part of the tree
This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s) 

2.      Index Intersections: SQL Server can use two or more Indexes to optimize query execution Plan.
If there is more than one Index in table then SQL Server can perform intersection of all the Indexes and Optimize the query.
For Example: If NonClustered Index is created on SalesPersonID column and no Index is created on OrderDate, SQL Server uses Clustered Index Scan to get the result

SELECT *
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = 276
            AND OrderDate BETWEEN '4/1/2002' AND '7/1/2002'


We can include OrderDate column in NonClustered Index of SalesPersonID to improve the performance. Sometimes, it is not permissible to include other column in NonClustered Index. We can create another NonClustered Index in that case.
CREATE NONCLUSTERED INDEX IX_OrderDate
      ON Sales.SalesOrderHeader
(OrderDate)

Now, if we rerun the query Index Intersection will perform and resulting Index seek.


3.      Index Joins: Index Join is variation of index intersection. When covering Index Technique is applied to Index Intersection then Index Join is performed.
For example:
SELECT SalesPersonID, OrderDate
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = 276
      AND OrderDate BETWEEN '4/1/2002' AND '7/1/2002'



4.      Filtered Index: A NonClustered index that uses a filter, basically a WHERE clause, to create a highly selective set of keys against a column. For example, a column with a large number of null values having NonClustered Index created on it. We can recreate NonClustered Index containing Not Null values, after filtering NULL values

CREATE NONCLUSTERED Index IX_Text ON
Sales.SalesOrderHeader (PurchaseOrderNumber, SalesPersonId)
INCLUDE (OrderDate, ShipDate)
WHERE PurchaseOrderNumber IS NOT NULL
 AND SalesPersonId IS NOT NULL
           
            .

5.      Indexed View: A database view can be materialized on the disk by creating a unique index on the view. Such view is referred as indexed view. Views result set is persisted immediately on physical storage in database. After the view is materialized, multiple NonClustered Index can be created on Indexed view.

Further Reading:

1. Video on Index Structure:



2. Video on Index Structure:

3. Column Store Index:



Monday, October 8, 2012

How to run a SQL Server Agent Job using Non - System Account


SQL Server agent Job is configured to run with System account. So, Ideally we can not run the job using other account. But, SQL Server provides capability to run a job with non system account using Proxy.

Some times, whenever we pull data from different data sources then if System Account do not have permission on Parent data source then we have to use other account to pull data.
To create a proxy, first we have to create a credential.This credential should have access permission.

To create a credential
In Object Explorer, expand the Security folder.
1. Right-click the Credentials folder and select New Credential….
2. In the New Credential dialog box, in the Credential Name box, type a name for the credential.
3. In the Identity box, type the name of the account used for outgoing connections (when leaving the context of SQL Server). Typically, this will be a Windows user account, but the identity can be an account of another type.
4. In the Password and Confirm password boxes, type the password of the account specified in the Identity box. If Identity is a Windows user account, this is the Windows password. The Password can be blank, if no password is required.
Click OK.

To create a proxy account
1. In Object Explorer, expand a server. Expand SQL Server Agent.
2. Right-click Proxies and select New Proxy.
3. On the General page of the New Proxy Account dialog, specify the proxy name, credential name, and description for the new proxy.

Once the Proxy is created go to that data pull step of SQL Server job. Select the option run as and then select the Proxy that you have created. 

This will successfully run the job.

Sunday, October 7, 2012

Installing Hadoop on Ubuntu (12.04) - single node


Installing Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Creating user
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configuring SSH
su - hduser
ssh-keygen -t rsa -P ""

To be sure that SSH installation is went well, you can open a new terminal and try to create ssh session using hduser by the following command:

$ssh localhost

Reinstall ssh if localhost not connected
sudo apt-get install openssh-server

Edit Sudoers
pkexec visudo

Add below line to add hduser into sudoers
hduser (ALL)=(ALL) ALL

Ctrl + O to save nano

Disable IPv6
following commands using a root account:
$sudo gedit /etc/sysctl.conf
This command will open sysctl.conf in text editor, you can copy the following lines at the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

If you faced a problem telling you don't have permissions, just remember to run the previous commands by your root account.
These steps required you to reboot your system, but alternatively, you can run the following command to re-initialize the configurations again.

$sudo sysctl -p
To make sure that IPV6 is disabled, you can run the following command:
$cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Configuration of Hadoop
Installing Hadoop

Now we can download Hadoop to begin installation. Go to Apache Downloads and download Hadoop version 0.20.2. To overcome the security issues, you can download the tar file in hduser directory, for example, /home/hduser. 

Then you need to extract the tar file and rename the extracted folder to 'hadoop'. Open a new terminal and run the following command:

$ cd /home/hduser
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo mv hadoop-0.20.2 hadoop

Update $HOME/.bashrc
You will need to update the .bachrc for hduser (and for every user you need to administer Hadoop). To open .bachrc file, you will need to open it as root:

$sudo gedit /home/hduser/.bashrc

Then you will add the following configurations at the end of .bachrc file

# Set Hadoop-related environment variables

export HADOOP_HOME=/home/hduser/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Some convenient aliases and functions for running Hadoop-related commands

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Configuration

Now, we need to configure Hadoop framework on Ubuntu machine. The following are configuration files we can use to do the proper configuration. To know more about hadoop configurations, you can visit this site

hadoop-env.sh
We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$sudo gedit /home/hduser/hadoop/conf/hadoop-env.sh

or

nano /home/hduser/hduser/hadoop/conf/hadoop-env.sh

Then you will need to change the following line

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To 

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Note: if you faced "Error: JAVA_HOME is not set" Error while starting the services, then you seems that you forgot toe uncomment the previous line (just remove #).

core-site.xml
First, we need to create a temp directory for Hadoop framework. If you need this environment for testing or a quick prototype (e.g. develop simple hadoop programs for your personal test ...), I suggest to create this folder under /home/hduser/ directory, otherwise, you should create this folder in a shared place under shared folder (like /usr/local ...) but you may face some security issues. But to overcome the exceptions that may caused by security (like java.io.IOException), I have created the tmp folder under hduser space.

To create this folder, type the following command:

$ sudo mkdir  /home/hduser/tmp

Please note that if you want to make another admin user (e.g. hduser2 in hadoop group), you should grant him a read and write permission on this folder using the following commands:


$ sudo chown hduser2:hadoop /home/hduser/tmp

$ sudo chmod 755 /home/hduser/tmp
Now, we can open hadoop/conf/core-site.xml to edit the hadoop.tmp.dir entry.
We can open the core-site.xml using text editor:

$sudo gedit /home/hduser/hadoop/conf/core-site.xml

or

nano /home/hduser/hduser/hadoop/conf/core-site.xml

Then add the following configurations between .. xml elements:

  hadoop.tmp.dir
  /home/hduser/tmp
  A base for other temporary directories.

  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.

mapred-site.xml
We will open the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
nano /home/hduser/hduser/hadoop/conf/mapred-site.xml

  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
 

hdfs-site.xml
Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

nano /home/hduser/hduser/hadoop/conf/hdfs-site.xml

  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 

Formatting NameNode
~/hduser/hadoop/bin/hadoop namenode -format

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.
Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

NameNode Formatting

Starting Hadoop Cluster

You will need to navigate to hadoop/bin directory and run ./start-all.sh script.
cd ~/hduser/hadoop/bin/
./start-all.sh

Starting Hadoop Services using ./start-all.sh

There is a nice tool called jps. You can use it to ensure that all the services are up.

Using jps tool


EDIT: I observed many changes in folder structure of latest releases of hadoop. 
So, Please refer following links for the same:


References:


So, the updated configuration of files:
Core-site.xml:
 
    fs.default.name
    hdfs://localhost:9000
 

 
    mapred.job.tracker
    hdfs://localhost:9001
 

 
    dfs.replication
    1
 



Hadoop-env.sh:
# The java implementation to use.
export JAVA_HOME="$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}