Sei sulla pagina 1di 6

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.

ORG

84

An Efficient Approach of Fast External Sorting Algorithm in Data Warehouse


Abhishek Purohit, Naveen Hemrajani,Savita Shiwani,Ruchi Dave
Abstract Sorting of bulk data in warehouse is possible through external sorting and the effective performance of the external sorting is analyzed in terms of both time and I/O complexities.The proposed method is a hybrid technique that uses quick sort and In place merging in two distinct phases. Both the time and I/O complexities of the proposed algorithm are analyzed here. The proposed algorithm uses special in-place merging technique, which creates no extra backup file for manipulating huge records. For this, the algorithm saves huge disk space, which is needed to hold the large file. This also reduces time complexity and makes the algorithm faster. Index Terms External sorting,In place merging, merge and quick sort,algorithm,time and space complexity.

1 INTRODUCTION
he memories of current computers have been increasing rapidly, there still exists a need for external sorting for large databases sorting has continued to be counted for roughly one-fourth of all computer cycles. The problem of how to sort data efficiently has been widely discussed. the main concern with external sorting is to minimize disk access since reading a disk block takes about a million times longer than accessing an item in RAM. The most common external sorting algorithm still uses the merge sort as described by Knuth [1]. The number of I/Os is a more appropriate measure in the performance of the external sorting and the other external problems, because the I/O speed is much slower than the CPU speed. The most common external sorting used is still the merge sort . In two-way merge sort, a file is divided into 2 sub files. The records of the two sub files are written to two auxiliary files whereby by pair wise comparison the smaller records are always written first, thus producing sorted runs of two records each. During the second pass, the two runs from the output files are compared; thereby producing new runs of four records each, which are in the sorted sequence. This process continues until the entire file is sorted. This routine makes use of temporary disk files. Dufrene and Lin[2] proposed an algorithm in which no other external file is needed; only the original file (file to be sorted) is used. M.N. Adnan et al. proposed a hybrid external sorting algorithm with no additional disk space. Another similar algorithm is proposed by M. R. Islam [3] . In all of these three algorithms the authors gave attention

to the time complexities but they did not give attention to the I/O complexities. In this paper we study the I/O complexities of these algorithms. For this, in the next section we review the external sorting algorithms with no additional disk space. text. For two addresses, use two centered tabs, and so on. For three authors, you may have to improvise.

2 RELATED WORK
In this section I will review three external sorting algorithms with no additional disk space. The proposed algorithm is based on the algorithms proposed by Dufrene and Lin [2] and M. R. Islam [3]. Among these algorithms, the overall performance of M. R. Islam et al. [3] is better. So, we have reviewed M. R. Islam et al. [3] algorithm in the next subsection.

2.1 An efficient external sorting algorithm

This algorithm proposed by Dufrene and Lin is essentially a generalization of the internal Bubble Sort, where the individual record in the internal sort is replaced by block of records in the external At the first iteration Block_1 and Block-N are read into the lower half and upper half of memory array respectively. These two blocks are then sorted using Quick Sort. The records of the lower half are retained in the memory array, which contains the lowest sorted records of Block_1 and Block-N and the records of upper half of memory array are returned to Block-N area of external file. Now Block-N-1 comes into the upper half Abhishek Purohit is M.Tech Software Engineering Scholar at Suresh Gyan and the process continues.
Vihar University Jaipur,India. Naveen Hemrajani is vice principal in Suresh Gyan Vihar University 2.2 A faster external sorting Algorithm to sort bulk Jaipur,India. data Savita Shiwani is Assistant Professor at Department of computer science & This algorithm proposed by M.N.Adnan is also the geneEngineering in SGVU,Jaipur,India Ruchi Dave is Assistant Professor at Department of computer science & ralization of internal Bubble Sort. The algorithm works in Engineering in SGVU,Jaipur,India

two phases. In the first phase, this algorithm works as the algorithm proposed by Dufrene and Lin which was re

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

85

viewed in the previous section. After this phase, we get the external file simultaneously in the position of BlockN-1 in the external file until the block is full. So, half of the records in the memory array will be sorted by merging and written in the position of Block-N-1 in the external file. The remaining records in the lower half (if any) are copied into the upper half of memory array. Now the upper half of memory array contains the highest records of Block-N and Block-N-1. Then, again Merge Sort is applied to sort the records in the upper half of memory array. The additional space required for Merge Sort is the lower half of memory array. After this, Block-N-2 The next iteration starts with Block-N-2 and Block-N-1 to be read into the lower half and upper half of memory array respectively. At the end of this iteration, upper half is read into lower half of memory array. The Merging and Merge Sort terminates when Block-B is read into the lower half of memory array and processed accordingly. After this iteration the upper half of the memory array contains the highest sorted records and they are written in the position of Block-N in the external file.

Block_ S 1 and the lower half is sent back to its corresponding position in the external file (Fig. 6). After this, Block_ S 2 is read into the lower half of the memory array and checked for the conditions specified in Case 1 or Case 2. In this way, when Block_ 2 has been processed, the upper half of the memory array contains the highest sorted records of the entire file and they are written in the position of Block_ S in the external file for Case 2. The next iteration starts with Block_ S 2 and Block_ S 1 to be read into the lower and upper halves of the memory array respectively. At the end of this iteration, the upper half of the memory array contains the highest sorted records among the blocks i.e. Block_ 2 , Block_ 3 , . . . , Block_ S 1 and they are written in the position of Block_ S 1 in the external file for Case 2. After each pass, the size of the external file is decreased by one block. The last two blocks to be processed are Block_ 2 and Block_ 3 , which upon completion, the entire file is sorted. 2.3.1 Algorithm: An external sorting algorithm using in-place merging no additional disk space 1. Declare the blocks in external file to be half of memory array. Let the blocks be Block_1, Block_2, ,Block_ S 1 , Block_ S If there is only one block in the external file then quicksort the entire memory array Read Block_1 into the lower half of memory array. Set T = S //Begins first phase Read Block_T into upper half of memory array Sort the entire memory array using quicksort Write upper half of memory array to Block_T area of external file Decrement Block_ T by one block Repeat from step 4 if Block_T is not equal to Block_1 Write lower half of memory array to Block_1 area of external file Set P = S //Begins second phase Read Block_ P into the upper half of memory array and set Q = P 1 Read Block_Q into the lower half of memory array If last element of the lower half is greater than first element of the upper half then sort (merge) the memory array using inplace merging and write lower half of memory array to Block_Q area of external file Decrement Block_Q by one block Repeat from step 12 if Block_Q Block_1

2.3 In Place Merging Algorithm


The algorithm works in two phases. In the first phase, the algorithm works as the algorithm proposed by Dufrene and Lin that is, Block_1 and Block_ S are read into lower half and upper half of the memory array, respectively,and they are sorted using Quick sort. This phase terminates when Block_ 2 is read into the upper half of the memory array and sorted with the remaining records in the lower half of the memory array. Thus we get sorted runs. After this phase, the lower half of the memory array contains the lowest sorted records of the entire file.Then, the algorithm switches to its second phase, whereby the sorting process continues considering the following two cases: Case 1: Here the required blocks are read and if the last record of the lower half of the memory array is smaller than the first record of the upper half of the memory array, then it is not required to sort the records of the memory array and then the next block will be read for further approach. Case 2: This is the general case. The in-place merging technique is used here. In the second phase, Block_ S 1 and Block_ S are read into the lower and upper halves of the memory array respectively. For Case 1 the blocks are not required to write back in the external file. In Case 2, after applying the in-place merging, the upper half of the memory array contains the highest ordered records of Block_ S and

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14. 15.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

86

16. Write upper half of memory array to Block_ P area of the external file. Decrement Block_ P by one block 17. Repeat from step 11 if Block_ P Block_2 18. //End of sorting procedure

3 PROPOSED MODEL
3.1 Sorting of bulk data in data warehouse using less time and space The proposed algorithm works in several phases. In the first phase, the external file is divided into all equal size blocks, where the size of each block is approximately equal to the available main memory (RAM) of the computer. If the size of the available internal memory is M then the size of each block is M and if the size of external file is N then the number of block, S = N/M. Block_A is read into memory. Then the records of the main memory are sorted using quick sort in figure 1(A) and again written to Block_A. The process continues until the last block, Block-N, has been processed. After this the proposed algorithm switches to its next phase.

again merged using In-place merging technique and the records of the upper half of the main memory are written in the position of B_1 of Block-C and read the records of sub block B_1 of Block_D to the upper half of the main memory. Repeat this process until the records of sub block B_1 of Block-N are read into the upper half of the main memory and processed. Now the lower half of the main memory contains the lowest records in sorted form among the records from Block-A to Block-N and is written in the position of B_1 of Block-A. Now sub block B_2 of Block-N and sub block B_2 of Block-N-1 are read into the upper and lower half of the memory array respectively. After Then the records of the lower half and upper half of the main memory(RAM) are merged using In-place merging technique. After merging, the records of the lower half of the main memory are written to B_2 of Block-N-1 and the records of the sub block B_2 of Block-N-2 are read into the lower half of the main memory(RAM). Then the records of the main memory are again merged using Inplace merging technique and the records of the lower half of the main memory are written in the position of B_2 of Block-N-2 and read the records of sub block B_2 of BlockN-3 to the lower half of main memoryin figure 2(A). Repeat this process until the records of the sub block B_2 of Block-A read into the lower half of the main memory(RAM) and processed. Now the upper half of the main memory contains the records which are maximum sorted among the records from Block-A to Block-N and is written in the position of B_2 of Block-N. At this point sub block B_1 of Block-A and sub block B_2 of Block-N contains the lowest and highest sorted records respectively. Now, read B_2 of Block-A and B_1 of Block-B in the main memory and after merging, write lower and upper half at the position of B_2 of Block-A and B_1 of Block-B respec-

Each sorted block is divided into two sub-blocks, B_1 and B_2 in figure 1(B). The sub block B_1 of Block-A and sub block B_1 of Block-B are read into the lower and upper half of the main memory(RAM) as array respectively. Then the records of the lower half and the upper half of the main memory, which are individually sorted are merged using In-place merging technique After sorting, the records of the upper half of the main memory are written to B_1 of Block-B and the records of the sub block B_1 of Block-C are read into the upper half of the main memory. After Then the records of the main memory are
2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

87

tively. Now rename B_2 of Block-A as B_1 and B_1 of Block-B as B_2 and let both B_1 and B_2 are under BlockA. Similarly merge B_2 of Block-B and B_1 of Block-C and after renaming in figure 2(B), let they are under Block-B. Repeat this technique until B_1 of Block-N has been processed. Let the last new Block be Block-N. Then apply the above procedure for the new blocks, Block-A to BlockN, to get the next lowest and highest sorted records.

4.IMPLEMENTATION
4.1 Output Complexity In the first phase of the proposed algorithm will take write (output) operations (N/M), size of each block is M and if the size of external file is N. In the second phase to obtain the first lowest sorted sub block it will take write(output) operations (N/M) and to obtain the first highest sorted sub block it will take write(N/M) operation. Next (N/M 1) blocks l have to be processed to generate the new blocks. So it will require (N/M 1) write operations. After Then, to obtain the next lowest and highest sorted sub block it will take (N/M 1) write operations in both cases and so on till all are not completed. So the total output operation is -

5 COMPARISON AND DISCUSSION


The output and time complexities of the algorithms proposed by approach 2.1 and approach 2.2

Table 1: Complexities of the Approach 2.1 and approach 2.2

4.2 Time Complexity The time complexity of the internal quick sort is O(n loge n) in average case. Where n is the number of records to be sorted. So, the time complexity of the starting first phase of the proposed algorithm is (N/M)(n loge n). In the another phase of mergr technique, the algorithm uses special In-place merging technique. The time complexity of the merging technique depends on the number of total comparison. To merge n data using In-place merging technique it will need n comparisons. So the time complexity to obtain the first lowest and highest sorted sub block is (N/M -1) n + (N/M -1) n = 2 (N/M -1) n. As, after each iteration the file size is decreased by one sub block, so the total time complexity is

Here both the output and time complexity of approach . 2.2 is better than approach 2.1 shown in paper . So the overall performance of Approach 2.2 is better.Therefore, in this section we will compare and discuss the complexities of Approach 2.2 and the proposed algorithm. Moreover,the value of M in the proposed algorithm equals to 2B in the 2.2 approach. So, while performing comparison M has been replaced by 2B. then

Here, for N > 4B, Time complexity of both approaches (T1,T2) > 0 or T1 > T2. So, the time complexity of the proposed algorithm is less than that of approach 2.2. The reduc-

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

88

tion of time complexity of the proposed algorithm from the algorithm proposed by approach 2.2 is calculated and given in following Table 2 :

[2]

W. R. Dufrene, F. C. Lin, An efficient sorting algorithm with no additional space, Compute. J. 35(3) (1992). M. R. Islam, W. Nusrat, M. Hossain, S.M.M.Rana, A New External Sorting Algorithm with No Additional Disk Space with Special In-place Merging Technique, Presented at International Conference on Computer and Information Technology (ICCIT), 26 - 28 December 2004,(Dhaka, Bangladesh). Katajainen, J., Pasanen, T.: In-Place Sorting with Fewer Moves. Inform. Process. Lett. 70 (1999) 3137 Katajainen, J., Pasanen, T., Teuhola, J.: Practical In-Place Mergesort. Nordic J. Comput. 3 (1996) 2740 Katajainen, J., Tra, J. L.: A Meticulous Analysis of Mergesort Programs. Lect. Notes Comput. Sci. 1203 (1997) 21728 V., Katajainen, J., Pasanen, T.: Asymptotically Ecient In-Place Merging. Theoret. Comput. Sci. 237 (2000) 15981 Franceschini, G., Geert, V.: An In-Place Sorting with O(nlog n) Comparisons and O(n) Moves. J. Assoc.Comput. Mach. 52 (2005) 51537 E. E. Lindstorm, J. S. Vitter (1985), The design and analysis of Bucket-Sort for bubble memory secondary storage, IEEE Trans. Comput. C-34 (3) 218-233. Clifford A. Shaffer (1997). A Practical Introduction to Data Structures and Algorithm Analysis. Prentice-Hall B. Singh and T. L. Naps (1985), Introduction to Data Structure, West publishing Co, St. Paul, MN SORTING AND SEARCHING ALGORITHMS by Thomas Niemann. SORTING AND SEARCHING ALGORITHMS by Thomas Niemann epperpress Memory Management during Run Generation in External Sorting Per-Ake Larson and Goetz Graefe Microsoft B. C. Huang, and M. A. Langston, Practical In-Place Merging. Communications of the ACM

[3]

[4] [5] [6]

6 CONCLUSION AND FUTURE WORK


In this paper a new datawarehouse sorting method is proposed.We used a new approach which uses in place merging technique and performs the sorting in datawarehouse in minimum time and existing main memory(RAM) space.By this bulk data of datawarehouse can easily be sorted .During sorting process whole main memory is being used so no other process can be executed concurrently with sorting process.So in future work we are planning to complete sorting process in existing time complexity without using complete main memory(RAM).By this another process also can be executed concurrently with sorting process.

[7]

[8]

[9] [10] [11] [12]

7 ACKNOWLEDGEMENT
I express my sincere and deep gratitude to all the Faculty Members for their valuable Help and guidance, which had enabled us to complete the paper.I take great pleasure in expressing my deepest sense of gratitude to Mr. Naveen Hemrajani(Vice Principal) who took all pains to see that no compromise is made in the quality of paper and it is completed well in time, for providing all the facilities for the preparation of the paper . With a deep sense of gratitude, I wish to acknowledge my obligation to my guide Mrs. Ruchi Dave and Savita Shiwani for her untiring devotion, valuable suggestions, correction guidance, supervision and encouragement, which is played an important role in the preparation of this paper so finally thanks a lot all my friends and all faculty members.

[13]

Mr.Abhishek Purohit Mr.Abhishek Purohit,Scholar M.Tech(S.E.),SGVU and Faculty Member ECB,Bikaner received his B.E degree in Computer Science & Engineering from Rajasthan University in the year 2004 and M.Tech(SE) in 2011(Appeared).. He possesses 03 years of Teaching experience. He has presented one paper in International and two in National conferences. He is also working as Asst.Prof in Department of Computer Science & Engg. In ECB Bikaner which is an auto. Inst. Of govt. of rajasthan . Prof. Naveen Hemrajani Prof. Naveen Hemrajani, Vice Principal(Engg.),SGVU and Chairman CSI(Jaipur Chapter) received his B.E degree in Computer Science & Engineering from Shivaji University in the year 1992 and M.Tech(CSE) in 2004. His Research Topic for PhD was Admission Control for Video Transmission. He possesses 19 years of Teaching and research experience. He has published two books and many research papers in International and National Journals of repute. He has also presented several papers in International and National conferences. He is also Editorial Board member of many international Journals of repute..He is also working on DST(Department of Science & Tech.) sanctioned project.

8. REFRENCES
[1] D. E. Knuth, Sorting and Searching, The art of computer programming Vol. 3, Addison- Wesley, Reading, MA, 2nd edition, 1998.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

89

Ms Savita Shiwani Ms Savita Shiwani has overall more than 12 years of teaching experience. She is holding the degree of M.Sc. (Computer Science), MCA and M.Tech. (Comp. Sc.). At present Pursuing Ph.D. from Banasthali Vidyapith. She has also having the Certificates of A and B level from DOEACC, New Delhi. At present she is working as a faculty member in Suresh Gyan Vihar University, Jaipur. She has having overall 4 book publications into her credit and 15 under publication. She has 8 publications in National Journals and 7 in International journal. She has also presented 3 papers in International and 8 papers in National conferences. She has also holding the life time membership of Computer Society of India. She has also associated with different universities like University of Rajasthan, Jaipur, Rajasthan Technical University (RTU), Kota, Indira Gandhi Open University (IGNOU), Makhan Lal Chaturvedi National University (Bhopal), Banasthali Vidyapith, Kota Open University etc. Mrs.Ruchi Dave Mrs. Ruchi Dave has overall more than 10 years of experience. She is holding Degree of M.Tech(CS). At present she is working as an Asst Prof.in Suresh Gyan Vihar University, Jaipur. . She has 3 publications in National Journals and 5 in International journal. She has also presented papers in and International conferences. She has also holding the life time embership of Computer Society of India.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

Potrebbero piacerti anche