BR Algorithms Corner

Ardeshir · Post by **Ardeshir** » 11 Aug 2014 16:55

Common products - exactly like it says in the example, an iPhone 16gb listed differently by different retailers. The ideal situation would be if every retailer used something like a UPC/EAN code to identify a product, but since most don't, we have to create an internal reference point. In other words, comparing like for like items.

Rahul Mehta · Post by **Rahul Mehta** » 11 Aug 2014 17:17

Ardeshir wrote:I have a problem, and would really appreciate some direction on this.

I have n-number of tables taken from different sources (eg. Walmart, Target, Mom N Pop Store etc.) with identical descriptors such as Product Name, Product Price, Product Brand, Product Description Etc. Given that the same product can be listed in a different ways by different sources (eg. "iPhone 5s 16gb" vs "Black 16gb iPhone 5s"), how can I find common products across these n-number of tables? One approach I have been exploring is the use of Bayesian Classification.

Each description is list of tokens eg "iPhone 5s 16gb" has 3 tokens namely "iPhone" , "5s" and "16gb"

So sort all tokens alphabetically inside each description

So

"iPhone 5s 16gb" becomes "16gb 5s iPhone"
"Black 16gb iPhone 5s" becomes "16gb 5s Black iphone"

Now comparison across description becomes easy. Sort all new description strings alphabetically and identical strings i.e. identical products will be adjacent.

Now if two different tokens have identical meanings, then we have issue. eg 16gb can be written as 16gb and "Sixteen GB". They mean the same, but are written differently. so above method will not work. Also, if description is incomplete, then above algo will not give full results. eg in description "iPhone 5s 16gb" , the color "black" or "white" is missing. So this algo will have shortcoming due to incomplete data.

SaiK · Post by **SaiK** » 11 Aug 2014 20:25

Ardeshir wrote:Common products - exactly like it says in the example, an iPhone 16gb listed differently by different retailers. The ideal situation would be if every retailer used something like a UPC/EAN code to identify a product, but since most don't, we have to create an internal reference point. In other words, comparing like for like items.

gotcha! structurally, it is a list of aliases to a node.
to construct it, perhaps you could use bayesian inference.

but, if there is typo or usage pattern that is wrong, then that would need to be classified as well [perhaps as separate set of nodes attached to the parent node]. what would be interesting would be the classifications itself are like certain priority lists, where it is prioritized based on some metric evaluation and trust score values.

Prem · Post by **Prem** » 16 Aug 2014 03:58

Does this Yindianleela fit here

What It Takes to Win the World’s Highest Computer Science Honor
http://www.wired.com/2014/08/subhash-kh ... nna-prize/

One summer afternoon in 2001, while visiting relatives in India, Subhash Khot drifted into his default mode — quietly contemplating the limits of computation. For hours, no one could tell whether the third-year Princeton University graduate student was working or merely sinking deeper into the living-room couch. That night, he woke up, scribbled something down and returned to bed. Over breakfast the next morning, he told his mother that he had come up with an interesting idea. She didn’t know what it was, but her reserved older son seemed unusually happy.Khot’s insight — now called the Unique Games Conjecture — helped him make progress on a problem he was working on at the time, but even Khot and his colleagues did not realize its potential. “It just sounded like an idea that would be nice if it was true,” recalled Khot, now a 36-year-old computer science professor at New York University’s Courant Institute of Mathematical Sciences.When Khot returned to Princeton, he mentioned the idea to Sanjeev Arora, his doctoral adviser, who advised him to hold off on publishing it. “I wasn’t sure it was going to be a good paper,” Arora said. “I thought it was maybe a little premature, that it was just a month since he had the idea.”In a sense, Khot’s insight completed an idea set in motion by another mentor, Johan Håstad of the Royal Institute of Technology in Stockholm. But even Håstad ignored Khot’s conjecture at first. “I thought it might get proven or disproven in a year,” he said. “It took us awhile to realize it had all these consequences.”Over the next several years, what seemed a modest observation — that a particular assumption could simplify certain approximation algorithm problems — grew into one of the most influential new ideas in theoretical computer science. Today, for his “prescient” assumption and his subsequent leadership in “the effort to understand its complexity and its pivotal role in the study of efficient approximation of optimization problems,” Khot was awarded the 2014 Rolf Nevanlinna Prize, widely considered one of the top honors in his field.In announcing the prize on its website, the International Mathematical Union also credited Khot’s work for generating “new exciting interactions between computational complexity, analysis and geometry.”Khot, who tends to keep his thoughts close and acknowledgment of his achievements even closer, was as surprised as his colleagues by the success of the Unique Games Conjecture. “I definitely didn’t expect that this small proposal would go so far,” he said.Although Arora, like others studying the limits of approximation algorithms, was initially unconvinced by Khot’s “pie in the sky” idea, he now credits his former graduate student for sensing that his proposal could clear “a fundamental stumbling block.”The way Khot looks at things might strike some as pessimistic. Given his research into the theoretical limitations of computers, perhaps it is natural that he tends to see what cannot be done or what might go wrong. When packing for vacations, for example, Khot tries to anticipate any ailments that could strike his 2-year-old son, Neev, by bringing all the medicine he thinks his family could need.“He has a good sense of what’s going to go wrong — he’s very analytical,” said his wife, Gayatri Ratnaparkhi, a 32-year-old analyst at the Federal Reserve Bank of New York. “And the end result is that not much goes wrong in our lives.”But Khot’s analytical approach can also be maddening, Ratnaparkhi said. “He tries to optimize things in every possible way,” she said. When walking from point A to point B, for example, Khot always wants to find the shortest path. Ratnaparkhi has to persuade him to take the scenic route. And shopping becomes “a major enterprise,” as Khot feels “obliged to go to five stores and take a look at prices,” he said. He tries to avoid the outings whenever possible.Then there were the cookies. One time, inside a local bakery, Khot was surprised to find that three 33-cent cookies were being sold for a dollar. While it didn’t prevent him from buying the cookies, he said, “even if it’s one cent, it seems off.”

Calling Khot a “first-class mathematician,” Naor highlighted the importance of the big, abstract questions he studies: “The boundaries between tractable and intractable are inherently interesting to us as humans.”In the three decades preceding Khot’s graduate school research, computer scientists had shown that hundreds of important computational challenges belong to a category called “NP-hard” problems, which most computer scientists believe cannot be solved exactly by any algorithm that runs in a reasonable amount of time. One example is the famous “traveling salesman” problem, which asks for the shortest round-trip route that visits each city in a collection of cities once. By the time Khot arrived at Princeton in 1999, many computer scientists had shifted their focus to exploring efficient algorithms that find good approximate solutions to these difficult problems.And computer scientists were successful at doing so — for many problems. But in 1992, a team of computer scientists — including Khot’s adviser-to-be, Arora — astonished their colleagues by proving a result called the PCP theorem, which enabled researchers to show that for a wide variety of computational problems, even finding good approximate solutions is NP-hard, meaning that it’s a task that, most computer scientists believe, is impossible to carry out efficiently.
This revelation dashed researchers’ hopes of identifying arbitrarily good approximation algorithms for every problem, but it opened up a new line of inquiry: trying to generate “exact hardness” results, statements of the form, “Here’s an approximation algorithm for problem X and a proof that finding any better approximation is NP-hard.” Shortly before Khot started graduate school, Håstad established exact hardness results for a few approximation problems. It was unclear, however, how to extend his results to other computation problems.At Princeton, after breezing through the department’s prerequisites in three months — course work that usually takes good students a year and average students two, Arora said — Khot started playing around with Håstad’s techniques, trying to establish exact hardness results for several problems. Then came his epiphany while vacationing in India: One of his problems got much simpler if he made a certain assumption about how difficult a certain approximation problem is. Back at Princeton, Khot realized that several of his other problems also became easier if he made the same assumption. He eventually named this assumption the Unique Games Conjecture.
Proving the Impossible
Khot grew up in Ichalkaranji — considered a small town in India with a population of just under 300,000 — where he was well-known for winning math competitions; his name and picture frequently appeared in the local Marathi-language newspapers Sakal (“Morning”) and Pudhari (“Leader”). At 16, he achieved the highest score countrywide in the Indian Institute of Technology Joint Entrance Exam, a test so notoriously difficult that most eligible students don’t bother to take it. At 17, Khot went off to study computer science at the school in Mumbai without ever having touched a computer.Khot was an autodidact from an early age. He loved reading Russian science books that had been translated into Marathi. His favorite one included chapters describing rare elements like palladium and gallium. But he seemed destined to follow his parents into the medical profession. His father and mother — an ear, nose and throat specialist and a general practitioner, respectively — ran a clinic on the first two floors of the family’s residence that bustled with patients from the local textile industry, many of whom suffered from respiratory ailments.
Suddenly, everyone was studying the implications of the Unique Games Conjecture. “You should see the number of mathematicians working on problems that emanated from this conjecture,” said Avi Wigderson of the Institute for Advanced Study in Princeton, N.J. The years following the Max Cut result witnessed a flood of approximation hardness results — theorems of the form, “If the Unique Games Conjecture is true, then it’s NP-hard to approximate the solutions of problem X any closer than Y percent.”“This conjecture suddenly became really interesting and important,” Wigderson said. It even seemed to help prove approximation hardness results about problems that on the surface seemed to have nothing to do with the problem at the heart of the Unique Games Conjecture, which involves assigning colors to the nodes of a network. “What was special about his problem?” Wigderson asked. “It wasn’t clear.”
In 2008, Prasad Raghavendra of the University of California, Berkeley, showed that if the UGC is true, then it’s possible to establish the approximation hardness of an entire category of common computational problems called “constraint satisfaction” problems. These involve trying to find the solution to a problem that satisfies as many of a set of constraints as possible — for example, the wedding seating chart that places feuding family members at different tables as much as possible.“We understand immediately an infinite class of problems by relying only on the one problem [that Khot] postulated is hard, which is amazing,” Wigderson said. The conjecture “creates an understanding that one rarely expects — that’s why it’s so interesting and beautiful,” he said.“It has changed the way we think about a lot of problems in computer science,” said Ryan O’Donnell, a theoretical computer scientist at Carnegie Mellon University in Pittsburgh.
If the Unique Games Conjecture is ever disproved, all the approximation hardness results that emanate from it will collapse. But certain other results will hold firm: For some mysterious reason, the proofs of the approximation hardness results and the attempts to prove the conjecture itself have led mathematicians to state — and then prove — an assortment of theorems about seemingly unrelated areas of mathematics, including the geometry of foams, the relationship between different ways of measuring distance, and even the stability of different election systems. “Out popped these very natural questions,” O’Donnell said. These results will hold up regardless of whether the Unique Games Conjecture turns out to be true or false.

For the problem of coloring the nodes of a network that has a collection of constraints about which color combinations are allowed (top left), it is sometimes possible to find a coloring that satisfies all the constraints (top right). But for some networks and constraints (bottom left), it is impossible to satisfy all the constraints simultaneously. The Unique Games Conjecture concerns the problem of finding a coloring that satisfies as many constraints as possible, such as the bottom right coloring, which satisfies all the constraints except the one for the thick edge.
For the problem of coloring the nodes of a network that has a collection of constraints about which color combinations are allowed (top left), it is sometimes possible to find a coloring that satisfies all the constraints (top right). But for some networks and constraints (bottom left), it is impossible to satisfy all the constraints simultaneously. The Unique Games Conjecture concerns the problem of finding a coloring that satisfies as many constraints as possible, such as the bottom right coloring, which satisfies all the constraints except the one for the thick edge. Quanta Magazine
It remains to be seen whether computer scientists will be able to prove or disprove Khot’s Unique Games Conjecture. A proof would be a boon to computer scientists, but a disproof might be even more exciting, Arora said. Researchers agree that disproving the conjecture would probably require innovative new algorithmic techniques that could unlock a host of different approximation problems. “If someone came up with an efficient algorithm [for the Unique Games problem], we would have a very valuable new insight into how to design algorithms,” Arora said.Khot doesn’t expect someone to prove or disprove his conjecture any time soon. “At this point, we can probably hope to just keep constructing pieces of evidence in one direction or another,” he said. He is working on proving the conjecture but is also exploring whether he can come up with different avenues toward proving approximation hardness results. “That’s the real goal,” he said.

Vayutuvan · Post by **Vayutuvan** » 18 Aug 2014 09:56

Jhujar: it fits perfectly. The first Nevanlinna prize winner is Robert Andre Tarjan for his analysis of union find (start of amortization analysis which ended up in analysis of simplex algorithm by smoothing by spielman and Teng for which they won godel,prize and the former won nevanlinna in Hyderabad) in addition other things like planar itetsting, push relabel for network flows. I am yet to look at khot's work in more depth but knew of him before this prize.

negi · Post by **negi** » 18 Aug 2014 18:29

Ardeshir wrote:I have a problem, and would really appreciate some direction on this.

I have n-number of tables taken from different sources (eg. Walmart, Target, Mom N Pop Store etc.) with identical descriptors such as Product Name, Product Price, Product Brand, Product Description Etc. Given that the same product can be listed in a different ways by different sources (eg. "iPhone 5s 16gb" vs "Black 16gb iPhone 5s"), how can I find common products across these n-number of tables? One approach I have been exploring is the use of Bayesian Classification.

This is a very typical enterprise data problem depending on tools at your disposal you can address it in many different ways; one very crude way to do this with a scripting language of your choice would be to maintain a reference table of sorts of all possible combinations of Product name and against it a standard name of your choice and then merely do a lookup using input string and retrieve a standardized value from this table. You can leverage perl/awk and power of regular expressions to catch hold of as many patterns as possible.

You can maintain a separate script for populating or constructing this table whenever your source tables/systems change.

Vayutuvan · Post by **Vayutuvan** » 02 Sep 2014 02:58

This will appeal to a lot of people on BRF (weapon enthusiasts

)

If programming languages were weapons

Apt description of perl:

Perl is a molotov cocktail, it was probably useful once, but few people use it now.

member_22733 · Post by **member_22733** » 02 Sep 2014 03:06

With the little bit of C# programming that I have done for some stupid instrumentation related project, I completely agree with this:

C# is a powerful laser rifle strapped to a donkey, when taken off the donkey the laser doesn’t seem to work as well.

Vayutuvan · Post by **Vayutuvan** » 02 Sep 2014 03:35

C# is a good language - better than both Java and C++ - is what I came away with after reading tutorials and manuals for week or so. Unfortunately I was using mono on linux (the laser came off the donkey and went onto an Arabian horse and the power of the laser got downgraded too). The language by itself is not useful - it is the libraries that differentiate (for normal Business CRUD, and TP on servers and GUI on the Desktop) one from the other. Net 4.5 (or whatever the latest dot net) is a must to be able to use C# to the fullest.

LokeshC: Have you tried F#? This is supposedly very good - much better than any of the languages out there. I also had been hearing great things about Julia (of all places on Sage math devel lists).

Waiting for my 64GB multiple VM desktop to be setup - actually two one for home and one for work - so that I can start playing with some of the new fangled stuff like MPI (I am not certain what we can get - the interconnect is really poor and what we do is not embarrassingly parallel). Currently I develop on FC12 on 32 bit machine, emacs, and in C - very hard to beat this combo for HPC (except the 32 bit part and the amount of disk I have which is being filled with all kinds of non-work related stuff at least for now).

member_22733 · Post by **member_22733** » 02 Sep 2014 03:49

I havent tried F# yet. In my case here is how the laser came of the donkey: I developed a soft-algo for parameter estimation using C# in that project, but it was for offline processing and the control frequency was slowish (to be used on a large floating rig with slowish response time, nothing exciting happens usually) . My choice would be to use something like Matlab, but I was "grandfathered" into using C#.

Some people then assumed that if we got a real time version of that algo into an embedded system, it can be used in a lot of similar problems which could really help the industry. So I was tasked with helping an embedded team to understand the C# code and port it onto firmware. That is where the donkey disappeared and the laser ended up doing all sort of things that no one ever foresaw

. Finally managed to help those guys out, but the struggle was insane

member_22733 · Post by **member_22733** » 02 Sep 2014 04:01

Waiting for my 64GB multiple VM desktop to be setup so that I can start playing with some of the new fangled stuff. Currently I develop on FC12, emacs, and in C - very hard to beat this combo for HPC.

I have an extremely ambitions

pet project that I want to complete. Much like the "artificial life" projects that go on in various univs it is a simulated robotics project, where I simulate a whole 3d environment and evolve artificial neural networks with controls to function as "living things" in these worlds. These worlds are filled with danger and usually need some sort of intelligence to survive (i.e. consume more energy than spend). The bottle neck is to combine reinforcement learning with neural networks, but once I figure that out I will need something like your machine plus a high performance GPU for something "meaningful" to evolve.

SaiK · Post by **SaiK** » 13 Oct 2014 18:06

http://web.stanford.edu/class/cs240/rea ... amport.pdf
Times, clocks and order of events in distributed systems!

more such from his links
http://www.se-radio.net/2014/04/episode ... d-systems/

SaiK · Post by **SaiK** » 20 Dec 2014 21:55

http://en.wikipedia.org/wiki/Stable_marriage_problem

Code: Select all

function stableMatching {
    Initialize all m ∈ M and w ∈ W to free
    while ∃ free man m who still has a woman w to propose to {
       w = m's highest ranked woman to whom he has not yet proposed
       if w is free
         (m, w) become engaged
       else some pair (m', w) already exists
         if w prefers m to m'
           (m, w) become engaged
           m' becomes free
         else
           (m', w) remain engaged
    }
}

can someone think of writing a temporal logic statement on this? I tried, but gave up.

Vayutuvan · Post by **Vayutuvan** » 22 Dec 2014 05:44

A 3D fractal - looks like a crown fit for Borgias

<img src="http://farm9.staticflickr.com/8048/8447 ... cd5d_o.png" width=480>

SaiK · Post by **SaiK** » 22 Dec 2014 07:24

not the brocoli shaped bottom.

how will the papal put on his head?

SaiK · Post by **SaiK** » 26 Dec 2014 01:20

what about this?
https://pbs.twimg.com/media/BtZWtk3CQAEpIZ8.jpg:large

are broccoli and cauliflower fractals?

vishvak · Post by **vishvak** » 26 Dec 2014 19:22

Fractals in natural phenomena is long known.
http://en.wikipedia.org/wiki/Fractal#Na ... l_features

Even lightning bolts show this behavior. As also trees, river networks, earthquakes, galaxy formations.

Fractals in Rangoli: link
About Fractals in Kolam - an auspicious symbol in the Indian culture.

Neshant · Post by **Neshant** » 28 Dec 2014 15:42

LokeshC wrote: I have an extremely ambitions pet project that I want to complete. Much like the "artificial life" projects that go on in various univs it is a simulated robotics project, where I simulate a whole 3d environment and evolve artificial neural networks with controls to function as "living things" in these worlds. These worlds are filled with danger and usually need some sort of intelligence to survive (i.e. consume more energy than spend). The bottle neck is to combine reinforcement learning with neural networks, but once I figure that out I will need something like your machine plus a high performance GPU for something "meaningful" to evolve.

I remember a commercial simulation game called Creatures (tm).

Little animal like creatures with digital DNA were running around in an interesting environment. They could reproduce with other creatures and the DNA was a hybrid of the two. You could introduce toys, aphrodesiacs, medicines and it would evolve the personality of the creature and some of it would rub off on their DNA or something. They would grow old after being hatched from an egg. You could trade eggs with others which were hybrids of various parents you had chosen to put together. There were also monsters running around in there. Someone figured out how to splice the digital DNA of a monster with one of the creatures and shared his eggs with others.

I was amazed at the game and thought of it as the future of how artificial life might be evolved digitally.

Check it out below :
http://en.wikipedia.org/wiki/Creatures_ ... #Creatures

johneeG · Post by **johneeG** » 28 Dec 2014 18:52

Neshant wrote:
LokeshC wrote: I have an extremely ambitions pet project that I want to complete. Much like the "artificial life" projects that go on in various univs it is a simulated robotics project, where I simulate a whole 3d environment and evolve artificial neural networks with controls to function as "living things" in these worlds. These worlds are filled with danger and usually need some sort of intelligence to survive (i.e. consume more energy than spend). The bottle neck is to combine reinforcement learning with neural networks, but once I figure that out I will need something like your machine plus a high performance GPU for something "meaningful" to evolve.
I remember a commercial simulation game called Creatures (tm).

Little animal like creatures with digital DNA were running around in an interesting environment. They could reproduce with other creatures and the DNA was a hybrid of the two. You could introduce toys, aphrodesiacs, medicines and it would evolve the personality of the creature and some of it would rub off on their DNA or something. They would grow old after being hatched from an egg. You could trade eggs with others which were hybrids of various parents you had chosen to put together. There were also monsters running around in there. Someone figured out how to splice the digital DNA of a monster with one of the creatures and shared his eggs with others.

I was amazed at the game and thought of it as the future of how artificial life might be evolved digitally.

Check it out below :
http://en.wikipedia.org/wiki/Creatures_ ... #Creatures

Main stumbling block to creating such artificial creatures is: Ego...sense of I or me i.e. Aham-kara.

That seems to be the main difference between living beings and non-living beings.

In this connection, I had a doubt:
a) how to define time?
b) how to define place?

ArmenT · Post by **ArmenT** » 10 Jan 2015 12:55

Ok. Here's a nice puzzle which seems counter-intuitive.

Problem: You are given N random integers and you should insert them into a sequence in numeric order (e.g) say you are given 5 1 4 2
- 5
- 1 5
- 1 4 5
- 1 2 4 5
and so on...

And the second part, you should remove element one at a time from the sequence by picking a random position (e.g) say we picked 1 2 0 0 (I'm assuming indexes are 0-based here)
- 1 2 4 5
- 1 4 5
- 1 4
- 4
-
At what value of N do you think that linked lists would be a good choice for this task and at what N would the humble array (or vector in C++ speak) be better.

The answers may surprise you

. Post away.

SaiK · Post by **SaiK** » 11 Jan 2015 11:05

in both ops and types (vector/arraylist and linkedlist) we have to inspect the value of the position to insert, right? dunno about c++, but in java arraylist it would be expensive to reach the position, resize the array and add them. linked list would just reference linking of next and previous.

short answer, I am thinking add and removing linked list is better choice and accessing elements would be the arrays. at what value of N confuses me now as it any has to search n-1 elements to find the value.

Vayutuvan · Post by **Vayutuvan** » 12 Jan 2015 04:50

ArmenT: In the absence of more information, here is my answer.

Use a vector (dynamic array). The way you posed the problem, I am assuming that the vector needs to be accessed by rank. As new insertions come in, binary search for the place of insertion, make space for one element (by moving all greater by one place right from back) and insert the element. Since it is dynamic array, space will be created (actually it will be doubled). Deletion is similar but no binary search, because one is given the rank (index) of the element to be deleted. Of course, the dynamic array structure with deletions will have to shrink once the free space goes beyond a 25%. Obviously at the time of expansion or shrinkage of the dynamic array, the memory doubles.

One can use a singly linked list which will have the high watermark at the time when the insertions have stopped.

In both cases, the algorithm is O(n^2). Binary search gets subsumed as it is going to be O(log n) for each element and hence it is O(n log n).
linked list solution has the same time bounds as the dynamic array with deletions.

More nuanced points are:

1. Dynamic array will have higher high memory watermark but it is more like c_1*n^2 + c_2 n log(n).
2. Singly linked list has lower high memory water mark (this needs to be verified) but the time is more like c_3*n^2

It all depends on space time trade off. This very important only when n gerts very large. In that case, both will have very similar running times unless the time for direct access is far smaller than the time for one level of indirection. For linked lists, one should not depend on the system memory management (or garbage collection) which tend to be really bad for high-performance applications. I am talking about situations like when one has only 64GB RAM where as n is ~16e9. Considering 4 bytes per int (your elemenst are int, aren't they?) then high watermark will take up all the RAM.

At 16e9, the number of operations are a (hopefully) small multiple of 256e18. The initial binary search in case of dynamic arrays is << 256e9 which can be safely be ignored.

If access by rank is not required and the insertions take place before deletions, one can do more things with heaps (AKA priority queues).

vishvak · Post by **vishvak** » 12 Jan 2015 19:29

Probably B-Trees (Balanced Trees) will be relevant here, since this can be done simply using B-tree method. If I am not mistaken, B-trees can be made with any data structure. But may be B-Tree is an overkill here.

Vayutuvan · Post by **Vayutuvan** » 12 Jan 2015 23:42

Vishvak: The way the question was posed, it seems to me that one needs the vector to be sorted at all time and can be accessed by rank in O(1). Also this needs to be an online algorithm. In that case there are not too many options left.

SaiK · Post by **SaiK** » 12 Jan 2015 23:49

his core need is add and remove, use case wise. a sorted array or direct pointer to linked objects is faster way to hit the position. i'm ignoring memory allocation here

Vayutuvan · Post by **Vayutuvan** » 13 Jan 2015 00:03

Adding can be done using a priority queue. But if this list is a shared resource where at any given point there are other threads accessing it by rank (query like give me the element which k'th in the sort order) how can that be accomplished?

I just thought of a way to reduce it O(n log n) if more space can be used and the input data (keys?) do not have duplicates or holes. B-Trees are a possibility but it has lot of wasted space and pointer following. But it is an interesting problem and I think an oft used pattern in a larger program/application.

SaiK · Post by **SaiK** » 13 Jan 2015 00:06

math guru, many threads can access arrays.. the only thing is we have to restrict/synchronize the access. so why that a problem?

--ps

found a nice mortful link
http://bigocheatsheet.com/

Vayutuvan · Post by **Vayutuvan** » 13 Jan 2015 01:19

SaiK: The main problem is to find an element given its rank in sorted order. O(n^2) is too high of a time complexity for any reasonable sized problem.

I remember now. Scratch the priority queue. It needs to be an augmented balanced binary search tree where the augmented data is the size of the tree rooted at a node.

The following will work in O(n log n) where the deletions are by rank, not by key. These data structures are called augmented data structures.

Basically as the elements come build a balanced binary search tree. You need to have one extra field at each node which keeps the size of the sub tree rooted at that node. Insertions are O(n log n).

Query by rank or delete by rank: A running sum and comparing with k will get you to the element you want in O(log n), i.e. height of a balanced binary tree.

While deleting and/or balancing the size field needs to be updated.

Vayutuvan · Post by **Vayutuvan** » 13 Jan 2015 01:25

SaiK: That cheat sheet is good a ready reference.

We get O(n log n) for n insertions. Querying for an element (by element value) is O(log n). You need to find it before you can delete, right? With n deletions, you end up with O(n log n) for n deletions. If your deletions are by key, then there is no problem. But the deletion is by rank. So you need the augmented data structure I described above to get an overall O(n log n) performance for n insertions and O(n) deletions by rank (position in the sorted list).

SaiK · Post by **SaiK** » 13 Jan 2015 06:38

how about index arrays as a separate list. the indexes are reordered for the given array. but the reference has to read the index array first. removing is nothing but soft delete.. the index would be removed/reset.

Vayutuvan · Post by **Vayutuvan** » 13 Jan 2015 19:35

Saik: what happens if an gets deleted? Ranks of all elements greater than the deleted elements rank get reduce by one. Now next deletion comes in. One would have to scan this list from the beginning for holes. On the average, assuming uniform distribution, the scans are going to take n/2 so it will end up being an O(n^2) time complexity.

For augmented data structures, CLR has a discussion. ~~I also found an online page which I will post in an hour or two.~~

ArmenT: don't leave us in the lurch. Pl. tell us more about what the larger problem is.

Added Later: That on-line source seems to be a copyright violation of CLR - it is an HTML version of CLR "Chapter 15 Augmenting Data Structures".

The origin of the idea is from Preparata and Shamos and a paper by Edelsbrunner. I met two of the mentioned people.

SaiK · Post by **SaiK** » 13 Jan 2015 22:40

let us take this to show what I was thinking:
orig: {5,1,4,2}
orig index: {3,0,2,1}
adding:
after first arment routine: {1,2,4,5}
after first arment routine index: {1,3,2,0}
removing:
before second arment routine: {1,2,4,5}
index changes: for 1 2 0 0 [hope the random choice is fine on the index, and not on the referent]
{1,3,2,0} -> {1,2,4,5}
{1,2,0} -> {1,4,5}
{1,,2,} -> {1,4}
{,,2,} -> {4}
{,,,,} -> {}

we can keep the holes with a value of "-1" or just remove them from index arrays.

is this wrong approach?

Vayutuvan · Post by **Vayutuvan** » 14 Jan 2015 01:15

SaiK: I am unable to figure out how your algorithm is going to work from your example above. Let us assume that yes you can have S as a setinel value, i.e. any value that doesn't occur in the input and all arrays are 0 based.

The brute force insertion sort based algorithm I gave first is O(n^2) so is the one you described above - if I am understanding you correctly.

start: {1,3,2,0} -> {1,2,4,5}
delete 1: {1,S,2,0} -> {1,S, 4,5}
delete 2: Which one is rank 2? You start your scan of the index array from the beginning and keep a count skipping all -1's. In this case you find that original 0 index is the current index 2 which is 5, and delete. The point is that you have to scan index array from the beginning every time a delete comes in, skipping all S values. That is the problem.

Here is a deletion sequence which achieves the worst case bound.

0 0 0 0

where you scans are 1 + 2 + 3 + 4.

Let us say you inserted n elements. And after that I have the deletion sequence 0 0 ... 0 which is a fraction of n, say n/k.
Then you will have to scan 1 + 2 + 3 + ... + (n/k) = (n/k)((n/k)+1)/2 which is O(n^2).

By the way, any kind of index array is not going to help you at all in reducing the number of operations asymptotically speaking. For insertions insert in place and for deletions delete and compress the hole out is simple but O(n^2). For any reasonable sized n, the speed up with an augmenting red-black tree will be huge. Say n = 1024== 2^10; then insertion in place etc. will take operations which are multiple of 2^20, where as with the augmenting BST the operations will be 10 * 2^10. You will get a speed up of 100.

If n = 2^20, then brute force will have c_1*2^40 ops vs. Augmenting BST will have only c_2*2^20, a,most a million times speed up.

Nandu · Post by **Nandu** » 14 Jan 2015 04:49

SaiK wrote:http://en.wikipedia.org/wiki/Stable_marriage_problem

can someone think of writing a temporal logic statement on this? I tried, but gave up.

No idea what you mean by temporal logic, but the algorithm itself is very simple.

Men propose. Women decide. Women can switch.

Each man, if he is not engaged, proposes to his highest preference woman, to whom he hasn't already proposed. The woman accepts if she is free or if she prefers the proposer to her current fiance.

That is all there is to it.

Nandu · Post by **Nandu** » 14 Jan 2015 05:05

The ArmenT, problem. I believe it is about the two given data structures, i.e. additional auxiliary DS.

So, for first part, we have search, + insert.
Search can be log N for array, linear for linked list. Insert is O(N) for array (have to move the right half of the array) and O(1) for linked list.

Part two is locate + delete.
Locate is O(1) for array, O(N) for linked list. Delete is O(1) for linked list O(N) for array.

I am going to ignore the O(1) parts for convenience (if N is very small, have to revisit). Also, will assume constants are 1 (i.e. same cost to access an array element as to move from one linked list node to next, etc).

Array: sum(k=1..N)(log k) + sum(k=1..N)(k/2) + sum(k=1..N)(k/2) (terms are search, insert, delete, respectively).

List: sum(k=1..N)(k/2) + sum(k=1..N)(k/2) (search + locate).

That is enough for me to say that linked list is always better than array (which probably means I missed something).

ArmenT · Post by **ArmenT** » 14 Jan 2015 09:56

Wow, there sure were a lot of replies to the puzzle.

As some of you have pointed out, with an array (or vector), you can efficiently find the insert position by binary searching, but you have to copy all the elements after it down one position each and then insert your element. With a linked list, you have to linear search your way to the proper position, but you can then insert the element by allocating new memory and updating a few pointers.

Similarly, when deleting, with an array (or vector): you can easily jump to any position by index, but actually deleting it needs you to copy all the elements after it by one position up. With a linked list: you need to linear search your way to get to the position, but then deleting it consists of merely updating a couple of pointers and freeing the allocated memory of your node.

So, pluses and minuses with each approach. However, the problem as stated, it turns out that arrays generally seem to outperform linked lists and the differences get broader the larger the number of elements! It turns out that the maximum compute time is spent searching for the position to insert to/delete from, not to insert or delete the item. And this is where the linked list really sucks at it. The difference becomes worse on modern CPUs for a reason that I'll state in the next para below

This video by Bjarne Stroustrup is very illuminating. He points out that this puzzle was originally suggested to him by Jon Bentley (famous algorithms guru). What is interesting about Stroustrup's presentation is that he did not use any of the optimization tricks that some of you have suggested (e.g. instead of binary searching through the array to find the appropriate position, he linear searched through it). Even with this disadvantage, arrays still managed to outperform linked lists! Why is this so? Because modern CPUs have memory caching optimizations built in (so if the CPU is told to access a particular memory location, the bus pulls the surrounding memory into a local cache as well, because it assumes that you're going to use the data in the nearby memory soon enough). With a normal array or vector, since the data is stored in linear fashion, when you access the first element, the next X elements are read in to the local L1 cache (or L2 cache or L3 cache etc.) as well and therefore access to the next few elements is faster. With a linked list type structure, first, the size of the elements are bigger to hold the same information, which means fewer elements are cached. Second, a linked list structure is generally not linear by nature, so accessing the next element in the list causes the CPU to cache miss (i.e.) it detects that the data to read is not in the local cache, so the CPU has to send a signal on the bus to go back to RAM to get it, which causes it to be slower because the CPU has to idle while waiting for the RAM to respond back. Each cache miss causes delays to the order of something like 100-700 clock cycles of idling on modern CPUs. I guess real-world CPU memory caching wrecks merry hell on theoretical types that talk about Big-O notation and reminds me of Knuths famous quote ("Beware, I've only proved this algorithm correct, not actually tested it!")

As a further test, here's some C++ test code that David Coppola wrote that you can run and verify for yourself

. There was quite a fascinating discussion on reddit that followed Stroustrup's video (link to discussion).

Vayutuvan · Post by **Vayutuvan** » 14 Jan 2015 10:07

ArmenT for large n, O(n lg n) achievable with augmenting Balanced BSt will beat either of the array or linked list solutions any day. The implementation is not all that bad either. Obviously pointer accesse - one level of indirect addressing - increases the hidden constant in big O but as n gets bigger it gets swamped by n/lg n. The speed up will be so obvious that it would be impossible to miss it. probably boost has a re black container? If so, try it out vis a vis an dynamic table container object.

SaiK · Post by **SaiK** » 14 Jan 2015 18:05

2015 is a palindrome

-------------------

from affice mail

top 10 institutions with best papers in CS

Institutions with Best Papers
Microsoft Research 41.0
Stanford University 29.2
Massachusetts Institute of Technology 28.9
Carnegie Mellon University 28.4
University of Washington 25.7
University of California Berkeley 19.5
University of Toronto 13.4
Cornell University 12.9
IBM Research 12.2
University of Illinois at Urbana-Champaign 11.9

SaiK · Post by **SaiK** » 14 Jan 2015 18:15

this is interesting.. i have to learn
http://en.wikipedia.org/wiki/Grover's_algorithm

Nandu · Post by **Nandu** » 14 Jan 2015 22:48

There is only one quantum algorithm. It is QFT, the quantum fourier transform.

Grover's algorithm and Shor's algorithm are just variations on it.