eterm 4 days ago

> if you require your parens to be balanced, that's no longer a regular language.

TIL: https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_lang...

As someone who has only recently learned the formal definition of regular language ( Thanks to https://www.youtube.com/watch?v=9syvZr-9xwk as mentioned on this board recently ), I'm interested in the formal proof for this?

It feels intuitively true, but I haven't finished the course yet and therefore haven't come across it and can't yet reason well about it.

Is the problem then in trying to "encode" whether brackets are balanced, that we would need infinite "levels" of how many more brackets we've opened rather than closed, (i.e a stack) and that violates the finite nature of finite automota?

So I've googled it now, and I've found the result is due to the "Pumping lemma": https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_lang...

1
gbacon 4 days ago

Pulling on the thread of your intuition about needing infinite levels, a regular language is one that can be recognized by a finite automaton, so requiring infinite memory or arbitrarily high counting suggests that the language is not regular. It’s not perfect; we prove it as below.

Assume for purpose of contraction that our language L of balanced parentheses is regular. Then L has pumping length p ≥ 1 such that every wL where |w| ≥ p can be decomposed w = xyz where |y| ≥ 1 and |xy| ≤ p. It is called the pumping lemma because we can “pump” the non-empty substring y either up or down, i.e., xy⁰z = xz and xyⁿz are also all in L. In formal languages, exponentiation notates character or string repetition.

We don’t know p and may not cherry pick it. Pretend you are playing a game against an adversary who gets to select arbitrary p ≥ 1, and then you use that p to choose wL to spoil the regular language party. With the language of balanced parentheses, no matter what p the adversary selects, you can force the leading xy prefix to contain all left-parentheses by choosing w = (ᵖ)ᵖ. The substring y cannot be empty, so pumping down gives xy⁰z = (ᵖ⁻ᵏ)ᵖ belongs to L, a contradiction.

One of my favorite test questions for automata theory classes is to have ChatGPT attempt a proof that a language is not regular and then have students correct the proof. You linked to a lecture by Michael Sipser, and the undergraduate version of the course I teach uses his excellent textbook.

https://www.amazon.com/Introduction-Theory-Computation-Micha...

His lecture on the pumping lemma for regular languages is

https://youtu.be/KAySmSEGc9U?t=1807

eterm 4 days ago

I've just watched that video, thank you, and it's interesting that he shows an example where my intuition of counting leading to non-regularity fails:

We've seen you can't detect if there is a balanced number of 0's and 1's in a string of 0's and 1's with a FA/Regular expression, howver you CAN detect if there are a balanced number of "01" and "10" substrings in a string of 0's and 1's.

Proving that 01 and 10 balancing was left as an exercise to the reader, which I will try to work through here, hopefully I've not made a mistake. My intuition here says that it is equivalent to checking whether it starts and ends with the same symbol, because 01 leaves you having last read a 1, and therefore then requires a "10" substring to get back to having last read a 0, and vice-versa.

The FA I believe therefore can be constructed with transitions

    Q0 -> epsilon -> Q1
    Q0 -> epsilon -> Q3
    Q1 -> 0 -> Q2
    Q2 -> 1 -> Q1
    Q1 -> epsilon -> Q5
    Q3 -> 1 -> Q4
    Q4 -> 0 -> Q3
    Q3 -> epsilon -> Q5
    ( Starting Q0, Accepting Q5 )
    
Or presented differently, (0(0U1)0) U (1(0U1)1) U epsilon

gbacon 2 days ago

Your intuition about starting and ending with the same symbol is correct. This is a good example of why the unbounded memory rule of thumb is not perfect.

The NFA you described recognizes

    (01)* ∪ (10)*
All strings in the above language will begin and end with different symbols and have only even lengths. (Balanced means not only equal in number but that they nest properly, e.g., )()()( and ())( have equal numbers of left- and right-parentheses but not in balance.) The empty string ε does belong to the language, and your NFA — but not your regular expression — correctly accepts it. The language from Sipser’s lecture has strings of odd length as well, the shortest being 0 and 1 that an equal number (namely, zero) of 01 and 10 substrings.

Your regular expression is close. It looks like HN’s markdown formatting ate your Kleene stars. Always account for the empty string, and in your case it does need to appear explicitly.

    ε ∪ 0 ∪ 1 ∪ 0(0 ∪ 1)*0 ∪ 1(0 ∪ 1)*1
Because this is a simple enough language, I suggest building a DFA rather than an NFA. You know that your automaton will have two branches: one for strings beginning with 0 and the other for those that begin with 1. That accounts for three states so far, and your initial state q0 has both of its transitions defined. Let’s say that q1 is the state after reading the initial 0. (Remember that you can interpret each state in a finite automaton’s transition graph as “I have read …”) When you’re in q1 and read a 0, where do you go? When you’re in q1 and read a 1, where do you go? You’ll know you completed this branch when you have defined 0 and 1 transitions for all its states. Then move on to completing the branch for strings that begin with 1. The intuition is it ought to have similar structure to the first branch. Finally, label your final or accept states. Always remember to account for the empty string!

What’s really interesting is the NFA you defined has, after conversion to DFA and minimization, almost the same structure as the correct DFA: the only difference is which are accept states.